vasceannie/rag-manager

Fork 0

Go to file

Travis Vasceannie 52c972bcdf x

2025-09-21 10:25:54 +00:00

.claude

yooohoo

2025-09-21 01:38:47 +00:00

.vscode

2025-09-19 13:34:17 +00:00

docs

yooohoo

2025-09-21 01:38:47 +00:00

ingest_pipeline

2025-09-21 10:25:54 +00:00

logs

2025-09-21 10:25:54 +00:00

notebooks

yooohoo

2025-09-21 01:38:47 +00:00

tests

2025-09-21 10:25:54 +00:00

typings

yooohoo

2025-09-21 01:38:47 +00:00

.env

2025-09-21 03:00:57 +00:00

.env.example

2025-09-19 13:34:17 +00:00

.gitignore

updates

2025-09-19 06:56:19 +00:00

.repomixignore

2025-09-18 09:44:16 +00:00

AGENTS.md

2025-09-18 09:44:16 +00:00

basedpyright.json

updates

2025-09-19 06:56:19 +00:00

chat.json

updates

2025-09-19 06:56:19 +00:00

CLAUDE.md

2025-09-19 08:31:36 +00:00

firecrawl.json

fix(docs): validate and correct Firecrawl v2 OpenAPI schema

2025-09-19 00:45:43 +00:00

prefect.yaml

updates

2025-09-19 06:56:19 +00:00

pyproject.toml

2025-09-21 10:25:54 +00:00

pyrightconfig.json

updates

2025-09-19 06:56:19 +00:00

r2r.json

2025-09-18 09:44:16 +00:00

README.md

init

2025-09-15 12:35:42 -04:00

repomix-output.xml

yooohoo

2025-09-21 01:38:47 +00:00

repomix.config.json

2025-09-18 09:44:16 +00:00

tui

init

2025-09-15 12:35:42 -04:00

uv.lock

2025-09-21 10:25:54 +00:00

README.md

Document Ingestion Pipeline

A modular, type-safe Python application using Prefect for scheduling ingestion jobs from web/documentation sites (via Firecrawl) and Git repositories (via Repomix) into Weaviate or Open WebUI knowledge endpoints.

Features

Multiple Data Sources:
- Web/documentation sites via Firecrawl
- Git repositories via Repomix
Multiple Storage Backends:
- Weaviate vector database (self-hosted at http://weaviate.yo)
- Open WebUI knowledge endpoints (http://chat.lab)
Scheduling & Orchestration:
- Prefect-based workflow orchestration
- Cron and interval-based scheduling
- Concurrent task execution
Type Safety:
- Strict Python typing with no Any types
- Modern typing syntax (using | instead of Union)
- Pydantic models for validation
Code Quality:
- Modular architecture
- Cyclomatic complexity < 15
- Clean separation of concerns

Installation

# Install dependencies
pip install -r requirements.txt

# Install repomix globally (required for repository ingestion)
npm install -g repomix

# Copy and configure environment
cp .env.example .env
# Edit .env with your settings

Usage

One-time Ingestion

# Ingest a documentation site into Weaviate
python -m ingest_pipeline ingest https://docs.example.com --type web --storage weaviate

# Ingest a repository into Open WebUI
python -m ingest_pipeline ingest https://github.com/user/repo --type repository --storage open_webui

Scheduled Ingestion

# Create a daily documentation crawl
python -m ingest_pipeline schedule daily-docs https://docs.example.com \
  --type documentation \
  --storage weaviate \
  --cron "0 2 * * *"

# Create an hourly repository sync
python -m ingest_pipeline schedule repo-sync https://github.com/user/repo \
  --type repository \
  --storage open_webui \
  --interval 60

Serve Deployments

# Start serving scheduled deployments
python -m ingest_pipeline serve

Configuration

# View current configuration
python -m ingest_pipeline config

Architecture

ingest_pipeline/
├── core/               # Core models and exceptions
│   ├── models.py      # Pydantic models with strict typing
│   └── exceptions.py  # Custom exceptions
├── ingestors/         # Data source ingestors
│   ├── base.py       # Abstract base ingestor
│   ├── firecrawl.py  # Web/docs ingestion via Firecrawl
│   └── repomix.py    # Repository ingestion via Repomix
├── storage/           # Storage adapters
│   ├── base.py       # Abstract base storage
│   ├── weaviate.py   # Weaviate adapter
│   └── openwebui.py  # Open WebUI adapter
├── flows/             # Prefect flows
│   ├── ingestion.py  # Main ingestion flow
│   └── scheduler.py  # Deployment scheduling
├── config/            # Configuration management
│   └── settings.py   # Settings with Pydantic
├── utils/             # Utilities
│   └── vectorizer.py # Text vectorization
└── cli/              # CLI interface
    └── main.py       # Typer-based CLI

Environment Variables

FIRECRAWL_API_KEY: API key for Firecrawl (optional)
LLM_ENDPOINT: LLM proxy endpoint (default: http://llm.lab)
WEAVIATE_ENDPOINT: Weaviate endpoint (default: http://weaviate.yo)
OPENWEBUI_ENDPOINT: Open WebUI endpoint (default: http://chat.lab)
EMBEDDING_MODEL: Model for embeddings (default: ollama/bge-m3:latest)

Vectorization

The pipeline uses your LLM proxy at http://llm.lab with:

Model: ollama/gpt-oss:20b for processing
Embeddings: ollama/bge-m3:latest for vectorization

Storage Backends

Weaviate

Endpoint: http://weaviate.yo
Automatic collection creation
Vector similarity search
Batch ingestion support

Open WebUI

Endpoint: http://chat.lab/docs
Knowledge base integration
Direct API access
Document management

Development

The codebase follows strict typing and quality standards:

No use of Any type
Modern Python typing syntax
Cyclomatic complexity < 15
Modular, testable architecture

License

MIT