Document Ingestion Pipeline
A modular, type-safe Python application using Prefect for scheduling ingestion jobs from web/documentation sites (via Firecrawl) and Git repositories (via Repomix) into Weaviate or Open WebUI knowledge endpoints.
Features
-
Multiple Data Sources:
- Web/documentation sites via Firecrawl
- Git repositories via Repomix
-
Multiple Storage Backends:
- Weaviate vector database (self-hosted at http://weaviate.yo)
- Open WebUI knowledge endpoints (http://chat.lab)
-
Scheduling & Orchestration:
- Prefect-based workflow orchestration
- Cron and interval-based scheduling
- Concurrent task execution
-
Type Safety:
- Strict Python typing with no
Anytypes - Modern typing syntax (using
|instead ofUnion) - Pydantic models for validation
- Strict Python typing with no
-
Code Quality:
- Modular architecture
- Cyclomatic complexity < 15
- Clean separation of concerns
Installation
# Install dependencies
pip install -r requirements.txt
# Install repomix globally (required for repository ingestion)
npm install -g repomix
# Copy and configure environment
cp .env.example .env
# Edit .env with your settings
Usage
One-time Ingestion
# Ingest a documentation site into Weaviate
python -m ingest_pipeline ingest https://docs.example.com --type web --storage weaviate
# Ingest a repository into Open WebUI
python -m ingest_pipeline ingest https://github.com/user/repo --type repository --storage open_webui
Scheduled Ingestion
# Create a daily documentation crawl
python -m ingest_pipeline schedule daily-docs https://docs.example.com \
--type documentation \
--storage weaviate \
--cron "0 2 * * *"
# Create an hourly repository sync
python -m ingest_pipeline schedule repo-sync https://github.com/user/repo \
--type repository \
--storage open_webui \
--interval 60
Serve Deployments
# Start serving scheduled deployments
python -m ingest_pipeline serve
Configuration
# View current configuration
python -m ingest_pipeline config
Architecture
ingest_pipeline/
├── core/ # Core models and exceptions
│ ├── models.py # Pydantic models with strict typing
│ └── exceptions.py # Custom exceptions
├── ingestors/ # Data source ingestors
│ ├── base.py # Abstract base ingestor
│ ├── firecrawl.py # Web/docs ingestion via Firecrawl
│ └── repomix.py # Repository ingestion via Repomix
├── storage/ # Storage adapters
│ ├── base.py # Abstract base storage
│ ├── weaviate.py # Weaviate adapter
│ └── openwebui.py # Open WebUI adapter
├── flows/ # Prefect flows
│ ├── ingestion.py # Main ingestion flow
│ └── scheduler.py # Deployment scheduling
├── config/ # Configuration management
│ └── settings.py # Settings with Pydantic
├── utils/ # Utilities
│ └── vectorizer.py # Text vectorization
└── cli/ # CLI interface
└── main.py # Typer-based CLI
Environment Variables
FIRECRAWL_API_KEY: API key for Firecrawl (optional)LLM_ENDPOINT: LLM proxy endpoint (default: http://llm.lab)WEAVIATE_ENDPOINT: Weaviate endpoint (default: http://weaviate.yo)OPENWEBUI_ENDPOINT: Open WebUI endpoint (default: http://chat.lab)EMBEDDING_MODEL: Model for embeddings (default: ollama/bge-m3:latest)
Vectorization
The pipeline uses your LLM proxy at http://llm.lab with:
- Model:
ollama/gpt-oss:20bfor processing - Embeddings:
ollama/bge-m3:latestfor vectorization
Storage Backends
Weaviate
- Endpoint: http://weaviate.yo
- Automatic collection creation
- Vector similarity search
- Batch ingestion support
Open WebUI
- Endpoint: http://chat.lab/docs
- Knowledge base integration
- Direct API access
- Document management
Development
The codebase follows strict typing and quality standards:
- No use of
Anytype - Modern Python typing syntax
- Cyclomatic complexity < 15
- Modular, testable architecture
License
MIT
Description
Languages
Python
100%