2025-09-21 10:25:54 +00:00
2025-09-21 01:38:47 +00:00
xx
2025-09-19 13:34:17 +00:00
2025-09-21 01:38:47 +00:00
x
2025-09-21 10:25:54 +00:00
x
2025-09-21 10:25:54 +00:00
2025-09-21 01:38:47 +00:00
x
2025-09-21 10:25:54 +00:00
2025-09-21 01:38:47 +00:00
x
2025-09-21 03:00:57 +00:00
xx
2025-09-19 13:34:17 +00:00
2025-09-19 06:56:19 +00:00
xx
2025-09-18 09:44:16 +00:00
xx
2025-09-18 09:44:16 +00:00
2025-09-19 06:56:19 +00:00
2025-09-19 06:56:19 +00:00
ee
2025-09-19 08:31:36 +00:00
2025-09-19 06:56:19 +00:00
x
2025-09-21 10:25:54 +00:00
2025-09-19 06:56:19 +00:00
xx
2025-09-18 09:44:16 +00:00
2025-09-15 12:35:42 -04:00
2025-09-21 01:38:47 +00:00
xx
2025-09-18 09:44:16 +00:00
2025-09-15 12:35:42 -04:00
x
2025-09-21 10:25:54 +00:00

Document Ingestion Pipeline

A modular, type-safe Python application using Prefect for scheduling ingestion jobs from web/documentation sites (via Firecrawl) and Git repositories (via Repomix) into Weaviate or Open WebUI knowledge endpoints.

Features

  • Multiple Data Sources:

    • Web/documentation sites via Firecrawl
    • Git repositories via Repomix
  • Multiple Storage Backends:

  • Scheduling & Orchestration:

    • Prefect-based workflow orchestration
    • Cron and interval-based scheduling
    • Concurrent task execution
  • Type Safety:

    • Strict Python typing with no Any types
    • Modern typing syntax (using | instead of Union)
    • Pydantic models for validation
  • Code Quality:

    • Modular architecture
    • Cyclomatic complexity < 15
    • Clean separation of concerns

Installation

# Install dependencies
pip install -r requirements.txt

# Install repomix globally (required for repository ingestion)
npm install -g repomix

# Copy and configure environment
cp .env.example .env
# Edit .env with your settings

Usage

One-time Ingestion

# Ingest a documentation site into Weaviate
python -m ingest_pipeline ingest https://docs.example.com --type web --storage weaviate

# Ingest a repository into Open WebUI
python -m ingest_pipeline ingest https://github.com/user/repo --type repository --storage open_webui

Scheduled Ingestion

# Create a daily documentation crawl
python -m ingest_pipeline schedule daily-docs https://docs.example.com \
  --type documentation \
  --storage weaviate \
  --cron "0 2 * * *"

# Create an hourly repository sync
python -m ingest_pipeline schedule repo-sync https://github.com/user/repo \
  --type repository \
  --storage open_webui \
  --interval 60

Serve Deployments

# Start serving scheduled deployments
python -m ingest_pipeline serve

Configuration

# View current configuration
python -m ingest_pipeline config

Architecture

ingest_pipeline/
├── core/               # Core models and exceptions
│   ├── models.py      # Pydantic models with strict typing
│   └── exceptions.py  # Custom exceptions
├── ingestors/         # Data source ingestors
│   ├── base.py       # Abstract base ingestor
│   ├── firecrawl.py  # Web/docs ingestion via Firecrawl
│   └── repomix.py    # Repository ingestion via Repomix
├── storage/           # Storage adapters
│   ├── base.py       # Abstract base storage
│   ├── weaviate.py   # Weaviate adapter
│   └── openwebui.py  # Open WebUI adapter
├── flows/             # Prefect flows
│   ├── ingestion.py  # Main ingestion flow
│   └── scheduler.py  # Deployment scheduling
├── config/            # Configuration management
│   └── settings.py   # Settings with Pydantic
├── utils/             # Utilities
│   └── vectorizer.py # Text vectorization
└── cli/              # CLI interface
    └── main.py       # Typer-based CLI

Environment Variables

  • FIRECRAWL_API_KEY: API key for Firecrawl (optional)
  • LLM_ENDPOINT: LLM proxy endpoint (default: http://llm.lab)
  • WEAVIATE_ENDPOINT: Weaviate endpoint (default: http://weaviate.yo)
  • OPENWEBUI_ENDPOINT: Open WebUI endpoint (default: http://chat.lab)
  • EMBEDDING_MODEL: Model for embeddings (default: ollama/bge-m3:latest)

Vectorization

The pipeline uses your LLM proxy at http://llm.lab with:

  • Model: ollama/gpt-oss:20b for processing
  • Embeddings: ollama/bge-m3:latest for vectorization

Storage Backends

Weaviate

  • Endpoint: http://weaviate.yo
  • Automatic collection creation
  • Vector similarity search
  • Batch ingestion support

Open WebUI

Development

The codebase follows strict typing and quality standards:

  • No use of Any type
  • Modern Python typing syntax
  • Cyclomatic complexity < 15
  • Modular, testable architecture

License

MIT

Description
No description provided
Readme 4.1 MiB
Languages
Python 100%