CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

This is a modular document ingestion pipeline using Prefect for orchestrating ingestion from web/documentation sites (via Firecrawl) and Git repositories (via Repomix) into Weaviate vector database or Open WebUI knowledge endpoints.

Development Commands

Environment Setup

# Install dependencies using uv (required)
uv sync

# Activate virtual environment
source .venv/bin/activate

# Install repomix globally (required for repository ingestion)
npm install -g repomix

# Configure environment
cp .env.example .env
# Edit .env with your settings

Running the Application

# One-time ingestion
python -m ingest_pipeline ingest <url> --type web --storage weaviate

# Schedule recurring ingestion
python -m ingest_pipeline schedule <name> <url> --type web --storage weaviate --cron "0 2 * * *"

# Start deployment server
python -m ingest_pipeline serve

# View configuration
python -m ingest_pipeline config

Code Quality

# Run linting
uv run ruff check .
uv run ruff format .

# Type checking
uv run pyrefly check ingest_pipeline/ tests/
uv run basedpyright
uv run sourcery review ingest_pipeline/ tests/ --fix

# Install dev dependencies
uv sync --dev

Architecture

The pipeline follows a modular architecture with clear separation of concerns:

Ingestors (ingest_pipeline/ingestors/): Abstract base class pattern for different data sources (Firecrawl for web, Repomix for repositories)
Storage Adapters (ingest_pipeline/storage/): Abstract base class for storage backends (Weaviate, Open WebUI)
Prefect Flows (ingest_pipeline/flows/): Orchestration layer using Prefect for scheduling and task management
CLI (ingest_pipeline/cli/main.py): Typer-based command interface with commands: ingest, schedule, serve, config

Key Implementation Details

Type Safety

Strict typing enforced with no Any types allowed
Modern typing syntax using | instead of Union
Pydantic v2+ for all models and settings
All models in core/models.py use TypedDict for metadata and strict Pydantic models

Configuration Management

Settings loaded from .env file via Pydantic Settings
Cached singleton pattern in config/settings.py using @lru_cache
Environment-specific endpoints configured for local services (llm.lab, weaviate.yo, chat.lab)

Flow Orchestration

Main ingestion flow in flows/ingestion.py with retry logic and task decorators
Deployment scheduling in flows/scheduler.py supporting both cron and interval schedules
Tasks use Prefect's @task decorator with retries and tags for monitoring

Storage Backends

Weaviate: Uses batch ingestion with configurable batch size, automatic collection creation
Open WebUI: Direct API integration for knowledge base management
Both inherit from abstract BaseStorage class ensuring consistent interface

Service Endpoints

LLM Proxy: http://llm.lab (for embeddings and processing)
Weaviate: http://weaviate.yo (vector database)
Open WebUI: http://chat.lab (knowledge interface)
Firecrawl: http://crawl.lab:30002 (web crawling service)

Important Constraints

Cyclomatic complexity must remain < 15 for all functions
Maximum file size for ingestion: 1MB
Batch size limits: 50-500 documents
Concurrent task limit: 5 (configurable via MAX_CONCURRENT_TASKS)
All async operations use proper async/await patterns

3.5 KiB Raw Permalink Blame History