3.5 KiB
3.5 KiB
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Project Overview
This is a modular document ingestion pipeline using Prefect for orchestrating ingestion from web/documentation sites (via Firecrawl) and Git repositories (via Repomix) into Weaviate vector database or Open WebUI knowledge endpoints.
Development Commands
Environment Setup
# Install dependencies using uv (required)
uv sync
# Activate virtual environment
source .venv/bin/activate
# Install repomix globally (required for repository ingestion)
npm install -g repomix
# Configure environment
cp .env.example .env
# Edit .env with your settings
Running the Application
# One-time ingestion
python -m ingest_pipeline ingest <url> --type web --storage weaviate
# Schedule recurring ingestion
python -m ingest_pipeline schedule <name> <url> --type web --storage weaviate --cron "0 2 * * *"
# Start deployment server
python -m ingest_pipeline serve
# View configuration
python -m ingest_pipeline config
Code Quality
# Run linting
uv run ruff check .
uv run ruff format .
# Type checking
uv run pyrefly check ingest_pipeline/ tests/
uv run basedpyright
uv run sourcery review ingest_pipeline/ tests/ --fix
# Install dev dependencies
uv sync --dev
Architecture
The pipeline follows a modular architecture with clear separation of concerns:
- Ingestors (
ingest_pipeline/ingestors/): Abstract base class pattern for different data sources (Firecrawl for web, Repomix for repositories) - Storage Adapters (
ingest_pipeline/storage/): Abstract base class for storage backends (Weaviate, Open WebUI) - Prefect Flows (
ingest_pipeline/flows/): Orchestration layer using Prefect for scheduling and task management - CLI (
ingest_pipeline/cli/main.py): Typer-based command interface with commands:ingest,schedule,serve,config
Key Implementation Details
Type Safety
- Strict typing enforced with no
Anytypes allowed - Modern typing syntax using
|instead ofUnion - Pydantic v2+ for all models and settings
- All models in
core/models.pyuse TypedDict for metadata and strict Pydantic models
Configuration Management
- Settings loaded from
.envfile via Pydantic Settings - Cached singleton pattern in
config/settings.pyusing@lru_cache - Environment-specific endpoints configured for local services (llm.lab, weaviate.yo, chat.lab)
Flow Orchestration
- Main ingestion flow in
flows/ingestion.pywith retry logic and task decorators - Deployment scheduling in
flows/scheduler.pysupporting both cron and interval schedules - Tasks use Prefect's
@taskdecorator with retries and tags for monitoring
Storage Backends
- Weaviate: Uses batch ingestion with configurable batch size, automatic collection creation
- Open WebUI: Direct API integration for knowledge base management
- Both inherit from abstract
BaseStorageclass ensuring consistent interface
Service Endpoints
- LLM Proxy: http://llm.lab (for embeddings and processing)
- Weaviate: http://weaviate.yo (vector database)
- Open WebUI: http://chat.lab (knowledge interface)
- Firecrawl: http://crawl.lab:30002 (web crawling service)
Important Constraints
- Cyclomatic complexity must remain < 15 for all functions
- Maximum file size for ingestion: 1MB
- Batch size limits: 50-500 documents
- Concurrent task limit: 5 (configurable via MAX_CONCURRENT_TASKS)
- All async operations use proper async/await patterns