Files
rag-manager/CLAUDE.md
2025-09-19 08:31:36 +00:00

3.5 KiB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

This is a modular document ingestion pipeline using Prefect for orchestrating ingestion from web/documentation sites (via Firecrawl) and Git repositories (via Repomix) into Weaviate vector database or Open WebUI knowledge endpoints.

Development Commands

Environment Setup

# Install dependencies using uv (required)
uv sync

# Activate virtual environment
source .venv/bin/activate

# Install repomix globally (required for repository ingestion)
npm install -g repomix

# Configure environment
cp .env.example .env
# Edit .env with your settings

Running the Application

# One-time ingestion
python -m ingest_pipeline ingest <url> --type web --storage weaviate

# Schedule recurring ingestion
python -m ingest_pipeline schedule <name> <url> --type web --storage weaviate --cron "0 2 * * *"

# Start deployment server
python -m ingest_pipeline serve

# View configuration
python -m ingest_pipeline config

Code Quality

# Run linting
uv run ruff check .
uv run ruff format .

# Type checking
uv run pyrefly check ingest_pipeline/ tests/
uv run basedpyright
uv run sourcery review ingest_pipeline/ tests/ --fix

# Install dev dependencies
uv sync --dev

Architecture

The pipeline follows a modular architecture with clear separation of concerns:

  • Ingestors (ingest_pipeline/ingestors/): Abstract base class pattern for different data sources (Firecrawl for web, Repomix for repositories)
  • Storage Adapters (ingest_pipeline/storage/): Abstract base class for storage backends (Weaviate, Open WebUI)
  • Prefect Flows (ingest_pipeline/flows/): Orchestration layer using Prefect for scheduling and task management
  • CLI (ingest_pipeline/cli/main.py): Typer-based command interface with commands: ingest, schedule, serve, config

Key Implementation Details

Type Safety

  • Strict typing enforced with no Any types allowed
  • Modern typing syntax using | instead of Union
  • Pydantic v2+ for all models and settings
  • All models in core/models.py use TypedDict for metadata and strict Pydantic models

Configuration Management

  • Settings loaded from .env file via Pydantic Settings
  • Cached singleton pattern in config/settings.py using @lru_cache
  • Environment-specific endpoints configured for local services (llm.lab, weaviate.yo, chat.lab)

Flow Orchestration

  • Main ingestion flow in flows/ingestion.py with retry logic and task decorators
  • Deployment scheduling in flows/scheduler.py supporting both cron and interval schedules
  • Tasks use Prefect's @task decorator with retries and tags for monitoring

Storage Backends

  • Weaviate: Uses batch ingestion with configurable batch size, automatic collection creation
  • Open WebUI: Direct API integration for knowledge base management
  • Both inherit from abstract BaseStorage class ensuring consistent interface

Service Endpoints

Important Constraints

  • Cyclomatic complexity must remain < 15 for all functions
  • Maximum file size for ingestion: 1MB
  • Batch size limits: 50-500 documents
  • Concurrent task limit: 5 (configurable via MAX_CONCURRENT_TASKS)
  • All async operations use proper async/await patterns