* feat: enhance coverage reporting and improve tool configuration - Added support for JSON coverage reports in pyproject.toml. - Updated .gitignore to include coverage.json and task files for better management. - Introduced a new Type Safety Audit Report to document findings and recommendations for type safety improvements. - Created a comprehensive coverage configuration guide to assist in understanding coverage reporting setup. - Refactored tools configuration to utilize environment variables for concurrent scraping settings. These changes improve the project's testing and reporting capabilities while enhancing overall code quality and maintainability. * feat: enhance configuration handling and improve error logging - Introduced a new utility function `_get_env_int` for robust environment variable integer retrieval with validation. - Updated `WebToolsConfig` and `ToolsConfigModel` to utilize the new utility for environment variable defaults. - Enhanced logging in `CircuitBreaker` to provide detailed state transition information. - Improved URL handling in `url_analyzer.py` for better file extension extraction and normalization. - Added type validation and logging in `SecureInputMixin` to ensure input sanitization and validation consistency. These changes improve the reliability and maintainability of configuration management and error handling across the codebase. * refactor: update imports and enhance .gitignore for improved organization - Updated import paths in various example scripts to reflect the new structure under `biz_bud`. - Enhanced .gitignore to include clearer formatting for task files. - Removed obsolete function calls and improved error handling in several scripts. - Added public alias for backward compatibility in `upload_r2r.py`. These changes improve code organization, maintainability, and compatibility across the project. * refactor: update graph paths in langgraph.json for improved organization - Changed paths for research, catalog, paperless, and url_to_r2r graphs to reflect new directory structure. - Added new entries for analysis and scraping graphs to enhance functionality. These changes improve the organization and maintainability of the graph configurations. * fix: enhance validation and error handling in date range and scraping functions - Updated date validation in UserFiltersModel to ensure date values are strings. - Improved error messages in create_scraped_content_dict to clarify conditions for success and failure. - Enhanced test coverage for date validation and scraping content creation to ensure robustness. These changes improve input validation and error handling across the application, enhancing overall reliability. * refactor: streamline graph creation and enhance type annotations in examples - Simplified graph creation in `catalog_ingredient_research_example.py` and `catalog_tech_components_example.py` by directly compiling the graph. - Updated type annotations in `catalog_intel_with_config.py` for improved clarity and consistency. - Enhanced error handling in catalog data processing to ensure robustness against unexpected data types. These changes improve code readability, maintainability, and error resilience across example scripts. * Update src/biz_bud/nodes/extraction/extractors.py Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com> * Update src/biz_bud/core/validation/pydantic_models.py Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com> * refactor: migrate Jina and Tavily clients to use ServiceFactory dependency injection * refactor: migrate URL processing to provider-based architecture with improved error handling * feat: add FirecrawlApp compatibility classes and mock implementations * fix: add thread-safe locking to LazyLoader factory management * feat: implement service restart and refactor cache decorator helpers * refactor: move r2r_direct_api_call to tools.clients.r2r_utils and improve HTTP service error handling * chore: update Sonar task IDs in report configuration --------- Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>
188 lines
6.4 KiB
Python
188 lines
6.4 KiB
Python
#!/usr/bin/env python3
|
|
"""Fixed script to properly crawl R2R documentation and upload to R2R instance."""
|
|
|
|
import asyncio
|
|
import os
|
|
import sys
|
|
from typing import Any
|
|
|
|
from biz_bud.core.config.loader import load_config_async
|
|
from biz_bud.graphs.rag.graph import process_url_to_r2r_with_streaming
|
|
|
|
|
|
async def crawl_r2r_docs_fixed(max_depth: int = 3, max_pages: int = 50):
|
|
"""Crawl R2R documentation site and upload to R2R.
|
|
|
|
This fixed version:
|
|
- Uses the iterative graph for better control
|
|
- Forces map+scrape approach for reliability
|
|
- Provides real-time progress updates
|
|
|
|
Args:
|
|
max_depth: Maximum crawl depth (default: 3)
|
|
max_pages: Maximum number of pages to crawl (default: 50)
|
|
|
|
"""
|
|
url = "https://r2r-docs.sciphi.ai"
|
|
|
|
print(f"🚀 Starting crawl of {url}")
|
|
print(f"📊 Settings: max_depth={max_depth}, max_pages={max_pages}")
|
|
print("-" * 60)
|
|
|
|
# Load configuration
|
|
config = await load_config_async()
|
|
config_dict = config.model_dump()
|
|
|
|
# Configure for reliable crawling
|
|
config_dict["scrape_params"] = {"max_depth": max_depth, "max_pages": max_pages}
|
|
|
|
# Force map+scrape approach for better reliability
|
|
config_dict["rag_config"] = {
|
|
"crawl_depth": max_depth,
|
|
"max_pages_to_crawl": max_pages,
|
|
"use_crawl_endpoint": False, # Don't use crawl endpoint
|
|
"use_map_first": True, # Use map to discover URLs first
|
|
}
|
|
|
|
# Check for Firecrawl API key
|
|
api_key = os.getenv("FIRECRAWL_API_KEY")
|
|
if not api_key:
|
|
api_config = config_dict.get("api", {})
|
|
firecrawl_config = api_config.get("firecrawl", {})
|
|
api_key = api_config.get("firecrawl_api_key") or firecrawl_config.get("api_key")
|
|
|
|
if not api_key:
|
|
print("❌ Error: FIRECRAWL_API_KEY not found in environment or config")
|
|
print("Please set FIRECRAWL_API_KEY environment variable")
|
|
sys.exit(1)
|
|
|
|
# Check for R2R instance
|
|
r2r_base_url = os.getenv("R2R_BASE_URL", "http://192.168.50.210:7272")
|
|
if "api_config" not in config_dict:
|
|
config_dict["api_config"] = {}
|
|
config_dict["api_config"]["r2r_base_url"] = r2r_base_url
|
|
|
|
print("✅ Using Firecrawl API (map+scrape mode)")
|
|
print(f"✅ Using R2R instance at: {r2r_base_url}")
|
|
print()
|
|
|
|
# Track progress
|
|
pages_processed = 0
|
|
|
|
def on_update(update: dict[str, Any]) -> None:
|
|
"""Handle streaming updates."""
|
|
nonlocal pages_processed
|
|
|
|
if update.get("type") == "status":
|
|
print(f"📌 {update.get('message', '')}")
|
|
elif update.get("type") == "progress":
|
|
progress = update.get("progress", {})
|
|
current = progress.get("current", 0)
|
|
total = progress.get("total", 0)
|
|
if current > pages_processed:
|
|
pages_processed = current
|
|
print(f"📊 Progress: {current}/{total} pages")
|
|
elif update.get("type") == "error":
|
|
print(f"❌ Error: {update.get('message', '')}")
|
|
|
|
try:
|
|
# Process URL and upload to R2R with streaming updates
|
|
print("🕷️ Starting crawl and R2R upload process...")
|
|
result = await process_url_to_r2r_with_streaming(url, config_dict, on_update=on_update)
|
|
|
|
# Display results
|
|
print("\n" + "=" * 60)
|
|
print("📊 CRAWL RESULTS")
|
|
print("=" * 60)
|
|
|
|
if "error" in result and result["error"]:
|
|
print(f"❌ Error: {result['error']}")
|
|
return
|
|
|
|
# Show scraped content summary
|
|
scraped_content = result.get("scraped_content", [])
|
|
if scraped_content:
|
|
print(f"\n✅ Successfully crawled {len(scraped_content)} pages:")
|
|
|
|
# Group by domain/section
|
|
sections = {}
|
|
for page in scraped_content:
|
|
url_parts = page.get("url", "").split("/")
|
|
section = url_parts[3] or "root" if len(url_parts) > 3 else "root"
|
|
|
|
if section not in sections:
|
|
sections[section] = []
|
|
sections[section].append(page)
|
|
|
|
# Show organized results
|
|
for section, pages in sorted(sections.items()):
|
|
print(f"\n 📁 /{section} ({len(pages)} pages)")
|
|
for page in pages[:3]: # Show first 3 per section
|
|
title = page.get("title", "Untitled")
|
|
if len(title) > 60:
|
|
title = f"{title[:57]}..."
|
|
print(f" - {title}")
|
|
if len(pages) > 3:
|
|
print(f" ... and {len(pages) - 3} more")
|
|
|
|
# Show R2R upload results
|
|
r2r_info = result.get("r2r_info")
|
|
if r2r_info:
|
|
print("\n✅ R2R Upload Successful:")
|
|
|
|
# Check if multiple documents were uploaded
|
|
if r2r_info.get("uploaded_documents"):
|
|
docs = r2r_info["uploaded_documents"]
|
|
print(f" - Total documents uploaded: {len(docs)}")
|
|
print(f" - Collection: {r2r_info.get('collection_name', 'default')}")
|
|
|
|
# Show sample document IDs
|
|
print(" - Sample document IDs:")
|
|
for doc_id in list(docs.keys())[:3]:
|
|
print(f" • {doc_id}")
|
|
|
|
else:
|
|
# Single document upload
|
|
print(f" - Document ID: {r2r_info.get('document_id')}")
|
|
print(f" - Collection: {r2r_info.get('collection_name')}")
|
|
print(f" - Title: {r2r_info.get('title')}")
|
|
|
|
print("\n✅ Crawl and upload completed successfully!")
|
|
print(f"📊 Total pages processed: {len(scraped_content)}")
|
|
|
|
except Exception as e:
|
|
print(f"\n❌ Error during crawl: {e}")
|
|
import traceback
|
|
|
|
traceback.print_exc()
|
|
|
|
|
|
def main():
|
|
"""Run the main entry point."""
|
|
import argparse
|
|
|
|
parser = argparse.ArgumentParser(
|
|
description="Crawl R2R documentation and upload to R2R instance (fixed version)"
|
|
)
|
|
parser.add_argument("--max-depth", type=int, default=3, help="Maximum crawl depth (default: 3)")
|
|
parser.add_argument(
|
|
"--max-pages",
|
|
type=int,
|
|
default=50,
|
|
help="Maximum number of pages to crawl (default: 50)",
|
|
)
|
|
parser.add_argument(
|
|
"--use-crawl",
|
|
action="store_true",
|
|
help="Use crawl endpoint instead of map+scrape (not recommended)",
|
|
)
|
|
|
|
args = parser.parse_args()
|
|
|
|
# Run the async crawl
|
|
asyncio.run(crawl_r2r_docs_fixed(max_depth=args.max_depth, max_pages=args.max_pages))
|
|
|
|
|
|
if __name__ == "__main__":
|
|
main()
|