Files
biz-bud/examples/firecrawl_monitoring_example.py
Travis Vasceannie e0bfb7a2f2 feat: enhance coverage reporting and improve tool configuration (#55)
* feat: enhance coverage reporting and improve tool configuration

- Added support for JSON coverage reports in pyproject.toml.
- Updated .gitignore to include coverage.json and task files for better management.
- Introduced a new Type Safety Audit Report to document findings and recommendations for type safety improvements.
- Created a comprehensive coverage configuration guide to assist in understanding coverage reporting setup.
- Refactored tools configuration to utilize environment variables for concurrent scraping settings.

These changes improve the project's testing and reporting capabilities while enhancing overall code quality and maintainability.

* feat: enhance configuration handling and improve error logging

- Introduced a new utility function `_get_env_int` for robust environment variable integer retrieval with validation.
- Updated `WebToolsConfig` and `ToolsConfigModel` to utilize the new utility for environment variable defaults.
- Enhanced logging in `CircuitBreaker` to provide detailed state transition information.
- Improved URL handling in `url_analyzer.py` for better file extension extraction and normalization.
- Added type validation and logging in `SecureInputMixin` to ensure input sanitization and validation consistency.

These changes improve the reliability and maintainability of configuration management and error handling across the codebase.

* refactor: update imports and enhance .gitignore for improved organization

- Updated import paths in various example scripts to reflect the new structure under `biz_bud`.
- Enhanced .gitignore to include clearer formatting for task files.
- Removed obsolete function calls and improved error handling in several scripts.
- Added public alias for backward compatibility in `upload_r2r.py`.

These changes improve code organization, maintainability, and compatibility across the project.

* refactor: update graph paths in langgraph.json for improved organization

- Changed paths for research, catalog, paperless, and url_to_r2r graphs to reflect new directory structure.
- Added new entries for analysis and scraping graphs to enhance functionality.

These changes improve the organization and maintainability of the graph configurations.

* fix: enhance validation and error handling in date range and scraping functions

- Updated date validation in UserFiltersModel to ensure date values are strings.
- Improved error messages in create_scraped_content_dict to clarify conditions for success and failure.
- Enhanced test coverage for date validation and scraping content creation to ensure robustness.

These changes improve input validation and error handling across the application, enhancing overall reliability.

* refactor: streamline graph creation and enhance type annotations in examples

- Simplified graph creation in `catalog_ingredient_research_example.py` and `catalog_tech_components_example.py` by directly compiling the graph.
- Updated type annotations in `catalog_intel_with_config.py` for improved clarity and consistency.
- Enhanced error handling in catalog data processing to ensure robustness against unexpected data types.

These changes improve code readability, maintainability, and error resilience across example scripts.

* Update src/biz_bud/nodes/extraction/extractors.py

Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>

* Update src/biz_bud/core/validation/pydantic_models.py

Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>

* refactor: migrate Jina and Tavily clients to use ServiceFactory dependency injection

* refactor: migrate URL processing to provider-based architecture with improved error handling

* feat: add FirecrawlApp compatibility classes and mock implementations

* fix: add thread-safe locking to LazyLoader factory management

* feat: implement service restart and refactor cache decorator helpers

* refactor: move r2r_direct_api_call to tools.clients.r2r_utils and improve HTTP service error handling

* chore: update Sonar task IDs in report configuration

---------

Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>
2025-08-04 00:54:52 -04:00

429 lines
16 KiB
Python

#!/usr/bin/env python3
"""Example to monitor and understand Firecrawl crawl/scrape behavior.
This example helps debug the following issues:
1. Why crawl endpoint calls scrape internally
2. How to handle crawl returning 0 pages
3. Dataset naming and reuse patterns
"""
import asyncio
import os
from datetime import datetime
from dotenv import load_dotenv
# Load environment variables
load_dotenv()
async def analyze_crawl_vs_scrape():
"""Analyze the relationship between crawl and scrape endpoints."""
# Note: These imports are from the original firecrawl library
# They are not available in our current client implementation
# from biz_bud.tools.clients.firecrawl import (
# CrawlJob,
# CrawlOptions,
# FirecrawlApp,
# FirecrawlOptions,
# )
# This example needs the actual firecrawl library
print("This example requires the actual firecrawl-py library")
return
print("=== Understanding Crawl vs Scrape Behavior ===\n")
print("Key insights:")
print("1. The /crawl endpoint discovers URLs on a website")
print("2. For each discovered URL, it internally calls /scrape")
print("3. This is why you see both endpoints being called")
print("4. If crawl finds 0 pages, our code falls back to direct scrape\n")
async with FirecrawlApp() as app:
url = "https://docs.anthropic.com/en/docs/intro-to-claude"
# First, let's see what a direct scrape gives us
print(f"1. Direct scrape of {url}")
print("-" * 50)
scrape_result = await app.scrape_url(
url, FirecrawlOptions(formats=["markdown", "links"], only_main_content=True)
)
if scrape_result.success:
print("✅ Direct scrape successful")
if scrape_result.data:
print(f" - Content length: {len(scrape_result.data.markdown or '')} chars")
print(f" - Links found: {len(scrape_result.data.links or [])}")
if scrape_result.data.links:
print(" - Sample links:")
for link in scrape_result.data.links[:3]:
print(f"{link}")
else:
print(" - No data returned")
# Now let's see what crawl does
print(f"\n2. Crawl starting from {url}")
print("-" * 50)
crawl_job = await app.crawl_website(
url=url,
options=CrawlOptions(
limit=5,
max_depth=1,
scrape_options=FirecrawlOptions(formats=["markdown"], only_main_content=True),
),
wait_for_completion=True,
)
if isinstance(crawl_job, CrawlJob):
print("✅ Crawl completed")
print(f" - Status: {crawl_job.status}")
print(f" - Pages discovered: {crawl_job.total_count}")
print(f" - Pages scraped: {crawl_job.completed_count}")
if crawl_job.data:
print(" - Scraped URLs:")
for i, page in enumerate(crawl_job.data[:5], 1):
page_url = "unknown"
if hasattr(page.metadata, "sourceURL"):
page_url = page.metadata.sourceURL
elif isinstance(page.metadata, dict):
page_url = page.metadata.get("url", "unknown")
print(f" {i}. {page_url}")
# This is the key insight
print("\n📊 ANALYSIS:")
print(f" The crawl endpoint discovered {crawl_job.total_count} URLs")
print(f" Then it called /scrape for each URL ({crawl_job.completed_count} times)")
print(" This is why you see both endpoints in your logs!")
if crawl_job.completed_count == 0:
print("\n⚠️ CRAWL RETURNED 0 PAGES!")
print(" In this case, our code falls back to direct scrape")
print(" This ensures we always get at least the main page content")
async def monitor_crawl_job():
"""Monitor a crawl job with real-time status updates."""
# Note: These imports are from the original firecrawl library
# They are not available in our current client implementation
# from biz_bud.tools.clients.firecrawl import (
# CrawlJob,
# CrawlOptions,
# FirecrawlApp,
# FirecrawlOptions,
# )
# This example needs the actual firecrawl library
print("This example requires the actual firecrawl-py library")
return
print("=== Firecrawl Crawl Job Monitoring Example ===\n")
# Track status updates
status_history = []
def status_callback(job: CrawlJob) -> None:
"""Handle callback function called on each status update."""
timestamp = datetime.now().strftime("%H:%M:%S.%f")[:-3]
status_history.append(
{
"time": timestamp,
"status": job.status,
"completed": job.completed_count,
"total": job.total_count,
}
)
# Progress bar visualization
if job.total_count > 0:
progress = job.completed_count / job.total_count
bar_length = 30
filled = int(bar_length * progress)
bar = "" * filled + "" * (bar_length - filled)
print(
f"\r[{timestamp}] {bar} {job.completed_count}/{job.total_count} - {job.status}",
end="",
flush=True,
)
async with FirecrawlApp() as app:
print(f"Connected to: {app.client.base_url}")
print(f"API Version: {app.api_version}\n")
# Configure crawl options
options = CrawlOptions(
limit=10, # Crawl up to 10 pages
max_depth=2, # Follow links 2 levels deep
scrape_options=FirecrawlOptions(
formats=["markdown"],
only_main_content=True,
timeout=30000, # 30 seconds per page
),
)
# Example URL - replace with your target
url = "https://example.com"
print(f"Starting crawl of {url}")
print(f"Max pages: {options.limit}, Max depth: {options.max_depth}\n")
start_time = datetime.now()
try:
# Method 1: Start crawl and get job ID immediately
initial_job = await app.crawl_website(
url=url,
options=options,
wait_for_completion=False, # Don't wait
)
if isinstance(initial_job, CrawlJob) and initial_job.job_id:
print(f"Job ID: {initial_job.job_id}")
print(f"Status URL: {app.client.base_url}/v1/crawl/{initial_job.job_id}")
print("\nMonitoring progress:\n")
# Poll for updates with callback
final_job = await app._poll_crawl_status(
job_id=initial_job.job_id,
poll_interval=2, # Check every 2 seconds
max_polls=150, # Max 5 minutes
status_callback=status_callback,
)
print("\n") # New line after progress bar
# Calculate statistics
duration = (datetime.now() - start_time).total_seconds()
print("\n=== Crawl Complete ===")
print(f"Duration: {duration:.1f} seconds")
print(f"Status: {final_job.status}")
print(f"Pages crawled: {final_job.completed_count}")
print(f"Total pages found: {final_job.total_count}")
if final_job.data:
print("\nCrawled URLs:")
for i, page in enumerate(final_job.data[:5], 1):
page_url = "unknown"
if hasattr(page, "metadata") and isinstance(page.metadata, dict):
page_url = page.metadata.get("url", "unknown")
content_length = len(page.markdown or page.content or "")
print(f" {i}. {page_url} ({content_length} chars)")
if len(final_job.data) > 5:
print(f" ... and {len(final_job.data) - 5} more pages")
# Show status timeline
print(f"\nStatus Updates ({len(status_history)} total):")
# Show first 2 and last 2 updates
if len(status_history) > 4:
for update in status_history[:2]:
print(
f" [{update['time']}] {update['status']}: {update['completed']}/{update['total']}"
)
print(" ...")
for update in status_history[-2:]:
print(
f" [{update['time']}] {update['status']}: {update['completed']}/{update['total']}"
)
else:
for update in status_history:
print(
f" [{update['time']}] {update['status']}: {update['completed']}/{update['total']}"
)
# Performance metrics
if final_job.completed_count > 0:
avg_time_per_page = duration / final_job.completed_count
print("\nPerformance:")
print(f" Average time per page: {avg_time_per_page:.2f} seconds")
print(f" Pages per minute: {(60 / avg_time_per_page):.1f}")
else:
print(f"Unexpected result: {type(initial_job)}")
except Exception as e:
print(f"\nError during crawl: {e}")
import traceback
traceback.print_exc()
async def monitor_batch_scrape():
"""Monitor batch scrape operations."""
from biz_bud.tools.clients.firecrawl import FirecrawlApp, FirecrawlOptions
print("\n\n=== Firecrawl Batch Scrape Monitoring Example ===\n")
# URLs to scrape
urls = [
"https://example.com",
"https://example.org",
"https://example.net",
"https://example.edu",
"https://example.io",
]
async with FirecrawlApp() as app:
print(f"Scraping {len(urls)} URLs concurrently...")
print("Max concurrent: 3\n")
start_time = datetime.now()
# Progress tracking
print("Progress:")
results = await app.batch_scrape(
urls=urls,
options=FirecrawlOptions(
formats=["markdown"],
only_main_content=True,
timeout=30000,
),
max_concurrent=3,
)
duration = (datetime.now() - start_time).total_seconds()
# Analyze results - filter for valid result objects with proper attributes
valid_results = [
r for r in results if hasattr(r, "success") and not isinstance(r, (dict, list, str))
]
successful = sum(1 for r in valid_results if getattr(r, "success", False))
# failed = len(valid_results) - successful # Would be used for monitoring stats
total_content = sum(
len(getattr(r.data, "markdown", "") or "")
for r in valid_results
if getattr(r, "success", False)
and hasattr(r, "data")
and r.data
and hasattr(r.data, "markdown")
)
print("\n=== Batch Complete ===")
print(f"Duration: {duration:.1f} seconds")
print(f"Success rate: {successful}/{len(results)} ({successful / len(results) * 100:.0f}%)")
print(f"Total content: {total_content:,} characters")
print(f"Average time per URL: {duration / len(urls):.2f} seconds")
# Show individual results
print("\nIndividual Results:")
for i, url in enumerate(urls):
if i < len(results):
result = results[i]
if (
hasattr(result, "success")
and not isinstance(result, (dict, list, str))
and getattr(result, "success", False)
and hasattr(result, "data")
and result.data
):
content_len = len(getattr(result.data, "markdown", "") or "")
print(f"{url} - {content_len} chars")
else:
error = getattr(result, "error", "Unknown error") or "Unknown error"
print(f"{url} - {error}")
async def monitor_ragflow_dataset_creation():
"""Monitor how RAGFlow datasets are created and reused."""
print("\n\n=== RAGFlow Dataset Management ===\n")
try:
from urllib.parse import urlparse
from ragflow_sdk import RAGFlow
api_key = os.getenv("RAGFLOW_API_KEY")
base_url = os.getenv("RAGFLOW_BASE_URL", "http://rag.lab")
if not api_key:
print("❌ RAGFLOW_API_KEY not set, skipping RAGFlow tests")
return
client = RAGFlow(api_key=api_key, base_url=base_url)
# Test URL
test_url = "https://example.com"
parsed = urlparse(test_url)
site_name = parsed.netloc.replace("www.", "").replace(":", "_")
print(f"Testing with URL: {test_url}")
print(f"Site name for dataset: {site_name}\n")
# List existing datasets
print("1. Existing datasets:")
print("-" * 50)
datasets = client.list_datasets()
existing_dataset = None
for dataset in datasets:
if dataset.name == site_name:
existing_dataset = dataset
print(f"✅ Found existing dataset: {dataset.name} (ID: {dataset.id})")
else:
print(f" - {dataset.name}")
if existing_dataset:
print(f"\n📊 INSIGHT: Dataset '{site_name}' already exists!")
print(" Our code will reuse this instead of creating a new one")
print(" This prevents duplicate datasets with timestamps")
else:
print(f"\n📊 INSIGHT: No dataset named '{site_name}' found")
print(" A new dataset will be created with just the site name")
print(" NOT with a timestamp - this allows future reuse")
# Show what the old behavior would have done
from datetime import datetime
old_style_name = f"{site_name}_{datetime.now().strftime('%Y%m%d_%H%M%S')}"
print(f"\n⚠️ OLD BEHAVIOR would create: {old_style_name}")
print(f"✅ NEW BEHAVIOR creates/reuses: {site_name}")
except ImportError:
print("❌ RAGFlow SDK not installed, skipping RAGFlow tests")
except Exception as e:
print(f"❌ RAGFlow test error: {e}")
async def main():
"""Run monitoring examples."""
print(f"Firecrawl Base URL: {os.getenv('FIRECRAWL_BASE_URL', 'https://api.firecrawl.dev')}")
print(f"RAGFlow Base URL: {os.getenv('RAGFLOW_BASE_URL', 'http://rag.lab')}")
print(f"API Version: {os.getenv('FIRECRAWL_API_VERSION', 'auto-detect')}\n")
try:
# Analyze crawl vs scrape behavior
await analyze_crawl_vs_scrape()
# Run crawl monitoring
await monitor_crawl_job()
# Run batch scrape monitoring
await monitor_batch_scrape()
# Monitor RAGFlow dataset creation
await monitor_ragflow_dataset_creation()
print("\n✅ All examples completed!")
print("\n" + "=" * 60)
print("SUMMARY OF FIXES:")
print("1. Crawl calling scrape is EXPECTED behavior - not a bug")
print("2. Crawl returning 0 pages is handled with fallback scraping")
print("3. Datasets now use site names for reuse, not timestamps")
except KeyboardInterrupt:
print("\n\nMonitoring cancelled by user")
except Exception as e:
print(f"\n❌ Error: {e}")
import traceback
traceback.print_exc()
if __name__ == "__main__":
asyncio.run(main())