Files
biz-bud/docs/audit-core-deps.md
Travis Vasceannie e0bfb7a2f2 feat: enhance coverage reporting and improve tool configuration (#55)
* feat: enhance coverage reporting and improve tool configuration

- Added support for JSON coverage reports in pyproject.toml.
- Updated .gitignore to include coverage.json and task files for better management.
- Introduced a new Type Safety Audit Report to document findings and recommendations for type safety improvements.
- Created a comprehensive coverage configuration guide to assist in understanding coverage reporting setup.
- Refactored tools configuration to utilize environment variables for concurrent scraping settings.

These changes improve the project's testing and reporting capabilities while enhancing overall code quality and maintainability.

* feat: enhance configuration handling and improve error logging

- Introduced a new utility function `_get_env_int` for robust environment variable integer retrieval with validation.
- Updated `WebToolsConfig` and `ToolsConfigModel` to utilize the new utility for environment variable defaults.
- Enhanced logging in `CircuitBreaker` to provide detailed state transition information.
- Improved URL handling in `url_analyzer.py` for better file extension extraction and normalization.
- Added type validation and logging in `SecureInputMixin` to ensure input sanitization and validation consistency.

These changes improve the reliability and maintainability of configuration management and error handling across the codebase.

* refactor: update imports and enhance .gitignore for improved organization

- Updated import paths in various example scripts to reflect the new structure under `biz_bud`.
- Enhanced .gitignore to include clearer formatting for task files.
- Removed obsolete function calls and improved error handling in several scripts.
- Added public alias for backward compatibility in `upload_r2r.py`.

These changes improve code organization, maintainability, and compatibility across the project.

* refactor: update graph paths in langgraph.json for improved organization

- Changed paths for research, catalog, paperless, and url_to_r2r graphs to reflect new directory structure.
- Added new entries for analysis and scraping graphs to enhance functionality.

These changes improve the organization and maintainability of the graph configurations.

* fix: enhance validation and error handling in date range and scraping functions

- Updated date validation in UserFiltersModel to ensure date values are strings.
- Improved error messages in create_scraped_content_dict to clarify conditions for success and failure.
- Enhanced test coverage for date validation and scraping content creation to ensure robustness.

These changes improve input validation and error handling across the application, enhancing overall reliability.

* refactor: streamline graph creation and enhance type annotations in examples

- Simplified graph creation in `catalog_ingredient_research_example.py` and `catalog_tech_components_example.py` by directly compiling the graph.
- Updated type annotations in `catalog_intel_with_config.py` for improved clarity and consistency.
- Enhanced error handling in catalog data processing to ensure robustness against unexpected data types.

These changes improve code readability, maintainability, and error resilience across example scripts.

* Update src/biz_bud/nodes/extraction/extractors.py

Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>

* Update src/biz_bud/core/validation/pydantic_models.py

Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>

* refactor: migrate Jina and Tavily clients to use ServiceFactory dependency injection

* refactor: migrate URL processing to provider-based architecture with improved error handling

* feat: add FirecrawlApp compatibility classes and mock implementations

* fix: add thread-safe locking to LazyLoader factory management

* feat: implement service restart and refactor cache decorator helpers

* refactor: move r2r_direct_api_call to tools.clients.r2r_utils and improve HTTP service error handling

* chore: update Sonar task IDs in report configuration

---------

Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>
2025-08-04 00:54:52 -04:00

14 KiB

Of course. Writing a script to enforce architectural conventions is an excellent way to maintain a large codebase. Statically analyzing your code is far more reliable than manual reviews for catching these kinds of deviations.

This script will use Python's built-in ast (Abstract Syntax Tree) module. It's the most robust way to analyze Python code, as it understands the code's structure, unlike simple text-based searches which can be easily fooled.

The script will identify modules, functions, or packages that are NOT using your core dependency infrastructure by looking for "anti-patterns"—the use of standard libraries or direct instantiations where your custom framework should be used instead.

The Script: audit_core_dependencies.py

Save the following code as a Python file (e.g., audit_core_dependencies.py) in the root of your repository.

import ast
import os
import argparse
from typing import Any, Dict, List, Set, Tuple

# --- Configuration of Anti-Patterns ---

# Direct imports of libraries that should be replaced by your core infrastructure.
# Maps the disallowed module to the suggested core module/function.
DISALLOWED_IMPORTS: Dict[str, str] = {
    "logging": "biz_bud.logging.unified_logging.get_logger",
    "requests": "biz_bud.core.networking.http_client.HTTPClient",
    "httpx": "biz_bud.core.networking.http_client.HTTPClient",
    "aiohttp": "biz_bud.core.networking.http_client.HTTPClient",
    "asyncio.gather": "biz_bud.core.networking.async_utils.gather_with_concurrency",
}

# Direct instantiation of service clients or tools that should come from the factory.
DISALLOWED_INSTANTIATIONS: Dict[str, str] = {
    "TavilySearchProvider": "ServiceFactory.get_service() or create_tools_for_capabilities()",
    "JinaSearchProvider": "ServiceFactory.get_service() or create_tools_for_capabilities()",
    "ArxivProvider": "ServiceFactory.get_service() or create_tools_for_capabilities()",
    "FirecrawlClient": "ServiceFactory.get_service() or a dedicated provider from ScrapeService",
    "TavilyClient": "ServiceFactory.get_service()",
    "PostgresStore": "ServiceFactory.get_db_service()",
    "LangchainLLMClient": "ServiceFactory.get_llm_client()",
    "HTTPClient": "HTTPClient.get_or_create_client() instead of direct instantiation",
}

# Built-in exceptions that should ideally be wrapped in a custom BusinessBuddyError.
DISALLOWED_EXCEPTIONS: Set[str] = {
    "Exception",
    "ValueError",
    "KeyError",
    "TypeError",
    "AttributeError",
    "NotImplementedError",
}

class InfrastructureVisitor(ast.NodeVisitor):
    """
    AST visitor that walks the code tree and identifies violations
    of the core dependency infrastructure usage.
    """

    def __init__(self, filepath: str):
        self.filepath = filepath
        self.violations: List[Tuple[int, str]] = []
        self.imported_names: Dict[str, str] = {} # Maps alias to full import path

    def _add_violation(self, node: ast.AST, message: str):
        self.violations.append((node.lineno, message))

    def visit_Import(self, node: ast.Import) -> None:
        """Checks for `import logging`, `import requests`, etc."""
        for alias in node.names:
            if alias.name in DISALLOWED_IMPORTS:
                suggestion = DISALLOWED_IMPORTS[alias.name]
                self._add_violation(
                    node,
                    f"Disallowed import '{alias.name}'. Please use '{suggestion}'."
                )
        self.generic_visit(node)

    def visit_ImportFrom(self, node: ast.ImportFrom) -> None:
        """Checks for direct service/client imports, e.g., `from biz_bud.tools.clients import TavilyClient`"""
        if node.module:
            for alias in node.names:
                full_import_path = f"{node.module}.{alias.name}"
                # Store the imported name (could be an alias)
                self.imported_names[alias.asname or alias.name] = full_import_path

                # Check for direct service/tool instantiation patterns
                if "biz_bud.tools.clients" in node.module or \
                   "biz_bud.services" in node.module and "factory" not in node.module:
                    if alias.name in DISALLOWED_INSTANTIATIONS:
                        suggestion = DISALLOWED_INSTANTIATIONS[alias.name]
                        self._add_violation(
                            node,
                            f"Disallowed direct import of '{alias.name}'. Use the ServiceFactory: '{suggestion}'."
                        )

        self.generic_visit(node)

    def visit_Raise(self, node: ast.Raise) -> None:
        """Checks for `raise ValueError` instead of a custom error."""
        if isinstance(node.exc, ast.Call) and isinstance(node.exc.func, ast.Name):
            exception_name = node.exc.func.id
        elif isinstance(node.exc, ast.Name):
            exception_name = node.exc.id
        else:
            exception_name = "unknown"

        if exception_name in DISALLOWED_EXCEPTIONS:
            self._add_violation(
                node,
                f"Raising generic exception '{exception_name}'. Please use a custom `BusinessBuddyError` from `core.errors.base`."
            )
        self.generic_visit(node)

    def visit_Assign(self, node: ast.Assign) -> None:
        """Checks for direct state mutation like `state['key'] = value`."""
        for target in node.targets:
            if isinstance(target, ast.Subscript) and isinstance(target.value, ast.Name):
                if target.value.id == 'state':
                    self._add_violation(
                        node,
                        "Direct state mutation `state[...] = ...` detected. Please use `StateUpdater` for immutable updates."
                    )
        self.generic_visit(node)

    def visit_Call(self, node: ast.Call) -> None:
        """
        Checks for:
        1. Direct instantiation of disallowed classes (e.g., `TavilyClient()`).
        2. Direct use of `asyncio.gather`.
        3. Direct state mutation via `state.update(...)`.
        """
        # 1. Check for direct instantiations
        if isinstance(node.func, ast.Name):
            class_name = node.func.id
            if class_name in DISALLOWED_INSTANTIATIONS:
                 # Verify it's not a legitimate call, e.g. a function with the same name
                if self.imported_names.get(class_name, "").endswith(class_name):
                    suggestion = DISALLOWED_INSTANTIATIONS[class_name]
                    self._add_violation(
                        node,
                        f"Direct instantiation of '{class_name}'. Use the ServiceFactory: '{suggestion}'."
                    )
        
        # 2. Check for `asyncio.gather` and `state.update`
        if isinstance(node.func, ast.Attribute):
            attr = node.func
            if isinstance(attr.value, ast.Name):
                parent_name = attr.value.id
                attr_name = attr.attr
                
                # Check for asyncio.gather
                if parent_name == 'asyncio' and attr_name == 'gather':
                    suggestion = DISALLOWED_IMPORTS['asyncio.gather']
                    self._add_violation(
                        node,
                        f"Direct use of 'asyncio.gather'. Please use '{suggestion}' for controlled concurrency."
                    )

                # Check for state.update()
                if parent_name == 'state' and attr_name == 'update':
                    self._add_violation(
                        node,
                        "Direct state mutation with `state.update()` detected. Please use `StateUpdater`."
                    )

        self.generic_visit(node)


def audit_directory(directory: str) -> Dict[str, List[Tuple[int, str]]]:
    """Scans a directory for Python files and audits them."""
    all_violations: Dict[str, List[Tuple[int, str]]] = {}
    for root, _, files in os.walk(directory):
        for file in files:
            if file.endswith(".py"):
                filepath = os.path.join(root, file)
                try:
                    with open(filepath, "r", encoding="utf-8") as f:
                        source_code = f.read()
                        tree = ast.parse(source_code, filename=filepath)
                    
                    visitor = InfrastructureVisitor(filepath)
                    visitor.visit(tree)
                    
                    if visitor.violations:
                        all_violations[filepath] = visitor.violations
                except (SyntaxError, ValueError) as e:
                    all_violations[filepath] = [(0, f"ERROR: Could not parse file: {e}")]
    return all_violations

def main():
    parser = argparse.ArgumentParser(description="Audit Python code for adherence to core infrastructure.")
    parser.add_argument(
        "directory",
        nargs="?",
        default="src/biz_bud",
        help="The directory to scan. Defaults to 'src/biz_bud'."
    )
    args = parser.parse_args()

    print(f"--- Auditing directory: {args.directory} ---\n")
    
    violations = audit_directory(args.directory)
    
    if not violations:
        print("\033[92m✅ All scanned files adhere to the core infrastructure rules.\033[0m")
        return

    print(f"\033[91m🔥 Found {len(violations)} file(s) with violations:\033[0m\n")
    
    total_violations = 0
    for filepath, file_violations in violations.items():
        print(f"\033[1m\033[93mFile: {filepath}\033[0m")
        for line, message in sorted(file_violations):
            print(f"  \033[96mL{line}:\033[0m {message}")
            total_violations += 1
        print("-" * 20)
        
    print(f"\n\033[1m\033[91mSummary: Found {total_violations} total violations in {len(violations)} files.\033[0m")

if __name__ == "__main__":
    main()

How to Run the Script

  1. Save the file as audit_core_dependencies.py in your project's root directory.
  2. Run from your terminal:
    # Scan the default 'src/biz_bud' directory
    python audit_core_dependencies.py
    
    # Scan a different directory
    python audit_core_dependencies.py path/to/your/code
    

How It Works and What It Detects

This script defines a series of "anti-patterns" and then checks your code for them.

  1. Logging (DISALLOWED_IMPORTS):

    • Anti-Pattern: import logging
    • Why: Your custom logging in biz_bud.logging.unified_logging and services.logger_factory is designed to provide structured, context-aware logs. Using the standard logging library directly bypasses this, leading to inconsistent log formats and loss of valuable context like trace IDs or node names.
    • Detection: The script flags any file that directly imports the logging module.
  2. Errors (DISALLOWED_EXCEPTIONS):

    • Anti-Pattern: raise ValueError("...") or except Exception:
    • Why: Your core.errors framework is built to create a predictable, structured error handling system. Raising generic exceptions bypasses your custom error types (BusinessBuddyError), telemetry, and routing logic. This leads to unhandled crashes and makes it difficult to implement targeted recovery strategies.
    • Detection: The visit_Raise method checks if the code is raising a standard, built-in exception instead of a custom one.
  3. HTTP & APIs (DISALLOWED_IMPORTS):

    • Anti-Pattern: import requests or import httpx
    • Why: Your core.networking.http_client.HTTPClient provides a centralized, singleton session manager with built-in retry logic, timeouts, and potentially unified headers or proxy configurations. Using external HTTP libraries directly fragments this logic, leading to inconsistent network behavior and making it harder to manage connections.
    • Detection: Flags any file importing requests, httpx, or aiohttp.
  4. Tools, Services, and Language Models (DISALLOWED_INSTANTIATIONS):

    • Anti-Pattern: from biz_bud.tools.clients import TavilyClient; client = TavilyClient()
    • Why: Your ServiceFactory is the single source of truth for creating and managing the lifecycle of services. It handles singleton behavior, dependency injection, and centralized configuration. Bypassing it means you might have multiple instances of a service (e.g., multiple database connection pools), services without proper configuration, or services that don't get cleaned up correctly.
    • Detection: The script first identifies direct imports of service or client classes and then uses visit_Call to check if they are being instantiated directly.
  5. State Reducers (visit_Assign, visit_Call):

    • Anti-Pattern: state['key'] = value or state.update({...})
    • Why: Your architecture appears to be moving towards immutable state updates (as hinted by core/langgraph/state_immutability.py and the concept of reducers). Direct mutation of the state dictionary is an anti-pattern because it can lead to unpredictable side effects, making the graph's flow difficult to trace and debug. Using a StateUpdater class or reducers ensures that state changes are explicit and traceable.
    • Detection: The script specifically looks for assignment to state[...] or calls to state.update().
  6. Concurrency (visit_Call):

    • Anti-Pattern: asyncio.gather(...)
    • Why: Your gather_with_concurrency wrapper in core.networking.async_utils likely adds a semaphore or other logic to limit the number of concurrent tasks. Calling asyncio.gather directly bypasses this control, which can lead to overwhelming external APIs with too many requests, hitting rate limits, or exhausting system resources.
    • Detection: The script looks for direct calls to asyncio.gather.

This script provides a powerful, automated first line of defense to enforce your architectural standards and significantly reduce the classes of bugs you asked about.