Files
noteflow/docs/sprints/phase-ongoing/patterns/logging_enrichment_spec.md
2026-01-02 04:22:40 +00:00

15 KiB

Sprint: Logging Enrichment & Gap Remediation

Size: L | Owner: Backend | Prerequisites: Sprint Logging Centralization (Infrastructure) Phase: Ongoing - Patterns & Observability


Open Issues & Prerequisites

Review Date: 2026-01-02 — Infrastructure migration complete, enrichment work outstanding

Blocking Issues

ID Issue Status Resolution
B1 structlog infrastructure must be in place RESOLVED 77 files migrated to get_logger() pattern
B2 Dual output (Rich + JSON) must work RESOLVED Implemented in config.py

Design Gaps to Address

ID Gap Resolution
G1 No timing decorator/context manager Create @log_timing decorator and LogTiming context manager
G2 No structured state transition logging Define log_state_transition(entity, old, new) helper
G3 Missing documentation Create docs/guides/logging.md

Prerequisite Verification

Prerequisite Status Notes
structlog infrastructure Complete src/noteflow/infrastructure/logging/
Context variable injection Complete add_noteflow_context processor
OTEL trace correlation Complete add_otel_trace_context processor
LogBuffer integration Complete LogBufferHandler attached

Validation Status (2026-01-02)

IMPLEMENTED

Component Status Notes
LoggingConfig dataclass Complete config.py:31-53
configure_logging() Complete config.py:163-187
get_logger() Complete config.py:190-199
add_noteflow_context processor Complete processors.py:27-50
add_otel_trace_context processor Complete processors.py:53-90
File migration (77 files) Complete All use centralized get_logger()

NOT IMPLEMENTED

Component Status Notes
Timing decorator/helper Not started Needed for 50+ operations
State transition logger Not started Needed for 7+ state machines
Silent failure remediation Not started 10+ locations return None silently
docs/guides/logging.md Not started Usage documentation
CLAUDE.md logging section Not started Project instructions

Downstream impact: Debugging remains difficult for network calls, blocking operations, and state transitions.


Objective

Systematically enrich logging across the codebase to eliminate debugging blind spots. This sprint addresses 100+ logging gaps identified in docs/triage.md, adding timing information to network/blocking operations, converting silent failures to logged failures, and implementing structured state transition logging.


Key Decisions

Decision Choice Rationale
Timing approach Context manager + decorator Reusable, consistent pattern across sync/async
Log levels INFO for operations, DEBUG for details Balances verbosity with observability
State transitions Structured fields (old_state, new_state) Enables filtering/alerting in log aggregators
Silent failures Log at WARNING, still return None Non-breaking change, adds visibility
Batch strategy By priority (P0 → P1 → P2) Address critical gaps first

What Already Exists

Asset Location Implication
Centralized get_logger() infrastructure/logging/config.py:190 All files already use this
Context injection infrastructure/logging/processors.py request_id, user_id auto-injected
OTEL correlation infrastructure/logging/processors.py:53 trace_id, span_id auto-injected
Structured logging structlog library Use keyword args for all context
Test infrastructure tests/infrastructure/observability/test_logging_*.py Extend for new helpers

Scope

Task Effort Notes
Infrastructure Helpers
Create @log_timing decorator S Handles sync/async, logs duration
Create LogTiming context manager S For inline timing blocks
Create log_state_transition() helper S Structured old→new logging
P0 - Critical Gaps
Ollama availability timeout logging S ollama_provider.py:101-115
Cloud summarization API timing S cloud_provider.py:238-282
Database engine creation logging S database.py:85-116
P1 - High Priority Gaps
Calendar API request logging M google_adapter.py, outlook_adapter.py
OAuth token refresh timing S oauth_manager.py:211-222
Webhook delivery start logging S executor.py:107 (upgrade to INFO)
ASR transcription duration S asr/engine.py:156-177
NER model warmup timing S ner_service.py:185-197
Diarization operation timing M diarization/engine.py:299-347
Silent ValueError logging M 4 locations identified
Silent settings fallback logging M 4 locations identified
gRPC client stub unavailable logging M _client_mixins/*.py
P2 - Medium Priority Gaps
Meeting state transition logging M Add to MeetingMixin
Diarization job state logging S _jobs.py:147-171
Segmenter state machine logging S segmenter.py:121-127
Stream cleanup logging S _cleanup.py:14-34
Background task spawn logging S _jobs.py:130-132
Repository CRUD logging L 10+ repository files
Unit of Work transaction logging M unit_of_work.py:220-296
P3 - Lower Priority
File system operation logging S audio/writer.py
Export timing logging S export/pdf.py, markdown.py, html.py
Lazy model loading logging M NER, diarization, Ollama engines
Singleton creation logging S metrics/collector.py
Documentation
Create docs/guides/logging.md M Usage guide with examples
Update CLAUDE.md S Add logging configuration section
Update triage.md S Mark infrastructure items resolved

Total Effort: L (2-3 days across P0-P2)


Domain Model

Timing Helper

# src/noteflow/infrastructure/logging/timing.py

from __future__ import annotations

import functools
import time
from contextlib import contextmanager
from typing import TYPE_CHECKING, ParamSpec, TypeVar

from .config import get_logger

if TYPE_CHECKING:
    from collections.abc import Callable, Generator

P = ParamSpec("P")
T = TypeVar("T")

logger = get_logger()


@contextmanager
def log_timing(
    operation: str,
    **context: str | int | float,
) -> Generator[None, None, None]:
    """Context manager for timing operations with structured logging.

    Args:
        operation: Name of the operation being timed.
        **context: Additional context fields to include in logs.

    Yields:
        None

    Example:
        with log_timing("ollama_availability_check", host=self._host):
            client.list()
    """
    logger.info(f"{operation}_started", **context)
    start = time.perf_counter()
    try:
        yield
        elapsed_ms = (time.perf_counter() - start) * 1000
        logger.info(f"{operation}_completed", duration_ms=elapsed_ms, **context)
    except TimeoutError:
        elapsed_ms = (time.perf_counter() - start) * 1000
        logger.error(f"{operation}_timeout", duration_ms=elapsed_ms, **context)
        raise
    except Exception as exc:
        elapsed_ms = (time.perf_counter() - start) * 1000
        logger.error(f"{operation}_failed", duration_ms=elapsed_ms, error=str(exc), **context)
        raise


def timed(operation: str) -> Callable[[Callable[P, T]], Callable[P, T]]:
    """Decorator for timing function calls with structured logging.

    Args:
        operation: Name of the operation (used as log event prefix).

    Returns:
        Decorated function with timing logs.

    Example:
        @timed("transcribe_audio")
        async def transcribe_async(self, audio: bytes) -> list[AsrResult]:
            ...
    """
    def decorator(func: Callable[P, T]) -> Callable[P, T]:
        @functools.wraps(func)
        def sync_wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
            with log_timing(operation):
                return func(*args, **kwargs)

        @functools.wraps(func)
        async def async_wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
            logger.info(f"{operation}_started")
            start = time.perf_counter()
            try:
                result = await func(*args, **kwargs)
                elapsed_ms = (time.perf_counter() - start) * 1000
                logger.info(f"{operation}_completed", duration_ms=elapsed_ms)
                return result
            except Exception as exc:
                elapsed_ms = (time.perf_counter() - start) * 1000
                logger.error(f"{operation}_failed", duration_ms=elapsed_ms, error=str(exc))
                raise

        import asyncio
        if asyncio.iscoroutinefunction(func):
            return async_wrapper  # type: ignore[return-value]
        return sync_wrapper

    return decorator

State Transition Helper

# src/noteflow/infrastructure/logging/transitions.py

from __future__ import annotations

from typing import TYPE_CHECKING

from .config import get_logger

if TYPE_CHECKING:
    from enum import Enum

logger = get_logger()


def log_state_transition(
    entity_type: str,
    entity_id: str,
    old_state: Enum | str | None,
    new_state: Enum | str,
    **context: str | int | float,
) -> None:
    """Log a state transition with structured fields.

    Args:
        entity_type: Type of entity (e.g., "meeting", "diarization_job").
        entity_id: Unique identifier for the entity.
        old_state: Previous state (None for creation).
        new_state: New state being transitioned to.
        **context: Additional context fields.

    Example:
        log_state_transition(
            "meeting", meeting_id,
            old_state=MeetingState.RECORDING,
            new_state=MeetingState.STOPPED,
            workspace_id=workspace_id,
        )
    """
    old_value = old_state.value if hasattr(old_state, "value") else str(old_state)
    new_value = new_state.value if hasattr(new_state, "value") else str(new_state)

    logger.info(
        f"{entity_type}_state_transition",
        entity_id=entity_id,
        old_state=old_value,
        new_state=new_value,
        **context,
    )

Deliverables

Backend

Infrastructure Layer:

  • src/noteflow/infrastructure/logging/timing.py — Timing helpers
  • src/noteflow/infrastructure/logging/transitions.py — State transition helper
  • src/noteflow/infrastructure/logging/__init__.py — Export new helpers

P0 Fixes:

  • src/noteflow/infrastructure/summarization/ollama_provider.py — Add timeout logging
  • src/noteflow/infrastructure/summarization/cloud_provider.py — Add API timing
  • src/noteflow/infrastructure/persistence/database.py — Add engine creation logging

P1 Fixes:

  • src/noteflow/infrastructure/calendar/google_adapter.py — Add request logging
  • src/noteflow/infrastructure/calendar/outlook_adapter.py — Add request logging
  • src/noteflow/infrastructure/calendar/oauth_manager.py — Add refresh timing
  • src/noteflow/infrastructure/webhooks/executor.py — Upgrade to INFO level
  • src/noteflow/infrastructure/asr/engine.py — Add transcription timing
  • src/noteflow/application/services/ner_service.py — Add warmup timing
  • src/noteflow/infrastructure/diarization/engine.py — Add operation timing
  • Silent failure locations (4 files) — Add WARNING logs

P2 Fixes:

  • src/noteflow/grpc/_mixins/meeting.py — Add state transition logging
  • src/noteflow/grpc/_mixins/diarization/_jobs.py — Add job state logging
  • src/noteflow/infrastructure/asr/segmenter.py — Add VAD state logging
  • src/noteflow/grpc/_mixins/streaming/_cleanup.py — Add cleanup logging
  • Repository files (10+) — Add CRUD logging
  • src/noteflow/infrastructure/persistence/unit_of_work.py — Add transaction logging

Documentation:

  • docs/guides/logging.md — Usage guide
  • CLAUDE.md — Add logging configuration section
  • docs/triage.md — Mark resolved items

Test Strategy

Fixtures to extend or create

  • tests/infrastructure/observability/conftest.py: Add captured_logs fixture using structlog testing utilities
  • tests/infrastructure/observability/conftest.py: Add mock_time fixture for timing tests

Parameterized tests

  • Timing decorator: sync functions, async functions, timeout errors, general exceptions
  • State transitions: enum states, string states, None old_state, with/without context

Core test cases

  • Timing helpers: Verify correct event names, duration_ms field, error handling
  • State transitions: Verify old_state/new_state fields, entity_id included
  • Integration: Verify logs appear in LogBuffer with correct structure

Quality Gates

  • pytest tests/infrastructure/observability/ passes
  • make quality-py passes (ruff, basedpyright, test quality)
  • No # type: ignore without justification
  • All new public functions have docstrings
  • Documentation files created

Priority-Based Implementation Order

Batch 1: Infrastructure + P0 (Day 1 morning)

  1. Create timing.py with log_timing and @timed
  2. Create transitions.py with log_state_transition
  3. Update __init__.py exports
  4. Write tests for new helpers
  5. Apply to Ollama availability check
  6. Apply to cloud summarization API calls
  7. Apply to database engine creation

Batch 2: P1 Network/Blocking (Day 1 afternoon)

  1. Calendar API request logging (Google, Outlook)
  2. OAuth token refresh timing
  3. Webhook delivery start logging
  4. ASR transcription timing
  5. NER model warmup timing
  6. Diarization operation timing

Batch 3: P1 Error Handling (Day 2 morning)

  1. Silent ValueError logging (4 locations)
  2. Silent settings fallback logging (4 locations)
  3. gRPC client stub unavailable logging

Batch 4: P2 State Transitions (Day 2 afternoon)

  1. Meeting state transition logging
  2. Diarization job state logging
  3. Segmenter state machine logging
  4. Stream cleanup logging
  5. Background task spawn logging

Batch 5: P2 Database + Docs (Day 3)

  1. Repository CRUD logging (10+ files)
  2. Unit of Work transaction logging
  3. Create docs/guides/logging.md
  4. Update CLAUDE.md
  5. Update triage.md resolved section

Post-Sprint

  • Add log aggregation integration (e.g., Loki, Datadog)
  • Create alerting rules for error patterns
  • Add request tracing dashboard
  • Consider structured logging for client-side (Rust tracing)