15 KiB
15 KiB
Sprint: Logging Enrichment & Gap Remediation
Size: L | Owner: Backend | Prerequisites: Sprint Logging Centralization (Infrastructure) Phase: Ongoing - Patterns & Observability
Open Issues & Prerequisites
✅ Review Date: 2026-01-02 — Infrastructure migration complete, enrichment work outstanding
Blocking Issues
| ID | Issue | Status | Resolution |
|---|---|---|---|
| B1 | structlog infrastructure must be in place | ✅ RESOLVED | 77 files migrated to get_logger() pattern |
| B2 | Dual output (Rich + JSON) must work | ✅ RESOLVED | Implemented in config.py |
Design Gaps to Address
| ID | Gap | Resolution |
|---|---|---|
| G1 | No timing decorator/context manager | Create @log_timing decorator and LogTiming context manager |
| G2 | No structured state transition logging | Define log_state_transition(entity, old, new) helper |
| G3 | Missing documentation | Create docs/guides/logging.md |
Prerequisite Verification
| Prerequisite | Status | Notes |
|---|---|---|
| structlog infrastructure | ✅ Complete | src/noteflow/infrastructure/logging/ |
| Context variable injection | ✅ Complete | add_noteflow_context processor |
| OTEL trace correlation | ✅ Complete | add_otel_trace_context processor |
| LogBuffer integration | ✅ Complete | LogBufferHandler attached |
Validation Status (2026-01-02)
IMPLEMENTED
| Component | Status | Notes |
|---|---|---|
LoggingConfig dataclass |
✅ Complete | config.py:31-53 |
configure_logging() |
✅ Complete | config.py:163-187 |
get_logger() |
✅ Complete | config.py:190-199 |
add_noteflow_context processor |
✅ Complete | processors.py:27-50 |
add_otel_trace_context processor |
✅ Complete | processors.py:53-90 |
| File migration (77 files) | ✅ Complete | All use centralized get_logger() |
NOT IMPLEMENTED
| Component | Status | Notes |
|---|---|---|
| Timing decorator/helper | ❌ Not started | Needed for 50+ operations |
| State transition logger | ❌ Not started | Needed for 7+ state machines |
| Silent failure remediation | ❌ Not started | 10+ locations return None silently |
| docs/guides/logging.md | ❌ Not started | Usage documentation |
| CLAUDE.md logging section | ❌ Not started | Project instructions |
Downstream impact: Debugging remains difficult for network calls, blocking operations, and state transitions.
Objective
Systematically enrich logging across the codebase to eliminate debugging blind spots. This sprint addresses 100+ logging gaps identified in docs/triage.md, adding timing information to network/blocking operations, converting silent failures to logged failures, and implementing structured state transition logging.
Key Decisions
| Decision | Choice | Rationale |
|---|---|---|
| Timing approach | Context manager + decorator | Reusable, consistent pattern across sync/async |
| Log levels | INFO for operations, DEBUG for details | Balances verbosity with observability |
| State transitions | Structured fields (old_state, new_state) | Enables filtering/alerting in log aggregators |
| Silent failures | Log at WARNING, still return None | Non-breaking change, adds visibility |
| Batch strategy | By priority (P0 → P1 → P2) | Address critical gaps first |
What Already Exists
| Asset | Location | Implication |
|---|---|---|
Centralized get_logger() |
infrastructure/logging/config.py:190 |
All files already use this |
| Context injection | infrastructure/logging/processors.py |
request_id, user_id auto-injected |
| OTEL correlation | infrastructure/logging/processors.py:53 |
trace_id, span_id auto-injected |
| Structured logging | structlog library |
Use keyword args for all context |
| Test infrastructure | tests/infrastructure/observability/test_logging_*.py |
Extend for new helpers |
Scope
| Task | Effort | Notes |
|---|---|---|
| Infrastructure Helpers | ||
Create @log_timing decorator |
S | Handles sync/async, logs duration |
Create LogTiming context manager |
S | For inline timing blocks |
Create log_state_transition() helper |
S | Structured old→new logging |
| P0 - Critical Gaps | ||
| Ollama availability timeout logging | S | ollama_provider.py:101-115 |
| Cloud summarization API timing | S | cloud_provider.py:238-282 |
| Database engine creation logging | S | database.py:85-116 |
| P1 - High Priority Gaps | ||
| Calendar API request logging | M | google_adapter.py, outlook_adapter.py |
| OAuth token refresh timing | S | oauth_manager.py:211-222 |
| Webhook delivery start logging | S | executor.py:107 (upgrade to INFO) |
| ASR transcription duration | S | asr/engine.py:156-177 |
| NER model warmup timing | S | ner_service.py:185-197 |
| Diarization operation timing | M | diarization/engine.py:299-347 |
| Silent ValueError logging | M | 4 locations identified |
| Silent settings fallback logging | M | 4 locations identified |
| gRPC client stub unavailable logging | M | _client_mixins/*.py |
| P2 - Medium Priority Gaps | ||
| Meeting state transition logging | M | Add to MeetingMixin |
| Diarization job state logging | S | _jobs.py:147-171 |
| Segmenter state machine logging | S | segmenter.py:121-127 |
| Stream cleanup logging | S | _cleanup.py:14-34 |
| Background task spawn logging | S | _jobs.py:130-132 |
| Repository CRUD logging | L | 10+ repository files |
| Unit of Work transaction logging | M | unit_of_work.py:220-296 |
| P3 - Lower Priority | ||
| File system operation logging | S | audio/writer.py |
| Export timing logging | S | export/pdf.py, markdown.py, html.py |
| Lazy model loading logging | M | NER, diarization, Ollama engines |
| Singleton creation logging | S | metrics/collector.py |
| Documentation | ||
| Create docs/guides/logging.md | M | Usage guide with examples |
| Update CLAUDE.md | S | Add logging configuration section |
| Update triage.md | S | Mark infrastructure items resolved |
Total Effort: L (2-3 days across P0-P2)
Domain Model
Timing Helper
# src/noteflow/infrastructure/logging/timing.py
from __future__ import annotations
import functools
import time
from contextlib import contextmanager
from typing import TYPE_CHECKING, ParamSpec, TypeVar
from .config import get_logger
if TYPE_CHECKING:
from collections.abc import Callable, Generator
P = ParamSpec("P")
T = TypeVar("T")
logger = get_logger()
@contextmanager
def log_timing(
operation: str,
**context: str | int | float,
) -> Generator[None, None, None]:
"""Context manager for timing operations with structured logging.
Args:
operation: Name of the operation being timed.
**context: Additional context fields to include in logs.
Yields:
None
Example:
with log_timing("ollama_availability_check", host=self._host):
client.list()
"""
logger.info(f"{operation}_started", **context)
start = time.perf_counter()
try:
yield
elapsed_ms = (time.perf_counter() - start) * 1000
logger.info(f"{operation}_completed", duration_ms=elapsed_ms, **context)
except TimeoutError:
elapsed_ms = (time.perf_counter() - start) * 1000
logger.error(f"{operation}_timeout", duration_ms=elapsed_ms, **context)
raise
except Exception as exc:
elapsed_ms = (time.perf_counter() - start) * 1000
logger.error(f"{operation}_failed", duration_ms=elapsed_ms, error=str(exc), **context)
raise
def timed(operation: str) -> Callable[[Callable[P, T]], Callable[P, T]]:
"""Decorator for timing function calls with structured logging.
Args:
operation: Name of the operation (used as log event prefix).
Returns:
Decorated function with timing logs.
Example:
@timed("transcribe_audio")
async def transcribe_async(self, audio: bytes) -> list[AsrResult]:
...
"""
def decorator(func: Callable[P, T]) -> Callable[P, T]:
@functools.wraps(func)
def sync_wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
with log_timing(operation):
return func(*args, **kwargs)
@functools.wraps(func)
async def async_wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
logger.info(f"{operation}_started")
start = time.perf_counter()
try:
result = await func(*args, **kwargs)
elapsed_ms = (time.perf_counter() - start) * 1000
logger.info(f"{operation}_completed", duration_ms=elapsed_ms)
return result
except Exception as exc:
elapsed_ms = (time.perf_counter() - start) * 1000
logger.error(f"{operation}_failed", duration_ms=elapsed_ms, error=str(exc))
raise
import asyncio
if asyncio.iscoroutinefunction(func):
return async_wrapper # type: ignore[return-value]
return sync_wrapper
return decorator
State Transition Helper
# src/noteflow/infrastructure/logging/transitions.py
from __future__ import annotations
from typing import TYPE_CHECKING
from .config import get_logger
if TYPE_CHECKING:
from enum import Enum
logger = get_logger()
def log_state_transition(
entity_type: str,
entity_id: str,
old_state: Enum | str | None,
new_state: Enum | str,
**context: str | int | float,
) -> None:
"""Log a state transition with structured fields.
Args:
entity_type: Type of entity (e.g., "meeting", "diarization_job").
entity_id: Unique identifier for the entity.
old_state: Previous state (None for creation).
new_state: New state being transitioned to.
**context: Additional context fields.
Example:
log_state_transition(
"meeting", meeting_id,
old_state=MeetingState.RECORDING,
new_state=MeetingState.STOPPED,
workspace_id=workspace_id,
)
"""
old_value = old_state.value if hasattr(old_state, "value") else str(old_state)
new_value = new_state.value if hasattr(new_state, "value") else str(new_state)
logger.info(
f"{entity_type}_state_transition",
entity_id=entity_id,
old_state=old_value,
new_state=new_value,
**context,
)
Deliverables
Backend
Infrastructure Layer:
src/noteflow/infrastructure/logging/timing.py— Timing helperssrc/noteflow/infrastructure/logging/transitions.py— State transition helpersrc/noteflow/infrastructure/logging/__init__.py— Export new helpers
P0 Fixes:
src/noteflow/infrastructure/summarization/ollama_provider.py— Add timeout loggingsrc/noteflow/infrastructure/summarization/cloud_provider.py— Add API timingsrc/noteflow/infrastructure/persistence/database.py— Add engine creation logging
P1 Fixes:
src/noteflow/infrastructure/calendar/google_adapter.py— Add request loggingsrc/noteflow/infrastructure/calendar/outlook_adapter.py— Add request loggingsrc/noteflow/infrastructure/calendar/oauth_manager.py— Add refresh timingsrc/noteflow/infrastructure/webhooks/executor.py— Upgrade to INFO levelsrc/noteflow/infrastructure/asr/engine.py— Add transcription timingsrc/noteflow/application/services/ner_service.py— Add warmup timingsrc/noteflow/infrastructure/diarization/engine.py— Add operation timing- Silent failure locations (4 files) — Add WARNING logs
P2 Fixes:
src/noteflow/grpc/_mixins/meeting.py— Add state transition loggingsrc/noteflow/grpc/_mixins/diarization/_jobs.py— Add job state loggingsrc/noteflow/infrastructure/asr/segmenter.py— Add VAD state loggingsrc/noteflow/grpc/_mixins/streaming/_cleanup.py— Add cleanup logging- Repository files (10+) — Add CRUD logging
src/noteflow/infrastructure/persistence/unit_of_work.py— Add transaction logging
Documentation:
docs/guides/logging.md— Usage guideCLAUDE.md— Add logging configuration sectiondocs/triage.md— Mark resolved items
Test Strategy
Fixtures to extend or create
tests/infrastructure/observability/conftest.py: Addcaptured_logsfixture using structlog testing utilitiestests/infrastructure/observability/conftest.py: Addmock_timefixture for timing tests
Parameterized tests
- Timing decorator: sync functions, async functions, timeout errors, general exceptions
- State transitions: enum states, string states, None old_state, with/without context
Core test cases
- Timing helpers: Verify correct event names, duration_ms field, error handling
- State transitions: Verify old_state/new_state fields, entity_id included
- Integration: Verify logs appear in LogBuffer with correct structure
Quality Gates
pytest tests/infrastructure/observability/passesmake quality-pypasses (ruff, basedpyright, test quality)- No
# type: ignorewithout justification - All new public functions have docstrings
- Documentation files created
Priority-Based Implementation Order
Batch 1: Infrastructure + P0 (Day 1 morning)
- Create
timing.pywithlog_timingand@timed - Create
transitions.pywithlog_state_transition - Update
__init__.pyexports - Write tests for new helpers
- Apply to Ollama availability check
- Apply to cloud summarization API calls
- Apply to database engine creation
Batch 2: P1 Network/Blocking (Day 1 afternoon)
- Calendar API request logging (Google, Outlook)
- OAuth token refresh timing
- Webhook delivery start logging
- ASR transcription timing
- NER model warmup timing
- Diarization operation timing
Batch 3: P1 Error Handling (Day 2 morning)
- Silent ValueError logging (4 locations)
- Silent settings fallback logging (4 locations)
- gRPC client stub unavailable logging
Batch 4: P2 State Transitions (Day 2 afternoon)
- Meeting state transition logging
- Diarization job state logging
- Segmenter state machine logging
- Stream cleanup logging
- Background task spawn logging
Batch 5: P2 Database + Docs (Day 3)
- Repository CRUD logging (10+ files)
- Unit of Work transaction logging
- Create docs/guides/logging.md
- Update CLAUDE.md
- Update triage.md resolved section
Post-Sprint
- Add log aggregation integration (e.g., Loki, Datadog)
- Create alerting rules for error patterns
- Add request tracing dashboard
- Consider structured logging for client-side (Rust tracing)