Files
noteflow/docs/observability.md
Travis Vasceannie 2d1a86937f Enhance observability and usage event infrastructure
- Introduced `UsageEventSink` interface for capturing application-layer usage events.
- Updated `OtelUsageEventSink` to attach span events and emit usage events from application services.
- Enhanced logging by injecting `trace_id` and `span_id` into LogBuffer entries.
- Implemented metadata persistence for `Summary.tokens_used` and `Summary.latency_ms` before saving.
- Updated documentation to reflect changes in observability components and planned features for Sprint 15.

All quality checks pass.
2025-12-29 21:07:47 +00:00

3.9 KiB
Raw Blame History

NoteFlow Observability (OpenTelemetry Plan)

Status: Planned (Sprint 15) Owner: Backend Goal: Standardize traces, metrics, and logs so usage events are derived consistently and correlate to user actions.


Why OpenTelemetry

  • Single, vendorneutral instrumentation layer for traces/metrics/logs.
  • Correlation IDs become firstclass (trace_id/span_id) across gRPC calls.
  • Usage events can be emitted as span events instead of bespoke emitters.

Existing Assets (Reuse)

  • src/noteflow/infrastructure/logging/log_buffer.py — structured log buffer (existing)
  • src/noteflow/infrastructure/metrics/collector.py — system metrics (existing)
  • src/noteflow/grpc/_mixins/observability.py — logs/metrics RPCs (existing)

These remain the retrieval surface; OpenTelemetry becomes the capture layer.


Planned Components (Sprint 15)

  • src/noteflow/application/observability/ports.py
    • UsageEventSink interface for application-layer usage events
  • src/noteflow/infrastructure/observability/otel.py
    • Initializes OTel providers (trace + metrics)
    • Instruments gRPC server
    • Adds resource attributes (service.name, version)
  • src/noteflow/infrastructure/observability/usage.py
    • OtelUsageEventSink attaches span events
  • src/noteflow/grpc/_interceptors/otel.py
    • Ensures correlation IDs propagate into async contexts
  • src/noteflow/infrastructure/logging/context.py
    • Logging filter/record factory to inject trace_id/span_id
  • src/noteflow/infrastructure/logging/log_buffer.py
    • Extend record schema to include trace_id/span_id
  • src/noteflow/application/services/summarization_service.py
    • Persist Summary.tokens_used / Summary.latency_ms and emit usage events

All new code must satisfy docs/sprints/QUALITY_STANDARDS.md.


Correlation Model

  • Primary: trace_id, span_id (OpenTelemetry)
  • Secondary: request_id, workspace_id, user_id, meeting_id (domain context)
  • Rule: gRPC interceptor sets context once per request; all logs/spans read from context.

Usage Events (Span Events)

Usage is recorded by attaching events to the active span:

  • summarization.completed
  • embedding.completed
  • transcription.completed
  • diarization.completed
  • entity_extraction.completed

Each event should include:

  • meeting_id
  • workspace_id
  • user_id
  • tokens_input, tokens_output
  • latency_ms
  • provider, model
  • success, error_message (optional)

tokens_output can be None when providers only return total tokens. Usage metadata should be persisted on Summary.tokens_used and Summary.latency_ms before saving.


Wiring Plan (Minimal)

  1. Startup: call configure_observability() in src/noteflow/grpc/server.py.
  2. gRPC: enable OTel gRPC server instrumentation.
  3. Logging: inject trace_id/span_id into LogBuffer entries via logging filter.
  4. Usage: emit events via UsageEventSink from application services.
  5. Metadata: persist Summary.tokens_used / Summary.latency_ms before save.
  6. Metrics history (optional): start background collection if historical graphs are required.

Tests (Planned)

Backend tests must validate that:

  • OTel spans are created per gRPC call.
  • LogBuffer entries include trace_id/span_id when available.
  • Usage events appear as span events with required attributes.

Planned test files:

  • tests/grpc/test_observability.py
  • tests/infrastructure/test_log_buffer.py
  • tests/application/test_summarization_usage.py

Quality Gates

  • pytest tests/grpc/test_observability.py
  • pytest tests/infrastructure/test_log_buffer.py
  • ruff check src/noteflow
  • basedpyright
  • Must comply with docs/sprints/QUALITY_STANDARDS.md

Notes

  • Keep exporters optional; default to localonly collection unless configured.
  • Do not introduce background threads that bypass existing async shutdown hooks.
  • If OTel initialization fails, fall back to current LogBuffer behavior.