- Introduced `UsageEventSink` interface for capturing application-layer usage events. - Updated `OtelUsageEventSink` to attach span events and emit usage events from application services. - Enhanced logging by injecting `trace_id` and `span_id` into LogBuffer entries. - Implemented metadata persistence for `Summary.tokens_used` and `Summary.latency_ms` before saving. - Updated documentation to reflect changes in observability components and planned features for Sprint 15. All quality checks pass.
3.9 KiB
3.9 KiB
NoteFlow Observability (OpenTelemetry Plan)
Status: Planned (Sprint 15) Owner: Backend Goal: Standardize traces, metrics, and logs so usage events are derived consistently and correlate to user actions.
Why OpenTelemetry
- Single, vendor‑neutral instrumentation layer for traces/metrics/logs.
- Correlation IDs become first‑class (trace_id/span_id) across gRPC calls.
- Usage events can be emitted as span events instead of bespoke emitters.
Existing Assets (Reuse)
src/noteflow/infrastructure/logging/log_buffer.py— structured log buffer (existing)src/noteflow/infrastructure/metrics/collector.py— system metrics (existing)src/noteflow/grpc/_mixins/observability.py— logs/metrics RPCs (existing)
These remain the retrieval surface; OpenTelemetry becomes the capture layer.
Planned Components (Sprint 15)
src/noteflow/application/observability/ports.pyUsageEventSinkinterface for application-layer usage events
src/noteflow/infrastructure/observability/otel.py- Initializes OTel providers (trace + metrics)
- Instruments gRPC server
- Adds resource attributes (service.name, version)
src/noteflow/infrastructure/observability/usage.pyOtelUsageEventSinkattaches span events
src/noteflow/grpc/_interceptors/otel.py- Ensures correlation IDs propagate into async contexts
src/noteflow/infrastructure/logging/context.py- Logging filter/record factory to inject
trace_id/span_id
- Logging filter/record factory to inject
src/noteflow/infrastructure/logging/log_buffer.py- Extend record schema to include
trace_id/span_id
- Extend record schema to include
src/noteflow/application/services/summarization_service.py- Persist
Summary.tokens_used/Summary.latency_msand emit usage events
- Persist
All new code must satisfy docs/sprints/QUALITY_STANDARDS.md.
Correlation Model
- Primary:
trace_id,span_id(OpenTelemetry) - Secondary:
request_id,workspace_id,user_id,meeting_id(domain context) - Rule: gRPC interceptor sets context once per request; all logs/spans read from context.
Usage Events (Span Events)
Usage is recorded by attaching events to the active span:
summarization.completedembedding.completedtranscription.completeddiarization.completedentity_extraction.completed
Each event should include:
meeting_idworkspace_iduser_idtokens_input,tokens_outputlatency_msprovider,modelsuccess,error_message(optional)
tokens_output can be None when providers only return total tokens. Usage metadata should be persisted on
Summary.tokens_used and Summary.latency_ms before saving.
Wiring Plan (Minimal)
- Startup: call
configure_observability()insrc/noteflow/grpc/server.py. - gRPC: enable OTel gRPC server instrumentation.
- Logging: inject
trace_id/span_idinto LogBuffer entries via logging filter. - Usage: emit events via
UsageEventSinkfrom application services. - Metadata: persist
Summary.tokens_used/Summary.latency_msbefore save. - Metrics history (optional): start background collection if historical graphs are required.
Tests (Planned)
Backend tests must validate that:
- OTel spans are created per gRPC call.
- LogBuffer entries include
trace_id/span_idwhen available. - Usage events appear as span events with required attributes.
Planned test files:
tests/grpc/test_observability.pytests/infrastructure/test_log_buffer.pytests/application/test_summarization_usage.py
Quality Gates
pytest tests/grpc/test_observability.pypytest tests/infrastructure/test_log_buffer.pyruff check src/noteflowbasedpyright- Must comply with
docs/sprints/QUALITY_STANDARDS.md
Notes
- Keep exporters optional; default to local‑only collection unless configured.
- Do not introduce background threads that bypass existing async shutdown hooks.
- If OTel initialization fails, fall back to current LogBuffer behavior.