Files
noteflow/docs/observability.md
Travis Vasceannie 968ddab526 chore: update client submodule and enhance observability documentation
- Updated client submodule to commit 3748f0b.
- Added logging standards to `docs/observability.md` to improve error handling and operational transparency.
- Marked implementation checklist items as complete in `README.md` and `IMPLEMENTATION_CHECKLIST.md` for sprint GAP-012.
2026-01-09 20:47:23 +00:00

7.4 KiB
Raw Blame History

NoteFlow Observability (OpenTelemetry Plan)

Status: Planned (Sprint 15) Owner: Backend Goal: Standardize traces, metrics, and logs so usage events are derived consistently and correlate to user actions.


Why OpenTelemetry

  • Single, vendorneutral instrumentation layer for traces/metrics/logs.
  • Correlation IDs become firstclass (trace_id/span_id) across gRPC calls.
  • Usage events can be emitted as span events instead of bespoke emitters.

Existing Assets (Reuse)

  • src/noteflow/infrastructure/logging/log_buffer.py — structured log buffer (existing)
  • src/noteflow/infrastructure/metrics/collector.py — system metrics (existing)
  • src/noteflow/grpc/_mixins/observability.py — logs/metrics RPCs (existing)

These remain the retrieval surface; OpenTelemetry becomes the capture layer.


Planned Components (Sprint 15)

  • src/noteflow/application/observability/ports.py
    • UsageEventSink interface for application-layer usage events
  • src/noteflow/infrastructure/observability/otel.py
    • Initializes OTel providers (trace + metrics)
    • Instruments gRPC server
    • Adds resource attributes (service.name, version)
  • src/noteflow/infrastructure/observability/usage.py
    • OtelUsageEventSink attaches span events
  • src/noteflow/grpc/_interceptors/otel.py
    • Ensures correlation IDs propagate into async contexts
  • src/noteflow/infrastructure/logging/context.py
    • Logging filter/record factory to inject trace_id/span_id
  • src/noteflow/infrastructure/logging/log_buffer.py
    • Extend record schema to include trace_id/span_id
  • src/noteflow/application/services/summarization_service.py
    • Persist Summary.tokens_used / Summary.latency_ms and emit usage events

All new code must satisfy docs/sprints/QUALITY_STANDARDS.md.


Correlation Model

  • Primary: trace_id, span_id (OpenTelemetry)
  • Secondary: request_id, workspace_id, user_id, meeting_id (domain context)
  • Rule: gRPC interceptor sets context once per request; all logs/spans read from context.

Usage Events (Span Events)

Usage is recorded by attaching events to the active span:

  • summarization.completed
  • embedding.completed
  • transcription.completed
  • diarization.completed
  • entity_extraction.completed

Each event should include:

  • meeting_id
  • workspace_id
  • user_id
  • tokens_input, tokens_output
  • latency_ms
  • provider, model
  • success, error_message (optional)

tokens_output can be None when providers only return total tokens. Usage metadata should be persisted on Summary.tokens_used and Summary.latency_ms before saving.


Wiring Plan (Minimal)

  1. Startup: call configure_observability() in src/noteflow/grpc/server.py.
  2. gRPC: enable OTel gRPC server instrumentation.
  3. Logging: inject trace_id/span_id into LogBuffer entries via logging filter.
  4. Usage: emit events via UsageEventSink from application services.
  5. Metadata: persist Summary.tokens_used / Summary.latency_ms before save.
  6. Metrics history (optional): start background collection if historical graphs are required.

Tests (Planned)

Backend tests must validate that:

  • OTel spans are created per gRPC call.
  • LogBuffer entries include trace_id/span_id when available.
  • Usage events appear as span events with required attributes.

Planned test files:

  • tests/grpc/test_observability.py
  • tests/infrastructure/test_log_buffer.py
  • tests/application/test_summarization_usage.py

Quality Gates

  • pytest tests/grpc/test_observability.py
  • pytest tests/infrastructure/test_log_buffer.py
  • ruff check src/noteflow
  • basedpyright
  • Must comply with docs/sprints/QUALITY_STANDARDS.md

Logging Standards

Principles

  1. Never suppress without logging: Replace contextlib.suppress() with explicit try/except
  2. Log early returns: Any function that returns early should log at DEBUG level
  3. Structured logging: Always include context (IDs, operation names)
  4. Appropriate levels:
    • ERROR: Unexpected failures
    • WARNING: Expected failures (validation errors)
    • INFO: Significant state changes
    • DEBUG: Operational flow details

Examples

Before (Bad)

with contextlib.suppress(Exception):
    await risky_operation()

After (Good)

try:
    await risky_operation()
except SpecificError as exc:
    logger.warning("Expected error in operation", error=str(exc))
except Exception as exc:
    logger.error("Unexpected error in operation", error=str(exc), exc_info=True)

Notes

  • Keep exporters optional; default to localonly collection unless configured.
  • Do not introduce background threads that bypass existing async shutdown hooks.
  • If OTel initialization fails, fall back to current LogBuffer behavior.

Connection Troubleshooting

Use this checklist when diagnosing client-to-server connection issues.

Quick Checks

Check Command / Location Expected
Server running ps aux | grep noteflow Process visible
Server port open lsof -i :50051 LISTEN state
Client effective URL Settings → Server Connection tooltip Shows URL and source

Environment Variables

Variable Default Purpose
NOTEFLOW_BIND_ADDRESS 0.0.0.0 Server bind address (Python)
NOTEFLOW_SERVER_ADDRESS 127.0.0.1:50051 Client server URL override
NOTEFLOW_GRPC_PORT 50051 Server port

Common Issues

1. "Connection refused" in Docker

Symptom: Desktop client cannot connect to server running in Docker.

Cause: Server binds to 127.0.0.1 (localhost only) instead of 0.0.0.0 (all interfaces).

Fix:

# Set bind address to all interfaces
export NOTEFLOW_BIND_ADDRESS=0.0.0.0

# Verify Docker port mapping
docker run -p 50051:50051 ...

2. IPv6 vs IPv4 mismatch

Symptom: Connection works on some systems but not others.

Cause: localhost resolves to ::1 (IPv6) on some systems, 127.0.0.1 (IPv4) on others.

Fix: Use explicit IPv4 address 127.0.0.1 instead of localhost.

3. Wrong server URL in client

Symptom: Client connects to wrong server or shows "not connected".

Diagnosis:

  1. Open Settings page
  2. Hover over the info icon next to "Server Connection"
  3. Check the tooltip shows correct URL and source

Source priority:

  1. Environment (NOTEFLOW_SERVER_ADDRESS)
  2. User preferences (Settings page)
  3. Default (127.0.0.1:50051)

4. Firewall blocking connection

Symptom: Server running, port open, but client cannot connect.

Fix (Linux):

sudo ufw allow 50051/tcp

Fix (macOS):

# Check if firewall is blocking
sudo /usr/libexec/ApplicationFirewall/socketfilterfw --listapps

Diagnostic Commands

# Check server is listening
netstat -tlnp | grep 50051

# Test gRPC connectivity (requires grpcurl)
grpcurl -plaintext localhost:50051 list

# Check server logs
docker logs noteflow-server 2>&1 | grep -i "bind\|listen\|address"

# Verify environment
printenv | grep NOTEFLOW

Client-Side Logging

Enable verbose logging in the Tauri client:

RUST_LOG=noteflow_tauri=debug npm run tauri dev

Look for connection-related entries:

  • grpc_connect — connection attempts
  • server_address — resolved server URL
  • connection_error — failure details