- Updated client submodule to commit 3748f0b. - Added logging standards to `docs/observability.md` to improve error handling and operational transparency. - Marked implementation checklist items as complete in `README.md` and `IMPLEMENTATION_CHECKLIST.md` for sprint GAP-012.
7.4 KiB
NoteFlow Observability (OpenTelemetry Plan)
Status: Planned (Sprint 15) Owner: Backend Goal: Standardize traces, metrics, and logs so usage events are derived consistently and correlate to user actions.
Why OpenTelemetry
- Single, vendor‑neutral instrumentation layer for traces/metrics/logs.
- Correlation IDs become first‑class (trace_id/span_id) across gRPC calls.
- Usage events can be emitted as span events instead of bespoke emitters.
Existing Assets (Reuse)
src/noteflow/infrastructure/logging/log_buffer.py— structured log buffer (existing)src/noteflow/infrastructure/metrics/collector.py— system metrics (existing)src/noteflow/grpc/_mixins/observability.py— logs/metrics RPCs (existing)
These remain the retrieval surface; OpenTelemetry becomes the capture layer.
Planned Components (Sprint 15)
src/noteflow/application/observability/ports.pyUsageEventSinkinterface for application-layer usage events
src/noteflow/infrastructure/observability/otel.py- Initializes OTel providers (trace + metrics)
- Instruments gRPC server
- Adds resource attributes (service.name, version)
src/noteflow/infrastructure/observability/usage.pyOtelUsageEventSinkattaches span events
src/noteflow/grpc/_interceptors/otel.py- Ensures correlation IDs propagate into async contexts
src/noteflow/infrastructure/logging/context.py- Logging filter/record factory to inject
trace_id/span_id
- Logging filter/record factory to inject
src/noteflow/infrastructure/logging/log_buffer.py- Extend record schema to include
trace_id/span_id
- Extend record schema to include
src/noteflow/application/services/summarization_service.py- Persist
Summary.tokens_used/Summary.latency_msand emit usage events
- Persist
All new code must satisfy docs/sprints/QUALITY_STANDARDS.md.
Correlation Model
- Primary:
trace_id,span_id(OpenTelemetry) - Secondary:
request_id,workspace_id,user_id,meeting_id(domain context) - Rule: gRPC interceptor sets context once per request; all logs/spans read from context.
Usage Events (Span Events)
Usage is recorded by attaching events to the active span:
summarization.completedembedding.completedtranscription.completeddiarization.completedentity_extraction.completed
Each event should include:
meeting_idworkspace_iduser_idtokens_input,tokens_outputlatency_msprovider,modelsuccess,error_message(optional)
tokens_output can be None when providers only return total tokens. Usage metadata should be persisted on
Summary.tokens_used and Summary.latency_ms before saving.
Wiring Plan (Minimal)
- Startup: call
configure_observability()insrc/noteflow/grpc/server.py. - gRPC: enable OTel gRPC server instrumentation.
- Logging: inject
trace_id/span_idinto LogBuffer entries via logging filter. - Usage: emit events via
UsageEventSinkfrom application services. - Metadata: persist
Summary.tokens_used/Summary.latency_msbefore save. - Metrics history (optional): start background collection if historical graphs are required.
Tests (Planned)
Backend tests must validate that:
- OTel spans are created per gRPC call.
- LogBuffer entries include
trace_id/span_idwhen available. - Usage events appear as span events with required attributes.
Planned test files:
tests/grpc/test_observability.pytests/infrastructure/test_log_buffer.pytests/application/test_summarization_usage.py
Quality Gates
pytest tests/grpc/test_observability.pypytest tests/infrastructure/test_log_buffer.pyruff check src/noteflowbasedpyright- Must comply with
docs/sprints/QUALITY_STANDARDS.md
Logging Standards
Principles
- Never suppress without logging: Replace
contextlib.suppress()with explicit try/except - Log early returns: Any function that returns early should log at DEBUG level
- Structured logging: Always include context (IDs, operation names)
- Appropriate levels:
- ERROR: Unexpected failures
- WARNING: Expected failures (validation errors)
- INFO: Significant state changes
- DEBUG: Operational flow details
Examples
Before (Bad)
with contextlib.suppress(Exception):
await risky_operation()
After (Good)
try:
await risky_operation()
except SpecificError as exc:
logger.warning("Expected error in operation", error=str(exc))
except Exception as exc:
logger.error("Unexpected error in operation", error=str(exc), exc_info=True)
Notes
- Keep exporters optional; default to local‑only collection unless configured.
- Do not introduce background threads that bypass existing async shutdown hooks.
- If OTel initialization fails, fall back to current LogBuffer behavior.
Connection Troubleshooting
Use this checklist when diagnosing client-to-server connection issues.
Quick Checks
| Check | Command / Location | Expected |
|---|---|---|
| Server running | ps aux | grep noteflow |
Process visible |
| Server port open | lsof -i :50051 |
LISTEN state |
| Client effective URL | Settings → Server Connection tooltip | Shows URL and source |
Environment Variables
| Variable | Default | Purpose |
|---|---|---|
NOTEFLOW_BIND_ADDRESS |
0.0.0.0 |
Server bind address (Python) |
NOTEFLOW_SERVER_ADDRESS |
127.0.0.1:50051 |
Client server URL override |
NOTEFLOW_GRPC_PORT |
50051 |
Server port |
Common Issues
1. "Connection refused" in Docker
Symptom: Desktop client cannot connect to server running in Docker.
Cause: Server binds to 127.0.0.1 (localhost only) instead of 0.0.0.0 (all interfaces).
Fix:
# Set bind address to all interfaces
export NOTEFLOW_BIND_ADDRESS=0.0.0.0
# Verify Docker port mapping
docker run -p 50051:50051 ...
2. IPv6 vs IPv4 mismatch
Symptom: Connection works on some systems but not others.
Cause: localhost resolves to ::1 (IPv6) on some systems, 127.0.0.1 (IPv4) on others.
Fix: Use explicit IPv4 address 127.0.0.1 instead of localhost.
3. Wrong server URL in client
Symptom: Client connects to wrong server or shows "not connected".
Diagnosis:
- Open Settings page
- Hover over the info icon next to "Server Connection"
- Check the tooltip shows correct URL and source
Source priority:
- Environment (
NOTEFLOW_SERVER_ADDRESS) - User preferences (Settings page)
- Default (
127.0.0.1:50051)
4. Firewall blocking connection
Symptom: Server running, port open, but client cannot connect.
Fix (Linux):
sudo ufw allow 50051/tcp
Fix (macOS):
# Check if firewall is blocking
sudo /usr/libexec/ApplicationFirewall/socketfilterfw --listapps
Diagnostic Commands
# Check server is listening
netstat -tlnp | grep 50051
# Test gRPC connectivity (requires grpcurl)
grpcurl -plaintext localhost:50051 list
# Check server logs
docker logs noteflow-server 2>&1 | grep -i "bind\|listen\|address"
# Verify environment
printenv | grep NOTEFLOW
Client-Side Logging
Enable verbose logging in the Tauri client:
RUST_LOG=noteflow_tauri=debug npm run tauri dev
Look for connection-related entries:
grpc_connect— connection attemptsserver_address— resolved server URLconnection_error— failure details