Files
noteflow/docs/sprints/phase-gaps/sprint-gap-003-error-handling.md
2026-01-02 04:22:40 +00:00

8.1 KiB

SPRINT-GAP-003: Error Handling Mismatches

Attribute Value
Sprint GAP-003
Size M (Medium)
Owner TBD
Phase Hardening
Prerequisites None

Open Issues

  • Define error severity levels for user-facing vs silent errors
  • Decide on error aggregation strategy (toast vs panel vs inline)
  • Determine retry eligibility by error type

Validation Status

Component Exists Needs Work
gRPC error helpers Yes Complete
Domain error mapping Yes Complete
Client error emission Partial Inconsistent
Error recovery UI No Required

Objective

Establish consistent error handling across the backend-client boundary, ensuring all errors are appropriately surfaced, logged, and actionable while maintaining system stability.

Key Decisions

Decision Choice Rationale
Silenced errors Eliminate .catch(() => {}) Hidden failures cause debugging nightmares
Error classification Retry/Fatal/Transient Different handling for each type
User notification Severity-based Critical → toast, Warning → inline, Info → log
Webhook failures Log + metrics only Fire-and-forget by design, but with visibility

What Already Exists

Backend Error Infrastructure (src/noteflow/grpc/_mixins/errors.py)

  • abort_not_found(), abort_invalid_argument(), etc.
  • DomainError → gRPC status mapping
  • domain_error_handler decorator
  • Feature requirement helpers (require_feature_*)

Client Error Patterns

  • emit_error() in Tauri commands
  • ErrorEvent type for Tauri events
  • Connection error state tracking

Identified Issues

1. Silenced Errors Pattern (Critical)

Multiple locations use .catch(() => {}) to ignore errors:

Location 1: client/src/api/tauri-adapter.ts:107

this.invoke(TauriCommands.SEND_AUDIO_CHUNK, args).catch((_err) => {});

Location 2: client/src/api/tauri-adapter.ts:124

this.invoke(TauriCommands.STOP_RECORDING, { meeting_id: this.meetingId }).catch((_err) => {});

Location 3: client/src/api/index.ts:51

await startTauriEventBridge().catch((_error) => {});

Location 4: client/src/lib/preferences.ts:234

validateCachedIntegrations().catch(() => {});

Location 5: client/src/contexts/project-context.tsx:134

.catch(() => {});

Location 6: client/src/api/cached-adapter.ts:134

await startTauriEventBridge().catch(() => {});

Problem: Errors are silently discarded:

  • No logging for debugging
  • No user notification for actionable errors
  • No metrics for monitoring
  • Failures appear as success to calling code

2. Inconsistent Error Emission (Medium)

Location: client/src-tauri/src/commands/diarization.rs:51-54

Err(err) => {
    emit_error(&app, "diarization_error", &err);
    Err(err)
}

Problem: Some commands emit errors, others don't:

  • refine_speaker_diarization emits errors
  • rename_speaker doesn't emit (line 81-86)
  • Inconsistent user experience

3. Webhook Failures Invisible (Low - By Design)

Location: src/noteflow/grpc/_mixins/meeting.py:187-192

try:
    await self._webhook_service.trigger_recording_stopped(...)
except Exception:
    logger.exception("Failed to trigger recording.stopped webhooks")

Problem: Fire-and-forget is correct for webhooks, but:

  • No metrics emitted for failure rate
  • No visibility into webhook health
  • Failures only visible in logs

4. Error Type Information Lost (Medium)

Problem: gRPC errors are well-structured on backend but client often reduces to string:

Backend:

await context.abort(grpc.StatusCode.NOT_FOUND, f"Meeting {meeting_id} not found")

Client:

Err(err) => {
    emit_error(&app, "diarization_error", &err);  // Just error code, no status
    ...
}

Lost information:

  • gRPC status code (NOT_FOUND, INVALID_ARGUMENT, etc.)
  • Whether error is retryable
  • Error category for appropriate UI handling

Scope

Task Breakdown

Task Effort Description
Replace .catch(() => {}) with logging S Log all caught errors with context
Add error classification M Categorize errors as Retry/Fatal/Transient
Preserve gRPC status in client M Include status code in error events
Add error metrics S Emit metrics for error rates by type
Consistent error emission S All Tauri commands emit errors
Webhook failure metrics S Track webhook delivery success/failure

Files to Modify

Client:

  • client/src/api/tauri-adapter.ts - Replace silent catches
  • client/src/api/index.ts - Log event bridge errors
  • client/src/api/cached-adapter.ts - Log event bridge errors
  • client/src/lib/preferences.ts - Surface validation errors
  • client/src/contexts/project-context.tsx - Handle errors
  • client/src-tauri/src/commands/*.rs - Consistent error emission
  • client/src/types/errors.ts - Add error classification

Backend:

  • src/noteflow/grpc/_mixins/meeting.py - Add webhook metrics
  • src/noteflow/application/services/webhook_service.py - Add metrics

Error Classification Schema

enum ErrorSeverity {
  Fatal = 'fatal',       // Unrecoverable, show dialog
  Retryable = 'retryable', // Can retry, show with retry button
  Transient = 'transient', // Temporary, auto-retry
  Warning = 'warning',   // Show inline, don't block
  Info = 'info',         // Log only, no UI
}

enum ErrorCategory {
  Network = 'network',
  Auth = 'auth',
  Validation = 'validation',
  NotFound = 'not_found',
  Server = 'server',
  Client = 'client',
}

interface ClassifiedError {
  message: string;
  code: string;
  severity: ErrorSeverity;
  category: ErrorCategory;
  retryable: boolean;
  grpcStatus?: number;
  context?: Record<string, unknown>;
}

Migration Strategy

Phase 1: Add Logging (Low Risk)

  • Replace .catch(() => {}) with .catch(logError)
  • No user-facing changes
  • Immediate debugging improvement

Phase 2: Add Classification (Low Risk)

  • Classify errors at emission point
  • Add gRPC status preservation
  • Backward compatible

Phase 3: Consistent Emission (Medium Risk)

  • All Tauri commands emit errors
  • UI handles new error events
  • May surface previously hidden errors

Phase 4: User-Facing Improvements (Medium Risk)

  • Add retry buttons for retryable errors
  • Add error aggregation panel
  • May change user experience

Deliverables

Backend

  • Add webhook delivery metrics (success/failure/retry counts)
  • Emit structured error events for observability

Client

  • Zero .catch(() => {}) patterns
  • Error classification utility
  • Consistent error emission from all Tauri commands
  • Error logging with context
  • Metrics emission for error rates

Shared

  • Error classification schema documentation
  • gRPC status → severity mapping

Tests

  • Unit test: error classification logic
  • Integration test: error propagation end-to-end
  • E2E test: error UI rendering

Test Strategy

Fixtures

  • Mock server that returns various error codes
  • Network failure simulation
  • Invalid request payloads

Test Cases

Case Input Expected
gRPC NOT_FOUND Get non-existent meeting Error with severity=Warning, category=NotFound
gRPC UNAVAILABLE Server down Error with severity=Retryable, category=Network
gRPC INTERNAL Server bug Error with severity=Fatal, category=Server
Network timeout Slow response Error with severity=Retryable, category=Network
Validation error Invalid UUID Error with severity=Warning, category=Validation
Webhook failure HTTP 500 from webhook Logged, metric emitted, no user error

Quality Gates

  • Zero .catch(() => {}) in production code
  • All errors have classification
  • All Tauri commands emit errors consistently
  • Error metrics visible in observability dashboard
  • gRPC status preserved in client errors
  • Tests cover all error categories
  • Documentation for error classification