noteflow/docs/sprints/phase-gaps/sprint-gap-003-error-handling.md

# SPRINT-GAP-003: Error Handling Mismatches

| Attribute | Value |
|-----------|-------|
| **Sprint** | GAP-003 |
| **Size** | M (Medium) |
| **Owner** | TBD |
| **Phase** | Hardening |
| **Prerequisites** | None |

## Open Issues

- [ ] Define error severity levels for user-facing vs silent errors
- [ ] Decide on error aggregation strategy (toast vs panel vs inline)
- [ ] Determine retry eligibility by error type

## Validation Status

| Component | Exists | Needs Work |
|-----------|--------|------------|
| gRPC error helpers | Yes | Complete |
| Domain error mapping | Yes | Complete |
| Client error emission | Partial | Inconsistent |
| Error recovery UI | No | Required |

## Objective

Establish consistent error handling across the backend-client boundary, ensuring all errors are appropriately surfaced, logged, and actionable while maintaining system stability.

## Key Decisions

| Decision | Choice | Rationale |
|----------|--------|-----------|
| Silenced errors | Eliminate `.catch(() => {})` | Hidden failures cause debugging nightmares |
| Error classification | Retry/Fatal/Transient | Different handling for each type |
| User notification | Severity-based | Critical → toast, Warning → inline, Info → log |
| Webhook failures | Log + metrics only | Fire-and-forget by design, but with visibility |

## What Already Exists

### Backend Error Infrastructure (`src/noteflow/grpc/_mixins/errors.py`)
- `abort_not_found()`, `abort_invalid_argument()`, etc.
- `DomainError` → gRPC status mapping
- `domain_error_handler` decorator
- Feature requirement helpers (`require_feature_*`)

### Client Error Patterns
- `emit_error()` in Tauri commands
- `ErrorEvent` type for Tauri events
- Connection error state tracking

## Identified Issues

### 1. Silenced Errors Pattern (Critical)

Multiple locations use `.catch(() => {})` to ignore errors:

**Location 1**: `client/src/api/tauri-adapter.ts:107`
```typescript
this.invoke(TauriCommands.SEND_AUDIO_CHUNK, args).catch((_err) => {});
```

**Location 2**: `client/src/api/tauri-adapter.ts:124`
```typescript
this.invoke(TauriCommands.STOP_RECORDING, { meeting_id: this.meetingId }).catch((_err) => {});
```

**Location 3**: `client/src/api/index.ts:51`
```typescript
await startTauriEventBridge().catch((_error) => {});
```

**Location 4**: `client/src/lib/preferences.ts:234`
```typescript
validateCachedIntegrations().catch(() => {});
```

**Location 5**: `client/src/contexts/project-context.tsx:134`
```typescript
.catch(() => {});
```

**Location 6**: `client/src/api/cached-adapter.ts:134`
```typescript
await startTauriEventBridge().catch(() => {});
```

**Problem**: Errors are silently discarded:
- No logging for debugging
- No user notification for actionable errors
- No metrics for monitoring
- Failures appear as success to calling code

### 2. Inconsistent Error Emission (Medium)

**Location**: `client/src-tauri/src/commands/diarization.rs:51-54`

```rust
Err(err) => {
    emit_error(&app, "diarization_error", &err);
    Err(err)
}
```

**Problem**: Some commands emit errors, others don't:
- `refine_speaker_diarization` emits errors
- `rename_speaker` doesn't emit (line 81-86)
- Inconsistent user experience

### 3. Webhook Failures Invisible (Low - By Design)

**Location**: `src/noteflow/grpc/_mixins/meeting.py:187-192`

```python
try:
    await self._webhook_service.trigger_recording_stopped(...)
except Exception:
    logger.exception("Failed to trigger recording.stopped webhooks")
```

**Problem**: Fire-and-forget is correct for webhooks, but:
- No metrics emitted for failure rate
- No visibility into webhook health
- Failures only visible in logs

### 4. Error Type Information Lost (Medium)

**Problem**: gRPC errors are well-structured on backend but client often reduces to string:

Backend:
```python
await context.abort(grpc.StatusCode.NOT_FOUND, f"Meeting {meeting_id} not found")
```

Client:
```typescript
Err(err) => {
    emit_error(&app, "diarization_error", &err);  // Just error code, no status
    ...
}
```

Lost information:
- gRPC status code (NOT_FOUND, INVALID_ARGUMENT, etc.)
- Whether error is retryable
- Error category for appropriate UI handling

## Scope

### Task Breakdown

| Task | Effort | Description |
|------|--------|-------------|
| Replace `.catch(() => {})` with logging | S | Log all caught errors with context |
| Add error classification | M | Categorize errors as Retry/Fatal/Transient |
| Preserve gRPC status in client | M | Include status code in error events |
| Add error metrics | S | Emit metrics for error rates by type |
| Consistent error emission | S | All Tauri commands emit errors |
| Webhook failure metrics | S | Track webhook delivery success/failure |

### Files to Modify

**Client:**
- `client/src/api/tauri-adapter.ts` - Replace silent catches
- `client/src/api/index.ts` - Log event bridge errors
- `client/src/api/cached-adapter.ts` - Log event bridge errors
- `client/src/lib/preferences.ts` - Surface validation errors
- `client/src/contexts/project-context.tsx` - Handle errors
- `client/src-tauri/src/commands/*.rs` - Consistent error emission
- `client/src/types/errors.ts` - Add error classification

**Backend:**
- `src/noteflow/grpc/_mixins/meeting.py` - Add webhook metrics
- `src/noteflow/application/services/webhook_service.py` - Add metrics

## Error Classification Schema

```typescript
enum ErrorSeverity {
  Fatal = 'fatal',       // Unrecoverable, show dialog
  Retryable = 'retryable', // Can retry, show with retry button
  Transient = 'transient', // Temporary, auto-retry
  Warning = 'warning',   // Show inline, don't block
  Info = 'info',         // Log only, no UI
}

enum ErrorCategory {
  Network = 'network',
  Auth = 'auth',
  Validation = 'validation',
  NotFound = 'not_found',
  Server = 'server',
  Client = 'client',
}

interface ClassifiedError {
  message: string;
  code: string;
  severity: ErrorSeverity;
  category: ErrorCategory;
  retryable: boolean;
  grpcStatus?: number;
  context?: Record<string, unknown>;
}
```

## Migration Strategy

### Phase 1: Add Logging (Low Risk)
- Replace `.catch(() => {})` with `.catch(logError)`
- No user-facing changes
- Immediate debugging improvement

### Phase 2: Add Classification (Low Risk)
- Classify errors at emission point
- Add gRPC status preservation
- Backward compatible

### Phase 3: Consistent Emission (Medium Risk)
- All Tauri commands emit errors
- UI handles new error events
- May surface previously hidden errors

### Phase 4: User-Facing Improvements (Medium Risk)
- Add retry buttons for retryable errors
- Add error aggregation panel
- May change user experience

## Deliverables

### Backend
- [ ] Add webhook delivery metrics (success/failure/retry counts)
- [ ] Emit structured error events for observability

### Client
- [ ] Zero `.catch(() => {})` patterns
- [ ] Error classification utility
- [ ] Consistent error emission from all Tauri commands
- [ ] Error logging with context
- [ ] Metrics emission for error rates

### Shared
- [ ] Error classification schema documentation
- [ ] gRPC status → severity mapping

### Tests
- [ ] Unit test: error classification logic
- [ ] Integration test: error propagation end-to-end
- [ ] E2E test: error UI rendering

## Test Strategy

### Fixtures
- Mock server that returns various error codes
- Network failure simulation
- Invalid request payloads

### Test Cases

| Case | Input | Expected |
|------|-------|----------|
| gRPC NOT_FOUND | Get non-existent meeting | Error with severity=Warning, category=NotFound |
| gRPC UNAVAILABLE | Server down | Error with severity=Retryable, category=Network |
| gRPC INTERNAL | Server bug | Error with severity=Fatal, category=Server |
| Network timeout | Slow response | Error with severity=Retryable, category=Network |
| Validation error | Invalid UUID | Error with severity=Warning, category=Validation |
| Webhook failure | HTTP 500 from webhook | Logged, metric emitted, no user error |

## Quality Gates

- [ ] Zero `.catch(() => {})` in production code
- [ ] All errors have classification
- [ ] All Tauri commands emit errors consistently
- [ ] Error metrics visible in observability dashboard
- [ ] gRPC status preserved in client errors
- [ ] Tests cover all error categories
- [ ] Documentation for error classification