Files
noteflow/docs/sprints/phase-gaps/sprint-gap-003-error-handling.md
2026-01-02 04:22:40 +00:00

278 lines
8.1 KiB
Markdown

# SPRINT-GAP-003: Error Handling Mismatches
| Attribute | Value |
|-----------|-------|
| **Sprint** | GAP-003 |
| **Size** | M (Medium) |
| **Owner** | TBD |
| **Phase** | Hardening |
| **Prerequisites** | None |
## Open Issues
- [ ] Define error severity levels for user-facing vs silent errors
- [ ] Decide on error aggregation strategy (toast vs panel vs inline)
- [ ] Determine retry eligibility by error type
## Validation Status
| Component | Exists | Needs Work |
|-----------|--------|------------|
| gRPC error helpers | Yes | Complete |
| Domain error mapping | Yes | Complete |
| Client error emission | Partial | Inconsistent |
| Error recovery UI | No | Required |
## Objective
Establish consistent error handling across the backend-client boundary, ensuring all errors are appropriately surfaced, logged, and actionable while maintaining system stability.
## Key Decisions
| Decision | Choice | Rationale |
|----------|--------|-----------|
| Silenced errors | Eliminate `.catch(() => {})` | Hidden failures cause debugging nightmares |
| Error classification | Retry/Fatal/Transient | Different handling for each type |
| User notification | Severity-based | Critical → toast, Warning → inline, Info → log |
| Webhook failures | Log + metrics only | Fire-and-forget by design, but with visibility |
## What Already Exists
### Backend Error Infrastructure (`src/noteflow/grpc/_mixins/errors.py`)
- `abort_not_found()`, `abort_invalid_argument()`, etc.
- `DomainError` → gRPC status mapping
- `domain_error_handler` decorator
- Feature requirement helpers (`require_feature_*`)
### Client Error Patterns
- `emit_error()` in Tauri commands
- `ErrorEvent` type for Tauri events
- Connection error state tracking
## Identified Issues
### 1. Silenced Errors Pattern (Critical)
Multiple locations use `.catch(() => {})` to ignore errors:
**Location 1**: `client/src/api/tauri-adapter.ts:107`
```typescript
this.invoke(TauriCommands.SEND_AUDIO_CHUNK, args).catch((_err) => {});
```
**Location 2**: `client/src/api/tauri-adapter.ts:124`
```typescript
this.invoke(TauriCommands.STOP_RECORDING, { meeting_id: this.meetingId }).catch((_err) => {});
```
**Location 3**: `client/src/api/index.ts:51`
```typescript
await startTauriEventBridge().catch((_error) => {});
```
**Location 4**: `client/src/lib/preferences.ts:234`
```typescript
validateCachedIntegrations().catch(() => {});
```
**Location 5**: `client/src/contexts/project-context.tsx:134`
```typescript
.catch(() => {});
```
**Location 6**: `client/src/api/cached-adapter.ts:134`
```typescript
await startTauriEventBridge().catch(() => {});
```
**Problem**: Errors are silently discarded:
- No logging for debugging
- No user notification for actionable errors
- No metrics for monitoring
- Failures appear as success to calling code
### 2. Inconsistent Error Emission (Medium)
**Location**: `client/src-tauri/src/commands/diarization.rs:51-54`
```rust
Err(err) => {
emit_error(&app, "diarization_error", &err);
Err(err)
}
```
**Problem**: Some commands emit errors, others don't:
- `refine_speaker_diarization` emits errors
- `rename_speaker` doesn't emit (line 81-86)
- Inconsistent user experience
### 3. Webhook Failures Invisible (Low - By Design)
**Location**: `src/noteflow/grpc/_mixins/meeting.py:187-192`
```python
try:
await self._webhook_service.trigger_recording_stopped(...)
except Exception:
logger.exception("Failed to trigger recording.stopped webhooks")
```
**Problem**: Fire-and-forget is correct for webhooks, but:
- No metrics emitted for failure rate
- No visibility into webhook health
- Failures only visible in logs
### 4. Error Type Information Lost (Medium)
**Problem**: gRPC errors are well-structured on backend but client often reduces to string:
Backend:
```python
await context.abort(grpc.StatusCode.NOT_FOUND, f"Meeting {meeting_id} not found")
```
Client:
```typescript
Err(err) => {
emit_error(&app, "diarization_error", &err); // Just error code, no status
...
}
```
Lost information:
- gRPC status code (NOT_FOUND, INVALID_ARGUMENT, etc.)
- Whether error is retryable
- Error category for appropriate UI handling
## Scope
### Task Breakdown
| Task | Effort | Description |
|------|--------|-------------|
| Replace `.catch(() => {})` with logging | S | Log all caught errors with context |
| Add error classification | M | Categorize errors as Retry/Fatal/Transient |
| Preserve gRPC status in client | M | Include status code in error events |
| Add error metrics | S | Emit metrics for error rates by type |
| Consistent error emission | S | All Tauri commands emit errors |
| Webhook failure metrics | S | Track webhook delivery success/failure |
### Files to Modify
**Client:**
- `client/src/api/tauri-adapter.ts` - Replace silent catches
- `client/src/api/index.ts` - Log event bridge errors
- `client/src/api/cached-adapter.ts` - Log event bridge errors
- `client/src/lib/preferences.ts` - Surface validation errors
- `client/src/contexts/project-context.tsx` - Handle errors
- `client/src-tauri/src/commands/*.rs` - Consistent error emission
- `client/src/types/errors.ts` - Add error classification
**Backend:**
- `src/noteflow/grpc/_mixins/meeting.py` - Add webhook metrics
- `src/noteflow/application/services/webhook_service.py` - Add metrics
## Error Classification Schema
```typescript
enum ErrorSeverity {
Fatal = 'fatal', // Unrecoverable, show dialog
Retryable = 'retryable', // Can retry, show with retry button
Transient = 'transient', // Temporary, auto-retry
Warning = 'warning', // Show inline, don't block
Info = 'info', // Log only, no UI
}
enum ErrorCategory {
Network = 'network',
Auth = 'auth',
Validation = 'validation',
NotFound = 'not_found',
Server = 'server',
Client = 'client',
}
interface ClassifiedError {
message: string;
code: string;
severity: ErrorSeverity;
category: ErrorCategory;
retryable: boolean;
grpcStatus?: number;
context?: Record<string, unknown>;
}
```
## Migration Strategy
### Phase 1: Add Logging (Low Risk)
- Replace `.catch(() => {})` with `.catch(logError)`
- No user-facing changes
- Immediate debugging improvement
### Phase 2: Add Classification (Low Risk)
- Classify errors at emission point
- Add gRPC status preservation
- Backward compatible
### Phase 3: Consistent Emission (Medium Risk)
- All Tauri commands emit errors
- UI handles new error events
- May surface previously hidden errors
### Phase 4: User-Facing Improvements (Medium Risk)
- Add retry buttons for retryable errors
- Add error aggregation panel
- May change user experience
## Deliverables
### Backend
- [ ] Add webhook delivery metrics (success/failure/retry counts)
- [ ] Emit structured error events for observability
### Client
- [ ] Zero `.catch(() => {})` patterns
- [ ] Error classification utility
- [ ] Consistent error emission from all Tauri commands
- [ ] Error logging with context
- [ ] Metrics emission for error rates
### Shared
- [ ] Error classification schema documentation
- [ ] gRPC status → severity mapping
### Tests
- [ ] Unit test: error classification logic
- [ ] Integration test: error propagation end-to-end
- [ ] E2E test: error UI rendering
## Test Strategy
### Fixtures
- Mock server that returns various error codes
- Network failure simulation
- Invalid request payloads
### Test Cases
| Case | Input | Expected |
|------|-------|----------|
| gRPC NOT_FOUND | Get non-existent meeting | Error with severity=Warning, category=NotFound |
| gRPC UNAVAILABLE | Server down | Error with severity=Retryable, category=Network |
| gRPC INTERNAL | Server bug | Error with severity=Fatal, category=Server |
| Network timeout | Slow response | Error with severity=Retryable, category=Network |
| Validation error | Invalid UUID | Error with severity=Warning, category=Validation |
| Webhook failure | HTTP 500 from webhook | Logged, metric emitted, no user error |
## Quality Gates
- [ ] Zero `.catch(() => {})` in production code
- [ ] All errors have classification
- [ ] All Tauri commands emit errors consistently
- [ ] Error metrics visible in observability dashboard
- [ ] gRPC status preserved in client errors
- [ ] Tests cover all error categories
- [ ] Documentation for error classification