278 lines
8.1 KiB
Markdown
278 lines
8.1 KiB
Markdown
# SPRINT-GAP-003: Error Handling Mismatches
|
|
|
|
| Attribute | Value |
|
|
|-----------|-------|
|
|
| **Sprint** | GAP-003 |
|
|
| **Size** | M (Medium) |
|
|
| **Owner** | TBD |
|
|
| **Phase** | Hardening |
|
|
| **Prerequisites** | None |
|
|
|
|
## Open Issues
|
|
|
|
- [ ] Define error severity levels for user-facing vs silent errors
|
|
- [ ] Decide on error aggregation strategy (toast vs panel vs inline)
|
|
- [ ] Determine retry eligibility by error type
|
|
|
|
## Validation Status
|
|
|
|
| Component | Exists | Needs Work |
|
|
|-----------|--------|------------|
|
|
| gRPC error helpers | Yes | Complete |
|
|
| Domain error mapping | Yes | Complete |
|
|
| Client error emission | Partial | Inconsistent |
|
|
| Error recovery UI | No | Required |
|
|
|
|
## Objective
|
|
|
|
Establish consistent error handling across the backend-client boundary, ensuring all errors are appropriately surfaced, logged, and actionable while maintaining system stability.
|
|
|
|
## Key Decisions
|
|
|
|
| Decision | Choice | Rationale |
|
|
|----------|--------|-----------|
|
|
| Silenced errors | Eliminate `.catch(() => {})` | Hidden failures cause debugging nightmares |
|
|
| Error classification | Retry/Fatal/Transient | Different handling for each type |
|
|
| User notification | Severity-based | Critical → toast, Warning → inline, Info → log |
|
|
| Webhook failures | Log + metrics only | Fire-and-forget by design, but with visibility |
|
|
|
|
## What Already Exists
|
|
|
|
### Backend Error Infrastructure (`src/noteflow/grpc/_mixins/errors.py`)
|
|
- `abort_not_found()`, `abort_invalid_argument()`, etc.
|
|
- `DomainError` → gRPC status mapping
|
|
- `domain_error_handler` decorator
|
|
- Feature requirement helpers (`require_feature_*`)
|
|
|
|
### Client Error Patterns
|
|
- `emit_error()` in Tauri commands
|
|
- `ErrorEvent` type for Tauri events
|
|
- Connection error state tracking
|
|
|
|
## Identified Issues
|
|
|
|
### 1. Silenced Errors Pattern (Critical)
|
|
|
|
Multiple locations use `.catch(() => {})` to ignore errors:
|
|
|
|
**Location 1**: `client/src/api/tauri-adapter.ts:107`
|
|
```typescript
|
|
this.invoke(TauriCommands.SEND_AUDIO_CHUNK, args).catch((_err) => {});
|
|
```
|
|
|
|
**Location 2**: `client/src/api/tauri-adapter.ts:124`
|
|
```typescript
|
|
this.invoke(TauriCommands.STOP_RECORDING, { meeting_id: this.meetingId }).catch((_err) => {});
|
|
```
|
|
|
|
**Location 3**: `client/src/api/index.ts:51`
|
|
```typescript
|
|
await startTauriEventBridge().catch((_error) => {});
|
|
```
|
|
|
|
**Location 4**: `client/src/lib/preferences.ts:234`
|
|
```typescript
|
|
validateCachedIntegrations().catch(() => {});
|
|
```
|
|
|
|
**Location 5**: `client/src/contexts/project-context.tsx:134`
|
|
```typescript
|
|
.catch(() => {});
|
|
```
|
|
|
|
**Location 6**: `client/src/api/cached-adapter.ts:134`
|
|
```typescript
|
|
await startTauriEventBridge().catch(() => {});
|
|
```
|
|
|
|
**Problem**: Errors are silently discarded:
|
|
- No logging for debugging
|
|
- No user notification for actionable errors
|
|
- No metrics for monitoring
|
|
- Failures appear as success to calling code
|
|
|
|
### 2. Inconsistent Error Emission (Medium)
|
|
|
|
**Location**: `client/src-tauri/src/commands/diarization.rs:51-54`
|
|
|
|
```rust
|
|
Err(err) => {
|
|
emit_error(&app, "diarization_error", &err);
|
|
Err(err)
|
|
}
|
|
```
|
|
|
|
**Problem**: Some commands emit errors, others don't:
|
|
- `refine_speaker_diarization` emits errors
|
|
- `rename_speaker` doesn't emit (line 81-86)
|
|
- Inconsistent user experience
|
|
|
|
### 3. Webhook Failures Invisible (Low - By Design)
|
|
|
|
**Location**: `src/noteflow/grpc/_mixins/meeting.py:187-192`
|
|
|
|
```python
|
|
try:
|
|
await self._webhook_service.trigger_recording_stopped(...)
|
|
except Exception:
|
|
logger.exception("Failed to trigger recording.stopped webhooks")
|
|
```
|
|
|
|
**Problem**: Fire-and-forget is correct for webhooks, but:
|
|
- No metrics emitted for failure rate
|
|
- No visibility into webhook health
|
|
- Failures only visible in logs
|
|
|
|
### 4. Error Type Information Lost (Medium)
|
|
|
|
**Problem**: gRPC errors are well-structured on backend but client often reduces to string:
|
|
|
|
Backend:
|
|
```python
|
|
await context.abort(grpc.StatusCode.NOT_FOUND, f"Meeting {meeting_id} not found")
|
|
```
|
|
|
|
Client:
|
|
```typescript
|
|
Err(err) => {
|
|
emit_error(&app, "diarization_error", &err); // Just error code, no status
|
|
...
|
|
}
|
|
```
|
|
|
|
Lost information:
|
|
- gRPC status code (NOT_FOUND, INVALID_ARGUMENT, etc.)
|
|
- Whether error is retryable
|
|
- Error category for appropriate UI handling
|
|
|
|
## Scope
|
|
|
|
### Task Breakdown
|
|
|
|
| Task | Effort | Description |
|
|
|------|--------|-------------|
|
|
| Replace `.catch(() => {})` with logging | S | Log all caught errors with context |
|
|
| Add error classification | M | Categorize errors as Retry/Fatal/Transient |
|
|
| Preserve gRPC status in client | M | Include status code in error events |
|
|
| Add error metrics | S | Emit metrics for error rates by type |
|
|
| Consistent error emission | S | All Tauri commands emit errors |
|
|
| Webhook failure metrics | S | Track webhook delivery success/failure |
|
|
|
|
### Files to Modify
|
|
|
|
**Client:**
|
|
- `client/src/api/tauri-adapter.ts` - Replace silent catches
|
|
- `client/src/api/index.ts` - Log event bridge errors
|
|
- `client/src/api/cached-adapter.ts` - Log event bridge errors
|
|
- `client/src/lib/preferences.ts` - Surface validation errors
|
|
- `client/src/contexts/project-context.tsx` - Handle errors
|
|
- `client/src-tauri/src/commands/*.rs` - Consistent error emission
|
|
- `client/src/types/errors.ts` - Add error classification
|
|
|
|
**Backend:**
|
|
- `src/noteflow/grpc/_mixins/meeting.py` - Add webhook metrics
|
|
- `src/noteflow/application/services/webhook_service.py` - Add metrics
|
|
|
|
## Error Classification Schema
|
|
|
|
```typescript
|
|
enum ErrorSeverity {
|
|
Fatal = 'fatal', // Unrecoverable, show dialog
|
|
Retryable = 'retryable', // Can retry, show with retry button
|
|
Transient = 'transient', // Temporary, auto-retry
|
|
Warning = 'warning', // Show inline, don't block
|
|
Info = 'info', // Log only, no UI
|
|
}
|
|
|
|
enum ErrorCategory {
|
|
Network = 'network',
|
|
Auth = 'auth',
|
|
Validation = 'validation',
|
|
NotFound = 'not_found',
|
|
Server = 'server',
|
|
Client = 'client',
|
|
}
|
|
|
|
interface ClassifiedError {
|
|
message: string;
|
|
code: string;
|
|
severity: ErrorSeverity;
|
|
category: ErrorCategory;
|
|
retryable: boolean;
|
|
grpcStatus?: number;
|
|
context?: Record<string, unknown>;
|
|
}
|
|
```
|
|
|
|
## Migration Strategy
|
|
|
|
### Phase 1: Add Logging (Low Risk)
|
|
- Replace `.catch(() => {})` with `.catch(logError)`
|
|
- No user-facing changes
|
|
- Immediate debugging improvement
|
|
|
|
### Phase 2: Add Classification (Low Risk)
|
|
- Classify errors at emission point
|
|
- Add gRPC status preservation
|
|
- Backward compatible
|
|
|
|
### Phase 3: Consistent Emission (Medium Risk)
|
|
- All Tauri commands emit errors
|
|
- UI handles new error events
|
|
- May surface previously hidden errors
|
|
|
|
### Phase 4: User-Facing Improvements (Medium Risk)
|
|
- Add retry buttons for retryable errors
|
|
- Add error aggregation panel
|
|
- May change user experience
|
|
|
|
## Deliverables
|
|
|
|
### Backend
|
|
- [ ] Add webhook delivery metrics (success/failure/retry counts)
|
|
- [ ] Emit structured error events for observability
|
|
|
|
### Client
|
|
- [ ] Zero `.catch(() => {})` patterns
|
|
- [ ] Error classification utility
|
|
- [ ] Consistent error emission from all Tauri commands
|
|
- [ ] Error logging with context
|
|
- [ ] Metrics emission for error rates
|
|
|
|
### Shared
|
|
- [ ] Error classification schema documentation
|
|
- [ ] gRPC status → severity mapping
|
|
|
|
### Tests
|
|
- [ ] Unit test: error classification logic
|
|
- [ ] Integration test: error propagation end-to-end
|
|
- [ ] E2E test: error UI rendering
|
|
|
|
## Test Strategy
|
|
|
|
### Fixtures
|
|
- Mock server that returns various error codes
|
|
- Network failure simulation
|
|
- Invalid request payloads
|
|
|
|
### Test Cases
|
|
|
|
| Case | Input | Expected |
|
|
|------|-------|----------|
|
|
| gRPC NOT_FOUND | Get non-existent meeting | Error with severity=Warning, category=NotFound |
|
|
| gRPC UNAVAILABLE | Server down | Error with severity=Retryable, category=Network |
|
|
| gRPC INTERNAL | Server bug | Error with severity=Fatal, category=Server |
|
|
| Network timeout | Slow response | Error with severity=Retryable, category=Network |
|
|
| Validation error | Invalid UUID | Error with severity=Warning, category=Validation |
|
|
| Webhook failure | HTTP 500 from webhook | Logged, metric emitted, no user error |
|
|
|
|
## Quality Gates
|
|
|
|
- [ ] Zero `.catch(() => {})` in production code
|
|
- [ ] All errors have classification
|
|
- [ ] All Tauri commands emit errors consistently
|
|
- [ ] Error metrics visible in observability dashboard
|
|
- [ ] gRPC status preserved in client errors
|
|
- [ ] Tests cover all error categories
|
|
- [ ] Documentation for error classification
|