6.5 KiB
6.5 KiB
SPRINT-GAP-004: Diarization Job Lifecycle Issues
| Attribute | Value |
|---|---|
| Sprint | GAP-004 |
| Size | S (Small) |
| Owner | TBD |
| Phase | Hardening |
| Prerequisites | None |
Open Issues
- Define maximum poll attempts before giving up
- Determine transient error detection heuristics
- Decide on job recovery strategy after app restart
Validation Status
| Component | Exists | Needs Work |
|---|---|---|
| Job status polling | Yes | Needs resilience |
| Job cancellation | Yes | Needs confirmation |
| DB persistence | Yes | Works correctly |
| Memory fallback | Yes | Doesn't survive restart |
Objective
Improve diarization job lifecycle management to handle transient failures gracefully, provide cancellation confirmation, and ensure job state survives application restarts.
Key Decisions
| Decision | Choice | Rationale |
|---|---|---|
| Poll failure handling | Retry with backoff | Transient errors shouldn't kill poll loop |
| Cancellation | Return final state | Client needs confirmation of cancel |
| Memory fallback | Log warning | Users should know DB is recommended |
| Max poll attempts | 60 (2 minutes) | Prevent infinite polling |
What Already Exists
Backend (src/noteflow/grpc/_mixins/diarization/_jobs.py)
- Job creation with DB or memory fallback
- Background task execution with timeout
- Status updates (QUEUED → RUNNING → COMPLETED/FAILED)
- Cancellation via task cancellation
Client (client/src-tauri/src/commands/diarization.rs)
refine_speaker_diarization()starts job + pollingstart_diarization_poll()polls every 2 secondscancel_diarization_job()sends cancel request- Error emission on failures
Identified Issues
1. Polling Loop Breaks on Any Error (Medium)
Location: client/src-tauri/src/commands/diarization.rs:128-134
let status = match state.grpc_client.get_diarization_job_status(&job_id).await {
Ok(status) => status,
Err(err) => {
emit_error(&app, "diarization_error", &err);
break; // Stops polling on ANY error
}
};
Problem: Any error terminates the poll loop:
- Transient network error → polling stops
- User sees "error" but job may still be running
- No distinction between fatal and recoverable errors
Impact: Users must manually refresh to see job completion.
2. Cancellation No Confirmation (Low)
Location: client/src-tauri/src/commands/diarization.rs:90-95
pub async fn cancel_diarization_job(
state: State<'_, Arc<AppState>>,
job_id: String,
) -> Result<CancelDiarizationResult> {
state.grpc_client.cancel_diarization_job(&job_id).await
}
Problem: Cancel request returns success but:
- Doesn't poll for final CANCELLED status
- UI may show inconsistent state
- Race with job completion
3. In-Memory Fallback Doesn't Persist (Low)
Location: src/noteflow/grpc/_mixins/diarization/_jobs.py:124-128
if repo.supports_diarization_jobs:
await repo.diarization_jobs.create(job)
await repo.commit()
else:
self._diarization_jobs[job_id] = job # Lost on restart
Problem: When DB not available:
- Jobs stored in memory only
- Server restart loses all job state
- No warning to user
4. No Maximum Poll Attempts (Low)
Location: client/src-tauri/src/commands/diarization.rs:124-144
tauri::async_runtime::spawn(async move {
let mut interval = tokio::time::interval(Duration::from_secs(2));
loop {
interval.tick().await;
// ... no attempt counter
if matches!(status.status, ...) {
break;
}
}
});
Problem: Polling continues indefinitely:
- Zombie poll loops if job disappears
- Resource waste
- Potential memory leak
Scope
Task Breakdown
| Task | Effort | Description |
|---|---|---|
| Add retry logic to polling | S | Retry transient errors with backoff |
| Add max poll attempts | S | Stop after 60 attempts (2 min) |
| Confirm cancellation | S | Poll for CANCELLED status after cancel |
| Warn on memory fallback | S | Emit warning when using in-memory |
| Distinguish error types | S | Identify transient vs fatal errors |
Files to Modify
Client:
client/src-tauri/src/commands/diarization.rs
Backend:
src/noteflow/grpc/_mixins/diarization/_jobs.py
Error Classification for Polling
fn is_transient_error(err: &Error) -> bool {
// Network errors, timeouts, temporary unavailable
matches!(err.code(),
ErrorCode::Unavailable |
ErrorCode::DeadlineExceeded |
ErrorCode::ResourceExhausted
)
}
fn is_fatal_error(err: &Error) -> bool {
// Job gone, invalid state
matches!(err.code(),
ErrorCode::NotFound |
ErrorCode::InvalidArgument |
ErrorCode::PermissionDenied
)
}
Migration Strategy
Phase 1: Add Resilience (Low Risk)
- Retry on transient errors (up to 3 times)
- Add max poll attempts counter
- No behavior change for success path
Phase 2: Add Confirmation (Low Risk)
- Cancel returns with final status poll
- UI updates to show confirmed cancellation
- Minor API change
Phase 3: Add Warnings (Low Risk)
- Emit warning event for memory fallback
- UI can show notification
- No breaking changes
Deliverables
Backend
- Log warning when using in-memory job storage
- Emit metric for storage mode (db vs memory)
Client
- Retry transient polling errors (3 attempts with backoff)
- Maximum 60 poll attempts
- Confirm cancellation by polling for CANCELLED
- Handle memory fallback warning
Tests
- Unit test: transient error retry
- Unit test: max attempts reached
- Integration test: cancel + confirm
- Integration test: server restart during poll
Test Strategy
Fixtures
- Mock server that fails intermittently
- Mock server that cancels slowly
- In-memory mode server
Test Cases
| Case | Input | Expected |
|---|---|---|
| Transient error during poll | Network blip | Retry up to 3 times, continue polling |
| Fatal error during poll | Job not found | Stop polling, emit error |
| Max attempts reached | Zombie job | Stop polling, emit timeout error |
| Cancel job | Active job | Returns CANCELLED status |
| Memory fallback | No DB | Warning emitted, job proceeds |
Quality Gates
- Polling survives transient errors
- Polling stops after max attempts
- Cancel returns confirmed state
- Memory fallback warning visible
- Tests cover error scenarios
- No zombie poll loops in production