Files
noteflow/docs/sprints/phase-gaps/sprint-gap-004-diarization-lifecycle.md
2026-01-02 04:22:40 +00:00

6.5 KiB

SPRINT-GAP-004: Diarization Job Lifecycle Issues

Attribute Value
Sprint GAP-004
Size S (Small)
Owner TBD
Phase Hardening
Prerequisites None

Open Issues

  • Define maximum poll attempts before giving up
  • Determine transient error detection heuristics
  • Decide on job recovery strategy after app restart

Validation Status

Component Exists Needs Work
Job status polling Yes Needs resilience
Job cancellation Yes Needs confirmation
DB persistence Yes Works correctly
Memory fallback Yes Doesn't survive restart

Objective

Improve diarization job lifecycle management to handle transient failures gracefully, provide cancellation confirmation, and ensure job state survives application restarts.

Key Decisions

Decision Choice Rationale
Poll failure handling Retry with backoff Transient errors shouldn't kill poll loop
Cancellation Return final state Client needs confirmation of cancel
Memory fallback Log warning Users should know DB is recommended
Max poll attempts 60 (2 minutes) Prevent infinite polling

What Already Exists

Backend (src/noteflow/grpc/_mixins/diarization/_jobs.py)

  • Job creation with DB or memory fallback
  • Background task execution with timeout
  • Status updates (QUEUED → RUNNING → COMPLETED/FAILED)
  • Cancellation via task cancellation

Client (client/src-tauri/src/commands/diarization.rs)

  • refine_speaker_diarization() starts job + polling
  • start_diarization_poll() polls every 2 seconds
  • cancel_diarization_job() sends cancel request
  • Error emission on failures

Identified Issues

1. Polling Loop Breaks on Any Error (Medium)

Location: client/src-tauri/src/commands/diarization.rs:128-134

let status = match state.grpc_client.get_diarization_job_status(&job_id).await {
    Ok(status) => status,
    Err(err) => {
        emit_error(&app, "diarization_error", &err);
        break;  // Stops polling on ANY error
    }
};

Problem: Any error terminates the poll loop:

  • Transient network error → polling stops
  • User sees "error" but job may still be running
  • No distinction between fatal and recoverable errors

Impact: Users must manually refresh to see job completion.

2. Cancellation No Confirmation (Low)

Location: client/src-tauri/src/commands/diarization.rs:90-95

pub async fn cancel_diarization_job(
    state: State<'_, Arc<AppState>>,
    job_id: String,
) -> Result<CancelDiarizationResult> {
    state.grpc_client.cancel_diarization_job(&job_id).await
}

Problem: Cancel request returns success but:

  • Doesn't poll for final CANCELLED status
  • UI may show inconsistent state
  • Race with job completion

3. In-Memory Fallback Doesn't Persist (Low)

Location: src/noteflow/grpc/_mixins/diarization/_jobs.py:124-128

if repo.supports_diarization_jobs:
    await repo.diarization_jobs.create(job)
    await repo.commit()
else:
    self._diarization_jobs[job_id] = job  # Lost on restart

Problem: When DB not available:

  • Jobs stored in memory only
  • Server restart loses all job state
  • No warning to user

4. No Maximum Poll Attempts (Low)

Location: client/src-tauri/src/commands/diarization.rs:124-144

tauri::async_runtime::spawn(async move {
    let mut interval = tokio::time::interval(Duration::from_secs(2));
    loop {
        interval.tick().await;
        // ... no attempt counter
        if matches!(status.status, ...) {
            break;
        }
    }
});

Problem: Polling continues indefinitely:

  • Zombie poll loops if job disappears
  • Resource waste
  • Potential memory leak

Scope

Task Breakdown

Task Effort Description
Add retry logic to polling S Retry transient errors with backoff
Add max poll attempts S Stop after 60 attempts (2 min)
Confirm cancellation S Poll for CANCELLED status after cancel
Warn on memory fallback S Emit warning when using in-memory
Distinguish error types S Identify transient vs fatal errors

Files to Modify

Client:

  • client/src-tauri/src/commands/diarization.rs

Backend:

  • src/noteflow/grpc/_mixins/diarization/_jobs.py

Error Classification for Polling

fn is_transient_error(err: &Error) -> bool {
    // Network errors, timeouts, temporary unavailable
    matches!(err.code(),
        ErrorCode::Unavailable |
        ErrorCode::DeadlineExceeded |
        ErrorCode::ResourceExhausted
    )
}

fn is_fatal_error(err: &Error) -> bool {
    // Job gone, invalid state
    matches!(err.code(),
        ErrorCode::NotFound |
        ErrorCode::InvalidArgument |
        ErrorCode::PermissionDenied
    )
}

Migration Strategy

Phase 1: Add Resilience (Low Risk)

  • Retry on transient errors (up to 3 times)
  • Add max poll attempts counter
  • No behavior change for success path

Phase 2: Add Confirmation (Low Risk)

  • Cancel returns with final status poll
  • UI updates to show confirmed cancellation
  • Minor API change

Phase 3: Add Warnings (Low Risk)

  • Emit warning event for memory fallback
  • UI can show notification
  • No breaking changes

Deliverables

Backend

  • Log warning when using in-memory job storage
  • Emit metric for storage mode (db vs memory)

Client

  • Retry transient polling errors (3 attempts with backoff)
  • Maximum 60 poll attempts
  • Confirm cancellation by polling for CANCELLED
  • Handle memory fallback warning

Tests

  • Unit test: transient error retry
  • Unit test: max attempts reached
  • Integration test: cancel + confirm
  • Integration test: server restart during poll

Test Strategy

Fixtures

  • Mock server that fails intermittently
  • Mock server that cancels slowly
  • In-memory mode server

Test Cases

Case Input Expected
Transient error during poll Network blip Retry up to 3 times, continue polling
Fatal error during poll Job not found Stop polling, emit error
Max attempts reached Zombie job Stop polling, emit timeout error
Cancel job Active job Returns CANCELLED status
Memory fallback No DB Warning emitted, job proceeds

Quality Gates

  • Polling survives transient errors
  • Polling stops after max attempts
  • Cancel returns confirmed state
  • Memory fallback warning visible
  • Tests cover error scenarios
  • No zombie poll loops in production