Files
noteflow/docs/sprints/phase-5-evolution/sprint-18.5-rocm-support/IMPLEMENTATION_CHECKLIST.md
Travis Vasceannie d0d4eea847 chore: configure devcontainer Python venv persistence and normalize package-lock peer dependencies
- Added bind mount for .venv directory in devcontainer to persist Python virtual environment across container rebuilds
- Enabled updateRemoteUserUID for proper file permissions in devcontainer
- Normalized peer dependency flags in package-lock.json (removed inconsistent "peer": true from core dependencies, added to test-only dependencies)
- Added empty codex file placeholder
- Created comprehensive
2026-01-17 21:36:22 -05:00

8.5 KiB

ROCm Support Implementation Checklist

This checklist tracks the implementation progress for Sprint 18.5.


Phase 1: Device Abstraction Layer

1.1 GPU Detection Module

  • Create src/noteflow/infrastructure/gpu/__init__.py
  • Create src/noteflow/infrastructure/gpu/detection.py
    • Implement GpuBackend enum (NONE, CUDA, ROCM, MPS)
    • Implement GpuInfo dataclass
    • Implement detect_gpu_backend() function
    • Implement get_gpu_info() function
    • Add ROCm version detection via torch.version.hip
  • Create tests/infrastructure/gpu/test_detection.py
    • Test no-torch case
    • Test CUDA detection
    • Test ROCm detection (HIP check)
    • Test MPS detection
    • Test CPU fallback

1.2 Domain Types

  • Create src/noteflow/domain/ports/gpu.py
    • Export GpuBackend enum
    • Export GpuInfo type
    • Define GpuDetectionProtocol

1.3 ASR Device Types

  • Update src/noteflow/application/services/asr_config/types.py
    • Add ROCM = "rocm" to AsrDevice enum
    • Add ROCm entry to DEVICE_COMPUTE_TYPES mapping
    • Update AsrCapabilities dataclass with rocm_available and gpu_backend fields

1.4 Diarization Device Mixin

  • Update src/noteflow/infrastructure/diarization/engine/_device_mixin.py
    • Add ROCm detection in _detect_available_device()
    • Maintain backward compatibility with "cuda" device string

1.5 System Metrics

  • Update src/noteflow/infrastructure/metrics/system_resources.py
    • Handle ROCm VRAM queries (same API as CUDA via HIP)
    • Add gpu_backend field to metrics

1.6 gRPC Proto

  • Update src/noteflow/grpc/proto/noteflow.proto
    • Add ASR_DEVICE_ROCM = 3 to AsrDevice enum
    • Add rocm_available field to AsrConfiguration
    • Add gpu_backend field to AsrConfiguration
  • Regenerate Python stubs
  • Run scripts/patch_grpc_stubs.py

1.7 Phase 1 Tests

  • Run pytest tests/infrastructure/gpu/
  • Run make quality-py
  • Verify no regressions in CUDA detection

Phase 2: ASR Engine Protocol

2.1 Engine Protocol Definition

  • Extend src/noteflow/infrastructure/asr/protocols.py (or relocate to domain/ports)
    • Reuse AsrResult / WordTiming from infrastructure/asr/dto.py
    • Add device property (logical device: cpu/cuda/rocm)
    • Add compute_type property
    • Confirm model_size + is_loaded already covered
    • Add optional transcribe_file() helper (if needed)

2.2 Refactor FasterWhisperEngine

  • Update src/noteflow/infrastructure/asr/engine.py
    • Ensure compliance with AsrEngine
    • Add explicit type annotations
    • Document as CUDA/CPU backend
  • Create tests/infrastructure/asr/test_protocol_compliance.py
    • Verify FasterWhisperEngine implements protocol

2.3 PyTorch Whisper Engine (Fallback)

  • Create src/noteflow/infrastructure/asr/pytorch_engine.py
    • Implement WhisperPyTorchEngine class
    • Implement all protocol methods
    • Handle device placement (cuda/rocm/cpu)
    • Support all compute types
  • Create tests/infrastructure/asr/test_pytorch_engine.py
    • Test model loading
    • Test transcription
    • Test device handling

2.4 Engine Factory

  • Create src/noteflow/infrastructure/asr/factory.py
    • Implement create_asr_engine() function
    • Implement _resolve_device() helper
    • Implement _create_cpu_engine() helper
    • Implement _create_cuda_engine() helper
    • Implement _create_rocm_engine() helper
    • Define EngineCreationError exception
  • Create tests/infrastructure/asr/test_factory.py
    • Test auto device resolution
    • Test explicit device selection
    • Test fallback behavior
    • Test error cases

2.5 Update Engine Manager

  • Update src/noteflow/application/services/asr_config/_engine_manager.py
    • Add detect_rocm_available() method
    • Update build_capabilities() for ROCm
    • Update check_configuration() for ROCm validation
    • Use factory for engine creation in build_engine_for_job()
  • Update tests/application/test_asr_config_service.py
    • Add ROCm detection tests
    • Add ROCm validation tests

2.6 Phase 2 Tests

  • Run full ASR test suite
  • Run make quality-py
  • Verify CUDA path unchanged

Phase 3: ROCm-Specific Engine

3.1 ROCm Engine Implementation

  • Create src/noteflow/infrastructure/asr/rocm_engine.py
    • Implement FasterWhisperRocmEngine class
    • Handle CTranslate2-ROCm import with fallback
    • Implement all protocol methods
    • Add ROCm-specific optimizations
  • Create tests/infrastructure/asr/test_rocm_engine.py
    • Test import fallback behavior
    • Test engine creation (mock)
    • Test protocol compliance

3.2 Update Factory for ROCm

  • Update src/noteflow/infrastructure/asr/factory.py
    • Add ROCm engine import with graceful fallback
    • Log warning when falling back to PyTorch
  • Update factory tests for ROCm path

3.3 ROCm Installation Detection

  • Update src/noteflow/infrastructure/gpu/detection.py
    • Add is_ctranslate2_rocm_available() function
    • Add get_rocm_version() function
  • Add corresponding tests

3.4 Phase 3 Tests

  • Run ROCm-specific tests (skip if no ROCm)
  • Run make quality-py
  • Test on AMD hardware (if available)

Phase 4: Configuration & Distribution

4.1 Feature Flag

  • Update src/noteflow/config/settings/_features.py
    • Add NOTEFLOW_FEATURE_ROCM_ENABLED flag
    • Document in settings
  • Update any feature flag guards

4.2 gRPC Config Handlers

  • Update src/noteflow/grpc/mixins/asr_config.py
    • Handle ROCm device in GetAsrConfiguration()
    • Handle ROCm device in UpdateAsrConfiguration()
    • Add ROCm to capabilities response
  • Update tests in tests/grpc/test_asr_config.py

4.3 Dependencies

  • Update pyproject.toml
    • Add rocm extras group
    • Add openai-whisper as optional dependency
    • Document ROCm installation in comments
  • Create requirements-rocm.txt (optional)

4.4 Docker ROCm Image

  • Create docker/Dockerfile.rocm
    • Base on rocm/pytorch image
    • Install NoteFlow with ROCm extras
    • Configure for GPU access
  • Update compose.yaml (and/or add compose.rocm.yaml) with ROCm profile
  • Test Docker image build

4.5 Documentation

  • Create docs/installation/rocm.md
    • System requirements
    • PyTorch ROCm installation
    • CTranslate2-ROCm installation (optional)
    • Docker usage
    • Troubleshooting
  • Update main README with ROCm section
  • Update CLAUDE.md with ROCm notes

4.6 Phase 4 Tests

  • Run full test suite
  • Run make quality
  • Build ROCm Docker image
  • Test on AMD hardware

Final Validation

Quality Gates

  • pytest tests/quality/ passes
  • make quality-py passes
  • make quality passes (full stack)
  • Proto regenerated correctly
  • No type errors (basedpyright)
  • No lint errors (ruff)

Functional Validation

  • CUDA path works (no regression)
  • CPU path works (no regression)
  • ROCm detection works
  • PyTorch fallback works
  • gRPC configuration works
  • Device switching works

Documentation

  • Sprint README complete
  • Implementation checklist complete
  • Installation guide complete
  • API documentation updated

Notes

Files Created

File Status
src/noteflow/domain/ports/gpu.py
src/noteflow/domain/ports/asr.py optional (only if relocating protocol)
src/noteflow/infrastructure/gpu/__init__.py
src/noteflow/infrastructure/gpu/detection.py
src/noteflow/infrastructure/asr/pytorch_engine.py
src/noteflow/infrastructure/asr/rocm_engine.py
src/noteflow/infrastructure/asr/factory.py
docker/Dockerfile.rocm
docs/installation/rocm.md

Files Modified

File Status
application/services/asr_config/types.py
application/services/asr_config/_engine_manager.py
infrastructure/diarization/engine/_device_mixin.py
infrastructure/metrics/system_resources.py
infrastructure/asr/engine.py
infrastructure/asr/protocols.py
grpc/proto/noteflow.proto
grpc/mixins/asr_config.py
config/settings/_features.py
pyproject.toml