Files
noteflow/docs/spec.md
Travis Vasceannie af1285b181 Add initial project structure and files
- Introduced .python-version for Python version management.
- Added AGENTS.md for documentation on agent usage and best practices.
- Created alembic.ini for database migration configurations.
- Implemented main.py as the entry point for the application.
- Established pyproject.toml for project dependencies and configurations.
- Initialized README.md for project overview.
- Generated uv.lock for dependency locking.
- Documented milestones and specifications in docs/milestones.md and docs/spec.md.
- Created logs/status_line.json for logging status information.
- Added initial spike implementations for UI tray hotkeys, audio capture, ASR latency, and encryption validation.
- Set up NoteFlow core structure in src/noteflow with necessary modules and services.
- Developed test suite in tests directory for application, domain, infrastructure, and integration testing.
- Included initial migration scripts in infrastructure/persistence/migrations for database setup.
- Established security protocols in infrastructure/security for key management and encryption.
- Implemented audio infrastructure for capturing and processing audio data.
- Created converters for ASR and ORM in infrastructure/converters.
- Added export functionality for different formats in infrastructure/export.
- Ensured all new files are included in the repository for future development.
2025-12-17 18:28:59 +00:00

27 KiB
Raw Blame History

Below is a rewritten, endtoend Product Specification + Engineering Design Document for NoteFlow V1 (Minimum Lovable Product) that merges:

  • your revised V1 draft (confidence-model triggers, single-process, partial/final UX, extractthensynthesize citations, pragmatic typing, packaging constraints, risks table), and
  • the de-risking feedback I gave earlier (audio capture reality, diarization scope, citation enforcement, OS permissions, shipping concerns, storage/retention, update strategy, and “dont promise what you cant reliably ship”).

Ive kept it “shipping-ready” by being explicit about decisions, failure modes, acceptance criteria, and what is deferred.


NoteFlow V1 — Minimum Lovable Product

Intelligent Meeting Notetaker (Localfirst capture + navigable recall + evidencelinked summaries)

Document Version: 1.0 (Engineering Draft) Status: Engineering Review Target Platforms: macOS 12+ (Monterey), Windows 10/11 (64-bit) Primary Use Case: Zoom/Teams-style meetings and ad-hoc conversations Core Value Proposition: “I can reliably record a meeting, read/search a transcript, and get a summary where every point links back to evidence.”


0. Glossary

  • Segment: A finalized chunk of transcript with start/end offsets and stable text.
  • Partial transcript: Unstable text shown in the live view; may be replaced. Not persisted.
  • Evidence link: A reference from a summary bullet to one or more Segment IDs (and timestamps).
  • Trigger score: Weighted confidence score (0.01.0) used to prompt recording.
  • Local-first: All recordings/transcripts stored on device by default; cloud is optional and explicit.

1. Product Strategy

1.1 Goals (V1 Must Deliver)

  1. Reliable capture of meeting audio (with explicit scope + honest constraints).

  2. Near real-time transcription with a stable partial/final UX.

  3. Postmeeting review with:

    • transcript navigation,
    • audio playback synced to timestamps,
    • annotations (action items/decisions/notes),
    • an evidencelinked summary (no uncited claims).
  4. Local-first storage with retention controls and deletion that is actually meaningful.

  5. A foundation for V2 (speaker identity, live RAG callbacks, advanced exports) without building them now.

1.2 NonGoals (V1 Will Not Promise)

  • Fully autonomous “always start recording” behavior by default.
  • Biometric speaker identification (“this is Alice”) or crossmeeting voice profiles.
  • Live “RAG callback cards” injected during meetings.
  • Team workspaces / cloud sync / org deployment.
  • PDF/DOCX export bundled in-app (V1 exports Markdown/HTML; PDF is via OS print).
  • Perfect diarization accuracy; diarization is best-effort and postmeeting only.

2. Scope: V1 vs V2+

Feature Area V1 Scope (Must Ship) Deferred (V2+)
Audio Capture Mic capture (default). Windows-only optional system loopback (no drivers) if feasible. macOS loopback requires user-installed device; V1 supports selecting it but does not ship drivers. First-class macOS system audio capture without user setup; multi-source mixing; per-app capture.
Transcription Near real-time via partial/final segments; timestamps; searchable transcript. Multi-language translation, custom vocab, advanced diarization alignment.
Speakers Anonymous speaker separation (postmeeting best-effort): “Speaker A/B/C”. Rename per meeting (non-biometric). Voice profiles, biometric identification, continuous learning loop.
Triggers Weighted confidence model; user confirmation by default; snooze and per-app suppression. Fully autonomous auto-start as default; “call state” deep integrations.
Intelligence Evidence-based summary (citations enforced). Live RAG callbacks; cross-meeting memory assistant.
Storage Local per-user database + encrypted assets; retention + deletion. Cloud sync; team search; shared templates.
Export Markdown/HTML + clipboard; “Print to PDF” via OS. Bundled PDF/DOCX, templating marketplace.

3. Success Metrics & Acceptance Criteria

3.1 Product Metrics (V1)

  • Core loop latency (P95): word spoken → visible partial text < 3.0s
  • Session reliability: crash rate < 0.5% for sessions > 60 minutes
  • False trigger prompts: < 1 prompt/day/user median; < 3 P95
  • Citation correctness: ≥ 90% of summary bullets link to supporting transcript segments (human audit)

3.2 “Must Work” Acceptance Criteria (Release Blockers)

  • User can start/stop recording manually from tray/menubar or hotkey.
  • Transcript segments are persisted and viewable after app restart.
  • Clicking a summary bullet jumps to the cited transcript segment (and audio if stored).
  • Deleting a meeting removes transcript + audio in a way that prevents casual recovery.
  • App never records without a visible, persistent indicator.

4. User Experience

4.1 Primary Screens

  1. Tray/Menubar Control

    • Start / Stop recording
    • Open NoteFlow
    • Snooze triggers (15m / 1h / 2h / today)
    • Settings
  2. Active Meeting View

    • Recording indicator + timer

    • VU meter (trust signal)

    • Rolling transcript:

      • Partial text in grey (unstable)
      • Final text in normal text (committed)
    • Annotation hotkeys (Action / Decision / Note)

    • “Mark moment” button (adds timestamped note instantly)

  3. PostMeeting Review

    • Transcript with search (in-meeting search is required; global search is “basic” in V1)
    • Speaker labels (if diarization completed)
    • Audio playback controls (if audio stored)
    • Summary panel with evidence links
    • Export buttons: Copy Markdown / Save HTML
  4. Meeting Library

    • List of meetings (title, date, duration, source)
    • Keyword search (V1: scan-based acceptable up to defined limits)
    • Filters: date range, source app, “has action items”
  5. Settings

    • Trigger sensitivity & sources
    • Audio device selection + test
    • “Store audio” toggle + retention days
    • Summarization provider (local/cloud) + privacy consent
    • Telemetry opt-in

5. Core Workflows

5.1 Workflow A — Smart Prompt to Record (Weighted Confidence Model)

Inputs (each produces a score contribution):

  • Calendar proximity (optional connector): meeting starts within 5 minutes → +0.30
  • Foreground app: Zoom/Teams/etc is frontmost → +0.40
  • Audio activity: mic level above threshold for 5s → +0.30

Threshold behavior

  • Score < 0.40: ignore
  • 0.400.79: show notification: “Meeting detected. Start NoteFlow?”
  • ≥ 0.80: auto-start only if user explicitly enabled

Controls

  • Snooze button included on prompt
  • “Dont prompt for this app” option
  • If already recording, ignore all new triggers

Engineering note (explicit constraint): V1 does not claim true “call state” detection. Foreground app + audio activity + calendar is the reliable baseline.


5.2 Workflow B — Live Transcription (Partial → Final)

  1. User starts recording (manual or triggered).

  2. Audio pipeline streams frames into ring buffer.

  3. VAD segments speech regions.

  4. Transcriber produces partial hypothesis every ~2 seconds.

  5. When VAD detects silence > 500ms (or max segment duration reached), commit final segment:

    • assign stable Segment ID
    • store text + timestamps
    • update UI (partial becomes final)

UI invariant: final segments never change text; corrections happen by creating a new segment (V2) or via explicit “edit transcript” (deferred).


5.3 Workflow C — PostMeeting Summary with Enforced Citations (“Extract → Synthesize → Verify”)

Goal: no summary bullet can exist without a citation.

  1. Chunking: transcript segments grouped into blocks ~500 tokens (segment-aware).

  2. Extraction prompt: model must return a list of:

    • quote (verbatim excerpt)
    • segment_ids (one or more)
    • category (decision/action/key_point)
  3. Synthesis prompt: rewrite extracted quotes into a professional bullet list; each bullet ends with [...] containing Segment IDs.

  4. Verification:

    • parse bullets; if any bullet lacks [...], mark it uncited and do not show it by default (user can reveal “uncited drafts” panel)
  5. Display: clicking a bullet scrolls transcript to cited segment(s) and sets playback time.


5.4 Workflow D — BestEffort Anonymous Diarization (PostMeeting)

V1 approach: diarization is a background job after recording stops (not real-time).

  1. If diarization enabled, run pipeline on recorded audio.
  2. Obtain speaker turns and cluster labels.
  3. Align speaker turns to transcript segments by time overlap.
  4. Assign “Speaker A/B/C” per meeting.
  5. User can rename speakers per meeting (non-biometric).

Failure handling: if diarization model unavailable or too slow, transcript remains “Unknown speaker.”


6. Functional Requirements (FR)

6.1 Recording & Audio

  • FR-01 Manual start/stop recording from tray/menubar.

  • FR-02 Global hotkey start/stop (configurable; can be disabled).

  • FR-03 Visible recording indicator whenever audio capture is active.

  • FR-04 Audio device selection + test page (VU meter).

  • FR-05 Audio dropouts handled gracefully:

    • attempt reconnect
    • if reconnection fails, prompt user and stop recording safely (flush files)

6.2 Transcription

  • FR-10 Near real-time transcript view with partial/final states.
  • FR-11 Persist finalized transcript segments with timestamps.
  • FR-12 Transcript is searchable within a meeting.

6.3 Annotations

  • FR-20 Add annotations during recording and review:

    • types: action_item, decision, note, risk (risk is allowed but not required in summary)
  • FR-21 An annotation always includes:

    • timestamp range
    • text
    • origin: user/system (V1: system used only for “uncited draft” metadata; no RAG callbacks)

6.4 Summaries

  • FR-30 Generate summary on demand (and optionally auto after stop).
  • FR-31 Enforce citations; uncited bullets are suppressed by default.
  • FR-32 Summary bullets clickable → jump to transcript + playback time.
  • FR-40 Meeting library list with sorting and basic search.
  • FR-41 Delete meeting removes transcript + audio + summary.

6.6 Settings & Privacy

  • FR-50 Retention policy (default 30 days, configurable).
  • FR-51 Cloud summarization requires explicit opt-in and provider selection.
  • FR-52 Telemetry is opt-in and content-free.

7. NonFunctional Requirements (NFR)

7.1 Performance

  • NFR-01 P95 partial transcript latency < 3s on baseline hardware (defined in release checklist).
  • NFR-02 Background jobs (diarization, embeddings) must not freeze UI; they run in worker threads and report progress.

7.2 Reliability

  • NFR-10 Crash-safe persistence:

    • audio file is written incrementally
    • transcript segments flushed within 2s of finalization
  • NFR-11 On restart after crash, last session is recoverable (meeting marked “incomplete”).

7.3 Security & Privacy

  • NFR-20 Local data encrypted at rest (see Section 10).
  • NFR-21 No recording without indicator.
  • NFR-22 No content in telemetry logs.

8. Technical Architecture

8.1 Process Model

Decision: Client-Server architecture with gRPC.

The system is split into two components that can run on the same machine or separately:

Server (Headless Backend)

  • ASR Engine: faster-whisper for transcription
  • Meeting Store: in-memory meeting management
  • Storage: LanceDB for persistence + encrypted audio assets
  • gRPC Service: bidirectional streaming for real-time transcription

Client (GUI Application)

  • UI: Flet (Python) for main window
  • Tray/Menubar: native integration layer (pystray)
  • Audio Capture: sounddevice for local mic capture
  • gRPC Client: streams audio to server, receives transcripts

Rationale:

  • Enables headless server deployment (e.g., home server, NAS)
  • Client can run on any machine with audio hardware
  • Separates compute-heavy ASR from UI responsiveness
  • Maintains local-first operation when both run on same machine

Deployment modes:

  1. Local: Server + Client on same machine (default)
  2. Split: Server on headless machine, Client on workstation with audio

8.2 gRPC Service Contract

Service: NoteFlowService

RPC Type Purpose
StreamTranscription Bidirectional stream Audio chunks → transcript updates
CreateMeeting Unary Start a new meeting
StopMeeting Unary Stop recording
ListMeetings Unary Query meetings
GetMeeting Unary Get meeting details
GenerateSummary Unary Generate evidence-linked summary
GetServerInfo Unary Health check + capabilities

Audio streaming contract:

  • Client sends AudioChunk messages (float32, 16kHz mono)
  • Server responds with TranscriptUpdate messages (partial or final)
  • Final segments include word-level timestamps

8.3 Concurrency & Threading

Server:

  • gRPC thread pool: handles incoming requests
  • ASR worker: processes audio buffers through faster-whisper
  • IO worker: persists segments + meeting metadata

Client:

  • Main/UI thread: rendering + user actions
  • Audio thread (high priority): capture callback → gRPC stream
  • gRPC stream thread: sends audio, receives transcripts
  • Event dispatch: updates UI from transcript callbacks

Hard rule: Server's IO worker is the only component that writes to the database (prevents corruption/races).


8.4 Audio Pipeline (Client-Side)

V1 capture modes

  1. Microphone input (default, cross-platform)
  2. Windows-only optional loopback (if implemented without extra drivers)
  3. macOS loopback via user-installed virtual device (supported if user configures; not bundled)

Client Pipeline

  1. Capture: PortAudio via sounddevice
    • internal capture format: float32 frames
    • resample to 16kHz mono for streaming
  2. Stream: gRPC StreamTranscription to server
    • chunks sent every ~100ms
    • includes timestamp for sync
  3. Display: receive TranscriptUpdate from server
    • partial updates shown in grey
    • final segments committed to UI

Server Pipeline

  1. Receive: audio chunks from gRPC stream
  2. Buffer: accumulate until processable duration (~1s)
  3. VAD: silero-vad filters non-speech
  4. ASR: faster-whisper inference with word timestamps
  5. Finalize: silence boundary or max segment length
  6. Persist: segments written to DB
  7. Stream: send TranscriptUpdate back to client

Explicit failure modes

  • device unplugged → reconnect to default device; show toast
  • permission denied → block recording and show system instructions
  • sustained dropouts → stop recording safely, mark session incomplete

8.5 Transcription Engine (Partial/Final Contract)

Partial inference cadence: every ~2 seconds Finalization rules:

  • VAD silence > 500ms finalizes current segment
  • max segment length (e.g., 20s) forces finalization to control latency/UX

Text stability rule: partial may be replaced; final never mutates.


8.6 Diarization (V1 PostMeeting Only)

  • Runs after meeting stop or on-demand
  • Produces anonymous labels
  • Time-align with transcript segments
  • Stored per meeting; no cross-meeting identity

Important: diarization is optional; must never block transcript availability.


8.7 Summarization Providers

Provider interface: Summarizer.generate(transcript: MeetingTranscript) -> MeetingSummary

Supported provider modes:

  • Cloud provider (user-supplied API key; explicit opt-in)
  • Local provider (optional; user-installed runtime; best-effort)

Privacy contract: if cloud is enabled, UI must clearly display “Transcript will be sent to provider X” at first use and in settings.


9. Storage & Data Model

9.1 On-Disk Layout (Per User)

  • App data directory (OS standard)

    • db/ (LanceDB)

    • meetings/<meeting_id>/

      • audio.<ext> (encrypted container)
      • manifest.json (non-sensitive)
    • logs/ (rotating; content-free)

    • settings.json

9.2 Database Schema (LanceDB)

Core tables:

  • meetings

    • id (UUID)
    • title
    • started_at, ended_at
    • source_app
    • flags: has_audio, has_summary, diarization_status
  • segments

    • id (UUID)
    • meeting_id
    • start_offset, end_offset
    • text
    • speaker_label (“Unknown”, “Speaker A”…)
    • confidence (optional)
    • embedding_vector (optional, computed postmeeting)
  • annotations

    • id
    • meeting_id
    • start_offset, end_offset
    • type
    • text
    • created_at
  • summaries

    • meeting_id
    • generated_at
    • provider
    • overview
    • points (serialized)
    • verification_report (uncited_count, etc.)

9.3 Domain Models (Pydantic v2)

Key correctness requirements:

  • enforce end >= start
  • avoid mutable defaults
  • keep “escape hatches” constrained and documented

Example models (illustrative; not exhaustive):

from __future__ import annotations

from datetime import datetime
from typing import Literal
from pydantic import BaseModel, Field, model_validator

MeetingID = str
SegmentID = str
AnnotationID = str

class MeetingMetadata(BaseModel):
    id: MeetingID
    title: str = "Untitled Meeting"
    started_at: datetime = Field(default_factory=datetime.now)
    ended_at: datetime | None = None
    trigger_source: Literal["manual", "calendar", "app", "mixed"] = "manual"
    source_app: str | None = None
    participants: list[str] = Field(default_factory=list)

class TranscriptSegment(BaseModel):
    id: SegmentID
    meeting_id: MeetingID
    start: float = Field(..., ge=0.0)
    end: float = Field(..., ge=0.0)
    text: str
    speaker_label: str = "Unknown"
    is_final: bool = True

    @model_validator(mode="after")
    def validate_times(self) -> "TranscriptSegment":
        if self.end < self.start:
            raise ValueError("segment end < start")
        return self

class Annotation(BaseModel):
    id: AnnotationID
    meeting_id: MeetingID
    type: Literal["action_item", "decision", "note", "risk"]
    start: float = Field(..., ge=0.0)
    end: float = Field(..., ge=0.0)
    text: str
    created_at: datetime = Field(default_factory=datetime.now)

class SummaryPoint(BaseModel):
    category: Literal["decision", "action_item", "key_point"]
    content: str
    citation_ids: list[SegmentID] = Field(default_factory=list)
    is_cited: bool = True

class MeetingSummary(BaseModel):
    meeting_id: MeetingID
    generated_at: datetime
    provider: str
    overview: str
    points: list[SummaryPoint]
    uncited_points: list[SummaryPoint] = Field(default_factory=list)

10. Privacy, Security & Compliance

  • Persistent recording indicator (tray/menubar icon + in-app)

  • First-run permission guide:

    • microphone access
    • hotkeys/accessibility permissions if required by OS
  • One-time legal reminder: user responsibility to comply with local consent laws

10.2 Encryption at Rest (Pragmatic + Real)

Goal: protect recordings and derived data on disk.

Design: envelope encryption

  • Master key stored in OS credential store (Keychain/Credential Manager) via a cross-platform keyring abstraction.
  • Per-meeting data key (DEK) generated randomly.
  • Meeting assets (audio, sensitive metadata) encrypted with DEK.
  • DEK encrypted with master key and stored in DB.

Deletion (“cryptographic shred”)

  • Delete encrypted DEK record + delete encrypted file(s).
  • Without DEK, leftover bytes are unusable.

10.3 Retention

  • Default retention: 30 days
  • Retention job runs at app startup and once daily
  • “Delete now” always available per meeting

10.4 Telemetry (Opt-in, Content-Free)

Allowed fields only:

  • crash stacktrace (redacted paths if needed)
  • performance counters (latency, dropouts, model runtime)
  • feature toggles (summarization enabled yes/no) Explicitly forbidden:
  • transcript text
  • audio
  • meeting titles/participants (unless user explicitly opts-in to “diagnostic mode,” which is V2+)

11. Packaging, Distribution, Updates

11.1 Packaging

  • Primary: PyInstaller-based app bundle (one-click install experience)
  • No bundled PDF engine in V1 (avoid complex native deps)
  • Exports: HTML/Markdown + OS “Print to PDF”

11.2 Code Signing & OS Requirements

  • macOS: signed + notarized app bundle
  • Windows: signed installer recommended to reduce SmartScreen friction

11.3 Updates (V1 Reality)

  • V1 includes: “Check for updates” → opens release page + shows current version
  • V1.1+ can add auto-update once packaging is stable across OS targets

12. Observability

12.1 Logging

  • Structured logging (JSON) to rotating files
  • Log levels configurable
  • Must never log transcript content or raw audio

12.2 Metrics (Local + Optional Telemetry)

Track locally:

  • audio_dropout_count
  • vad_speech_ratio
  • asr_partial_latency_ms (P50/P95)
  • asr_final_latency_ms
  • summarization_duration_ms
  • db_write_queue_depth

13. Development Standards (Pragmatic)

13.1 Typing Policy

  • mypy --strict required in CI

  • Any avoided in core domain; allowed only at explicit boundaries (OS bindings, C libs)

  • type: ignore[code] allowed only with:

    1. narrow scope
    2. comment explaining why
    3. tracked follow-up task if its not permanent

13.2 Architecture Conventions

  • Dependency Injection for services (no heavy constructors)

  • Facade exports (__init__.py) for clean APIs

  • Module size guideline:

    • soft limit 500 LoC
    • hard limit 750 LoC → refactor into package

13.3 Testing Strategy

  • Unit tests: trigger scoring, summarization verifier, model validators
  • Integration tests: DB schema, retention deletion, encrypted asset lifecycle
  • E2E tests (required): inject prerecorded audio into pipeline; assert transcript contains expected phrases + stable segment timing behavior
  • CI must not depend on live microphone input

14. Known Risks & Mitigations (V1)

Risk Impact Mitigation
Mic-only capture misses remote speakers (headphones) Product feels “broken” Provide Windows loopback option if feasible; on macOS provide “Audio Setup Wizard” supporting user-installed loopback devices; clearly label limitations in UI.
Whisper hallucinations on silence Bad transcript VAD gate; discard non-speech frames; conservative finalization.
Model performance on low-end CPU Laggy UI “Low Power Mode” (slower partial cadence), async background jobs, allow cloud ASR (optional later).
Diarization dependency/model availability Feature instability Make diarization optional + post-meeting; graceful fallback to “Unknown speaker.”
False trigger prompts Annoyance Weighted scoring + snooze + per-app suppression + “only prompt when foreground.”
Packaging/permissions friction Drop-off First-run wizard; clear permission UX; signed builds.

15. Roadmap (V2+)

High-confidence next steps after V1 ships:

  1. Live RAG callbacks (throttled, high-signal only)
  2. Speaker identity profiles with safeguards (quarantine samples, versioning, revert)
  3. Advanced exports (PDF/DOCX via a packaging-friendly approach)
  4. Search upgrades (FTS/semantic global search performance)
  5. Cloud sync (optional) and team workspaces (separate product decision)

16. Open Questions (Engineering Spikes Required)

These must be resolved with short spikes before implementation finalization:

  1. Tray + global hotkeys compatibility with chosen UI stack on macOS/Windows
  2. Windows loopback feasibility with the selected audio library and packaging approach
  3. Diarization model choice that does not require gated downloads or accounts (or else diarization becomes V2)
  4. Local LLM summarization feasibility (quality + packaging); if not feasible, cloud-only summarization requires an explicit product decision

If you want, I can also produce a companion “Implementation Plan” (milestones + tasks + module breakdown + API skeletons) that matches this spec exactly—so engineering can start building without re-interpreting decisions.