Files

Travis Vasceannie af1285b181 Add initial project structure and files

- Introduced .python-version for Python version management.
- Added AGENTS.md for documentation on agent usage and best practices.
- Created alembic.ini for database migration configurations.
- Implemented main.py as the entry point for the application.
- Established pyproject.toml for project dependencies and configurations.
- Initialized README.md for project overview.
- Generated uv.lock for dependency locking.
- Documented milestones and specifications in docs/milestones.md and docs/spec.md.
- Created logs/status_line.json for logging status information.
- Added initial spike implementations for UI tray hotkeys, audio capture, ASR latency, and encryption validation.
- Set up NoteFlow core structure in src/noteflow with necessary modules and services.
- Developed test suite in tests directory for application, domain, infrastructure, and integration testing.
- Included initial migration scripts in infrastructure/persistence/migrations for database setup.
- Established security protocols in infrastructure/security for key management and encryption.
- Implemented audio infrastructure for capturing and processing audio data.
- Created converters for ASR and ORM in infrastructure/converters.
- Added export functionality for different formats in infrastructure/export.
- Ensured all new files are included in the repository for future development.

2025-12-17 18:28:59 +00:00

27 KiB

Raw Blame History

Below is a rewritten, end‑to‑end Product Specification + Engineering Design Document for NoteFlow V1 (Minimum Lovable Product) that merges:

your revised V1 draft (confidence-model triggers, single-process, partial/final UX, extract‑then‑synthesize citations, pragmatic typing, packaging constraints, risks table), and
the de-risking feedback I gave earlier (audio capture reality, diarization scope, citation enforcement, OS permissions, shipping concerns, storage/retention, update strategy, and “don’t promise what you can’t reliably ship”).

I’ve kept it “shipping-ready” by being explicit about decisions, failure modes, acceptance criteria, and what is deferred.

NoteFlow V1 — Minimum Lovable Product

Intelligent Meeting Notetaker (Local‑first capture + navigable recall + evidence‑linked summaries)

Document Version: 1.0 (Engineering Draft) Status: Engineering Review Target Platforms: macOS 12+ (Monterey), Windows 10/11 (64-bit) Primary Use Case: Zoom/Teams-style meetings and ad-hoc conversations Core Value Proposition: “I can reliably record a meeting, read/search a transcript, and get a summary where every point links back to evidence.”

0. Glossary

Segment: A finalized chunk of transcript with start/end offsets and stable text.
Partial transcript: Unstable text shown in the live view; may be replaced. Not persisted.
Evidence link: A reference from a summary bullet to one or more Segment IDs (and timestamps).
Trigger score: Weighted confidence score (0.0–1.0) used to prompt recording.
Local-first: All recordings/transcripts stored on device by default; cloud is optional and explicit.

1. Product Strategy

1.1 Goals (V1 Must Deliver)

Reliable capture of meeting audio (with explicit scope + honest constraints).
Near real-time transcription with a stable partial/final UX.
Post‑meeting review with:
- transcript navigation,
- audio playback synced to timestamps,
- annotations (action items/decisions/notes),
- an evidence‑linked summary (no uncited claims).
Local-first storage with retention controls and deletion that is actually meaningful.
A foundation for V2 (speaker identity, live RAG callbacks, advanced exports) without building them now.

1.2 Non‑Goals (V1 Will Not Promise)

Fully autonomous “always start recording” behavior by default.
Biometric speaker identification (“this is Alice”) or cross‑meeting voice profiles.
Live “RAG callback cards” injected during meetings.
Team workspaces / cloud sync / org deployment.
PDF/DOCX export bundled in-app (V1 exports Markdown/HTML; PDF is via OS print).
Perfect diarization accuracy; diarization is best-effort and post‑meeting only.

2. Scope: V1 vs V2+

Feature Area	V1 Scope (Must Ship)	Deferred (V2+)
Audio Capture	Mic capture (default). Windows-only optional system loopback (no drivers) if feasible. macOS loopback requires user-installed device; V1 supports selecting it but does not ship drivers.	First-class macOS system audio capture without user setup; multi-source mixing; per-app capture.
Transcription	Near real-time via partial/final segments; timestamps; searchable transcript.	Multi-language translation, custom vocab, advanced diarization alignment.
Speakers	Anonymous speaker separation (post‑meeting best-effort): “Speaker A/B/C”. Rename per meeting (non-biometric).	Voice profiles, biometric identification, continuous learning loop.
Triggers	Weighted confidence model; user confirmation by default; snooze and per-app suppression.	Fully autonomous auto-start as default; “call state” deep integrations.
Intelligence	Evidence-based summary (citations enforced).	Live RAG callbacks; cross-meeting memory assistant.
Storage	Local per-user database + encrypted assets; retention + deletion.	Cloud sync; team search; shared templates.
Export	Markdown/HTML + clipboard; “Print to PDF” via OS.	Bundled PDF/DOCX, templating marketplace.

3. Success Metrics & Acceptance Criteria

3.1 Product Metrics (V1)

Core loop latency (P95): word spoken → visible partial text < 3.0s
Session reliability: crash rate < 0.5% for sessions > 60 minutes
False trigger prompts: < 1 prompt/day/user median; < 3 P95
Citation correctness: ≥ 90% of summary bullets link to supporting transcript segments (human audit)

3.2 “Must Work” Acceptance Criteria (Release Blockers)

User can start/stop recording manually from tray/menubar or hotkey.
Transcript segments are persisted and viewable after app restart.
Clicking a summary bullet jumps to the cited transcript segment (and audio if stored).
Deleting a meeting removes transcript + audio in a way that prevents casual recovery.
App never records without a visible, persistent indicator.

4. User Experience

4.1 Primary Screens

Tray/Menubar Control
- Start / Stop recording
- Open NoteFlow
- Snooze triggers (15m / 1h / 2h / today)
- Settings
Active Meeting View
- Recording indicator + timer
- VU meter (trust signal)
- Rolling transcript:
  - Partial text in grey (unstable)
  - Final text in normal text (committed)
- Annotation hotkeys (Action / Decision / Note)
- “Mark moment” button (adds timestamped note instantly)
Post‑Meeting Review
- Transcript with search (in-meeting search is required; global search is “basic” in V1)
- Speaker labels (if diarization completed)
- Audio playback controls (if audio stored)
- Summary panel with evidence links
- Export buttons: Copy Markdown / Save HTML
Meeting Library
- List of meetings (title, date, duration, source)
- Keyword search (V1: scan-based acceptable up to defined limits)
- Filters: date range, source app, “has action items”
Settings
- Trigger sensitivity & sources
- Audio device selection + test
- “Store audio” toggle + retention days
- Summarization provider (local/cloud) + privacy consent
- Telemetry opt-in

5. Core Workflows

5.1 Workflow A — Smart Prompt to Record (Weighted Confidence Model)

Inputs (each produces a score contribution):

Calendar proximity (optional connector): meeting starts within 5 minutes → +0.30
Foreground app: Zoom/Teams/etc is frontmost → +0.40
Audio activity: mic level above threshold for 5s → +0.30

Threshold behavior

Score < 0.40: ignore
0.40–0.79: show notification: “Meeting detected. Start NoteFlow?”
≥ 0.80: auto-start only if user explicitly enabled

Controls

Snooze button included on prompt
“Don’t prompt for this app” option
If already recording, ignore all new triggers

Engineering note (explicit constraint): V1 does not claim true “call state” detection. Foreground app + audio activity + calendar is the reliable baseline.

5.2 Workflow B — Live Transcription (Partial → Final)

User starts recording (manual or triggered).
Audio pipeline streams frames into ring buffer.
VAD segments speech regions.
Transcriber produces partial hypothesis every ~2 seconds.
When VAD detects silence > 500ms (or max segment duration reached), commit final segment:
- assign stable Segment ID
- store text + timestamps
- update UI (partial becomes final)

UI invariant: final segments never change text; corrections happen by creating a new segment (V2) or via explicit “edit transcript” (deferred).

5.3 Workflow C — Post‑Meeting Summary with Enforced Citations (“Extract → Synthesize → Verify”)

Goal: no summary bullet can exist without a citation.

Chunking: transcript segments grouped into blocks ~500 tokens (segment-aware).
Extraction prompt: model must return a list of:
- quote (verbatim excerpt)
- segment_ids (one or more)
- category (decision/action/key_point)
Synthesis prompt: rewrite extracted quotes into a professional bullet list; each bullet ends with [...] containing Segment IDs.
Verification:
- parse bullets; if any bullet lacks [...], mark it uncited and do not show it by default (user can reveal “uncited drafts” panel)
Display: clicking a bullet scrolls transcript to cited segment(s) and sets playback time.

5.4 Workflow D — Best‑Effort Anonymous Diarization (Post‑Meeting)

V1 approach: diarization is a background job after recording stops (not real-time).

If diarization enabled, run pipeline on recorded audio.
Obtain speaker turns and cluster labels.
Align speaker turns to transcript segments by time overlap.
Assign “Speaker A/B/C” per meeting.
User can rename speakers per meeting (non-biometric).

Failure handling: if diarization model unavailable or too slow, transcript remains “Unknown speaker.”

6. Functional Requirements (FR)

6.1 Recording & Audio

FR-01 Manual start/stop recording from tray/menubar.
FR-02 Global hotkey start/stop (configurable; can be disabled).
FR-03 Visible recording indicator whenever audio capture is active.
FR-04 Audio device selection + test page (VU meter).
FR-05 Audio dropouts handled gracefully:
- attempt reconnect
- if reconnection fails, prompt user and stop recording safely (flush files)

6.2 Transcription

FR-10 Near real-time transcript view with partial/final states.
FR-11 Persist finalized transcript segments with timestamps.
FR-12 Transcript is searchable within a meeting.

6.3 Annotations

FR-20 Add annotations during recording and review:
- types: action_item, decision, note, risk (risk is allowed but not required in summary)
FR-21 An annotation always includes:
- timestamp range
- text
- origin: user/system (V1: system used only for “uncited draft” metadata; no RAG callbacks)

6.4 Summaries

FR-30 Generate summary on demand (and optionally auto after stop).
FR-31 Enforce citations; uncited bullets are suppressed by default.
FR-32 Summary bullets clickable → jump to transcript + playback time.

6.5 Library & Search

FR-40 Meeting library list with sorting and basic search.
FR-41 Delete meeting removes transcript + audio + summary.

6.6 Settings & Privacy

FR-50 Retention policy (default 30 days, configurable).
FR-51 Cloud summarization requires explicit opt-in and provider selection.
FR-52 Telemetry is opt-in and content-free.

7. Non‑Functional Requirements (NFR)

7.1 Performance

NFR-01 P95 partial transcript latency < 3s on baseline hardware (defined in release checklist).
NFR-02 Background jobs (diarization, embeddings) must not freeze UI; they run in worker threads and report progress.

7.2 Reliability

NFR-10 Crash-safe persistence:
- audio file is written incrementally
- transcript segments flushed within 2s of finalization
NFR-11 On restart after crash, last session is recoverable (meeting marked “incomplete”).

7.3 Security & Privacy

NFR-20 Local data encrypted at rest (see Section 10).
NFR-21 No recording without indicator.
NFR-22 No content in telemetry logs.

8. Technical Architecture

8.1 Process Model

Decision: Client-Server architecture with gRPC.

The system is split into two components that can run on the same machine or separately:

Server (Headless Backend)

ASR Engine: faster-whisper for transcription
Meeting Store: in-memory meeting management
Storage: LanceDB for persistence + encrypted audio assets
gRPC Service: bidirectional streaming for real-time transcription

Client (GUI Application)

UI: Flet (Python) for main window
Tray/Menubar: native integration layer (pystray)
Audio Capture: sounddevice for local mic capture
gRPC Client: streams audio to server, receives transcripts

Rationale:

Enables headless server deployment (e.g., home server, NAS)
Client can run on any machine with audio hardware
Separates compute-heavy ASR from UI responsiveness
Maintains local-first operation when both run on same machine

Deployment modes:

Local: Server + Client on same machine (default)
Split: Server on headless machine, Client on workstation with audio

8.2 gRPC Service Contract

Service: NoteFlowService

RPC	Type	Purpose
`StreamTranscription`	Bidirectional stream	Audio chunks → transcript updates
`CreateMeeting`	Unary	Start a new meeting
`StopMeeting`	Unary	Stop recording
`ListMeetings`	Unary	Query meetings
`GetMeeting`	Unary	Get meeting details
`GenerateSummary`	Unary	Generate evidence-linked summary
`GetServerInfo`	Unary	Health check + capabilities

Audio streaming contract:

Client sends AudioChunk messages (float32, 16kHz mono)
Server responds with TranscriptUpdate messages (partial or final)
Final segments include word-level timestamps

8.3 Concurrency & Threading

Server:

gRPC thread pool: handles incoming requests
ASR worker: processes audio buffers through faster-whisper
IO worker: persists segments + meeting metadata

Client:

Main/UI thread: rendering + user actions
Audio thread (high priority): capture callback → gRPC stream
gRPC stream thread: sends audio, receives transcripts
Event dispatch: updates UI from transcript callbacks

Hard rule: Server's IO worker is the only component that writes to the database (prevents corruption/races).

8.4 Audio Pipeline (Client-Side)

V1 capture modes

Microphone input (default, cross-platform)
Windows-only optional loopback (if implemented without extra drivers)
macOS loopback via user-installed virtual device (supported if user configures; not bundled)

Client Pipeline

Capture: PortAudio via sounddevice
- internal capture format: float32 frames
- resample to 16kHz mono for streaming
Stream: gRPC StreamTranscription to server
- chunks sent every ~100ms
- includes timestamp for sync
Display: receive TranscriptUpdate from server
- partial updates shown in grey
- final segments committed to UI

Server Pipeline

Receive: audio chunks from gRPC stream
Buffer: accumulate until processable duration (~1s)
VAD: silero-vad filters non-speech
ASR: faster-whisper inference with word timestamps
Finalize: silence boundary or max segment length
Persist: segments written to DB
Stream: send TranscriptUpdate back to client

Explicit failure modes

device unplugged → reconnect to default device; show toast
permission denied → block recording and show system instructions
sustained dropouts → stop recording safely, mark session incomplete

8.5 Transcription Engine (Partial/Final Contract)

Partial inference cadence: every ~2 seconds Finalization rules:

VAD silence > 500ms finalizes current segment
max segment length (e.g., 20s) forces finalization to control latency/UX

Text stability rule: partial may be replaced; final never mutates.

8.6 Diarization (V1 Post‑Meeting Only)

Runs after meeting stop or on-demand
Produces anonymous labels
Time-align with transcript segments
Stored per meeting; no cross-meeting identity

Important: diarization is optional; must never block transcript availability.

8.7 Summarization Providers

Provider interface: Summarizer.generate(transcript: MeetingTranscript) -> MeetingSummary

Supported provider modes:

Cloud provider (user-supplied API key; explicit opt-in)
Local provider (optional; user-installed runtime; best-effort)

Privacy contract: if cloud is enabled, UI must clearly display “Transcript will be sent to provider X” at first use and in settings.

9. Storage & Data Model

9.1 On-Disk Layout (Per User)

App data directory (OS standard)
- db/ (LanceDB)
- meetings/<meeting_id>/
  - audio.<ext> (encrypted container)
  - manifest.json (non-sensitive)
- logs/ (rotating; content-free)
- settings.json

9.2 Database Schema (LanceDB)

Core tables:

meetings
- id (UUID)
- title
- started_at, ended_at
- source_app
- flags: has_audio, has_summary, diarization_status
segments
- id (UUID)
- meeting_id
- start_offset, end_offset
- text
- speaker_label (“Unknown”, “Speaker A”…)
- confidence (optional)
- embedding_vector (optional, computed post‑meeting)
annotations
- id
- meeting_id
- start_offset, end_offset
- type
- text
- created_at
summaries
- meeting_id
- generated_at
- provider
- overview
- points (serialized)
- verification_report (uncited_count, etc.)

9.3 Domain Models (Pydantic v2)

Key correctness requirements:

enforce end >= start
avoid mutable defaults
keep “escape hatches” constrained and documented

Example models (illustrative; not exhaustive):

from __future__ import annotations

from datetime import datetime
from typing import Literal
from pydantic import BaseModel, Field, model_validator

MeetingID = str
SegmentID = str
AnnotationID = str

class MeetingMetadata(BaseModel):
    id: MeetingID
    title: str = "Untitled Meeting"
    started_at: datetime = Field(default_factory=datetime.now)
    ended_at: datetime | None = None
    trigger_source: Literal["manual", "calendar", "app", "mixed"] = "manual"
    source_app: str | None = None
    participants: list[str] = Field(default_factory=list)

class TranscriptSegment(BaseModel):
    id: SegmentID
    meeting_id: MeetingID
    start: float = Field(..., ge=0.0)
    end: float = Field(..., ge=0.0)
    text: str
    speaker_label: str = "Unknown"
    is_final: bool = True

    @model_validator(mode="after")
    def validate_times(self) -> "TranscriptSegment":
        if self.end < self.start:
            raise ValueError("segment end < start")
        return self

class Annotation(BaseModel):
    id: AnnotationID
    meeting_id: MeetingID
    type: Literal["action_item", "decision", "note", "risk"]
    start: float = Field(..., ge=0.0)
    end: float = Field(..., ge=0.0)
    text: str
    created_at: datetime = Field(default_factory=datetime.now)

class SummaryPoint(BaseModel):
    category: Literal["decision", "action_item", "key_point"]
    content: str
    citation_ids: list[SegmentID] = Field(default_factory=list)
    is_cited: bool = True

class MeetingSummary(BaseModel):
    meeting_id: MeetingID
    generated_at: datetime
    provider: str
    overview: str
    points: list[SummaryPoint]
    uncited_points: list[SummaryPoint] = Field(default_factory=list)

10. Privacy, Security & Compliance

Persistent recording indicator (tray/menubar icon + in-app)
First-run permission guide:
- microphone access
- hotkeys/accessibility permissions if required by OS
One-time legal reminder: user responsibility to comply with local consent laws

10.2 Encryption at Rest (Pragmatic + Real)

Goal: protect recordings and derived data on disk.

Design: envelope encryption

Master key stored in OS credential store (Keychain/Credential Manager) via a cross-platform keyring abstraction.
Per-meeting data key (DEK) generated randomly.
Meeting assets (audio, sensitive metadata) encrypted with DEK.
DEK encrypted with master key and stored in DB.

Deletion (“cryptographic shred”)

Delete encrypted DEK record + delete encrypted file(s).
Without DEK, leftover bytes are unusable.

10.3 Retention

Default retention: 30 days
Retention job runs at app startup and once daily
“Delete now” always available per meeting

10.4 Telemetry (Opt-in, Content-Free)

Allowed fields only:

crash stacktrace (redacted paths if needed)
performance counters (latency, dropouts, model runtime)
feature toggles (summarization enabled yes/no) Explicitly forbidden:
transcript text
audio
meeting titles/participants (unless user explicitly opts-in to “diagnostic mode,” which is V2+)

11. Packaging, Distribution, Updates

11.1 Packaging

Primary: PyInstaller-based app bundle (one-click install experience)
No bundled PDF engine in V1 (avoid complex native deps)
Exports: HTML/Markdown + OS “Print to PDF”

11.2 Code Signing & OS Requirements

macOS: signed + notarized app bundle
Windows: signed installer recommended to reduce SmartScreen friction

11.3 Updates (V1 Reality)

V1 includes: “Check for updates” → opens release page + shows current version
V1.1+ can add auto-update once packaging is stable across OS targets

12. Observability

12.1 Logging

Structured logging (JSON) to rotating files
Log levels configurable
Must never log transcript content or raw audio

12.2 Metrics (Local + Optional Telemetry)

Track locally:

audio_dropout_count
vad_speech_ratio
asr_partial_latency_ms (P50/P95)
asr_final_latency_ms
summarization_duration_ms
db_write_queue_depth

13. Development Standards (Pragmatic)

13.1 Typing Policy

mypy --strict required in CI
Any avoided in core domain; allowed only at explicit boundaries (OS bindings, C libs)
type: ignore[code] allowed only with:
1. narrow scope
2. comment explaining why
3. tracked follow-up task if it’s not permanent

13.2 Architecture Conventions

Dependency Injection for services (no heavy constructors)
Facade exports (__init__.py) for clean APIs
Module size guideline:
- soft limit 500 LoC
- hard limit 750 LoC → refactor into package

13.3 Testing Strategy

Unit tests: trigger scoring, summarization verifier, model validators
Integration tests: DB schema, retention deletion, encrypted asset lifecycle
E2E tests (required): inject prerecorded audio into pipeline; assert transcript contains expected phrases + stable segment timing behavior
CI must not depend on live microphone input

14. Known Risks & Mitigations (V1)

Risk	Impact	Mitigation
Mic-only capture misses remote speakers (headphones)	Product feels “broken”	Provide Windows loopback option if feasible; on macOS provide “Audio Setup Wizard” supporting user-installed loopback devices; clearly label limitations in UI.
Whisper hallucinations on silence	Bad transcript	VAD gate; discard non-speech frames; conservative finalization.
Model performance on low-end CPU	Laggy UI	“Low Power Mode” (slower partial cadence), async background jobs, allow cloud ASR (optional later).
Diarization dependency/model availability	Feature instability	Make diarization optional + post-meeting; graceful fallback to “Unknown speaker.”
False trigger prompts	Annoyance	Weighted scoring + snooze + per-app suppression + “only prompt when foreground.”
Packaging/permissions friction	Drop-off	First-run wizard; clear permission UX; signed builds.

15. Roadmap (V2+)

High-confidence next steps after V1 ships:

Live RAG callbacks (throttled, high-signal only)
Speaker identity profiles with safeguards (quarantine samples, versioning, revert)
Advanced exports (PDF/DOCX via a packaging-friendly approach)
Search upgrades (FTS/semantic global search performance)
Cloud sync (optional) and team workspaces (separate product decision)

16. Open Questions (Engineering Spikes Required)

These must be resolved with short spikes before implementation finalization:

Tray + global hotkeys compatibility with chosen UI stack on macOS/Windows
Windows loopback feasibility with the selected audio library and packaging approach
Diarization model choice that does not require gated downloads or accounts (or else diarization becomes V2)
Local LLM summarization feasibility (quality + packaging); if not feasible, cloud-only summarization requires an explicit product decision

If you want, I can also produce a companion “Implementation Plan” (milestones + tasks + module breakdown + API skeletons) that matches this spec exactly—so engineering can start building without re-interpreting decisions.

27 KiB Raw Blame History Unescape Escape