- Introduced .python-version for Python version management. - Added AGENTS.md for documentation on agent usage and best practices. - Created alembic.ini for database migration configurations. - Implemented main.py as the entry point for the application. - Established pyproject.toml for project dependencies and configurations. - Initialized README.md for project overview. - Generated uv.lock for dependency locking. - Documented milestones and specifications in docs/milestones.md and docs/spec.md. - Created logs/status_line.json for logging status information. - Added initial spike implementations for UI tray hotkeys, audio capture, ASR latency, and encryption validation. - Set up NoteFlow core structure in src/noteflow with necessary modules and services. - Developed test suite in tests directory for application, domain, infrastructure, and integration testing. - Included initial migration scripts in infrastructure/persistence/migrations for database setup. - Established security protocols in infrastructure/security for key management and encryption. - Implemented audio infrastructure for capturing and processing audio data. - Created converters for ASR and ORM in infrastructure/converters. - Added export functionality for different formats in infrastructure/export. - Ensured all new files are included in the repository for future development.
27 KiB
Below is a rewritten, end‑to‑end Product Specification + Engineering Design Document for NoteFlow V1 (Minimum Lovable Product) that merges:
- your revised V1 draft (confidence-model triggers, single-process, partial/final UX, extract‑then‑synthesize citations, pragmatic typing, packaging constraints, risks table), and
- the de-risking feedback I gave earlier (audio capture reality, diarization scope, citation enforcement, OS permissions, shipping concerns, storage/retention, update strategy, and “don’t promise what you can’t reliably ship”).
I’ve kept it “shipping-ready” by being explicit about decisions, failure modes, acceptance criteria, and what is deferred.
NoteFlow V1 — Minimum Lovable Product
Intelligent Meeting Notetaker (Local‑first capture + navigable recall + evidence‑linked summaries)
Document Version: 1.0 (Engineering Draft) Status: Engineering Review Target Platforms: macOS 12+ (Monterey), Windows 10/11 (64-bit) Primary Use Case: Zoom/Teams-style meetings and ad-hoc conversations Core Value Proposition: “I can reliably record a meeting, read/search a transcript, and get a summary where every point links back to evidence.”
0. Glossary
- Segment: A finalized chunk of transcript with
start/endoffsets and stable text. - Partial transcript: Unstable text shown in the live view; may be replaced. Not persisted.
- Evidence link: A reference from a summary bullet to one or more Segment IDs (and timestamps).
- Trigger score: Weighted confidence score (0.0–1.0) used to prompt recording.
- Local-first: All recordings/transcripts stored on device by default; cloud is optional and explicit.
1. Product Strategy
1.1 Goals (V1 Must Deliver)
-
Reliable capture of meeting audio (with explicit scope + honest constraints).
-
Near real-time transcription with a stable partial/final UX.
-
Post‑meeting review with:
- transcript navigation,
- audio playback synced to timestamps,
- annotations (action items/decisions/notes),
- an evidence‑linked summary (no uncited claims).
-
Local-first storage with retention controls and deletion that is actually meaningful.
-
A foundation for V2 (speaker identity, live RAG callbacks, advanced exports) without building them now.
1.2 Non‑Goals (V1 Will Not Promise)
- Fully autonomous “always start recording” behavior by default.
- Biometric speaker identification (“this is Alice”) or cross‑meeting voice profiles.
- Live “RAG callback cards” injected during meetings.
- Team workspaces / cloud sync / org deployment.
- PDF/DOCX export bundled in-app (V1 exports Markdown/HTML; PDF is via OS print).
- Perfect diarization accuracy; diarization is best-effort and post‑meeting only.
2. Scope: V1 vs V2+
| Feature Area | V1 Scope (Must Ship) | Deferred (V2+) |
|---|---|---|
| Audio Capture | Mic capture (default). Windows-only optional system loopback (no drivers) if feasible. macOS loopback requires user-installed device; V1 supports selecting it but does not ship drivers. | First-class macOS system audio capture without user setup; multi-source mixing; per-app capture. |
| Transcription | Near real-time via partial/final segments; timestamps; searchable transcript. | Multi-language translation, custom vocab, advanced diarization alignment. |
| Speakers | Anonymous speaker separation (post‑meeting best-effort): “Speaker A/B/C”. Rename per meeting (non-biometric). | Voice profiles, biometric identification, continuous learning loop. |
| Triggers | Weighted confidence model; user confirmation by default; snooze and per-app suppression. | Fully autonomous auto-start as default; “call state” deep integrations. |
| Intelligence | Evidence-based summary (citations enforced). | Live RAG callbacks; cross-meeting memory assistant. |
| Storage | Local per-user database + encrypted assets; retention + deletion. | Cloud sync; team search; shared templates. |
| Export | Markdown/HTML + clipboard; “Print to PDF” via OS. | Bundled PDF/DOCX, templating marketplace. |
3. Success Metrics & Acceptance Criteria
3.1 Product Metrics (V1)
- Core loop latency (P95): word spoken → visible partial text < 3.0s
- Session reliability: crash rate < 0.5% for sessions > 60 minutes
- False trigger prompts: < 1 prompt/day/user median; < 3 P95
- Citation correctness: ≥ 90% of summary bullets link to supporting transcript segments (human audit)
3.2 “Must Work” Acceptance Criteria (Release Blockers)
- User can start/stop recording manually from tray/menubar or hotkey.
- Transcript segments are persisted and viewable after app restart.
- Clicking a summary bullet jumps to the cited transcript segment (and audio if stored).
- Deleting a meeting removes transcript + audio in a way that prevents casual recovery.
- App never records without a visible, persistent indicator.
4. User Experience
4.1 Primary Screens
-
Tray/Menubar Control
- Start / Stop recording
- Open NoteFlow
- Snooze triggers (15m / 1h / 2h / today)
- Settings
-
Active Meeting View
-
Recording indicator + timer
-
VU meter (trust signal)
-
Rolling transcript:
- Partial text in grey (unstable)
- Final text in normal text (committed)
-
Annotation hotkeys (Action / Decision / Note)
-
“Mark moment” button (adds timestamped note instantly)
-
-
Post‑Meeting Review
- Transcript with search (in-meeting search is required; global search is “basic” in V1)
- Speaker labels (if diarization completed)
- Audio playback controls (if audio stored)
- Summary panel with evidence links
- Export buttons: Copy Markdown / Save HTML
-
Meeting Library
- List of meetings (title, date, duration, source)
- Keyword search (V1: scan-based acceptable up to defined limits)
- Filters: date range, source app, “has action items”
-
Settings
- Trigger sensitivity & sources
- Audio device selection + test
- “Store audio” toggle + retention days
- Summarization provider (local/cloud) + privacy consent
- Telemetry opt-in
5. Core Workflows
5.1 Workflow A — Smart Prompt to Record (Weighted Confidence Model)
Inputs (each produces a score contribution):
- Calendar proximity (optional connector): meeting starts within 5 minutes →
+0.30 - Foreground app: Zoom/Teams/etc is frontmost →
+0.40 - Audio activity: mic level above threshold for 5s →
+0.30
Threshold behavior
- Score
< 0.40: ignore 0.40–0.79: show notification: “Meeting detected. Start NoteFlow?”≥ 0.80: auto-start only if user explicitly enabled
Controls
- Snooze button included on prompt
- “Don’t prompt for this app” option
- If already recording, ignore all new triggers
Engineering note (explicit constraint): V1 does not claim true “call state” detection. Foreground app + audio activity + calendar is the reliable baseline.
5.2 Workflow B — Live Transcription (Partial → Final)
-
User starts recording (manual or triggered).
-
Audio pipeline streams frames into ring buffer.
-
VAD segments speech regions.
-
Transcriber produces partial hypothesis every ~2 seconds.
-
When VAD detects silence > 500ms (or max segment duration reached), commit final segment:
- assign stable Segment ID
- store text + timestamps
- update UI (partial becomes final)
UI invariant: final segments never change text; corrections happen by creating a new segment (V2) or via explicit “edit transcript” (deferred).
5.3 Workflow C — Post‑Meeting Summary with Enforced Citations (“Extract → Synthesize → Verify”)
Goal: no summary bullet can exist without a citation.
-
Chunking: transcript segments grouped into blocks ~500 tokens (segment-aware).
-
Extraction prompt: model must return a list of:
quote(verbatim excerpt)segment_ids(one or more)category(decision/action/key_point)
-
Synthesis prompt: rewrite extracted quotes into a professional bullet list; each bullet ends with
[...]containing Segment IDs. -
Verification:
- parse bullets; if any bullet lacks
[...], mark ituncitedand do not show it by default (user can reveal “uncited drafts” panel)
- parse bullets; if any bullet lacks
-
Display: clicking a bullet scrolls transcript to cited segment(s) and sets playback time.
5.4 Workflow D — Best‑Effort Anonymous Diarization (Post‑Meeting)
V1 approach: diarization is a background job after recording stops (not real-time).
- If diarization enabled, run pipeline on recorded audio.
- Obtain speaker turns and cluster labels.
- Align speaker turns to transcript segments by time overlap.
- Assign “Speaker A/B/C” per meeting.
- User can rename speakers per meeting (non-biometric).
Failure handling: if diarization model unavailable or too slow, transcript remains “Unknown speaker.”
6. Functional Requirements (FR)
6.1 Recording & Audio
-
FR-01 Manual start/stop recording from tray/menubar.
-
FR-02 Global hotkey start/stop (configurable; can be disabled).
-
FR-03 Visible recording indicator whenever audio capture is active.
-
FR-04 Audio device selection + test page (VU meter).
-
FR-05 Audio dropouts handled gracefully:
- attempt reconnect
- if reconnection fails, prompt user and stop recording safely (flush files)
6.2 Transcription
- FR-10 Near real-time transcript view with partial/final states.
- FR-11 Persist finalized transcript segments with timestamps.
- FR-12 Transcript is searchable within a meeting.
6.3 Annotations
-
FR-20 Add annotations during recording and review:
- types:
action_item,decision,note,risk(risk is allowed but not required in summary)
- types:
-
FR-21 An annotation always includes:
- timestamp range
- text
- origin: user/system (V1: system used only for “uncited draft” metadata; no RAG callbacks)
6.4 Summaries
- FR-30 Generate summary on demand (and optionally auto after stop).
- FR-31 Enforce citations; uncited bullets are suppressed by default.
- FR-32 Summary bullets clickable → jump to transcript + playback time.
6.5 Library & Search
- FR-40 Meeting library list with sorting and basic search.
- FR-41 Delete meeting removes transcript + audio + summary.
6.6 Settings & Privacy
- FR-50 Retention policy (default 30 days, configurable).
- FR-51 Cloud summarization requires explicit opt-in and provider selection.
- FR-52 Telemetry is opt-in and content-free.
7. Non‑Functional Requirements (NFR)
7.1 Performance
- NFR-01 P95 partial transcript latency < 3s on baseline hardware (defined in release checklist).
- NFR-02 Background jobs (diarization, embeddings) must not freeze UI; they run in worker threads and report progress.
7.2 Reliability
-
NFR-10 Crash-safe persistence:
- audio file is written incrementally
- transcript segments flushed within 2s of finalization
-
NFR-11 On restart after crash, last session is recoverable (meeting marked “incomplete”).
7.3 Security & Privacy
- NFR-20 Local data encrypted at rest (see Section 10).
- NFR-21 No recording without indicator.
- NFR-22 No content in telemetry logs.
8. Technical Architecture
8.1 Process Model
Decision: Client-Server architecture with gRPC.
The system is split into two components that can run on the same machine or separately:
Server (Headless Backend)
- ASR Engine: faster-whisper for transcription
- Meeting Store: in-memory meeting management
- Storage: LanceDB for persistence + encrypted audio assets
- gRPC Service: bidirectional streaming for real-time transcription
Client (GUI Application)
- UI: Flet (Python) for main window
- Tray/Menubar: native integration layer (pystray)
- Audio Capture: sounddevice for local mic capture
- gRPC Client: streams audio to server, receives transcripts
Rationale:
- Enables headless server deployment (e.g., home server, NAS)
- Client can run on any machine with audio hardware
- Separates compute-heavy ASR from UI responsiveness
- Maintains local-first operation when both run on same machine
Deployment modes:
- Local: Server + Client on same machine (default)
- Split: Server on headless machine, Client on workstation with audio
8.2 gRPC Service Contract
Service: NoteFlowService
| RPC | Type | Purpose |
|---|---|---|
StreamTranscription |
Bidirectional stream | Audio chunks → transcript updates |
CreateMeeting |
Unary | Start a new meeting |
StopMeeting |
Unary | Stop recording |
ListMeetings |
Unary | Query meetings |
GetMeeting |
Unary | Get meeting details |
GenerateSummary |
Unary | Generate evidence-linked summary |
GetServerInfo |
Unary | Health check + capabilities |
Audio streaming contract:
- Client sends
AudioChunkmessages (float32, 16kHz mono) - Server responds with
TranscriptUpdatemessages (partial or final) - Final segments include word-level timestamps
8.3 Concurrency & Threading
Server:
- gRPC thread pool: handles incoming requests
- ASR worker: processes audio buffers through faster-whisper
- IO worker: persists segments + meeting metadata
Client:
- Main/UI thread: rendering + user actions
- Audio thread (high priority): capture callback → gRPC stream
- gRPC stream thread: sends audio, receives transcripts
- Event dispatch: updates UI from transcript callbacks
Hard rule: Server's IO worker is the only component that writes to the database (prevents corruption/races).
8.4 Audio Pipeline (Client-Side)
V1 capture modes
- Microphone input (default, cross-platform)
- Windows-only optional loopback (if implemented without extra drivers)
- macOS loopback via user-installed virtual device (supported if user configures; not bundled)
Client Pipeline
- Capture: PortAudio via
sounddevice- internal capture format: float32 frames
- resample to 16kHz mono for streaming
- Stream: gRPC
StreamTranscriptionto server- chunks sent every ~100ms
- includes timestamp for sync
- Display: receive
TranscriptUpdatefrom server- partial updates shown in grey
- final segments committed to UI
Server Pipeline
- Receive: audio chunks from gRPC stream
- Buffer: accumulate until processable duration (~1s)
- VAD: silero-vad filters non-speech
- ASR: faster-whisper inference with word timestamps
- Finalize: silence boundary or max segment length
- Persist: segments written to DB
- Stream: send
TranscriptUpdateback to client
Explicit failure modes
- device unplugged → reconnect to default device; show toast
- permission denied → block recording and show system instructions
- sustained dropouts → stop recording safely, mark session incomplete
8.5 Transcription Engine (Partial/Final Contract)
Partial inference cadence: every ~2 seconds Finalization rules:
- VAD silence > 500ms finalizes current segment
- max segment length (e.g., 20s) forces finalization to control latency/UX
Text stability rule: partial may be replaced; final never mutates.
8.6 Diarization (V1 Post‑Meeting Only)
- Runs after meeting stop or on-demand
- Produces anonymous labels
- Time-align with transcript segments
- Stored per meeting; no cross-meeting identity
Important: diarization is optional; must never block transcript availability.
8.7 Summarization Providers
Provider interface: Summarizer.generate(transcript: MeetingTranscript) -> MeetingSummary
Supported provider modes:
- Cloud provider (user-supplied API key; explicit opt-in)
- Local provider (optional; user-installed runtime; best-effort)
Privacy contract: if cloud is enabled, UI must clearly display “Transcript will be sent to provider X” at first use and in settings.
9. Storage & Data Model
9.1 On-Disk Layout (Per User)
-
App data directory (OS standard)
-
db/(LanceDB) -
meetings/<meeting_id>/audio.<ext>(encrypted container)manifest.json(non-sensitive)
-
logs/(rotating; content-free) -
settings.json
-
9.2 Database Schema (LanceDB)
Core tables:
-
meetings- id (UUID)
- title
- started_at, ended_at
- source_app
- flags: has_audio, has_summary, diarization_status
-
segments- id (UUID)
- meeting_id
- start_offset, end_offset
- text
- speaker_label (“Unknown”, “Speaker A”…)
- confidence (optional)
- embedding_vector (optional, computed post‑meeting)
-
annotations- id
- meeting_id
- start_offset, end_offset
- type
- text
- created_at
-
summaries- meeting_id
- generated_at
- provider
- overview
- points (serialized)
- verification_report (uncited_count, etc.)
9.3 Domain Models (Pydantic v2)
Key correctness requirements:
- enforce
end >= start - avoid mutable defaults
- keep “escape hatches” constrained and documented
Example models (illustrative; not exhaustive):
from __future__ import annotations
from datetime import datetime
from typing import Literal
from pydantic import BaseModel, Field, model_validator
MeetingID = str
SegmentID = str
AnnotationID = str
class MeetingMetadata(BaseModel):
id: MeetingID
title: str = "Untitled Meeting"
started_at: datetime = Field(default_factory=datetime.now)
ended_at: datetime | None = None
trigger_source: Literal["manual", "calendar", "app", "mixed"] = "manual"
source_app: str | None = None
participants: list[str] = Field(default_factory=list)
class TranscriptSegment(BaseModel):
id: SegmentID
meeting_id: MeetingID
start: float = Field(..., ge=0.0)
end: float = Field(..., ge=0.0)
text: str
speaker_label: str = "Unknown"
is_final: bool = True
@model_validator(mode="after")
def validate_times(self) -> "TranscriptSegment":
if self.end < self.start:
raise ValueError("segment end < start")
return self
class Annotation(BaseModel):
id: AnnotationID
meeting_id: MeetingID
type: Literal["action_item", "decision", "note", "risk"]
start: float = Field(..., ge=0.0)
end: float = Field(..., ge=0.0)
text: str
created_at: datetime = Field(default_factory=datetime.now)
class SummaryPoint(BaseModel):
category: Literal["decision", "action_item", "key_point"]
content: str
citation_ids: list[SegmentID] = Field(default_factory=list)
is_cited: bool = True
class MeetingSummary(BaseModel):
meeting_id: MeetingID
generated_at: datetime
provider: str
overview: str
points: list[SummaryPoint]
uncited_points: list[SummaryPoint] = Field(default_factory=list)
10. Privacy, Security & Compliance
10.1 Consent & Transparency
-
Persistent recording indicator (tray/menubar icon + in-app)
-
First-run permission guide:
- microphone access
- hotkeys/accessibility permissions if required by OS
-
One-time legal reminder: user responsibility to comply with local consent laws
10.2 Encryption at Rest (Pragmatic + Real)
Goal: protect recordings and derived data on disk.
Design: envelope encryption
- Master key stored in OS credential store (Keychain/Credential Manager) via a cross-platform keyring abstraction.
- Per-meeting data key (DEK) generated randomly.
- Meeting assets (audio, sensitive metadata) encrypted with DEK.
- DEK encrypted with master key and stored in DB.
Deletion (“cryptographic shred”)
- Delete encrypted DEK record + delete encrypted file(s).
- Without DEK, leftover bytes are unusable.
10.3 Retention
- Default retention: 30 days
- Retention job runs at app startup and once daily
- “Delete now” always available per meeting
10.4 Telemetry (Opt-in, Content-Free)
Allowed fields only:
- crash stacktrace (redacted paths if needed)
- performance counters (latency, dropouts, model runtime)
- feature toggles (summarization enabled yes/no) Explicitly forbidden:
- transcript text
- audio
- meeting titles/participants (unless user explicitly opts-in to “diagnostic mode,” which is V2+)
11. Packaging, Distribution, Updates
11.1 Packaging
- Primary: PyInstaller-based app bundle (one-click install experience)
- No bundled PDF engine in V1 (avoid complex native deps)
- Exports: HTML/Markdown + OS “Print to PDF”
11.2 Code Signing & OS Requirements
- macOS: signed + notarized app bundle
- Windows: signed installer recommended to reduce SmartScreen friction
11.3 Updates (V1 Reality)
- V1 includes: “Check for updates” → opens release page + shows current version
- V1.1+ can add auto-update once packaging is stable across OS targets
12. Observability
12.1 Logging
- Structured logging (JSON) to rotating files
- Log levels configurable
- Must never log transcript content or raw audio
12.2 Metrics (Local + Optional Telemetry)
Track locally:
audio_dropout_countvad_speech_ratioasr_partial_latency_ms(P50/P95)asr_final_latency_mssummarization_duration_msdb_write_queue_depth
13. Development Standards (Pragmatic)
13.1 Typing Policy
-
mypy --strictrequired in CI -
Anyavoided in core domain; allowed only at explicit boundaries (OS bindings, C libs) -
type: ignore[code]allowed only with:- narrow scope
- comment explaining why
- tracked follow-up task if it’s not permanent
13.2 Architecture Conventions
-
Dependency Injection for services (no heavy constructors)
-
Facade exports (
__init__.py) for clean APIs -
Module size guideline:
- soft limit 500 LoC
- hard limit 750 LoC → refactor into package
13.3 Testing Strategy
- Unit tests: trigger scoring, summarization verifier, model validators
- Integration tests: DB schema, retention deletion, encrypted asset lifecycle
- E2E tests (required): inject prerecorded audio into pipeline; assert transcript contains expected phrases + stable segment timing behavior
- CI must not depend on live microphone input
14. Known Risks & Mitigations (V1)
| Risk | Impact | Mitigation |
|---|---|---|
| Mic-only capture misses remote speakers (headphones) | Product feels “broken” | Provide Windows loopback option if feasible; on macOS provide “Audio Setup Wizard” supporting user-installed loopback devices; clearly label limitations in UI. |
| Whisper hallucinations on silence | Bad transcript | VAD gate; discard non-speech frames; conservative finalization. |
| Model performance on low-end CPU | Laggy UI | “Low Power Mode” (slower partial cadence), async background jobs, allow cloud ASR (optional later). |
| Diarization dependency/model availability | Feature instability | Make diarization optional + post-meeting; graceful fallback to “Unknown speaker.” |
| False trigger prompts | Annoyance | Weighted scoring + snooze + per-app suppression + “only prompt when foreground.” |
| Packaging/permissions friction | Drop-off | First-run wizard; clear permission UX; signed builds. |
15. Roadmap (V2+)
High-confidence next steps after V1 ships:
- Live RAG callbacks (throttled, high-signal only)
- Speaker identity profiles with safeguards (quarantine samples, versioning, revert)
- Advanced exports (PDF/DOCX via a packaging-friendly approach)
- Search upgrades (FTS/semantic global search performance)
- Cloud sync (optional) and team workspaces (separate product decision)
16. Open Questions (Engineering Spikes Required)
These must be resolved with short spikes before implementation finalization:
- Tray + global hotkeys compatibility with chosen UI stack on macOS/Windows
- Windows loopback feasibility with the selected audio library and packaging approach
- Diarization model choice that does not require gated downloads or accounts (or else diarization becomes V2)
- Local LLM summarization feasibility (quality + packaging); if not feasible, cloud-only summarization requires an explicit product decision
If you want, I can also produce a companion “Implementation Plan” (milestones + tasks + module breakdown + API skeletons) that matches this spec exactly—so engineering can start building without re-interpreting decisions.