- Moved all hookify configuration files from `.claude/` to `.claude/hooks/` subdirectory for better organization - Added four new blocking hooks to prevent common error handling anti-patterns: - `block-broad-exception-handler`: Prevents catching generic `Exception` with only logging - `block-datetime-now-fallback`: Blocks returning `datetime.now()` as fallback on parse failures to prevent data corruption - `block-default
62 KiB
Sprint 4: Named Entity Extraction (NER)
Priority: 4 | Owner: Backend | Complexity: Medium Prerequisite: Sprint 0 (Proto & Schema Foundation)
Validation Status (2025-12-27)
✅ 100% COMPLETE
| Component | Location | Status |
|---|---|---|
| Feature Flag | config/settings.py |
✅ ner_enabled in FeatureFlags |
| Domain Entity | domain/entities/named_entity.py |
✅ NamedEntity, EntityCategory |
| NER Engine | infrastructure/ner/engine.py |
✅ NerEngine with spaCy integration |
| NER Service | application/services/ner_service.py |
✅ NerService with caching |
| gRPC Mixin | grpc/_mixins/entities.py |
✅ EntitiesMixin with extract/get/pin |
| ORM Model | persistence/models/entities/named_entity.py |
✅ NamedEntityModel |
| Repository | persistence/repositories/entity_repo.py |
✅ SqlAlchemyEntityRepository |
| Proto RPC | noteflow.proto |
✅ ExtractEntities, GetEntities, PinEntity |
| Dependency | pyproject.toml |
✅ spacy>=3.7 |
| Frontend Hook | hooks/use-entity-extraction.ts |
✅ Reusable extraction hook with state |
| Frontend Panel | pages/MeetingDetail.tsx |
✅ Entities tab with extract/refresh UI |
| Entity Store | lib/entity-store.ts |
✅ Client-side entity state management |
| Rust Command | commands/entities.rs |
✅ extract_entities command |
| Tauri Adapter | api/tauri-adapter.ts:179 |
✅ extractEntities() wrapper |
Objective
Automatically extract named entities (people, companies, products, locations) from meeting transcripts, reducing manual annotation effort and ensuring consistency.
Library Selection
NER Engine Options
| Library | Best For | Pros | Cons |
|---|---|---|---|
| spaCy (recommended) | Production pipelines | Fast, mature, transformer support | Fixed entity types without training |
| GLiNER | Zero-shot custom entities | Extract any entity type without training | Newer, less battle-tested |
| Hugging Face Transformers | Maximum flexibility | State-of-the-art models | More setup, heavier |
Decision: Use spaCy with spacy-transformers for:
- Production-ready performance (0.05x real-time)
- Transformer integration via
en_core_web_trfmodel - Well-documented, widely adopted
- Future option: Add GLiNER adapter for custom entity types
References:
- spaCy NER Documentation
- GLiNER GitHub - Zero-shot NER alternative
- spaCy vs GLiNER Comparison
Current State Analysis
What Already Exists
Frontend Components (Complete)
| Component | Location | Status |
|---|---|---|
| Entity Highlight | client/src/components/entity-highlight.tsx |
Renders inline entity highlights with tooltips (190 lines) |
| Entity Panel | client/src/components/entity-management-panel.tsx |
Manual CRUD with Sheet UI (388 lines) |
| Entity Store | client/src/lib/entity-store.ts |
Client-side state with observer pattern (169 lines) |
| Entity Types | client/src/types/entity.ts |
TypeScript types + color mappings (49 lines) |
Backend Infrastructure (Complete)
| Component | Location | Status |
|---|---|---|
| ORM Model | infrastructure/persistence/models/entities/named_entity.py |
Complete - NamedEntityModel with all fields |
| Migration | migrations/versions/h2c3d4e5f6g7_add_named_entities_table.py |
Complete - table created with indices |
| Meeting Relationship | models/core/meeting.py:35-37 |
Configured - named_entities with cascade delete |
| Proto RPC | noteflow.proto:46 |
Defined - rpc ExtractEntities(...) |
| Proto Messages | noteflow.proto:593-630 |
Defined - Request, Response, ExtractedEntity |
| Feature Flag | config/settings.py:223-226 |
Defined - NOTEFLOW_FEATURE_NER_ENABLED |
| Dependency | pyproject.toml:63 |
Declared - spacy>=3.7 |
Note
: The
named_entitiestable and proto definitions are already implemented. Sprint 0 completed this foundation work. No schema changes needed.
Gap: What Needs Implementation
Backend NER processing pipeline:
- No spaCy engine wrapper (
infrastructure/ner/) - No
NamedEntitydomain entity class - No
NerPortdomain protocol - No
NerServiceapplication layer - No
SqlAlchemyEntityRepository - No
EntitiesMixingRPC handler - No Rust/Tauri command for extraction
- Frontend not wired to backend extraction
Field Naming Alignment Warning
⚠️ CRITICAL: Frontend and backend use different field names for entity text:
Layer Field Location Frontend TS termclient/src/types/entity.ts:Entity.termBackend Proto textnoteflow.proto:ExtractedEntity.textORM Model textmodels/entities/named_entity.py:NamedEntityModel.textResolution required: Either align frontend to use
textor add mapping in Tauri commands. The existingentity-store.tsusestermthroughout—a rename may be needed for consistency.
Architecture: Protocol-Based Dependency Injection
All components use Protocol-based dependency injection for testability:
┌─────────────────────────────────────────────────────────────────┐
│ gRPC Layer │
│ EntitiesMixin(ner_service: NerServiceProtocol) │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Application Layer │
│ NerService(ner_engine: NerPort, uow_factory: UoWFactory) │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Infrastructure Layer │
│ SpacyNerEngine(model_name: str) implements NerPort │
│ GlinerNerEngine(model_name: str) implements NerPort │
└─────────────────────────────────────────────────────────────────┘
Key Principles (per ArjanCodes DI Best Practices):
- Constructor injection for all dependencies
- Protocol-based abstractions (structural typing)
- Factory functions for object creation
- No global state or singletons
Target/Affected Code
Files to Create
| File | Purpose | Lines Est. |
|---|---|---|
src/noteflow/domain/entities/named_entity.py |
Domain entity | ~50 |
src/noteflow/domain/ports/ner.py |
NER port interface | ~40 |
src/noteflow/application/services/ner_service.py |
Application service layer | ~150 |
src/noteflow/infrastructure/ner/__init__.py |
Module init | ~10 |
src/noteflow/infrastructure/ner/engine.py |
spaCy NER engine | ~120 |
src/noteflow/infrastructure/persistence/repositories/entity_repo.py |
Entity repository | ~100 |
src/noteflow/infrastructure/converters/ner_converters.py |
ORM ↔ domain converters | ~50 |
src/noteflow/grpc/_mixins/entities.py |
gRPC mixin (calls NerService) | ~100 |
tests/application/test_ner_service.py |
Application layer tests | ~200 |
tests/infrastructure/ner/test_engine.py |
Engine unit tests | ~150 |
client/src/hooks/use-entity-extraction.ts |
React hook with polling | ~80 |
client/src-tauri/src/commands/entities.rs |
Rust command | ~60 |
Note
: ORM model (
NamedEntityModel) already exists atmodels/entities/named_entity.py. Frontend extraction panel can extend existingentity-management-panel.tsx.
Files to Modify
| File | Change Type | Lines Est. |
|---|---|---|
src/noteflow/grpc/service.py |
Add mixin + NerService initialization | +15 |
src/noteflow/grpc/server.py |
Initialize NerService | +10 |
src/noteflow/grpc/_mixins/protocols.py |
Add _ner_service to ServicerHost |
+5 |
src/noteflow/infrastructure/persistence/repositories/__init__.py |
Export entity repo | +2 |
src/noteflow/infrastructure/persistence/unit_of_work.py |
Add entity repository property | +15 |
client/src/lib/tauri.ts |
Add extractEntities wrapper | +15 |
client/src/components/entity-management-panel.tsx |
Add extraction trigger button | +30 |
client/src-tauri/src/commands/mod.rs |
Export entities module | +2 |
client/src-tauri/src/lib.rs |
Register entity commands | +3 |
client/src-tauri/src/grpc/client.rs |
Add extract_entities method | +25 |
Note
:
pyproject.tomlalready hasspacy>=3.7. Proto regeneration not needed.
Implementation Tasks
Task 0: Verify Persistence Infrastructure
Prerequisite: Sprint 0 must be complete with:
-
Proto definitions verified in
noteflow.proto:rpc ExtractEntities(ExtractEntitiesRequest) returns (ExtractEntitiesResponse); -
Database schema verified via Alembic migration:
SELECT column_name FROM information_schema.columns WHERE table_name = 'named_entities'; -- Expected: id, meeting_id, text, category, segment_ids, confidence, is_pinned, created_at -
Feature flag verified:
from noteflow.config.settings import get_settings assert get_settings().feature_flags.ner_extraction_enabled
If any verification fails: Complete Sprint 0 first.
Task 1: Add Dependencies
File: pyproject.toml
[project.optional-dependencies]
ner = [
"spacy>=3.7",
]
Post-install (handled by Docker in Sprint 0):
python -m spacy download en_core_web_sm
Task 2: Create Domain Entity
File: src/noteflow/domain/entities/named_entity.py
"""Named entity domain entity."""
from __future__ import annotations
from dataclasses import dataclass, field
from enum import Enum
from typing import TYPE_CHECKING
from uuid import UUID, uuid4
if TYPE_CHECKING:
from noteflow.domain.entities.meeting import MeetingId
class EntityCategory(Enum):
"""Categories for named entities."""
PERSON = "person"
COMPANY = "company"
PRODUCT = "product"
TECHNICAL = "technical"
ACRONYM = "acronym"
LOCATION = "location"
DATE = "date"
OTHER = "other"
@dataclass
class NamedEntity:
"""A named entity extracted from transcript.
Represents a person, company, product, or other notable term
identified in the meeting transcript.
"""
id: UUID = field(default_factory=uuid4)
meeting_id: MeetingId | None = None
text: str = ""
category: EntityCategory = EntityCategory.OTHER
segment_ids: list[int] = field(default_factory=list)
confidence: float = 0.0
is_pinned: bool = False # User-confirmed entity
@classmethod
def create(
cls,
text: str,
category: EntityCategory,
segment_ids: list[int],
confidence: float,
meeting_id: MeetingId | None = None,
) -> NamedEntity:
"""Create a new named entity.
Args:
text: The entity text as it appears in transcript.
category: Classification category.
segment_ids: Segments where entity appears.
confidence: Extraction confidence (0.0-1.0).
meeting_id: Optional meeting association.
Returns:
New NamedEntity instance.
"""
return cls(
text=text,
category=category,
segment_ids=sorted(set(segment_ids)),
confidence=confidence,
meeting_id=meeting_id,
)
Task 3: Create NER Port Interface
File: src/noteflow/domain/ports/ner.py
"""NER port interface (hexagonal architecture)."""
from __future__ import annotations
from typing import TYPE_CHECKING, Protocol
if TYPE_CHECKING:
from noteflow.domain.entities.named_entity import NamedEntity
class NerPort(Protocol):
"""Port for named entity recognition.
This is the domain port that the application layer uses.
Infrastructure adapters (like NerEngine) implement this protocol.
"""
def extract(self, text: str) -> list[NamedEntity]:
"""Extract named entities from text.
Args:
text: Input text to analyze.
Returns:
List of extracted entities.
"""
...
def extract_from_segments(
self,
segments: list[tuple[int, str]],
) -> list[NamedEntity]:
"""Extract entities from multiple segments with tracking.
Args:
segments: List of (segment_id, text) tuples.
Returns:
Entities with segment_ids populated.
"""
...
def is_ready(self) -> bool:
"""Check if the NER engine is loaded and ready.
Returns:
True if model is loaded.
"""
...
Task 4: Create NER Infrastructure
File: src/noteflow/infrastructure/ner/__init__.py
"""Named Entity Recognition infrastructure."""
from noteflow.infrastructure.ner.engine import NerEngine
__all__ = ["NerEngine"]
File: src/noteflow/infrastructure/ner/engine.py
"""NER engine using spaCy."""
from __future__ import annotations
import logging
from typing import TYPE_CHECKING
from noteflow.domain.entities.named_entity import EntityCategory, NamedEntity
if TYPE_CHECKING:
from spacy.language import Language
logger = logging.getLogger(__name__)
# Map spaCy entity types to our categories
_SPACY_CATEGORY_MAP: dict[str, EntityCategory] = {
"PERSON": EntityCategory.PERSON,
"ORG": EntityCategory.COMPANY,
"PRODUCT": EntityCategory.PRODUCT,
"GPE": EntityCategory.LOCATION, # Geo-political entity
"LOC": EntityCategory.LOCATION,
"FAC": EntityCategory.LOCATION, # Facility
"DATE": EntityCategory.DATE,
"TIME": EntityCategory.DATE,
"MONEY": EntityCategory.OTHER,
"PERCENT": EntityCategory.OTHER,
"CARDINAL": EntityCategory.OTHER,
"ORDINAL": EntityCategory.OTHER,
"QUANTITY": EntityCategory.OTHER,
"NORP": EntityCategory.OTHER, # Nationalities, religions, etc.
"EVENT": EntityCategory.OTHER,
"WORK_OF_ART": EntityCategory.PRODUCT,
"LAW": EntityCategory.OTHER,
"LANGUAGE": EntityCategory.OTHER,
}
class NerEngine:
"""Named entity recognition engine using spaCy.
Lazy-loads the spaCy model on first use to avoid startup delay.
Implements the NerPort protocol for hexagonal architecture.
"""
def __init__(self, model_name: str = "en_core_web_sm") -> None:
"""Initialize NER engine.
Args:
model_name: spaCy model to use.
"""
self._model_name = model_name
self._nlp: Language | None = None
def _ensure_loaded(self) -> Language:
"""Lazy-load the spaCy model."""
if self._nlp is None:
import spacy
logger.info("Loading spaCy model: %s", self._model_name)
self._nlp = spacy.load(self._model_name)
logger.info("spaCy model loaded")
return self._nlp
def is_ready(self) -> bool:
"""Check if model is loaded."""
return self._nlp is not None
def extract(self, text: str) -> list[NamedEntity]:
"""Extract named entities from text.
Args:
text: Input text to analyze.
Returns:
List of extracted entities (deduplicated).
"""
if not text.strip():
return []
nlp = self._ensure_loaded()
doc = nlp(text)
entities: list[NamedEntity] = []
seen: set[str] = set()
for ent in doc.ents:
# Normalize and deduplicate
key = ent.text.lower().strip()
if not key or key in seen:
continue
seen.add(key)
category = _SPACY_CATEGORY_MAP.get(ent.label_, EntityCategory.OTHER)
# Skip low-value entities
if category == EntityCategory.OTHER and ent.label_ in {
"CARDINAL", "ORDINAL", "QUANTITY", "PERCENT"
}:
continue
entities.append(
NamedEntity.create(
text=ent.text,
category=category,
segment_ids=[], # Filled by caller
confidence=0.8, # spaCy doesn't provide per-entity confidence
)
)
return entities
def extract_from_segments(
self,
segments: list[tuple[int, str]],
) -> list[NamedEntity]:
"""Extract entities from multiple segments with segment tracking.
Args:
segments: List of (segment_id, text) tuples.
Returns:
Entities with segment_ids populated (deduplicated across segments).
"""
# Track entities and their segment occurrences
entity_segments: dict[str, list[int]] = {}
all_entities: dict[str, NamedEntity] = {}
for segment_id, text in segments:
entities = self.extract(text)
for entity in entities:
key = entity.text.lower()
if key not in all_entities:
all_entities[key] = entity
entity_segments[key] = []
entity_segments[key].append(segment_id)
# Update segment IDs
for key, entity in all_entities.items():
entity.segment_ids = sorted(set(entity_segments[key]))
return list(all_entities.values())
Task 5: Create Entity Persistence
✅ ORM Model Already Exists: The
NamedEntityModelis already implemented atinfrastructure/persistence/models/entities/named_entity.py. Do not recreate it. Only the repository and converters need to be implemented.
Existing ORM (reference only—do not modify):
- Location:
src/noteflow/infrastructure/persistence/models/entities/named_entity.py - Fields:
id,meeting_id,text,category,segment_ids,confidence,is_pinned,created_at - Relationship:
meeting→MeetingModel.named_entities(cascade delete configured)
File to Create: src/noteflow/infrastructure/persistence/repositories/entity_repo.py
"""Named entity repository."""
from __future__ import annotations
from typing import TYPE_CHECKING
from uuid import UUID
from sqlalchemy import delete, select
from noteflow.infrastructure.persistence.models import NamedEntityModel
from noteflow.infrastructure.persistence.repositories._base import BaseRepository
if TYPE_CHECKING:
from noteflow.domain.entities.meeting import MeetingId
from noteflow.domain.entities.named_entity import NamedEntity
class SqlAlchemyEntityRepository(BaseRepository):
"""Repository for named entity persistence."""
async def save(self, entity: NamedEntity) -> None:
"""Save or update a named entity.
Args:
entity: The entity to save.
"""
from noteflow.infrastructure.converters.ner_converters import NerConverter
model = NerConverter.to_orm(entity)
await self._session.merge(model)
await self._session.flush()
async def save_batch(self, entities: list[NamedEntity]) -> None:
"""Save multiple entities efficiently.
Args:
entities: List of entities to save.
"""
from noteflow.infrastructure.converters.ner_converters import NerConverter
for entity in entities:
model = NerConverter.to_orm(entity)
await self._session.merge(model)
await self._session.flush()
async def get(self, entity_id: UUID) -> NamedEntity | None:
"""Get entity by ID.
Args:
entity_id: The entity UUID.
Returns:
Entity if found, None otherwise.
"""
from noteflow.infrastructure.converters.ner_converters import NerConverter
stmt = select(NamedEntityModel).where(NamedEntityModel.id == entity_id)
result = await self._session.execute(stmt)
model = result.scalar_one_or_none()
return NerConverter.to_domain(model) if model else None
async def get_by_meeting(self, meeting_id: MeetingId) -> list[NamedEntity]:
"""Get all entities for a meeting.
Args:
meeting_id: The meeting UUID.
Returns:
List of entities.
"""
from noteflow.infrastructure.converters.ner_converters import NerConverter
stmt = (
select(NamedEntityModel)
.where(NamedEntityModel.meeting_id == meeting_id)
.order_by(NamedEntityModel.category, NamedEntityModel.text)
)
result = await self._session.execute(stmt)
models = result.scalars().all()
return [NerConverter.to_domain(m) for m in models]
async def delete_by_meeting(self, meeting_id: MeetingId) -> int:
"""Delete all entities for a meeting.
Args:
meeting_id: The meeting UUID.
Returns:
Number of deleted entities.
"""
stmt = delete(NamedEntityModel).where(
NamedEntityModel.meeting_id == meeting_id
)
result = await self._session.execute(stmt)
await self._session.flush()
return result.rowcount
async def update_pinned(self, entity_id: UUID, is_pinned: bool) -> bool:
"""Update the pinned status of an entity.
Args:
entity_id: The entity UUID.
is_pinned: New pinned status.
Returns:
True if entity was found and updated.
"""
stmt = select(NamedEntityModel).where(NamedEntityModel.id == entity_id)
result = await self._session.execute(stmt)
model = result.scalar_one_or_none()
if model:
model.is_pinned = is_pinned
await self._session.flush()
return True
return False
Task 6: Create NER Converters
File: src/noteflow/infrastructure/converters/ner_converters.py
"""NER domain ↔ ORM converters."""
from __future__ import annotations
from typing import TYPE_CHECKING
from noteflow.domain.entities.named_entity import EntityCategory, NamedEntity
if TYPE_CHECKING:
from noteflow.infrastructure.persistence.models import NamedEntityModel
class NerConverter:
"""Convert between NamedEntity domain objects and ORM models."""
@staticmethod
def to_domain(model: NamedEntityModel) -> NamedEntity:
"""Convert ORM model to domain entity.
Args:
model: SQLAlchemy model.
Returns:
Domain entity.
"""
return NamedEntity(
id=model.id,
meeting_id=model.meeting_id,
text=model.text,
category=EntityCategory(model.category),
segment_ids=list(model.segment_ids) if model.segment_ids else [],
confidence=model.confidence,
is_pinned=model.is_pinned,
)
@staticmethod
def to_orm(entity: NamedEntity) -> NamedEntityModel:
"""Convert domain entity to ORM model.
Args:
entity: Domain entity.
Returns:
SQLAlchemy model.
"""
from noteflow.infrastructure.persistence.models import NamedEntityModel
return NamedEntityModel(
id=entity.id,
meeting_id=entity.meeting_id,
text=entity.text,
category=entity.category.value,
segment_ids=entity.segment_ids,
confidence=entity.confidence,
is_pinned=entity.is_pinned,
)
Task 7: Create NerService Application Layer
File: src/noteflow/application/services/ner_service.py
This is the critical architectural component that was missing. The NerService sits between gRPC and infrastructure, following hexagonal architecture.
"""Named Entity Recognition application service.
This service orchestrates NER operations, following hexagonal architecture:
- gRPC mixin → NerService (application) → NerEngine (infrastructure) ✓
The service handles:
- Extraction orchestration
- Caching/persistence of results
- Feature flag checking
- Concurrency control for model loading
"""
from __future__ import annotations
import asyncio
import logging
from dataclasses import dataclass
from typing import TYPE_CHECKING
from noteflow.config.settings import get_settings
from noteflow.domain.entities.named_entity import NamedEntity
if TYPE_CHECKING:
from uuid import UUID
from noteflow.domain.ports.ner import NerPort
from noteflow.infrastructure.persistence.unit_of_work import UnitOfWork
logger = logging.getLogger(__name__)
@dataclass
class ExtractionResult:
"""Result of entity extraction."""
entities: list[NamedEntity]
cached: bool
total_count: int
class NerService:
"""Application service for Named Entity Recognition.
Provides a clean interface for NER operations, abstracting away
the infrastructure details (spaCy engine, database persistence).
"""
def __init__(
self,
ner_engine: NerPort,
uow_factory: type[UnitOfWork],
) -> None:
"""Initialize NER service.
Args:
ner_engine: NER engine implementation (infrastructure adapter).
uow_factory: Factory for creating Unit of Work instances.
"""
self._ner_engine = ner_engine
self._uow_factory = uow_factory
self._extraction_lock = asyncio.Lock()
self._model_load_lock = asyncio.Lock()
async def extract_entities(
self,
meeting_id: UUID,
force_refresh: bool = False,
) -> ExtractionResult:
"""Extract named entities from a meeting's transcript.
Checks for cached results first, unless force_refresh is True.
Persists new extractions to the database.
Args:
meeting_id: Meeting to extract entities from.
force_refresh: If True, re-extract even if cached results exist.
Returns:
ExtractionResult with entities and metadata.
Raises:
ValueError: If meeting not found or has no segments.
RuntimeError: If NER feature is disabled.
"""
settings = get_settings()
if not settings.feature_flags.ner_extraction_enabled:
raise RuntimeError("NER extraction is disabled by feature flag")
async with self._uow_factory() as uow:
# Check for cached results
if not force_refresh:
cached = await uow.entities.get_by_meeting(meeting_id)
if cached:
logger.debug(
"Returning %d cached entities for meeting %s",
len(cached),
meeting_id,
)
return ExtractionResult(
entities=cached,
cached=True,
total_count=len(cached),
)
# Fetch meeting and segments
meeting = await uow.meetings.get(meeting_id)
if not meeting:
raise ValueError(f"Meeting {meeting_id} not found")
if not meeting.segments:
logger.debug("Meeting %s has no segments", meeting_id)
return ExtractionResult(entities=[], cached=False, total_count=0)
# Build segment data
segments = [(s.segment_id, s.text) for s in meeting.segments]
# Extract entities (outside UoW to avoid long transactions)
async with self._extraction_lock:
# Ensure model is loaded (thread-safe)
if not self._ner_engine.is_ready():
async with self._model_load_lock:
if not self._ner_engine.is_ready():
# Run sync model load in executor
loop = asyncio.get_event_loop()
await loop.run_in_executor(
None,
lambda: self._ner_engine.extract("warm up"),
)
# Extract entities
loop = asyncio.get_event_loop()
entities = await loop.run_in_executor(
None,
self._ner_engine.extract_from_segments,
segments,
)
# Assign meeting ID to entities
for entity in entities:
entity.meeting_id = meeting_id
# Persist results
async with self._uow_factory() as uow:
if force_refresh:
await uow.entities.delete_by_meeting(meeting_id)
await uow.entities.save_batch(entities)
await uow.commit()
logger.info(
"Extracted %d entities from meeting %s (%d segments)",
len(entities),
meeting_id,
len(segments),
)
return ExtractionResult(
entities=entities,
cached=False,
total_count=len(entities),
)
async def get_entities(self, meeting_id: UUID) -> list[NamedEntity]:
"""Get cached entities for a meeting.
Args:
meeting_id: Meeting UUID.
Returns:
List of entities (empty if not extracted yet).
"""
async with self._uow_factory() as uow:
return await uow.entities.get_by_meeting(meeting_id)
async def pin_entity(self, entity_id: UUID, is_pinned: bool = True) -> bool:
"""Mark an entity as user-verified (pinned).
Args:
entity_id: Entity UUID.
is_pinned: New pinned status.
Returns:
True if entity was found and updated.
"""
async with self._uow_factory() as uow:
result = await uow.entities.update_pinned(entity_id, is_pinned)
if result:
await uow.commit()
return result
async def clear_entities(self, meeting_id: UUID) -> int:
"""Delete all entities for a meeting.
Args:
meeting_id: Meeting UUID.
Returns:
Number of deleted entities.
"""
async with self._uow_factory() as uow:
count = await uow.entities.delete_by_meeting(meeting_id)
await uow.commit()
logger.info("Cleared %d entities for meeting %s", count, meeting_id)
return count
def is_ready(self) -> bool:
"""Check if NER engine is ready.
Returns:
True if model is loaded.
"""
return self._ner_engine.is_ready()
Task 8: Update gRPC Mixin (Calls NerService)
File: src/noteflow/grpc/_mixins/entities.py
The mixin now calls NerService (application layer), not NerEngine (infrastructure).
"""Entity extraction gRPC mixin."""
from __future__ import annotations
import logging
from typing import TYPE_CHECKING
import grpc
from noteflow.grpc.proto import noteflow_pb2
if TYPE_CHECKING:
from noteflow.grpc._mixins.protocols import ServicerHost
logger = logging.getLogger(__name__)
class EntitiesMixin:
"""Mixin for entity extraction RPC methods.
Architecture: gRPC → NerService (application) → NerEngine (infrastructure)
"""
async def ExtractEntities(
self: ServicerHost,
request: noteflow_pb2.ExtractEntitiesRequest,
context: grpc.aio.ServicerContext,
) -> noteflow_pb2.ExtractEntitiesResponse:
"""Extract named entities from meeting transcript.
Delegates to NerService for extraction, caching, and persistence.
"""
meeting_id = self._parse_meeting_id(request.meeting_id)
try:
result = await self._ner_service.extract_entities(
meeting_id=meeting_id,
force_refresh=request.force_refresh,
)
except ValueError as e:
context.set_code(grpc.StatusCode.NOT_FOUND)
context.set_details(str(e))
return noteflow_pb2.ExtractEntitiesResponse()
except RuntimeError as e:
# Feature disabled
context.set_code(grpc.StatusCode.FAILED_PRECONDITION)
context.set_details(str(e))
return noteflow_pb2.ExtractEntitiesResponse()
# Convert to proto
proto_entities = [
noteflow_pb2.ExtractedEntity(
id=str(e.id),
text=e.text,
category=e.category.value,
segment_ids=e.segment_ids,
confidence=e.confidence,
is_pinned=e.is_pinned,
)
for e in result.entities
]
return noteflow_pb2.ExtractEntitiesResponse(
entities=proto_entities,
total_count=result.total_count,
cached=result.cached,
)
async def GetEntities(
self: ServicerHost,
request: noteflow_pb2.GetEntitiesRequest,
context: grpc.aio.ServicerContext,
) -> noteflow_pb2.GetEntitiesResponse:
"""Get cached entities for a meeting (no extraction)."""
meeting_id = self._parse_meeting_id(request.meeting_id)
entities = await self._ner_service.get_entities(meeting_id)
proto_entities = [
noteflow_pb2.ExtractedEntity(
id=str(e.id),
text=e.text,
category=e.category.value,
segment_ids=e.segment_ids,
confidence=e.confidence,
is_pinned=e.is_pinned,
)
for e in entities
]
return noteflow_pb2.GetEntitiesResponse(entities=proto_entities)
async def PinEntity(
self: ServicerHost,
request: noteflow_pb2.PinEntityRequest,
context: grpc.aio.ServicerContext,
) -> noteflow_pb2.PinEntityResponse:
"""Mark an entity as user-verified."""
from uuid import UUID
entity_id = UUID(request.entity_id)
success = await self._ner_service.pin_entity(entity_id, request.is_pinned)
if not success:
context.set_code(grpc.StatusCode.NOT_FOUND)
context.set_details(f"Entity {entity_id} not found")
return noteflow_pb2.PinEntityResponse(success=success)
Task 9: Frontend Integration
File: client/src/hooks/useEntityExtraction.ts
import { useCallback, useEffect, useState } from 'react';
import { invoke } from '@tauri-apps/api/core';
export interface ExtractedEntity {
id: string;
text: string;
category: 'person' | 'company' | 'product' | 'technical' | 'acronym' | 'location' | 'date' | 'other';
segmentIds: number[];
confidence: number;
isPinned: boolean;
}
interface ExtractionResult {
entities: ExtractedEntity[];
totalCount: number;
cached: boolean;
}
type ExtractionStatus = 'idle' | 'loading' | 'success' | 'error';
interface UseEntityExtractionOptions {
autoLoad?: boolean;
pollInterval?: number; // For long extractions
}
interface UseEntityExtractionReturn {
entities: ExtractedEntity[];
status: ExtractionStatus;
error: string | null;
cached: boolean;
extract: (forceRefresh?: boolean) => Promise<void>;
pinEntity: (entityId: string, isPinned: boolean) => Promise<void>;
clearEntities: () => Promise<void>;
}
export function useEntityExtraction(
meetingId: string | null,
options: UseEntityExtractionOptions = {},
): UseEntityExtractionReturn {
const { autoLoad = false } = options;
const [entities, setEntities] = useState<ExtractedEntity[]>([]);
const [status, setStatus] = useState<ExtractionStatus>('idle');
const [error, setError] = useState<string | null>(null);
const [cached, setCached] = useState(false);
const extract = useCallback(async (forceRefresh = false) => {
if (!meetingId) return;
setStatus('loading');
setError(null);
try {
const result = await invoke<ExtractionResult>('extract_entities', {
meetingId,
forceRefresh,
});
setEntities(result.entities);
setCached(result.cached);
setStatus('success');
} catch (err) {
const message = err instanceof Error ? err.message : String(err);
setError(message);
setStatus('error');
}
}, [meetingId]);
const pinEntity = useCallback(async (entityId: string, isPinned: boolean) => {
try {
await invoke('pin_entity', { entityId, isPinned });
// Optimistic update
setEntities((prev) =>
prev.map((e) => (e.id === entityId ? { ...e, isPinned } : e))
);
} catch (err) {
const message = err instanceof Error ? err.message : String(err);
setError(message);
}
}, []);
const clearEntities = useCallback(async () => {
if (!meetingId) return;
try {
await invoke('clear_entities', { meetingId });
setEntities([]);
setCached(false);
} catch (err) {
const message = err instanceof Error ? err.message : String(err);
setError(message);
}
}, [meetingId]);
// Auto-load on mount if requested
useEffect(() => {
if (autoLoad && meetingId) {
extract(false);
}
}, [autoLoad, meetingId, extract]);
return {
entities,
status,
error,
cached,
extract,
pinEntity,
clearEntities,
};
}
File: client/src/components/EntityExtractionPanel.tsx
import { Loader2, Pin, RefreshCw, Sparkles, Trash2, X } from 'lucide-react';
import { useMemo } from 'react';
import { Button } from '@/components/ui/Button';
import { Card, CardContent, CardHeader, CardTitle } from '@/components/ui/Card';
import { Badge } from '@/components/ui/Badge';
import { Tooltip, TooltipContent, TooltipTrigger } from '@/components/ui/Tooltip';
import { ExtractedEntity, useEntityExtraction } from '@/hooks/useEntityExtraction';
interface EntityExtractionPanelProps {
meetingId: string;
className?: string;
}
const CATEGORY_COLORS: Record<string, string> = {
person: 'bg-blue-100 text-blue-800',
company: 'bg-purple-100 text-purple-800',
product: 'bg-green-100 text-green-800',
location: 'bg-amber-100 text-amber-800',
date: 'bg-rose-100 text-rose-800',
technical: 'bg-cyan-100 text-cyan-800',
acronym: 'bg-indigo-100 text-indigo-800',
other: 'bg-gray-100 text-gray-800',
};
function EntityBadge({
entity,
onPin,
}: {
entity: ExtractedEntity;
onPin: (id: string, pinned: boolean) => void;
}) {
const colorClass = CATEGORY_COLORS[entity.category] || CATEGORY_COLORS.other;
return (
<Badge
variant="secondary"
className={`${colorClass} flex items-center gap-1 ${entity.isPinned ? 'ring-2 ring-yellow-400' : ''}`}
>
<span>{entity.text}</span>
<Tooltip>
<TooltipTrigger asChild>
<button
type="button"
onClick={() => onPin(entity.id, !entity.isPinned)}
className="ml-1 opacity-60 hover:opacity-100"
>
<Pin className={`w-3 h-3 ${entity.isPinned ? 'fill-current' : ''}`} />
</button>
</TooltipTrigger>
<TooltipContent>
{entity.isPinned ? 'Unpin entity' : 'Pin as verified'}
</TooltipContent>
</Tooltip>
</Badge>
);
}
export function EntityExtractionPanel({ meetingId, className }: EntityExtractionPanelProps) {
const {
entities,
status,
error,
cached,
extract,
pinEntity,
clearEntities,
} = useEntityExtraction(meetingId, { autoLoad: true });
// Group entities by category
const groupedEntities = useMemo(() => {
const groups: Record<string, ExtractedEntity[]> = {};
for (const entity of entities) {
const category = entity.category;
if (!groups[category]) {
groups[category] = [];
}
groups[category].push(entity);
}
return groups;
}, [entities]);
const categoryOrder = ['person', 'company', 'product', 'location', 'technical', 'acronym', 'date', 'other'];
return (
<Card className={className}>
<CardHeader className="pb-3">
<div className="flex items-center justify-between">
<CardTitle className="text-sm font-medium">
Named Entities
{cached && (
<span className="ml-2 text-xs text-muted-foreground">(cached)</span>
)}
</CardTitle>
<div className="flex gap-1">
<Tooltip>
<TooltipTrigger asChild>
<Button
variant="ghost"
size="sm"
onClick={() => extract(false)}
disabled={status === 'loading'}
>
{status === 'loading' ? (
<Loader2 className="w-4 h-4 animate-spin" />
) : (
<Sparkles className="w-4 h-4" />
)}
</Button>
</TooltipTrigger>
<TooltipContent>Extract entities</TooltipContent>
</Tooltip>
<Tooltip>
<TooltipTrigger asChild>
<Button
variant="ghost"
size="sm"
onClick={() => extract(true)}
disabled={status === 'loading'}
>
<RefreshCw className="w-4 h-4" />
</Button>
</TooltipTrigger>
<TooltipContent>Re-extract (refresh)</TooltipContent>
</Tooltip>
{entities.length > 0 && (
<Tooltip>
<TooltipTrigger asChild>
<Button
variant="ghost"
size="sm"
onClick={clearEntities}
>
<Trash2 className="w-4 h-4" />
</Button>
</TooltipTrigger>
<TooltipContent>Clear all entities</TooltipContent>
</Tooltip>
)}
</div>
</div>
</CardHeader>
<CardContent>
{/* Loading state */}
{status === 'loading' && entities.length === 0 && (
<div className="flex flex-col items-center justify-center py-8 text-muted-foreground">
<Loader2 className="w-8 h-8 animate-spin mb-2" />
<p className="text-sm">Extracting entities...</p>
<p className="text-xs mt-1">This may take a moment for long transcripts</p>
</div>
)}
{/* Error state */}
{status === 'error' && (
<div className="flex items-center gap-2 p-3 bg-destructive/10 text-destructive rounded-md">
<X className="w-4 h-4 flex-shrink-0" />
<p className="text-sm">{error}</p>
</div>
)}
{/* Empty state */}
{status === 'success' && entities.length === 0 && (
<p className="text-sm text-muted-foreground text-center py-4">
No entities found. Click the sparkle icon to extract.
</p>
)}
{/* Entity list grouped by category */}
{entities.length > 0 && (
<div className="space-y-4">
{categoryOrder
.filter((cat) => groupedEntities[cat]?.length > 0)
.map((category) => (
<div key={category}>
<h4 className="text-xs font-medium text-muted-foreground uppercase tracking-wide mb-2">
{category} ({groupedEntities[category].length})
</h4>
<div className="flex flex-wrap gap-2">
{groupedEntities[category].map((entity) => (
<EntityBadge
key={entity.id}
entity={entity}
onPin={pinEntity}
/>
))}
</div>
</div>
))}
</div>
)}
</CardContent>
</Card>
);
}
File: client/src-tauri/src/commands/entities.rs
use serde::{Deserialize, Serialize};
use tauri::State;
use uuid::Uuid;
use crate::state::AppState;
#[derive(Debug, Clone, Serialize, Deserialize)]
#[serde(rename_all = "camelCase")]
pub struct ExtractedEntity {
pub id: String,
pub text: String,
pub category: String,
pub segment_ids: Vec<i32>,
pub confidence: f32,
pub is_pinned: bool,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
#[serde(rename_all = "camelCase")]
pub struct ExtractEntitiesResult {
pub entities: Vec<ExtractedEntity>,
pub total_count: i32,
pub cached: bool,
}
#[tauri::command]
pub async fn extract_entities(
meeting_id: String,
force_refresh: Option<bool>,
state: State<'_, AppState>,
) -> Result<ExtractEntitiesResult, String> {
let client = state.grpc_client.lock().await;
let client = client.as_ref().ok_or("gRPC client not initialized")?;
let request = proto::ExtractEntitiesRequest {
meeting_id,
force_refresh: force_refresh.unwrap_or(false),
};
let response = client
.extract_entities(request)
.await
.map_err(|e| format!("Failed to extract entities: {e}"))?;
Ok(ExtractEntitiesResult {
entities: response
.entities
.into_iter()
.map(|e| ExtractedEntity {
id: e.id,
text: e.text,
category: e.category,
segment_ids: e.segment_ids,
confidence: e.confidence,
is_pinned: e.is_pinned,
})
.collect(),
total_count: response.total_count,
cached: response.cached,
})
}
#[tauri::command]
pub async fn pin_entity(
entity_id: String,
is_pinned: bool,
state: State<'_, AppState>,
) -> Result<bool, String> {
let client = state.grpc_client.lock().await;
let client = client.as_ref().ok_or("gRPC client not initialized")?;
let request = proto::PinEntityRequest {
entity_id,
is_pinned,
};
let response = client
.pin_entity(request)
.await
.map_err(|e| format!("Failed to pin entity: {e}"))?;
Ok(response.success)
}
#[tauri::command]
pub async fn clear_entities(
meeting_id: String,
state: State<'_, AppState>,
) -> Result<i32, String> {
let client = state.grpc_client.lock().await;
let client = client.as_ref().ok_or("gRPC client not initialized")?;
let request = proto::ClearEntitiesRequest { meeting_id };
let response = client
.clear_entities(request)
.await
.map_err(|e| format!("Failed to clear entities: {e}"))?;
Ok(response.deleted_count)
}
Code Segments to Reuse
Persistence Layer Patterns (CRITICAL)
Location: src/noteflow/infrastructure/persistence/
| Pattern | Reference Location | Usage |
|---|---|---|
| ORM Model | models/entities/named_entity.py |
Already exists - no creation needed |
| Repository | repositories/annotation_repo.py |
Template for CRUD operations |
| Base Repository | repositories/_base.py |
Extend for helper methods |
| Unit of Work | unit_of_work.py |
Add entity repository property |
| Converters | converters/orm_converters.py |
Add entity_to_domain() method |
Existing Entity UI Components
Location: client/src/components/entity-highlight.tsx
Already renders entity highlights with tooltips - connect to extracted entities.
Location: client/src/components/entity-management-panel.tsx
CRUD panel with Sheet UI - extend to display auto-extracted entities.
Location: client/src/lib/entity-store.ts
Client-side observer pattern store - wire to backend extraction results.
Warning
: Color definitions are duplicated between
entity-highlight.tsx(inlinecategoryColors) andtypes/entity.ts(ENTITY_CATEGORY_COLORS). Use the shared constant fromtypes/entity.ts.
Application Service Pattern
Location: src/noteflow/application/services/summarization_service.py
Pattern for application service with:
- Dataclass-based service with settings
- Provider registration pattern (multi-backend support)
- Lazy model loading via property getters
- Callback-based persistence (not UoW-injected)
Location: src/noteflow/infrastructure/summarization/ollama_provider.py
Pattern for lazy model loading:
self._client: T | None = Noneinitial state_get_client()method for lazy initializationis_availableproperty for runtime checksasyncio.to_thread()for CPU-bound inference
Location: src/noteflow/grpc/_mixins/diarization.py
Pattern for CPU-bound gRPC handlers:
asyncio.Lockfor concurrency controlloop.run_in_executor()for blocking operations- Structured logging with meeting context
Performance Targets
| Metric | Target | Measurement |
|---|---|---|
| Model load time | < 3s | First extraction latency |
| Extraction throughput | > 500 segments/sec | pytest --benchmark |
| API response (cached) | < 50ms | p95 latency |
| API response (extraction) | < 2s for 100 segments | p95 latency |
| Memory overhead | < 200MB | Model + cache |
Acceptance Criteria
Functional
- "Auto-Extract Entities" button appears on meeting detail page
- Clicking button extracts entities from all segments
- Extracted entities appear in EntityExtractionPanel
- Entity categories match spaCy labels (person, company, location, etc.)
- Entities link to source segments via segment_ids
- Entities persist to database and survive server restart
- Cached results returned on subsequent requests
- Force refresh re-extracts and replaces cached results
- Pinned entities preserve user verification
Technical
- spaCy model lazy-loaded (no startup delay)
- Entities deduplicated across segments
- Segment tracking works correctly
- Error handling for missing spaCy model
- NerService mediates between gRPC and NerEngine (hexagonal architecture)
- Feature flag controls extraction availability
Quality Gates
pytest tests/quality/passes- Module size < 150 lines (engine.py, ner_service.py)
- All functions documented
- Unit tests cover edge cases
- Application layer tests mock infrastructure
Test Plan
Test Philosophy: Use
pytest.mark.parametrizefor test variations. No conditionals in tests - each parameter set is a separate test case. See pytest parametrize docs
Unit Tests (Infrastructure)
File: tests/infrastructure/ner/test_engine.py
"""NER engine unit tests using pytest.mark.parametrize."""
import pytest
from noteflow.domain.entities.named_entity import EntityCategory
from noteflow.infrastructure.ner.engine import SpacyNerEngine
# Use module-scoped fixture to load model once (expensive)
@pytest.fixture(scope="module")
def engine() -> SpacyNerEngine:
"""Create NER engine with model pre-loaded."""
eng = SpacyNerEngine(model_name="en_core_web_sm")
eng._ensure_loaded() # Pre-load for all tests
return eng
class TestEntityExtraction:
"""Test entity extraction with parametrized inputs."""
@pytest.mark.parametrize(
("text", "expected_category", "expected_text_fragment"),
[
pytest.param(
"John Smith discussed the project.",
EntityCategory.PERSON,
"john",
id="person-single",
),
pytest.param(
"Alice Johnson and Bob Williams met today.",
EntityCategory.PERSON,
"alice",
id="person-multiple",
),
pytest.param(
"We use Google Cloud for hosting.",
EntityCategory.COMPANY,
"google",
id="company-tech",
),
pytest.param(
"The meeting is in New York.",
EntityCategory.LOCATION,
"new york",
id="location-city",
),
],
)
def test_extracts_expected_entity_type(
self,
engine: SpacyNerEngine,
text: str,
expected_category: EntityCategory,
expected_text_fragment: str,
) -> None:
"""Extract entities of expected category from text."""
entities = engine.extract(text)
matching = [e for e in entities if e.category == expected_category]
assert matching, f"Expected {expected_category.value} entity in: {text}"
texts_lower = [e.text.lower() for e in matching]
assert any(
expected_text_fragment in t for t in texts_lower
), f"Expected '{expected_text_fragment}' in {texts_lower}"
@pytest.mark.parametrize(
("text", "expected_count"),
[
pytest.param("", 0, id="empty-string"),
pytest.param(" ", 0, id="whitespace-only"),
pytest.param("Hello world.", 0, id="no-entities"),
],
)
def test_handles_edge_cases(
self,
engine: SpacyNerEngine,
text: str,
expected_count: int,
) -> None:
"""Handle edge cases correctly."""
entities = engine.extract(text)
assert len(entities) == expected_count
class TestDeduplication:
"""Test entity deduplication."""
@pytest.mark.parametrize(
("text", "entity_fragment", "expected_count"),
[
pytest.param(
"John met with John about John's project.",
"john",
1,
id="repeated-name",
),
pytest.param(
"Google uses Google Cloud. Google is great.",
"google",
1,
id="repeated-company",
),
],
)
def test_deduplicates_repeated_entities(
self,
engine: SpacyNerEngine,
text: str,
entity_fragment: str,
expected_count: int,
) -> None:
"""Deduplicate repeated entities in text."""
entities = engine.extract(text)
matching = [e for e in entities if entity_fragment in e.text.lower()]
assert len(matching) == expected_count, f"Expected {expected_count} '{entity_fragment}'"
class TestSegmentTracking:
"""Test segment ID tracking across multiple segments."""
@pytest.mark.parametrize(
("segments", "entity_fragment", "expected_segment_ids"),
[
pytest.param(
[(1, "John presented."), (2, "John reviewed."), (3, "John approved.")],
"john",
{1, 2, 3},
id="entity-in-all-segments",
),
pytest.param(
[(1, "Alice spoke."), (2, "Bob listened."), (3, "Alice concluded.")],
"alice",
{1, 3},
id="entity-in-some-segments",
),
],
)
def test_tracks_segment_ids(
self,
engine: SpacyNerEngine,
segments: list[tuple[int, str]],
entity_fragment: str,
expected_segment_ids: set[int],
) -> None:
"""Track segment IDs for entities across segments."""
entities = engine.extract_from_segments(segments)
matching = [e for e in entities if entity_fragment in e.text.lower()]
assert matching, f"Expected entity with '{entity_fragment}'"
assert set(matching[0].segment_ids) == expected_segment_ids
class TestLazyLoading:
"""Test model lazy loading."""
def test_model_not_loaded_on_init(self) -> None:
"""Model is not loaded until first use."""
engine = SpacyNerEngine()
assert not engine.is_ready()
def test_model_loads_on_first_extract(self) -> None:
"""Model loads on first extraction."""
engine = SpacyNerEngine()
engine.extract("Test text")
assert engine.is_ready()
Application Layer Tests
File: tests/application/test_ner_service.py
"""NerService application layer tests using parametrize."""
from typing import Any
from unittest.mock import AsyncMock, MagicMock
from uuid import uuid4
import pytest
from noteflow.application.services.ner import ExtractionResult, NerService
from noteflow.domain.entities.meeting import Meeting
from noteflow.domain.entities.named_entity import EntityCategory, NamedEntity
from noteflow.domain.entities.segment import Segment
# Fixtures defined in tests/conftest.py - DO NOT REDEFINE
# mock_uow, mock_ner_engine, mock_uow_factory
@pytest.fixture
def ner_service(mock_ner_engine: MagicMock, mock_uow_factory: type) -> NerService:
"""Create NerService with injected mocks."""
return NerService(ner_engine=mock_ner_engine, uow_factory=mock_uow_factory)
class TestExtractEntities:
"""Test entity extraction scenarios."""
@pytest.mark.asyncio
@pytest.mark.parametrize(
("cached_entities", "force_refresh", "expected_cached"),
[
pytest.param(
[NamedEntity.create("Cached", EntityCategory.PERSON, [1], 0.9)],
False,
True,
id="returns-cached-when-available",
),
pytest.param(
[],
False,
False,
id="extracts-when-no-cache",
),
pytest.param(
[NamedEntity.create("Old", EntityCategory.PERSON, [1], 0.9)],
True,
False,
id="re-extracts-on-force-refresh",
),
],
)
async def test_caching_behavior(
self,
ner_service: NerService,
mock_uow: AsyncMock,
mock_ner_engine: MagicMock,
cached_entities: list[NamedEntity],
force_refresh: bool,
expected_cached: bool,
) -> None:
"""Test caching behavior with different scenarios."""
meeting_id = uuid4()
mock_uow.entities.get_by_meeting.return_value = cached_entities
# Setup meeting for extraction cases
mock_uow.meetings.get.return_value = Meeting(
id=meeting_id,
title="Test",
segments=[Segment(segment_id=1, text="Hello John")],
)
result = await ner_service.extract_entities(meeting_id, force_refresh=force_refresh)
assert result.cached == expected_cached
class TestErrorHandling:
"""Test error handling scenarios."""
@pytest.mark.asyncio
@pytest.mark.parametrize(
("setup_mock", "expected_error", "expected_message"),
[
pytest.param(
{"meetings.get": None, "entities.get_by_meeting": []},
ValueError,
"not found",
id="meeting-not-found",
),
],
)
async def test_raises_expected_errors(
self,
ner_service: NerService,
mock_uow: AsyncMock,
setup_mock: dict[str, Any],
expected_error: type[Exception],
expected_message: str,
) -> None:
"""Test that expected errors are raised."""
meeting_id = uuid4()
# Apply mock setup
for attr_path, return_value in setup_mock.items():
attrs = attr_path.split(".")
obj = mock_uow
for attr in attrs[:-1]:
obj = getattr(obj, attr)
getattr(obj, attrs[-1]).return_value = return_value
with pytest.raises(expected_error, match=expected_message):
await ner_service.extract_entities(meeting_id)
class TestPinEntity:
"""Test entity pinning."""
@pytest.mark.asyncio
@pytest.mark.parametrize(
("update_result", "expected_return", "expected_commit_calls"),
[
pytest.param(True, True, 1, id="successful-pin"),
pytest.param(False, False, 0, id="entity-not-found"),
],
)
async def test_pin_entity_behavior(
self,
ner_service: NerService,
mock_uow: AsyncMock,
update_result: bool,
expected_return: bool,
expected_commit_calls: int,
) -> None:
"""Test pin entity with different outcomes."""
entity_id = uuid4()
mock_uow.entities.update_pinned.return_value = update_result
result = await ner_service.pin_entity(entity_id, is_pinned=True)
assert result == expected_return
assert mock_uow.commit.call_count == expected_commit_calls
Dependencies
- spaCy: NER library
- en_core_web_sm: English language model (~50MB, downloaded in Sprint 0 Docker setup)
Blocks
- Sprint 0 (Proto & Schema Foundation) must be complete
Failure Modes
| Failure | Detection | Recovery |
|---|---|---|
| spaCy model not installed | OSError on spacy.load() |
Return FAILED_PRECONDITION with helpful message to run python -m spacy download en_core_web_sm |
| Model OOM on long transcript | MemoryError or process crash |
Chunk transcript into batches of 100 segments, process incrementally |
| Empty transcript | len(segments) == 0 |
Return empty entities list (not an error) |
| Feature flag disabled | settings.feature_flags.ner_enabled == False |
Return FAILED_PRECONDITION with feature disabled message |
| DB constraint violation | UniqueViolation on uq_named_entities_meeting_text |
Use ON CONFLICT DO UPDATE for upsert behavior |
| Extraction timeout | Processing > 30s for large meetings | Add configurable timeout, chunk processing |
Deduplication Limitation
Current: Uses normalized_text = text.lower().strip() for deduplication.
Known issue: "IBM" and "I.B.M." would be treated as different entities.
Future work: Add fuzzy matching using rapidfuzz library:
# Future enhancement (not in this sprint)
from rapidfuzz import fuzz
def find_similar_entity(new_text: str, existing: list[NamedEntity]) -> NamedEntity | None:
"""Find existing entity with >90% similarity."""
for entity in existing:
if fuzz.ratio(new_text.lower(), entity.text.lower()) > 90:
return entity
return None
Definition of Done
- All acceptance criteria met
pytest tests/quality/passespytest tests/application/test_ner_service.pypassespytest tests/infrastructure/ner/passesruff check .passesbasedpyrightpasses- Performance targets verified
- Frontend components render correctly with loading/error states
- Entities persist across server restarts
- Feature flag controls availability