Files
noteflow/docs/sprints/phase-2-intelligence/sprint-4-ner-extraction/README.md
Travis Vasceannie 1ce24cdf7b feat: reorganize Claude hooks and add RAG documentation structure with error handling policies
- Moved all hookify configuration files from `.claude/` to `.claude/hooks/` subdirectory for better organization
- Added four new blocking hooks to prevent common error handling anti-patterns:
  - `block-broad-exception-handler`: Prevents catching generic `Exception` with only logging
  - `block-datetime-now-fallback`: Blocks returning `datetime.now()` as fallback on parse failures to prevent data corruption
  - `block-default
2026-01-15 15:58:06 +00:00

62 KiB

Sprint 4: Named Entity Extraction (NER)

Priority: 4 | Owner: Backend | Complexity: Medium Prerequisite: Sprint 0 (Proto & Schema Foundation)


Validation Status (2025-12-27)

100% COMPLETE

Component Location Status
Feature Flag config/settings.py ner_enabled in FeatureFlags
Domain Entity domain/entities/named_entity.py NamedEntity, EntityCategory
NER Engine infrastructure/ner/engine.py NerEngine with spaCy integration
NER Service application/services/ner_service.py NerService with caching
gRPC Mixin grpc/_mixins/entities.py EntitiesMixin with extract/get/pin
ORM Model persistence/models/entities/named_entity.py NamedEntityModel
Repository persistence/repositories/entity_repo.py SqlAlchemyEntityRepository
Proto RPC noteflow.proto ExtractEntities, GetEntities, PinEntity
Dependency pyproject.toml spacy>=3.7
Frontend Hook hooks/use-entity-extraction.ts Reusable extraction hook with state
Frontend Panel pages/MeetingDetail.tsx Entities tab with extract/refresh UI
Entity Store lib/entity-store.ts Client-side entity state management
Rust Command commands/entities.rs extract_entities command
Tauri Adapter api/tauri-adapter.ts:179 extractEntities() wrapper

Objective

Automatically extract named entities (people, companies, products, locations) from meeting transcripts, reducing manual annotation effort and ensuring consistency.


Library Selection

NER Engine Options

Library Best For Pros Cons
spaCy (recommended) Production pipelines Fast, mature, transformer support Fixed entity types without training
GLiNER Zero-shot custom entities Extract any entity type without training Newer, less battle-tested
Hugging Face Transformers Maximum flexibility State-of-the-art models More setup, heavier

Decision: Use spaCy with spacy-transformers for:

  • Production-ready performance (0.05x real-time)
  • Transformer integration via en_core_web_trf model
  • Well-documented, widely adopted
  • Future option: Add GLiNER adapter for custom entity types

References:


Current State Analysis

What Already Exists

Frontend Components (Complete)

Component Location Status
Entity Highlight client/src/components/entity-highlight.tsx Renders inline entity highlights with tooltips (190 lines)
Entity Panel client/src/components/entity-management-panel.tsx Manual CRUD with Sheet UI (388 lines)
Entity Store client/src/lib/entity-store.ts Client-side state with observer pattern (169 lines)
Entity Types client/src/types/entity.ts TypeScript types + color mappings (49 lines)

Backend Infrastructure (Complete)

Component Location Status
ORM Model infrastructure/persistence/models/entities/named_entity.py Complete - NamedEntityModel with all fields
Migration migrations/versions/h2c3d4e5f6g7_add_named_entities_table.py Complete - table created with indices
Meeting Relationship models/core/meeting.py:35-37 Configured - named_entities with cascade delete
Proto RPC noteflow.proto:46 Defined - rpc ExtractEntities(...)
Proto Messages noteflow.proto:593-630 Defined - Request, Response, ExtractedEntity
Feature Flag config/settings.py:223-226 Defined - NOTEFLOW_FEATURE_NER_ENABLED
Dependency pyproject.toml:63 Declared - spacy>=3.7

Note

: The named_entities table and proto definitions are already implemented. Sprint 0 completed this foundation work. No schema changes needed.

Gap: What Needs Implementation

Backend NER processing pipeline:

  • No spaCy engine wrapper (infrastructure/ner/)
  • No NamedEntity domain entity class
  • No NerPort domain protocol
  • No NerService application layer
  • No SqlAlchemyEntityRepository
  • No EntitiesMixin gRPC handler
  • No Rust/Tauri command for extraction
  • Frontend not wired to backend extraction

Field Naming Alignment Warning

⚠️ CRITICAL: Frontend and backend use different field names for entity text:

Layer Field Location
Frontend TS term client/src/types/entity.ts:Entity.term
Backend Proto text noteflow.proto:ExtractedEntity.text
ORM Model text models/entities/named_entity.py:NamedEntityModel.text

Resolution required: Either align frontend to use text or add mapping in Tauri commands. The existing entity-store.ts uses term throughout—a rename may be needed for consistency.


Architecture: Protocol-Based Dependency Injection

All components use Protocol-based dependency injection for testability:

┌─────────────────────────────────────────────────────────────────┐
│                        gRPC Layer                                │
│  EntitiesMixin(ner_service: NerServiceProtocol)                 │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                     Application Layer                            │
│  NerService(ner_engine: NerPort, uow_factory: UoWFactory)       │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                   Infrastructure Layer                           │
│  SpacyNerEngine(model_name: str) implements NerPort             │
│  GlinerNerEngine(model_name: str) implements NerPort            │
└─────────────────────────────────────────────────────────────────┘

Key Principles (per ArjanCodes DI Best Practices):

  • Constructor injection for all dependencies
  • Protocol-based abstractions (structural typing)
  • Factory functions for object creation
  • No global state or singletons

Target/Affected Code

Files to Create

File Purpose Lines Est.
src/noteflow/domain/entities/named_entity.py Domain entity ~50
src/noteflow/domain/ports/ner.py NER port interface ~40
src/noteflow/application/services/ner_service.py Application service layer ~150
src/noteflow/infrastructure/ner/__init__.py Module init ~10
src/noteflow/infrastructure/ner/engine.py spaCy NER engine ~120
src/noteflow/infrastructure/persistence/repositories/entity_repo.py Entity repository ~100
src/noteflow/infrastructure/converters/ner_converters.py ORM ↔ domain converters ~50
src/noteflow/grpc/_mixins/entities.py gRPC mixin (calls NerService) ~100
tests/application/test_ner_service.py Application layer tests ~200
tests/infrastructure/ner/test_engine.py Engine unit tests ~150
client/src/hooks/use-entity-extraction.ts React hook with polling ~80
client/src-tauri/src/commands/entities.rs Rust command ~60

Note

: ORM model (NamedEntityModel) already exists at models/entities/named_entity.py. Frontend extraction panel can extend existing entity-management-panel.tsx.

Files to Modify

File Change Type Lines Est.
src/noteflow/grpc/service.py Add mixin + NerService initialization +15
src/noteflow/grpc/server.py Initialize NerService +10
src/noteflow/grpc/_mixins/protocols.py Add _ner_service to ServicerHost +5
src/noteflow/infrastructure/persistence/repositories/__init__.py Export entity repo +2
src/noteflow/infrastructure/persistence/unit_of_work.py Add entity repository property +15
client/src/lib/tauri.ts Add extractEntities wrapper +15
client/src/components/entity-management-panel.tsx Add extraction trigger button +30
client/src-tauri/src/commands/mod.rs Export entities module +2
client/src-tauri/src/lib.rs Register entity commands +3
client/src-tauri/src/grpc/client.rs Add extract_entities method +25

Note

: pyproject.toml already has spacy>=3.7. Proto regeneration not needed.


Implementation Tasks

Task 0: Verify Persistence Infrastructure

Prerequisite: Sprint 0 must be complete with:

  1. Proto definitions verified in noteflow.proto:

    rpc ExtractEntities(ExtractEntitiesRequest) returns (ExtractEntitiesResponse);
    
  2. Database schema verified via Alembic migration:

    SELECT column_name FROM information_schema.columns
    WHERE table_name = 'named_entities';
    -- Expected: id, meeting_id, text, category, segment_ids, confidence, is_pinned, created_at
    
  3. Feature flag verified:

    from noteflow.config.settings import get_settings
    assert get_settings().feature_flags.ner_extraction_enabled
    

If any verification fails: Complete Sprint 0 first.


Task 1: Add Dependencies

File: pyproject.toml

[project.optional-dependencies]
ner = [
    "spacy>=3.7",
]

Post-install (handled by Docker in Sprint 0):

python -m spacy download en_core_web_sm

Task 2: Create Domain Entity

File: src/noteflow/domain/entities/named_entity.py

"""Named entity domain entity."""

from __future__ import annotations

from dataclasses import dataclass, field
from enum import Enum
from typing import TYPE_CHECKING
from uuid import UUID, uuid4

if TYPE_CHECKING:
    from noteflow.domain.entities.meeting import MeetingId


class EntityCategory(Enum):
    """Categories for named entities."""

    PERSON = "person"
    COMPANY = "company"
    PRODUCT = "product"
    TECHNICAL = "technical"
    ACRONYM = "acronym"
    LOCATION = "location"
    DATE = "date"
    OTHER = "other"


@dataclass
class NamedEntity:
    """A named entity extracted from transcript.

    Represents a person, company, product, or other notable term
    identified in the meeting transcript.
    """

    id: UUID = field(default_factory=uuid4)
    meeting_id: MeetingId | None = None
    text: str = ""
    category: EntityCategory = EntityCategory.OTHER
    segment_ids: list[int] = field(default_factory=list)
    confidence: float = 0.0
    is_pinned: bool = False  # User-confirmed entity

    @classmethod
    def create(
        cls,
        text: str,
        category: EntityCategory,
        segment_ids: list[int],
        confidence: float,
        meeting_id: MeetingId | None = None,
    ) -> NamedEntity:
        """Create a new named entity.

        Args:
            text: The entity text as it appears in transcript.
            category: Classification category.
            segment_ids: Segments where entity appears.
            confidence: Extraction confidence (0.0-1.0).
            meeting_id: Optional meeting association.

        Returns:
            New NamedEntity instance.
        """
        return cls(
            text=text,
            category=category,
            segment_ids=sorted(set(segment_ids)),
            confidence=confidence,
            meeting_id=meeting_id,
        )

Task 3: Create NER Port Interface

File: src/noteflow/domain/ports/ner.py

"""NER port interface (hexagonal architecture)."""

from __future__ import annotations

from typing import TYPE_CHECKING, Protocol

if TYPE_CHECKING:
    from noteflow.domain.entities.named_entity import NamedEntity


class NerPort(Protocol):
    """Port for named entity recognition.

    This is the domain port that the application layer uses.
    Infrastructure adapters (like NerEngine) implement this protocol.
    """

    def extract(self, text: str) -> list[NamedEntity]:
        """Extract named entities from text.

        Args:
            text: Input text to analyze.

        Returns:
            List of extracted entities.
        """
        ...

    def extract_from_segments(
        self,
        segments: list[tuple[int, str]],
    ) -> list[NamedEntity]:
        """Extract entities from multiple segments with tracking.

        Args:
            segments: List of (segment_id, text) tuples.

        Returns:
            Entities with segment_ids populated.
        """
        ...

    def is_ready(self) -> bool:
        """Check if the NER engine is loaded and ready.

        Returns:
            True if model is loaded.
        """
        ...

Task 4: Create NER Infrastructure

File: src/noteflow/infrastructure/ner/__init__.py

"""Named Entity Recognition infrastructure."""

from noteflow.infrastructure.ner.engine import NerEngine

__all__ = ["NerEngine"]

File: src/noteflow/infrastructure/ner/engine.py

"""NER engine using spaCy."""

from __future__ import annotations

import logging
from typing import TYPE_CHECKING

from noteflow.domain.entities.named_entity import EntityCategory, NamedEntity

if TYPE_CHECKING:
    from spacy.language import Language

logger = logging.getLogger(__name__)

# Map spaCy entity types to our categories
_SPACY_CATEGORY_MAP: dict[str, EntityCategory] = {
    "PERSON": EntityCategory.PERSON,
    "ORG": EntityCategory.COMPANY,
    "PRODUCT": EntityCategory.PRODUCT,
    "GPE": EntityCategory.LOCATION,  # Geo-political entity
    "LOC": EntityCategory.LOCATION,
    "FAC": EntityCategory.LOCATION,  # Facility
    "DATE": EntityCategory.DATE,
    "TIME": EntityCategory.DATE,
    "MONEY": EntityCategory.OTHER,
    "PERCENT": EntityCategory.OTHER,
    "CARDINAL": EntityCategory.OTHER,
    "ORDINAL": EntityCategory.OTHER,
    "QUANTITY": EntityCategory.OTHER,
    "NORP": EntityCategory.OTHER,  # Nationalities, religions, etc.
    "EVENT": EntityCategory.OTHER,
    "WORK_OF_ART": EntityCategory.PRODUCT,
    "LAW": EntityCategory.OTHER,
    "LANGUAGE": EntityCategory.OTHER,
}


class NerEngine:
    """Named entity recognition engine using spaCy.

    Lazy-loads the spaCy model on first use to avoid startup delay.
    Implements the NerPort protocol for hexagonal architecture.
    """

    def __init__(self, model_name: str = "en_core_web_sm") -> None:
        """Initialize NER engine.

        Args:
            model_name: spaCy model to use.
        """
        self._model_name = model_name
        self._nlp: Language | None = None

    def _ensure_loaded(self) -> Language:
        """Lazy-load the spaCy model."""
        if self._nlp is None:
            import spacy
            logger.info("Loading spaCy model: %s", self._model_name)
            self._nlp = spacy.load(self._model_name)
            logger.info("spaCy model loaded")
        return self._nlp

    def is_ready(self) -> bool:
        """Check if model is loaded."""
        return self._nlp is not None

    def extract(self, text: str) -> list[NamedEntity]:
        """Extract named entities from text.

        Args:
            text: Input text to analyze.

        Returns:
            List of extracted entities (deduplicated).
        """
        if not text.strip():
            return []

        nlp = self._ensure_loaded()
        doc = nlp(text)

        entities: list[NamedEntity] = []
        seen: set[str] = set()

        for ent in doc.ents:
            # Normalize and deduplicate
            key = ent.text.lower().strip()
            if not key or key in seen:
                continue
            seen.add(key)

            category = _SPACY_CATEGORY_MAP.get(ent.label_, EntityCategory.OTHER)

            # Skip low-value entities
            if category == EntityCategory.OTHER and ent.label_ in {
                "CARDINAL", "ORDINAL", "QUANTITY", "PERCENT"
            }:
                continue

            entities.append(
                NamedEntity.create(
                    text=ent.text,
                    category=category,
                    segment_ids=[],  # Filled by caller
                    confidence=0.8,  # spaCy doesn't provide per-entity confidence
                )
            )

        return entities

    def extract_from_segments(
        self,
        segments: list[tuple[int, str]],
    ) -> list[NamedEntity]:
        """Extract entities from multiple segments with segment tracking.

        Args:
            segments: List of (segment_id, text) tuples.

        Returns:
            Entities with segment_ids populated (deduplicated across segments).
        """
        # Track entities and their segment occurrences
        entity_segments: dict[str, list[int]] = {}
        all_entities: dict[str, NamedEntity] = {}

        for segment_id, text in segments:
            entities = self.extract(text)

            for entity in entities:
                key = entity.text.lower()

                if key not in all_entities:
                    all_entities[key] = entity
                    entity_segments[key] = []

                entity_segments[key].append(segment_id)

        # Update segment IDs
        for key, entity in all_entities.items():
            entity.segment_ids = sorted(set(entity_segments[key]))

        return list(all_entities.values())

Task 5: Create Entity Persistence

ORM Model Already Exists: The NamedEntityModel is already implemented at infrastructure/persistence/models/entities/named_entity.py. Do not recreate it. Only the repository and converters need to be implemented.

Existing ORM (reference only—do not modify):

  • Location: src/noteflow/infrastructure/persistence/models/entities/named_entity.py
  • Fields: id, meeting_id, text, category, segment_ids, confidence, is_pinned, created_at
  • Relationship: meetingMeetingModel.named_entities (cascade delete configured)

File to Create: src/noteflow/infrastructure/persistence/repositories/entity_repo.py

"""Named entity repository."""

from __future__ import annotations

from typing import TYPE_CHECKING
from uuid import UUID

from sqlalchemy import delete, select

from noteflow.infrastructure.persistence.models import NamedEntityModel
from noteflow.infrastructure.persistence.repositories._base import BaseRepository

if TYPE_CHECKING:
    from noteflow.domain.entities.meeting import MeetingId
    from noteflow.domain.entities.named_entity import NamedEntity


class SqlAlchemyEntityRepository(BaseRepository):
    """Repository for named entity persistence."""

    async def save(self, entity: NamedEntity) -> None:
        """Save or update a named entity.

        Args:
            entity: The entity to save.
        """
        from noteflow.infrastructure.converters.ner_converters import NerConverter

        model = NerConverter.to_orm(entity)
        await self._session.merge(model)
        await self._session.flush()

    async def save_batch(self, entities: list[NamedEntity]) -> None:
        """Save multiple entities efficiently.

        Args:
            entities: List of entities to save.
        """
        from noteflow.infrastructure.converters.ner_converters import NerConverter

        for entity in entities:
            model = NerConverter.to_orm(entity)
            await self._session.merge(model)
        await self._session.flush()

    async def get(self, entity_id: UUID) -> NamedEntity | None:
        """Get entity by ID.

        Args:
            entity_id: The entity UUID.

        Returns:
            Entity if found, None otherwise.
        """
        from noteflow.infrastructure.converters.ner_converters import NerConverter

        stmt = select(NamedEntityModel).where(NamedEntityModel.id == entity_id)
        result = await self._session.execute(stmt)
        model = result.scalar_one_or_none()
        return NerConverter.to_domain(model) if model else None

    async def get_by_meeting(self, meeting_id: MeetingId) -> list[NamedEntity]:
        """Get all entities for a meeting.

        Args:
            meeting_id: The meeting UUID.

        Returns:
            List of entities.
        """
        from noteflow.infrastructure.converters.ner_converters import NerConverter

        stmt = (
            select(NamedEntityModel)
            .where(NamedEntityModel.meeting_id == meeting_id)
            .order_by(NamedEntityModel.category, NamedEntityModel.text)
        )
        result = await self._session.execute(stmt)
        models = result.scalars().all()
        return [NerConverter.to_domain(m) for m in models]

    async def delete_by_meeting(self, meeting_id: MeetingId) -> int:
        """Delete all entities for a meeting.

        Args:
            meeting_id: The meeting UUID.

        Returns:
            Number of deleted entities.
        """
        stmt = delete(NamedEntityModel).where(
            NamedEntityModel.meeting_id == meeting_id
        )
        result = await self._session.execute(stmt)
        await self._session.flush()
        return result.rowcount

    async def update_pinned(self, entity_id: UUID, is_pinned: bool) -> bool:
        """Update the pinned status of an entity.

        Args:
            entity_id: The entity UUID.
            is_pinned: New pinned status.

        Returns:
            True if entity was found and updated.
        """
        stmt = select(NamedEntityModel).where(NamedEntityModel.id == entity_id)
        result = await self._session.execute(stmt)
        model = result.scalar_one_or_none()
        if model:
            model.is_pinned = is_pinned
            await self._session.flush()
            return True
        return False

Task 6: Create NER Converters

File: src/noteflow/infrastructure/converters/ner_converters.py

"""NER domain ↔ ORM converters."""

from __future__ import annotations

from typing import TYPE_CHECKING

from noteflow.domain.entities.named_entity import EntityCategory, NamedEntity

if TYPE_CHECKING:
    from noteflow.infrastructure.persistence.models import NamedEntityModel


class NerConverter:
    """Convert between NamedEntity domain objects and ORM models."""

    @staticmethod
    def to_domain(model: NamedEntityModel) -> NamedEntity:
        """Convert ORM model to domain entity.

        Args:
            model: SQLAlchemy model.

        Returns:
            Domain entity.
        """
        return NamedEntity(
            id=model.id,
            meeting_id=model.meeting_id,
            text=model.text,
            category=EntityCategory(model.category),
            segment_ids=list(model.segment_ids) if model.segment_ids else [],
            confidence=model.confidence,
            is_pinned=model.is_pinned,
        )

    @staticmethod
    def to_orm(entity: NamedEntity) -> NamedEntityModel:
        """Convert domain entity to ORM model.

        Args:
            entity: Domain entity.

        Returns:
            SQLAlchemy model.
        """
        from noteflow.infrastructure.persistence.models import NamedEntityModel

        return NamedEntityModel(
            id=entity.id,
            meeting_id=entity.meeting_id,
            text=entity.text,
            category=entity.category.value,
            segment_ids=entity.segment_ids,
            confidence=entity.confidence,
            is_pinned=entity.is_pinned,
        )

Task 7: Create NerService Application Layer

File: src/noteflow/application/services/ner_service.py

This is the critical architectural component that was missing. The NerService sits between gRPC and infrastructure, following hexagonal architecture.

"""Named Entity Recognition application service.

This service orchestrates NER operations, following hexagonal architecture:
- gRPC mixin → NerService (application) → NerEngine (infrastructure) ✓

The service handles:
- Extraction orchestration
- Caching/persistence of results
- Feature flag checking
- Concurrency control for model loading
"""

from __future__ import annotations

import asyncio
import logging
from dataclasses import dataclass
from typing import TYPE_CHECKING

from noteflow.config.settings import get_settings
from noteflow.domain.entities.named_entity import NamedEntity

if TYPE_CHECKING:
    from uuid import UUID

    from noteflow.domain.ports.ner import NerPort
    from noteflow.infrastructure.persistence.unit_of_work import UnitOfWork

logger = logging.getLogger(__name__)


@dataclass
class ExtractionResult:
    """Result of entity extraction."""

    entities: list[NamedEntity]
    cached: bool
    total_count: int


class NerService:
    """Application service for Named Entity Recognition.

    Provides a clean interface for NER operations, abstracting away
    the infrastructure details (spaCy engine, database persistence).
    """

    def __init__(
        self,
        ner_engine: NerPort,
        uow_factory: type[UnitOfWork],
    ) -> None:
        """Initialize NER service.

        Args:
            ner_engine: NER engine implementation (infrastructure adapter).
            uow_factory: Factory for creating Unit of Work instances.
        """
        self._ner_engine = ner_engine
        self._uow_factory = uow_factory
        self._extraction_lock = asyncio.Lock()
        self._model_load_lock = asyncio.Lock()

    async def extract_entities(
        self,
        meeting_id: UUID,
        force_refresh: bool = False,
    ) -> ExtractionResult:
        """Extract named entities from a meeting's transcript.

        Checks for cached results first, unless force_refresh is True.
        Persists new extractions to the database.

        Args:
            meeting_id: Meeting to extract entities from.
            force_refresh: If True, re-extract even if cached results exist.

        Returns:
            ExtractionResult with entities and metadata.

        Raises:
            ValueError: If meeting not found or has no segments.
            RuntimeError: If NER feature is disabled.
        """
        settings = get_settings()
        if not settings.feature_flags.ner_extraction_enabled:
            raise RuntimeError("NER extraction is disabled by feature flag")

        async with self._uow_factory() as uow:
            # Check for cached results
            if not force_refresh:
                cached = await uow.entities.get_by_meeting(meeting_id)
                if cached:
                    logger.debug(
                        "Returning %d cached entities for meeting %s",
                        len(cached),
                        meeting_id,
                    )
                    return ExtractionResult(
                        entities=cached,
                        cached=True,
                        total_count=len(cached),
                    )

            # Fetch meeting and segments
            meeting = await uow.meetings.get(meeting_id)
            if not meeting:
                raise ValueError(f"Meeting {meeting_id} not found")

            if not meeting.segments:
                logger.debug("Meeting %s has no segments", meeting_id)
                return ExtractionResult(entities=[], cached=False, total_count=0)

            # Build segment data
            segments = [(s.segment_id, s.text) for s in meeting.segments]

        # Extract entities (outside UoW to avoid long transactions)
        async with self._extraction_lock:
            # Ensure model is loaded (thread-safe)
            if not self._ner_engine.is_ready():
                async with self._model_load_lock:
                    if not self._ner_engine.is_ready():
                        # Run sync model load in executor
                        loop = asyncio.get_event_loop()
                        await loop.run_in_executor(
                            None,
                            lambda: self._ner_engine.extract("warm up"),
                        )

            # Extract entities
            loop = asyncio.get_event_loop()
            entities = await loop.run_in_executor(
                None,
                self._ner_engine.extract_from_segments,
                segments,
            )

        # Assign meeting ID to entities
        for entity in entities:
            entity.meeting_id = meeting_id

        # Persist results
        async with self._uow_factory() as uow:
            if force_refresh:
                await uow.entities.delete_by_meeting(meeting_id)
            await uow.entities.save_batch(entities)
            await uow.commit()

        logger.info(
            "Extracted %d entities from meeting %s (%d segments)",
            len(entities),
            meeting_id,
            len(segments),
        )

        return ExtractionResult(
            entities=entities,
            cached=False,
            total_count=len(entities),
        )

    async def get_entities(self, meeting_id: UUID) -> list[NamedEntity]:
        """Get cached entities for a meeting.

        Args:
            meeting_id: Meeting UUID.

        Returns:
            List of entities (empty if not extracted yet).
        """
        async with self._uow_factory() as uow:
            return await uow.entities.get_by_meeting(meeting_id)

    async def pin_entity(self, entity_id: UUID, is_pinned: bool = True) -> bool:
        """Mark an entity as user-verified (pinned).

        Args:
            entity_id: Entity UUID.
            is_pinned: New pinned status.

        Returns:
            True if entity was found and updated.
        """
        async with self._uow_factory() as uow:
            result = await uow.entities.update_pinned(entity_id, is_pinned)
            if result:
                await uow.commit()
            return result

    async def clear_entities(self, meeting_id: UUID) -> int:
        """Delete all entities for a meeting.

        Args:
            meeting_id: Meeting UUID.

        Returns:
            Number of deleted entities.
        """
        async with self._uow_factory() as uow:
            count = await uow.entities.delete_by_meeting(meeting_id)
            await uow.commit()
            logger.info("Cleared %d entities for meeting %s", count, meeting_id)
            return count

    def is_ready(self) -> bool:
        """Check if NER engine is ready.

        Returns:
            True if model is loaded.
        """
        return self._ner_engine.is_ready()

Task 8: Update gRPC Mixin (Calls NerService)

File: src/noteflow/grpc/_mixins/entities.py

The mixin now calls NerService (application layer), not NerEngine (infrastructure).

"""Entity extraction gRPC mixin."""

from __future__ import annotations

import logging
from typing import TYPE_CHECKING

import grpc

from noteflow.grpc.proto import noteflow_pb2

if TYPE_CHECKING:
    from noteflow.grpc._mixins.protocols import ServicerHost

logger = logging.getLogger(__name__)


class EntitiesMixin:
    """Mixin for entity extraction RPC methods.

    Architecture: gRPC → NerService (application) → NerEngine (infrastructure)
    """

    async def ExtractEntities(
        self: ServicerHost,
        request: noteflow_pb2.ExtractEntitiesRequest,
        context: grpc.aio.ServicerContext,
    ) -> noteflow_pb2.ExtractEntitiesResponse:
        """Extract named entities from meeting transcript.

        Delegates to NerService for extraction, caching, and persistence.
        """
        meeting_id = self._parse_meeting_id(request.meeting_id)

        try:
            result = await self._ner_service.extract_entities(
                meeting_id=meeting_id,
                force_refresh=request.force_refresh,
            )
        except ValueError as e:
            context.set_code(grpc.StatusCode.NOT_FOUND)
            context.set_details(str(e))
            return noteflow_pb2.ExtractEntitiesResponse()
        except RuntimeError as e:
            # Feature disabled
            context.set_code(grpc.StatusCode.FAILED_PRECONDITION)
            context.set_details(str(e))
            return noteflow_pb2.ExtractEntitiesResponse()

        # Convert to proto
        proto_entities = [
            noteflow_pb2.ExtractedEntity(
                id=str(e.id),
                text=e.text,
                category=e.category.value,
                segment_ids=e.segment_ids,
                confidence=e.confidence,
                is_pinned=e.is_pinned,
            )
            for e in result.entities
        ]

        return noteflow_pb2.ExtractEntitiesResponse(
            entities=proto_entities,
            total_count=result.total_count,
            cached=result.cached,
        )

    async def GetEntities(
        self: ServicerHost,
        request: noteflow_pb2.GetEntitiesRequest,
        context: grpc.aio.ServicerContext,
    ) -> noteflow_pb2.GetEntitiesResponse:
        """Get cached entities for a meeting (no extraction)."""
        meeting_id = self._parse_meeting_id(request.meeting_id)

        entities = await self._ner_service.get_entities(meeting_id)

        proto_entities = [
            noteflow_pb2.ExtractedEntity(
                id=str(e.id),
                text=e.text,
                category=e.category.value,
                segment_ids=e.segment_ids,
                confidence=e.confidence,
                is_pinned=e.is_pinned,
            )
            for e in entities
        ]

        return noteflow_pb2.GetEntitiesResponse(entities=proto_entities)

    async def PinEntity(
        self: ServicerHost,
        request: noteflow_pb2.PinEntityRequest,
        context: grpc.aio.ServicerContext,
    ) -> noteflow_pb2.PinEntityResponse:
        """Mark an entity as user-verified."""
        from uuid import UUID

        entity_id = UUID(request.entity_id)
        success = await self._ner_service.pin_entity(entity_id, request.is_pinned)

        if not success:
            context.set_code(grpc.StatusCode.NOT_FOUND)
            context.set_details(f"Entity {entity_id} not found")

        return noteflow_pb2.PinEntityResponse(success=success)

Task 9: Frontend Integration

File: client/src/hooks/useEntityExtraction.ts

import { useCallback, useEffect, useState } from 'react';
import { invoke } from '@tauri-apps/api/core';

export interface ExtractedEntity {
  id: string;
  text: string;
  category: 'person' | 'company' | 'product' | 'technical' | 'acronym' | 'location' | 'date' | 'other';
  segmentIds: number[];
  confidence: number;
  isPinned: boolean;
}

interface ExtractionResult {
  entities: ExtractedEntity[];
  totalCount: number;
  cached: boolean;
}

type ExtractionStatus = 'idle' | 'loading' | 'success' | 'error';

interface UseEntityExtractionOptions {
  autoLoad?: boolean;
  pollInterval?: number;  // For long extractions
}

interface UseEntityExtractionReturn {
  entities: ExtractedEntity[];
  status: ExtractionStatus;
  error: string | null;
  cached: boolean;
  extract: (forceRefresh?: boolean) => Promise<void>;
  pinEntity: (entityId: string, isPinned: boolean) => Promise<void>;
  clearEntities: () => Promise<void>;
}

export function useEntityExtraction(
  meetingId: string | null,
  options: UseEntityExtractionOptions = {},
): UseEntityExtractionReturn {
  const { autoLoad = false } = options;

  const [entities, setEntities] = useState<ExtractedEntity[]>([]);
  const [status, setStatus] = useState<ExtractionStatus>('idle');
  const [error, setError] = useState<string | null>(null);
  const [cached, setCached] = useState(false);

  const extract = useCallback(async (forceRefresh = false) => {
    if (!meetingId) return;

    setStatus('loading');
    setError(null);

    try {
      const result = await invoke<ExtractionResult>('extract_entities', {
        meetingId,
        forceRefresh,
      });

      setEntities(result.entities);
      setCached(result.cached);
      setStatus('success');
    } catch (err) {
      const message = err instanceof Error ? err.message : String(err);
      setError(message);
      setStatus('error');
    }
  }, [meetingId]);

  const pinEntity = useCallback(async (entityId: string, isPinned: boolean) => {
    try {
      await invoke('pin_entity', { entityId, isPinned });

      // Optimistic update
      setEntities((prev) =>
        prev.map((e) => (e.id === entityId ? { ...e, isPinned } : e))
      );
    } catch (err) {
      const message = err instanceof Error ? err.message : String(err);
      setError(message);
    }
  }, []);

  const clearEntities = useCallback(async () => {
    if (!meetingId) return;

    try {
      await invoke('clear_entities', { meetingId });
      setEntities([]);
      setCached(false);
    } catch (err) {
      const message = err instanceof Error ? err.message : String(err);
      setError(message);
    }
  }, [meetingId]);

  // Auto-load on mount if requested
  useEffect(() => {
    if (autoLoad && meetingId) {
      extract(false);
    }
  }, [autoLoad, meetingId, extract]);

  return {
    entities,
    status,
    error,
    cached,
    extract,
    pinEntity,
    clearEntities,
  };
}

File: client/src/components/EntityExtractionPanel.tsx

import { Loader2, Pin, RefreshCw, Sparkles, Trash2, X } from 'lucide-react';
import { useMemo } from 'react';

import { Button } from '@/components/ui/Button';
import { Card, CardContent, CardHeader, CardTitle } from '@/components/ui/Card';
import { Badge } from '@/components/ui/Badge';
import { Tooltip, TooltipContent, TooltipTrigger } from '@/components/ui/Tooltip';
import { ExtractedEntity, useEntityExtraction } from '@/hooks/useEntityExtraction';

interface EntityExtractionPanelProps {
  meetingId: string;
  className?: string;
}

const CATEGORY_COLORS: Record<string, string> = {
  person: 'bg-blue-100 text-blue-800',
  company: 'bg-purple-100 text-purple-800',
  product: 'bg-green-100 text-green-800',
  location: 'bg-amber-100 text-amber-800',
  date: 'bg-rose-100 text-rose-800',
  technical: 'bg-cyan-100 text-cyan-800',
  acronym: 'bg-indigo-100 text-indigo-800',
  other: 'bg-gray-100 text-gray-800',
};

function EntityBadge({
  entity,
  onPin,
}: {
  entity: ExtractedEntity;
  onPin: (id: string, pinned: boolean) => void;
}) {
  const colorClass = CATEGORY_COLORS[entity.category] || CATEGORY_COLORS.other;

  return (
    <Badge
      variant="secondary"
      className={`${colorClass} flex items-center gap-1 ${entity.isPinned ? 'ring-2 ring-yellow-400' : ''}`}
    >
      <span>{entity.text}</span>
      <Tooltip>
        <TooltipTrigger asChild>
          <button
            type="button"
            onClick={() => onPin(entity.id, !entity.isPinned)}
            className="ml-1 opacity-60 hover:opacity-100"
          >
            <Pin className={`w-3 h-3 ${entity.isPinned ? 'fill-current' : ''}`} />
          </button>
        </TooltipTrigger>
        <TooltipContent>
          {entity.isPinned ? 'Unpin entity' : 'Pin as verified'}
        </TooltipContent>
      </Tooltip>
    </Badge>
  );
}

export function EntityExtractionPanel({ meetingId, className }: EntityExtractionPanelProps) {
  const {
    entities,
    status,
    error,
    cached,
    extract,
    pinEntity,
    clearEntities,
  } = useEntityExtraction(meetingId, { autoLoad: true });

  // Group entities by category
  const groupedEntities = useMemo(() => {
    const groups: Record<string, ExtractedEntity[]> = {};
    for (const entity of entities) {
      const category = entity.category;
      if (!groups[category]) {
        groups[category] = [];
      }
      groups[category].push(entity);
    }
    return groups;
  }, [entities]);

  const categoryOrder = ['person', 'company', 'product', 'location', 'technical', 'acronym', 'date', 'other'];

  return (
    <Card className={className}>
      <CardHeader className="pb-3">
        <div className="flex items-center justify-between">
          <CardTitle className="text-sm font-medium">
            Named Entities
            {cached && (
              <span className="ml-2 text-xs text-muted-foreground">(cached)</span>
            )}
          </CardTitle>
          <div className="flex gap-1">
            <Tooltip>
              <TooltipTrigger asChild>
                <Button
                  variant="ghost"
                  size="sm"
                  onClick={() => extract(false)}
                  disabled={status === 'loading'}
                >
                  {status === 'loading' ? (
                    <Loader2 className="w-4 h-4 animate-spin" />
                  ) : (
                    <Sparkles className="w-4 h-4" />
                  )}
                </Button>
              </TooltipTrigger>
              <TooltipContent>Extract entities</TooltipContent>
            </Tooltip>

            <Tooltip>
              <TooltipTrigger asChild>
                <Button
                  variant="ghost"
                  size="sm"
                  onClick={() => extract(true)}
                  disabled={status === 'loading'}
                >
                  <RefreshCw className="w-4 h-4" />
                </Button>
              </TooltipTrigger>
              <TooltipContent>Re-extract (refresh)</TooltipContent>
            </Tooltip>

            {entities.length > 0 && (
              <Tooltip>
                <TooltipTrigger asChild>
                  <Button
                    variant="ghost"
                    size="sm"
                    onClick={clearEntities}
                  >
                    <Trash2 className="w-4 h-4" />
                  </Button>
                </TooltipTrigger>
                <TooltipContent>Clear all entities</TooltipContent>
              </Tooltip>
            )}
          </div>
        </div>
      </CardHeader>

      <CardContent>
        {/* Loading state */}
        {status === 'loading' && entities.length === 0 && (
          <div className="flex flex-col items-center justify-center py-8 text-muted-foreground">
            <Loader2 className="w-8 h-8 animate-spin mb-2" />
            <p className="text-sm">Extracting entities...</p>
            <p className="text-xs mt-1">This may take a moment for long transcripts</p>
          </div>
        )}

        {/* Error state */}
        {status === 'error' && (
          <div className="flex items-center gap-2 p-3 bg-destructive/10 text-destructive rounded-md">
            <X className="w-4 h-4 flex-shrink-0" />
            <p className="text-sm">{error}</p>
          </div>
        )}

        {/* Empty state */}
        {status === 'success' && entities.length === 0 && (
          <p className="text-sm text-muted-foreground text-center py-4">
            No entities found. Click the sparkle icon to extract.
          </p>
        )}

        {/* Entity list grouped by category */}
        {entities.length > 0 && (
          <div className="space-y-4">
            {categoryOrder
              .filter((cat) => groupedEntities[cat]?.length > 0)
              .map((category) => (
                <div key={category}>
                  <h4 className="text-xs font-medium text-muted-foreground uppercase tracking-wide mb-2">
                    {category} ({groupedEntities[category].length})
                  </h4>
                  <div className="flex flex-wrap gap-2">
                    {groupedEntities[category].map((entity) => (
                      <EntityBadge
                        key={entity.id}
                        entity={entity}
                        onPin={pinEntity}
                      />
                    ))}
                  </div>
                </div>
              ))}
          </div>
        )}
      </CardContent>
    </Card>
  );
}

File: client/src-tauri/src/commands/entities.rs

use serde::{Deserialize, Serialize};
use tauri::State;
use uuid::Uuid;

use crate::state::AppState;

#[derive(Debug, Clone, Serialize, Deserialize)]
#[serde(rename_all = "camelCase")]
pub struct ExtractedEntity {
    pub id: String,
    pub text: String,
    pub category: String,
    pub segment_ids: Vec<i32>,
    pub confidence: f32,
    pub is_pinned: bool,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
#[serde(rename_all = "camelCase")]
pub struct ExtractEntitiesResult {
    pub entities: Vec<ExtractedEntity>,
    pub total_count: i32,
    pub cached: bool,
}

#[tauri::command]
pub async fn extract_entities(
    meeting_id: String,
    force_refresh: Option<bool>,
    state: State<'_, AppState>,
) -> Result<ExtractEntitiesResult, String> {
    let client = state.grpc_client.lock().await;
    let client = client.as_ref().ok_or("gRPC client not initialized")?;

    let request = proto::ExtractEntitiesRequest {
        meeting_id,
        force_refresh: force_refresh.unwrap_or(false),
    };

    let response = client
        .extract_entities(request)
        .await
        .map_err(|e| format!("Failed to extract entities: {e}"))?;

    Ok(ExtractEntitiesResult {
        entities: response
            .entities
            .into_iter()
            .map(|e| ExtractedEntity {
                id: e.id,
                text: e.text,
                category: e.category,
                segment_ids: e.segment_ids,
                confidence: e.confidence,
                is_pinned: e.is_pinned,
            })
            .collect(),
        total_count: response.total_count,
        cached: response.cached,
    })
}

#[tauri::command]
pub async fn pin_entity(
    entity_id: String,
    is_pinned: bool,
    state: State<'_, AppState>,
) -> Result<bool, String> {
    let client = state.grpc_client.lock().await;
    let client = client.as_ref().ok_or("gRPC client not initialized")?;

    let request = proto::PinEntityRequest {
        entity_id,
        is_pinned,
    };

    let response = client
        .pin_entity(request)
        .await
        .map_err(|e| format!("Failed to pin entity: {e}"))?;

    Ok(response.success)
}

#[tauri::command]
pub async fn clear_entities(
    meeting_id: String,
    state: State<'_, AppState>,
) -> Result<i32, String> {
    let client = state.grpc_client.lock().await;
    let client = client.as_ref().ok_or("gRPC client not initialized")?;

    let request = proto::ClearEntitiesRequest { meeting_id };

    let response = client
        .clear_entities(request)
        .await
        .map_err(|e| format!("Failed to clear entities: {e}"))?;

    Ok(response.deleted_count)
}

Code Segments to Reuse

Persistence Layer Patterns (CRITICAL)

Location: src/noteflow/infrastructure/persistence/

Pattern Reference Location Usage
ORM Model models/entities/named_entity.py Already exists - no creation needed
Repository repositories/annotation_repo.py Template for CRUD operations
Base Repository repositories/_base.py Extend for helper methods
Unit of Work unit_of_work.py Add entity repository property
Converters converters/orm_converters.py Add entity_to_domain() method

Existing Entity UI Components

Location: client/src/components/entity-highlight.tsx

Already renders entity highlights with tooltips - connect to extracted entities.

Location: client/src/components/entity-management-panel.tsx

CRUD panel with Sheet UI - extend to display auto-extracted entities.

Location: client/src/lib/entity-store.ts

Client-side observer pattern store - wire to backend extraction results.

Warning

: Color definitions are duplicated between entity-highlight.tsx (inline categoryColors) and types/entity.ts (ENTITY_CATEGORY_COLORS). Use the shared constant from types/entity.ts.

Application Service Pattern

Location: src/noteflow/application/services/summarization_service.py

Pattern for application service with:

  • Dataclass-based service with settings
  • Provider registration pattern (multi-backend support)
  • Lazy model loading via property getters
  • Callback-based persistence (not UoW-injected)

Location: src/noteflow/infrastructure/summarization/ollama_provider.py

Pattern for lazy model loading:

  • self._client: T | None = None initial state
  • _get_client() method for lazy initialization
  • is_available property for runtime checks
  • asyncio.to_thread() for CPU-bound inference

Location: src/noteflow/grpc/_mixins/diarization.py

Pattern for CPU-bound gRPC handlers:

  • asyncio.Lock for concurrency control
  • loop.run_in_executor() for blocking operations
  • Structured logging with meeting context

Performance Targets

Metric Target Measurement
Model load time < 3s First extraction latency
Extraction throughput > 500 segments/sec pytest --benchmark
API response (cached) < 50ms p95 latency
API response (extraction) < 2s for 100 segments p95 latency
Memory overhead < 200MB Model + cache

Acceptance Criteria

Functional

  • "Auto-Extract Entities" button appears on meeting detail page
  • Clicking button extracts entities from all segments
  • Extracted entities appear in EntityExtractionPanel
  • Entity categories match spaCy labels (person, company, location, etc.)
  • Entities link to source segments via segment_ids
  • Entities persist to database and survive server restart
  • Cached results returned on subsequent requests
  • Force refresh re-extracts and replaces cached results
  • Pinned entities preserve user verification

Technical

  • spaCy model lazy-loaded (no startup delay)
  • Entities deduplicated across segments
  • Segment tracking works correctly
  • Error handling for missing spaCy model
  • NerService mediates between gRPC and NerEngine (hexagonal architecture)
  • Feature flag controls extraction availability

Quality Gates

  • pytest tests/quality/ passes
  • Module size < 150 lines (engine.py, ner_service.py)
  • All functions documented
  • Unit tests cover edge cases
  • Application layer tests mock infrastructure

Test Plan

Test Philosophy: Use pytest.mark.parametrize for test variations. No conditionals in tests - each parameter set is a separate test case. See pytest parametrize docs

Unit Tests (Infrastructure)

File: tests/infrastructure/ner/test_engine.py

"""NER engine unit tests using pytest.mark.parametrize."""

import pytest

from noteflow.domain.entities.named_entity import EntityCategory
from noteflow.infrastructure.ner.engine import SpacyNerEngine


# Use module-scoped fixture to load model once (expensive)
@pytest.fixture(scope="module")
def engine() -> SpacyNerEngine:
    """Create NER engine with model pre-loaded."""
    eng = SpacyNerEngine(model_name="en_core_web_sm")
    eng._ensure_loaded()  # Pre-load for all tests
    return eng


class TestEntityExtraction:
    """Test entity extraction with parametrized inputs."""

    @pytest.mark.parametrize(
        ("text", "expected_category", "expected_text_fragment"),
        [
            pytest.param(
                "John Smith discussed the project.",
                EntityCategory.PERSON,
                "john",
                id="person-single",
            ),
            pytest.param(
                "Alice Johnson and Bob Williams met today.",
                EntityCategory.PERSON,
                "alice",
                id="person-multiple",
            ),
            pytest.param(
                "We use Google Cloud for hosting.",
                EntityCategory.COMPANY,
                "google",
                id="company-tech",
            ),
            pytest.param(
                "The meeting is in New York.",
                EntityCategory.LOCATION,
                "new york",
                id="location-city",
            ),
        ],
    )
    def test_extracts_expected_entity_type(
        self,
        engine: SpacyNerEngine,
        text: str,
        expected_category: EntityCategory,
        expected_text_fragment: str,
    ) -> None:
        """Extract entities of expected category from text."""
        entities = engine.extract(text)

        matching = [e for e in entities if e.category == expected_category]
        assert matching, f"Expected {expected_category.value} entity in: {text}"

        texts_lower = [e.text.lower() for e in matching]
        assert any(
            expected_text_fragment in t for t in texts_lower
        ), f"Expected '{expected_text_fragment}' in {texts_lower}"

    @pytest.mark.parametrize(
        ("text", "expected_count"),
        [
            pytest.param("", 0, id="empty-string"),
            pytest.param("   ", 0, id="whitespace-only"),
            pytest.param("Hello world.", 0, id="no-entities"),
        ],
    )
    def test_handles_edge_cases(
        self,
        engine: SpacyNerEngine,
        text: str,
        expected_count: int,
    ) -> None:
        """Handle edge cases correctly."""
        entities = engine.extract(text)
        assert len(entities) == expected_count


class TestDeduplication:
    """Test entity deduplication."""

    @pytest.mark.parametrize(
        ("text", "entity_fragment", "expected_count"),
        [
            pytest.param(
                "John met with John about John's project.",
                "john",
                1,
                id="repeated-name",
            ),
            pytest.param(
                "Google uses Google Cloud. Google is great.",
                "google",
                1,
                id="repeated-company",
            ),
        ],
    )
    def test_deduplicates_repeated_entities(
        self,
        engine: SpacyNerEngine,
        text: str,
        entity_fragment: str,
        expected_count: int,
    ) -> None:
        """Deduplicate repeated entities in text."""
        entities = engine.extract(text)

        matching = [e for e in entities if entity_fragment in e.text.lower()]
        assert len(matching) == expected_count, f"Expected {expected_count} '{entity_fragment}'"


class TestSegmentTracking:
    """Test segment ID tracking across multiple segments."""

    @pytest.mark.parametrize(
        ("segments", "entity_fragment", "expected_segment_ids"),
        [
            pytest.param(
                [(1, "John presented."), (2, "John reviewed."), (3, "John approved.")],
                "john",
                {1, 2, 3},
                id="entity-in-all-segments",
            ),
            pytest.param(
                [(1, "Alice spoke."), (2, "Bob listened."), (3, "Alice concluded.")],
                "alice",
                {1, 3},
                id="entity-in-some-segments",
            ),
        ],
    )
    def test_tracks_segment_ids(
        self,
        engine: SpacyNerEngine,
        segments: list[tuple[int, str]],
        entity_fragment: str,
        expected_segment_ids: set[int],
    ) -> None:
        """Track segment IDs for entities across segments."""
        entities = engine.extract_from_segments(segments)

        matching = [e for e in entities if entity_fragment in e.text.lower()]
        assert matching, f"Expected entity with '{entity_fragment}'"
        assert set(matching[0].segment_ids) == expected_segment_ids


class TestLazyLoading:
    """Test model lazy loading."""

    def test_model_not_loaded_on_init(self) -> None:
        """Model is not loaded until first use."""
        engine = SpacyNerEngine()
        assert not engine.is_ready()

    def test_model_loads_on_first_extract(self) -> None:
        """Model loads on first extraction."""
        engine = SpacyNerEngine()
        engine.extract("Test text")
        assert engine.is_ready()

Application Layer Tests

File: tests/application/test_ner_service.py

"""NerService application layer tests using parametrize."""

from typing import Any
from unittest.mock import AsyncMock, MagicMock
from uuid import uuid4

import pytest

from noteflow.application.services.ner import ExtractionResult, NerService
from noteflow.domain.entities.meeting import Meeting
from noteflow.domain.entities.named_entity import EntityCategory, NamedEntity
from noteflow.domain.entities.segment import Segment


# Fixtures defined in tests/conftest.py - DO NOT REDEFINE
# mock_uow, mock_ner_engine, mock_uow_factory


@pytest.fixture
def ner_service(mock_ner_engine: MagicMock, mock_uow_factory: type) -> NerService:
    """Create NerService with injected mocks."""
    return NerService(ner_engine=mock_ner_engine, uow_factory=mock_uow_factory)


class TestExtractEntities:
    """Test entity extraction scenarios."""

    @pytest.mark.asyncio
    @pytest.mark.parametrize(
        ("cached_entities", "force_refresh", "expected_cached"),
        [
            pytest.param(
                [NamedEntity.create("Cached", EntityCategory.PERSON, [1], 0.9)],
                False,
                True,
                id="returns-cached-when-available",
            ),
            pytest.param(
                [],
                False,
                False,
                id="extracts-when-no-cache",
            ),
            pytest.param(
                [NamedEntity.create("Old", EntityCategory.PERSON, [1], 0.9)],
                True,
                False,
                id="re-extracts-on-force-refresh",
            ),
        ],
    )
    async def test_caching_behavior(
        self,
        ner_service: NerService,
        mock_uow: AsyncMock,
        mock_ner_engine: MagicMock,
        cached_entities: list[NamedEntity],
        force_refresh: bool,
        expected_cached: bool,
    ) -> None:
        """Test caching behavior with different scenarios."""
        meeting_id = uuid4()
        mock_uow.entities.get_by_meeting.return_value = cached_entities

        # Setup meeting for extraction cases
        mock_uow.meetings.get.return_value = Meeting(
            id=meeting_id,
            title="Test",
            segments=[Segment(segment_id=1, text="Hello John")],
        )

        result = await ner_service.extract_entities(meeting_id, force_refresh=force_refresh)

        assert result.cached == expected_cached


class TestErrorHandling:
    """Test error handling scenarios."""

    @pytest.mark.asyncio
    @pytest.mark.parametrize(
        ("setup_mock", "expected_error", "expected_message"),
        [
            pytest.param(
                {"meetings.get": None, "entities.get_by_meeting": []},
                ValueError,
                "not found",
                id="meeting-not-found",
            ),
        ],
    )
    async def test_raises_expected_errors(
        self,
        ner_service: NerService,
        mock_uow: AsyncMock,
        setup_mock: dict[str, Any],
        expected_error: type[Exception],
        expected_message: str,
    ) -> None:
        """Test that expected errors are raised."""
        meeting_id = uuid4()

        # Apply mock setup
        for attr_path, return_value in setup_mock.items():
            attrs = attr_path.split(".")
            obj = mock_uow
            for attr in attrs[:-1]:
                obj = getattr(obj, attr)
            getattr(obj, attrs[-1]).return_value = return_value

        with pytest.raises(expected_error, match=expected_message):
            await ner_service.extract_entities(meeting_id)


class TestPinEntity:
    """Test entity pinning."""

    @pytest.mark.asyncio
    @pytest.mark.parametrize(
        ("update_result", "expected_return", "expected_commit_calls"),
        [
            pytest.param(True, True, 1, id="successful-pin"),
            pytest.param(False, False, 0, id="entity-not-found"),
        ],
    )
    async def test_pin_entity_behavior(
        self,
        ner_service: NerService,
        mock_uow: AsyncMock,
        update_result: bool,
        expected_return: bool,
        expected_commit_calls: int,
    ) -> None:
        """Test pin entity with different outcomes."""
        entity_id = uuid4()
        mock_uow.entities.update_pinned.return_value = update_result

        result = await ner_service.pin_entity(entity_id, is_pinned=True)

        assert result == expected_return
        assert mock_uow.commit.call_count == expected_commit_calls

Dependencies

  • spaCy: NER library
  • en_core_web_sm: English language model (~50MB, downloaded in Sprint 0 Docker setup)

Blocks

  • Sprint 0 (Proto & Schema Foundation) must be complete

Failure Modes

Failure Detection Recovery
spaCy model not installed OSError on spacy.load() Return FAILED_PRECONDITION with helpful message to run python -m spacy download en_core_web_sm
Model OOM on long transcript MemoryError or process crash Chunk transcript into batches of 100 segments, process incrementally
Empty transcript len(segments) == 0 Return empty entities list (not an error)
Feature flag disabled settings.feature_flags.ner_enabled == False Return FAILED_PRECONDITION with feature disabled message
DB constraint violation UniqueViolation on uq_named_entities_meeting_text Use ON CONFLICT DO UPDATE for upsert behavior
Extraction timeout Processing > 30s for large meetings Add configurable timeout, chunk processing

Deduplication Limitation

Current: Uses normalized_text = text.lower().strip() for deduplication.

Known issue: "IBM" and "I.B.M." would be treated as different entities.

Future work: Add fuzzy matching using rapidfuzz library:

# Future enhancement (not in this sprint)
from rapidfuzz import fuzz

def find_similar_entity(new_text: str, existing: list[NamedEntity]) -> NamedEntity | None:
    """Find existing entity with >90% similarity."""
    for entity in existing:
        if fuzz.ratio(new_text.lower(), entity.text.lower()) > 90:
            return entity
    return None

Definition of Done

  • All acceptance criteria met
  • pytest tests/quality/ passes
  • pytest tests/application/test_ner_service.py passes
  • pytest tests/infrastructure/ner/ passes
  • ruff check . passes
  • basedpyright passes
  • Performance targets verified
  • Frontend components render correctly with loading/error states
  • Entities persist across server restarts
  • Feature flag controls availability