9.4 KiB
TL;DR / Highest‑impact fixes (do these first)
Event-loop blocking in the TUI (user-visible stutter): time.sleep(2) is called at the end of an ingestion run on the IngestionScreen. Even though this runs in a thread worker, it blocks that worker and delays UI transition; prefer a scheduled UI callback instead.
repomix-output (2)
Blocking Weaviate client calls inside async methods: Several async methods in WeaviateStorage call the synchronous Weaviate client directly (connect(), collections.create, queries, inserts, deletes). Wrap those in asyncio.to_thread(...) (or equivalent) to avoid freezing the loop when these calls take time.
Embedding at scale: use batch vectorization: store_batch vectorizes each document one by one; you already have Vectorizer.vectorize_batch. Switching to batch reduces HTTP round trips and improves throughput under backpressure.
Connection lifecycle: close the vectorizer client: WeaviateStorage.close() closes the Weaviate client but not the httpx.AsyncClient inside the Vectorizer. Add an await to close it to prevent leaked connections/sockets under heavy usage.
Broadened exception handling in UI utilities: Multiple places catch Exception broadly, making failures ambiguous and harder to surface to users (e.g., storage manager and list builders). Narrow these to expected exceptions and fail fast with user-friendly messages where appropriate.
repomix-output (2)
Correctness / Bugs
Blocking sleep in TUI: time.sleep(2) after posting notifications and before app.pop_screen() blocks the worker thread; use Textual timers instead.
repomix-output (2)
Synchronous SDK in async contexts (Weaviate):
initialize() calls self.client.connect() (sync). Wrap with asyncio.to_thread(self.client.connect).
repomix-output (2)
Object operations such as collection.data.insert(...), collection.query.fetch_objects(...), collection.data.delete_many(...), and client.collections.delete(...) are sync calls invoked from async methods. These can stall the event loop under latency; use asyncio.to_thread(...) (see snippet below).
HTTP client lifecycle (Vectorizer): Vectorizer owns an httpx.AsyncClient but WeaviateStorage.close() doesn’t close it—add a close call to avoid resource leaks.
Heuristic “word_count” for OpenWebUI file listing: word_count is estimated via size/6. That’s fine as a placeholder but can mislead paging and UI logic downstream—consider a sentinel or a clearer label (e.g., estimated_word_count).
repomix-output (2)
Wide except Exception: Broad catches appear in places like the storage manager and TUI screen update closures; they hide actionable errors. Prefer catching StorageError, IngestionError, or specific SDK exceptions—and surface toasts with actionable details.
repomix-output (2)
Performance & Scalability
Batch embeddings: In WeaviateStorage.store_batch you conditionally vectorize each doc inline. Use your existing vectorize_batch and map the results back to documents. This cuts request overhead and enables controlled concurrency (you already have AsyncTaskManager to help).
Async-friendly Weaviate calls: Offload sync SDK operations to a thread so your Textual UI remains responsive while bulk inserting/querying/deleting. (See patch below.)
Retry & backoff are solid in your HTTP base client: Your TypedHttpClient.request adds exponential backoff + jitter; this is great and worth keeping consistent across adapters.
repomix-output (2)
UX notes (Textual TUI)
Notifications are good—make them consistent: app.safe_notify(...) exists—use that everywhere instead of self.notify(...) to normalize markup handling & safety.
repomix-output (2)
Avoid visible “freeze” scenarios: Replace the final time.sleep(2) with a timer and transition immediately; users see their actions complete without lag. (Patch below.)
repomix-output (2)
Search table quick find: You have EnhancedDataTable with quick-search messages; wire a shortcut (e.g., /) to focus a search input and filter rows live.
repomix-output (2)
Theme file size & maintainability: The theming system is thorough; consider splitting styles.py into smaller modules or generating CSS at build time to keep the Python file lean. (The responsive CSS generators are consolidated here.)
repomix-output (2)
Modularity / Redundancy
Converge repeated property mapping: WeaviateStorage.store and store_batch build the same properties dict; factor this into a small helper to keep schemas in one place (less drift, easier to extend).
repomix-output (2)
Common “describe/list/count” patterns across storages: R2RStorage, OpenWebUIStorage, and WeaviateStorage present similar collection/document listing and counting methods. Consider a small “collection view” mixin with shared helpers; each backend implements only the API-specific steps.
Security & Reliability
Input sanitization for LLM metadata: Your MetadataTagger sanitizes and bounds fields (e.g., max lengths, language whitelist). This is a strong pattern—keep it.
repomix-output (2)
Timeouts and typed HTTP clients: You standardize HTTP clients, headers, and timeouts. Good foundation for consistent behavior & observability.
repomix-output (2)
Suggested patches (drop‑in)
- Don’t block the UI when closing the ingestion screen
Current (blocking):
repomix-output (2)
import time time.sleep(2) cast("CollectionManagementApp", self.app).pop_screen()
Safer (schedule via the app’s timer)
def _pop() -> None: try: self.app.pop_screen() except Exception: pass
schedule from the worker thread
cast("CollectionManagementApp", self.app).call_from_thread( lambda: self.app.set_timer(2.0, _pop) )
- Offload Weaviate sync calls from async methods
Example – insert (from WeaviateStorage.store)
repomix-output (2)
before (sync in async method)
collection.data.insert(properties=properties, vector=vector)
after
await asyncio.to_thread(collection.data.insert, properties=properties, vector=vector)
Example – fetch/delete by filter (from delete_by_filter)
repomix-output (2)
response = await asyncio.to_thread( collection.query.fetch_objects, filters=where_filter, limit=1000 ) for obj in response.objects: await asyncio.to_thread(collection.data.delete_by_id, obj.uuid)
Example – connect during initialize
repomix-output (2)
await asyncio.to_thread(self.client.connect)
- Batch embeddings in store_batch
Current (per‑doc):
repomix-output (2)
for doc in documents: if doc.vector is None: doc.vector = await self.vectorizer.vectorize(doc.content)
Proposed (batch):
repomix-output (2)
collect contents needing vectors
to_embed_idxs = [i for i, d in enumerate(documents) if d.vector is None] if to_embed_idxs: contents = [documents[i].content for i in to_embed_idxs] vectors = await self.vectorizer.vectorize_batch(contents) for j, idx in enumerate(to_embed_idxs): documents[idx].vector = vectors[j]
- Close the Vectorizer HTTP client when storage closes
Current: WeaviateStorage.close() only closes the Weaviate client.
repomix-output (2)
Add:
async def close(self) -> None: if self.client: try: cast(weaviate.WeaviateClient, self.client).close() except Exception as e: import logging logging.warning("Error closing Weaviate client: %s", e) # NEW: close vectorizer HTTP client too try: await self.vectorizer.client.aclose() except Exception: pass
(Your Vectorizer owns an httpx.AsyncClient with headers/timeouts set.)
repomix-output (2)
- Prefer safe_notify consistently
Replace direct self.notify(...) calls inside TUI screens with cast(AppType, self.app).safe_notify(...) (same severity, markup=False by default), centralizing styling/sanitization.
repomix-output (2)
Smaller improvements (quality of life)
Quick search focus: Wire the EnhancedDataTable quick-search to a keyboard binding (e.g., /) so users can filter rows without reaching for the mouse.
repomix-output (2)
Refactor repeated properties dict: Extract the property construction in WeaviateStorage.store/store_batch into a helper to avoid drift and reduce line count.
repomix-output (2)
Styles organization: styles.py is hefty by design. Consider splitting “theme palette”, “components”, and “responsive” generators into separate modules to keep diffs small and reviews easier.
repomix-output (2)
Architecture & Modularity
Storage adapters: The three adapters share “describe/list/count” concepts. Introduce a tiny shared “CollectionIntrospection” mixin (interfaces + default fallbacks), and keep only API specifics in each adapter. This will simplify the TUI’s StorageManager as well.
Ingestion flows: Good use of Prefect tasks/flows with retries, variables, and tagging. The Firecrawl→R2R specialization cleanly reuses common steps. The batch boundaries are clear and progress updates are threaded back to the UI.
DX / Testing
Unit tests: In this export I don’t see tests. Add lightweight tests for:
Vector extraction/format parsing in Vectorizer (covers multiple providers).
repomix-output (2)
Weaviate adapters: property building, name normalization, vector extraction.
repomix-output (2)
MetadataTagger sanitization rules.
repomix-output (2)
TUI: use Textual’s pilot to test that notifications and timers trigger and that screens transition without blocking (verifies the sleep fix).
Static analysis: You already have excellent typing. Add ruff + mypy --strict and a pre-commit config to keep it consistent across contributors.