lightrag

Author	SHA1	Message	Date
Daniel.y	b670544958	Merge pull request #2433 from danielaskdd/fix-jina-embedding Fix: Add configurable model support for Jina embedding	2025-11-28 19:36:18 +08:00
yangdx	ea8d55ab42	Add documentation for embedding provider configuration rules	2025-11-28 17:49:30 +08:00
Christian Clauss	90f341d614	Fix typos discovered by codespell	2025-11-28 10:31:52 +01:00
yangdx	4ab4a7ac94	Allow embedding models to use provider defaults when unspecified - Set EMBEDDING_MODEL default to None - Pass model param only when provided - Let providers use their own defaults - Fix lollms embed function params - Add ollama embed_model default param	2025-11-28 16:57:33 +08:00
yangdx	881b8d3a50	Bump API version to 0257	2025-11-28 15:39:55 +08:00
yangdx	56e0365cf0	Add configurable model parameter to jina_embed function - Add model parameter to jina_embed - Pass model from API server - Default to jina-embeddings-v4 - Update function documentation - Make model selection flexible	2025-11-28 15:38:29 +08:00
Daniel.y	1b02684e2f	Merge pull request #2432 from danielaskdd/embedding-example Doc: Update README examples to prevent double-wrapping of embedding functions	2025-11-28 15:24:52 +08:00
yangdx	97a9dfcac0	Add important note about embedding function wrapping restrictions	2025-11-28 14:55:15 +08:00
yangdx	1d07ff7f60	Update OpenAI and Ollama embedding func examples in README	2025-11-28 14:41:29 +08:00
yangdx	6e2946e78a	Add max_token_size parameter to azure_openai_embed wrapper	2025-11-28 13:41:01 +08:00
yangdx	4f12fe121d	Change entity extraction logging from warning to info level • Reduce log noise for empty entities	2025-11-27 11:00:34 +08:00
Ghazi-raad	4e8e08cf4d	Update lightrag/operate.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-11-26 23:18:20 +00:00
Ghazi-raad	56677ae466	Update lightrag/prompt.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-11-26 23:18:12 +00:00
Ghazi-raad	207af40f54	Optimize for OpenAI Prompt Caching: Restructure entity extraction prompts - Remove input_text from entity_extraction_system_prompt to enable caching - Move input_text to entity_extraction_user_prompt for per-chunk variability - Update operate.py to format system prompt once without input_text - Format user prompts with input_text for each chunk This enables OpenAI's automatic prompt caching (50% discount on cached tokens): - ~1300 token system message cached and reused for ALL chunks - Only ~150 token user message varies per chunk - Expected 45% cost reduction on prompt tokens during indexing - 2-3x faster response times from cached prompts Fixes #2355	2025-11-26 21:56:25 +00:00
palanisd	a898f0548d	Merge branch 'HKUDS:main' into cohere-rerank	2025-11-25 14:21:43 -05:00
BukeLy	cf68cdfe3a	refactor: improve PostgreSQL migration code quality Why this change is needed: 1. Added clarifying comments to _pg_migrate_workspace_data() parameter handling 2. Removed dead code from PGDocStatusStorage.initialize() that was never executed Changes: 1. PostgreSQL Migration Parameter Documentation (lightrag/kg/postgres_impl.py:2240-2241): - Added comments explaining dict rebuild for correct value ordering - Clarifies that Python 3.7+ dict insertion order is relied upon - Documents that execute() converts dict to tuple via .values() 2. Dead Code Removal (lightrag/kg/postgres_impl.py:3061-3062): - Removed unreachable table creation code from PGDocStatusStorage.initialize() - Table is already created by PostgreSQLDB.initdb() during initialization - This code path was never executed as table always exists before initialize() is called - Added NOTE comment explaining where table creation actually happens Impact: - No functional changes - only code clarification and cleanup - Reduces maintenance burden by removing unreachable code - Improves code readability with better documentation Testing: - All 14 PostgreSQL migration tests pass - All 5 UnifiedLock safety tests pass - Pre-commit checks pass (ruff-format, ruff)	2025-11-26 02:24:51 +08:00
BukeLy	0fb7c5bc3b	test: add unit test for Case 1 sequential workspace migration bug Add test_case1_sequential_workspace_migration to verify the fix for the multi-tenant data loss bug in PostgreSQL Case 1 migration. Problem: - When workspace_a migrates first (Case 4: only legacy table exists) - Then workspace_b initializes later (Case 1: both tables exist) - Bug: Case 1 only checked if legacy table was globally empty - Result: workspace_b's data was not migrated, causing data loss Test Scenario: 1. Legacy table contains data from both workspace_a (3 records) and workspace_b (3 records) 2. workspace_a initializes first → triggers Case 4 migration 3. workspace_b initializes second → triggers Case 1 migration 4. Verify workspace_b's data is correctly migrated to new table 5. Verify workspace_b's data is deleted from legacy table 6. Verify legacy table is dropped when empty This test uses mock tracking of inserted records to verify migration behavior without requiring a real PostgreSQL database. Related: GitHub PR #2391 comment #2553973066	2025-11-26 01:32:25 +08:00
BukeLy	a8f5c9bd33	fix: migrate workspace data in PostgreSQL Case 1 to prevent data loss Why this change is needed: In multi-tenant deployments, when workspace A migrates first (creating the new model-suffixed table), subsequent workspace B initialization enters Case 1 (both tables exist). The original Case 1 logic only checked if the legacy table was empty globally, without checking if the current workspace had unmigrated data. This caused workspace B's data to remain in the legacy table while the application queried the new table, resulting in data loss for workspace B. How it solves the problem: 1. Extracted migration logic into _pg_migrate_workspace_data() helper function to avoid code duplication 2. Modified Case 1 to check if current workspace has data in legacy table and migrate it if found 3. Both Case 1 and Case 4 now use the same migration helper, ensuring consistent behavior 4. After migration, only delete the current workspace's data from legacy table, preserving other workspaces' data Impact: - Prevents data loss in multi-tenant PostgreSQL deployments - Maintains backward compatibility with single-tenant setups - Reduces code duplication between Case 1 and Case 4 Testing: All PostgreSQL migration tests pass (8/8)	2025-11-26 01:16:57 +08:00
yangdx	93d445dfdd	Add pipeline status lock function for legacy compatibility - Add get_pipeline_status_lock function - Return NamespaceLock for consistency - Support workspace parameter - Enable logging option - Legacy code compatibility	2025-11-25 18:24:39 +08:00
Daniel.y	d2cd1c0722	Merge pull request #2421 from EightyOliveira/fix_catch_order fix:exception handling order error	2025-11-25 17:52:56 +08:00
yangdx	777c91794b	Add Langfuse observability configuration to env.example - Add Langfuse environment variables - Include setup instructions - Support OpenAI compatible APIs - Enable tracing configuration - Add cloud/self-host options	2025-11-25 17:16:55 +08:00
EightyOliveira	8994c70f2f	fix:exception handling order error	2025-11-25 16:36:41 +08:00
Daniel.y	2539b4e2c8	Merge pull request #2418 from danielaskdd/start-without-webui Refact: Allow API Server to Start Without Built WebUI Assets	2025-11-25 03:02:15 +08:00
yangdx	48b67d3077	Handle missing WebUI assets gracefully without blocking server startup - Change build check from error to warning - Redirect to /docs when WebUI unavailable - Add webui_available to health endpoint - Only mount /webui if assets exist - Return status tuple from build check	2025-11-25 02:51:55 +08:00
Daniel.y	2832a2ca7e	Merge pull request #2417 from danielaskdd/neo4j-retry Fix: Add Comprehensive Retry Mechanism for Neo4j Storage Operations	2025-11-25 02:03:48 +08:00
yangdx	5f91063c7a	Add ruff as dependency to pytest and evaluation extras	2025-11-25 02:03:28 +08:00
yangdx	8c4d7a00ad	Refactor: Extract retry decorator to reduce code duplication in Neo4J storage • Define READ_RETRY_EXCEPTIONS constant • Create reusable READ_RETRY decorator • Replace 11 duplicate retry decorators • Improve code maintainability • Add missing retry to edge_degrees_batch	2025-11-25 01:35:21 +08:00
Daniel.y	5b81ef000e	Merge pull request #2410 from netbrah/create-copilot-setup-steps feat: create copilot-setup-steps.yml	2025-11-24 22:36:33 +08:00
yangdx	7aaa51cda9	Add retry decorators to Neo4j read operations for resilience	2025-11-24 22:28:15 +08:00
palanisd	293ddbc326	Update test_neo4j_fulltext_index.py	2025-11-24 09:21:37 -05:00
palanisd	dd18eb5b9c	Merge pull request #3 from netbrah/copilot/fix-overlap-tokens-validation Fix infinite loop in chunk_documents_for_rerank when overlap_tokens >= max_tokens	2025-11-24 09:11:24 -05:00
copilot-swe-agent[bot]	8835fc244a	Improve edge case handling for max_tokens=1 Co-authored-by: netbrah <162479981+netbrah@users.noreply.github.com>	2025-11-24 03:43:05 +00:00
copilot-swe-agent[bot]	1d6ea0c5f7	Fix chunking infinite loop when overlap_tokens >= max_tokens Co-authored-by: netbrah <162479981+netbrah@users.noreply.github.com>	2025-11-24 03:40:58 +00:00
copilot-swe-agent[bot]	e136da968b	Initial plan	2025-11-24 03:33:26 +00:00
palanisd	c233da6318	Update copilot-setup-steps.yml	2025-11-23 17:42:04 -05:00
BukeLy	3b8a1e64b7	style: apply ruff formatting fixes to test files Apply ruff-format fixes to 6 test files to pass pre-commit checks: - test_dimension_mismatch.py - test_e2e_multi_instance.py - test_no_model_suffix_safety.py - test_postgres_migration.py - test_unified_lock_safety.py - test_workspace_migration_isolation.py Changes are primarily assert statement reformatting to match ruff style guide.	2025-11-23 16:59:02 +08:00
BukeLy	510baebf62	fix: correct PostgreSQL execute() parameter format in workspace cleanup Critical Bug Fix: PostgreSQLDB.execute() expects data as dict, but workspace cleanup was passing a list [workspace], causing cleanup to fail with "PostgreSQLDB.execute() expects data as dict, got list" error. Changes: 1. Fixed postgres_impl.py:2522 - Changed: await db.execute(delete_query, [workspace]) - To: await db.execute(delete_query, {"workspace": workspace}) 2. Improved test_postgres_migration.py mock - Enhanced COUNT() mock to properly distinguish between: Legacy table with workspace filter (returns 50) * Legacy table without filter after deletion (returns 0) * New table verification (returns 50) - Uses storage.legacy_table_name dynamically instead of hardcoded strings - Detects table type by checking for model suffix patterns 3. Fixed test_unified_lock_safety.py formatting - Applied ruff formatting to assert statement Impact: - Workspace-aware legacy cleanup now works correctly - Legacy tables properly deleted when all workspace data migrated - Legacy tables preserved when other workspace data remains Tests: All 25 unit tests pass	2025-11-23 16:55:48 +08:00
BukeLy	e2d68adff9	style: apply ruff formatting to test files	2025-11-23 16:45:50 +08:00
BukeLy	16fff353d9	fix: prevent data loss in PostgreSQL migration and add doc_status table creation This commit fixes two critical issues in PostgreSQL storage: BUG 1: Legacy table cleanup causing data loss across workspaces --------------------------------------------------------------- PROBLEM: - After migrating workspace_a data from legacy table, the ENTIRE legacy table was deleted - This caused workspace_b's data (still in legacy table) to be lost - Multi-tenant data isolation was violated FIX: - Implement workspace-aware cleanup: only delete migrated workspace's data - Check if other workspaces still have data before dropping table - Only drop legacy table when it becomes completely empty - If other workspace data exists, preserve legacy table with remaining records Location: postgres_impl.py PGVectorStorage.setup_table() lines 2510-2567 Test verification: - test_workspace_migration_isolation_e2e_postgres validates this fix BUG 2: PGDocStatusStorage missing table initialization ------------------------------------------------------- PROBLEM: - PGDocStatusStorage.initialize() only set workspace, never created table - Caused "relation 'lightrag_doc_status' does not exist" errors - document insertion (ainsert) failed immediately FIX: - Add table creation to initialize() method using _pg_create_table() - Consistent with other storage implementations: * MongoDocStatusStorage creates collections * JsonDocStatusStorage creates directories * PGDocStatusStorage now creates tables ✓ Location: postgres_impl.py PGDocStatusStorage.initialize() lines 2965-2971 Test Results: - Unit tests: 13/13 passed (test_unified_lock_safety, test_workspace_migration_isolation, test_dimension_mismatch) - E2E tests require PostgreSQL server Related: PR #2391 (Vector Storage Model Isolation)	2025-11-23 16:43:49 +08:00
BukeLy	204a2535c8	fix: prevent double-release in UnifiedLock.__aexit__ error recovery Problem: When UnifiedLock.__aexit__ encountered an exception during async_lock.release(), the error recovery logic would incorrectly attempt to release async_lock again because it only checked main_lock_released flag. This could cause: - Double-release attempts on already-failed locks - Masking of original exceptions - Undefined behavior in lock state Root Cause: The recovery logic used only main_lock_released to determine whether to attempt async_lock release, without tracking whether async_lock.release() had already been attempted and failed. Fix: - Added async_lock_released flag to track async_lock release attempts - Updated recovery logic condition to check both main_lock_released AND async_lock_released before attempting async_lock release - This ensures async_lock.release() is only called once, even if it fails Testing: - Added test_aexit_no_double_release_on_async_lock_failure: Verifies async_lock.release() is called only once when it fails - Added test_aexit_recovery_on_main_lock_failure: Verifies recovery logic still works when main lock fails - All 5 UnifiedLock safety tests pass Impact: - Eliminates double-release bugs in multiprocess lock scenarios - Preserves correct error propagation - Maintains recovery logic for legitimate failure cases Files Modified: - lightrag/kg/shared_storage.py: Added async_lock_released tracking - tests/test_unified_lock_safety.py: Added 2 new tests (5 total now pass)	2025-11-23 16:34:08 +08:00
BukeLy	49bbb3a4d7	test: add E2E test for workspace migration isolation Why this change is needed: Add end-to-end test to verify the P0 bug fix for cross-workspace data leakage during PostgreSQL migration. Unit tests use mocks and cannot verify that real SQL queries correctly filter by workspace in actual database. What this test does: - Creates legacy table with MIXED data (workspace_a + workspace_b) - Initializes LightRAG for workspace_a only - Verifies ONLY workspace_a data migrated to new table - Verifies workspace_b data NOT leaked to new table (0 records) - Verifies workspace_b data preserved in legacy table (3 records) - Verifies workspace_a data cleaned from legacy after migration (0 records) Impact: - tests/test_e2e_multi_instance.py: Add test_workspace_migration_isolation_e2e_postgres - Validates multi-tenant isolation in real PostgreSQL environment - Prevents regression of critical security fix Testing: E2E test passes with real PostgreSQL container, confirming workspace filtering works correctly with actual SQL execution.	2025-11-23 16:27:05 +08:00
BukeLy	cfc6587e04	fix: prevent race conditions and cross-workspace data leakage in migration Why this change is needed: Two critical P0 security vulnerabilities were identified in CursorReview: 1. UnifiedLock silently allows unprotected execution when lock is None, creating false security and potential race conditions in multi-process scenarios 2. PostgreSQL migration copies ALL workspace data during legacy table migration, violating multi-tenant isolation and causing data leakage How it solves it: - UnifiedLock now raises RuntimeError when lock is None instead of WARNING - Added workspace parameter to setup_table() for proper data isolation - Migration queries now filter by workspace in both COUNT and SELECT operations - Added clear error messages to help developers diagnose initialization issues Impact: - lightrag/kg/shared_storage.py: UnifiedLock raises exception on None lock - lightrag/kg/postgres_impl.py: Added workspace filtering to migration logic - tests/test_unified_lock_safety.py: 3 tests for lock safety - tests/test_workspace_migration_isolation.py: 3 tests for workspace isolation - tests/test_dimension_mismatch.py: Updated table names and mocks - tests/test_postgres_migration.py: Updated mocks for workspace filtering Testing: - All 31 tests pass (16 migration + 4 safety + 3 lock + 3 workspace + 5 dimension) - Backward compatible: existing code continues working unchanged - Code style verified with ruff and pre-commit hooks	2025-11-23 16:09:59 +08:00
BukeLy	f69cf9bcd6	fix: prevent vector dimension mismatch crashes and data loss on no-suffix restarts Why this change is needed: Two critical issues were identified in Codex review of PR #2391: 1. Migration fails when legacy collections/tables use different embedding dimensions (e.g., upgrading from 1536d to 3072d models causes initialization failures) 2. When model_suffix is empty (no model_name provided), table_name equals legacy_table_name, causing Case 1 logic to delete the only table/collection on second startup How it solves it: - Added dimension compatibility checks before migration in both Qdrant and PostgreSQL - PostgreSQL uses two-method detection: pg_attribute metadata query + vector sampling fallback - When dimensions mismatch, skip migration and create new empty table/collection, preserving legacy data - Added safety check to detect when new and legacy names are identical, preventing deletion - Both backends log clear warnings about dimension mismatches and skipped migrations Impact: - lightrag/kg/qdrant_impl.py: Added dimension check (lines 254-297) and no-suffix safety (lines 163-169) - lightrag/kg/postgres_impl.py: Added dimension check with fallback (lines 2347-2410) and no-suffix safety (lines 2281-2287) - tests/test_no_model_suffix_safety.py: New test file with 4 test cases covering edge scenarios - Backward compatible: All existing scenarios continue working unchanged Testing: - All 20 tests pass (16 existing migration tests + 4 new safety tests) - E2E tests enhanced with explicit verification points for dimension mismatch scenarios - Verified graceful degradation when dimension detection fails - Code style verified with ruff and pre-commit hooks	2025-11-23 15:44:07 +08:00
netbrah	a05bbf105e	Add Cohere reranker config, chunking, and tests	2025-11-22 16:43:13 -05:00
palanisd	1b0413ee74	Create copilot-setup-steps.yml	2025-11-22 15:29:05 -05:00
palanisd	e9f2b13b26	Fix Neo4j fulltext index name mismatch between creation and query	2025-11-22 15:23:32 -05:00
copilot-swe-agent[bot]	8cb04cea5c	Optimize index search loop with early break Co-authored-by: netbrah <162479981+netbrah@users.noreply.github.com>	2025-11-22 20:20:59 +00:00
copilot-swe-agent[bot]	4a75c60cf4	Fix Neo4j fulltext index name mismatch and add tests Co-authored-by: netbrah <162479981+netbrah@users.noreply.github.com>	2025-11-22 20:16:47 +00:00
palanisd	10780f4a69	Merge branch 'HKUDS:main' into update-full-text-index-for-workspace	2025-11-22 15:16:30 -05:00
copilot-swe-agent[bot]	eb1f5aeea2	Initial plan	2025-11-22 20:11:38 +00:00

... 4 5 6 7 8 ...

6166 Commits