6166 Commits

Author SHA1 Message Date
Daniel.y
b670544958 Merge pull request #2433 from danielaskdd/fix-jina-embedding
Fix: Add configurable model support for Jina embedding
2025-11-28 19:36:18 +08:00
yangdx
ea8d55ab42 Add documentation for embedding provider configuration rules 2025-11-28 17:49:30 +08:00
Christian Clauss
90f341d614 Fix typos discovered by codespell 2025-11-28 10:31:52 +01:00
yangdx
4ab4a7ac94 Allow embedding models to use provider defaults when unspecified
- Set EMBEDDING_MODEL default to None
- Pass model param only when provided
- Let providers use their own defaults
- Fix lollms embed function params
- Add ollama embed_model default param
2025-11-28 16:57:33 +08:00
yangdx
881b8d3a50 Bump API version to 0257 2025-11-28 15:39:55 +08:00
yangdx
56e0365cf0 Add configurable model parameter to jina_embed function
- Add model parameter to jina_embed
- Pass model from API server
- Default to jina-embeddings-v4
- Update function documentation
- Make model selection flexible
2025-11-28 15:38:29 +08:00
Daniel.y
1b02684e2f Merge pull request #2432 from danielaskdd/embedding-example
Doc: Update README examples to prevent double-wrapping of embedding functions
2025-11-28 15:24:52 +08:00
yangdx
97a9dfcac0 Add important note about embedding function wrapping restrictions 2025-11-28 14:55:15 +08:00
yangdx
1d07ff7f60 Update OpenAI and Ollama embedding func examples in README 2025-11-28 14:41:29 +08:00
yangdx
6e2946e78a Add max_token_size parameter to azure_openai_embed wrapper 2025-11-28 13:41:01 +08:00
yangdx
4f12fe121d Change entity extraction logging from warning to info level
• Reduce log noise for empty entities
2025-11-27 11:00:34 +08:00
Ghazi-raad
4e8e08cf4d Update lightrag/operate.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-11-26 23:18:20 +00:00
Ghazi-raad
56677ae466 Update lightrag/prompt.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-11-26 23:18:12 +00:00
Ghazi-raad
207af40f54 Optimize for OpenAI Prompt Caching: Restructure entity extraction prompts
- Remove input_text from entity_extraction_system_prompt to enable caching
- Move input_text to entity_extraction_user_prompt for per-chunk variability
- Update operate.py to format system prompt once without input_text
- Format user prompts with input_text for each chunk

This enables OpenAI's automatic prompt caching (50% discount on cached tokens):
- ~1300 token system message cached and reused for ALL chunks
- Only ~150 token user message varies per chunk
- Expected 45% cost reduction on prompt tokens during indexing
- 2-3x faster response times from cached prompts

Fixes #2355
2025-11-26 21:56:25 +00:00
palanisd
a898f0548d Merge branch 'HKUDS:main' into cohere-rerank 2025-11-25 14:21:43 -05:00
BukeLy
cf68cdfe3a refactor: improve PostgreSQL migration code quality
Why this change is needed:
1. Added clarifying comments to _pg_migrate_workspace_data() parameter handling
2. Removed dead code from PGDocStatusStorage.initialize() that was never executed

Changes:

1. PostgreSQL Migration Parameter Documentation (lightrag/kg/postgres_impl.py:2240-2241):
   - Added comments explaining dict rebuild for correct value ordering
   - Clarifies that Python 3.7+ dict insertion order is relied upon
   - Documents that execute() converts dict to tuple via .values()

2. Dead Code Removal (lightrag/kg/postgres_impl.py:3061-3062):
   - Removed unreachable table creation code from PGDocStatusStorage.initialize()
   - Table is already created by PostgreSQLDB.initdb() during initialization
   - This code path was never executed as table always exists before initialize() is called
   - Added NOTE comment explaining where table creation actually happens

Impact:
- No functional changes - only code clarification and cleanup
- Reduces maintenance burden by removing unreachable code
- Improves code readability with better documentation

Testing:
- All 14 PostgreSQL migration tests pass
- All 5 UnifiedLock safety tests pass
- Pre-commit checks pass (ruff-format, ruff)
2025-11-26 02:24:51 +08:00
BukeLy
0fb7c5bc3b test: add unit test for Case 1 sequential workspace migration bug
Add test_case1_sequential_workspace_migration to verify the fix for
the multi-tenant data loss bug in PostgreSQL Case 1 migration.

Problem:
- When workspace_a migrates first (Case 4: only legacy table exists)
- Then workspace_b initializes later (Case 1: both tables exist)
- Bug: Case 1 only checked if legacy table was globally empty
- Result: workspace_b's data was not migrated, causing data loss

Test Scenario:
1. Legacy table contains data from both workspace_a (3 records) and
   workspace_b (3 records)
2. workspace_a initializes first → triggers Case 4 migration
3. workspace_b initializes second → triggers Case 1 migration
4. Verify workspace_b's data is correctly migrated to new table
5. Verify workspace_b's data is deleted from legacy table
6. Verify legacy table is dropped when empty

This test uses mock tracking of inserted records to verify migration
behavior without requiring a real PostgreSQL database.

Related: GitHub PR #2391 comment #2553973066
2025-11-26 01:32:25 +08:00
BukeLy
a8f5c9bd33 fix: migrate workspace data in PostgreSQL Case 1 to prevent data loss
Why this change is needed:
In multi-tenant deployments, when workspace A migrates first (creating
the new model-suffixed table), subsequent workspace B initialization
enters Case 1 (both tables exist). The original Case 1 logic only
checked if the legacy table was empty globally, without checking if
the current workspace had unmigrated data. This caused workspace B's
data to remain in the legacy table while the application queried the
new table, resulting in data loss for workspace B.

How it solves the problem:
1. Extracted migration logic into _pg_migrate_workspace_data() helper
   function to avoid code duplication
2. Modified Case 1 to check if current workspace has data in legacy
   table and migrate it if found
3. Both Case 1 and Case 4 now use the same migration helper, ensuring
   consistent behavior
4. After migration, only delete the current workspace's data from
   legacy table, preserving other workspaces' data

Impact:
- Prevents data loss in multi-tenant PostgreSQL deployments
- Maintains backward compatibility with single-tenant setups
- Reduces code duplication between Case 1 and Case 4

Testing:
All PostgreSQL migration tests pass (8/8)
2025-11-26 01:16:57 +08:00
yangdx
93d445dfdd Add pipeline status lock function for legacy compatibility
- Add get_pipeline_status_lock function
- Return NamespaceLock for consistency
- Support workspace parameter
- Enable logging option
- Legacy code compatibility
2025-11-25 18:24:39 +08:00
Daniel.y
d2cd1c0722 Merge pull request #2421 from EightyOliveira/fix_catch_order
fix:exception handling order error
2025-11-25 17:52:56 +08:00
yangdx
777c91794b Add Langfuse observability configuration to env.example
- Add Langfuse environment variables
- Include setup instructions
- Support OpenAI compatible APIs
- Enable tracing configuration
- Add cloud/self-host options
2025-11-25 17:16:55 +08:00
EightyOliveira
8994c70f2f fix:exception handling order error 2025-11-25 16:36:41 +08:00
Daniel.y
2539b4e2c8 Merge pull request #2418 from danielaskdd/start-without-webui
Refact: Allow API Server to Start Without Built WebUI Assets
2025-11-25 03:02:15 +08:00
yangdx
48b67d3077 Handle missing WebUI assets gracefully without blocking server startup
- Change build check from error to warning
- Redirect to /docs when WebUI unavailable
- Add webui_available to health endpoint
- Only mount /webui if assets exist
- Return status tuple from build check
2025-11-25 02:51:55 +08:00
Daniel.y
2832a2ca7e Merge pull request #2417 from danielaskdd/neo4j-retry
Fix: Add Comprehensive Retry Mechanism for Neo4j Storage Operations
2025-11-25 02:03:48 +08:00
yangdx
5f91063c7a Add ruff as dependency to pytest and evaluation extras 2025-11-25 02:03:28 +08:00
yangdx
8c4d7a00ad Refactor: Extract retry decorator to reduce code duplication in Neo4J storage
• Define READ_RETRY_EXCEPTIONS constant
• Create reusable READ_RETRY decorator
• Replace 11 duplicate retry decorators
• Improve code maintainability
• Add missing retry to edge_degrees_batch
2025-11-25 01:35:21 +08:00
Daniel.y
5b81ef000e Merge pull request #2410 from netbrah/create-copilot-setup-steps
feat: create copilot-setup-steps.yml
2025-11-24 22:36:33 +08:00
yangdx
7aaa51cda9 Add retry decorators to Neo4j read operations for resilience 2025-11-24 22:28:15 +08:00
palanisd
293ddbc326 Update test_neo4j_fulltext_index.py 2025-11-24 09:21:37 -05:00
palanisd
dd18eb5b9c Merge pull request #3 from netbrah/copilot/fix-overlap-tokens-validation
Fix infinite loop in chunk_documents_for_rerank when overlap_tokens >= max_tokens
2025-11-24 09:11:24 -05:00
copilot-swe-agent[bot]
8835fc244a Improve edge case handling for max_tokens=1
Co-authored-by: netbrah <162479981+netbrah@users.noreply.github.com>
2025-11-24 03:43:05 +00:00
copilot-swe-agent[bot]
1d6ea0c5f7 Fix chunking infinite loop when overlap_tokens >= max_tokens
Co-authored-by: netbrah <162479981+netbrah@users.noreply.github.com>
2025-11-24 03:40:58 +00:00
copilot-swe-agent[bot]
e136da968b Initial plan 2025-11-24 03:33:26 +00:00
palanisd
c233da6318 Update copilot-setup-steps.yml 2025-11-23 17:42:04 -05:00
BukeLy
3b8a1e64b7 style: apply ruff formatting fixes to test files
Apply ruff-format fixes to 6 test files to pass pre-commit checks:
- test_dimension_mismatch.py
- test_e2e_multi_instance.py
- test_no_model_suffix_safety.py
- test_postgres_migration.py
- test_unified_lock_safety.py
- test_workspace_migration_isolation.py

Changes are primarily assert statement reformatting to match ruff style guide.
2025-11-23 16:59:02 +08:00
BukeLy
510baebf62 fix: correct PostgreSQL execute() parameter format in workspace cleanup
Critical Bug Fix:
PostgreSQLDB.execute() expects data as dict, but workspace cleanup
was passing a list [workspace], causing cleanup to fail with
"PostgreSQLDB.execute() expects data as dict, got list" error.

Changes:
1. Fixed postgres_impl.py:2522
   - Changed: await db.execute(delete_query, [workspace])
   - To: await db.execute(delete_query, {"workspace": workspace})

2. Improved test_postgres_migration.py mock
   - Enhanced COUNT(*) mock to properly distinguish between:
     * Legacy table with workspace filter (returns 50)
     * Legacy table without filter after deletion (returns 0)
     * New table verification (returns 50)
   - Uses storage.legacy_table_name dynamically instead of hardcoded strings
   - Detects table type by checking for model suffix patterns

3. Fixed test_unified_lock_safety.py formatting
   - Applied ruff formatting to assert statement

Impact:
- Workspace-aware legacy cleanup now works correctly
- Legacy tables properly deleted when all workspace data migrated
- Legacy tables preserved when other workspace data remains

Tests: All 25 unit tests pass
2025-11-23 16:55:48 +08:00
BukeLy
e2d68adff9 style: apply ruff formatting to test files 2025-11-23 16:45:50 +08:00
BukeLy
16fff353d9 fix: prevent data loss in PostgreSQL migration and add doc_status table creation
This commit fixes two critical issues in PostgreSQL storage:

BUG 1: Legacy table cleanup causing data loss across workspaces
---------------------------------------------------------------
PROBLEM:
- After migrating workspace_a data from legacy table, the ENTIRE legacy
  table was deleted
- This caused workspace_b's data (still in legacy table) to be lost
- Multi-tenant data isolation was violated

FIX:
- Implement workspace-aware cleanup: only delete migrated workspace's data
- Check if other workspaces still have data before dropping table
- Only drop legacy table when it becomes completely empty
- If other workspace data exists, preserve legacy table with remaining records

Location: postgres_impl.py PGVectorStorage.setup_table() lines 2510-2567

Test verification:
- test_workspace_migration_isolation_e2e_postgres validates this fix

BUG 2: PGDocStatusStorage missing table initialization
-------------------------------------------------------
PROBLEM:
- PGDocStatusStorage.initialize() only set workspace, never created table
- Caused "relation 'lightrag_doc_status' does not exist" errors
- document insertion (ainsert) failed immediately

FIX:
- Add table creation to initialize() method using _pg_create_table()
- Consistent with other storage implementations:
  * MongoDocStatusStorage creates collections
  * JsonDocStatusStorage creates directories
  * PGDocStatusStorage now creates tables ✓

Location: postgres_impl.py PGDocStatusStorage.initialize() lines 2965-2971

Test Results:
- Unit tests: 13/13 passed (test_unified_lock_safety,
  test_workspace_migration_isolation, test_dimension_mismatch)
- E2E tests require PostgreSQL server

Related: PR #2391 (Vector Storage Model Isolation)
2025-11-23 16:43:49 +08:00
BukeLy
204a2535c8 fix: prevent double-release in UnifiedLock.__aexit__ error recovery
Problem:
When UnifiedLock.__aexit__ encountered an exception during async_lock.release(),
the error recovery logic would incorrectly attempt to release async_lock again
because it only checked main_lock_released flag. This could cause:
- Double-release attempts on already-failed locks
- Masking of original exceptions
- Undefined behavior in lock state

Root Cause:
The recovery logic used only main_lock_released to determine whether to attempt
async_lock release, without tracking whether async_lock.release() had already
been attempted and failed.

Fix:
- Added async_lock_released flag to track async_lock release attempts
- Updated recovery logic condition to check both main_lock_released AND
  async_lock_released before attempting async_lock release
- This ensures async_lock.release() is only called once, even if it fails

Testing:
- Added test_aexit_no_double_release_on_async_lock_failure:
  Verifies async_lock.release() is called only once when it fails
- Added test_aexit_recovery_on_main_lock_failure:
  Verifies recovery logic still works when main lock fails
- All 5 UnifiedLock safety tests pass

Impact:
- Eliminates double-release bugs in multiprocess lock scenarios
- Preserves correct error propagation
- Maintains recovery logic for legitimate failure cases

Files Modified:
- lightrag/kg/shared_storage.py: Added async_lock_released tracking
- tests/test_unified_lock_safety.py: Added 2 new tests (5 total now pass)
2025-11-23 16:34:08 +08:00
BukeLy
49bbb3a4d7 test: add E2E test for workspace migration isolation
Why this change is needed:
Add end-to-end test to verify the P0 bug fix for cross-workspace data
leakage during PostgreSQL migration. Unit tests use mocks and cannot verify
that real SQL queries correctly filter by workspace in actual database.

What this test does:
- Creates legacy table with MIXED data (workspace_a + workspace_b)
- Initializes LightRAG for workspace_a only
- Verifies ONLY workspace_a data migrated to new table
- Verifies workspace_b data NOT leaked to new table (0 records)
- Verifies workspace_b data preserved in legacy table (3 records)
- Verifies workspace_a data cleaned from legacy after migration (0 records)

Impact:
- tests/test_e2e_multi_instance.py: Add test_workspace_migration_isolation_e2e_postgres
- Validates multi-tenant isolation in real PostgreSQL environment
- Prevents regression of critical security fix

Testing:
E2E test passes with real PostgreSQL container, confirming workspace
filtering works correctly with actual SQL execution.
2025-11-23 16:27:05 +08:00
BukeLy
cfc6587e04 fix: prevent race conditions and cross-workspace data leakage in migration
Why this change is needed:
Two critical P0 security vulnerabilities were identified in CursorReview:
1. UnifiedLock silently allows unprotected execution when lock is None, creating
   false security and potential race conditions in multi-process scenarios
2. PostgreSQL migration copies ALL workspace data during legacy table migration,
   violating multi-tenant isolation and causing data leakage

How it solves it:
- UnifiedLock now raises RuntimeError when lock is None instead of WARNING
- Added workspace parameter to setup_table() for proper data isolation
- Migration queries now filter by workspace in both COUNT and SELECT operations
- Added clear error messages to help developers diagnose initialization issues

Impact:
- lightrag/kg/shared_storage.py: UnifiedLock raises exception on None lock
- lightrag/kg/postgres_impl.py: Added workspace filtering to migration logic
- tests/test_unified_lock_safety.py: 3 tests for lock safety
- tests/test_workspace_migration_isolation.py: 3 tests for workspace isolation
- tests/test_dimension_mismatch.py: Updated table names and mocks
- tests/test_postgres_migration.py: Updated mocks for workspace filtering

Testing:
- All 31 tests pass (16 migration + 4 safety + 3 lock + 3 workspace + 5 dimension)
- Backward compatible: existing code continues working unchanged
- Code style verified with ruff and pre-commit hooks
2025-11-23 16:09:59 +08:00
BukeLy
f69cf9bcd6 fix: prevent vector dimension mismatch crashes and data loss on no-suffix restarts
Why this change is needed:
Two critical issues were identified in Codex review of PR #2391:
1. Migration fails when legacy collections/tables use different embedding dimensions
   (e.g., upgrading from 1536d to 3072d models causes initialization failures)
2. When model_suffix is empty (no model_name provided), table_name equals legacy_table_name,
   causing Case 1 logic to delete the only table/collection on second startup

How it solves it:
- Added dimension compatibility checks before migration in both Qdrant and PostgreSQL
- PostgreSQL uses two-method detection: pg_attribute metadata query + vector sampling fallback
- When dimensions mismatch, skip migration and create new empty table/collection, preserving legacy data
- Added safety check to detect when new and legacy names are identical, preventing deletion
- Both backends log clear warnings about dimension mismatches and skipped migrations

Impact:
- lightrag/kg/qdrant_impl.py: Added dimension check (lines 254-297) and no-suffix safety (lines 163-169)
- lightrag/kg/postgres_impl.py: Added dimension check with fallback (lines 2347-2410) and no-suffix safety (lines 2281-2287)
- tests/test_no_model_suffix_safety.py: New test file with 4 test cases covering edge scenarios
- Backward compatible: All existing scenarios continue working unchanged

Testing:
- All 20 tests pass (16 existing migration tests + 4 new safety tests)
- E2E tests enhanced with explicit verification points for dimension mismatch scenarios
- Verified graceful degradation when dimension detection fails
- Code style verified with ruff and pre-commit hooks
2025-11-23 15:44:07 +08:00
netbrah
a05bbf105e Add Cohere reranker config, chunking, and tests 2025-11-22 16:43:13 -05:00
palanisd
1b0413ee74 Create copilot-setup-steps.yml 2025-11-22 15:29:05 -05:00
palanisd
e9f2b13b26 Fix Neo4j fulltext index name mismatch between creation and query 2025-11-22 15:23:32 -05:00
copilot-swe-agent[bot]
8cb04cea5c Optimize index search loop with early break
Co-authored-by: netbrah <162479981+netbrah@users.noreply.github.com>
2025-11-22 20:20:59 +00:00
copilot-swe-agent[bot]
4a75c60cf4 Fix Neo4j fulltext index name mismatch and add tests
Co-authored-by: netbrah <162479981+netbrah@users.noreply.github.com>
2025-11-22 20:16:47 +00:00
palanisd
10780f4a69 Merge branch 'HKUDS:main' into update-full-text-index-for-workspace 2025-11-22 15:16:30 -05:00
copilot-swe-agent[bot]
eb1f5aeea2 Initial plan 2025-11-22 20:11:38 +00:00