* docs: add cursor rule for embedding generation process Add documentation for cursor IDE about how docs embeddings are generated, including the workflow for creating and uploading semantic search content. * feat: improve API reference metadata upload with descriptive content - Add preembeddings script to run codegen before embedding generation - Enhance OpenApiReferenceSource to generate more descriptive content including parameters, responses, path information, and better structured documentation * feat: add Management API references to searchDocs GraphQL query - Add ManagementApiReference GraphQL type and model for API endpoint search results - Integrate Management API references into global search results - Update test snapshots and add comprehensive test coverage for Management API search * style: format
2.5 KiB
2.5 KiB
Documentation Embeddings Generation System
Overview
The documentation embeddings generation system processes various documentation sources and uploads their metadata to a database for semantic search functionality. The system is located in apps/docs/scripts/search/ and works by:
- Discovering content sources from multiple types of documentation
- Processing content into structured sections with checksums
- Generating embeddings using OpenAI's text-embedding-ada-002 model
- Storing in database with vector embeddings for semantic search
Architecture
Main Entry Point
generate-embeddings.ts- Main script that orchestrates the entire process- Supports
--refreshflag to force regeneration of all content
Content Sources (sources/ directory)
Base Classes
BaseLoader- Abstract class for loading content from different sourcesBaseSource- Abstract class for processing and formatting content
Source Types
-
Markdown Sources (
markdown.ts)- Processes
.mdxfiles from guides and documentation - Extracts frontmatter metadata and content sections
- Processes
-
Reference Documentation (
reference-doc.ts)- OpenAPI References - Management API documentation from OpenAPI specs
- Client Library References - JavaScript, Dart, Python, C#, Swift, Kotlin SDKs
- CLI References - Command-line interface documentation
- Processes YAML/JSON specs and matches with common sections
-
GitHub Discussions (
github-discussion.ts)- Fetches troubleshooting discussions from GitHub using GraphQL API
- Uses GitHub App authentication for access
-
Partner Integrations (
partner-integrations.ts)- Fetches approved partner integration documentation from Supabase database
- Technology integrations only (excludes agencies)
Processing Flow
- Content Discovery: Each source loader discovers and loads content files/data
- Content Processing: Each source processes content into:
- Checksum for change detection
- Metadata (title, subtitle, etc.)
- Sections with headings and content
- Change Detection: Compares checksums against existing database records
- Embedding Generation: Uses OpenAI to generate embeddings for new/changed content
- Database Storage: Stores in
pageandpage_sectiontables with embeddings - Cleanup: Removes outdated pages using version tracking
Database Schema
pagetable: Stores page metadata, content, checksum, versionpage_sectiontable: Stores individual sections with embeddings, token counts