* UN-2470 [MISC] Remove Django dependency from Celery workers This commit introduces a new worker architecture that decouples Celery workers from Django where possible, enabling support for gevent/eventlet pool types and reducing worker startup overhead. Key changes: - Created separate worker modules (api-deployment, callback, file_processing, general) - Added internal API endpoints for worker communication - Implemented Django-free task execution where appropriate - Added shared utilities and client facades - Updated container configurations for new worker architecture 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> * Fix pre-commit issues: file permissions and ruff errors Setup the docker for new workers - Add executable permissions to worker entrypoint files - Fix import order in namespace package __init__.py - Remove unused variable api_status in general worker - Address ruff E402 and F841 errors 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> * refactoreed, Dockerfiles,fixes * flexibility on celery run commands * added debug logs * handled filehistory for API * cleanup * cleanup * cloud plugin structure * minor changes in import plugin * added notification and logger workers under new worker module * add docker compatibility for new workers * handled docker issues * log consumer worker fixes * added scheduler worker * minor env changes * cleanup the logs * minor changes in logs * resolved scheduler worker issues * cleanup and refactor * ensuring backward compatibbility to existing wokers * added configuration internal apis and cache utils * optimization * Fix API client singleton pattern to share HTTP sessions - Fix flawed singleton implementation that was trying to share BaseAPIClient instances - Now properly shares HTTP sessions between specialized clients - Eliminates 6x BaseAPIClient initialization by reusing the same underlying session - Should reduce API deployment orchestration time by ~135ms (from 6 clients to 1 session) - Added debug logging to verify singleton pattern activation * cleanup and structuring * cleanup in callback * file system connectors issue * celery env values changes * optional gossip * variables for sync, mingle and gossip * Fix for file type check * Task pipeline issue resolving * api deployement failed response handled * Task pipline fixes * updated file history cleanup with active file execution * pipline status update and workflow ui page execution * cleanup and resolvinf conflicts * remove unstract-core from conenctoprs * Commit uv.lock changes * uv locks updates * resolve migration issues * defer connector-metadtda * Fix connector migration for production scale - Add encryption key handling with defer() to prevent decryption failures - Add final cleanup step to fix duplicate connector names - Optimize for large datasets with batch processing and bulk operations - Ensure unique constraint in migration 0004 can be created successfully 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> * hitl fixes * minor fixes on hitl * api_hub related changes * dockerfile fixes * api client cache fixes with actual response class * fix: tags and llm_profile_id * optimized clear cache * cleanup * enhanced logs * added more handling on is file dir and added loggers * cleanup the runplatform script * internal apis are excempting from csrf * sonal cloud issues * sona-cloud issues * resolving sonar cloud issues * resolving sonar cloud issues * Delta: added Batch size fix in workers * comments addressed * celery configurational changes for new workers * fiixes in callback regaurding the pipline type check * change internal url registry logic * gitignore changes * gitignore changes * addressng pr cmmnets and cleanup the codes * adding missed profiles for v2 * sonal cloud blocker issues resolved * imlement otel * Commit uv.lock changes * handle execution time and some cleanup * adding user_data in metadata Pr: https://github.com/Zipstack/unstract/pull/1544 * scheduler backward compatibitlity * replace user_data with custom_data * Commit uv.lock changes * celery worker command issue resolved * enhance package imports in connectors by changing to lazy imports * Update runner.py by removing the otel from it Update runner.py by removing the otel from it Signed-off-by: ali <117142933+muhammad-ali-e@users.noreply.github.com> * added delta changes * handle erro to destination db * resolve tool instances id validation and hitl queu name in API * handled direct execution from workflow page to worker and logs * handle cost logs * Update health.py Signed-off-by: Ritwik G <100672805+ritwik-g@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor log changes * introducing log consumer scheduler to bulk create, and socket .emit from worker for ws * Commit uv.lock changes * time limit or timeout celery config cleanup * implemented redis client class in worker * pipline status enum mismatch * notification worker fixes * resolve uv lock conflicts * workflow log fixes * ws channel name issue resolved. and handling redis down in status tracker, and removing redis keys * default TTL changed for unified logs * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: ali <117142933+muhammad-ali-e@users.noreply.github.com> Signed-off-by: Ritwik G <100672805+ritwik-g@users.noreply.github.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Ritwik G <100672805+ritwik-g@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Unstract
The Data Layer for your Agentic Workflows—Automate Document-based workflows with close to 100% accuracy!
🤖 Prompt Studio
Prompt Studio is a purpose-built environment that supercharges your schema definition efforts. Compare outputs from different LLMs side-by-side, keep tab on costs while you develop generic prompts that work across wide-ranging document variations. And when you're ready, launch extraction APIs with a single click.
🔌 Integrations that suit your environment
Once you've used Prompt Studio to define your schema, Unstract makes it easy to integrate into your existing workflows. Simply choose the integration type that best fits your environment:
| Integration Type | Description | Best For | Documentation |
|---|---|---|---|
| 🖥️ MCP Servers | Run Unstract as an MCP Server to provide structured data extraction to Agents or LLMs in your ecosystem. | Developers building Agentic/LLM apps/tools that speak MCP. | Unstract MCP Server Docs |
| 🌐 API Deployments | Turn any document into JSON with an API call. Deploy any Prompt Studio project as a REST API endpoint with a single click. | Teams needing programmatic access in apps, services, or custom tooling. | API Deployment Docs |
| ⚙️ ETL Pipelines | Embed Unstract directly into your ETL jobs to transform unstructured data before loading it into your warehouse / database. | Engineering and Data engineering teams that need to batch process documents into clean JSON. | ETL Pipelines Docs |
| 🧩 n8n Nodes | Use Unstract as ready-made nodes in n8n workflows for drag-and-drop automation. | Low-code users and ops teams automating workflows. | Unstract n8n Nodes Docs |
☁️ Getting Started (Cloud / Enterprise)
The easy-peasy way to try Unstract is to sign up for a 14-day free trial. Give Unstract a spin now!
Unstract Cloud also comes with some really awesome features that give serious accuracy boosts to agentic/LLM-powered document-centric workflows in the enterprise.
| Feature | Description | Documentation |
|---|---|---|
| 🧪 LLMChallenge | Uses two Large Language Models to ensure trustworthy output. You either get the right response or no response at all. | Docs |
| ⚡ SinglePass Extraction | Reduces LLM token usage by up to 8x, dramatically cutting costs. | Docs |
| 📉 SummarizedExtraction | Reduces LLM token usage by up to 6x, saving costs while keeping accuracy. | Docs |
| 👀 Human-In-The-Loop | Side-by-side comparison of extracted value and source document, with highlighting for human review and tweaking. | Docs |
| 🔐 SSO Support | Enterprise-ready authentication options for seamless onboarding and off-boarding. | Docs |
⏩ Quick Start Guide
Unstract comes well documented. You can get introduced to the basics of Unstract, and learn how to connect various systems like LLMs, Vector Databases, Embedding Models and Text Extractors to it. The easiest way to wet your feet is to go through our Quick Start Guide where you actually get to do some prompt engineering in Prompt Studio and launch an API to structure varied credit card statements!
🚀 Getting started (self-hosted)
System Requirements
- 8GB RAM (minimum)
Prerequisites
- Linux or MacOS (Intel or M-series)
- Docker
- Docker Compose (if you need to install it separately)
- Git
Next, either download a release or clone this repo and do the following:
✅ ./run-platform.sh
✅ Now visit http://frontend.unstract.localhost in your browser
✅ Use username and password unstract to login
That's all there is to it!
Follow these steps to change the default username and password. See user guide for more details on managing the platform.
Another really quick way to experience Unstract is by signing up for our hosted version. It comes with a 14 day free trial!
📄 Supported File Types
Unstract supports a wide range of file formats for document processing:
| Category | Format | Description |
|---|---|---|
| Word Processing | DOCX | Microsoft Word Open XML |
| DOC | Microsoft Word | |
| ODT | OpenDocument Text | |
| Presentation | PPTX | Microsoft PowerPoint Open XML |
| PPT | Microsoft PowerPoint | |
| ODP | OpenDocument Presentation | |
| Spreadsheet | XLSX | Microsoft Excel Open XML |
| XLS | Microsoft Excel | |
| ODS | OpenDocument Spreadsheet | |
| Document & Text | Portable Document Format | |
| TXT | Plain Text | |
| CSV | Comma-Separated Values | |
| JSON | JavaScript Object Notation | |
| Image | BMP | Bitmap Image |
| GIF | Graphics Interchange Format | |
| JPEG | Joint Photographic Experts Group | |
| JPG | Joint Photographic Experts Group | |
| PNG | Portable Network Graphics | |
| TIF | Tagged Image File Format | |
| TIFF | Tagged Image File Format | |
| WEBP | Web Picture Format |
🤝 Ecosystem support
LLM Providers
Vector Databases
| Provider | Status | |
|---|---|---|
| Qdrant | ✅ Working | |
| Weaviate | ✅ Working | |
| Pinecone | ✅ Working | |
| PostgreSQL | ✅ Working | |
| Milvus | ✅ Working |
Embeddings
| Provider | Status | |
|---|---|---|
| OpenAI | ✅ Working | |
| Azure OpenAI | ✅ Working | |
| Google PaLM | ✅ Working | |
| Ollama | ✅ Working | |
| VertexAI | ✅ Working | |
| Bedrock | ✅ Working |
Text Extractors
| Provider | Status | |
|---|---|---|
| Unstract LLMWhisperer V2 | ✅ Working | |
| Unstructured.io Community | ✅ Working | |
| Unstructured.io Enterprise | ✅ Working | |
| LlamaIndex Parse | ✅ Working |
ETL Sources
| Provider | Status | |
|---|---|---|
| AWS S3 | ✅ Working | |
| MinIO | ✅ Working | |
| Google Cloud Storage | ✅ Working | |
| Azure Cloud Storage | ✅ Working | |
| Google Drive | ✅ Working | |
| Dropbox | ✅ Working | |
| SFTP | ✅ Working |
ETL Destinations
| Provider | Status | |
|---|---|---|
| Snowflake | ✅ Working | |
| Amazon Redshift | ✅ Working | |
| Google BigQuery | ✅ Working | |
| PostgreSQL | ✅ Working | |
| MySQL | ✅ Working | |
| MariaDB | ✅ Working | |
| Microsoft SQL Server | ✅ Working | |
| Oracle | ✅ Working |
🙌 Contributing
Contributions are welcome! Please see CONTRIBUTING.md for further details to get started easily.
👋 Join the LLM-powered automation community
- On Slack, join great conversations around LLMs, their ecosystem and leveraging them to automate the previously unautomatable!
- Follow us on X/Twitter
- Follow us on LinkedIn
🚨 Backup encryption key
Do copy the value of ENCRYPTION_KEY config in either backend/.env or platform-service/.env file to a secure location.
Adapter credentials are encrypted by the platform using this key. Its loss or change will make all existing adapters inaccessible!
📊 A note on analytics
In full disclosure, Unstract integrates Posthog to track usage analytics. As you can inspect the relevant code here, we collect the minimum possible metrics. Posthog can be disabled if desired by setting REACT_APP_ENABLE_POSTHOG to false in the frontend's .env file.
