Files
unstract/backend/sample.env
ali 0c5997f9a9 UN-2470 [FEAT] Remove Django dependency from Celery workers with internal APIs (#1494)
* UN-2470 [MISC] Remove Django dependency from Celery workers

This commit introduces a new worker architecture that decouples
Celery workers from Django where possible, enabling support for
gevent/eventlet pool types and reducing worker startup overhead.

Key changes:
- Created separate worker modules (api-deployment, callback, file_processing, general)
- Added internal API endpoints for worker communication
- Implemented Django-free task execution where appropriate
- Added shared utilities and client facades
- Updated container configurations for new worker architecture

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* Fix pre-commit issues: file permissions and ruff errors

Setup the docker for new workers

- Add executable permissions to worker entrypoint files
- Fix import order in namespace package __init__.py
- Remove unused variable api_status in general worker
- Address ruff E402 and F841 errors

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* refactoreed, Dockerfiles,fixes

* flexibility on celery run commands

* added debug logs

* handled filehistory for API

* cleanup

* cleanup

* cloud plugin structure

* minor changes in import plugin

* added notification and logger workers under new worker module

* add docker compatibility for new workers

* handled docker issues

* log consumer worker fixes

* added scheduler worker

* minor env changes

* cleanup the logs

* minor changes in logs

* resolved scheduler worker issues

* cleanup and refactor

* ensuring backward compatibbility to existing wokers

* added configuration internal apis and cache utils

* optimization

* Fix API client singleton pattern to share HTTP sessions

- Fix flawed singleton implementation that was trying to share BaseAPIClient instances
- Now properly shares HTTP sessions between specialized clients
- Eliminates 6x BaseAPIClient initialization by reusing the same underlying session
- Should reduce API deployment orchestration time by ~135ms (from 6 clients to 1 session)
- Added debug logging to verify singleton pattern activation

* cleanup and structuring

* cleanup in callback

* file system connectors  issue

* celery env values changes

* optional gossip

* variables for sync, mingle and gossip

* Fix for file type check

* Task pipeline issue resolving

* api deployement failed response handled

* Task pipline fixes

* updated file history cleanup with active file execution

* pipline status update and workflow ui page execution

* cleanup and resolvinf conflicts

* remove unstract-core from conenctoprs

* Commit uv.lock changes

* uv locks updates

* resolve migration issues

* defer connector-metadtda

* Fix connector migration for production scale

- Add encryption key handling with defer() to prevent decryption failures
- Add final cleanup step to fix duplicate connector names
- Optimize for large datasets with batch processing and bulk operations
- Ensure unique constraint in migration 0004 can be created successfully

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* hitl fixes

* minor fixes on hitl

* api_hub related changes

* dockerfile fixes

* api client cache fixes with actual response class

* fix: tags and llm_profile_id

* optimized clear cache

* cleanup

* enhanced logs

* added more handling on is file dir and added loggers

* cleanup the runplatform script

* internal apis are excempting from csrf

* sonal cloud issues

* sona-cloud issues

* resolving sonar cloud issues

* resolving sonar cloud issues

* Delta: added Batch size fix in workers

* comments addressed

* celery configurational changes for new workers

* fiixes in callback regaurding the pipline type check

* change internal url registry logic

* gitignore changes

* gitignore changes

* addressng pr cmmnets and cleanup the codes

* adding missed profiles for v2

* sonal cloud blocker issues resolved

* imlement otel

* Commit uv.lock changes

* handle execution time and some cleanup

* adding user_data in metadata Pr: https://github.com/Zipstack/unstract/pull/1544

* scheduler backward compatibitlity

* replace user_data with custom_data

* Commit uv.lock changes

* celery worker command issue resolved

* enhance package imports in connectors by changing to lazy imports

* Update runner.py by removing the otel from it

Update runner.py by removing the otel from it

Signed-off-by: ali <117142933+muhammad-ali-e@users.noreply.github.com>

* added delta changes

* handle erro to destination db

* resolve tool instances id validation and hitl queu name in API

* handled direct execution from workflow page to worker and logs

* handle cost logs

* Update health.py

Signed-off-by: Ritwik G <100672805+ritwik-g@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* minor log changes

* introducing log consumer scheduler to bulk create, and socket .emit from worker for ws

* Commit uv.lock changes

* time limit or timeout celery config cleanup

* implemented redis client class in worker

* pipline status enum mismatch

* notification worker fixes

* resolve uv lock conflicts

* workflow log fixes

* ws channel name issue resolved. and handling redis down in status tracker, and removing redis keys

* default TTL changed for unified logs

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: ali <117142933+muhammad-ali-e@users.noreply.github.com>
Signed-off-by: Ritwik G <100672805+ritwik-g@users.noreply.github.com>
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Ritwik G <100672805+ritwik-g@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2025-10-03 11:24:07 +05:30

203 lines
6.2 KiB
Bash

DJANGO_SETTINGS_MODULE='backend.settings.dev'
# NOTE: Change below to True if you are running in HTTPS mode.
SESSION_COOKIE_SECURE=False
CSRF_COOKIE_SECURE=False
# Default log level
DEFAULT_LOG_LEVEL="INFO"
# Common
PATH_PREFIX="api/v1"
# Django settings
DJANGO_APP_BACKEND_URL=http://frontend.unstract.localhost
DJANGO_SECRET_KEY="1(xf&nc6!y7!l&!5xe&i_rx7e^m@fcut9fduv86ft=-b@2g6"
# Postgres DB envs
DB_HOST='unstract-db'
DB_USER='unstract_dev'
DB_PASSWORD='unstract_pass'
DB_NAME='unstract_db'
DB_PORT=5432
DB_SCHEMA="unstract"
# Celery Backend Database (optional - defaults to DB_NAME if unset)
# Example:
# CELERY_BACKEND_DB_NAME=unstract_celery_db
# Redis
REDIS_HOST="unstract-redis"
REDIS_PORT=6379
REDIS_PASSWORD=""
REDIS_USER=default
# Connector OAuth
SOCIAL_AUTH_EXTRA_DATA_EXPIRATION_TIME_IN_SECOND=3600
GOOGLE_OAUTH2_KEY=
GOOGLE_OAUTH2_SECRET=
# User session
SESSION_EXPIRATION_TIME_IN_SECOND=7200
# FE Web Application Dependencies
WEB_APP_ORIGIN_URL="http://frontend.unstract.localhost"
# API keys for trusted services
INTERNAL_SERVICE_API_KEY=
# Unstract Core envs
BUILTIN_FUNCTIONS_API_KEY=
FREE_STORAGE_AWS_ACCESS_KEY_ID=
FREE_STORAGE_AWS_SECRET_ACCESS_KEY=
UNSTRACT_FREE_STORAGE_BUCKET_NAME=
GDRIVE_GOOGLE_SERVICE_ACCOUNT=
GDRIVE_GOOGLE_PROJECT_ID=
GOOGLE_STORAGE_ACCESS_KEY_ID=
GOOGLE_STORAGE_SECRET_ACCESS_KEY=
GOOGLE_STORAGE_BASE_URL=https://storage.googleapis.com
# Platform Service
PLATFORM_SERVICE_HOST=http://unstract-platform-service
PLATFORM_SERVICE_PORT=3001
# Tool Runner
UNSTRACT_RUNNER_HOST=http://unstract-runner
UNSTRACT_RUNNER_PORT=5002
UNSTRACT_RUNNER_API_TIMEOUT=240 # (in seconds) 2 mins
UNSTRACT_RUNNER_API_RETRY_COUNT=5 # Number of retries for failed requests
UNSTRACT_RUNNER_API_BACKOFF_FACTOR=3 # Exponential backoff factor for retries
# Prompt Service
PROMPT_HOST=http://unstract-prompt-service
PROMPT_PORT=3003
#Prompt Studio
PROMPT_STUDIO_FILE_PATH=/app/prompt-studio-data
# Structure Tool Image (Runs prompt studio exported tools)
# https://hub.docker.com/r/unstract/tool-structure
STRUCTURE_TOOL_IMAGE_URL="docker:unstract/tool-structure:0.0.88"
STRUCTURE_TOOL_IMAGE_NAME="unstract/tool-structure"
STRUCTURE_TOOL_IMAGE_TAG="0.0.88"
# Feature Flags
EVALUATION_SERVER_IP=unstract-flipt
EVALUATION_SERVER_PORT=9000
PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python
#X2Text Service
X2TEXT_HOST=http://unstract-x2text-service
X2TEXT_PORT=3004
# Encryption Key
# Key must be 32 url-safe base64-encoded bytes. Check the README.md for details
ENCRYPTION_KEY="Sample-Key"
# Cache TTL
CACHE_TTL_SEC=10800
# Default user auth credentials
DEFAULT_AUTH_USERNAME=
DEFAULT_AUTH_PASSWORD=
# System admin credentials
SYSTEM_ADMIN_USERNAME="admin"
SYSTEM_ADMIN_PASSWORD="admin"
SYSTEM_ADMIN_EMAIL="admin@abc.com"
# Set Django Session Expiry Time (in seconds)
SESSION_COOKIE_AGE=86400
# Control async extraction of LLMWhisperer
# Time in seconds to wait before polling LLMWhisperer's status API
ADAPTER_LLMW_POLL_INTERVAL=30
# Total number of times to poll the status API.
# 500 mins to allow 1500 (max pages limit) * 20 (approx time in sec to process a page)
ADAPTER_LLMW_MAX_POLLS=1000
# Number of times to retry the /whisper-status API before failing the extraction
ADAPTER_LLMW_STATUS_RETRIES=5
# Enable logging of workflow history.
ENABLE_LOG_HISTORY=True
# Interval in seconds for periodic consumer operations.
LOG_HISTORY_CONSUMER_INTERVAL=30
# Maximum number of logs to insert in a single batch.
LOGS_BATCH_LIMIT=30
# Logs Expiry of 24 hours
LOGS_EXPIRATION_TIME_IN_SECOND=86400
# Celery Configuration
# Used by celery and to connect to queue to push logs
CELERY_BROKER_BASE_URL="amqp://unstract-rabbitmq:5672//"
CELERY_BROKER_USER=admin
CELERY_BROKER_PASS=password
# Indexing flag to prevent re-index
INDEXING_FLAG_TTL=1800
# Notification Timeout in Seconds
NOTIFICATION_TIMEOUT=5
# Path where public and private tools are registered
# with a YAML and JSONs
TOOL_REGISTRY_CONFIG_PATH="/data/tool_registry_config"
# Flipt Service
FLIPT_SERVICE_AVAILABLE=False
# File System Configuration for Workflow and API Execution
# Directory Prefixes for storing execution files
WORKFLOW_EXECUTION_DIR_PREFIX="unstract/execution"
API_EXECUTION_DIR_PREFIX="unstract/api"
# Storage Provider for Workflow Execution
# Valid options: MINIO, S3, etc..
WORKFLOW_EXECUTION_FILE_STORAGE_CREDENTIALS='{"provider": "minio", "credentials": {"endpoint_url": "http://unstract-minio:9000", "key": "minio", "secret": "minio123"}}'
# Storage Provider for API Execution
API_FILE_STORAGE_CREDENTIALS='{"provider": "minio", "credentials": {"endpoint_url": "http://unstract-minio:9000", "key": "minio", "secret": "minio123"}}'
#Remote storage related envs
PERMANENT_REMOTE_STORAGE='{"provider": "minio", "credentials": {"endpoint_url": "http://unstract-minio:9000", "key": "minio", "secret": "minio123"}}'
REMOTE_PROMPT_STUDIO_FILE_PATH="unstract/prompt-studio-data"
# Storage Provider for Tool registry
TOOL_REGISTRY_STORAGE_CREDENTIALS='{"provider":"local"}'
# Highlight data to be available in api deployment
ENABLE_HIGHLIGHT_API_DEPLOYMENT=False
# Execution result and cache expire time
# For API results cached per workflow execution (24 hours)
EXECUTION_RESULT_TTL_SECONDS=86400
# For execution metadata cached per workflow execution (24 hours)
EXECUTION_CACHE_TTL_SECONDS=86400
# Instant workflow polling timeout in seconds (5 minutes)
INSTANT_WF_POLLING_TIMEOUT=300
# Maximum number of batches (i.e., parallel tasks) created for a single workflow execution (1 file at a time)
MAX_PARALLEL_FILE_BATCHES=1
# Maximum allowed value for MAX_PARALLEL_FILE_BATCHES (upper limit for validation)
MAX_PARALLEL_FILE_BATCHES_MAX_VALUE=100
# Maximum number of files allowed per workflow page execution
WORKFLOW_PAGE_MAX_FILES=2
# File execution tracker TTL in seconds (5 hours)
FILE_EXECUTION_TRACKER_TTL_IN_SECOND=18000
# File execution tracker completed TTL in seconds (10 minutes)
FILE_EXECUTION_TRACKER_COMPLETED_TTL_IN_SECOND=600
# Runner polling timeout (3 hours)
MAX_RUNNER_POLLING_WAIT_SECONDS=10800
# Runner polling interval (2 seconds)
RUNNER_POLLING_INTERVAL_SECONDS=2
# ETL Pipeline minimum schedule interval (in seconds)
# Default: 1800 seconds (30 minutes)
# Examples: 900 (15 min), 1800 (30 min), 3600 (60 min)
MIN_SCHEDULE_INTERVAL_SECONDS=1800