Files
unstract/backend/workflow_manager/file_execution/models.py
ali 0c5997f9a9 UN-2470 [FEAT] Remove Django dependency from Celery workers with internal APIs (#1494)
* UN-2470 [MISC] Remove Django dependency from Celery workers

This commit introduces a new worker architecture that decouples
Celery workers from Django where possible, enabling support for
gevent/eventlet pool types and reducing worker startup overhead.

Key changes:
- Created separate worker modules (api-deployment, callback, file_processing, general)
- Added internal API endpoints for worker communication
- Implemented Django-free task execution where appropriate
- Added shared utilities and client facades
- Updated container configurations for new worker architecture

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* Fix pre-commit issues: file permissions and ruff errors

Setup the docker for new workers

- Add executable permissions to worker entrypoint files
- Fix import order in namespace package __init__.py
- Remove unused variable api_status in general worker
- Address ruff E402 and F841 errors

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* refactoreed, Dockerfiles,fixes

* flexibility on celery run commands

* added debug logs

* handled filehistory for API

* cleanup

* cleanup

* cloud plugin structure

* minor changes in import plugin

* added notification and logger workers under new worker module

* add docker compatibility for new workers

* handled docker issues

* log consumer worker fixes

* added scheduler worker

* minor env changes

* cleanup the logs

* minor changes in logs

* resolved scheduler worker issues

* cleanup and refactor

* ensuring backward compatibbility to existing wokers

* added configuration internal apis and cache utils

* optimization

* Fix API client singleton pattern to share HTTP sessions

- Fix flawed singleton implementation that was trying to share BaseAPIClient instances
- Now properly shares HTTP sessions between specialized clients
- Eliminates 6x BaseAPIClient initialization by reusing the same underlying session
- Should reduce API deployment orchestration time by ~135ms (from 6 clients to 1 session)
- Added debug logging to verify singleton pattern activation

* cleanup and structuring

* cleanup in callback

* file system connectors  issue

* celery env values changes

* optional gossip

* variables for sync, mingle and gossip

* Fix for file type check

* Task pipeline issue resolving

* api deployement failed response handled

* Task pipline fixes

* updated file history cleanup with active file execution

* pipline status update and workflow ui page execution

* cleanup and resolvinf conflicts

* remove unstract-core from conenctoprs

* Commit uv.lock changes

* uv locks updates

* resolve migration issues

* defer connector-metadtda

* Fix connector migration for production scale

- Add encryption key handling with defer() to prevent decryption failures
- Add final cleanup step to fix duplicate connector names
- Optimize for large datasets with batch processing and bulk operations
- Ensure unique constraint in migration 0004 can be created successfully

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* hitl fixes

* minor fixes on hitl

* api_hub related changes

* dockerfile fixes

* api client cache fixes with actual response class

* fix: tags and llm_profile_id

* optimized clear cache

* cleanup

* enhanced logs

* added more handling on is file dir and added loggers

* cleanup the runplatform script

* internal apis are excempting from csrf

* sonal cloud issues

* sona-cloud issues

* resolving sonar cloud issues

* resolving sonar cloud issues

* Delta: added Batch size fix in workers

* comments addressed

* celery configurational changes for new workers

* fiixes in callback regaurding the pipline type check

* change internal url registry logic

* gitignore changes

* gitignore changes

* addressng pr cmmnets and cleanup the codes

* adding missed profiles for v2

* sonal cloud blocker issues resolved

* imlement otel

* Commit uv.lock changes

* handle execution time and some cleanup

* adding user_data in metadata Pr: https://github.com/Zipstack/unstract/pull/1544

* scheduler backward compatibitlity

* replace user_data with custom_data

* Commit uv.lock changes

* celery worker command issue resolved

* enhance package imports in connectors by changing to lazy imports

* Update runner.py by removing the otel from it

Update runner.py by removing the otel from it

Signed-off-by: ali <117142933+muhammad-ali-e@users.noreply.github.com>

* added delta changes

* handle erro to destination db

* resolve tool instances id validation and hitl queu name in API

* handled direct execution from workflow page to worker and logs

* handle cost logs

* Update health.py

Signed-off-by: Ritwik G <100672805+ritwik-g@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* minor log changes

* introducing log consumer scheduler to bulk create, and socket .emit from worker for ws

* Commit uv.lock changes

* time limit or timeout celery config cleanup

* implemented redis client class in worker

* pipline status enum mismatch

* notification worker fixes

* resolve uv lock conflicts

* workflow log fixes

* ws channel name issue resolved. and handling redis down in status tracker, and removing redis keys

* default TTL changed for unified logs

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: ali <117142933+muhammad-ali-e@users.noreply.github.com>
Signed-off-by: Ritwik G <100672805+ritwik-g@users.noreply.github.com>
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Ritwik G <100672805+ritwik-g@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2025-10-03 11:24:07 +05:30

258 lines
8.8 KiB
Python

import uuid
from datetime import timedelta
from typing import Any
from django.db import models
from utils.common_utils import CommonUtils
from utils.models.base_model import BaseModel
from workflow_manager.endpoint_v2.dto import FileHash
from workflow_manager.workflow_v2.enums import ExecutionStatus
FILE_NAME_LENGTH = 255
FILE_PATH_LENGTH = 255
HASH_LENGTH = 64
MIME_TYPE_LENGTH = 128
class WorkflowFileExecutionManager(models.Manager):
def get_or_create_file_execution(
self,
workflow_execution: Any,
file_hash: FileHash,
is_api: bool = False,
) -> "WorkflowFileExecution":
"""Retrieves or creates a new input file record for a workflow execution.
Args:
workflow_execution: The `WorkflowExecution` object
associated with this file.
file_hash: The `FileHash` object containing file metadata.
file_path: (Optional) The full path of the input file.
Returns:
The `WorkflowFileExecution` object.
"""
# Determine file path based on connection type
execution_file_path = file_hash.file_path if not is_api else None
lookup_fields = {
"workflow_execution": workflow_execution,
"file_path": execution_file_path,
}
if file_hash.file_hash:
lookup_fields["file_hash"] = file_hash.file_hash
elif file_hash.provider_file_uuid:
lookup_fields["provider_file_uuid"] = file_hash.provider_file_uuid
execution_file, is_created = self.get_or_create(**lookup_fields)
if is_created:
self._update_execution_file(execution_file, file_hash)
return execution_file
def _update_execution_file(
self, execution_file: "WorkflowFileExecution", file_hash: FileHash
) -> None:
"""Updates the attributes of a newly created WorkflowFileExecution object."""
execution_file.file_name = file_hash.file_name
execution_file.file_size = file_hash.file_size
execution_file.mime_type = file_hash.mime_type
execution_file.provider_file_uuid = file_hash.provider_file_uuid
execution_file.fs_metadata = file_hash.fs_metadata
execution_file.save()
class WorkflowFileExecution(BaseModel):
id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False)
workflow_execution = models.ForeignKey(
"workflow_v2.WorkflowExecution",
on_delete=models.CASCADE,
db_index=True,
editable=False,
db_comment="Foreign key from WorkflowExecution model",
related_name="file_executions",
)
file_name = models.CharField(
max_length=FILE_NAME_LENGTH, db_comment="Name of the file"
)
file_path = models.CharField(
max_length=FILE_PATH_LENGTH, null=True, db_comment="Full Path of the file"
)
file_size = models.BigIntegerField(null=True, db_comment="Size of the file in bytes")
file_hash = models.CharField(
max_length=HASH_LENGTH, null=True, db_comment="Hash of the file content"
)
provider_file_uuid = models.CharField(
max_length=HASH_LENGTH,
null=True,
db_comment="Unique identifier assigned by the file storage provider",
)
fs_metadata = models.JSONField(
null=True,
db_comment="Complete metadata of the file retrieved from the file system.",
)
mime_type = models.CharField(
max_length=MIME_TYPE_LENGTH,
blank=True,
null=True,
db_comment="MIME type of the file",
)
status = models.TextField(
choices=ExecutionStatus.choices,
db_comment="Current status of the execution",
)
execution_time = models.FloatField(null=True, db_comment="Execution time in seconds")
execution_error = models.TextField(
blank=True, null=True, db_comment="Error message if execution failed"
)
# Custom manager
objects = WorkflowFileExecutionManager()
def __str__(self):
return (
f"WorkflowFileExecution: {self.file_name} "
f"(WorkflowExecution: {self.workflow_execution})"
)
def update_status(
self,
status: ExecutionStatus | str,
execution_error: str = None,
execution_time: float = None,
) -> None:
"""Updates the status and execution details of an input file.
Args:
execution_file: The `WorkflowExecutionFile` object to update
status: The new status of the file (ExecutionStatus enum or string)
execution_time: The execution time for processing the file (optional)
execution_error: (Optional) Error message if processing failed
Return:
The updated `WorkflowExecutionInputFile` object
"""
# Set execution_time if provided, otherwise calculate it for final states
status = ExecutionStatus(status)
self.status = status.value
if status in [
ExecutionStatus.COMPLETED,
ExecutionStatus.ERROR,
ExecutionStatus.STOPPED,
]:
self.execution_time = CommonUtils.time_since(self.created_at, 3)
self.execution_error = execution_error
self.save()
@property
def pretty_file_size(self) -> str:
"""Convert file_size from bytes to human-readable format
Returns:
str: File size with a precision of 2 decimals
"""
return CommonUtils.pretty_file_size(self.file_size)
@property
def pretty_execution_time(self) -> str:
"""Convert execution_time from seconds to HH:MM:SS format
Returns:
str: Time in HH:MM:SS format
"""
# Compute execution time for a run that's in progress
time_in_secs = (
self.execution_time
if self.execution_time
else CommonUtils.time_since(self.created_at)
)
return str(timedelta(seconds=time_in_secs)).split(".")[0]
class Meta:
verbose_name = "Workflow File Execution"
verbose_name_plural = "Workflow File Executions"
db_table = "workflow_file_execution"
indexes = [
models.Index(
fields=["workflow_execution", "file_hash"],
name="workflow_file_hash_idx",
),
models.Index(
fields=["workflow_execution", "provider_file_uuid"],
name="workflow_exec_p_uuid_idx",
),
models.Index(
fields=["workflow_execution", "file_hash", "file_path", "status"],
name="wf_file_hash_path_status_idx",
),
models.Index(
fields=[
"workflow_execution",
"provider_file_uuid",
"file_path",
"status",
],
name="wf_provider_uuid_path_stat_idx",
),
]
constraints = [
models.UniqueConstraint(
fields=["workflow_execution", "file_hash", "file_path"],
name="unique_workflow_file_hash_path",
),
models.UniqueConstraint(
fields=["workflow_execution", "provider_file_uuid", "file_path"],
name="unique_workflow_provider_uuid_path",
),
# CRITICAL FIX: Add constraint for API files where file_path is None
# This prevents duplicate entries for same file_hash
models.UniqueConstraint(
fields=["workflow_execution", "file_hash"],
condition=models.Q(file_path__isnull=True),
name="unique_workflow_api_file_hash",
),
]
@property
def is_completed(self) -> bool:
"""Check if the execution status is completed.
Returns:
bool: True if the execution status is completed, False otherwise.
"""
return self.status is not None and self.status == ExecutionStatus.COMPLETED.value
def update(
self,
file_hash: str = None,
fs_metadata: dict[str, Any] = None,
mime_type: str = None,
) -> None:
"""Updates the file execution details.
Args:
file_hash: (Optional) Hash of the file content
fs_metadata: (Optional) File system metadata
mime_type: (Optional) MIME type of the file
Returns:
None
"""
update_fields = []
if file_hash is not None:
self.file_hash = file_hash
update_fields.append("file_hash")
if fs_metadata is not None:
self.fs_metadata = fs_metadata
update_fields.append("fs_metadata")
if mime_type is not None:
self.mime_type = mime_type
update_fields.append("mime_type")
if update_fields: # Save only if there's an actual update
self.save(update_fields=update_fields)