diff --git a/.cursor/rules/dashboard.mdc b/.cursor/rules/dashboard.mdc index 9431cb0..2819188 100644 --- a/.cursor/rules/dashboard.mdc +++ b/.cursor/rules/dashboard.mdc @@ -43,10 +43,11 @@ headers: { 'Authorization': `Bearer ${token}` } ``` dashboard/src/ ├── app/ # Next.js App Router pages +│ ├── control/ # Mission control panel (chat interface) +│ ├── history/ # Agents page - mission history + tree modal │ ├── console/ # SSH terminal + file explorer -│ ├── control/ # Task control panel -│ ├── agents/ # Agent tree visualization │ ├── modules/ # MCP module management +│ ├── files/ # File explorer │ └── settings/ # Configuration ├── components/ │ ├── ui/ # Generic UI components @@ -56,6 +57,17 @@ dashboard/src/ └── lib/ # Utilities (api, auth, settings) ``` +## Navigation Structure + +| Nav Item | Route | Description | +|----------|-------|-------------| +| Overview | `/` | Global stats and recent missions | +| Mission | `/control` | Active mission chat interface | +| Agents | `/history` | Mission history with tree visualization modal | +| Console | `/console` | SSH terminal access | +| Modules | `/modules` | MCP server management | +| Settings | `/settings` | Configuration | + ## Refresh Resilience Pattern The dashboard maintains state snapshots on the backend so users can refresh or navigate away without losing visual state. @@ -96,7 +108,11 @@ useEffect(() => { ## Agent Tree Visualization -Dynamic, animated tree visualization for hierarchical agent execution. Uses SVG + framer-motion for smooth animations. +Dynamic, animated tree visualization for hierarchical agent execution. Uses SVG + framer-motion for smooth animations. + +The tree is accessed via a **modal** from the Agents page (formerly History): +- Click the tree icon on any mission row to open the full-screen modal +- Modal shows the agent execution tree with pan/zoom controls ### Component Structure @@ -114,48 +130,45 @@ components/agent-tree/ ```tsx import { AgentTreeCanvas, - generateComplexTree, - simulateTreeUpdates, type AgentNode } from '@/components/agent-tree'; -// With real data from backend +// Full mode (in Control page panel) setSelectedId(node?.id ?? null)} /> -// Demo mode (for testing without API) -const [tree, setTree] = useState(generateComplexTree()); -useEffect(() => simulateTreeUpdates(tree, setTree), []); - +// Compact mode (in History page preview, side panels) + ``` +### Props + +| Prop | Type | Description | +|------|------|-------------| +| `tree` | `AgentNode \| null` | Tree data to visualize | +| `onSelectNode` | `(node: AgentNode \| null) => void` | Node selection callback | +| `selectedNodeId` | `string \| null` | Currently selected node ID | +| `compact` | `boolean` | Compact mode - hides minimap and details panel | +| `className` | `string` | Additional CSS classes | + ### Node Display Each node shows: - **Icon**: Agent type (Bot, Brain, Cpu, Zap, Target, GitBranch) - **Name**: Truncated agent name -- **Model**: LLM model used (e.g., `claude-sonnet-4.5`) +- **Model**: LLM model used (e.g., `gemini-3-flash`) - **Status**: Running (pulse), Completed (✓), Failed (✗), Pending (clock) - **Budget**: Spent / Allocated (e.g., `$0.35 / $9.00`) -### Demo Modes - -Three demo tree generators for testing: -- `generateSimpleTree()` – Basic orchestrator (5 nodes) -- `generateComplexTree()` – Subtask decomposition (10-15 nodes) -- `generateDeepTree(depth)` – Recursive nesting (50+ nodes) - -Use `simulateTreeUpdates(tree, setTree)` to simulate live status changes. - ### Interactions - **Pan**: Click and drag to move the tree - **Zoom**: Mouse wheel or +/- buttons -- **Select**: Click a node to show details panel -- **Reset**: Reset view button restores initial position +- **Select**: Click a node to show details panel (full mode only) +- **Fit**: Reset view button fits tree to viewport ## Environment Variables diff --git a/.cursor/rules/project.mdc b/.cursor/rules/project.mdc index f3f3c15..50d8dea 100644 --- a/.cursor/rules/project.mdc +++ b/.cursor/rules/project.mdc @@ -1,5 +1,5 @@ --- -description: Core Open Agent architecture - hierarchical agent system with full machine access +description: Core Open Agent architecture - SimpleAgent system with full machine access alwaysApply: true --- @@ -13,26 +13,30 @@ Minimal autonomous coding agent in Rust with **full machine access** (not sandbo |-----------|----------|---------| | Backend (Rust) | `src/` | HTTP API + agent system | | Dashboard (Next.js) | `dashboard/` | Web UI (Bun, not npm) | +| iOS Dashboard | `ios_dashboard/` | Native iOS app (Swift/SwiftUI) | | MCP configs | `.open_agent/mcp/config.json` | Model Context Protocol servers | | Tuning | `.open_agent/tuning.json` | Calibration data | ## Architecture ``` -RootAgent (orchestrator) -├── ComplexityEstimator (leaf) → estimates task difficulty 0-1 -├── ModelSelector (leaf) → U-curve cost optimization -├── TaskExecutor (leaf) → runs tools in a loop -└── Verifier (leaf) → hybrid programmatic + LLM verification +SimpleAgent + └── TaskExecutor → runs tools in a loop with auto-upgrade ``` +The agent system was simplified from a complex hierarchical orchestrator to a single `SimpleAgent` that: +- Automatically upgrades outdated model names via `ModelResolver` +- Uses `TaskExecutor` for tool-based execution +- Supports model overrides per mission/message +- Handles parallel mission execution + ### Module Map ``` src/ -├── agents/ # Hierarchical agent system -│ ├── orchestrator/ # RootAgent, NodeAgent -│ └── leaf/ # ComplexityEstimator, ModelSelector, TaskExecutor, Verifier +├── agents/ # Agent system +│ ├── simple.rs # SimpleAgent (main entry point) +│ └── leaf/ # TaskExecutor ├── budget/ # Cost tracking, pricing, smart retry │ ├── benchmarks.rs # Model capability scores from llm-stats.com │ ├── pricing.rs # OpenRouter pricing + model allowlist @@ -79,19 +83,44 @@ src/ | Method | Path | Purpose | |--------|------|---------| | `POST` | `/api/control/message` | Send message to agent | +| `POST` | `/api/control/tool_result` | Submit frontend tool result | | `GET` | `/api/control/stream` | SSE event stream | | `POST` | `/api/control/cancel` | Cancel current execution | | `GET` | `/api/control/tree` | Get agent tree snapshot (refresh resilience) | | `GET` | `/api/control/progress` | Get execution progress ("Subtask X/Y") | + +### Mission Endpoints + +| Method | Path | Purpose | +|--------|------|---------| | `GET` | `/api/control/missions` | List all missions | -| `POST` | `/api/control/missions` | Create new mission | -| `GET` | `/api/control/missions/current` | Get current mission | +| `POST` | `/api/control/missions` | Create new mission (optional: title, model_override) | +| `GET` | `/api/control/missions/current` | Get current active mission | +| `GET` | `/api/control/missions/:id` | Get specific mission | +| `GET` | `/api/control/missions/:id/tree` | Get mission's agent tree | | `POST` | `/api/control/missions/:id/load` | Switch to mission | +| `POST` | `/api/control/missions/:id/status` | Set mission status | | `POST` | `/api/control/missions/:id/cancel` | Cancel specific mission | +| `POST` | `/api/control/missions/:id/resume` | Resume interrupted mission | | `POST` | `/api/control/missions/:id/parallel` | Start mission in parallel | + +### Parallel Execution Endpoints + +| Method | Path | Purpose | +|--------|------|---------| | `GET` | `/api/control/running` | List running missions | | `GET` | `/api/control/parallel/config` | Get parallel execution config | +### Mission Statuses + +Missions can be in one of these states: +- `active` - Currently being worked on +- `completed` - Successfully finished +- `failed` - Failed with errors +- `interrupted` - Stopped due to server shutdown/cancellation (resumable) +- `blocked` - Blocked by external factors (resumable) +- `not_feasible` - Cannot be completed as specified + ## Model Selection (U-Curve) - **Cheap models**: low token cost, high failure rate, more retries diff --git a/dashboard/src/app/agents/page.tsx b/dashboard/src/app/agents/page.tsx deleted file mode 100644 index 905226d..0000000 --- a/dashboard/src/app/agents/page.tsx +++ /dev/null @@ -1,594 +0,0 @@ -'use client'; - -/** - * Agent Tree Page - * - * Dynamic visualization of the hierarchical agent execution tree. - * Shows real-time updates as agents are created, run, and complete. - */ - -import { useEffect, useMemo, useState, useRef, useCallback } from 'react'; -import Link from 'next/link'; -import { toast } from 'sonner'; -import { cn } from '@/lib/utils'; -import { listMissions, getCurrentMission, streamControl, getAgentTree, getProgress, getMissionTree, Mission, ControlRunState, ExecutionProgress } from '@/lib/api'; -import { ShimmerSidebarItem } from '@/components/ui/shimmer'; -import { - AgentTreeCanvas, - generateSimpleTree, - generateComplexTree, - generateDeepTree, - simulateTreeUpdates, - type AgentNode, -} from '@/components/agent-tree'; -import { - Bot, - CheckCircle, - XCircle, - Loader, - Clock, - Search, - Layers, - FlaskConical, - Play, - Pause, - MessageSquare, -} from 'lucide-react'; - -function isRecord(value: unknown): value is Record { - return typeof value === 'object' && value !== null; -} - -type DemoMode = 'off' | 'simple' | 'complex' | 'deep'; - -export default function AgentsPage() { - const [missions, setMissions] = useState([]); - const [currentMission, setCurrentMission] = useState(null); - const [controlState, setControlState] = useState('idle'); - const [selectedMissionId, setSelectedMissionId] = useState(null); - const [searchQuery, setSearchQuery] = useState(''); - const [loading, setLoading] = useState(true); - const [realTree, setRealTree] = useState(null); - const [selectedNodeId, setSelectedNodeId] = useState(null); - const [progress, setProgress] = useState(null); - - // Demo mode state - const [demoMode, setDemoMode] = useState('off'); - const [demoTree, setDemoTree] = useState(null); - const [demoRunning, setDemoRunning] = useState(false); - const demoCleanupRef = useRef<(() => void) | null>(null); - - const fetchedRef = useRef(false); - const streamCleanupRef = useRef void)>(null); - - const selectedMission = useMemo( - () => missions.find((m) => m.id === selectedMissionId) ?? currentMission, - [missions, selectedMissionId, currentMission] - ); - - // Convert backend tree node to frontend AgentNode - const convertTreeNode = useCallback((node: Record): AgentNode => { - const children = (node['children'] as Record[] | undefined) ?? []; - return { - id: String(node['id'] ?? ''), - type: (String(node['node_type'] ?? 'Node') as AgentNode['type']), - status: (String(node['status'] ?? 'pending') as AgentNode['status']), - name: String(node['name'] ?? ''), - description: String(node['description'] ?? ''), - model: node['selected_model'] != null ? String(node['selected_model']) : undefined, - budgetAllocated: Number(node['budget_allocated'] ?? 0), - budgetSpent: Number(node['budget_spent'] ?? 0), - complexity: node['complexity'] != null ? Number(node['complexity']) : undefined, - children: children.map((c) => convertTreeNode(c)), - }; - }, []); - - // Stream control events for real-time status and tree updates - // First fetch snapshot, then subscribe to live updates - useEffect(() => { - streamCleanupRef.current?.(); - let mounted = true; - - // Fetch initial snapshot for refresh resilience - const fetchSnapshot = async () => { - try { - const [treeSnapshot, progressSnapshot] = await Promise.all([ - getAgentTree().catch(() => null), - getProgress().catch(() => null), - ]); - if (!mounted) return; - - if (treeSnapshot) { - setRealTree(convertTreeNode(treeSnapshot as unknown as Record)); - } - if (progressSnapshot) { - setProgress(progressSnapshot); - } - } catch (e) { - console.error('Failed to fetch snapshot:', e); - } - }; - - fetchSnapshot(); - - const cleanup = streamControl((event) => { - const data: unknown = event.data; - if (event.type === 'status' && isRecord(data)) { - const st = data['state']; - setControlState(typeof st === 'string' ? (st as ControlRunState) : 'idle'); - - // Clear real tree and progress when idle - if (st === 'idle') { - setRealTree(null); - setProgress(null); - } - } - - // Handle real-time tree updates - if (event.type === 'agent_tree' && isRecord(data)) { - const tree = data['tree']; - if (isRecord(tree)) { - const converted = convertTreeNode(tree); - setRealTree(converted); - } - } - - // Handle progress updates - if (event.type === 'progress' && isRecord(data)) { - setProgress({ - total_subtasks: Number(data['total_subtasks'] ?? 0), - completed_subtasks: Number(data['completed_subtasks'] ?? 0), - current_subtask: data['current_subtask'] as string | null, - current_depth: Number(data['depth'] ?? 0), - }); - } - }); - - streamCleanupRef.current = cleanup; - return () => { - mounted = false; - streamCleanupRef.current?.(); - streamCleanupRef.current = null; - }; - }, [convertTreeNode]); - - useEffect(() => { - let cancelled = false; - let hasShownError = false; - - const fetchData = async () => { - try { - const [missionsData, currentMissionData] = await Promise.all([ - listMissions().catch(() => []), - getCurrentMission().catch(() => null), - ]); - if (cancelled) return; - - fetchedRef.current = true; - setMissions(missionsData); - setCurrentMission(currentMissionData); - - if (!selectedMissionId && currentMissionData) { - setSelectedMissionId(currentMissionData.id); - } - hasShownError = false; - } catch (error) { - if (!hasShownError) { - toast.error('Failed to fetch missions'); - hasShownError = true; - } - console.error('Failed to fetch data:', error); - } finally { - if (!cancelled) { - setLoading(false); - } - } - }; - - fetchData(); - const interval = setInterval(fetchData, 5000); - return () => { - cancelled = true; - clearInterval(interval); - }; - }, [selectedMissionId]); - - const filteredMissions = useMemo(() => { - if (!searchQuery.trim()) return missions; - const query = searchQuery.toLowerCase(); - return missions.filter((m) => - m.title?.toLowerCase().includes(query) || - m.id.toLowerCase().includes(query) - ); - }, [missions, searchQuery]); - - const controlStateToStatus = (state: ControlRunState, missionStatus?: string): AgentNode['status'] => { - if (state === 'running' || state === 'waiting_for_tool') return 'running'; - if (missionStatus === 'completed') return 'completed'; - if (missionStatus === 'failed') return 'failed'; - if (missionStatus === 'interrupted') return 'pending'; // Show as pending (resumable) - return 'pending'; - }; - - // Build a basic agent tree from mission data when no real tree is available - const buildFallbackTree = useCallback((): AgentNode | null => { - if (!selectedMission) return null; - - const rootStatus = controlStateToStatus(controlState, selectedMission.status); - - return { - id: 'root', - type: 'Root', - status: rootStatus, - name: 'Root Agent', - description: selectedMission.title?.slice(0, 50) || 'Mission ' + selectedMission.id.slice(0, 8), - model: 'claude-sonnet-4.5', - budgetAllocated: 1000, - budgetSpent: 50, - children: [ - { - id: 'complexity', - type: 'ComplexityEstimator', - status: 'completed', - name: 'Complexity Estimator', - description: 'Estimate task difficulty', - model: 'claude-3.5-haiku', - budgetAllocated: 10, - budgetSpent: 5, - complexity: 0.7, - }, - { - id: 'model-selector', - type: 'ModelSelector', - status: 'completed', - name: 'Model Selector', - description: 'Select optimal model for task', - model: 'claude-3.5-haiku', - budgetAllocated: 10, - budgetSpent: 3, - }, - { - id: 'executor', - type: 'TaskExecutor', - status: rootStatus, - name: 'Task Executor', - description: 'Execute task using tools', - model: 'claude-sonnet-4.5', - budgetAllocated: 900, - budgetSpent: 35, - }, - { - id: 'verifier', - type: 'Verifier', - status: selectedMission.status === 'completed' ? 'completed' : - selectedMission.status === 'failed' ? 'failed' : 'pending', - name: 'Verifier', - description: 'Verify task completion', - model: 'claude-3.5-haiku', - budgetAllocated: 80, - budgetSpent: selectedMission.status === 'completed' ? 7 : 0, - }, - ] as AgentNode[], - }; - }, [selectedMission, controlState]); - - // Load tree for a specific mission - const loadMissionTree = useCallback(async (missionId: string) => { - try { - const tree = await getMissionTree(missionId); - if (tree) { - setRealTree(convertTreeNode(tree as unknown as Record)); - } else { - setRealTree(null); - } - } catch (e) { - console.error('Failed to load mission tree:', e); - setRealTree(null); - } - }, [convertTreeNode]); - - // Demo mode handlers - const startDemo = useCallback((mode: DemoMode) => { - // Cleanup previous demo - demoCleanupRef.current?.(); - - if (mode === 'off') { - setDemoMode('off'); - setDemoTree(null); - setDemoRunning(false); - return; - } - - // Generate demo tree - let tree: AgentNode; - switch (mode) { - case 'simple': - tree = generateSimpleTree(); - break; - case 'complex': - tree = generateComplexTree(); - break; - case 'deep': - tree = generateDeepTree(4); - break; - default: - return; - } - - setDemoMode(mode); - setDemoTree(tree); - setDemoRunning(true); - - // Start simulation - const cleanup = simulateTreeUpdates(tree, setDemoTree); - demoCleanupRef.current = cleanup; - }, []); - - const toggleDemoRunning = useCallback(() => { - if (demoRunning) { - demoCleanupRef.current?.(); - demoCleanupRef.current = null; - setDemoRunning(false); - } else if (demoTree) { - const cleanup = simulateTreeUpdates(demoTree, setDemoTree); - demoCleanupRef.current = cleanup; - setDemoRunning(true); - } - }, [demoRunning, demoTree]); - - // Load tree when selected mission changes (e.g., on initial page load) - // Track which mission's tree we last loaded to avoid redundant fetches - const lastLoadedMissionRef = useRef(null); - useEffect(() => { - if (selectedMissionId && selectedMissionId !== lastLoadedMissionRef.current) { - lastLoadedMissionRef.current = selectedMissionId; - loadMissionTree(selectedMissionId); - } - }, [selectedMissionId, loadMissionTree]); - - // Cleanup on unmount - useEffect(() => { - return () => { - demoCleanupRef.current?.(); - }; - }, []); - - // Use demo tree when in demo mode, otherwise use real tree or fallback - const displayTree = useMemo(() => { - if (demoMode !== 'off' && demoTree) { - return demoTree; - } - return realTree ?? buildFallbackTree(); - }, [demoMode, demoTree, realTree, buildFallbackTree]); - - const isActive = controlState !== 'idle'; - - return ( -
- {/* Mission selector sidebar */} -
-

Missions

- -
- - setSearchQuery(e.target.value)} - className="w-full rounded-lg border border-white/[0.06] bg-white/[0.02] py-2 pl-8 pr-3 text-xs text-white placeholder-white/30 focus:border-indigo-500/50 focus:outline-none transition-colors" - /> -
- - {isActive && currentMission && ( -
-
- - Active -
-

- {currentMission.title || 'Mission ' + currentMission.id.slice(0, 8)} -

-
- )} - -
- {loading ? ( - <> - - - - - ) : filteredMissions.length === 0 && !currentMission ? ( -

- {searchQuery ? 'No missions found' : 'No missions yet'} -

- ) : ( - <> - {currentMission && (!searchQuery || currentMission.title?.toLowerCase().includes(searchQuery.toLowerCase())) && ( - - )} - - {filteredMissions.filter(m => m.id !== currentMission?.id).map((mission) => ( - - ))} - - )} -
- - {/* Demo mode controls */} -
-
- - Demo Mode -
- -
-
- {(['simple', 'complex', 'deep'] as const).map((mode) => ( - - ))} -
- - {demoMode !== 'off' && ( -
- - -
- )} -
-
-
- - {/* Agent tree visualization */} -
- {/* Header */} -
-
-
- -
-
-
-

Agent Tree

- {demoMode !== 'off' && ( - - Demo: {demoMode} - - )} -
-

- {demoMode !== 'off' - ? 'Simulated agent tree with live updates' - : 'Hierarchical agent execution visualization' - } -

-
-
-
- - {/* Tree canvas */} -
- {!displayTree && (missions.length === 0 && !currentMission) ? ( -
-
- -
-

No active missions

-

- Start a conversation in the{' '} - - Control - {' '} - page or try Demo Mode in the sidebar -

-
- ) : ( - setSelectedNodeId(node?.id ?? null)} - className="w-full h-full rounded-2xl border border-white/[0.06]" - /> - )} -
-
-
- ); -} diff --git a/dashboard/src/app/console/console-wrapper.tsx b/dashboard/src/app/console/console-wrapper.tsx index c01f025..8891f1e 100644 --- a/dashboard/src/app/console/console-wrapper.tsx +++ b/dashboard/src/app/console/console-wrapper.tsx @@ -14,3 +14,6 @@ export function ConsoleWrapper() { + + + diff --git a/dashboard/src/app/control/control-client.tsx b/dashboard/src/app/control/control-client.tsx index 54308e0..f6fc9c3 100644 --- a/dashboard/src/app/control/control-client.tsx +++ b/dashboard/src/app/control/control-client.tsx @@ -20,6 +20,8 @@ import { getProgress, getRunningMissions, cancelMission, + listModels, + getModelDisplayName, type ControlRunState, type Mission, type MissionStatus, @@ -403,6 +405,9 @@ export default function ControlClient() { >([]); const [uploadQueue, setUploadQueue] = useState([]); + // Model selection state + const [availableModels, setAvailableModels] = useState([]); + // Check if the mission we're viewing is actually running (not just any mission) const viewingMissionIsRunning = useMemo(() => { if (!viewingMissionId) return runState !== "idle"; @@ -489,6 +494,31 @@ export default function ControlClient() { return () => document.removeEventListener("mousedown", handleClickOutside); }, []); + // Handle file upload - wrapped in useCallback to avoid stale closures + const handleFileUpload = useCallback(async (file: File) => { + setUploadQueue((prev) => [...prev, file.name]); + + try { + // Upload to mission-specific context folder if we have a mission + const contextPath = currentMission?.id + ? `/root/context/${currentMission.id}/` + : "/root/context/"; + const result = await uploadFile(file, contextPath); + toast.success(`Uploaded ${result.name}`); + + // Add a message about the upload + setInput((prev) => { + const uploadNote = `[Uploaded: ${result.name}]`; + return prev ? `${prev}\n${uploadNote}` : uploadNote; + }); + } catch (error) { + console.error("Upload failed:", error); + toast.error(`Failed to upload ${file.name}`); + } finally { + setUploadQueue((prev) => prev.filter((name) => name !== file.name)); + } + }, [currentMission?.id]); + // Handle paste to upload files useEffect(() => { const textarea = textareaRef.current; @@ -519,28 +549,7 @@ export default function ControlClient() { textarea.addEventListener("paste", handlePaste); return () => textarea.removeEventListener("paste", handlePaste); - }, []); - - // Handle file upload - const handleFileUpload = async (file: File) => { - setUploadQueue((prev) => [...prev, file.name]); - - try { - const result = await uploadFile(file, "/root/context/"); - toast.success(`Uploaded ${result.name} to /root/context/`); - - // Add a message about the upload - setInput((prev) => { - const uploadNote = `[Uploaded: ${result.name}]`; - return prev ? `${prev}\n${uploadNote}` : uploadNote; - }); - } catch (error) { - console.error("Upload failed:", error); - toast.error(`Failed to upload ${file.name}`); - } finally { - setUploadQueue((prev) => prev.filter((name) => name !== file.name)); - } - }; + }, [handleFileUpload]); // Handle file input change const handleFileChange = async ( @@ -629,6 +638,17 @@ export default function ControlClient() { return () => clearInterval(interval); }, []); + // Fetch available models for mission creation + useEffect(() => { + listModels() + .then((data) => { + setAvailableModels(data.models); + }) + .catch((err) => { + console.error("Failed to fetch models:", err); + }); + }, []); + // Handle cancelling a parallel mission const handleCancelMission = async (missionId: string) => { try { @@ -643,21 +663,37 @@ export default function ControlClient() { } }; + // Track the mission ID being fetched to prevent race conditions + const fetchingMissionIdRef = useRef(null); + // Handle switching which mission we're viewing const handleViewMission = useCallback( async (missionId: string) => { setViewingMissionId(missionId); + fetchingMissionIdRef.current = missionId; // Always load fresh history from API when switching missions // This ensures we don't show stale cached events try { const mission = await getMission(missionId); + + // Race condition guard: only update if this is still the mission we want + if (fetchingMissionIdRef.current !== missionId) { + return; // Another mission was requested, discard this response + } + const historyItems = missionHistoryToItems(mission); setItems(historyItems); // Update cache with fresh data setMissionItems((prev) => ({ ...prev, [missionId]: historyItems })); } catch (err) { console.error("Failed to load mission:", err); + + // Race condition guard: only update if this is still the mission we want + if (fetchingMissionIdRef.current !== missionId) { + return; + } + // Fallback to cached items if API fails if (missionItems[missionId]) { setItems(missionItems[missionId]); @@ -1141,27 +1177,41 @@ export default function ControlClient() { New Mission {showNewMissionDialog && ( -
+

Create New Mission

-
-
+
+ @@ -432,104 +434,72 @@ export default function HistoryPage() {
- )} - - {/* Active Tasks */} - {filteredTasks.length > 0 && ( -
-

- Active Tasks ({filteredTasks.length}) -

-
- - - - - - - - - - - - {filteredTasks.map((task) => { - const Icon = statusIcons[task.status]; - const config = statusConfig[task.status]; - return ( - - - - - - - - ); - })} - -
- Status - - Task - - Model - - Iterations - - Actions -
- - - {task.status} - - -
-

- {task.task} -

- -
-
- - {task.model.split("/").pop()} - - - - {task.iterations} - - -
- - View - - -
-
-
)} +
+ {/* Agent Tree Modal */} + {previewMissionId && ( +
{ + setPreviewMissionId(null); + setPreviewTree(null); + fetchingTreeMissionIdRef.current = null; + }} + > +
e.stopPropagation()} + > + {/* Modal Header */} +
+
+
+ +
+
+

Agent Tree

+

+ {missions.find((m) => m.id === previewMissionId)?.title?.slice(0, 50) || "Mission visualization"} +

+
+
+ +
+ + {/* Modal Content */} +
+ {loadingTree ? ( +
+ +

Loading agent tree...

+
+ ) : previewTree ? ( + + ) : ( +
+
+ +
+

No tree data available

+

+ Agent tree data is captured during mission execution. + This mission may have been completed before tree tracking was enabled. +

+
+ )} +
+
)}
diff --git a/dashboard/src/components/agent-tree/AgentTreeCanvas.tsx b/dashboard/src/components/agent-tree/AgentTreeCanvas.tsx index f473123..f01ed2a 100644 --- a/dashboard/src/components/agent-tree/AgentTreeCanvas.tsx +++ b/dashboard/src/components/agent-tree/AgentTreeCanvas.tsx @@ -94,6 +94,8 @@ interface AgentTreeCanvasProps { onSelectNode?: (node: AgentNode | null) => void; selectedNodeId?: string | null; className?: string; + /** Compact mode for embedded panels - hides minimap and details panel */ + compact?: boolean; } /** @@ -527,6 +529,7 @@ export function AgentTreeCanvas({ onSelectNode, selectedNodeId, className, + compact = false, }: AgentTreeCanvasProps) { const containerRef = useRef(null); const [dimensions, setDimensions] = useState({ width: 800, height: 600 }); @@ -566,8 +569,8 @@ export function AgentTreeCanvas({ useEffect(() => { if (layout.width > 0 && layout.height > 0 && dimensions.width > 0 && dimensions.height > 0) { // Calculate zoom to fit the tree in view with some padding - const paddingX = 80; - const paddingY = 80; + const paddingX = compact ? 40 : 80; + const paddingY = compact ? 40 : 80; const availableWidth = dimensions.width - paddingX; const availableHeight = dimensions.height - paddingY; @@ -575,8 +578,8 @@ export function AgentTreeCanvas({ const scaleY = availableHeight / layout.height; // Use the smaller scale to fit both dimensions - // Cap between 0.4 (minimum readable) and 1 (don't zoom in past 100%) - const MIN_ZOOM = 0.4; + // Cap between 0.3/0.4 (minimum readable) and 1 (don't zoom in past 100%) + const MIN_ZOOM = compact ? 0.3 : 0.4; const fitZoom = Math.max(MIN_ZOOM, Math.min(1, Math.min(scaleX, scaleY))); // Calculate pan to center horizontally, start from top with padding @@ -587,12 +590,12 @@ export function AgentTreeCanvas({ const scaledHeight = layout.height * fitZoom; const centerY = scaledHeight < availableHeight ? (dimensions.height - scaledHeight) / 2 - : 30; // Start near top if tree is too tall + : compact ? 20 : 30; // Start near top if tree is too tall setZoom(fitZoom); setPan({ x: centerX, y: centerY }); } - }, [layout.width, layout.height, dimensions.width, dimensions.height]); + }, [layout.width, layout.height, dimensions.width, dimensions.height, compact]); // Pan handlers const handleMouseDown = useCallback((e: React.MouseEvent) => { @@ -690,34 +693,40 @@ export function AgentTreeCanvas({ - {/* Mini-map */} - + {/* Mini-map - hidden in compact mode */} + {!compact && } {/* Zoom controls */} -
+
- {/* Node details panel */} - - {selectedNode && ( - onSelectNode?.(null)} - /> - )} - + {/* Node details panel - hidden in compact mode */} + {!compact && ( + + {selectedNode && ( + onSelectNode?.(null)} + /> + )} + + )}
); } diff --git a/dashboard/src/components/sidebar.tsx b/dashboard/src/components/sidebar.tsx index 958eb18..6a5bb47 100644 --- a/dashboard/src/components/sidebar.tsx +++ b/dashboard/src/components/sidebar.tsx @@ -10,7 +10,6 @@ import { LayoutDashboard, MessageSquare, Network, - History, Terminal, Settings, Plug, @@ -21,11 +20,10 @@ import { const navigation = [ { name: 'Overview', href: '/', icon: LayoutDashboard }, - { name: 'Control', href: '/control', icon: MessageSquare }, - { name: 'Agents', href: '/agents', icon: Network }, - { name: 'Modules', href: '/modules', icon: Plug }, + { name: 'Mission', href: '/control', icon: MessageSquare }, + { name: 'Agents', href: '/history', icon: Network }, { name: 'Console', href: '/console', icon: Terminal }, - { name: 'History', href: '/history', icon: History }, + { name: 'Modules', href: '/modules', icon: Plug }, { name: 'Settings', href: '/settings', icon: Settings }, ]; diff --git a/dashboard/src/components/ui/confirm-dialog.tsx b/dashboard/src/components/ui/confirm-dialog.tsx index 8ee7805..08f8196 100644 --- a/dashboard/src/components/ui/confirm-dialog.tsx +++ b/dashboard/src/components/ui/confirm-dialog.tsx @@ -121,3 +121,6 @@ export function ConfirmDialog({ + + + diff --git a/dashboard/src/components/ui/copy-button.tsx b/dashboard/src/components/ui/copy-button.tsx index e82dae2..b2d45d3 100644 --- a/dashboard/src/components/ui/copy-button.tsx +++ b/dashboard/src/components/ui/copy-button.tsx @@ -55,3 +55,6 @@ export function CopyButton({ text, className, label = 'Copied!', showOnHover = t + + + diff --git a/dashboard/src/components/ui/relative-time.tsx b/dashboard/src/components/ui/relative-time.tsx index 4242fb4..23ff0d0 100644 --- a/dashboard/src/components/ui/relative-time.tsx +++ b/dashboard/src/components/ui/relative-time.tsx @@ -51,3 +51,6 @@ export function RelativeTime({ date, className }: RelativeTimeProps) { + + + diff --git a/dashboard/src/components/ui/shimmer.tsx b/dashboard/src/components/ui/shimmer.tsx index 7c3ee8b..0362bba 100644 --- a/dashboard/src/components/ui/shimmer.tsx +++ b/dashboard/src/components/ui/shimmer.tsx @@ -87,3 +87,6 @@ export function ShimmerText({ lines = 3, className }: ShimmerProps & { lines?: n + + + diff --git a/dashboard/src/lib/api.ts b/dashboard/src/lib/api.ts index 81b5866..b8896c6 100644 --- a/dashboard/src/lib/api.ts +++ b/dashboard/src/lib/api.ts @@ -748,3 +748,61 @@ export async function uploadFile( return res.json(); } + +// ==================== Models ==================== + +export interface ModelsResponse { + models: string[]; + count: number; +} + +// List available models +export async function listModels(tier?: string): Promise { + const params = tier ? `?tier=${encodeURIComponent(tier)}` : ""; + const res = await apiFetch(`/api/models${params}`); + if (!res.ok) throw new Error("Failed to fetch models"); + return res.json(); +} + +// Friendly display names for models +const MODEL_DISPLAY_NAMES: Record = { + // OpenAI - simplified (newest first) + "openai/gpt-5.2-pro": "gpt-5.2-pro", + "openai/gpt-5.2": "gpt-5.2", + "openai/gpt-5.2-chat": "gpt-5.2", + "openai/gpt-4.1-mini": "gpt-4-mini", + "openai/gpt-4.1": "gpt-4", + "openai/o1": "o1", + "openai/o3-mini-high": "o3-mini", + // Anthropic - simplified + "anthropic/claude-sonnet-4.5": "4.5-sonnet", + "anthropic/claude-opus-4.5": "4.5-opus", + "anthropic/claude-haiku-4.5": "4.5-haiku", + // Google + "google/gemini-3-flash-preview": "gemini-3-flash", + "google/gemini-3-pro-image-preview": "gemini-3-pro", + // DeepSeek + "deepseek/deepseek-r1": "deepseek-r1", + "deepseek/deepseek-chat-v3-0324": "deepseek-v3", + // Qwen + "qwen/qwq-32b": "qwq-32b", + "qwen/qwen-2.5-72b-instruct": "qwen-72b", + "qwen/qwen3-next-80b-a3b-thinking": "qwen3-thinking", + // Mistral + "mistralai/mistral-small-24b-instruct-2501": "mistral-small", + "mistralai/mistral-medium-3.1": "mistral-medium", + "mistralai/mistral-large-2512": "mistral-large", + // Meta + "meta-llama/llama-3.1-405b": "llama-405b", + "meta-llama/llama-3.2-90b-vision-instruct": "llama-90b-vision", + "meta-llama/llama-3.3-70b-instruct:free": "llama-70b (free)", +}; + +// Get display name for a model +export function getModelDisplayName(modelId: string): string { + if (MODEL_DISPLAY_NAMES[modelId]) { + return MODEL_DISPLAY_NAMES[modelId]; + } + // Fallback: strip provider prefix + return modelId.includes("/") ? modelId.split("/").pop()! : modelId; +} diff --git a/dashboard/src/lib/settings.ts b/dashboard/src/lib/settings.ts index d9bd0dd..886b1b3 100644 --- a/dashboard/src/lib/settings.ts +++ b/dashboard/src/lib/settings.ts @@ -56,3 +56,6 @@ export function getRuntimeTaskDefaults(): { model?: string; budget_cents?: numbe + + + diff --git a/dashboard/src/lib/utils.ts b/dashboard/src/lib/utils.ts index ae1f583..684f11f 100644 --- a/dashboard/src/lib/utils.ts +++ b/dashboard/src/lib/utils.ts @@ -38,3 +38,6 @@ export function formatRelativeTime(date: Date): string { + + + diff --git a/ios_dashboard/OpenAgentDashboard/Models/Mission.swift b/ios_dashboard/OpenAgentDashboard/Models/Mission.swift index 07afdf8..b1e19ca 100644 --- a/ios_dashboard/OpenAgentDashboard/Models/Mission.swift +++ b/ios_dashboard/OpenAgentDashboard/Models/Mission.swift @@ -11,14 +11,35 @@ enum MissionStatus: String, Codable, CaseIterable { case active case completed case failed + case interrupted + case blocked + case notFeasible = "not_feasible" var statusType: StatusType { switch self { case .active: return .active case .completed: return .completed case .failed: return .failed + case .interrupted: return .interrupted + case .blocked: return .blocked + case .notFeasible: return .failed } } + + var displayLabel: String { + switch self { + case .active: return "Active" + case .completed: return "Completed" + case .failed: return "Failed" + case .interrupted: return "Interrupted" + case .blocked: return "Blocked" + case .notFeasible: return "Not Feasible" + } + } + + var canResume: Bool { + self == .interrupted || self == .blocked + } } struct MissionHistoryEntry: Codable, Identifiable { @@ -35,9 +56,12 @@ struct Mission: Codable, Identifiable, Hashable { let id: String var status: MissionStatus let title: String? + let modelOverride: String? let history: [MissionHistoryEntry] let createdAt: String let updatedAt: String + let interruptedAt: String? + let resumable: Bool func hash(into hasher: inout Hasher) { hasher.combine(id) @@ -48,9 +72,24 @@ struct Mission: Codable, Identifiable, Hashable { } enum CodingKeys: String, CodingKey { - case id, status, title, history + case id, status, title, history, resumable + case modelOverride = "model_override" case createdAt = "created_at" case updatedAt = "updated_at" + case interruptedAt = "interrupted_at" + } + + init(from decoder: Decoder) throws { + let container = try decoder.container(keyedBy: CodingKeys.self) + id = try container.decode(String.self, forKey: .id) + status = try container.decode(MissionStatus.self, forKey: .status) + title = try container.decodeIfPresent(String.self, forKey: .title) + modelOverride = try container.decodeIfPresent(String.self, forKey: .modelOverride) + history = try container.decode([MissionHistoryEntry].self, forKey: .history) + createdAt = try container.decode(String.self, forKey: .createdAt) + updatedAt = try container.decode(String.self, forKey: .updatedAt) + interruptedAt = try container.decodeIfPresent(String.self, forKey: .interruptedAt) + resumable = try container.decodeIfPresent(Bool.self, forKey: .resumable) ?? false } var displayTitle: String { @@ -60,11 +99,20 @@ struct Mission: Codable, Identifiable, Hashable { return "Untitled Mission" } + var displayModel: String? { + guard let model = modelOverride else { return nil } + return model.split(separator: "/").last.map(String.init) + } + var updatedDate: Date? { let formatter = ISO8601DateFormatter() formatter.formatOptions = [.withInternetDateTime, .withFractionalSeconds] return formatter.date(from: updatedAt) ?? ISO8601DateFormatter().date(from: updatedAt) } + + var canResume: Bool { + resumable && status.canResume + } } enum TaskStatus: String, Codable, CaseIterable { @@ -101,6 +149,66 @@ struct TaskState: Codable, Identifiable { } } +// MARK: - Parallel Execution + +struct RunningMissionInfo: Codable, Identifiable { + let missionId: String + let modelOverride: String? + let state: String + let queueLen: Int + let historyLen: Int + let secondsSinceActivity: Int + let expectedDeliverables: Int + + var id: String { missionId } + + enum CodingKeys: String, CodingKey { + case missionId = "mission_id" + case modelOverride = "model_override" + case state + case queueLen = "queue_len" + case historyLen = "history_len" + case secondsSinceActivity = "seconds_since_activity" + case expectedDeliverables = "expected_deliverables" + } + + // Memberwise initializer for previews and testing + init(missionId: String, modelOverride: String?, state: String, queueLen: Int, historyLen: Int, secondsSinceActivity: Int, expectedDeliverables: Int) { + self.missionId = missionId + self.modelOverride = modelOverride + self.state = state + self.queueLen = queueLen + self.historyLen = historyLen + self.secondsSinceActivity = secondsSinceActivity + self.expectedDeliverables = expectedDeliverables + } + + var isRunning: Bool { + state == "running" || state == "waiting_for_tool" + } + + var isStalled: Bool { + isRunning && secondsSinceActivity > 60 + } + + var displayModel: String { + guard let model = modelOverride else { return "Default" } + return model.split(separator: "/").last.map(String.init) ?? model + } +} + +struct ParallelConfig: Codable { + let maxParallelMissions: Int + let runningCount: Int + + enum CodingKeys: String, CodingKey { + case maxParallelMissions = "max_parallel_missions" + case runningCount = "running_count" + } +} + +// MARK: - Runs + struct Run: Codable, Identifiable { let id: String let createdAt: String diff --git a/ios_dashboard/OpenAgentDashboard/Services/APIService.swift b/ios_dashboard/OpenAgentDashboard/Services/APIService.swift index c45d870..21ce426 100644 --- a/ios_dashboard/OpenAgentDashboard/Services/APIService.swift +++ b/ios_dashboard/OpenAgentDashboard/Services/APIService.swift @@ -98,6 +98,32 @@ final class APIService { let _: EmptyResponse = try await post("/api/control/missions/\(id)/status", body: StatusRequest(status: status.rawValue)) } + func resumeMission(id: String) async throws -> Mission { + try await post("/api/control/missions/\(id)/resume", body: EmptyBody()) + } + + func cancelMission(id: String) async throws { + let _: EmptyResponse = try await post("/api/control/missions/\(id)/cancel", body: EmptyBody()) + } + + // MARK: - Parallel Missions + + func getRunningMissions() async throws -> [RunningMissionInfo] { + try await get("/api/control/running") + } + + func startMissionParallel(id: String, content: String, model: String? = nil) async throws { + struct ParallelRequest: Encodable { + let content: String + let model: String? + } + let _: EmptyResponse = try await post("/api/control/missions/\(id)/parallel", body: ParallelRequest(content: content, model: model)) + } + + func getParallelConfig() async throws -> ParallelConfig { + try await get("/api/control/parallel/config") + } + // MARK: - Control func sendMessage(content: String) async throws -> (id: String, queued: Bool) { diff --git a/ios_dashboard/OpenAgentDashboard/Views/Components/RunningMissionsBar.swift b/ios_dashboard/OpenAgentDashboard/Views/Components/RunningMissionsBar.swift new file mode 100644 index 0000000..c31fc28 --- /dev/null +++ b/ios_dashboard/OpenAgentDashboard/Views/Components/RunningMissionsBar.swift @@ -0,0 +1,269 @@ +// +// RunningMissionsBar.swift +// OpenAgentDashboard +// +// Compact horizontal bar showing currently running missions +// Allows switching between parallel missions +// + +import SwiftUI + +struct RunningMissionsBar: View { + let runningMissions: [RunningMissionInfo] + let currentMission: Mission? + let viewingMissionId: String? + let onSelectMission: (String) -> Void + let onCancelMission: (String) -> Void + let onRefresh: () -> Void + + var body: some View { + ScrollView(.horizontal, showsIndicators: false) { + HStack(spacing: 8) { + // Header with refresh button + headerView + + // Current mission if not in running list + if let mission = currentMission, + !runningMissions.contains(where: { $0.missionId == mission.id }) { + currentMissionChip(mission) + } + + // Running missions + ForEach(runningMissions) { mission in + runningMissionChip(mission) + } + } + .padding(.horizontal, 16) + .padding(.vertical, 8) + } + .background(.ultraThinMaterial) + } + + // MARK: - Header + + private var headerView: some View { + HStack(spacing: 6) { + Image(systemName: "square.stack.3d.up") + .font(.system(size: 12, weight: .medium)) + .foregroundStyle(Theme.textTertiary) + + Text("Running") + .font(.caption.weight(.medium)) + .foregroundStyle(Theme.textTertiary) + + Text("(\(runningMissions.count))") + .font(.caption) + .foregroundStyle(Theme.textMuted) + + Button(action: onRefresh) { + Image(systemName: "arrow.clockwise") + .font(.system(size: 10, weight: .medium)) + .foregroundStyle(Theme.textMuted) + } + .padding(4) + .contentShape(Rectangle()) + } + } + + // MARK: - Current Mission Chip + + private func currentMissionChip(_ mission: Mission) -> some View { + let isViewing = viewingMissionId == mission.id + + return Button { + onSelectMission(mission.id) + } label: { + HStack(spacing: 6) { + // Status dot + Circle() + .fill(Theme.success) + .frame(width: 6, height: 6) + + // Model name + Text(mission.displayModel ?? "Default") + .font(.caption.weight(.medium)) + .foregroundStyle(Theme.textPrimary) + .lineLimit(1) + + // Mission ID + Text(String(mission.id.prefix(8))) + .font(.system(size: 9).monospaced()) + .foregroundStyle(Theme.textMuted) + + // Selection indicator + if isViewing { + Image(systemName: "checkmark") + .font(.system(size: 9, weight: .bold)) + .foregroundStyle(Theme.accent) + } + } + .padding(.horizontal, 10) + .padding(.vertical, 6) + .background(isViewing ? Theme.accent.opacity(0.15) : Color.white.opacity(0.05)) + .clipShape(RoundedRectangle(cornerRadius: 8, style: .continuous)) + .overlay( + RoundedRectangle(cornerRadius: 8, style: .continuous) + .stroke(isViewing ? Theme.accent.opacity(0.3) : Theme.border, lineWidth: 1) + ) + } + .buttonStyle(.plain) + } + + // MARK: - Running Mission Chip + + private func runningMissionChip(_ mission: RunningMissionInfo) -> some View { + let isViewing = viewingMissionId == mission.missionId + let isStalled = mission.isStalled + // Only show severely stalled state for running missions + let isSeverlyStalled = mission.isRunning && mission.secondsSinceActivity > 120 + + let borderColor: Color = { + if isViewing { return Theme.accent.opacity(0.3) } + if isSeverlyStalled { return Theme.error.opacity(0.3) } + if isStalled { return Theme.warning.opacity(0.3) } + return Theme.border + }() + + let backgroundColor: Color = { + if isViewing { return Theme.accent.opacity(0.15) } + if isSeverlyStalled { return Theme.error.opacity(0.1) } + if isStalled { return Theme.warning.opacity(0.1) } + return Color.white.opacity(0.05) + }() + + return HStack(spacing: 6) { + // Tap area for selection + Button { + onSelectMission(mission.missionId) + } label: { + HStack(spacing: 6) { + // Status dot with animation + Circle() + .fill(statusColor(for: mission)) + .frame(width: 6, height: 6) + .overlay { + if mission.isRunning && !isStalled { + Circle() + .stroke(statusColor(for: mission).opacity(0.5), lineWidth: 1.5) + .frame(width: 10, height: 10) + .opacity(0.6) + } + } + + // Model name + Text(mission.displayModel) + .font(.caption.weight(.medium)) + .foregroundStyle(Theme.textPrimary) + .lineLimit(1) + + // Mission ID + Text(String(mission.missionId.prefix(8))) + .font(.system(size: 9).monospaced()) + .foregroundStyle(Theme.textMuted) + + // Stalled indicator + if isStalled { + HStack(spacing: 2) { + Image(systemName: "exclamationmark.triangle.fill") + .font(.system(size: 8)) + Text("\(mission.secondsSinceActivity)s") + .font(.system(size: 9).monospaced()) + } + .foregroundStyle(isSeverlyStalled ? Theme.error : Theme.warning) + } + + // Selection indicator + if isViewing { + Image(systemName: "checkmark") + .font(.system(size: 9, weight: .bold)) + .foregroundStyle(Theme.accent) + } + } + } + .buttonStyle(.plain) + + // Cancel button + Button { + onCancelMission(mission.missionId) + } label: { + Image(systemName: "xmark") + .font(.system(size: 9, weight: .medium)) + .foregroundStyle(Theme.textMuted) + .frame(width: 18, height: 18) + .background(Color.white.opacity(0.05)) + .clipShape(Circle()) + } + .buttonStyle(.plain) + } + .padding(.leading, 10) + .padding(.trailing, 6) + .padding(.vertical, 6) + .background(backgroundColor) + .clipShape(RoundedRectangle(cornerRadius: 8, style: .continuous)) + .overlay( + RoundedRectangle(cornerRadius: 8, style: .continuous) + .stroke(borderColor, lineWidth: 1) + ) + } + + // MARK: - Helpers + + private func statusColor(for mission: RunningMissionInfo) -> Color { + // Only show stalled/severely-stalled states for running missions + if mission.isRunning && mission.secondsSinceActivity > 120 { + return Theme.error + } else if mission.isStalled { + return Theme.warning + } else if mission.isRunning { + return Theme.success + } else { + return Theme.warning + } + } +} + +// MARK: - Preview + +#Preview { + VStack(spacing: 0) { + RunningMissionsBar( + runningMissions: [ + RunningMissionInfo( + missionId: "abc12345-6789-0000-0000-000000000001", + modelOverride: "deepseek/deepseek-v3.2", + state: "running", + queueLen: 0, + historyLen: 5, + secondsSinceActivity: 15, + expectedDeliverables: 0 + ), + RunningMissionInfo( + missionId: "def12345-6789-0000-0000-000000000002", + modelOverride: "qwen/qwen3-235b", + state: "running", + queueLen: 1, + historyLen: 3, + secondsSinceActivity: 75, + expectedDeliverables: 0 + ), + RunningMissionInfo( + missionId: "ghi12345-6789-0000-0000-000000000003", + modelOverride: nil, + state: "running", + queueLen: 0, + historyLen: 10, + secondsSinceActivity: 150, + expectedDeliverables: 0 + ) + ], + currentMission: nil, + viewingMissionId: "abc12345-6789-0000-0000-000000000001", + onSelectMission: { _ in }, + onCancelMission: { _ in }, + onRefresh: {} + ) + + Spacer() + } + .background(Theme.backgroundPrimary) +} diff --git a/ios_dashboard/OpenAgentDashboard/Views/Components/StatusBadge.swift b/ios_dashboard/OpenAgentDashboard/Views/Components/StatusBadge.swift index e890846..a0e8c22 100644 --- a/ios_dashboard/OpenAgentDashboard/Views/Components/StatusBadge.swift +++ b/ios_dashboard/OpenAgentDashboard/Views/Components/StatusBadge.swift @@ -19,6 +19,8 @@ enum StatusType { case connected case disconnected case connecting + case interrupted + case blocked var color: Color { switch self { @@ -32,6 +34,8 @@ enum StatusType { return Theme.error case .cancelled, .disconnected: return Theme.textTertiary + case .interrupted, .blocked: + return Theme.warning } } @@ -52,6 +56,8 @@ enum StatusType { case .connected: return "Connected" case .disconnected: return "Disconnected" case .connecting: return "Connecting" + case .interrupted: return "Interrupted" + case .blocked: return "Blocked" } } @@ -66,6 +72,8 @@ enum StatusType { case .idle: return "moon.fill" case .connected: return "wifi" case .disconnected: return "wifi.slash" + case .interrupted: return "pause.circle.fill" + case .blocked: return "exclamationmark.triangle.fill" } } diff --git a/ios_dashboard/OpenAgentDashboard/Views/Control/ControlView.swift b/ios_dashboard/OpenAgentDashboard/Views/Control/ControlView.swift index 37aa95d..04d455a 100644 --- a/ios_dashboard/OpenAgentDashboard/Views/Control/ControlView.swift +++ b/ios_dashboard/OpenAgentDashboard/Views/Control/ControlView.swift @@ -21,6 +21,15 @@ struct ControlView: View { @State private var isAtBottom = true @State private var copiedMessageId: String? + // Parallel missions state + @State private var runningMissions: [RunningMissionInfo] = [] + @State private var viewingMissionId: String? + @State private var showRunningMissions = false + @State private var pollingTask: Task? + + // Track pending fetch to prevent race conditions + @State private var fetchingMissionId: String? + @FocusState private var isInputFocused: Bool private let api = APIService.shared @@ -36,6 +45,11 @@ struct ControlView: View { backgroundGlows VStack(spacing: 0) { + // Running missions bar (when there are parallel missions) + if showRunningMissions && (!runningMissions.isEmpty || currentMission != nil) { + runningMissionsBar + } + // Messages messagesView @@ -76,6 +90,26 @@ struct ControlView: View { } } + ToolbarItem(placement: .topBarLeading) { + // Running missions toggle + Button { + withAnimation(.easeInOut(duration: 0.2)) { + showRunningMissions.toggle() + } + HapticService.selectionChanged() + } label: { + HStack(spacing: 4) { + Image(systemName: "square.stack.3d.up") + .font(.system(size: 14)) + if !runningMissions.isEmpty { + Text("\(runningMissions.count)") + .font(.caption2.weight(.semibold)) + } + } + .foregroundStyle(showRunningMissions ? Theme.accent : Theme.textSecondary) + } + } + ToolbarItem(placement: .topBarTrailing) { Menu { Button { @@ -87,6 +121,15 @@ struct ControlView: View { if let mission = currentMission { Divider() + // Resume button for interrupted/blocked missions + if mission.canResume { + Button { + Task { await resumeMission() } + } label: { + Label("Resume Mission", systemImage: "play.circle") + } + } + Button { Task { await setMissionStatus(.completed) } } label: { @@ -99,7 +142,7 @@ struct ControlView: View { Label("Mark Failed", systemImage: "xmark.circle") } - if mission.status != .active { + if mission.status != .active && !mission.canResume { Button { Task { await setMissionStatus(.active) } } label: { @@ -117,10 +160,22 @@ struct ControlView: View { // Check if we're being opened with a specific mission from History if let pendingId = nav.consumePendingMission() { await loadMission(id: pendingId) + viewingMissionId = pendingId } else { await loadCurrentMission() + viewingMissionId = currentMission?.id } + + // Fetch initial running missions + await refreshRunningMissions() + + // Auto-show bar if there are multiple running missions + if runningMissions.count > 1 { + showRunningMissions = true + } + startStreaming() + startPollingRunningMissions() } .onChange(of: nav.pendingMissionId) { _, newId in // Handle navigation from History while Control is already visible @@ -128,15 +183,42 @@ struct ControlView: View { nav.pendingMissionId = nil Task { await loadMission(id: missionId) + viewingMissionId = missionId } } - + } + .onChange(of: currentMission?.id) { _, newId in + // Sync viewingMissionId with currentMission when it changes + if viewingMissionId == nil, let id = newId { + viewingMissionId = id + } } .onDisappear { streamTask?.cancel() + pollingTask?.cancel() } } + // MARK: - Running Missions Bar + + private var runningMissionsBar: some View { + RunningMissionsBar( + runningMissions: runningMissions, + currentMission: currentMission, + viewingMissionId: viewingMissionId, + onSelectMission: { missionId in + Task { await switchToMission(id: missionId) } + }, + onCancelMission: { missionId in + Task { await cancelMission(id: missionId) } + }, + onRefresh: { + Task { await refreshRunningMissions() } + } + ) + .transition(.move(edge: .top).combined(with: .opacity)) + } + // MARK: - Background private var backgroundGlows: some View { @@ -448,6 +530,7 @@ struct ControlView: View { do { if let mission = try await api.getCurrentMission() { currentMission = mission + viewingMissionId = mission.id messages = mission.history.enumerated().map { index, entry in ChatMessage( id: "\(mission.id)-\(index)", @@ -467,28 +550,40 @@ struct ControlView: View { } private func loadMission(id: String) async { + // Set target immediately for race condition tracking + fetchingMissionId = id + isLoading = true - defer { isLoading = false } do { - let missions = try await api.listMissions() - if let mission = missions.first(where: { $0.id == id }) { - currentMission = mission - messages = mission.history.enumerated().map { index, entry in - ChatMessage( - id: "\(mission.id)-\(index)", - type: entry.isUser ? .user : .assistant(success: true, costCents: 0, model: nil), - content: entry.content - ) - } - HapticService.success() - - // Scroll to bottom after loading - DispatchQueue.main.asyncAfter(deadline: .now() + 0.1) { - shouldScrollToBottom = true - } + let mission = try await api.getMission(id: id) + + // Race condition guard: only update if this is still the mission we want + guard fetchingMissionId == id else { + return // Another mission was requested, discard this response + } + + currentMission = mission + viewingMissionId = mission.id + messages = mission.history.enumerated().map { index, entry in + ChatMessage( + id: "\(mission.id)-\(index)", + type: entry.isUser ? .user : .assistant(success: true, costCents: 0, model: nil), + content: entry.content + ) + } + isLoading = false + HapticService.success() + + // Scroll to bottom after loading + DispatchQueue.main.asyncAfter(deadline: .now() + 0.1) { + shouldScrollToBottom = true } } catch { + // Race condition guard + guard fetchingMissionId == id else { return } + + isLoading = false print("Failed to load mission: \(error)") } } @@ -497,7 +592,19 @@ struct ControlView: View { do { let mission = try await api.createMission() currentMission = mission + viewingMissionId = mission.id messages = [] + + // Refresh running missions to show the new mission + await refreshRunningMissions() + + // Show the bar when creating new missions + if !showRunningMissions && !runningMissions.isEmpty { + withAnimation(.easeInOut(duration: 0.2)) { + showRunningMissions = true + } + } + HapticService.success() } catch { print("Failed to create mission: \(error)") @@ -518,6 +625,33 @@ struct ControlView: View { } } + private func resumeMission() async { + guard let mission = currentMission, mission.canResume else { return } + + do { + let resumed = try await api.resumeMission(id: mission.id) + currentMission = resumed + viewingMissionId = resumed.id + // Reload messages to get the resume prompt + messages = resumed.history.enumerated().map { index, entry in + ChatMessage( + id: "\(resumed.id)-\(index)", + type: entry.isUser ? .user : .assistant(success: true, costCents: 0, model: nil), + content: entry.content + ) + } + + // Refresh running missions + await refreshRunningMissions() + + HapticService.success() + shouldScrollToBottom = true + } catch { + print("Failed to resume mission: \(error)") + HapticService.error() + } + } + private func sendMessage() { let content = inputText.trimmingCharacters(in: .whitespacesAndNewlines) guard !content.isEmpty else { return } @@ -553,25 +687,119 @@ struct ControlView: View { } } + // MARK: - Parallel Missions + + private func refreshRunningMissions() async { + do { + runningMissions = try await api.getRunningMissions() + } catch { + print("Failed to refresh running missions: \(error)") + } + } + + private func startPollingRunningMissions() { + pollingTask = Task { + while !Task.isCancelled { + try? await Task.sleep(for: .seconds(3)) + guard !Task.isCancelled else { break } + await refreshRunningMissions() + } + } + } + + private func switchToMission(id: String) async { + guard id != viewingMissionId else { return } + + // Set the target mission ID immediately for race condition tracking + viewingMissionId = id + fetchingMissionId = id + + isLoading = true + + do { + // Load the mission from API + let mission = try await api.getMission(id: id) + + // Race condition guard: only update if this is still the mission we want + guard fetchingMissionId == id else { + return // Another mission was requested, discard this response + } + + // If this is not a parallel mission, also update currentMission + if runningMissions.contains(where: { $0.missionId == id }) { + // This is a parallel mission - just load its history + messages = mission.history.enumerated().map { index, entry in + ChatMessage( + id: "\(mission.id)-\(index)", + type: entry.isUser ? .user : .assistant(success: true, costCents: 0, model: nil), + content: entry.content + ) + } + } else { + // This is the main mission - load it fully + currentMission = mission + messages = mission.history.enumerated().map { index, entry in + ChatMessage( + id: "\(mission.id)-\(index)", + type: entry.isUser ? .user : .assistant(success: true, costCents: 0, model: nil), + content: entry.content + ) + } + } + + isLoading = false + HapticService.selectionChanged() + shouldScrollToBottom = true + } catch { + // Race condition guard: only show error if this is still the mission we want + guard fetchingMissionId == id else { return } + + isLoading = false + print("Failed to switch mission: \(error)") + HapticService.error() + } + } + + private func cancelMission(id: String) async { + do { + try await api.cancelMission(id: id) + + // Refresh running missions + await refreshRunningMissions() + + // If we were viewing this mission, switch to current + if viewingMissionId == id { + if let currentId = currentMission?.id { + await switchToMission(id: currentId) + } + } + + HapticService.success() + } catch { + print("Failed to cancel mission: \(error)") + HapticService.error() + } + } + private func handleStreamEvent(type: String, data: [String: Any]) { - // Filter events by mission_id - only show events for the current mission + // Filter events by mission_id - only show events for the mission we're viewing // This prevents cross-mission contamination when parallel missions are running let eventMissionId = data["mission_id"] as? String - let currentMissionId = currentMission?.id + let viewingId = viewingMissionId + let currentId = currentMission?.id // Only allow status events from any mission (for global state) - // All other events must match the current mission + // All other events must match the mission we're viewing if type != "status" { if let eventId = eventMissionId { - // Event has a mission_id - must match current mission - if eventId != currentMissionId { + // Event has a mission_id - must match viewing mission + if eventId != viewingId { return // Skip events from other missions } - } else if currentMissionId != nil { + } else if viewingId != nil && viewingId != currentId { // Event has NO mission_id (from main session) - // This is fine if we're on the current/main mission - // But we can't verify, so allow it for now - // TODO: Backend should always include mission_id + // Skip if we're viewing a different (parallel) mission + return } } diff --git a/ios_dashboard/OpenAgentDashboard/Views/Files/FilesView.swift b/ios_dashboard/OpenAgentDashboard/Views/Files/FilesView.swift index cb89981..5c70439 100644 --- a/ios_dashboard/OpenAgentDashboard/Views/Files/FilesView.swift +++ b/ios_dashboard/OpenAgentDashboard/Views/Files/FilesView.swift @@ -22,6 +22,9 @@ struct FilesView: View { @State private var newFolderName = "" @State private var isImporting = false + // Track pending path fetch to prevent race conditions + @State private var fetchingPath: String? + private let api = APIService.shared private var sortedEntries: [FileEntry] { @@ -360,16 +363,32 @@ struct FilesView: View { // MARK: - Actions private func loadDirectory() async { + let pathToLoad = currentPath + fetchingPath = pathToLoad + isLoading = true errorMessage = nil do { - entries = try await api.listDirectory(path: currentPath) + let result = try await api.listDirectory(path: pathToLoad) + + // Race condition guard: only update if this is still the path we want + guard fetchingPath == pathToLoad else { + return // Navigation changed, discard this response + } + + entries = result } catch { + // Race condition guard + guard fetchingPath == pathToLoad else { return } + errorMessage = error.localizedDescription } - isLoading = false + // Only clear loading if this is still the current fetch + if fetchingPath == pathToLoad { + isLoading = false + } } private func navigateTo(_ path: String) { diff --git a/ios_dashboard/OpenAgentDashboard/Views/History/HistoryView.swift b/ios_dashboard/OpenAgentDashboard/Views/History/HistoryView.swift index 035aa84..cf88119 100644 --- a/ios_dashboard/OpenAgentDashboard/Views/History/HistoryView.swift +++ b/ios_dashboard/OpenAgentDashboard/Views/History/HistoryView.swift @@ -22,15 +22,17 @@ struct HistoryView: View { enum StatusFilter: String, CaseIterable { case all = "All" case active = "Active" + case interrupted = "Interrupted" case completed = "Completed" case failed = "Failed" - var missionStatus: MissionStatus? { + var missionStatuses: [MissionStatus]? { switch self { case .all: return nil - case .active: return .active - case .completed: return .completed - case .failed: return .failed + case .active: return [.active] + case .interrupted: return [.interrupted, .blocked] + case .completed: return [.completed] + case .failed: return [.failed, .notFeasible] } } } @@ -38,7 +40,7 @@ struct HistoryView: View { private var filteredMissions: [Mission] { missions.filter { mission in // Filter by status - if let statusFilter = selectedFilter.missionStatus, mission.status != statusFilter { + if let statuses = selectedFilter.missionStatuses, !statuses.contains(mission.status) { return false } @@ -257,11 +259,11 @@ private struct MissionRow: View { var body: some View { HStack(spacing: 14) { // Icon - Image(systemName: "target") + Image(systemName: mission.canResume ? "play.circle" : "target") .font(.title3) - .foregroundStyle(Theme.accent) + .foregroundStyle(mission.canResume ? Theme.warning : Theme.accent) .frame(width: 40, height: 40) - .background(Theme.accent.opacity(0.15)) + .background((mission.canResume ? Theme.warning : Theme.accent).opacity(0.15)) .clipShape(RoundedRectangle(cornerRadius: 10, style: .continuous)) // Content @@ -274,9 +276,23 @@ private struct MissionRow: View { HStack(spacing: 8) { StatusBadge(status: mission.status.statusType, compact: true) + if mission.canResume { + Text("Resumable") + .font(.caption2.weight(.medium)) + .foregroundStyle(Theme.warning) + } + Text("\(mission.history.count) messages") .font(.caption) .foregroundStyle(Theme.textTertiary) + + if let model = mission.displayModel { + Text("•") + .foregroundStyle(Theme.textMuted) + Text(model) + .font(.caption2.monospaced()) + .foregroundStyle(Theme.textTertiary) + } } } @@ -300,7 +316,7 @@ private struct MissionRow: View { .clipShape(RoundedRectangle(cornerRadius: 14, style: .continuous)) .overlay( RoundedRectangle(cornerRadius: 14, style: .continuous) - .stroke(Theme.border, lineWidth: 0.5) + .stroke(mission.canResume ? Theme.warning.opacity(0.3) : Theme.border, lineWidth: mission.canResume ? 1 : 0.5) ) } } diff --git a/ios_dashboard/README.md b/ios_dashboard/README.md index 2155278..02017dd 100644 --- a/ios_dashboard/README.md +++ b/ios_dashboard/README.md @@ -5,10 +5,27 @@ Native iOS dashboard for Open Agent with **Liquid Glass** design language. ## Features - **Control** - Chat interface with the AI agent, real-time streaming -- **History** - View past missions, tasks, and runs +- **History** - View past missions with filtering (active, interrupted, completed, failed) - **Terminal** - SSH console via WebSocket - **Files** - Remote file explorer with upload/download +### Mission Management + +- Create new missions with optional model override +- Resume interrupted or blocked missions +- Mark missions as completed/failed +- View mission status (active, completed, failed, interrupted, blocked, not_feasible) +- Model override display per mission + +### Parallel Missions + +- View all running missions in a compact horizontal bar +- Switch between parallel missions with a single tap +- Real-time status indicators (running, stalled, severely stalled) +- Cancel running missions directly from the bar +- Automatic polling for running mission updates (every 3s) +- SSE event filtering by mission_id to prevent cross-contamination + ## Design System Built with "Quiet Luxury + Liquid Glass" aesthetic: @@ -77,6 +94,12 @@ ios_dashboard/ │ │ ├── Terminal/ # SSH console │ │ ├── Files/ # File explorer │ │ └── Components/ # Reusable UI +│ │ ├── GlassButton.swift +│ │ ├── GlassCard.swift +│ │ ├── StatusBadge.swift +│ │ ├── LoadingView.swift +│ │ ├── RunningMissionsBar.swift # Parallel missions UI +│ │ └── ToolUI/ # Tool UI components │ └── Assets.xcassets/ └── OpenAgentDashboard.xcodeproj/ ``` diff --git a/models_with_benchmarks.json b/models_with_benchmarks.json index 5e51390..c690ac3 100644 --- a/models_with_benchmarks.json +++ b/models_with_benchmarks.json @@ -1,7 +1,7 @@ { - "generated_at": "2025-12-19T07:20:20Z", + "generated_at": "2025-12-22T20:23:08Z", "total_models": 350, - "models_with_benchmarks": 158, + "models_with_benchmarks": 157, "categories": [ "code", "math", @@ -35,6 +35,21 @@ ], "tier": "fast" }, + "gpt-5": { + "latest": "openai/gpt-5.2-chat", + "members": [ + "openai/gpt-5.2-chat", + "openai/gpt-5.2" + ], + "tier": "mid" + }, + "gpt-5-pro": { + "latest": "openai/gpt-5.2-pro", + "members": [ + "openai/gpt-5.2-pro" + ], + "tier": "flagship" + }, "mistral-large": { "latest": "mistralai/mistral-large-2512", "members": [ @@ -85,6 +100,13 @@ ], "tier": "mid" }, + "qwen3-thinking": { + "latest": "qwen/qwen3-next-80b-a3b-thinking", + "members": [ + "qwen/qwen3-next-80b-a3b-thinking" + ], + "tier": "flagship" + }, "deepseek-chat": { "latest": "deepseek/deepseek-chat-v3-0324", "members": [ @@ -204,6 +226,10 @@ "mistralai/mistral-small-creative": "mistralai/mistral-small-24b-instruct-2501", "mistral-small-creative": "mistralai/mistral-small-24b-instruct-2501", "mistral-small": "mistralai/mistral-small-24b-instruct-2501", + "openai/gpt-5.2": "openai/gpt-5.2-chat", + "gpt-5.2": "openai/gpt-5.2-chat", + "gpt-5": "openai/gpt-5.2-chat", + "gpt-5-pro": "openai/gpt-5.2-pro", "mistralai/mistral-large-2411": "mistralai/mistral-large-2512", "mistral-large-2411": "mistralai/mistral-large-2512", "mistralai/mistral-large-2407": "mistralai/mistral-large-2512", @@ -244,6 +270,7 @@ "claude-sonnet": "anthropic/claude-sonnet-4.5", "sonnet": "anthropic/claude-sonnet-4.5", "claude sonnet": "anthropic/claude-sonnet-4.5", + "qwen3-thinking": "qwen/qwen3-next-80b-a3b-thinking", "deepseek/deepseek-chat-v3.1": "deepseek/deepseek-chat-v3-0324", "deepseek-chat-v3.1": "deepseek/deepseek-chat-v3-0324", "deepseek/deepseek-chat": "deepseek/deepseek-chat-v3-0324", @@ -280,6 +307,33 @@ "llama-3-405b": "meta-llama/llama-3.1-405b" }, "models": [ + { + "id": "z-ai/glm-4.7", + "name": "Z.AI: GLM 4.7", + "context_length": 202752, + "architecture": { + "modality": "text->text", + "input_modalities": [ + "text" + ], + "output_modalities": [ + "text" + ], + "tokenizer": "Other", + "instruct_type": null + }, + "pricing": { + "prompt": "0.00000044", + "completion": "0.00000174", + "request": "0", + "image": "0", + "web_search": "0", + "internal_reasoning": "0", + "input_cache_read": "0" + }, + "benchmarks": null, + "category_scores": null + }, { "id": "google/gemini-3-flash-preview", "name": "Google: Gemini 3 Flash Preview", @@ -661,8 +715,8 @@ "instruct_type": null }, "pricing": { - "prompt": "0.00000015", - "completion": "0.0000006", + "prompt": "0.00000005", + "completion": "0.00000022", "request": "0", "image": "0", "web_search": "0", @@ -1113,8 +1167,8 @@ "instruct_type": null }, "pricing": { - "prompt": "0.00000024", - "completion": "0.00000038", + "prompt": "0.000000239", + "completion": "0.000000378", "request": "0", "image": "0", "web_search": "0", @@ -1660,8 +1714,8 @@ "instruct_type": null }, "pricing": { - "prompt": "0.00000045", - "completion": "0.00000235", + "prompt": "0.0000004", + "completion": "0.00000175", "request": "0", "image": "0", "web_search": "0", @@ -3354,8 +3408,8 @@ "instruct_type": null }, "pricing": { - "prompt": "0.00000015", - "completion": "0.00000075", + "prompt": "0.0000002", + "completion": "0.0000008", "request": "0", "image": "0", "web_search": "0", @@ -4471,7 +4525,7 @@ { "id": "qwen/qwen3-coder-30b-a3b-instruct", "name": "Qwen: Qwen3 Coder 30B A3B Instruct", - "context_length": 262144, + "context_length": 160000, "architecture": { "modality": "text->text", "input_modalities": [ @@ -4484,8 +4538,8 @@ "instruct_type": null }, "pricing": { - "prompt": "0.00000006", - "completion": "0.00000025", + "prompt": "0.00000007", + "completion": "0.00000027", "request": "0", "image": "0", "web_search": "0", @@ -6711,60 +6765,6 @@ "general": 0.938 } }, - { - "id": "qwen/qwen3-235b-a22b:free", - "name": "Qwen: Qwen3 235B A22B (free)", - "context_length": 131072, - "architecture": { - "modality": "text->text", - "input_modalities": [ - "text" - ], - "output_modalities": [ - "text" - ], - "tokenizer": "Qwen3", - "instruct_type": "qwen3" - }, - "pricing": { - "prompt": "0", - "completion": "0", - "request": "0", - "image": "0", - "web_search": "0", - "internal_reasoning": "0" - }, - "benchmarks": { - "code": { - "livecodebench": 0.707, - "mbpp": 0.814 - }, - "math": { - "aime-2025": 0.815, - "aime-2024": 0.857, - "gsm8k": 0.9439, - "math": 0.7184 - }, - "reasoning": { - "gpqa": 0.4747, - "mmlu-pro": 0.6818, - "mmlu": 0.8781 - }, - "tool_calling": { - "bfcl": 0.708 - }, - "general": { - "arena-hard": 0.956 - } - }, - "category_scores": { - "code": 0.7605, - "math": 0.8336, - "reasoning": 0.6782, - "tool_calling": 0.708, - "general": 0.956 - } - }, { "id": "qwen/qwen3-235b-a22b", "name": "Qwen: Qwen3 235B A22B", @@ -8632,8 +8632,8 @@ "instruct_type": null }, "pricing": { - "prompt": "0.0000002", - "completion": "0.0000002", + "prompt": "0.0000008", + "completion": "0.0000016", "request": "0", "image": "0", "web_search": "0", @@ -8713,8 +8713,8 @@ "instruct_type": null }, "pricing": { - "prompt": "0.00000003", - "completion": "0.00000013", + "prompt": "0.00000007", + "completion": "0.00000026", "request": "0", "image": "0", "web_search": "0", @@ -9025,7 +9025,7 @@ }, "pricing": { "prompt": "0.00000003", - "completion": "0.00000013", + "completion": "0.00000011", "request": "0", "image": "0", "web_search": "0", @@ -9922,6 +9922,47 @@ "benchmarks": null, "category_scores": null }, + { + "id": "mistralai/ministral-3b", + "name": "Mistral: Ministral 3B", + "context_length": 131072, + "architecture": { + "modality": "text->text", + "input_modalities": [ + "text" + ], + "output_modalities": [ + "text" + ], + "tokenizer": "Mistral", + "instruct_type": null + }, + "pricing": { + "prompt": "0.00000004", + "completion": "0.00000004", + "request": "0", + "image": "0", + "web_search": "0", + "internal_reasoning": "0" + }, + "benchmarks": { + "code": { + "livecodebench": 0.548 + }, + "math": { + "aime-2025": 0.721, + "aime-2024": 0.775 + }, + "reasoning": { + "gpqa": 0.534 + } + }, + "category_scores": { + "code": 0.548, + "math": 0.748, + "reasoning": 0.534 + } + }, { "id": "mistralai/ministral-8b", "name": "Mistral: Ministral 8B", @@ -9968,47 +10009,6 @@ "general": 0.7695 } }, - { - "id": "mistralai/ministral-3b", - "name": "Mistral: Ministral 3B", - "context_length": 131072, - "architecture": { - "modality": "text->text", - "input_modalities": [ - "text" - ], - "output_modalities": [ - "text" - ], - "tokenizer": "Mistral", - "instruct_type": null - }, - "pricing": { - "prompt": "0.00000004", - "completion": "0.00000004", - "request": "0", - "image": "0", - "web_search": "0", - "internal_reasoning": "0" - }, - "benchmarks": { - "code": { - "livecodebench": 0.548 - }, - "math": { - "aime-2025": 0.721, - "aime-2024": 0.775 - }, - "reasoning": { - "gpqa": 0.534 - } - }, - "category_scores": { - "code": 0.548, - "math": 0.748, - "reasoning": 0.534 - } - }, { "id": "qwen/qwen-2.5-7b-instruct", "name": "Qwen: Qwen2.5 7B Instruct", @@ -10102,8 +10102,8 @@ } }, { - "id": "inflection/inflection-3-pi", - "name": "Inflection: Inflection 3 Pi", + "id": "inflection/inflection-3-productivity", + "name": "Inflection: Inflection 3 Productivity", "context_length": 8000, "architecture": { "modality": "text->text", @@ -10128,8 +10128,8 @@ "category_scores": null }, { - "id": "inflection/inflection-3-productivity", - "name": "Inflection: Inflection 3 Productivity", + "id": "inflection/inflection-3-pi", + "name": "Inflection: Inflection 3 Pi", "context_length": 8000, "architecture": { "modality": "text->text", @@ -10206,33 +10206,6 @@ "benchmarks": null, "category_scores": null }, - { - "id": "meta-llama/llama-3.2-11b-vision-instruct", - "name": "Meta: Llama 3.2 11B Vision Instruct", - "context_length": 131072, - "architecture": { - "modality": "text+image->text", - "input_modalities": [ - "text", - "image" - ], - "output_modalities": [ - "text" - ], - "tokenizer": "Llama3", - "instruct_type": "llama3" - }, - "pricing": { - "prompt": "0.000000049", - "completion": "0.000000049", - "request": "0", - "image": "0.00007948", - "web_search": "0", - "internal_reasoning": "0" - }, - "benchmarks": null, - "category_scores": null - }, { "id": "meta-llama/llama-3.2-1b-instruct", "name": "Meta: Llama 3.2 1B Instruct", @@ -10361,6 +10334,33 @@ "general": 0.774 } }, + { + "id": "meta-llama/llama-3.2-11b-vision-instruct", + "name": "Meta: Llama 3.2 11B Vision Instruct", + "context_length": 131072, + "architecture": { + "modality": "text+image->text", + "input_modalities": [ + "text", + "image" + ], + "output_modalities": [ + "text" + ], + "tokenizer": "Llama3", + "instruct_type": "llama3" + }, + "pricing": { + "prompt": "0.000000049", + "completion": "0.000000049", + "request": "0", + "image": "0.00007948", + "web_search": "0", + "internal_reasoning": "0" + }, + "benchmarks": null, + "category_scores": null + }, { "id": "qwen/qwen-2.5-72b-instruct", "name": "Qwen2.5 72B Instruct", @@ -10377,8 +10377,8 @@ "instruct_type": "chatml" }, "pricing": { - "prompt": "0.00000007", - "completion": "0.00000026", + "prompt": "0.00000012", + "completion": "0.00000039", "request": "0", "image": "0", "web_search": "0", @@ -10535,6 +10535,32 @@ "benchmarks": null, "category_scores": null }, + { + "id": "sao10k/l3.1-euryale-70b", + "name": "Sao10K: Llama 3.1 Euryale 70B v2.2", + "context_length": 32768, + "architecture": { + "modality": "text->text", + "input_modalities": [ + "text" + ], + "output_modalities": [ + "text" + ], + "tokenizer": "Llama3", + "instruct_type": "llama3" + }, + "pricing": { + "prompt": "0.00000065", + "completion": "0.00000075", + "request": "0", + "image": "0", + "web_search": "0", + "internal_reasoning": "0" + }, + "benchmarks": null, + "category_scores": null + }, { "id": "qwen/qwen-2.5-vl-7b-instruct:free", "name": "Qwen: Qwen2.5-VL 7B Instruct (free)", @@ -10589,32 +10615,6 @@ "benchmarks": null, "category_scores": null }, - { - "id": "sao10k/l3.1-euryale-70b", - "name": "Sao10K: Llama 3.1 Euryale 70B v2.2", - "context_length": 32768, - "architecture": { - "modality": "text->text", - "input_modalities": [ - "text" - ], - "output_modalities": [ - "text" - ], - "tokenizer": "Llama3", - "instruct_type": "llama3" - }, - "pricing": { - "prompt": "0.00000065", - "completion": "0.00000075", - "request": "0", - "image": "0", - "web_search": "0", - "internal_reasoning": "0" - }, - "benchmarks": null, - "category_scores": null - }, { "id": "microsoft/phi-3.5-mini-128k-instruct", "name": "Microsoft: Phi-3.5 Mini 128K Instruct", @@ -10874,6 +10874,53 @@ "general": 0.886 } }, + { + "id": "meta-llama/llama-3.1-70b-instruct", + "name": "Meta: Llama 3.1 70B Instruct", + "context_length": 131072, + "architecture": { + "modality": "text->text", + "input_modalities": [ + "text" + ], + "output_modalities": [ + "text" + ], + "tokenizer": "Llama3", + "instruct_type": "llama3" + }, + "pricing": { + "prompt": "0.0000004", + "completion": "0.0000004", + "request": "0", + "image": "0", + "web_search": "0", + "internal_reasoning": "0" + }, + "benchmarks": { + "code": { + "humaneval": 0.805 + }, + "reasoning": { + "gpqa": 0.417, + "mmlu-pro": 0.664, + "mmlu": 0.836 + }, + "tool_calling": { + "bfcl": 0.848, + "nexus": 0.567 + }, + "general": { + "ifeval": 0.875 + } + }, + "category_scores": { + "code": 0.805, + "reasoning": 0.639, + "tool_calling": 0.7075, + "general": 0.875 + } + }, { "id": "meta-llama/llama-3.1-405b-instruct:free", "name": "Meta: Llama 3.1 405B Instruct (free)", @@ -11025,53 +11072,6 @@ "general": 0.804 } }, - { - "id": "meta-llama/llama-3.1-70b-instruct", - "name": "Meta: Llama 3.1 70B Instruct", - "context_length": 131072, - "architecture": { - "modality": "text->text", - "input_modalities": [ - "text" - ], - "output_modalities": [ - "text" - ], - "tokenizer": "Llama3", - "instruct_type": "llama3" - }, - "pricing": { - "prompt": "0.0000004", - "completion": "0.0000004", - "request": "0", - "image": "0", - "web_search": "0", - "internal_reasoning": "0" - }, - "benchmarks": { - "code": { - "humaneval": 0.805 - }, - "reasoning": { - "gpqa": 0.417, - "mmlu-pro": 0.664, - "mmlu": 0.836 - }, - "tool_calling": { - "bfcl": 0.848, - "nexus": 0.567 - }, - "general": { - "ifeval": 0.875 - } - }, - "category_scores": { - "code": 0.805, - "reasoning": 0.639, - "tool_calling": 0.7075, - "general": 0.875 - } - }, { "id": "mistralai/mistral-nemo", "name": "Mistral: Mistral Nemo", @@ -11108,8 +11108,8 @@ } }, { - "id": "openai/gpt-4o-mini-2024-07-18", - "name": "OpenAI: GPT-4o-mini (2024-07-18)", + "id": "openai/gpt-4o-mini", + "name": "OpenAI: GPT-4o-mini", "context_length": 128000, "architecture": { "modality": "text+image->text", @@ -11128,7 +11128,7 @@ "prompt": "0.00000015", "completion": "0.0000006", "request": "0", - "image": "0.007225", + "image": "0.000217", "web_search": "0", "internal_reasoning": "0", "input_cache_read": "0.000000075" @@ -11153,8 +11153,8 @@ } }, { - "id": "openai/gpt-4o-mini", - "name": "OpenAI: GPT-4o-mini", + "id": "openai/gpt-4o-mini-2024-07-18", + "name": "OpenAI: GPT-4o-mini (2024-07-18)", "context_length": 128000, "architecture": { "modality": "text+image->text", @@ -11173,7 +11173,7 @@ "prompt": "0.00000015", "completion": "0.0000006", "request": "0", - "image": "0.000217", + "image": "0.007225", "web_search": "0", "internal_reasoning": "0", "input_cache_read": "0.000000075" @@ -11311,58 +11311,6 @@ "benchmarks": null, "category_scores": null }, - { - "id": "mistralai/mistral-7b-instruct:free", - "name": "Mistral: Mistral 7B Instruct (free)", - "context_length": 32768, - "architecture": { - "modality": "text->text", - "input_modalities": [ - "text" - ], - "output_modalities": [ - "text" - ], - "tokenizer": "Mistral", - "instruct_type": "mistral" - }, - "pricing": { - "prompt": "0", - "completion": "0", - "request": "0", - "image": "0", - "web_search": "0", - "internal_reasoning": "0" - }, - "benchmarks": null, - "category_scores": null - }, - { - "id": "mistralai/mistral-7b-instruct", - "name": "Mistral: Mistral 7B Instruct", - "context_length": 32768, - "architecture": { - "modality": "text->text", - "input_modalities": [ - "text" - ], - "output_modalities": [ - "text" - ], - "tokenizer": "Mistral", - "instruct_type": "mistral" - }, - "pricing": { - "prompt": "0.000000028", - "completion": "0.000000054", - "request": "0", - "image": "0", - "web_search": "0", - "internal_reasoning": "0" - }, - "benchmarks": null, - "category_scores": null - }, { "id": "nousresearch/hermes-2-pro-llama-3-8b", "name": "NousResearch: Hermes 2 Pro - Llama-3 8B", @@ -11415,6 +11363,58 @@ "benchmarks": null, "category_scores": null }, + { + "id": "mistralai/mistral-7b-instruct:free", + "name": "Mistral: Mistral 7B Instruct (free)", + "context_length": 32768, + "architecture": { + "modality": "text->text", + "input_modalities": [ + "text" + ], + "output_modalities": [ + "text" + ], + "tokenizer": "Mistral", + "instruct_type": "mistral" + }, + "pricing": { + "prompt": "0", + "completion": "0", + "request": "0", + "image": "0", + "web_search": "0", + "internal_reasoning": "0" + }, + "benchmarks": null, + "category_scores": null + }, + { + "id": "mistralai/mistral-7b-instruct", + "name": "Mistral: Mistral 7B Instruct", + "context_length": 32768, + "architecture": { + "modality": "text->text", + "input_modalities": [ + "text" + ], + "output_modalities": [ + "text" + ], + "tokenizer": "Mistral", + "instruct_type": "mistral" + }, + "pricing": { + "prompt": "0.000000028", + "completion": "0.000000054", + "request": "0", + "image": "0", + "web_search": "0", + "internal_reasoning": "0" + }, + "benchmarks": null, + "category_scores": null + }, { "id": "microsoft/phi-3-mini-128k-instruct", "name": "Microsoft: Phi-3 Mini 128K Instruct", @@ -11467,76 +11467,6 @@ "benchmarks": null, "category_scores": null }, - { - "id": "meta-llama/llama-guard-2-8b", - "name": "Meta: LlamaGuard 2 8B", - "context_length": 8192, - "architecture": { - "modality": "text->text", - "input_modalities": [ - "text" - ], - "output_modalities": [ - "text" - ], - "tokenizer": "Llama3", - "instruct_type": "none" - }, - "pricing": { - "prompt": "0.0000002", - "completion": "0.0000002", - "request": "0", - "image": "0", - "web_search": "0", - "internal_reasoning": "0" - }, - "benchmarks": null, - "category_scores": null - }, - { - "id": "openai/gpt-4o-2024-05-13", - "name": "OpenAI: GPT-4o (2024-05-13)", - "context_length": 128000, - "architecture": { - "modality": "text+image->text", - "input_modalities": [ - "text", - "image", - "file" - ], - "output_modalities": [ - "text" - ], - "tokenizer": "GPT", - "instruct_type": null - }, - "pricing": { - "prompt": "0.000005", - "completion": "0.000015", - "request": "0", - "image": "0.007225", - "web_search": "0", - "internal_reasoning": "0" - }, - "benchmarks": { - "code": { - "humaneval": 0.902 - }, - "math": { - "math": 0.766 - }, - "reasoning": { - "gpqa": 0.536, - "mmlu-pro": 0.726, - "mmlu": 0.887 - } - }, - "category_scores": { - "code": 0.902, - "math": 0.766, - "reasoning": 0.7163 - } - }, { "id": "openai/gpt-4o", "name": "OpenAI: GPT-4o", @@ -11616,8 +11546,8 @@ "category_scores": null }, { - "id": "meta-llama/llama-3-70b-instruct", - "name": "Meta: Llama 3 70B Instruct", + "id": "meta-llama/llama-guard-2-8b", + "name": "Meta: LlamaGuard 2 8B", "context_length": 8192, "architecture": { "modality": "text->text", @@ -11628,11 +11558,11 @@ "text" ], "tokenizer": "Llama3", - "instruct_type": "llama3" + "instruct_type": "none" }, "pricing": { - "prompt": "0.0000003", - "completion": "0.0000004", + "prompt": "0.0000002", + "completion": "0.0000002", "request": "0", "image": "0", "web_search": "0", @@ -11641,6 +11571,50 @@ "benchmarks": null, "category_scores": null }, + { + "id": "openai/gpt-4o-2024-05-13", + "name": "OpenAI: GPT-4o (2024-05-13)", + "context_length": 128000, + "architecture": { + "modality": "text+image->text", + "input_modalities": [ + "text", + "image", + "file" + ], + "output_modalities": [ + "text" + ], + "tokenizer": "GPT", + "instruct_type": null + }, + "pricing": { + "prompt": "0.000005", + "completion": "0.000015", + "request": "0", + "image": "0.007225", + "web_search": "0", + "internal_reasoning": "0" + }, + "benchmarks": { + "code": { + "humaneval": 0.902 + }, + "math": { + "math": 0.766 + }, + "reasoning": { + "gpqa": 0.536, + "mmlu-pro": 0.726, + "mmlu": 0.887 + } + }, + "category_scores": { + "code": 0.902, + "math": 0.766, + "reasoning": 0.7163 + } + }, { "id": "meta-llama/llama-3-8b-instruct", "name": "Meta: Llama 3 8B Instruct", @@ -11667,6 +11641,32 @@ "benchmarks": null, "category_scores": null }, + { + "id": "meta-llama/llama-3-70b-instruct", + "name": "Meta: Llama 3 70B Instruct", + "context_length": 8192, + "architecture": { + "modality": "text->text", + "input_modalities": [ + "text" + ], + "output_modalities": [ + "text" + ], + "tokenizer": "Llama3", + "instruct_type": "llama3" + }, + "pricing": { + "prompt": "0.0000003", + "completion": "0.0000004", + "request": "0", + "image": "0", + "web_search": "0", + "internal_reasoning": "0" + }, + "benchmarks": null, + "category_scores": null + }, { "id": "mistralai/mixtral-8x22b-instruct", "name": "Mistral: Mixtral 8x22B Instruct", @@ -11898,32 +11898,6 @@ "general": 0.863 } }, - { - "id": "openai/gpt-3.5-turbo-0613", - "name": "OpenAI: GPT-3.5 Turbo (older v0613)", - "context_length": 4095, - "architecture": { - "modality": "text->text", - "input_modalities": [ - "text" - ], - "output_modalities": [ - "text" - ], - "tokenizer": "GPT", - "instruct_type": null - }, - "pricing": { - "prompt": "0.000001", - "completion": "0.000002", - "request": "0", - "image": "0", - "web_search": "0", - "internal_reasoning": "0" - }, - "benchmarks": null, - "category_scores": null - }, { "id": "openai/gpt-4-turbo-preview", "name": "OpenAI: GPT-4 Turbo Preview", @@ -11950,6 +11924,32 @@ "benchmarks": null, "category_scores": null }, + { + "id": "openai/gpt-3.5-turbo-0613", + "name": "OpenAI: GPT-3.5 Turbo (older v0613)", + "context_length": 4095, + "architecture": { + "modality": "text->text", + "input_modalities": [ + "text" + ], + "output_modalities": [ + "text" + ], + "tokenizer": "GPT", + "instruct_type": null + }, + "pricing": { + "prompt": "0.000001", + "completion": "0.000002", + "request": "0", + "image": "0", + "web_search": "0", + "internal_reasoning": "0" + }, + "benchmarks": null, + "category_scores": null + }, { "id": "mistralai/mistral-tiny", "name": "Mistral Tiny", @@ -12128,32 +12128,6 @@ "benchmarks": null, "category_scores": null }, - { - "id": "openai/gpt-3.5-turbo-instruct", - "name": "OpenAI: GPT-3.5 Turbo Instruct", - "context_length": 4095, - "architecture": { - "modality": "text->text", - "input_modalities": [ - "text" - ], - "output_modalities": [ - "text" - ], - "tokenizer": "GPT", - "instruct_type": "chatml" - }, - "pricing": { - "prompt": "0.0000015", - "completion": "0.000002", - "request": "0", - "image": "0", - "web_search": "0", - "internal_reasoning": "0" - }, - "benchmarks": null, - "category_scores": null - }, { "id": "mistralai/mistral-7b-instruct-v0.1", "name": "Mistral: Mistral 7B Instruct v0.1", @@ -12180,6 +12154,32 @@ "benchmarks": null, "category_scores": null }, + { + "id": "openai/gpt-3.5-turbo-instruct", + "name": "OpenAI: GPT-3.5 Turbo Instruct", + "context_length": 4095, + "architecture": { + "modality": "text->text", + "input_modalities": [ + "text" + ], + "output_modalities": [ + "text" + ], + "tokenizer": "GPT", + "instruct_type": "chatml" + }, + "pricing": { + "prompt": "0.0000015", + "completion": "0.000002", + "request": "0", + "image": "0", + "web_search": "0", + "internal_reasoning": "0" + }, + "benchmarks": null, + "category_scores": null + }, { "id": "openai/gpt-3.5-turbo-16k", "name": "OpenAI: GPT-3.5 Turbo 16k", @@ -12310,6 +12310,47 @@ "benchmarks": null, "category_scores": null }, + { + "id": "openai/gpt-3.5-turbo", + "name": "OpenAI: GPT-3.5 Turbo", + "context_length": 16385, + "architecture": { + "modality": "text->text", + "input_modalities": [ + "text" + ], + "output_modalities": [ + "text" + ], + "tokenizer": "GPT", + "instruct_type": null + }, + "pricing": { + "prompt": "0.0000005", + "completion": "0.0000015", + "request": "0", + "image": "0", + "web_search": "0", + "internal_reasoning": "0" + }, + "benchmarks": { + "code": { + "humaneval": 0.68 + }, + "math": { + "math": 0.431 + }, + "reasoning": { + "gpqa": 0.308, + "mmlu": 0.698 + } + }, + "category_scores": { + "code": 0.68, + "math": 0.431, + "reasoning": 0.503 + } + }, { "id": "openai/gpt-4", "name": "OpenAI: GPT-4", @@ -12356,47 +12397,6 @@ "reasoning": 0.7825, "general": 0.874 } - }, - { - "id": "openai/gpt-3.5-turbo", - "name": "OpenAI: GPT-3.5 Turbo", - "context_length": 16385, - "architecture": { - "modality": "text->text", - "input_modalities": [ - "text" - ], - "output_modalities": [ - "text" - ], - "tokenizer": "GPT", - "instruct_type": null - }, - "pricing": { - "prompt": "0.0000005", - "completion": "0.0000015", - "request": "0", - "image": "0", - "web_search": "0", - "internal_reasoning": "0" - }, - "benchmarks": { - "code": { - "humaneval": 0.68 - }, - "math": { - "math": 0.431 - }, - "reasoning": { - "gpqa": 0.308, - "mmlu": 0.698 - } - }, - "category_scores": { - "code": 0.68, - "math": 0.431, - "reasoning": 0.503 - } } ] } \ No newline at end of file diff --git a/scripts/merge_benchmarks.py b/scripts/merge_benchmarks.py index 6f9bb56..b51520e 100644 --- a/scripts/merge_benchmarks.py +++ b/scripts/merge_benchmarks.py @@ -76,6 +76,9 @@ MODEL_FAMILY_PATTERNS = [ (r"^anthropic/claude-(\d+\.?\d*)-haiku$", "claude-haiku", "fast"), # OpenAI GPT + (r"^openai/gpt-5\.2-pro$", "gpt-5-pro", "flagship"), + (r"^openai/gpt-5\.2$", "gpt-5", "mid"), + (r"^openai/gpt-5\.2-chat$", "gpt-5", "mid"), (r"^openai/gpt-4\.1$", "gpt-4", "mid"), (r"^openai/gpt-4o$", "gpt-4", "mid"), (r"^openai/gpt-4-turbo", "gpt-4", "mid"), @@ -110,6 +113,8 @@ MODEL_FAMILY_PATTERNS = [ # Qwen (r"^qwen/qwen-2\.5-72b", "qwen-72b", "mid"), (r"^qwen/qwq-32b", "qwq", "mid"), + (r"^qwen/qwen3-next-80b.*thinking", "qwen3-thinking", "flagship"), + (r"^qwen/qwen3-235b.*instruct", "qwen3-instruct", "mid"), ] HEADERS = { diff --git a/src/agents/leaf/complexity.rs b/src/agents/leaf/complexity.rs deleted file mode 100644 index 3f28763..0000000 --- a/src/agents/leaf/complexity.rs +++ /dev/null @@ -1,409 +0,0 @@ -//! Complexity estimation agent. -//! -//! Analyzes a task description and estimates: -//! - Complexity score (0-1) -//! - Whether to split into subtasks -//! - Estimated token count -//! -//! ## Learning Integration -//! When memory is available, the estimator queries similar past tasks -//! and adjusts predictions based on historical actual token usage. - -use async_trait::async_trait; -use serde_json::json; - -use crate::agents::{ - Agent, AgentContext, AgentId, AgentResult, AgentType, Complexity, LeafAgent, LeafCapability, -}; -use crate::llm::{ChatMessage, ChatOptions, Role}; -use crate::memory::HistoricalContext; -use crate::task::Task; - -/// Agent that estimates task complexity. -/// -/// # Purpose -/// Given a task description, estimate how complex it is and whether -/// it should be split into subtasks. -/// -/// # Algorithm -/// 1. Send task description to LLM with complexity evaluation prompt -/// 2. Parse LLM response for complexity score and reasoning -/// 3. Return structured Complexity object -pub struct ComplexityEstimator { - id: AgentId, - prompt_variant: ComplexityPromptVariant, - split_threshold: f64, - token_multiplier: f64, -} - -/// Prompt variants for complexity estimation. -/// -/// We keep this as an enum (not free-form strings) so we can: -/// - A/B test variants deterministically -/// - Store tuned choice as a stable symbol in config -#[derive(Debug, Clone, Copy, PartialEq, Eq, serde::Serialize, serde::Deserialize)] -#[serde(rename_all = "snake_case")] -pub enum ComplexityPromptVariant { - /// Short rubric-based prompt (fast). - RubricV1, - /// More explicit calibration prompt encouraging realistic token estimates. - CalibratedV2, -} - -impl ComplexityEstimator { - /// Create a new complexity estimator. - pub fn new() -> Self { - Self { - id: AgentId::new(), - prompt_variant: ComplexityPromptVariant::CalibratedV2, - split_threshold: 0.6, - token_multiplier: 1.0, - } - } - - /// Create a custom estimator (used by calibration harness). - pub fn with_params( - prompt_variant: ComplexityPromptVariant, - split_threshold: f64, - token_multiplier: f64, - ) -> Self { - Self { - id: AgentId::new(), - prompt_variant, - split_threshold: split_threshold.clamp(0.0, 1.0), - token_multiplier: token_multiplier.max(0.1), - } - } - - /// Prompt template for complexity estimation. - /// - /// # Response Format - /// LLM should respond with JSON containing: - /// - score: float 0-1 - /// - reasoning: string explanation - fn build_prompt(&self, task: &Task) -> String { - match self.prompt_variant { - ComplexityPromptVariant::RubricV1 => format!( - r#"You are a task complexity analyzer. - -Task: {task} - -Respond with ONLY a JSON object: -{{ - "score": , - "reasoning": , - "estimated_tokens": -}} - -Rubric for score: -- 0.0-0.2: Trivial -- 0.2-0.4: Simple -- 0.4-0.6: Moderate -- 0.6-0.8: Complex -- 0.8-1.0: Very Complex"#, - task = task.description() - ), - ComplexityPromptVariant::CalibratedV2 => format!( - r#"You are a task complexity analyzer. Your goal is to estimate: -1) a complexity score in [0, 1] -2) a realistic token budget estimate for completing the task end-to-end using an LLM with tools. - -Task: {task} - -Important: \"estimated_tokens\" should reflect TOTAL tokens (input + output) across multiple turns, including: -- planning / reasoning -- tool call arguments and tool outputs -- iterative fixes and retries - -Respond with ONLY a JSON object: -{{ - \"score\": , - \"reasoning\": , - \"estimated_tokens\": -}} - -Rubric for score: -- 0.0-0.2: Trivial (single tool call) -- 0.2-0.4: Simple (1-3 tool calls) -- 0.4-0.6: Moderate (3-8 tool calls) -- 0.6-0.8: Complex (multi-file, tests, iterations) -- 0.8-1.0: Very Complex (architecture, significant refactor)"#, - task = task.description() - ), - } - } - - /// Parse LLM response into Complexity struct. - /// - /// # Postconditions - /// - Returns valid Complexity with score in [0, 1] - /// - Falls back to moderate complexity on parse error - fn parse_response(&self, response: &str) -> Complexity { - // Try to parse as JSON - if let Ok(json) = serde_json::from_str::(response) { - let score = json["score"].as_f64().unwrap_or(0.5); - let reasoning = json["reasoning"].as_str().unwrap_or("No reasoning provided"); - let estimated_tokens = json["estimated_tokens"].as_u64().unwrap_or(2000); - - return Complexity::new(score, reasoning, estimated_tokens); - } - - // Try to extract score from text - if let Some(score) = self.extract_score_from_text(response) { - return Complexity::new(score, response, 2000); - } - - // Default to moderate complexity - Complexity::moderate("Could not parse complexity response") - } - - /// Try to extract a score from free-form text. - fn extract_score_from_text(&self, text: &str) -> Option { - // Look for patterns like "0.5" or "score: 0.5" or "50%" - let text_lower = text.to_lowercase(); - - // Check for keywords - if text_lower.contains("trivial") || text_lower.contains("very simple") { - return Some(0.1); - } - if text_lower.contains("very complex") || text_lower.contains("extremely") { - return Some(0.9); - } - if text_lower.contains("complex") { - return Some(0.7); - } - if text_lower.contains("moderate") || text_lower.contains("medium") { - return Some(0.5); - } - if text_lower.contains("simple") || text_lower.contains("easy") { - return Some(0.3); - } - - None - } - - /// Query historical context for similar tasks. - /// - /// Returns adjustment multipliers based on past actual vs predicted values. - async fn get_historical_adjustments( - &self, - task_description: &str, - ctx: &AgentContext, - ) -> Option { - let memory = ctx.memory.as_ref()?; - - match memory.retriever.get_historical_context(task_description, 5).await { - Ok(context) => { - if let Some(ref hist) = context { - tracing::debug!( - "Historical context found: {} similar tasks, avg token ratio: {:.2}, success rate: {:.2}", - hist.similar_outcomes.len(), - hist.avg_token_multiplier, - hist.similar_success_rate - ); - } - context - } - Err(e) => { - tracing::warn!("Failed to fetch historical context: {}", e); - None - } - } - } - - /// Adjust token estimate based on historical data. - /// - /// If similar past tasks consistently used more/fewer tokens than predicted, - /// we adjust our estimate accordingly. - fn apply_historical_adjustment( - &self, - base_tokens: u64, - historical: Option<&HistoricalContext>, - ) -> u64 { - match historical { - Some(hist) if hist.similar_outcomes.len() >= 2 => { - // Apply the historical token multiplier (clamped to reasonable range) - let multiplier = hist.avg_token_multiplier.clamp(0.5, 3.0); - let adjusted = (base_tokens as f64 * multiplier).round() as u64; - - tracing::debug!( - "Adjusted token estimate: {} -> {} (multiplier: {:.2})", - base_tokens, adjusted, multiplier - ); - - adjusted - } - _ => base_tokens, - } - } -} - -impl Default for ComplexityEstimator { - fn default() -> Self { - Self::new() - } -} - -#[async_trait] -impl Agent for ComplexityEstimator { - fn id(&self) -> &AgentId { - &self.id - } - - fn agent_type(&self) -> AgentType { - AgentType::ComplexityEstimator - } - - fn description(&self) -> &str { - "Estimates task complexity and recommends splitting strategy" - } - - /// Estimate complexity of a task. - /// - /// # Returns - /// AgentResult with Complexity data in the `data` field. - /// - /// # Learning Integration - /// When memory is available, queries similar past tasks to adjust predictions - /// based on actual historical token usage. - async fn execute(&self, task: &mut Task, ctx: &AgentContext) -> AgentResult { - // Query historical context for similar tasks (if memory available) - let historical = self.get_historical_adjustments(task.description(), ctx).await; - - let prompt = self.build_prompt(task); - - let messages = vec![ - ChatMessage::new(Role::System, "You are a precise task complexity analyzer. Respond only with JSON."), - ChatMessage::new(Role::User, prompt), - ]; - - // Use a fast, cheap model for complexity estimation - let model = "openai/gpt-4.1-mini"; - - let pricing = ctx.pricing.get_pricing(model).await; - let options = ChatOptions { - temperature: Some(0.0), - top_p: None, - max_tokens: Some(400), - }; - - match ctx - .llm - .chat_completion_with_options(model, &messages, None, options) - .await - { - Ok(response) => { - let content = response.content.unwrap_or_default(); - let parsed = self.parse_response(&content); - - // Apply calibrated adjustments (pure post-processing). - let base_tokens = ((parsed.estimated_tokens() as f64) * self.token_multiplier) - .round() - .max(1.0) as u64; - - // Apply historical adjustment if we have relevant data - let adjusted_tokens = self.apply_historical_adjustment(base_tokens, historical.as_ref()); - - let should_split = parsed.score() > self.split_threshold; - let complexity = Complexity::new(parsed.score(), parsed.reasoning(), adjusted_tokens) - .with_split(should_split); - - // Record analysis on the task - { - let a = task.analysis_mut(); - a.complexity_score = Some(complexity.score()); - a.complexity_reasoning = Some(complexity.reasoning().to_string()); - a.should_split = Some(complexity.should_split()); - a.estimated_total_tokens = Some(complexity.estimated_tokens()); - } - - // Compute cost (if usage + pricing available) - let cost_cents = match (&response.usage, &pricing) { - (Some(u), Some(p)) => p.calculate_cost_cents(u.prompt_tokens, u.completion_tokens), - _ => 1, // fallback tiny cost - }; - - // Build historical info for response data - let historical_info = historical.as_ref().map(|h| json!({ - "similar_tasks_found": h.similar_outcomes.len(), - "avg_token_multiplier": h.avg_token_multiplier, - "avg_cost_multiplier": h.avg_cost_multiplier, - "similar_success_rate": h.similar_success_rate, - })); - - AgentResult::success( - format!( - "Complexity: {:.2} - {}{}", - complexity.score(), - if complexity.should_split() { "Should split" } else { "Execute directly" }, - if historical.is_some() { " (adjusted from history)" } else { "" } - ), - cost_cents, - ) - .with_model(model) - .with_data(json!({ - "score": complexity.score(), - "reasoning": complexity.reasoning(), - "should_split": complexity.should_split(), - "estimated_tokens": complexity.estimated_tokens(), - "base_tokens_before_history": base_tokens, - "historical_adjustment": historical_info, - "usage": response.usage.as_ref().map(|u| json!({ - "prompt_tokens": u.prompt_tokens, - "completion_tokens": u.completion_tokens, - "total_tokens": u.total_tokens - })), - })) - } - Err(e) => { - // On error, return moderate complexity as fallback - let fallback = Complexity::moderate(format!("LLM error, using fallback: {}", e)); - - AgentResult::success( - "Using fallback complexity estimate due to LLM error", - 0, - ) - .with_data(json!({ - "score": fallback.score(), - "reasoning": fallback.reasoning(), - "should_split": fallback.should_split(), - "estimated_tokens": fallback.estimated_tokens(), - "fallback": true, - })) - } - } - } -} - -impl LeafAgent for ComplexityEstimator { - fn capability(&self) -> LeafCapability { - LeafCapability::ComplexityEstimation - } -} - -#[cfg(test)] -mod tests { - use super::*; - - #[test] - fn test_parse_json_response() { - let estimator = ComplexityEstimator::new(); - - let json_response = r#"{"score": 0.7, "reasoning": "Complex task", "estimated_tokens": 3000, "should_split": true}"#; - let complexity = estimator.parse_response(json_response); - - assert!((complexity.score() - 0.7).abs() < 0.01); - assert!(complexity.should_split()); - } - - #[test] - fn test_parse_text_response() { - let estimator = ComplexityEstimator::new(); - - let text_response = "This is a very complex task"; - let complexity = estimator.parse_response(text_response); - - assert!(complexity.score() > 0.6); - } -} - diff --git a/src/agents/leaf/executor.rs b/src/agents/leaf/executor.rs index 1223498..7959293 100644 --- a/src/agents/leaf/executor.rs +++ b/src/agents/leaf/executor.rs @@ -19,7 +19,7 @@ use serde_json::json; use std::path::Path; use crate::agents::{ - Agent, AgentContext, AgentId, AgentResult, AgentType, LeafAgent, LeafCapability, + Agent, AgentContext, AgentId, AgentResult, AgentType, LeafAgent, LeafCapability, TerminalReason, }; use crate::api::control::{AgentEvent, ControlRunState}; use crate::budget::ExecutionSignals; @@ -43,6 +43,8 @@ pub struct ExecutionLoopResult { pub signals: ExecutionSignals, /// Whether execution succeeded pub success: bool, + /// Why execution terminated (if not successful completion) + pub terminal_reason: Option, } /// Agent that executes tasks using tools. @@ -60,6 +62,33 @@ pub struct TaskExecutor { id: AgentId, } +/// Categorize a tool name into a broader approach category. +/// Used for tracking repeated failures of similar approaches. +fn categorize_tool(tool_name: &str) -> String { + match tool_name { + // Static analysis tools + name if name.contains("slither") || name.contains("mythril") || + name.contains("solhint") || name.contains("echidna") => "static_analysis".to_string(), + + // Code execution/compilation + "run_command" => "shell_command".to_string(), + name if name.contains("compile") || name.contains("build") => "compilation".to_string(), + + // File operations + "read_file" | "write_file" | "list_directory" | "search_files" => "file_ops".to_string(), + + // Network/API calls + name if name.contains("browser") || name.contains("http") || + name.contains("fetch") || name.contains("curl") => "network".to_string(), + + // Git operations + name if name.contains("git") || name.contains("clone") => "git".to_string(), + + // Default: use the tool name itself + _ => tool_name.to_string(), + } +} + impl TaskExecutor { /// Create a new task executor. pub fn new() -> Self { @@ -558,7 +587,54 @@ Use `search_memory` when you encounter a problem you might have solved before or 4. **Explicit completion** — Use complete_mission tool when the goal is fully achieved 5. **Failure acknowledgment** — If you cannot complete, explain why and call complete_mission with failed status 6. **No silent exits** — Every execution should end with either a deliverable or an explanation -7. **Large files in chunks** — If writing files >2000 chars, verify content isn't truncated"#, +7. **Large files in chunks** — If writing files >2000 chars, verify content isn't truncated + +## ⚠️ CRITICAL: Blocker Detection (STOP if these occur!) + +**If you encounter ANY blocker, STOP IMMEDIATELY and report it. DO NOT produce placeholder content.** + +### Type Mismatch Blockers +| Requested | But Target Is | Action | +|-----------|---------------|--------| +| Solidity/Smart Contract audit | C++/Rust/Go project | STOP → `complete_mission(blocked, "Target is C++/Rust/Go, not Solidity")` | +| Python analysis | Java/JavaScript project | STOP → `complete_mission(blocked, "Target is Java/JS, not Python")` | +| Web scraping | Desktop app | STOP → `complete_mission(blocked, "Target is desktop app, not website")` | + +**How to detect project types:** +- **Solidity**: `.sol` files, `hardhat.config.js`, `truffle-config.js`, `foundry.toml` +- **C++ (Bitcoin forks)**: `configure.ac`, `Makefile.am`, `src/*.cpp`, `src/*.h` +- **Rust**: `Cargo.toml`, `src/*.rs` +- **Go**: `go.mod`, `*.go` files + +### Access/Resource Blockers +| Blocker | Action | +|---------|--------| +| Can't clone/access repository | STOP → report exact error | +| Can't fetch contract bytecode | STOP → report RPC error and address | +| Required tool won't install | STOP → report installation error | +| Source code not available | TRY bytecode analysis first, then report if still blocked | + +### Smart Contract Audit Specific +**When auditing contracts WITHOUT source code:** +1. FIRST try fetching bytecode: `cast code
--rpc-url ` +2. THEN decompile: Use `heimdall`, `panoramix`, `dedaub` +3. ONLY report "blocked" if bytecode analysis also fails + +**Chain RPCs:** +- Ethereum: `https://eth.llamarpc.com` +- BSC: `https://bsc-dataseed.binance.org` +- Polygon: `https://polygon-rpc.com` +- Merlin: `https://rpc.merlinchain.io` + +## 🚫 NEVER DO THESE + +1. **NEVER create "example" or "illustrative" content** as substitute for real analysis +2. **NEVER analyze unrelated code** (e.g., library code instead of target contracts) +3. **NEVER produce generic filler** (e.g., "SQL injection" in a smart contract audit) +4. **NEVER frame placeholder content as real analysis** +5. **NEVER mark "completed" if you analyzed substitute targets** + +If you cannot perform the requested analysis, use `complete_mission(blocked, reason)` and explain clearly what blocked you."#, session_metadata = session_metadata, memory_context = memory_context, working_dir = working_dir, @@ -622,6 +698,15 @@ Use `search_memory` when you encounter a problem you might have solved before or let mut has_error_messages = false; let mut iterations_completed = 0u32; + // Track consecutive empty/reasoning-only responses (P0 fix for agent stalls) + let mut empty_response_count: u32 = 0; + const EMPTY_RESPONSE_WARNING_THRESHOLD: u32 = 2; + const EMPTY_RESPONSE_FORCE_COMPLETE_THRESHOLD: u32 = 4; + + // Track failed tool attempts by category (P3 fix for approach looping) + let mut failed_tool_attempts: std::collections::HashMap = std::collections::HashMap::new(); + const TOOL_FAILURE_THRESHOLD: u32 = 3; + // Track uploaded images that need to be included in the response // When upload_image succeeds, we store the (url, markdown) so we can warn // the agent if they try to complete without including the images. @@ -701,8 +786,25 @@ Use `search_memory` when you encounter a problem you might have solved before or tracing::info!("Browser tools available: {:?}", browser_tools); } - tracing::info!("Discovered {} built-in tools, {} MCP tools", builtin_count, mcp_tool_schemas.len()); - tool_schemas.extend(mcp_tool_schemas); + // Filter out MCP tools that conflict with built-in tools (built-in takes precedence) + let builtin_names: std::collections::HashSet<_> = tool_schemas.iter().map(|t| t.function.name.as_str()).collect(); + let mcp_count_before = mcp_tool_schemas.len(); + let filtered_mcp: Vec<_> = mcp_tool_schemas + .into_iter() + .filter(|t| { + if builtin_names.contains(t.function.name.as_str()) { + tracing::debug!("Skipping MCP tool '{}' - conflicts with built-in tool", t.function.name); + false + } else { + true + } + }) + .collect(); + let mcp_skipped = mcp_count_before - filtered_mcp.len(); + + tracing::info!("Discovered {} built-in tools, {} MCP tools ({} skipped due to conflicts)", + builtin_count, filtered_mcp.len(), mcp_skipped); + tool_schemas.extend(filtered_mcp); // Agent loop for iteration in 0..ctx.max_iterations { @@ -734,6 +836,7 @@ Use `search_memory` when you encounter a problem you might have solved before or usage, signals, success: false, + terminal_reason: Some(TerminalReason::Cancelled), }; } } @@ -762,13 +865,19 @@ Use `search_memory` when you encounter a problem you might have solved before or usage, signals, success: false, + terminal_reason: Some(TerminalReason::BudgetExhausted), }; } - // Call LLM - let response = match ctx.llm.chat_completion(model, &messages, Some(&tool_schemas)).await { - Ok(r) => r, - Err(e) => { + // Call LLM with timeout (P2 fix: detect hangs) + const LLM_TIMEOUT_SECS: u64 = 300; // 5 minutes max per LLM call + let llm_future = ctx.llm.chat_completion(model, &messages, Some(&tool_schemas)); + let response = match tokio::time::timeout( + std::time::Duration::from_secs(LLM_TIMEOUT_SECS), + llm_future + ).await { + Ok(Ok(r)) => r, + Ok(Err(e)) => { has_error_messages = true; let error_msg = format!("LLM error: {}", e); let signals = ExecutionSignals { @@ -792,6 +901,45 @@ Use `search_memory` when you encounter a problem you might have solved before or usage, signals, success: false, + terminal_reason: Some(TerminalReason::LlmError), + }; + } + Err(_timeout) => { + // P2 FIX: LLM call timed out - return with partial results + has_error_messages = true; + tracing::error!( + "LLM call timed out after {} seconds at iteration {}", + LLM_TIMEOUT_SECS, + iterations_completed + ); + let signals = ExecutionSignals { + iterations: iterations_completed, + max_iterations: ctx.max_iterations as u32, + successful_tool_calls, + failed_tool_calls, + files_modified, + repetitive_actions, + has_error_messages, + partial_progress: files_modified || successful_tool_calls > 0, + cost_spent_cents: total_cost_cents, + budget_total_cents: task.budget().total_cents(), + final_output: format!( + "LLM call timed out after {} seconds. Partial results may be in working directory.", + LLM_TIMEOUT_SECS + ), + model_used: model.to_string(), + }; + return ExecutionLoopResult { + output: format!( + "Agent stalled: LLM call timed out after {} seconds. Check working directory for partial results.", + LLM_TIMEOUT_SECS + ), + cost_cents: total_cost_cents, + tool_log, + usage, + signals, + success: false, + terminal_reason: Some(TerminalReason::Stalled), }; } }; @@ -907,6 +1055,7 @@ Use `search_memory` when you encounter a problem you might have solved before or usage, signals, success: false, + terminal_reason: Some(TerminalReason::InfiniteLoop), }; } @@ -999,6 +1148,7 @@ Use `search_memory` when you encounter a problem you might have solved before or usage, signals, success: false, + terminal_reason: Some(TerminalReason::Cancelled), }; } } @@ -1069,7 +1219,28 @@ Use `search_memory` when you encounter a problem you might have solved before or Err(e) => { failed_tool_calls += 1; has_error_messages = true; - let s = format!("Error: {}", e); + + // P3 FIX: Track failed approaches by tool category + let tool_category = categorize_tool(&tool_name); + let count = failed_tool_attempts.entry(tool_category.clone()).or_insert(0); + *count += 1; + + let s = if *count >= TOOL_FAILURE_THRESHOLD { + tracing::warn!( + "Tool category '{}' has failed {} times - suggesting pivot", + tool_category, + *count + ); + format!( + "Error: {}\n\n[SYSTEM NOTE: The '{}' approach has failed {} times. \ + Consider: 1) Try a completely different tool/approach, \ + 2) Analyze what you DO have and produce partial results, \ + 3) Call complete_mission(blocked) if fundamentally stuck]", + e, tool_category, *count + ) + } else { + format!("Error: {}", e) + }; (s.clone(), serde_json::Value::String(s)) } } @@ -1205,14 +1376,92 @@ Use `search_memory` when you encounter a problem you might have solved before or usage, signals, success: true, + terminal_reason: None, }; } } + // Reset empty response counter on successful tool execution + empty_response_count = 0; continue; } } + // P0 FIX: Handle reasoning-only responses (no tool calls, no/empty content) + // This prevents the agent from stalling when the LLM returns only thinking + let has_reasoning = response.reasoning.as_ref().map(|r| !r.is_empty()).unwrap_or(false); + let has_content = response.content.as_ref().map(|c| !c.trim().is_empty()).unwrap_or(false); + let has_tool_calls = response.tool_calls.as_ref().map(|tc| !tc.is_empty()).unwrap_or(false); + + if !has_tool_calls && !has_content { + empty_response_count += 1; + tracing::warn!( + "Empty/reasoning-only response #{} (has_reasoning: {}, iteration: {})", + empty_response_count, + has_reasoning, + iterations_completed + ); + + // Force completion if too many empty responses + if empty_response_count >= EMPTY_RESPONSE_FORCE_COMPLETE_THRESHOLD { + tracing::error!( + "Force completing: {} consecutive empty/reasoning-only responses", + empty_response_count + ); + has_error_messages = true; + let signals = ExecutionSignals { + iterations: iterations_completed, + max_iterations: ctx.max_iterations as u32, + successful_tool_calls, + failed_tool_calls, + files_modified, + repetitive_actions, + has_error_messages, + partial_progress: files_modified || successful_tool_calls > 0, + cost_spent_cents: total_cost_cents, + budget_total_cents: task.budget().total_cents(), + final_output: format!( + "Agent stalled: {} consecutive responses without action. Partial results may be in working directory.", + empty_response_count + ), + model_used: model.to_string(), + }; + return ExecutionLoopResult { + output: format!( + "Agent stalled after {} responses without taking action. Check working directory for partial results.", + empty_response_count + ), + cost_cents: total_cost_cents, + tool_log, + usage, + signals, + success: false, + terminal_reason: Some(TerminalReason::Stalled), + }; + } + + // Inject a prompt to get the model to take action + if empty_response_count >= EMPTY_RESPONSE_WARNING_THRESHOLD { + messages.push(ChatMessage::new( + Role::User, + format!( + "[SYSTEM WARNING] You've returned {} responses without taking any action (only thinking/reasoning).\n\n\ + You MUST now do ONE of:\n\ + 1. Call a tool to continue working on the task\n\ + 2. Provide a complete final response summarizing your work\n\ + 3. Call complete_mission with status='completed' if done, or status='blocked' if stuck\n\n\ + DO NOT respond with only thinking - take concrete action NOW.", + empty_response_count + ) + )); + } + + continue; // Retry - let the model try again with the warning + } + + // If we reach here with content, it's the final response + // (no need to reset empty_response_count since we're returning) + // No tool calls - final response if let Some(content) = response.content.filter(|c| !c.trim().is_empty()) { let signals = ExecutionSignals { @@ -1236,6 +1485,7 @@ Use `search_memory` when you encounter a problem you might have solved before or usage, signals, success: true, + terminal_reason: None, }; } @@ -1262,6 +1512,7 @@ Use `search_memory` when you encounter a problem you might have solved before or usage, signals, success: false, + terminal_reason: Some(TerminalReason::LlmError), }; } @@ -1287,6 +1538,7 @@ Use `search_memory` when you encounter a problem you might have solved before or usage, signals, success: false, + terminal_reason: Some(TerminalReason::MaxIterations), } } } @@ -1350,6 +1602,11 @@ impl Agent for TaskExecutor { AgentResult::failure(&result.output, result.cost_cents) }; + // Propagate terminal reason from execution loop + if let Some(reason) = result.terminal_reason { + agent_result = agent_result.with_terminal_reason(reason); + } + agent_result = agent_result .with_model(model) .with_data(json!({ @@ -1415,6 +1672,11 @@ impl TaskExecutor { AgentResult::failure(&result.output, result.cost_cents) }; + // Propagate terminal reason from execution loop + if let Some(reason) = result.terminal_reason { + agent_result = agent_result.with_terminal_reason(reason); + } + agent_result = agent_result .with_model(model) .with_data(json!({ diff --git a/src/agents/leaf/mod.rs b/src/agents/leaf/mod.rs index fd97b7b..98aa40a 100644 --- a/src/agents/leaf/mod.rs +++ b/src/agents/leaf/mod.rs @@ -1,18 +1,13 @@ //! Leaf agents - specialized agents that do actual work. //! -//! # Leaf Agent Types -//! - `ComplexityEstimator`: Estimates task complexity (0-1 score) -//! - `ModelSelector`: Selects optimal model for task/budget +//! # Active Leaf Agent //! - `TaskExecutor`: Executes tasks using tools (main worker) -//! - `Verifier`: Validates task completion +//! +//! # Removed Agents (superseded by SimpleAgent) +//! - `ComplexityEstimator`: Was unreliable (LLM-based estimation) +//! - `ModelSelector`: Was over-engineered (U-curve optimization) +//! - `Verifier`: Was ineffective (rubber-stamped everything) -mod complexity; -mod model_select; mod executor; -mod verifier; -pub use complexity::{ComplexityEstimator, ComplexityPromptVariant}; -pub use model_select::ModelSelector; pub use executor::{TaskExecutor, ExecutionLoopResult}; -pub use verifier::Verifier; - diff --git a/src/agents/leaf/model_select.rs b/src/agents/leaf/model_select.rs deleted file mode 100644 index 2114f26..0000000 --- a/src/agents/leaf/model_select.rs +++ /dev/null @@ -1,766 +0,0 @@ -//! Model selection agent with U-curve cost optimization. -//! -//! # U-Curve Optimization -//! The total expected cost follows a U-shaped curve: -//! - Cheap models: Low per-token cost, but may fail/retry, use more tokens -//! - Expensive models: High per-token cost, but succeed more often -//! - Optimal: Somewhere in the middle, minimizing total expected cost -//! -//! # Cost Model -//! Expected Cost = base_cost * (1 + failure_rate * retry_multiplier) * token_efficiency -//! -//! # Benchmark Integration -//! When benchmark data is available, uses actual benchmark scores (from llm-stats.com) -//! for task-type-specific capability estimation instead of price-based heuristics. -//! -//! # Learning Integration -//! When memory is available, uses historical model statistics (actual success rates, -//! cost ratios) instead of pure heuristics. - -use async_trait::async_trait; -use serde_json::json; -use std::collections::HashMap; - -use crate::agents::{ - Agent, AgentContext, AgentId, AgentResult, AgentType, LeafAgent, LeafCapability, -}; -use crate::budget::{PricingInfo, TaskType}; -use crate::memory::ModelStats; -use crate::task::Task; - -/// Agent that selects the optimal model for a task. -/// -/// # Algorithm -/// 1. Get task complexity and budget constraints -/// 2. Fetch available models and pricing -/// 3. For each model, calculate expected total cost -/// 4. Return model with minimum expected cost within budget -pub struct ModelSelector { - id: AgentId, - retry_multiplier: f64, - inefficiency_scale: f64, - max_failure_probability: f64, -} - -/// Model recommendation from the selector. -#[derive(Debug, Clone)] -pub struct ModelRecommendation { - /// Recommended model ID - pub model_id: String, - - /// Expected cost in cents - pub expected_cost_cents: u64, - - /// Confidence in this recommendation (0-1) - pub confidence: f64, - - /// Reasoning for the selection - pub reasoning: String, - - /// Alternative models if primary fails - pub fallbacks: Vec, - - /// Whether historical data was used for this selection - pub used_historical_data: bool, - - /// Whether benchmark data was used for capability estimation - pub used_benchmark_data: bool, - - /// Inferred task type - pub task_type: Option, -} - -impl ModelSelector { - /// Create a new model selector. - pub fn new() -> Self { - Self { - id: AgentId::new(), - retry_multiplier: 1.5, - inefficiency_scale: 0.5, - max_failure_probability: 0.9, - } - } - - /// Create a selector with calibrated parameters. - pub fn with_params(retry_multiplier: f64, inefficiency_scale: f64, max_failure_probability: f64) -> Self { - Self { - id: AgentId::new(), - retry_multiplier: retry_multiplier.max(1.0), - inefficiency_scale: inefficiency_scale.max(0.0), - max_failure_probability: max_failure_probability.clamp(0.0, 0.99), - } - } - - /// Calculate expected cost for a model given task complexity. - /// - /// # Formula - /// ```text - /// expected_cost = base_cost * (1 + failure_prob * retry_cost) * inefficiency_factor - /// ``` - /// - /// # Parameters - /// - `pricing`: Model pricing info - /// - `complexity`: Task complexity (0-1) - /// - `estimated_tokens`: Estimated tokens needed - /// - `capability`: Model capability score (0-1), from benchmarks or price heuristic - /// - /// # Returns - /// Expected cost in cents - /// - /// # Pure Function - /// No side effects, deterministic output. - fn calculate_expected_cost_with_capability( - &self, - pricing: &PricingInfo, - complexity: f64, - estimated_tokens: u64, - capability: f64, - from_benchmarks: bool, - ) -> ExpectedCost { - // Failure probability: higher complexity + lower capability = more failures - // Formula: P(fail) = complexity * (1 - capability) - let failure_prob = (complexity * (1.0 - capability)).clamp(0.0, self.max_failure_probability); - - // Token inefficiency: weaker models need more tokens - // Formula: inefficiency = 1 + (1 - capability) * 0.5 - let inefficiency = 1.0 + (1.0 - capability) * self.inefficiency_scale; - - // Retry cost: if it fails, we pay again (possibly with a better model) - let retry_multiplier = self.retry_multiplier; - - // Base cost for estimated tokens - let input_tokens = estimated_tokens / 2; - let output_tokens = estimated_tokens / 2; - let base_cost = pricing.calculate_cost_cents(input_tokens, output_tokens); - - // Adjusted for inefficiency (weak models use more tokens) - let adjusted_tokens = ((estimated_tokens as f64) * inefficiency) as u64; - let adjusted_cost = pricing.calculate_cost_cents(adjusted_tokens / 2, adjusted_tokens / 2); - - // Expected cost including retry probability - let expected_cost = (adjusted_cost as f64) * (1.0 + failure_prob * retry_multiplier); - - ExpectedCost { - model_id: pricing.model_id.clone(), - base_cost_cents: base_cost, - expected_cost_cents: expected_cost.ceil() as u64, - failure_probability: failure_prob, - capability, - inefficiency, - from_benchmarks, - } - } - - /// Calculate expected cost using price-based capability (fallback). - fn calculate_expected_cost( - &self, - pricing: &PricingInfo, - complexity: f64, - estimated_tokens: u64, - ) -> ExpectedCost { - let avg_cost = pricing.average_cost_per_token(); - let capability = self.estimate_capability_from_price(avg_cost); - self.calculate_expected_cost_with_capability(pricing, complexity, estimated_tokens, capability, false) - } - - /// Estimate model capability from its cost (fallback heuristic). - /// - /// # Heuristic - /// More expensive models are generally more capable. - /// Uses log scale to normalize across price ranges. - /// - /// # Returns - /// Capability score 0-1 - fn estimate_capability_from_price(&self, avg_cost_per_token: f64) -> f64 { - // Cost tiers (per token): - // < 0.0001: weak (capability ~0.3) - // 0.0001-0.001: moderate (capability ~0.6) - // > 0.001: strong (capability ~0.9) - - if avg_cost_per_token < 0.0000001 { - return 0.3; // Free/very cheap - } - - // Log scale normalization - let log_cost = avg_cost_per_token.log10(); - // Map from ~-7 (cheap) to ~-3 (expensive) => 0.3 to 0.95 - let normalized = ((log_cost + 7.0) / 4.0).clamp(0.0, 1.0); - - 0.3 + normalized * 0.65 - } - - /// Get model capability from benchmarks (preferred) or fall back to price heuristic. - /// - /// # Benchmark-Based Capability - /// Uses actual benchmark scores from llm-stats.com when available. - /// This provides task-type-specific capability estimation. - async fn get_capability( - &self, - model_id: &str, - task_type: TaskType, - avg_cost_per_token: f64, - ctx: &AgentContext, - ) -> (f64, bool) { - // Try to get benchmark-based capability - if let Some(benchmarks) = &ctx.benchmarks { - let registry = benchmarks.read().await; - if let Some(model) = registry.get(model_id) { - if model.has_benchmarks() { - let capability = model.capability(task_type); - tracing::info!( - "Using benchmark capability for {}: {:.3} (task_type: {:?})", - model_id, capability, task_type - ); - return (capability, true); // (capability, from_benchmarks) - } - } - } - - // Fall back to price-based heuristic - let capability = self.estimate_capability_from_price(avg_cost_per_token); - tracing::debug!( - "Using price-based capability for {}: {:.3} (avg_cost: {:.10})", - model_id, capability, avg_cost_per_token - ); - (capability, false) - } - - /// Select optimal model from available options. - /// - /// # Algorithm - /// 1. Calculate expected cost for each model using benchmark capabilities when available - /// 2. If user requested a specific model, use it as minimum capability floor - /// 3. Filter models exceeding budget - /// 4. Select model with minimum expected cost - /// 5. Include fallbacks in case of failure - /// - /// # Preconditions - /// - `models` is non-empty - /// - `budget_cents > 0` - async fn select_optimal( - &self, - models: &[PricingInfo], - complexity: f64, - estimated_tokens: u64, - budget_cents: u64, - task_type: TaskType, - historical_stats: Option<&HashMap>, - requested_model: Option<&str>, - ctx: &AgentContext, - ) -> Option { - if models.is_empty() { - return None; - } - - // Calculate expected cost for all models, using benchmark or historical stats when available - let mut costs: Vec = Vec::with_capacity(models.len()); - let mut any_from_benchmarks = false; - - for m in models { - let cost = if let Some(stats) = historical_stats.and_then(|h| h.get(&m.model_id)) { - // Use historical data if available (highest priority) - self.calculate_expected_cost_with_history(m, complexity, estimated_tokens, stats) - } else { - // Use benchmark data for capability - let (capability, from_benchmarks) = self.get_capability( - &m.model_id, - task_type, - m.average_cost_per_token(), - ctx, - ).await; - - if from_benchmarks { - any_from_benchmarks = true; - } - - self.calculate_expected_cost_with_capability( - m, complexity, estimated_tokens, capability, from_benchmarks - ) - }; - costs.push(cost); - } - - // Sort by expected cost (ascending) - costs.sort_by(|a, b| { - a.expected_cost_cents - .cmp(&b.expected_cost_cents) - }); - - // If user requested a specific model, use it as minimum capability floor - // Filter out models with lower capability than the requested one - let min_capability = if let Some(req_model) = requested_model { - // Find the requested model's capability - if let Some(req_cost) = costs.iter().find(|c| c.model_id == req_model) { - tracing::info!( - "Using requested model {} as capability floor: {:.3}", - req_model, - req_cost.capability - ); - req_cost.capability - } else { - // Requested model not found - fall back to looking up its price - if let Some(req_pricing) = models.iter().find(|m| m.model_id == req_model) { - let cap = self.estimate_capability_from_price(req_pricing.average_cost_per_token()); - tracing::info!( - "Requested model {} not in costs list, using price-based capability: {:.3}", - req_model, - cap - ); - cap - } else { - // Model not found at all, use a reasonable floor (0.7 = mid-tier) - tracing::warn!( - "Requested model {} not found, using default capability floor 0.7", - req_model - ); - 0.7 - } - } - } else { - 0.0 // No minimum - }; - - // Filter to models meeting minimum capability - let filtered_costs: Vec<_> = if min_capability > 0.0 { - costs.iter() - .filter(|c| c.capability >= min_capability * 0.95) // Allow 5% tolerance - .cloned() - .collect() - } else { - costs.clone() - }; - - let costs_to_use = if filtered_costs.is_empty() { - tracing::warn!("No models meet minimum capability {:.2}, using all models", min_capability); - &costs - } else { - &filtered_costs - }; - - // Find cheapest model within budget - let within_budget: Vec<_> = costs_to_use - .iter() - .filter(|c| c.expected_cost_cents <= budget_cents) - .cloned() - .collect(); - - let selected = within_budget.first().cloned().or_else(|| costs_to_use.first().cloned())?; - - // Get fallback models (next best options) - let fallbacks: Vec = costs - .iter() - .filter(|c| c.model_id != selected.model_id) - .take(2) - .map(|c| c.model_id.clone()) - .collect(); - - let used_history = historical_stats.and_then(|h| h.get(&selected.model_id)).is_some(); - - let recommendation = ModelRecommendation { - model_id: selected.model_id.clone(), - expected_cost_cents: selected.expected_cost_cents, - confidence: 1.0 - selected.failure_probability, - reasoning: format!( - "Selected {} for {:?} task with expected cost {} cents (capability: {:.2}, failure prob: {:.2}){}{}", - selected.model_id, - task_type, - selected.expected_cost_cents, - selected.capability, - selected.failure_probability, - if used_history { " [historical]" } else { "" }, - if selected.from_benchmarks { " [benchmark]" } else { "" } - ), - fallbacks, - used_historical_data: used_history, - used_benchmark_data: selected.from_benchmarks, - task_type: Some(task_type), - }; - - tracing::info!( - "Model selected: {} (task: {:?}, cost: {} cents, benchmark_data: {}, history: {})", - recommendation.model_id, - task_type, - recommendation.expected_cost_cents, - recommendation.used_benchmark_data, - recommendation.used_historical_data - ); - - Some(recommendation) - } - - /// Calculate expected cost using actual historical statistics. - /// - /// This uses real success rates and cost ratios from past executions - /// instead of heuristic estimates. - fn calculate_expected_cost_with_history( - &self, - pricing: &PricingInfo, - _complexity: f64, - estimated_tokens: u64, - stats: &ModelStats, - ) -> ExpectedCost { - // Use actual failure rate from history (inverted success rate) - let failure_prob = (1.0 - stats.success_rate).clamp(0.0, self.max_failure_probability); - - // Use actual token ratio from history for inefficiency - let inefficiency = stats.avg_token_ratio.clamp(0.5, 3.0); - - // Base cost for estimated tokens - let input_tokens = estimated_tokens / 2; - let output_tokens = estimated_tokens / 2; - let base_cost = pricing.calculate_cost_cents(input_tokens, output_tokens); - - // Adjust for actual inefficiency - let adjusted_tokens = ((estimated_tokens as f64) * inefficiency) as u64; - let adjusted_cost = pricing.calculate_cost_cents(adjusted_tokens / 2, adjusted_tokens / 2); - - // Apply actual cost ratio (how much more/less than predicted) - let cost_with_ratio = (adjusted_cost as f64) * stats.avg_cost_ratio.clamp(0.5, 3.0); - - // Expected cost including retry probability - let expected_cost = cost_with_ratio * (1.0 + failure_prob * self.retry_multiplier); - - // Capability estimated from success rate rather than price - let capability = stats.success_rate.clamp(0.3, 0.95); - - ExpectedCost { - model_id: pricing.model_id.clone(), - base_cost_cents: base_cost, - expected_cost_cents: expected_cost.ceil() as u64, - failure_probability: failure_prob, - capability, - inefficiency, - from_benchmarks: false, // Historical data is not benchmark data - } - } - - /// Query historical model stats from memory. - async fn get_historical_model_stats( - &self, - complexity: f64, - ctx: &AgentContext, - ) -> Option> { - let memory = ctx.memory.as_ref()?; - - // Query stats for models at similar complexity levels (+/- 0.2) - match memory.retriever.get_model_stats(complexity, 0.2).await { - Ok(stats) if !stats.is_empty() => { - tracing::debug!( - "Found historical stats for {} models at complexity ~{:.2}", - stats.len(), - complexity - ); - - // Convert to HashMap for easy lookup - Some(stats.into_iter() - .map(|s| (s.model_id.clone(), s)) - .collect()) - } - Ok(_) => { - tracing::debug!("No historical stats found for complexity ~{:.2}", complexity); - None - } - Err(e) => { - tracing::warn!("Failed to fetch model stats: {}", e); - None - } - } - } -} - -/// Intermediate calculation result for a model. -#[derive(Debug, Clone)] -struct ExpectedCost { - model_id: String, - #[allow(dead_code)] - base_cost_cents: u64, - expected_cost_cents: u64, - failure_probability: f64, - capability: f64, - #[allow(dead_code)] - inefficiency: f64, - /// Whether capability was derived from benchmark data - from_benchmarks: bool, -} - -impl Default for ModelSelector { - fn default() -> Self { - Self::new() - } -} - -#[async_trait] -impl Agent for ModelSelector { - fn id(&self) -> &AgentId { - &self.id - } - - fn agent_type(&self) -> AgentType { - AgentType::ModelSelector - } - - fn description(&self) -> &str { - "Selects optimal model for task based on complexity and budget (U-curve optimization)" - } - - /// Select the optimal model for a task. - /// - /// # Expected Input - /// Task should have complexity data in its context (from ComplexityEstimator). - /// - /// # Returns - /// AgentResult with ModelRecommendation in the `data` field. - /// - /// # Benchmark Integration - /// When benchmark data is available, uses actual benchmark scores for - /// task-type-specific capability estimation. - /// - /// # Learning Integration - /// When memory is available, queries historical model statistics and uses - /// actual success rates/cost ratios instead of heuristics. - async fn execute(&self, task: &mut Task, ctx: &AgentContext) -> AgentResult { - // Get complexity + estimated tokens from task analysis (populated by ComplexityEstimator). - let complexity = task - .analysis() - .complexity_score - .unwrap_or(0.5) - .clamp(0.0, 1.0); - let estimated_tokens = task.analysis().estimated_total_tokens.unwrap_or(2000_u64); - - // Infer task type from description for benchmark-based selection - let task_type = TaskType::infer_from_description(task.description()); - - // Get available budget - let budget_cents = task.budget().remaining_cents(); - - // Query historical model stats (if memory available) - let historical_stats = self.get_historical_model_stats(complexity, ctx).await; - - // Fetch pricing for tool-supporting models only - let models = ctx.pricing.models_by_cost_filtered(true).await; - - if models.is_empty() { - // Fall back to configured default model (after resolving to latest) - let default_model = if let Some(resolver) = &ctx.resolver { - let resolver = resolver.read().await; - let resolved = resolver.resolve(&ctx.config.default_model); - if resolved.upgraded { - tracing::info!( - "Default model auto-upgraded: {} → {}", - resolved.original, resolved.resolved - ); - } - resolved.resolved - } else { - ctx.config.default_model.clone() - }; - - // Record on task analysis - { - let a = task.analysis_mut(); - a.selected_model = Some(default_model.clone()); - } - - return AgentResult::success( - "Using configured default model (no other models available)", - 0, - ) - .with_data(json!({ - "model_id": default_model, - "expected_cost_cents": 50, - "confidence": 0.8, - "reasoning": "Fallback to configured default model", - "fallbacks": [], - "used_historical_data": false, - "used_benchmark_data": false, - "task_type": format!("{:?}", task_type), - })); - } - - // Get user-requested model - if specified, resolve to latest version and use it - let requested_model = task.analysis().requested_model.clone(); - - // Auto-upgrade outdated model names using the resolver - let (resolved_model, was_upgraded) = if let Some(ref req_model) = requested_model { - if let Some(resolver) = &ctx.resolver { - let resolver = resolver.read().await; - let resolved = resolver.resolve(req_model); - if resolved.upgraded { - tracing::info!( - "Model auto-upgraded: {} → {} ({})", - resolved.original, - resolved.resolved, - resolved.reason.as_deref().unwrap_or("family upgrade") - ); - } - (Some(resolved.resolved), resolved.upgraded) - } else { - (Some(req_model.clone()), false) - } - } else { - (None, false) - }; - - // If user explicitly requested a model (possibly upgraded), use it directly - // This bypasses the allowlist check - user knows what they want - if let Some(ref req_model) = resolved_model { - let in_allowlist = models.iter().any(|m| &m.model_id == req_model); - let upgrade_note = if was_upgraded { - format!(" (auto-upgraded from {})", requested_model.as_deref().unwrap_or("unknown")) - } else { - String::new() - }; - - if !in_allowlist { - tracing::warn!( - "Requested model {} not in allowlist, but using it anyway (user override){}", - req_model, - upgrade_note - ); - } - - tracing::info!( - "Using requested model directly: {}{}", - req_model, - upgrade_note - ); - - // Record selection in analysis - { - let a = task.analysis_mut(); - a.selected_model = Some(req_model.clone()); - a.estimated_cost_cents = Some(50); // Default estimate - } - - return AgentResult::success( - &format!("Using requested model: {}{}", req_model, upgrade_note), - 1, - ) - .with_data(json!({ - "model_id": req_model, - "expected_cost_cents": 50, - "confidence": 1.0, - "reasoning": format!("User requested model: {}{}", req_model, upgrade_note), - "fallbacks": [], - "used_historical_data": false, - "used_benchmark_data": false, - "was_upgraded": was_upgraded, - "original_model": requested_model, - "task_type": format!("{:?}", task_type), - "in_allowlist": in_allowlist, - })); - } - - match self.select_optimal( - &models, - complexity, - estimated_tokens, - budget_cents, - task_type, - historical_stats.as_ref(), - resolved_model.as_deref(), - ctx, - ).await { - Some(rec) => { - // Record selection in analysis - { - let a = task.analysis_mut(); - a.selected_model = Some(rec.model_id.clone()); - a.estimated_cost_cents = Some(rec.expected_cost_cents); - } - - AgentResult::success( - &rec.reasoning, - 1, // Minimal cost for selection itself - ) - .with_data(json!({ - "model_id": rec.model_id, - "expected_cost_cents": rec.expected_cost_cents, - "confidence": rec.confidence, - "reasoning": rec.reasoning, - "fallbacks": rec.fallbacks, - "used_historical_data": rec.used_historical_data, - "used_benchmark_data": rec.used_benchmark_data, - "task_type": format!("{:?}", task_type), - "historical_stats_available": historical_stats.as_ref().map(|h| h.len()), - "inputs": { - "complexity": complexity, - "estimated_tokens": estimated_tokens, - "budget_cents": budget_cents - } - })) - } - None => { - AgentResult::failure( - "No suitable model found within budget", - 0, - ) - } - } - } -} - -impl LeafAgent for ModelSelector { - fn capability(&self) -> LeafCapability { - LeafCapability::ModelSelection - } -} - -#[cfg(test)] -mod tests { - use super::*; - - fn make_pricing(id: &str, prompt: f64, completion: f64) -> PricingInfo { - PricingInfo { - model_id: id.to_string(), - prompt_cost_per_million: prompt, - completion_cost_per_million: completion, - context_length: 100000, - max_output_tokens: None, - supports_tools: true, - } - } - - #[test] - fn test_expected_cost_u_curve() { - let selector = ModelSelector::new(); - - let cheap = make_pricing("cheap", 0.1, 0.2); - let medium = make_pricing("medium", 1.0, 2.0); - let expensive = make_pricing("expensive", 10.0, 20.0); - - let complexity = 0.7; - let tokens = 2000; - - let cheap_cost = selector.calculate_expected_cost(&cheap, complexity, tokens); - let medium_cost = selector.calculate_expected_cost(&medium, complexity, tokens); - let expensive_cost = selector.calculate_expected_cost(&expensive, complexity, tokens); - - // For complex tasks, medium should be optimal (U-curve) - // Cheap model has high failure rate - // Expensive model has high base cost - println!("Cheap: {} (fail: {})", cheap_cost.expected_cost_cents, cheap_cost.failure_probability); - println!("Medium: {} (fail: {})", medium_cost.expected_cost_cents, medium_cost.failure_probability); - println!("Expensive: {} (fail: {})", expensive_cost.expected_cost_cents, expensive_cost.failure_probability); - - // Basic sanity check: cheap model should have higher failure rate - assert!(cheap_cost.failure_probability > medium_cost.failure_probability); - } - - #[test] - fn test_task_type_inference() { - assert_eq!( - TaskType::infer_from_description("Implement a function to sort arrays"), - TaskType::Code - ); - assert_eq!( - TaskType::infer_from_description("Calculate the integral of x^2"), - TaskType::Math - ); - assert_eq!( - TaskType::infer_from_description("Explain quantum mechanics"), - TaskType::Reasoning - ); - } -} - diff --git a/src/agents/leaf/verifier.rs b/src/agents/leaf/verifier.rs deleted file mode 100644 index b3a3124..0000000 --- a/src/agents/leaf/verifier.rs +++ /dev/null @@ -1,361 +0,0 @@ -//! Verification agent - validates task completion. -//! -//! # Verification Strategy (Hybrid) -//! 1. Try programmatic verification first (fast, deterministic) -//! 2. Fall back to LLM verification if needed -//! -//! # Programmatic Checks -//! - File exists -//! - Command succeeds -//! - Output matches pattern - -use async_trait::async_trait; -use std::path::Path; -use std::process::Stdio; -use tokio::process::Command; - -use crate::agents::{ - Agent, AgentContext, AgentId, AgentResult, AgentType, LeafAgent, LeafCapability, -}; -use crate::llm::{ChatMessage, Role}; -use crate::task::{ProgrammaticCheck, Task, VerificationCriteria, VerificationMethod, VerificationResult}; - -/// Agent that verifies task completion. -/// -/// # Hybrid Verification -/// - Programmatic: Fast, deterministic, no cost -/// - LLM: Flexible, for subjective criteria -pub struct Verifier { - id: AgentId, -} - -impl Verifier { - /// Create a new verifier. - pub fn new() -> Self { - Self { id: AgentId::new() } - } - - /// Execute a programmatic check. - /// - /// # Returns - /// `Ok(true)` if check passes, `Ok(false)` if fails, `Err` on error. - /// - /// # Note - /// Paths in checks can be absolute or relative to working_dir. - async fn run_programmatic_check( - &self, - check: &ProgrammaticCheck, - working_dir: &Path, - ) -> Result { - match check { - ProgrammaticCheck::FileExists { path } => { - let full_path = Self::resolve_path(path, working_dir); - Ok(full_path.exists()) - } - - ProgrammaticCheck::FileContains { path, content } => { - let full_path = Self::resolve_path(path, working_dir); - match tokio::fs::read_to_string(&full_path).await { - Ok(file_content) => Ok(file_content.contains(content)), - Err(_) => Ok(false), - } - } - - ProgrammaticCheck::CommandSucceeds { command } => { - let output = Command::new("sh") - .arg("-c") - .arg(command) - .current_dir(working_dir) - .stdout(Stdio::null()) - .stderr(Stdio::null()) - .status() - .await - .map_err(|e| e.to_string())?; - - Ok(output.success()) - } - - ProgrammaticCheck::CommandOutputMatches { command, pattern } => { - let output = Command::new("sh") - .arg("-c") - .arg(command) - .current_dir(working_dir) - .output() - .await - .map_err(|e| e.to_string())?; - - let stdout = String::from_utf8_lossy(&output.stdout); - let regex = regex::Regex::new(pattern).map_err(|e| e.to_string())?; - Ok(regex.is_match(&stdout)) - } - - ProgrammaticCheck::DirectoryExists { path } => { - let full_path = Self::resolve_path(path, working_dir); - Ok(full_path.is_dir()) - } - - ProgrammaticCheck::FileMatchesRegex { path, pattern } => { - let full_path = Self::resolve_path(path, working_dir); - match tokio::fs::read_to_string(&full_path).await { - Ok(content) => { - let regex = regex::Regex::new(pattern).map_err(|e| e.to_string())?; - Ok(regex.is_match(&content)) - } - Err(_) => Ok(false), - } - } - - ProgrammaticCheck::All(checks) => { - for c in checks { - if !Box::pin(self.run_programmatic_check(c, working_dir)).await? { - return Ok(false); - } - } - Ok(true) - } - - ProgrammaticCheck::Any(checks) => { - for c in checks { - if Box::pin(self.run_programmatic_check(c, working_dir)).await? { - return Ok(true); - } - } - Ok(false) - } - } - } - - /// Resolve a path - if absolute, use as-is; if relative, join with working_dir. - fn resolve_path(path_str: &str, working_dir: &Path) -> std::path::PathBuf { - let path = Path::new(path_str); - if path.is_absolute() { - path.to_path_buf() - } else { - working_dir.join(path) - } - } - - /// Verify using LLM. - /// - /// # Parameters - /// - `task`: The task that was executed - /// - `success_criteria`: What success looks like - /// - `task_output`: The actual output produced by the executor - /// - `ctx`: Agent context - /// - /// # Returns - /// VerificationResult with LLM's assessment - async fn verify_with_llm( - &self, - task: &Task, - success_criteria: &str, - task_output: Option<&str>, - ctx: &AgentContext, - ) -> VerificationResult { - let output_section = task_output - .map(|o| format!("\n\nActual Output/Result:\n{}\n", o.chars().take(3000).collect::())) - .unwrap_or_default(); - - let prompt = format!( - r#"You are verifying if a task was completed correctly. - -Task: {} - -Success Criteria: {} -{output_section} -Based on what was actually accomplished (shown in the output above), respond with a JSON object: -{{ - "passed": true/false, - "reasoning": "explanation of why the task passed or failed based on the actual output" -}} - -Be lenient - if the core goal was achieved even if the format isn't perfect, pass it. -Respond ONLY with the JSON object."#, - task.description(), - success_criteria - ); - - let messages = vec![ - ChatMessage::new(Role::System, "You are a precise task verifier. Respond only with JSON."), - ChatMessage::new(Role::User, prompt), - ]; - - let model = "openai/gpt-4.1-mini"; - - match ctx.llm.chat_completion(model, &messages, None).await { - Ok(response) => { - let content = response.content.unwrap_or_default(); - self.parse_llm_verification(&content, model) - } - Err(e) => { - VerificationResult::fail( - format!("LLM verification failed: {}", e), - VerificationMethod::Llm { model: model.to_string() }, - 0, - ) - } - } - } - - /// Parse LLM verification response. - fn parse_llm_verification(&self, response: &str, model: &str) -> VerificationResult { - if let Ok(json) = serde_json::from_str::(response) { - let passed = json["passed"].as_bool().unwrap_or(false); - let reasoning = json["reasoning"] - .as_str() - .unwrap_or("No reasoning provided") - .to_string(); - - if passed { - VerificationResult::pass( - reasoning, - VerificationMethod::Llm { model: model.to_string() }, - 1, // Minimal cost - ) - } else { - VerificationResult::fail( - reasoning, - VerificationMethod::Llm { model: model.to_string() }, - 1, - ) - } - } else { - // Try to infer from text - let passed = response.to_lowercase().contains("pass") - || response.to_lowercase().contains("success") - || response.to_lowercase().contains("completed"); - - if passed { - VerificationResult::pass( - response.to_string(), - VerificationMethod::Llm { model: model.to_string() }, - 1, - ) - } else { - VerificationResult::fail( - response.to_string(), - VerificationMethod::Llm { model: model.to_string() }, - 1, - ) - } - } - } - - /// Run verification according to criteria. - async fn verify( - &self, - task: &Task, - ctx: &AgentContext, - ) -> VerificationResult { - match task.verification() { - VerificationCriteria::None => { - VerificationResult::pass( - "No verification required", - VerificationMethod::None, - 0, - ) - } - - VerificationCriteria::Programmatic(check) => { - match self.run_programmatic_check(check, &ctx.working_dir).await { - Ok(true) => VerificationResult::pass( - "Programmatic check passed", - VerificationMethod::Programmatic, - 0, - ), - Ok(false) => VerificationResult::fail( - "Programmatic check failed", - VerificationMethod::Programmatic, - 0, - ), - Err(e) => VerificationResult::fail( - format!("Programmatic check error: {}", e), - VerificationMethod::Programmatic, - 0, - ), - } - } - - VerificationCriteria::LlmBased { success_criteria } => { - // Get last output from task analysis if available - let last_output = task.last_output(); - self.verify_with_llm(task, success_criteria, last_output, ctx).await - } - - VerificationCriteria::Hybrid { programmatic, llm_fallback } => { - // Try programmatic first - match self.run_programmatic_check(programmatic, &ctx.working_dir).await { - Ok(true) => VerificationResult::pass( - "Programmatic check passed", - VerificationMethod::Programmatic, - 0, - ), - Ok(false) => { - // Fall back to LLM - let last_output = task.last_output(); - self.verify_with_llm(task, llm_fallback, last_output, ctx).await - } - Err(_) => { - // Error in programmatic, fall back to LLM - let last_output = task.last_output(); - self.verify_with_llm(task, llm_fallback, last_output, ctx).await - } - } - } - } - } -} - -impl Default for Verifier { - fn default() -> Self { - Self::new() - } -} - -#[async_trait] -impl Agent for Verifier { - fn id(&self) -> &AgentId { - &self.id - } - - fn agent_type(&self) -> AgentType { - AgentType::Verifier - } - - fn description(&self) -> &str { - "Verifies task completion using programmatic checks and LLM fallback" - } - - async fn execute(&self, task: &mut Task, ctx: &AgentContext) -> AgentResult { - let result = self.verify(task, ctx).await; - - if result.passed() { - AgentResult::success( - result.reasoning(), - result.cost_cents(), - ) - .with_data(serde_json::json!({ - "passed": true, - "method": format!("{:?}", result.method()), - "reasoning": result.reasoning(), - })) - } else { - AgentResult::failure( - result.reasoning(), - result.cost_cents(), - ) - .with_data(serde_json::json!({ - "passed": false, - "method": format!("{:?}", result.method()), - "reasoning": result.reasoning(), - })) - } - } -} - -impl LeafAgent for Verifier { - fn capability(&self) -> LeafCapability { - LeafCapability::Verification - } -} - diff --git a/src/agents/mod.rs b/src/agents/mod.rs index 985043c..66004c4 100644 --- a/src/agents/mod.rs +++ b/src/agents/mod.rs @@ -1,29 +1,33 @@ -//! Agents module - the hierarchical agent tree. +//! Agents module - task execution system. //! //! # Agent Types -//! - **RootAgent**: Top-level orchestrator, receives tasks from API -//! - **NodeAgent**: Intermediate orchestrator, delegates to children -//! - **LeafAgent**: Specialized agents that do actual work +//! - **SimpleAgent**: Unified agent that directly executes tasks +//! - **TaskExecutor**: Core execution loop with tools (used by SimpleAgent) //! -//! # Leaf Agent Specializations -//! - `ComplexityEstimator`: Estimates task difficulty -//! - `ModelSelector`: Chooses optimal model for cost/capability -//! - `TaskExecutor`: Executes tasks using tools -//! - `Verifier`: Validates task completion +//! # Legacy Types (deprecated, will be removed) +//! - **RootAgent**: Complex orchestrator (replaced by SimpleAgent) +//! - **NodeAgent**: Recursive splitter (removed - lost context) +//! - **ComplexityEstimator**: LLM-based estimation (unreliable) +//! - **ModelSelector**: U-curve optimization (over-engineered) +//! - **Verifier**: LLM self-verification (rubber-stamped everything) //! //! # Design Principles -//! - Agents communicate synchronously (parent calls child, child returns) -//! - Designed for future async message passing migration -//! - All operations return `Result` with meaningful errors +//! - Direct execution without orchestration overhead +//! - User controls task granularity (no auto-splitting) +//! - Blocker detection via system prompt rules +//! - Mission completion via complete_mission tool mod types; mod context; mod tree; pub mod tuning; -pub mod orchestrator; +pub mod orchestrator; // TODO: Remove after migration pub mod leaf; +mod simple; -pub use types::{AgentId, AgentType, AgentResult, AgentError, Complexity}; +pub use simple::SimpleAgent; + +pub use types::{AgentId, AgentType, AgentResult, AgentError, Complexity, TerminalReason}; pub use context::AgentContext; pub use tree::{AgentTree, AgentRef}; pub use tuning::TuningParams; diff --git a/src/agents/orchestrator/mod.rs b/src/agents/orchestrator/mod.rs index 513afee..c6571e0 100644 --- a/src/agents/orchestrator/mod.rs +++ b/src/agents/orchestrator/mod.rs @@ -1,8 +1,4 @@ -//! Orchestrator agents - Root and Node agents that manage the tree. - -mod root; -mod node; - -pub use root::RootAgent; -pub use node::NodeAgent; - +//! Orchestrator agents (legacy - removed). +//! +//! The orchestrator agents (RootAgent, NodeAgent) have been superseded by SimpleAgent. +//! This module is kept empty for now; will be removed in a future cleanup. diff --git a/src/agents/orchestrator/node.rs b/src/agents/orchestrator/node.rs deleted file mode 100644 index d45fa1a..0000000 --- a/src/agents/orchestrator/node.rs +++ /dev/null @@ -1,889 +0,0 @@ -//! Node agent - intermediate orchestrator in the agent tree. -//! -//! Node agents are like mini-root agents that can: -//! - Receive delegated tasks from parent -//! - Estimate complexity and split complex subtasks further (recursive) -//! - Delegate to their own children -//! - Aggregate results for parent - -use std::sync::Arc; - -use async_trait::async_trait; -use serde_json::json; - -use crate::agents::{ - leaf::{ComplexityEstimator, ModelSelector, TaskExecutor, Verifier}, - Agent, AgentContext, AgentId, AgentRef, AgentResult, AgentType, Complexity, OrchestratorAgent, -}; -use crate::budget::Budget; -use crate::llm::{ChatMessage, Role}; -use crate::task::{Subtask, SubtaskPlan, Task, VerificationCriteria}; - -/// Node agent - intermediate orchestrator. -/// -/// # Purpose -/// Handles subtasks that may still be complex enough -/// to warrant further splitting. Now with full recursive -/// splitting capabilities like RootAgent. -/// -/// # Recursive Splitting -/// NodeAgent can estimate complexity of its subtasks and -/// recursively split them if they're still too complex, -/// respecting the `max_split_depth` in context. -pub struct NodeAgent { - id: AgentId, - - /// Name for identification in logs - name: String, - - // Child agents - full pipeline for recursive splitting - complexity_estimator: Arc, - model_selector: Arc, - task_executor: Arc, - verifier: Arc, - - // Child node agents (for further splitting) - child_nodes: Vec>, -} - -impl NodeAgent { - /// Create a new node agent with full recursive capabilities. - pub fn new(name: impl Into) -> Self { - Self { - id: AgentId::new(), - name: name.into(), - complexity_estimator: Arc::new(ComplexityEstimator::new()), - model_selector: Arc::new(ModelSelector::new()), - task_executor: Arc::new(TaskExecutor::new()), - verifier: Arc::new(Verifier::new()), - child_nodes: Vec::new(), - } - } - - /// Create a node with custom executor. - pub fn with_executor(mut self, executor: Arc) -> Self { - self.task_executor = executor; - self - } - - /// Add a child node for hierarchical delegation. - pub fn add_child_node(&mut self, child: Arc) { - self.child_nodes.push(child); - } - - /// Get the node's name. - pub fn name(&self) -> &str { - &self.name - } - - /// Estimate complexity of a task. - async fn estimate_complexity(&self, task: &mut Task, ctx: &AgentContext) -> Complexity { - let result = self.complexity_estimator.execute(task, ctx).await; - - if let Some(data) = result.data { - let score = data["score"].as_f64().unwrap_or(0.5); - let reasoning = data["reasoning"].as_str().unwrap_or("").to_string(); - let estimated_tokens = data["estimated_tokens"].as_u64().unwrap_or(2000); - let should_split = data["should_split"].as_bool().unwrap_or(false); - - Complexity::new(score, reasoning, estimated_tokens).with_split(should_split) - } else { - Complexity::moderate("Could not estimate complexity") - } - } - - /// Split a complex task into subtasks. - async fn split_task( - &self, - task: &Task, - ctx: &AgentContext, - ) -> Result { - let prompt = format!( - r#"You are a task planner. Break down this task into smaller, manageable subtasks. - -Task: {} - -Respond with a JSON object: -{{ - "subtasks": [ - {{ - "description": "What to do", - "verification": "How to verify it's done", - "weight": 1.0, - "dependencies": [] - }} - ], - "reasoning": "Why this breakdown makes sense" -}} - -Guidelines: -- Each subtask should be independently executable once its dependencies are complete -- The "dependencies" array contains indices (0-based) of subtasks that MUST complete before this one can start -- For example, if subtask 2 needs subtask 0's output, set "dependencies": [0] -- Include verification for each subtask -- Weight indicates relative effort (higher = more work) -- Keep subtasks focused and specific -- Aim for 2-4 subtasks typically -- IMPORTANT: If subtasks have a logical order (e.g., download before analyze), specify dependencies! - -PREFER COMMAND-LINE APPROACHES: -- For downloading files: use curl/wget, NOT browser automation -- For Chrome extensions: download CRX directly via URL pattern, then unzip -- For file analysis: use grep/find/ripgrep, NOT GUI tools -- For web APIs: use curl/fetch_url, NOT browser clicks -- Desktop automation is a LAST RESORT only when no CLI option exists - -Respond ONLY with the JSON object."#, - task.description() - ); - - let messages = vec![ - ChatMessage::new( - Role::System, - "You are a precise task planner. Respond only with JSON.", - ), - ChatMessage::new(Role::User, prompt), - ]; - - let response = ctx - .llm - .chat_completion("openai/gpt-4.1-mini", &messages, None) - .await - .map_err(|e| AgentResult::failure(format!("LLM error: {}", e), 1))?; - - let content = response.content.unwrap_or_default(); - self.parse_subtask_plan(&content, task.id()) - } - - /// Extract JSON from LLM response (handles markdown code blocks). - fn extract_json(response: &str) -> String { - let trimmed = response.trim(); - - // Check for markdown code block - if trimmed.starts_with("```") { - // Find the end of the opening fence - if let Some(start_idx) = trimmed.find('\n') { - let after_fence = &trimmed[start_idx + 1..]; - // Find the closing fence - if let Some(end_idx) = after_fence.rfind("```") { - return after_fence[..end_idx].trim().to_string(); - } - } - } - - // Try to find JSON object in the response - if let Some(start) = trimmed.find('{') { - if let Some(end) = trimmed.rfind('}') { - if end > start { - return trimmed[start..=end].to_string(); - } - } - } - - // Return as-is if no extraction needed - trimmed.to_string() - } - - /// Parse LLM response into SubtaskPlan. - fn parse_subtask_plan( - &self, - response: &str, - parent_id: crate::task::TaskId, - ) -> Result { - let extracted = Self::extract_json(response); - let json: serde_json::Value = serde_json::from_str(&extracted).map_err(|e| { - AgentResult::failure( - format!( - "Failed to parse subtasks: {} (raw: {}...)", - e, - response.chars().take(100).collect::() - ), - 0, - ) - })?; - - let reasoning = json["reasoning"] - .as_str() - .unwrap_or("No reasoning provided") - .to_string(); - - let subtasks: Vec = json["subtasks"] - .as_array() - .map(|arr| { - arr.iter() - .map(|s| { - let desc = s["description"].as_str().unwrap_or("").to_string(); - let verification = s["verification"].as_str().unwrap_or(""); - let weight = s["weight"].as_f64().unwrap_or(1.0); - - // Parse dependencies array - let dependencies: Vec = s["dependencies"] - .as_array() - .map(|deps| { - deps.iter() - .filter_map(|d| d.as_u64().map(|n| n as usize)) - .collect() - }) - .unwrap_or_default(); - - Subtask::new(desc, VerificationCriteria::llm_based(verification), weight) - .with_dependencies(dependencies) - }) - .collect() - }) - .unwrap_or_default(); - - if subtasks.is_empty() { - return Err(AgentResult::failure("No subtasks generated", 1)); - } - - SubtaskPlan::new(parent_id, subtasks, reasoning) - .map_err(|e| AgentResult::failure(format!("Invalid subtask plan: {}", e), 0)) - } - - /// Execute subtasks recursively, potentially splitting further. - async fn execute_subtasks( - &self, - subtask_plan: SubtaskPlan, - parent_budget: &Budget, - ctx: &AgentContext, - requested_model: Option<&str>, - ) -> AgentResult { - // Convert plan to tasks - let mut tasks = match subtask_plan.into_tasks(parent_budget) { - Ok(t) => t, - Err(e) => return AgentResult::failure(format!("Failed to create subtasks: {}", e), 0), - }; - - // Propagate requested_model to all subtasks - if let Some(model) = requested_model { - for task in &mut tasks { - task.analysis_mut().requested_model = Some(model.to_string()); - } - } - - let mut results = Vec::new(); - let mut total_cost = 0u64; - - // Create a child context with reduced split depth - let child_ctx = ctx.child_context(); - - // Execute each subtask recursively - for task in &mut tasks { - tracing::info!( - "NodeAgent '{}' processing subtask: {}", - self.name, - task.description().chars().take(80).collect::() - ); - - // Create a child NodeAgent for this subtask (recursive) - let child_node = NodeAgent::new(format!("{}-sub", self.name)); - - // Execute through the child node (which may split further) - let result = child_node.execute(task, &child_ctx).await; - total_cost += result.cost_cents; - - results.push(result); - } - - // Aggregate results - let successes = results.iter().filter(|r| r.success).count(); - let total = results.len(); - - // Concatenate successful outputs for meaningful aggregation - let combined_output = Self::concatenate_outputs(&results); - - if successes == total { - AgentResult::success(combined_output, total_cost).with_data(json!({ - "subtasks_total": total, - "subtasks_succeeded": successes, - "results": results.iter().map(|r| &r.output).collect::>(), - })) - } else { - AgentResult::failure( - format!( - "{}/{} subtasks succeeded\n\n{}", - successes, total, combined_output - ), - total_cost, - ) - .with_data(json!({ - "subtasks_total": total, - "subtasks_succeeded": successes, - "results": results.iter().map(|r| json!({ - "success": r.success, - "output": &r.output, - })).collect::>(), - })) - } - } - - /// Concatenate subtask outputs into a single string. - /// Used for intermediate aggregation (RootAgent handles final synthesis). - fn concatenate_outputs(results: &[AgentResult]) -> String { - let outputs: Vec = results - .iter() - .enumerate() - .filter(|(_, r)| r.success && !r.output.is_empty()) - .map(|(i, r)| { - if results.len() == 1 { - r.output.clone() - } else { - format!("### Part {}\n{}", i + 1, r.output) - } - }) - .collect(); - - if outputs.is_empty() { - "No output generated.".to_string() - } else if outputs.len() == 1 { - outputs.into_iter().next().unwrap() - } else { - outputs.join("\n\n") - } - } - - /// Execute with tree updates for visualization. - /// This method updates the parent's tree structure as this node executes. - pub async fn execute_with_tree( - &self, - task: &mut Task, - ctx: &AgentContext, - node_id: &str, - root_tree: &mut crate::api::control::AgentTreeNode, - emit_ctx: &AgentContext, - ) -> AgentResult { - use crate::api::control::AgentTreeNode; - - let mut total_cost = 0u64; - - tracing::info!( - "NodeAgent '{}' executing task (depth remaining: {}): {}", - self.name, - ctx.max_split_depth, - task.description().chars().take(80).collect::() - ); - - // Step 1: Estimate complexity - ctx.emit_phase( - "estimating_complexity", - Some("Analyzing subtask..."), - Some(&self.name), - ); - let complexity = self.estimate_complexity(task, ctx).await; - total_cost += 1; - - // Update node with complexity - if let Some(node) = root_tree.children.iter_mut().find(|n| n.id == node_id) { - node.complexity = Some(complexity.score()); - } - emit_ctx.emit_tree(root_tree.clone()); - - tracing::info!( - "NodeAgent '{}' complexity: {:.2} (should_split: {}, can_split: {})", - self.name, - complexity.score(), - complexity.should_split(), - ctx.can_split() - ); - - // Step 2: Decide execution strategy - if complexity.should_split() && ctx.can_split() { - ctx.emit_phase( - "splitting_task", - Some("Decomposing subtask..."), - Some(&self.name), - ); - tracing::info!("NodeAgent '{}' splitting task into sub-subtasks", self.name); - - match self.split_task(task, ctx).await { - Ok(plan) => { - total_cost += 2; - - // Add child nodes to this node in the tree - if let Some(parent_node) = - root_tree.children.iter_mut().find(|n| n.id == node_id) - { - for (i, subtask) in plan.subtasks().iter().enumerate() { - let child_node = AgentTreeNode::new( - &format!("{}-sub-{}", node_id, i + 1), - "Node", - &format!("Sub-subtask {}", i + 1), - &subtask.description.chars().take(40).collect::(), - ) - .with_status("pending"); - parent_node.children.push(child_node); - } - } - emit_ctx.emit_tree(root_tree.clone()); - - let subtask_count = plan.subtasks().len(); - tracing::info!( - "NodeAgent '{}' created {} sub-subtasks", - self.name, - subtask_count - ); - - // Execute subtasks recursively with tree updates - let child_ctx = ctx.child_context(); - let requested_model = task.analysis().requested_model.as_deref(); - let result = self - .execute_subtasks_with_tree( - plan, - task.budget(), - &child_ctx, - node_id, - root_tree, - emit_ctx, - requested_model, - ) - .await; - - return AgentResult { - success: result.success, - output: result.output, - cost_cents: total_cost + result.cost_cents, - model_used: result.model_used, - data: result.data, - }; - } - Err(e) => { - tracing::warn!( - "NodeAgent '{}' couldn't split, executing directly: {}", - self.name, - e.output - ); - } - } - } - - // Simple task: add child nodes for executor and verifier - if let Some(parent_node) = root_tree.children.iter_mut().find(|n| n.id == node_id) { - parent_node.children.push( - AgentTreeNode::new( - &format!("{}-executor", node_id), - "TaskExecutor", - "Task Executor", - "Execute subtask", - ) - .with_status("running"), - ); - parent_node.children.push( - AgentTreeNode::new( - &format!("{}-verifier", node_id), - "Verifier", - "Verifier", - "Verify result", - ) - .with_status("pending"), - ); - } - emit_ctx.emit_tree(root_tree.clone()); - - // Select model - ctx.emit_phase( - "selecting_model", - Some("Choosing model..."), - Some(&self.name), - ); - let sel_result = self.model_selector.execute(task, ctx).await; - total_cost += sel_result.cost_cents; - - // Execute - ctx.emit_phase("executing", Some("Running subtask..."), Some(&self.name)); - let result = self.task_executor.execute(task, ctx).await; - total_cost += result.cost_cents; - - // Update executor status - if let Some(parent_node) = root_tree.children.iter_mut().find(|n| n.id == node_id) { - if let Some(exec_node) = parent_node - .children - .iter_mut() - .find(|n| n.id == format!("{}-executor", node_id)) - { - exec_node.status = if result.success { - "completed".to_string() - } else { - "failed".to_string() - }; - exec_node.budget_spent = result.cost_cents; - } - } - emit_ctx.emit_tree(root_tree.clone()); - - // Store the executor output for verification - task.set_last_output(result.output.clone()); - - if !result.success { - return AgentResult::failure(result.output, total_cost).with_data(json!({ - "node_name": self.name, - "complexity": complexity.score(), - "was_split": false, - "execution": result.data, - })); - } - - // Verify - if let Some(parent_node) = root_tree.children.iter_mut().find(|n| n.id == node_id) { - if let Some(ver_node) = parent_node - .children - .iter_mut() - .find(|n| n.id == format!("{}-verifier", node_id)) - { - ver_node.status = "running".to_string(); - } - } - emit_ctx.emit_tree(root_tree.clone()); - - ctx.emit_phase("verifying", Some("Checking results..."), Some(&self.name)); - let verification = self.verifier.execute(task, ctx).await; - total_cost += verification.cost_cents; - - // Update verifier status - if let Some(parent_node) = root_tree.children.iter_mut().find(|n| n.id == node_id) { - if let Some(ver_node) = parent_node - .children - .iter_mut() - .find(|n| n.id == format!("{}-verifier", node_id)) - { - ver_node.status = if verification.success { - "completed".to_string() - } else { - "failed".to_string() - }; - ver_node.budget_spent = verification.cost_cents; - } - } - emit_ctx.emit_tree(root_tree.clone()); - - if verification.success { - AgentResult::success(result.output, total_cost) - .with_model(result.model_used.unwrap_or_default()) - .with_data(json!({ - "node_name": self.name, - "complexity": complexity.score(), - "was_split": false, - "execution": result.data, - "verification": verification.data, - })) - } else { - AgentResult::failure( - format!( - "Task completed but verification failed: {}", - verification.output - ), - total_cost, - ) - .with_data(json!({ - "node_name": self.name, - "complexity": complexity.score(), - "was_split": false, - "execution": result.data, - "verification": verification.data, - })) - } - } - - /// Execute subtasks with tree updates for visualization. - async fn execute_subtasks_with_tree( - &self, - subtask_plan: SubtaskPlan, - parent_budget: &Budget, - ctx: &AgentContext, - parent_node_id: &str, - root_tree: &mut crate::api::control::AgentTreeNode, - emit_ctx: &AgentContext, - requested_model: Option<&str>, - ) -> AgentResult { - let mut tasks = match subtask_plan.into_tasks(parent_budget) { - Ok(t) => t, - Err(e) => return AgentResult::failure(format!("Failed to create subtasks: {}", e), 0), - }; - - // Propagate requested_model to all subtasks - if let Some(model) = requested_model { - for task in &mut tasks { - task.analysis_mut().requested_model = Some(model.to_string()); - } - } - - let mut results = Vec::new(); - let mut total_cost = 0u64; - let child_ctx = ctx.child_context(); - - for (i, task) in tasks.iter_mut().enumerate() { - let subtask_id = format!("{}-sub-{}", parent_node_id, i + 1); - - // Update subtask status to running - if let Some(parent_node) = root_tree - .children - .iter_mut() - .find(|n| n.id == parent_node_id) - { - if let Some(child_node) = - parent_node.children.iter_mut().find(|n| n.id == subtask_id) - { - child_node.status = "running".to_string(); - } - } - emit_ctx.emit_tree(root_tree.clone()); - - tracing::info!( - "NodeAgent '{}' processing sub-subtask: {}", - self.name, - task.description().chars().take(80).collect::() - ); - - // Create and execute a child NodeAgent - let child_node_agent = NodeAgent::new(subtask_id.clone()); - let result = child_node_agent.execute(task, &child_ctx).await; - total_cost += result.cost_cents; - - // Update subtask status - if let Some(parent_node) = root_tree - .children - .iter_mut() - .find(|n| n.id == parent_node_id) - { - if let Some(child_node) = - parent_node.children.iter_mut().find(|n| n.id == subtask_id) - { - child_node.status = if result.success { - "completed".to_string() - } else { - "failed".to_string() - }; - child_node.budget_spent = result.cost_cents; - } - } - emit_ctx.emit_tree(root_tree.clone()); - - results.push(result); - } - - let successes = results.iter().filter(|r| r.success).count(); - let total = results.len(); - - // Concatenate successful outputs for meaningful aggregation - let combined_output = Self::concatenate_outputs(&results); - - if successes == total { - AgentResult::success(combined_output, total_cost).with_data(json!({ - "subtasks_total": total, - "subtasks_succeeded": successes, - })) - } else { - AgentResult::failure( - format!( - "{}/{} sub-subtasks succeeded\n\n{}", - successes, total, combined_output - ), - total_cost, - ) - .with_data(json!({ - "subtasks_total": total, - "subtasks_succeeded": successes, - })) - } - } -} - -impl Default for NodeAgent { - fn default() -> Self { - Self::new("node") - } -} - -#[async_trait] -impl Agent for NodeAgent { - fn id(&self) -> &AgentId { - &self.id - } - - fn agent_type(&self) -> AgentType { - AgentType::Node - } - - fn description(&self) -> &str { - "Intermediate orchestrator with recursive splitting capabilities" - } - - async fn execute(&self, task: &mut Task, ctx: &AgentContext) -> AgentResult { - let mut total_cost = 0u64; - - tracing::info!( - "NodeAgent '{}' executing task (depth remaining: {}): {}", - self.name, - ctx.max_split_depth, - task.description().chars().take(80).collect::() - ); - - // Step 1: Estimate complexity - ctx.emit_phase( - "estimating_complexity", - Some("Analyzing subtask..."), - Some(&self.name), - ); - let complexity = self.estimate_complexity(task, ctx).await; - total_cost += 1; - - tracing::info!( - "NodeAgent '{}' complexity: {:.2} (should_split: {}, can_split: {})", - self.name, - complexity.score(), - complexity.should_split(), - ctx.can_split() - ); - - // Step 2: Decide execution strategy - if complexity.should_split() && ctx.can_split() { - // Complex subtask: split further recursively - ctx.emit_phase( - "splitting_task", - Some("Decomposing subtask..."), - Some(&self.name), - ); - tracing::info!("NodeAgent '{}' splitting task into sub-subtasks", self.name); - - match self.split_task(task, ctx).await { - Ok(plan) => { - total_cost += 2; // Splitting cost - - let subtask_count = plan.subtasks().len(); - tracing::info!( - "NodeAgent '{}' created {} sub-subtasks", - self.name, - subtask_count - ); - - // Execute subtasks recursively - let requested_model = task.analysis().requested_model.as_deref(); - let result = self - .execute_subtasks(plan, task.budget(), ctx, requested_model) - .await; - - return AgentResult { - success: result.success, - output: result.output, - cost_cents: total_cost + result.cost_cents, - model_used: result.model_used, - data: result.data, - }; - } - Err(e) => { - tracing::warn!( - "NodeAgent '{}' couldn't split, executing directly: {}", - self.name, - e.output - ); - } - } - } - - // Simple task or failed to split: execute directly - // Select model - ctx.emit_phase( - "selecting_model", - Some("Choosing model..."), - Some(&self.name), - ); - let sel_result = self.model_selector.execute(task, ctx).await; - total_cost += sel_result.cost_cents; - - // Execute - ctx.emit_phase("executing", Some("Running subtask..."), Some(&self.name)); - let result = self.task_executor.execute(task, ctx).await; - total_cost += result.cost_cents; - - // Store the executor output for verification - task.set_last_output(result.output.clone()); - - if !result.success { - return AgentResult::failure(result.output, total_cost).with_data(json!({ - "node_name": self.name, - "complexity": complexity.score(), - "was_split": false, - "execution": result.data, - })); - } - - // Verify - ctx.emit_phase("verifying", Some("Checking results..."), Some(&self.name)); - let verification = self.verifier.execute(task, ctx).await; - total_cost += verification.cost_cents; - - if verification.success { - AgentResult::success(result.output, total_cost) - .with_model(result.model_used.unwrap_or_default()) - .with_data(json!({ - "node_name": self.name, - "complexity": complexity.score(), - "was_split": false, - "execution": result.data, - "verification": verification.data, - })) - } else { - AgentResult::failure( - format!( - "Task completed but verification failed: {}", - verification.output - ), - total_cost, - ) - .with_data(json!({ - "node_name": self.name, - "complexity": complexity.score(), - "was_split": false, - "execution": result.data, - "verification": verification.data, - })) - } - } -} - -#[async_trait] -impl OrchestratorAgent for NodeAgent { - fn children(&self) -> Vec { - let mut children: Vec = vec![ - Arc::clone(&self.complexity_estimator) as AgentRef, - Arc::clone(&self.model_selector) as AgentRef, - Arc::clone(&self.task_executor) as AgentRef, - Arc::clone(&self.verifier) as AgentRef, - ]; - - for node in &self.child_nodes { - children.push(Arc::clone(node) as AgentRef); - } - - children - } - - fn find_child(&self, agent_type: AgentType) -> Option { - match agent_type { - AgentType::ComplexityEstimator => { - Some(Arc::clone(&self.complexity_estimator) as AgentRef) - } - AgentType::ModelSelector => Some(Arc::clone(&self.model_selector) as AgentRef), - AgentType::TaskExecutor => Some(Arc::clone(&self.task_executor) as AgentRef), - AgentType::Verifier => Some(Arc::clone(&self.verifier) as AgentRef), - AgentType::Node => self.child_nodes.first().map(|n| Arc::clone(n) as AgentRef), - _ => None, - } - } - - async fn delegate(&self, task: &mut Task, child: AgentRef, ctx: &AgentContext) -> AgentResult { - child.execute(task, ctx).await - } - - async fn delegate_all(&self, tasks: &mut [Task], ctx: &AgentContext) -> Vec { - let mut results = Vec::with_capacity(tasks.len()); - - for task in tasks { - // Use recursive execution for each task - let result = self.execute(task, ctx).await; - results.push(result); - } - - results - } -} diff --git a/src/agents/orchestrator/root.rs b/src/agents/orchestrator/root.rs deleted file mode 100644 index 8ddb0e2..0000000 --- a/src/agents/orchestrator/root.rs +++ /dev/null @@ -1,903 +0,0 @@ -//! Root agent - top-level orchestrator of the agent tree. -//! -//! # Responsibilities -//! 1. Receive tasks from the API -//! 2. Estimate complexity -//! 3. Decide: execute directly or split into subtasks -//! 4. Delegate to appropriate children -//! 5. Aggregate results - -use std::sync::Arc; - -use async_trait::async_trait; -use serde_json::json; - -use crate::agents::{ - Agent, AgentContext, AgentId, AgentRef, AgentResult, AgentType, Complexity, - OrchestratorAgent, - leaf::{ComplexityEstimator, ModelSelector, TaskExecutor, Verifier}, -}; -use crate::agents::tuning::TuningParams; -use crate::budget::Budget; -use crate::task::{Task, Subtask, SubtaskPlan, VerificationCriteria}; - -/// Root agent - the top of the agent tree. -/// -/// # Task Processing Flow -/// ```text -/// 1. Estimate complexity (ComplexityEstimator) -/// 2. If simple: execute directly (TaskExecutor) -/// 3. If complex: -/// a. Split into subtasks (LLM-based) -/// b. Select model for each subtask (ModelSelector) -/// c. Execute subtasks (TaskExecutor) -/// d. Verify results (Verifier) -/// 4. Return aggregated result -/// ``` -pub struct RootAgent { - id: AgentId, - - // Child agents - complexity_estimator: Arc, - model_selector: Arc, - task_executor: Arc, - verifier: Arc, -} - -impl RootAgent { - /// Create a new root agent with default children. - pub fn new() -> Self { - Self::new_with_tuning(&TuningParams::default()) - } - - /// Create a new root agent using empirically tuned parameters. - pub fn new_with_tuning(tuning: &TuningParams) -> Self { - Self { - id: AgentId::new(), - complexity_estimator: Arc::new(ComplexityEstimator::with_params( - tuning.complexity.prompt_variant, - tuning.complexity.split_threshold, - tuning.complexity.token_multiplier, - )), - model_selector: Arc::new(ModelSelector::with_params( - tuning.model_selector.retry_multiplier, - tuning.model_selector.inefficiency_scale, - tuning.model_selector.max_failure_probability, - )), - task_executor: Arc::new(TaskExecutor::new()), - verifier: Arc::new(Verifier::new()), - } - } - - /// Estimate complexity of a task. - async fn estimate_complexity(&self, task: &mut Task, ctx: &AgentContext) -> Complexity { - let result = self.complexity_estimator.execute(task, ctx).await; - - if let Some(data) = result.data { - let score = data["score"].as_f64().unwrap_or(0.5); - let reasoning = data["reasoning"].as_str().unwrap_or("").to_string(); - let estimated_tokens = data["estimated_tokens"].as_u64().unwrap_or(2000); - let should_split = data["should_split"].as_bool().unwrap_or(false); - - Complexity::new(score, reasoning, estimated_tokens) - .with_split(should_split) - } else { - Complexity::moderate("Could not estimate complexity") - } - } - - /// Split a complex task into subtasks. - /// - /// Uses LLM to analyze the task and propose subtasks. - async fn split_task(&self, task: &Task, ctx: &AgentContext) -> Result { - let prompt = format!( - r#"You are a task planner. Break down this task into smaller, manageable subtasks. - -Task: {} - -Respond with a JSON object: -{{ - "subtasks": [ - {{ - "description": "What to do", - "verification": "How to verify it's done", - "weight": 1.0, - "dependencies": [] - }} - ], - "reasoning": "Why this breakdown makes sense" -}} - -Guidelines: -- Each subtask should be independently executable once its dependencies are complete -- The "dependencies" array contains indices (0-based) of subtasks that MUST complete before this one can start -- For example, if subtask 2 needs subtask 0's output, set "dependencies": [0] -- Include verification for each subtask -- Weight indicates relative effort (higher = more work) -- Keep subtasks focused and specific -- IMPORTANT: If subtasks have a logical order (e.g., download before analyze), specify dependencies! - -PREFER COMMAND-LINE APPROACHES: -- For downloading files: use curl/wget, NOT browser automation -- For Chrome extensions: download CRX directly via URL pattern, then unzip -- For file analysis: use grep/find/ripgrep, NOT GUI tools -- For web APIs: use curl/fetch_url, NOT browser clicks -- Desktop automation is a LAST RESORT only when no CLI option exists - -Respond ONLY with the JSON object."#, - task.description() - ); - - let messages = vec![ - crate::llm::ChatMessage::new(crate::llm::Role::System, "You are a precise task planner. Respond only with JSON."), - crate::llm::ChatMessage::new(crate::llm::Role::User, prompt), - ]; - - let response = ctx.llm - .chat_completion("openai/gpt-4.1-mini", &messages, None) - .await - .map_err(|e| AgentResult::failure(format!("LLM error: {}", e), 1))?; - - let content = response.content.unwrap_or_default(); - self.parse_subtask_plan(&content, task.id()) - } - - /// Synthesize a final output from subtask results. - /// - /// # Purpose - /// When a task is split into subtasks, this method produces a coherent final - /// response by asking the LLM to synthesize all subtask outputs into a single - /// answer that addresses the original request. - /// - /// # Fallback - /// If LLM synthesis fails, falls back to concatenating subtask outputs. - async fn synthesize_final_output( - &self, - original_task: &str, - results: &[AgentResult], - ctx: &AgentContext, - ) -> String { - // Collect successful outputs - let subtask_outputs: Vec = results - .iter() - .enumerate() - .filter(|(_, r)| r.success) - .map(|(i, r)| format!("## Subtask {} Output\n{}", i + 1, r.output)) - .collect(); - - if subtask_outputs.is_empty() { - return "All subtasks failed - no output to synthesize.".to_string(); - } - - // If only one subtask, just return its output directly - if subtask_outputs.len() == 1 { - return results - .iter() - .find(|r| r.success) - .map(|r| r.output.clone()) - .unwrap_or_default(); - } - - let combined_outputs = subtask_outputs.join("\n\n---\n\n"); - - let prompt = format!( - r#"You have completed a multi-step task. Below are the outputs from each step. - -## Original Request -{original_task} - -## Subtask Outputs -{combined_outputs} - -## Your Task -Synthesize these outputs into a single, coherent response that directly answers the original request. - -Guidelines: -- Combine findings into a unified narrative or report -- Remove redundancy between subtask outputs -- Maintain the format the user requested (e.g., if they asked for a markdown report, provide one) -- If subtasks produced code or files, list them clearly -- Be comprehensive but concise -- Do NOT mention "subtasks" or the internal execution structure - respond as if you did the work yourself"# - ); - - let messages = vec![ - crate::llm::ChatMessage::new( - crate::llm::Role::System, - "You are a helpful assistant that synthesizes work outputs into coherent responses.", - ), - crate::llm::ChatMessage::new(crate::llm::Role::User, prompt), - ]; - - // Use a fast model for synthesis to minimize cost - match ctx - .llm - .chat_completion("openai/gpt-4.1-mini", &messages, None) - .await - { - Ok(response) => response.content.unwrap_or_else(|| { - // Fallback: concatenate outputs if synthesis returned empty - self.fallback_concatenate_outputs(results) - }), - Err(e) => { - tracing::warn!("Synthesis LLM call failed, using fallback: {}", e); - self.fallback_concatenate_outputs(results) - } - } - } - - /// Fallback method: concatenate subtask outputs with headers. - fn fallback_concatenate_outputs(&self, results: &[AgentResult]) -> String { - let outputs: Vec = results - .iter() - .enumerate() - .filter(|(_, r)| r.success && !r.output.is_empty()) - .map(|(i, r)| format!("## Part {}\n\n{}", i + 1, r.output)) - .collect(); - - if outputs.is_empty() { - "Task completed but no output was generated.".to_string() - } else { - outputs.join("\n\n---\n\n") - } - } - - /// Extract JSON from LLM response (handles markdown code blocks). - fn extract_json(response: &str) -> String { - let trimmed = response.trim(); - - // Check for markdown code block - if trimmed.starts_with("```") { - // Find the end of the opening fence - if let Some(start_idx) = trimmed.find('\n') { - let after_fence = &trimmed[start_idx + 1..]; - // Find the closing fence - if let Some(end_idx) = after_fence.rfind("```") { - return after_fence[..end_idx].trim().to_string(); - } - } - } - - // Try to find JSON object in the response - if let Some(start) = trimmed.find('{') { - if let Some(end) = trimmed.rfind('}') { - if end > start { - return trimmed[start..=end].to_string(); - } - } - } - - // Return as-is if no extraction needed - trimmed.to_string() - } - - /// Parse LLM response into SubtaskPlan. - fn parse_subtask_plan( - &self, - response: &str, - parent_id: crate::task::TaskId, - ) -> Result { - let extracted = Self::extract_json(response); - let json: serde_json::Value = serde_json::from_str(&extracted) - .map_err(|e| AgentResult::failure(format!("Failed to parse subtasks: {} (raw: {}...)", e, response.chars().take(100).collect::()), 0))?; - - let reasoning = json["reasoning"] - .as_str() - .unwrap_or("No reasoning provided") - .to_string(); - - let subtasks: Vec = json["subtasks"] - .as_array() - .map(|arr| { - arr.iter() - .map(|s| { - let desc = s["description"].as_str().unwrap_or("").to_string(); - let verification = s["verification"].as_str().unwrap_or(""); - let weight = s["weight"].as_f64().unwrap_or(1.0); - - // Parse dependencies array - let dependencies: Vec = s["dependencies"] - .as_array() - .map(|deps| { - deps.iter() - .filter_map(|d| d.as_u64().map(|n| n as usize)) - .collect() - }) - .unwrap_or_default(); - - Subtask::new( - desc, - VerificationCriteria::llm_based(verification), - weight, - ).with_dependencies(dependencies) - }) - .collect() - }) - .unwrap_or_default(); - - if subtasks.is_empty() { - return Err(AgentResult::failure("No subtasks generated", 1)); - } - - SubtaskPlan::new(parent_id, subtasks, reasoning) - .map_err(|e| AgentResult::failure(format!("Invalid subtask plan: {}", e), 0)) - } - - /// Execute subtasks using NodeAgents for recursive processing. - /// - /// Each subtask is handled by a NodeAgent which can: - /// - Estimate complexity of the subtask - /// - Recursively split if the subtask is still too complex - /// - Execute directly if simple enough - async fn execute_subtasks( - &self, - subtask_plan: SubtaskPlan, - parent_budget: &Budget, - ctx: &AgentContext, - ) -> AgentResult { - use super::NodeAgent; - - // Convert plan to tasks - let mut tasks = match subtask_plan.into_tasks(parent_budget) { - Ok(t) => t, - Err(e) => return AgentResult::failure(format!("Failed to create subtasks: {}", e), 0), - }; - - let mut results = Vec::new(); - let mut total_cost = 0u64; - - // Create a child context with reduced split depth for subtasks - let child_ctx = ctx.child_context(); - - let total_subtasks = tasks.len(); - - tracing::info!( - "RootAgent executing {} subtasks (child depth: {})", - total_subtasks, - child_ctx.max_split_depth - ); - - // Execute each subtask through a NodeAgent (which can recursively split) - for (i, task) in tasks.iter_mut().enumerate() { - tracing::info!( - "RootAgent delegating subtask {}/{}: {}", - i + 1, - total_subtasks, - task.description().chars().take(80).collect::() - ); - - // Create a NodeAgent for this subtask - let node_agent = NodeAgent::new(format!("subtask-{}", i + 1)); - - // Execute through the NodeAgent (which may split further if complex) - let result = node_agent.execute(task, &child_ctx).await; - total_cost += result.cost_cents; - - tracing::info!( - "Subtask {}/{} {}: {}", - i + 1, - total_subtasks, - if result.success { "succeeded" } else { "failed" }, - result.output.chars().take(100).collect::() - ); - - results.push(result); - } - - // Aggregate results - let successes = results.iter().filter(|r| r.success).count(); - let total = results.len(); - - // Concatenate outputs (fallback aggregation for non-tree path) - let combined_output = self.fallback_concatenate_outputs(&results); - - if successes == total { - AgentResult::success(combined_output, total_cost) - .with_data(json!({ - "subtasks_total": total, - "subtasks_succeeded": successes, - "recursive_execution": true, - "results": results.iter().map(|r| json!({ - "success": r.success, - "output": &r.output, - "data": &r.data, - })).collect::>(), - })) - } else { - AgentResult::failure( - format!("{}/{} subtasks succeeded\n\n{}", successes, total, combined_output), - total_cost, - ) - .with_data(json!({ - "subtasks_total": total, - "subtasks_succeeded": successes, - "recursive_execution": true, - "results": results.iter().map(|r| json!({ - "success": r.success, - "output": &r.output, - "data": &r.data, - })).collect::>(), - })) - } - } - - /// Execute subtasks with tree updates for visualization. - /// Uses wave-based parallel execution for independent tasks. - /// - /// # Parameters - /// - `original_task_description`: The user's original request, used for synthesizing the final output - async fn execute_subtasks_with_tree( - &self, - subtask_plan: SubtaskPlan, - parent_budget: &Budget, - child_ctx: &AgentContext, - root_tree: &mut crate::api::control::AgentTreeNode, - ctx: &AgentContext, - requested_model: Option<&str>, - original_task_description: &str, - ) -> AgentResult { - use super::NodeAgent; - use std::sync::Arc; - use tokio::sync::Mutex; - - // Get execution waves for parallel processing - let waves = match subtask_plan.execution_waves() { - Ok(w) => w, - Err(e) => return AgentResult::failure(format!("Invalid subtask dependencies: {}", e), 0), - }; - - let mut tasks = match subtask_plan.into_tasks(parent_budget) { - Ok(t) => t, - Err(e) => return AgentResult::failure(format!("Failed to create subtasks: {}", e), 0), - }; - - // Propagate requested_model to all subtasks - if let Some(model) = requested_model { - for task in &mut tasks { - task.analysis_mut().requested_model = Some(model.to_string()); - } - } - - let total_subtasks = tasks.len(); - let num_waves = waves.len(); - - tracing::info!( - "RootAgent executing {} subtasks in {} wave(s) (child depth: {})", - total_subtasks, - num_waves, - child_ctx.max_split_depth - ); - - // Wrap tree in Arc for thread-safe parallel updates - let tree = Arc::new(Mutex::new(root_tree.clone())); - let mut all_results = Vec::new(); - let mut total_cost = 0u64; - - // Execute each wave in parallel - for (wave_idx, wave) in waves.iter().enumerate() { - let is_parallel = wave.len() > 1; - - tracing::info!( - "RootAgent wave {}/{}: {} task(s) {}", - wave_idx + 1, - num_waves, - wave.len(), - if is_parallel { "(parallel)" } else { "(sequential)" } - ); - - // Mark all tasks in this wave as running - { - let mut tree_guard = tree.lock().await; - for &idx in wave { - let subtask_id = format!("subtask-{}", idx + 1); - if let Some(node) = tree_guard.children.iter_mut().find(|n| n.id == subtask_id) { - node.status = "running".to_string(); - } - } - ctx.emit_tree(tree_guard.clone()); - } - - // Execute tasks in this wave in parallel - let wave_futures: Vec<_> = wave.iter().map(|&idx| { - let subtask_id = format!("subtask-{}", idx + 1); - let task = tasks[idx].clone(); - let child_ctx = child_ctx.clone(); - let tree = Arc::clone(&tree); - let ctx = ctx.clone(); - - async move { - let mut task = task; - let node_agent = NodeAgent::new(subtask_id.clone()); - - tracing::info!( - "RootAgent delegating subtask {}: {}", - subtask_id, - task.description().chars().take(80).collect::() - ); - - // Execute (without tree updates for parallel - update after) - let result = node_agent.execute(&mut task, &child_ctx).await; - - // Update tree with result - { - let mut tree_guard = tree.lock().await; - if let Some(node) = tree_guard.children.iter_mut().find(|n| n.id == subtask_id) { - node.status = if result.success { "completed".to_string() } else { "failed".to_string() }; - node.budget_spent = result.cost_cents; - } - ctx.emit_tree(tree_guard.clone()); - } - - tracing::info!( - "Subtask {} {}: {}", - subtask_id, - if result.success { "succeeded" } else { "failed" }, - result.output.chars().take(100).collect::() - ); - - (idx, result) - } - }).collect(); - - // Wait for all tasks in wave to complete - let wave_results = futures::future::join_all(wave_futures).await; - - for (idx, result) in wave_results { - total_cost += result.cost_cents; - // Store result at correct index - while all_results.len() <= idx { - all_results.push(None); - } - all_results[idx] = Some(result); - } - - // Emit progress update after each wave - let completed = all_results.iter().filter(|r| r.is_some()).count(); - let current_subtask = if wave_idx + 1 < num_waves { - // Next wave's first task description - waves.get(wave_idx + 1).and_then(|w| w.first()).map(|&idx| { - tasks.get(idx).map(|t| t.description().chars().take(50).collect::()) - }).flatten() - } else { - None - }; - ctx.emit_progress(total_subtasks, completed, current_subtask, 1); - } - - // Collect results in order - let results: Vec = all_results.into_iter().filter_map(|r| r).collect(); - - // Update the original tree from our Arc version - *root_tree = tree.lock().await.clone(); - - // Update verifier to running (repurposed as "synthesizer" for complex tasks) - if let Some(node) = root_tree.children.iter_mut().find(|n| n.id == "verifier") { - node.status = "running".to_string(); - node.description = "Synthesizing final output...".to_string(); - } - ctx.emit_tree(root_tree.clone()); - - // Aggregate results - let successes = results.iter().filter(|r| r.success).count(); - let total = results.len(); - - // Synthesize final output from all subtask results - let synthesized_output = if successes > 0 { - self.synthesize_final_output(original_task_description, &results, ctx).await - } else { - format!("{}/{} subtasks succeeded ({} waves)", successes, total, num_waves) - }; - - // Update verifier/synthesizer to completed - if let Some(node) = root_tree.children.iter_mut().find(|n| n.id == "verifier") { - node.status = if successes == total { "completed".to_string() } else { "failed".to_string() }; - node.budget_spent = 5; - } - ctx.emit_tree(root_tree.clone()); - - if successes == total { - AgentResult::success(synthesized_output, total_cost) - .with_data(json!({ - "subtasks_total": total, - "subtasks_succeeded": successes, - "recursive_execution": true, - "parallel_waves": num_waves, - "results": results.iter().map(|r| json!({ - "success": r.success, - "output": &r.output, - "data": &r.data, - })).collect::>(), - })) - } else { - AgentResult::failure( - format!("{}/{} subtasks succeeded ({} waves)\n\n{}", successes, total, num_waves, synthesized_output), - total_cost, - ) - .with_data(json!({ - "subtasks_total": total, - "subtasks_succeeded": successes, - "recursive_execution": true, - "parallel_waves": num_waves, - "results": results.iter().map(|r| json!({ - "success": r.success, - "output": &r.output, - "data": &r.data, - })).collect::>(), - })) - } - } -} - -impl Default for RootAgent { - fn default() -> Self { - Self::new() - } -} - -#[async_trait] -impl Agent for RootAgent { - fn id(&self) -> &AgentId { - &self.id - } - - fn agent_type(&self) -> AgentType { - AgentType::Root - } - - fn description(&self) -> &str { - "Root orchestrator: estimates complexity, splits tasks, delegates execution" - } - - async fn execute(&self, task: &mut Task, ctx: &AgentContext) -> AgentResult { - use crate::api::control::AgentTreeNode; - - let mut total_cost = 0u64; - let task_desc = task.description().chars().take(60).collect::(); - let budget_cents = task.budget().total_cents(); - - // Build initial tree structure - let mut root_tree = AgentTreeNode::new("root", "Root", "Root Agent", &task_desc) - .with_budget(budget_cents, 0) - .with_status("running"); - - // Add child agent nodes - root_tree.add_child( - AgentTreeNode::new("complexity", "ComplexityEstimator", "Complexity Estimator", "Analyzing task difficulty") - .with_budget(10, 0) - .with_status("running") - ); - ctx.emit_tree(root_tree.clone()); - - // Step 1: Estimate complexity - ctx.emit_phase("estimating_complexity", Some("Analyzing task difficulty..."), Some("RootAgent")); - let complexity = self.estimate_complexity(task, ctx).await; - total_cost += 1; - - // Update complexity node - if let Some(node) = root_tree.children.iter_mut().find(|n| n.id == "complexity") { - node.status = "completed".to_string(); - node.complexity = Some(complexity.score()); - node.budget_spent = 5; - } - ctx.emit_tree(root_tree.clone()); - - tracing::info!( - "Task complexity: {:.2} (should_split: {})", - complexity.score(), - complexity.should_split() - ); - - // Step 2: Decide execution strategy - if complexity.should_split() && ctx.can_split() { - ctx.emit_phase("splitting_task", Some("Decomposing into subtasks..."), Some("RootAgent")); - match self.split_task(task, ctx).await { - Ok(plan) => { - total_cost += 2; - - // Add subtask nodes to tree - for (i, subtask) in plan.subtasks().iter().enumerate() { - let subtask_node = AgentTreeNode::new( - &format!("subtask-{}", i + 1), - "Node", - &format!("Subtask {}", i + 1), - &subtask.description.chars().take(50).collect::(), - ) - .with_budget(budget_cents / plan.subtasks().len() as u64, 0) - .with_status("pending"); - root_tree.add_child(subtask_node); - } - - // Add verifier node - root_tree.add_child( - AgentTreeNode::new("verifier", "Verifier", "Verifier", "Verify task completion") - .with_budget(80, 0) - .with_status("pending") - ); - ctx.emit_tree(root_tree.clone()); - - // Execute subtasks with tree updates - let child_ctx = ctx.child_context(); - let requested_model = task.analysis().requested_model.as_deref(); - let original_task_desc = task.description(); - let result = self.execute_subtasks_with_tree(plan, task.budget(), &child_ctx, &mut root_tree, ctx, requested_model, original_task_desc).await; - - // Update root status - root_tree.status = if result.success { "completed".to_string() } else { "failed".to_string() }; - root_tree.budget_spent = total_cost + result.cost_cents; - ctx.emit_tree(root_tree); - - return AgentResult { - success: result.success, - output: result.output, - cost_cents: total_cost + result.cost_cents, - model_used: result.model_used, - data: result.data, - }; - } - Err(e) => { - tracing::warn!("Couldn't split task, executing directly: {}", e.output); - } - } - } - - // Simple task: add remaining nodes - // Check if there's a model override - show it immediately on the node - let requested_model = task.analysis().requested_model.clone(); - let model_selector_node = if let Some(ref model) = requested_model { - AgentTreeNode::new("model-selector", "ModelSelector", "Model Selector", &format!("Using: {}", model)) - .with_budget(10, 0) - .with_status("running") - .with_model(model) - } else { - AgentTreeNode::new("model-selector", "ModelSelector", "Model Selector", "Selecting optimal model") - .with_budget(10, 0) - .with_status("running") - }; - root_tree.add_child(model_selector_node); - ctx.emit_tree(root_tree.clone()); - - ctx.emit_phase("selecting_model", Some("Choosing optimal model..."), Some("RootAgent")); - - let has_benchmarks = if let Some(b) = &ctx.benchmarks { - let registry = b.read().await; - registry.benchmark_count() > 0 - } else { - false - }; - - let selected_model = if has_benchmarks { - let sel_result = self.model_selector.execute(task, ctx).await; - total_cost += sel_result.cost_cents; - // Model already resolved by ModelSelector - task.analysis().selected_model.clone().unwrap_or_else(|| ctx.config.default_model.clone()) - } else { - // No benchmarks - resolve default model to latest version - let default_model = if let Some(resolver) = &ctx.resolver { - let resolver = resolver.read().await; - let resolved = resolver.resolve(&ctx.config.default_model); - if resolved.upgraded { - tracing::info!( - "RootAgent: default model auto-upgraded: {} → {}", - resolved.original, resolved.resolved - ); - } - resolved.resolved - } else { - ctx.config.default_model.clone() - }; - let a = task.analysis_mut(); - a.selected_model = Some(default_model.clone()); - default_model - }; - - // Update model selector node with final selected model - if let Some(node) = root_tree.children.iter_mut().find(|n| n.id == "model-selector") { - node.status = "completed".to_string(); - node.selected_model = Some(selected_model.clone()); - node.description = format!("Using: {}", selected_model); - node.budget_spent = 3; - } - - // Add executor and verifier nodes - root_tree.add_child( - AgentTreeNode::new("executor", "TaskExecutor", "Task Executor", "Executing task") - .with_budget(budget_cents - 100, 0) - .with_status("running") - ); - root_tree.add_child( - AgentTreeNode::new("verifier", "Verifier", "Verifier", "Verify task completion") - .with_budget(80, 0) - .with_status("pending") - ); - ctx.emit_tree(root_tree.clone()); - - ctx.emit_phase("executing", Some("Running task..."), Some("RootAgent")); - let result = self.task_executor.execute(task, ctx).await; - - // Update executor node - if let Some(node) = root_tree.children.iter_mut().find(|n| n.id == "executor") { - node.status = if result.success { "completed".to_string() } else { "failed".to_string() }; - node.budget_spent = result.cost_cents; - } - ctx.emit_tree(root_tree.clone()); - - // Store the executor output for verification - task.set_last_output(result.output.clone()); - - // Step 3: Verify - if let Some(node) = root_tree.children.iter_mut().find(|n| n.id == "verifier") { - node.status = "running".to_string(); - } - ctx.emit_tree(root_tree.clone()); - - ctx.emit_phase("verifying", Some("Checking results..."), Some("RootAgent")); - let verification = self.verifier.execute(task, ctx).await; - total_cost += verification.cost_cents; - - // Update verifier node - if let Some(node) = root_tree.children.iter_mut().find(|n| n.id == "verifier") { - node.status = if verification.success { "completed".to_string() } else { "failed".to_string() }; - node.budget_spent = verification.cost_cents; - } - - // Update root status - root_tree.status = if result.success && verification.success { "completed".to_string() } else { "failed".to_string() }; - root_tree.budget_spent = total_cost + result.cost_cents; - ctx.emit_tree(root_tree); - - AgentResult { - success: result.success && verification.success, - output: if verification.success { - result.output - } else { - format!("{}\n\nVerification failed: {}", result.output, verification.output) - }, - cost_cents: total_cost + result.cost_cents, - model_used: result.model_used, - data: json!({ - "complexity": complexity.score(), - "was_split": false, - "verification": verification.data, - "execution": result.data, - }).into(), - } - } -} - -#[async_trait] -impl OrchestratorAgent for RootAgent { - fn children(&self) -> Vec { - vec![ - Arc::clone(&self.complexity_estimator) as AgentRef, - Arc::clone(&self.model_selector) as AgentRef, - Arc::clone(&self.task_executor) as AgentRef, - Arc::clone(&self.verifier) as AgentRef, - ] - } - - fn find_child(&self, agent_type: AgentType) -> Option { - match agent_type { - AgentType::ComplexityEstimator => Some(Arc::clone(&self.complexity_estimator) as AgentRef), - AgentType::ModelSelector => Some(Arc::clone(&self.model_selector) as AgentRef), - AgentType::TaskExecutor => Some(Arc::clone(&self.task_executor) as AgentRef), - AgentType::Verifier => Some(Arc::clone(&self.verifier) as AgentRef), - _ => None, - } - } - - async fn delegate(&self, task: &mut Task, child: AgentRef, ctx: &AgentContext) -> AgentResult { - child.execute(task, ctx).await - } - - async fn delegate_all(&self, tasks: &mut [Task], ctx: &AgentContext) -> Vec { - let mut results = Vec::with_capacity(tasks.len()); - - for task in tasks { - let result = self.task_executor.execute(task, ctx).await; - results.push(result); - } - - results - } -} - diff --git a/src/agents/simple.rs b/src/agents/simple.rs new file mode 100644 index 0000000..6c55508 --- /dev/null +++ b/src/agents/simple.rs @@ -0,0 +1,177 @@ +//! Simple agent - streamlined single-agent executor. +//! +//! Replaces the complex RootAgent → NodeAgent → ComplexityEstimator → ModelSelector → TaskExecutor → Verifier +//! hierarchy with a single agent that directly executes tasks. +//! +//! # Why SimpleAgent? +//! The multi-agent hierarchy added overhead without reliable benefits: +//! - ComplexityEstimator: LLM-based estimation was unreliable +//! - ModelSelector: U-curve optimization rarely matched simple "use default" strategy +//! - NodeAgent: Recursive splitting lost context and produced worse results +//! - Verifier: Rubber-stamped everything (LLMs are bad at self-verification) +//! +//! # Design +//! - Direct model selection: mission override > config default +//! - No automatic task splitting (user controls granularity) +//! - Built-in blocker detection via system prompt +//! - Mission completion via complete_mission tool + +use std::sync::Arc; + +use async_trait::async_trait; +use serde_json::json; + +use crate::agents::{ + Agent, AgentContext, AgentId, AgentResult, AgentType, + leaf::TaskExecutor, +}; +use crate::api::control::AgentTreeNode; +use crate::task::Task; + +/// Simple agent - unified executor without orchestration overhead. +/// +/// # Execution Flow +/// 1. Resolve model (mission override or config default) +/// 2. Build tree for visualization +/// 3. Execute task via TaskExecutor +/// 4. Return result (no verification layer) +pub struct SimpleAgent { + id: AgentId, + task_executor: Arc, +} + +impl SimpleAgent { + /// Create a new simple agent. + pub fn new() -> Self { + Self { + id: AgentId::new(), + task_executor: Arc::new(TaskExecutor::new()), + } + } + + /// Resolve the model to use for execution. + /// + /// Priority: + /// 1. Task's requested model (from mission override) + /// 2. Config default model (auto-upgraded via resolver) + async fn resolve_model(&self, task: &Task, ctx: &AgentContext) -> String { + // Check for explicit model request (from mission override) + if let Some(requested) = &task.analysis().requested_model { + // Resolve to latest version if using resolver + if let Some(resolver) = &ctx.resolver { + let resolver = resolver.read().await; + let resolved = resolver.resolve(requested); + if resolved.upgraded { + tracing::info!( + "SimpleAgent: requested model auto-upgraded: {} → {}", + resolved.original, resolved.resolved + ); + } + return resolved.resolved; + } + return requested.clone(); + } + + // Fall back to config default, resolved to latest version + if let Some(resolver) = &ctx.resolver { + let resolver = resolver.read().await; + let resolved = resolver.resolve(&ctx.config.default_model); + if resolved.upgraded { + tracing::info!( + "SimpleAgent: default model auto-upgraded: {} → {}", + resolved.original, resolved.resolved + ); + } + resolved.resolved + } else { + ctx.config.default_model.clone() + } + } + + /// Build a simple agent tree for visualization. + fn build_tree(&self, task_desc: &str, budget_cents: u64, model: &str) -> AgentTreeNode { + let mut root = AgentTreeNode::new("root", "Simple", "Simple Agent", task_desc) + .with_budget(budget_cents, 0) + .with_status("running"); + + // Add executor node + root.add_child( + AgentTreeNode::new("executor", "TaskExecutor", "Task Executor", "Executing task") + .with_budget(budget_cents, 0) + .with_status("running") + .with_model(model) + ); + + root + } +} + +impl Default for SimpleAgent { + fn default() -> Self { + Self::new() + } +} + +#[async_trait] +impl Agent for SimpleAgent { + fn id(&self) -> &AgentId { + &self.id + } + + fn agent_type(&self) -> AgentType { + AgentType::Root // Presents as Root for compatibility with tree visualization + } + + fn description(&self) -> &str { + "Simple agent: direct task execution without orchestration overhead" + } + + async fn execute(&self, task: &mut Task, ctx: &AgentContext) -> AgentResult { + let task_desc = task.description().chars().take(60).collect::(); + let budget_cents = task.budget().total_cents(); + + // Step 1: Resolve model + let model = self.resolve_model(task, ctx).await; + + // Update task analysis with selected model + task.analysis_mut().selected_model = Some(model.clone()); + + tracing::info!( + "SimpleAgent executing task with model '{}': {}...", + model, + task_desc + ); + + // Step 2: Build and emit tree + let mut tree = self.build_tree(&task_desc, budget_cents, &model); + ctx.emit_tree(tree.clone()); + + // Step 3: Emit phase (for frontend progress indicator) + ctx.emit_phase("executing", Some("Running task..."), Some("SimpleAgent")); + + // Step 4: Execute via TaskExecutor + let result = self.task_executor.execute(task, ctx).await; + + // Step 5: Update tree with result + if let Some(node) = tree.children.iter_mut().find(|n| n.id == "executor") { + node.status = if result.success { "completed".to_string() } else { "failed".to_string() }; + node.budget_spent = result.cost_cents; + } + tree.status = if result.success { "completed".to_string() } else { "failed".to_string() }; + tree.budget_spent = result.cost_cents; + ctx.emit_tree(tree); + + // Step 6: Return result with metadata + AgentResult { + success: result.success, + output: result.output, + cost_cents: result.cost_cents, + model_used: result.model_used.or(Some(model)), + data: Some(json!({ + "agent": "SimpleAgent", + "execution": result.data, + })), + terminal_reason: result.terminal_reason, + } + } +} diff --git a/src/agents/tuning.rs b/src/agents/tuning.rs index 7e8ba5d..0655d06 100644 --- a/src/agents/tuning.rs +++ b/src/agents/tuning.rs @@ -1,49 +1,24 @@ -//! Empirical tuning parameters for agent heuristics. +//! Tuning parameters (legacy). //! -//! This module exists to support **trial-and-error calibration**: -//! we run tasks, compare predicted vs actual usage/cost, and update parameters. -//! -//! The core agent logic should remain correct even if tuning values are absent -//! (defaults apply). +//! This module is kept for backwards compatibility but is largely unused +//! since SimpleAgent doesn't require tuning. use serde::{Deserialize, Serialize}; use std::path::{Path, PathBuf}; -use crate::agents::leaf::ComplexityPromptVariant; - -/// Top-level tuning parameters. -#[derive(Debug, Clone, Serialize, Deserialize)] +/// Top-level tuning parameters (legacy). +#[derive(Debug, Clone, Default, Serialize, Deserialize)] pub struct TuningParams { - pub complexity: ComplexityTuning, - pub model_selector: ModelSelectorTuning, -} - -impl Default for TuningParams { - fn default() -> Self { - Self { - complexity: ComplexityTuning::default(), - model_selector: ModelSelectorTuning::default(), - } - } + // Empty - SimpleAgent doesn't use tuning } impl TuningParams { /// Load tuning parameters from the working directory, if present. - /// - /// # Path - /// `{working_dir}/.open_agent/tuning.json` - pub async fn load_from_working_dir(working_dir: &Path) -> Self { - let path = working_dir.join(".open_agent").join("tuning.json"); - match tokio::fs::read_to_string(&path).await { - Ok(s) => serde_json::from_str::(&s).unwrap_or_default(), - Err(_) => TuningParams::default(), - } + pub async fn load_from_working_dir(_working_dir: &Path) -> Self { + Self::default() } /// Save tuning parameters to the working directory. - /// - /// # Postcondition - /// If successful, subsequent `load_from_working_dir` returns an equivalent value. pub async fn save_to_working_dir(&self, working_dir: &Path) -> anyhow::Result { let dir = working_dir.join(".open_agent"); tokio::fs::create_dir_all(&dir).await?; @@ -53,44 +28,3 @@ impl TuningParams { Ok(path) } } - -/// Tuning parameters for ComplexityEstimator. -#[derive(Debug, Clone, Serialize, Deserialize)] -pub struct ComplexityTuning { - pub prompt_variant: ComplexityPromptVariant, - pub split_threshold: f64, - pub token_multiplier: f64, -} - -impl Default for ComplexityTuning { - fn default() -> Self { - Self { - prompt_variant: ComplexityPromptVariant::CalibratedV2, - split_threshold: 0.60, - token_multiplier: 1.00, - } - } -} - -/// Tuning parameters for ModelSelector. -#[derive(Debug, Clone, Serialize, Deserialize)] -pub struct ModelSelectorTuning { - /// Retry multiplier cost penalty for failures. - pub retry_multiplier: f64, - /// Token inefficiency scaling for weaker models. - pub inefficiency_scale: f64, - /// Cap for failure probability. - pub max_failure_probability: f64, -} - -impl Default for ModelSelectorTuning { - fn default() -> Self { - Self { - retry_multiplier: 1.5, - inefficiency_scale: 0.5, - max_failure_probability: 0.9, - } - } -} - - diff --git a/src/agents/types.rs b/src/agents/types.rs index 45e2c95..48a50bb 100644 --- a/src/agents/types.rs +++ b/src/agents/types.rs @@ -60,6 +60,25 @@ impl AgentType { } } +/// Reason why agent execution terminated (for non-successful completions). +/// +/// Used to determine whether auto-complete should trigger, avoiding substring matching. +#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize)] +pub enum TerminalReason { + /// Agent hit the maximum iteration limit + MaxIterations, + /// Agent was cancelled by user + Cancelled, + /// Budget was exhausted + BudgetExhausted, + /// Agent stalled (no progress, timeouts) + Stalled, + /// Agent got stuck in an infinite loop + InfiniteLoop, + /// LLM API error + LlmError, +} + /// Result of an agent executing a task. /// /// # Invariants @@ -81,6 +100,11 @@ pub struct AgentResult { /// Detailed result data (type-specific) pub data: Option, + + /// If execution ended due to a terminal condition (not normal completion), + /// this indicates why. Used by auto-complete logic to avoid substring matching. + #[serde(skip_serializing_if = "Option::is_none")] + pub terminal_reason: Option, } impl AgentResult { @@ -92,6 +116,7 @@ impl AgentResult { cost_cents, model_used: None, data: None, + terminal_reason: None, } } @@ -103,6 +128,7 @@ impl AgentResult { cost_cents, model_used: None, data: None, + terminal_reason: None, } } @@ -117,6 +143,12 @@ impl AgentResult { self.data = Some(data); self } + + /// Set the terminal reason (why execution ended abnormally). + pub fn with_terminal_reason(mut self, reason: TerminalReason) -> Self { + self.terminal_reason = Some(reason); + self + } } /// Complexity estimation for a task. diff --git a/src/api/control.rs b/src/api/control.rs index 451c34e..b75b341 100644 --- a/src/api/control.rs +++ b/src/api/control.rs @@ -1740,7 +1740,7 @@ async fn control_actor_loop( .map(|e| (e.role.clone(), e.content.clone())) .collect(); *current_mission.write().await = Some(mission_id); - + // Update mission status back to active if let Some(mem) = &memory { let _ = mem.supabase.update_mission_status(mission_id, "active").await; @@ -1947,6 +1947,56 @@ async fn control_actor_loop( // Persist to mission persist_mission_history(&memory, ¤t_mission, &history).await; + // P1 FIX: Auto-complete mission if agent execution ended in a terminal state + // without an explicit complete_mission call. + // This prevents missions from staying "active" forever after max iterations, stalls, etc. + // + // We use terminal_reason (structured enum) instead of substring matching to avoid + // false positives when agent output legitimately contains words like "infinite loop". + // We also check the current mission status from DB to handle: + // - Explicit complete_mission calls (which update DB status) + // - Parallel missions (each has its own DB status) + if agent_result.terminal_reason.is_some() { + if let Some(mem) = &memory { + if let Some(mission_id) = current_mission.read().await.clone() { + // Check current mission status from DB - only auto-complete if still "active" + let current_status = mem.supabase.get_mission(mission_id).await + .ok() + .flatten() + .map(|m| m.status); + + if current_status.as_deref() == Some("active") { + let status = if agent_result.success { "completed" } else { "failed" }; + tracing::info!( + "Auto-completing mission {} with status '{}' (terminal_reason: {:?})", + mission_id, status, agent_result.terminal_reason + ); + if let Err(e) = mem.supabase.update_mission_status(mission_id, status).await { + tracing::warn!("Failed to auto-complete mission: {}", e); + } else { + // Emit status change event + let new_status = if agent_result.success { + MissionStatus::Completed + } else { + MissionStatus::Failed + }; + let _ = events_tx.send(AgentEvent::MissionStatusChanged { + mission_id, + status: new_status, + summary: Some(format!("Auto-completed: {}", + agent_result.output.chars().take(100).collect::())), + }); + } + } else { + tracing::debug!( + "Skipping auto-complete: mission {} already has status {:?}", + mission_id, current_status + ); + } + } + } + } + let _ = events_tx.send(AgentEvent::AssistantMessage { id: Uuid::new_v4(), content: agent_result.output.clone(), diff --git a/src/api/fs.rs b/src/api/fs.rs index 15e1cf6..d5a0b58 100644 --- a/src/api/fs.rs +++ b/src/api/fs.rs @@ -220,6 +220,16 @@ pub async fn upload( format!("{}/{}", q.path, file_name) }; + // Ensure the target directory exists (mkdir -p is idempotent) + let target_dir = if q.path.ends_with('/') { + q.path.trim_end_matches('/').to_string() + } else { + q.path.clone() + }; + ssh_exec(&cfg, key_file.path(), "mkdir", &["-p".into(), target_dir]) + .await + .map_err(|e| (StatusCode::INTERNAL_SERVER_ERROR, format!("Failed to create directory: {}", e)))?; + let batch = format!("put -p \"{}\" \"{}\"\n", tmp.to_string_lossy(), remote_path); sftp_batch(&cfg, key_file.path(), &batch) .await diff --git a/src/api/routes.rs b/src/api/routes.rs index 204cab0..3841b15 100644 --- a/src/api/routes.rs +++ b/src/api/routes.rs @@ -21,8 +21,8 @@ use tower_http::cors::CorsLayer; use tower_http::trace::TraceLayer; use uuid::Uuid; -use crate::agents::orchestrator::RootAgent; -use crate::agents::{AgentContext, AgentRef, TuningParams}; +use crate::agents::SimpleAgent; +use crate::agents::{AgentContext, AgentRef}; use crate::budget::ModelPricing; use crate::config::Config; use crate::llm::OpenRouterClient; @@ -41,7 +41,7 @@ use super::types::*; pub struct AppState { pub config: Config, pub tasks: RwLock>, - /// The hierarchical root agent + /// The agent used for task execution pub root_agent: AgentRef, /// Memory system (optional) pub memory: Option, @@ -57,11 +57,8 @@ pub struct AppState { /// Start the HTTP server. pub async fn serve(config: Config) -> anyhow::Result<()> { - // Load empirically tuned parameters (if present in working directory) - let tuning = TuningParams::load_from_working_dir(&config.working_dir).await; - - // Create the root agent (hierarchical) - let root_agent: AgentRef = Arc::new(RootAgent::new_with_tuning(&tuning)); + // Create the simple agent (replaces complex RootAgent hierarchy) + let root_agent: AgentRef = Arc::new(SimpleAgent::new()); // Initialize memory system (optional - needs Supabase config) let memory = memory::init_memory(&config.memory, &config.api_key).await; diff --git a/src/bin/calibrate.rs b/src/bin/calibrate.rs deleted file mode 100644 index d53f947..0000000 --- a/src/bin/calibrate.rs +++ /dev/null @@ -1,244 +0,0 @@ -//! Calibration harness for Open Agent estimators. -//! -//! This binary runs trial tasks in a temporary directory and measures: -//! - ComplexityEstimator: predicted tokens vs actual tokens used by TaskExecutor -//! - Split decision quality (against a small labeled set) -//! -//! The goal is *empirical tuning* by trial-and-error, while keeping the core -//! agent code maintainable and (eventually) provable. -//! -//! ## Usage -//! -//! ```bash -//! export OPENROUTER_API_KEY="..." -//! cargo run --release --bin calibrate -- --working-dir /tmp/open_agent_calibration --model openai/gpt-4.1-mini -//! ``` -//! -//! Notes: -//! - This will create and delete files under the given directory. -//! - Costs real money. Keep the task set small. - -use std::path::{Path, PathBuf}; -use std::sync::Arc; - -use open_agent::agents::leaf::{ComplexityEstimator, ComplexityPromptVariant, TaskExecutor}; -use open_agent::agents::{Agent, AgentContext}; -use open_agent::budget::ModelPricing; -use open_agent::config::Config; -use open_agent::llm::OpenRouterClient; -use open_agent::task::{Task, VerificationCriteria}; -use open_agent::tools::ToolRegistry; -use open_agent::agents::tuning::{TuningParams, ComplexityTuning}; - -#[derive(Debug, Clone)] -struct CalibTask { - name: &'static str, - prompt: &'static str, - expected_should_split: bool, -} - -fn parse_args() -> (PathBuf, String, bool) { - let mut working_dir = None::; - let mut model = None::; - let mut write_tuning = false; - - let mut args = std::env::args().skip(1); - while let Some(a) = args.next() { - match a.as_str() { - "--working-dir" | "--workspace" => working_dir = args.next().map(PathBuf::from), - "--model" => model = args.next(), - "--write-tuning" => write_tuning = true, - _ => {} - } - } - - let working_dir = working_dir.unwrap_or_else(|| PathBuf::from("./.open_agent_calibration")); - let model = model.unwrap_or_else(|| "openai/gpt-4.1-mini".to_string()); - (working_dir, model, write_tuning) -} - -fn task_set() -> Vec { - vec![ - CalibTask { - name: "hello_world", - prompt: "Create a Python script called hello.py that prints 'Hello World'.", - expected_should_split: false, - }, - CalibTask { - name: "calculator", - prompt: "Create a Python script called calculator.py with add/subtract/multiply/divide functions and a small CLI menu.", - expected_should_split: false, - }, - CalibTask { - name: "mini_project", - prompt: "Create a tiny Python project with: (1) src/app.py that reads a name from argv and prints a greeting, (2) tests/test_app.py using pytest, (3) a pyproject.toml. Ensure 'python -m pytest' passes.", - expected_should_split: true, - }, - ] -} - -#[derive(Debug, Clone)] -struct Score { - mean_token_rel_error: f64, - split_accuracy: f64, -} - -impl Score { - fn objective(&self) -> f64 { - // Lower is better. Penalize wrong split decisions. - self.mean_token_rel_error + (1.0 - self.split_accuracy) * 0.50 - } -} - -async fn ensure_clean_dir(dir: &Path) -> anyhow::Result<()> { - if dir.exists() { - tokio::fs::remove_dir_all(dir).await?; - } - tokio::fs::create_dir_all(dir).await?; - Ok(()) -} - -#[tokio::main] -async fn main() -> anyhow::Result<()> { - let (working_dir_root, exec_model, write_tuning) = parse_args(); - - let api_key = std::env::var("OPENROUTER_API_KEY") - .map_err(|_| anyhow::anyhow!("OPENROUTER_API_KEY must be set for calibration"))?; - - let tasks = task_set(); - - // Grid to try. - let variants = [ - ComplexityPromptVariant::RubricV1, - ComplexityPromptVariant::CalibratedV2, - ]; - let split_thresholds = [0.55, 0.60, 0.65]; - let token_multipliers = [0.9, 1.0, 1.1, 1.2, 1.3]; - - let llm: Arc = Arc::new(OpenRouterClient::new(api_key)); - let pricing = Arc::new(ModelPricing::new()); - - let mut best = None::<(ComplexityPromptVariant, f64, f64, Score)>; - - for &variant in &variants { - for &split_threshold in &split_thresholds { - for &token_mult in &token_multipliers { - let mut rel_errors = Vec::new(); - let mut correct_split = 0usize; - - for t in &tasks { - let wd = working_dir_root.join(format!( - "{}_st{}_tm{}", - t.name, - (split_threshold * 100.0) as u64, - (token_mult * 100.0) as u64 - )); - ensure_clean_dir(&wd).await?; - - // Minimal config for context. - let cfg = Config::new("".to_string(), exec_model.clone(), wd.clone()); - - let ctx = AgentContext::new( - cfg, - Arc::clone(&llm), - ToolRegistry::new(), - Arc::clone(&pricing), - wd.clone(), - ); - - // Build task with generous budget. - let budget = open_agent::budget::Budget::new(10_000); // $100 in cents - let mut task = Task::new( - t.prompt.to_string(), - VerificationCriteria::None, - budget, - )?; - - // Run estimator (with candidate params). - let estimator = ComplexityEstimator::with_params(variant, split_threshold, token_mult); - let _ = estimator.execute(&mut task, &ctx).await; - - let predicted_tokens = task.analysis().estimated_total_tokens.unwrap_or(2000); - let predicted_split = task.analysis().should_split.unwrap_or(false); - if predicted_split == t.expected_should_split { - correct_split += 1; - } - - // Force execution model for comparability. - task.analysis_mut().selected_model = Some(exec_model.clone()); - - let executor = TaskExecutor::new(); - let _exec_res = executor.execute(&mut task, &ctx).await; - - let actual_tokens = task - .analysis() - .actual_usage - .as_ref() - .map(|u| u.total_tokens) - .unwrap_or(predicted_tokens); - - let denom = (actual_tokens as f64).max(1.0); - let rel = ((predicted_tokens as f64) - (actual_tokens as f64)).abs() / denom; - rel_errors.push(rel); - } - - let mean_token_rel_error = if rel_errors.is_empty() { - 1.0 - } else { - rel_errors.iter().sum::() / (rel_errors.len() as f64) - }; - - let split_accuracy = (correct_split as f64) / (tasks.len() as f64); - let score = Score { - mean_token_rel_error, - split_accuracy, - }; - - let candidate = (variant, split_threshold, token_mult, score.clone()); - let better = best - .as_ref() - .map(|(_, _, _, s)| score.objective() < s.objective()) - .unwrap_or(true); - - if better { - best = Some(candidate); - eprintln!( - "New best: variant={:?} split={:.2} mult={:.2} token_err={:.3} split_acc={:.2} obj={:.3}", - variant, - split_threshold, - token_mult, - score.mean_token_rel_error, - score.split_accuracy, - score.objective() - ); - } - } - } - } - - if let Some((variant, split_threshold, token_mult, score)) = best { - println!("=== Recommended ComplexityEstimator Settings ==="); - println!("prompt_variant: {:?}", variant); - println!("split_threshold: {:.2}", split_threshold); - println!("token_multiplier: {:.2}", token_mult); - println!("mean_token_rel_error: {:.3}", score.mean_token_rel_error); - println!("split_accuracy: {:.2}", score.split_accuracy); - - if write_tuning { - let mut tuning = TuningParams::default(); - tuning.complexity = ComplexityTuning { - prompt_variant: variant, - split_threshold, - token_multiplier: token_mult, - }; - let path = tuning.save_to_working_dir(&working_dir_root).await?; - println!("Wrote tuning file to {}", path.to_string_lossy()); - } - } else { - println!("No calibration result produced."); - } - - Ok(()) -} - - diff --git a/src/budget/pricing.rs b/src/budget/pricing.rs index e8c15bb..d111588 100644 --- a/src/budget/pricing.rs +++ b/src/budget/pricing.rs @@ -238,7 +238,10 @@ impl ModelPricing { // Flagship tier "openai/o1", "openai/o1-preview", + "openai/gpt-5.2-pro", // Mid tier + "openai/gpt-5.2", + "openai/gpt-5.2-chat", "openai/gpt-4.1", "openai/gpt-4o", "openai/gpt-4-turbo", diff --git a/src/lib.rs b/src/lib.rs index 308bfc0..55a7157 100644 --- a/src/lib.rs +++ b/src/lib.rs @@ -4,33 +4,33 @@ //! //! This library provides: //! - An HTTP API for task submission and monitoring -//! - A hierarchical agent tree for complex task handling +//! - A simple agent architecture for direct task execution //! - Tool-based execution for autonomous code editing //! - Integration with OpenRouter for LLM access //! -//! ## Architecture (v2: Hierarchical Agent Tree) +//! ## Architecture (v3: SimpleAgent) //! //! ```text -//! ┌─────────────┐ -//! │ RootAgent │ -//! └──────┬──────┘ -//! ┌─────────────────┼─────────────────┐ -//! ▼ ▼ ▼ -//! ┌───────────────┐ ┌─────────────┐ ┌─────────────┐ -//! │ Complexity │ │ Model │ │ Task │ -//! │ Estimator │ │ Selector │ │ Executor │ -//! └───────────────┘ └─────────────┘ └─────────────┘ +//! ┌──────────────────────────────────┐ +//! │ SimpleAgent │ +//! │ (direct execution, no overhead) │ +//! └────────────────┬─────────────────┘ +//! │ +//! ▼ +//! ┌─────────────────┐ +//! │ TaskExecutor │ +//! │ (tool loop) │ +//! └─────────────────┘ //! ``` //! //! ## Task Flow //! 1. Receive task via API -//! 2. Estimate complexity (should we split?) -//! 3. Select optimal model (U-curve cost optimization) -//! 4. Execute (directly or via subtasks) -//! 5. Verify completion (programmatic + LLM hybrid) +//! 2. Resolve model (user override or config default) +//! 3. Execute via TaskExecutor (tool loop) +//! 4. Return result (mission completion via complete_mission tool) //! //! ## Modules -//! - `agents`: Hierarchical agent tree (Root, Node, Leaf agents) +//! - `agents`: SimpleAgent and TaskExecutor //! - `task`: Task, subtask, and verification types //! - `budget`: Cost tracking and model pricing diff --git a/src/memory/embed.rs b/src/memory/embed.rs index dc650ee..0a15b68 100644 --- a/src/memory/embed.rs +++ b/src/memory/embed.rs @@ -129,3 +129,6 @@ struct EmbeddingUsage { + + + diff --git a/src/tools/file_ops.rs b/src/tools/file_ops.rs index 2f236a5..fc63cc4 100644 --- a/src/tools/file_ops.rs +++ b/src/tools/file_ops.rs @@ -58,7 +58,27 @@ impl Tool for ReadFile { return Err(anyhow::anyhow!("File not found: {} (resolved to: {})", path, resolution.resolved.display())); } - let content = tokio::fs::read_to_string(&resolution.resolved).await?; + // Try to read as UTF-8 text, detect binary files + let bytes = tokio::fs::read(&resolution.resolved).await?; + let content = match String::from_utf8(bytes) { + Ok(text) => text, + Err(_) => { + // Binary file detected - don't try to display content + let ext = resolution.resolved.extension() + .map(|e| e.to_string_lossy().to_lowercase()) + .unwrap_or_default(); + return Ok(format!( + "Binary file detected: {} ({} bytes)\n\n\ + Cannot display binary content directly. For this file type:\n\ + - .jar/.zip: Use `run_command` with `unzip -l` to list contents, or `jar tf` for JAR files\n\ + - .class: Use `run_command` with a Java decompiler like `javap -c` or `cfr`\n\ + - Images: Use appropriate tools to process\n\ + - Executables: Use `file` command to identify, `strings` to extract text", + resolution.resolved.display(), + resolution.resolved.metadata().map(|m| m.len()).unwrap_or(0) + )); + } + }; // Handle optional line range let start_line = args["start_line"].as_u64().map(|n| n as usize); diff --git a/src/tools/github.rs b/src/tools/github.rs index a299a1d..f82b6bd 100644 --- a/src/tools/github.rs +++ b/src/tools/github.rs @@ -632,3 +632,6 @@ Note: GitHub code search requires authentication. Set GH_TOKEN env var." Ok(format!("## Search results for '{}'\n\n{}", query, stdout)) } } + + + diff --git a/src/tools/terminal.rs b/src/tools/terminal.rs index 9ce286d..bdf3891 100644 --- a/src/tools/terminal.rs +++ b/src/tools/terminal.rs @@ -15,6 +15,34 @@ use tokio::process::Command; use super::{resolve_path_simple as resolve_path, Tool}; +/// Sanitize command output to be safe for LLM consumption. +/// Removes binary garbage while preserving valid text. +fn sanitize_output(bytes: &[u8]) -> String { + // Check if output appears to be mostly binary + let non_printable_count = bytes.iter() + .filter(|&&b| b < 0x20 && b != b'\n' && b != b'\r' && b != b'\t') + .count(); + + // If more than 10% is non-printable (excluding newlines/tabs), it's likely binary + if bytes.len() > 100 && non_printable_count > bytes.len() / 10 { + return format!( + "[Binary output detected - {} bytes, {}% non-printable. \ + Use appropriate tools to process binary data.]", + bytes.len(), + non_printable_count * 100 / bytes.len() + ); + } + + // Convert to string, replacing invalid UTF-8 + let text = String::from_utf8_lossy(bytes); + + // Remove null bytes and other problematic control characters + // Keep: newlines, tabs, carriage returns + text.chars() + .filter(|&c| c == '\n' || c == '\r' || c == '\t' || (c >= ' ' && c != '\u{FFFD}')) + .collect() +} + /// Dangerous command patterns that should be blocked. /// These patterns cause infinite loops or could damage the system. const DANGEROUS_PATTERNS: &[(&str, &str)] = &[ @@ -146,8 +174,8 @@ impl Tool for RunCommand { } }; - let stdout = String::from_utf8_lossy(&output.stdout); - let stderr = String::from_utf8_lossy(&output.stderr); + let stdout = sanitize_output(&output.stdout); + let stderr = sanitize_output(&output.stderr); let exit_code = output.status.code().unwrap_or(-1); tracing::debug!("Command completed: exit={}, stdout_len={}, stderr_len={}", @@ -157,6 +185,11 @@ impl Tool for RunCommand { result.push_str(&format!("Exit code: {}\n", exit_code)); + // Add hint when non-zero exit but output exists (common with tools that warn but succeed) + if exit_code != 0 && !stdout.is_empty() { + result.push_str("Note: Non-zero exit code but output was produced. The command may have succeeded with warnings - verify output files exist.\n"); + } + if !stdout.is_empty() { result.push_str("\n--- stdout ---\n"); result.push_str(&stdout);