Add stall detection warning for stuck agent operations (#28)

* Add multi-user auth and per-user control sessions * Add mission store abstraction and auth UX polish * Fix unused warnings in tooling * Fix Bugbot review issues - Prevent username enumeration by using generic error message - Add pagination support to InMemoryMissionStore::list_missions - Improve config error when JWT_SECRET missing but DASHBOARD_PASSWORD set * Trim stored username in comparison for consistency * Fix mission cleanup to also remove orphaned tree data * Refactor Open Agent as OpenCode workspace host * Remove chromiumoxide and pin @types/react * Pin idna_adapter for MSRV compatibility * Add host-mcp bin target * Use isolated Playwright MCP sessions * Allow Playwright MCP as root * Fix iOS dashboard warnings * Add autoFocus to username field in multi-user login mode Mirrors the iOS implementation behavior where username field is focused when multi-user auth mode is active. * Fix Bugbot review issues - Add conditional ellipsis for tool descriptions (only when > 32 chars) - Add serde(default) to JWT usr field for backward compatibility * Fix empty user ID fallback in multi-user auth Add effective_user_id helper that falls back to username when id is empty, preventing session sharing and token verification issues. * Fix parallel mission history preservation Load existing mission history into runner before starting parallel execution to prevent losing conversation context. * Fix desktop stream controls layout overflow on iPad - Add frame(maxWidth: .infinity) constraints to ensure controls stay within bounds on wide displays - Add alignment: .leading to VStacks for consistent layout - Add Spacer() to buttons row to prevent spreading - Increase label width to 55 for consistent FPS/Quality alignment - Add alignment: .trailing to value text frames * Fix queued user messages not persisted to mission history When a user message was queued (sent while another task was running), it was not being added to the history or persisted to the database. This caused queued messages to be lost from mission history. Added the same persistence logic used for initial messages to the queued message handling code path. * Add stall detection warning for stuck agent operations When an agent hasn't reported activity for 60+ seconds, show a warning banner in the chat UI with a Stop button. After 120+ seconds, the warning becomes more urgent with a Force Stop button. Changes: - Dashboard: Add viewingMissionStallSeconds tracking and stall warning banner - Backend: Update parallel runner last_activity when receiving events This helps users identify and cancel stuck missions (e.g., when OpenCode tool execution hangs indefinitely). * Fix main mission stall detection always reporting zero Track main_runner_last_activity separately from parallel runners. Update activity timestamp when events match the running main mission. Resolves Bugbot review finding. * Reset stall timer when new task starts Reset main_runner_last_activity when spawning a new task to prevent false stall warnings from idle time between tasks. Resolves Bugbot review finding. * Update CLAUDE.md to prefer debug builds by default - Debug builds compile 5-10x faster than release builds - Only use --release for production deployment or when explicitly requested - Added Build Mode Policy section documenting this preference
2026-01-04 23:55:27 -08:00
parent a3d3437b1d
commit aa65c4a1ef
3 changed files with 113 additions and 8 deletions
--- a/.claude/CLAUDE.md
+++ b/.claude/CLAUDE.md
@@ -15,14 +15,18 @@ Minimal autonomous coding agent in Rust with **full machine access** (not sandbo
 ## Commands

 ```bash
-# Backend
-cargo build --release           # Build
-cargo run --release             # Run server (port 3000)
-RUST_LOG=debug cargo run        # Debug mode
+# Backend - ALWAYS use debug builds by default (faster compilation)
+cargo build                     # Build (debug mode - use this by default)
+cargo run                       # Run server (port 3000)
+RUST_LOG=debug cargo run        # Run with debug logging
 cargo test                      # Run tests
 cargo fmt                       # Format code
 cargo clippy                    # Lint

+# Release builds - ONLY use when explicitly requested or for production deployment
+cargo build --release           # Release build (slower compilation, faster binary)
+cargo run --release             # Run in release mode
+
 # Dashboard (uses Bun, NOT npm/yarn/pnpm)
 cd dashboard
 bun install                     # Install deps (NEVER use npm install)
@@ -34,10 +38,18 @@ bun run build                   # Production build
 # - bun add <pkg> (not npm install <pkg>)
 # - bun run <script> (not npm run <script>)

-# Deployment
+# Deployment (release build required for production)
 ssh root@95.216.112.253 'cd /root/open_agent && git pull && cargo build --release && cp target/release/open_agent /usr/local/bin/ && cp target/release/desktop-mcp /usr/local/bin/ && cp target/release/host-mcp /usr/local/bin/ && systemctl restart open_agent'
 ```

+## Build Mode Policy
+
+**Always prefer debug builds** unless explicitly requested otherwise:
+- Debug builds compile much faster (~5-10x)
+- Use `cargo build` and `cargo run` (no `--release` flag)
+- Only use `--release` for production deployment or when user explicitly requests it
+- Performance difference is negligible for development/testing
+
 ## Architecture

 Open Agent uses OpenCode as its execution backend, enabling Claude Max subscription usage.
@@ -144,7 +156,7 @@ OPENCODE_PERMISSIVE=true
 **Desktop Tools with OpenCode:**
 To enable desktop tools (i3, Xvfb, screenshots):

-1. Build the MCP servers: `cargo build --release --bin desktop-mcp --bin host-mcp`
+1. Build the MCP servers: `cargo build --bin desktop-mcp --bin host-mcp` (use `--release` only for production)
 2. Workspace `opencode.json` files are generated automatically under `workspaces/`
   from `.openagent/mcp/config.json` (override by editing MCP configs via the UI).
 3. OpenCode will automatically load the tools from the MCP server
--- a/dashboard/src/app/control/control-client.tsx
+++ b/dashboard/src/app/control/control-client.tsx
@@ -70,6 +70,7 @@ import {
  PanelRight,
  Wifi,
  WifiOff,
+  AlertTriangle,
 } from "lucide-react";
 import {
  OptionList,
@@ -664,6 +665,18 @@ export default function ControlClient() {
    return mission.state === "running" || mission.state === "waiting_for_tool";
  }, [viewingMissionId, runningMissions, runState]);

+  // Check if the mission we're viewing appears stalled (no activity for 60+ seconds)
+  const viewingMissionStallSeconds = useMemo(() => {
+    if (!viewingMissionId) return 0;
+    const mission = runningMissions.find((m) => m.mission_id === viewingMissionId);
+    if (!mission) return 0;
+    if (mission.state !== "running") return 0;
+    return mission.seconds_since_activity;
+  }, [viewingMissionId, runningMissions]);
+
+  const isViewingMissionStalled = viewingMissionStallSeconds >= 60;
+  const isViewingMissionSeverelyStalled = viewingMissionStallSeconds >= 120;
+
  const isBusy = viewingMissionIsRunning;

  const streamCleanupRef = useRef<null | (() => void)>(null);
@@ -2535,6 +2548,53 @@ export default function ControlClient() {
                  </div>
                )}

+              {/* Stall warning banner when agent hasn't reported activity for 60+ seconds */}
+              {isViewingMissionStalled && viewingMissionId && (
+                <div className="flex justify-center py-4 animate-fade-in">
+                  <div className={cn(
+                    "flex flex-col sm:flex-row items-start sm:items-center gap-3 rounded-xl px-5 py-4",
+                    isViewingMissionSeverelyStalled
+                      ? "bg-red-500/10 border border-red-500/20"
+                      : "bg-amber-500/10 border border-amber-500/20"
+                  )}>
+                    <div className="flex items-center gap-3">
+                      <AlertTriangle className={cn(
+                        "h-5 w-5 shrink-0",
+                        isViewingMissionSeverelyStalled ? "text-red-400" : "text-amber-400"
+                      )} />
+                      <div className="text-sm">
+                        <span className={cn(
+                          "font-medium",
+                          isViewingMissionSeverelyStalled ? "text-red-400" : "text-amber-400"
+                        )}>
+                          Agent may be stuck
+                        </span>
+                        <span className="text-white/50 ml-1">
+                          — No activity for {Math.floor(viewingMissionStallSeconds)}s
+                        </span>
+                        <p className="text-white/40 text-xs mt-1">
+                          {isViewingMissionSeverelyStalled
+                            ? "The agent appears to be stuck on a long-running operation. Consider stopping it."
+                            : "A tool or external operation may be taking longer than expected."}
+                        </p>
+                      </div>
+                    </div>
+                    <button
+                      onClick={() => handleCancelMission(viewingMissionId)}
+                      className={cn(
+                        "shrink-0 inline-flex items-center gap-1.5 rounded-lg px-3 py-1.5 text-sm font-medium transition-colors",
+                        isViewingMissionSeverelyStalled
+                          ? "bg-red-500 text-white hover:bg-red-400"
+                          : "bg-amber-500/20 text-amber-400 hover:bg-amber-500/30 border border-amber-500/30"
+                      )}
+                    >
+                      <Square className="h-3.5 w-3.5" />
+                      {isViewingMissionSeverelyStalled ? "Force Stop" : "Stop"}
+                    </button>
+                  </div>
+                </div>
+              )}
+
              {/* Continue banner for blocked missions */}
              {activeMission?.status === "blocked" && items.length > 0 && (
                <div className="flex justify-center py-4">
--- a/src/api/control.rs
+++ b/src/api/control.rs
@@ -1624,7 +1624,7 @@ fn spawn_control_session(
    mission_store: Arc<dyn MissionStore>,
 ) -> ControlState {
    let (cmd_tx, cmd_rx) = mpsc::channel::<ControlCommand>(256);
-    let (events_tx, _events_rx) = broadcast::channel::<AgentEvent>(1024);
+    let (events_tx, events_rx) = broadcast::channel::<AgentEvent>(1024);
    let tool_hub = Arc::new(FrontendToolHub::new());
    let status = Arc::new(RwLock::new(ControlStatus {
        state: ControlRunState::Idle,
@@ -1666,6 +1666,7 @@ fn spawn_control_session(
        mission_cmd_rx,
        mission_cmd_tx,
        events_tx.clone(),
+        events_rx,
        tool_hub,
        status,
        current_mission,
@@ -1749,6 +1750,7 @@ async fn control_actor_loop(
    mut mission_cmd_rx: mpsc::Receiver<crate::tools::mission::MissionControlCommand>,
    mission_cmd_tx: mpsc::Sender<crate::tools::mission::MissionControlCommand>,
    events_tx: broadcast::Sender<AgentEvent>,
+    mut events_rx: broadcast::Receiver<AgentEvent>,
    tool_hub: Arc<FrontendToolHub>,
    status: Arc<RwLock<ControlStatus>>,
    current_mission: Arc<RwLock<Option<Uuid>>>,
@@ -1767,6 +1769,8 @@ async fn control_actor_loop(
    // Track which mission the main `running` task is actually working on.
    // This is different from `current_mission` which can change when user creates a new mission.
    let mut running_mission_id: Option<Uuid> = None;
+    // Track last activity for the main runner (for stall detection)
+    let mut main_runner_last_activity: std::time::Instant = std::time::Instant::now();

    // Parallel mission runners - each runs independently
    let mut parallel_runners: std::collections::HashMap<
@@ -2101,6 +2105,8 @@ async fn control_actor_loop(
                                let mission_id = current_mission.read().await.clone();
                                running_cancel = Some(cancel.clone());
                                running_mission_id = mission_id;
+                                // Reset activity timer when new task starts to avoid false stall warnings
+                                main_runner_last_activity = std::time::Instant::now();
                                running = Some(tokio::spawn(async move {
                                    let result = run_single_control_turn(
                                        cfg,
@@ -2341,7 +2347,7 @@ async fn control_actor_loop(
                                    state: "running".to_string(),
                                    queue_len: queue.len(),
                                    history_len: history.len(),
-                                    seconds_since_activity: 0, // Main runner doesn't track this yet
+                                    seconds_since_activity: main_runner_last_activity.elapsed().as_secs(),
                                    expected_deliverables: 0,
                                });
                            }
@@ -2788,6 +2794,8 @@ async fn control_actor_loop(
                    // Capture which mission this task is working on
                    let mission_id = current_mission.read().await.clone();
                    running_mission_id = mission_id;
+                    // Reset activity timer when new task starts to avoid false stall warnings
+                    main_runner_last_activity = std::time::Instant::now();
                    running = Some(tokio::spawn(async move {
                        let result = run_single_control_turn(
                            cfg,
@@ -2871,6 +2879,31 @@ async fn control_actor_loop(
                    tracing::info!("Parallel mission {} removed from runners", mid);
                }
            }
+            // Update last_activity for runners when we receive events for them
+            event = events_rx.recv() => {
+                if let Ok(event) = event {
+                    // Extract mission_id from event if present
+                    let mission_id = match &event {
+                        AgentEvent::ToolCall { mission_id, .. } => *mission_id,
+                        AgentEvent::ToolResult { mission_id, .. } => *mission_id,
+                        AgentEvent::Thinking { mission_id, .. } => *mission_id,
+                        AgentEvent::AgentPhase { mission_id, .. } => *mission_id,
+                        AgentEvent::AgentTree { mission_id, .. } => *mission_id,
+                        AgentEvent::Progress { mission_id, .. } => *mission_id,
+                        _ => None,
+                    };
+                    // Update last_activity for matching runner (main or parallel)
+                    if let Some(mid) = mission_id {
+                        if running_mission_id == Some(mid) {
+                            // Update main runner activity
+                            main_runner_last_activity = std::time::Instant::now();
+                        } else if let Some(runner) = parallel_runners.get_mut(&mid) {
+                            // Update parallel runner activity
+                            runner.touch();
+                        }
+                    }
+                }
+            }
        }
    }
 }