Files

Thomas Marchand 086e3ad3ed Improve Agent Tree page visualization and add recursive task splitting

Dashboard:
- Redesigned Agent Tree page with hierarchical tree visualization
- Added Tree Overview mini-map showing agent counts by status
- Added collapsible nodes with smooth animations
- Added color-coded status indicators (running/completed/failed/pending)
- Added details panel showing agent info, budget, complexity, and logs
- Added Expand All / Collapse All controls

Backend:
- Enhanced NodeAgent with recursive complexity estimation and splitting
- Added extract_json helper for robust LLM response parsing
- RootAgent now delegates to NodeAgent for recursive subtask execution

2025-12-18 11:03:18 +00:00

generate_ios_icons.js

feat: new mission completion, tooling & stream fix

2025-12-17 17:03:47 +00:00

install_desktop.sh

feat: better icon

2025-12-17 14:44:08 +00:00

merge_benchmarks.py

Enable benchmark-based model selection and fix agent execution

2025-12-17 04:26:11 +00:00

README.md

Enable benchmark-based model selection and fix agent execution

2025-12-17 04:26:11 +00:00

test_recursive_split.sh

Improve Agent Tree page visualization and add recursive task splitting

2025-12-18 11:03:18 +00:00

README.md

Open Agent Scripts

Reusable Python scripts for data processing tasks that are too large for LLM context.

Available Scripts

merge_benchmarks.py

Merges OpenRouter models with ZeroEval benchmark scores.

Usage:

python3 scripts/merge_benchmarks.py

What it does:

Fetches all models from OpenRouter API (~350 models)
Fetches benchmark metadata from ZeroEval API (~383 benchmarks)
Fetches scores for key benchmarks in each category:
- code: SWE-bench, HumanEval, LiveCodeBench, Aider-Polyglot, etc.
- math: AIME 2025/2024, MATH-500, GSM8K, etc.
- reasoning: GPQA, MMLU-Pro, MMLU, ARC, HellaSwag, etc.
- tool_calling: BFCL, Tau-Bench, ACEBench, etc.
- long_context: RULER, LongBench, InfiniteBench, etc.
- general: IFEval, Arena-Hard, MT-Bench, etc.
Merges models with benchmark data
Outputs models_with_benchmarks.json

Output files:

models_with_benchmarks.json - Main output with merged data
openrouter_models_raw.json - Raw OpenRouter API response
llm_stats_benchmarks.json - Benchmark metadata from ZeroEval

Output format:

{
  "generated_at": "2025-12-17T03:37:04Z",
  "total_models": 349,
  "models_with_benchmarks": 156,
  "categories": ["code", "math", "reasoning", "tool_calling", "long_context", "general"],
  "models": [
    {
      "id": "openai/gpt-5.2",
      "name": "GPT-5.2",
      "context_length": 400000,
      "pricing": {...},
      "benchmarks": {
        "code": {"swe-bench-verified": 0.731},
        "math": {"aime-2025": 0.96},
        "reasoning": {"gpqa": 0.924}
      },
      "category_scores": {
        "code": 0.731,
        "math": 0.96,
        "reasoning": 0.924
      }
    }
  ]
}

Best Practices for Large Data Tasks

When dealing with data too large for the LLM context (>10KB):

Use scripts: Run Python/bash scripts with run_command
Write to files: Save intermediate results to files
Read summaries: Read only summaries or specific sections
Process in chunks: Break large tasks into smaller pieces

Example:

# Run the merge script
python3 scripts/merge_benchmarks.py

# Check summary
python3 -c "import json; d=json.load(open('models_with_benchmarks.json')); print(f'Models: {d[\"total_models\"]}, With benchmarks: {d[\"models_with_benchmarks\"]}')"

# Look up specific model
python3 -c "import json; d=json.load(open('models_with_benchmarks.json')); m=[x for x in d['models'] if 'gpt-5' in x['id'].lower()]; print(json.dumps(m[:3], indent=2))"