Files

Thomas Marchand 1fd2ad0702 Add model override support to control session API

The /api/control/message endpoint now accepts an optional `model` field
to specify which model to use for a particular message. This enables:
- Model comparison tests from the dashboard
- Per-message model selection in the control session

The model override is passed through to the task's requested_model field,
which the ModelSelector respects when choosing the execution model.

2025-12-19 08:35:27 +00:00

check_all_tasks.py

Add model override support to control session API

2025-12-19 08:35:27 +00:00

check_results.py

Add model comparison test framework and analysis report

2025-12-19 07:54:40 +00:00

generate_ios_icons.js

feat: new mission completion, tooling & stream fix

2025-12-17 17:03:47 +00:00

install_desktop.sh

feat: better icon

2025-12-17 14:44:08 +00:00

merge_benchmarks.py

feat: Add model resolver and fix remaining build issues

2025-12-19 04:32:17 +00:00

quick_model_test.sh

Add test models to CAPABLE_MODEL_BASES and create model comparison scripts

2025-12-19 07:33:55 +00:00

README.md

Enable benchmark-based model selection and fix agent execution

2025-12-17 04:26:11 +00:00

run_security_test.sh

Add model comparison test framework and analysis report

2025-12-19 07:54:40 +00:00

test_model_comparison.sh

Add test models to CAPABLE_MODEL_BASES and create model comparison scripts

2025-12-19 07:33:55 +00:00

test_recursive_split.sh

Improve Agent Tree page visualization and add recursive task splitting

2025-12-18 11:03:18 +00:00

README.md

Open Agent Scripts

Reusable Python scripts for data processing tasks that are too large for LLM context.

Available Scripts

merge_benchmarks.py

Merges OpenRouter models with ZeroEval benchmark scores.

Usage:

python3 scripts/merge_benchmarks.py

What it does:

Fetches all models from OpenRouter API (~350 models)
Fetches benchmark metadata from ZeroEval API (~383 benchmarks)
Fetches scores for key benchmarks in each category:
- code: SWE-bench, HumanEval, LiveCodeBench, Aider-Polyglot, etc.
- math: AIME 2025/2024, MATH-500, GSM8K, etc.
- reasoning: GPQA, MMLU-Pro, MMLU, ARC, HellaSwag, etc.
- tool_calling: BFCL, Tau-Bench, ACEBench, etc.
- long_context: RULER, LongBench, InfiniteBench, etc.
- general: IFEval, Arena-Hard, MT-Bench, etc.
Merges models with benchmark data
Outputs models_with_benchmarks.json

Output files:

models_with_benchmarks.json - Main output with merged data
openrouter_models_raw.json - Raw OpenRouter API response
llm_stats_benchmarks.json - Benchmark metadata from ZeroEval

Output format:

{
  "generated_at": "2025-12-17T03:37:04Z",
  "total_models": 349,
  "models_with_benchmarks": 156,
  "categories": ["code", "math", "reasoning", "tool_calling", "long_context", "general"],
  "models": [
    {
      "id": "openai/gpt-5.2",
      "name": "GPT-5.2",
      "context_length": 400000,
      "pricing": {...},
      "benchmarks": {
        "code": {"swe-bench-verified": 0.731},
        "math": {"aime-2025": 0.96},
        "reasoning": {"gpqa": 0.924}
      },
      "category_scores": {
        "code": 0.731,
        "math": 0.96,
        "reasoning": 0.924
      }
    }
  ]
}

Best Practices for Large Data Tasks

When dealing with data too large for the LLM context (>10KB):

Use scripts: Run Python/bash scripts with run_command
Write to files: Save intermediate results to files
Read summaries: Read only summaries or specific sections
Process in chunks: Break large tasks into smaller pieces

Example:

# Run the merge script
python3 scripts/merge_benchmarks.py

# Check summary
python3 -c "import json; d=json.load(open('models_with_benchmarks.json')); print(f'Models: {d[\"total_models\"]}, With benchmarks: {d[\"models_with_benchmarks\"]}')"

# Look up specific model
python3 -c "import json; d=json.load(open('models_with_benchmarks.json')); m=[x for x in d['models'] if 'gpt-5' in x['id'].lower()]; print(json.dumps(m[:3], indent=2))"