The /api/control/message endpoint now accepts an optional `model` field to specify which model to use for a particular message. This enables: - Model comparison tests from the dashboard - Per-message model selection in the control session The model override is passed through to the task's requested_model field, which the ModelSelector respects when choosing the execution model.
Open Agent Scripts
Reusable Python scripts for data processing tasks that are too large for LLM context.
Available Scripts
merge_benchmarks.py
Merges OpenRouter models with ZeroEval benchmark scores.
Usage:
python3 scripts/merge_benchmarks.py
What it does:
- Fetches all models from OpenRouter API (~350 models)
- Fetches benchmark metadata from ZeroEval API (~383 benchmarks)
- Fetches scores for key benchmarks in each category:
- code: SWE-bench, HumanEval, LiveCodeBench, Aider-Polyglot, etc.
- math: AIME 2025/2024, MATH-500, GSM8K, etc.
- reasoning: GPQA, MMLU-Pro, MMLU, ARC, HellaSwag, etc.
- tool_calling: BFCL, Tau-Bench, ACEBench, etc.
- long_context: RULER, LongBench, InfiniteBench, etc.
- general: IFEval, Arena-Hard, MT-Bench, etc.
- Merges models with benchmark data
- Outputs
models_with_benchmarks.json
Output files:
models_with_benchmarks.json- Main output with merged dataopenrouter_models_raw.json- Raw OpenRouter API responsellm_stats_benchmarks.json- Benchmark metadata from ZeroEval
Output format:
{
"generated_at": "2025-12-17T03:37:04Z",
"total_models": 349,
"models_with_benchmarks": 156,
"categories": ["code", "math", "reasoning", "tool_calling", "long_context", "general"],
"models": [
{
"id": "openai/gpt-5.2",
"name": "GPT-5.2",
"context_length": 400000,
"pricing": {...},
"benchmarks": {
"code": {"swe-bench-verified": 0.731},
"math": {"aime-2025": 0.96},
"reasoning": {"gpqa": 0.924}
},
"category_scores": {
"code": 0.731,
"math": 0.96,
"reasoning": 0.924
}
}
]
}
Best Practices for Large Data Tasks
When dealing with data too large for the LLM context (>10KB):
- Use scripts: Run Python/bash scripts with
run_command - Write to files: Save intermediate results to files
- Read summaries: Read only summaries or specific sections
- Process in chunks: Break large tasks into smaller pieces
Example:
# Run the merge script
python3 scripts/merge_benchmarks.py
# Check summary
python3 -c "import json; d=json.load(open('models_with_benchmarks.json')); print(f'Models: {d[\"total_models\"]}, With benchmarks: {d[\"models_with_benchmarks\"]}')"
# Look up specific model
python3 -c "import json; d=json.load(open('models_with_benchmarks.json')); m=[x for x in d['models'] if 'gpt-5' in x['id'].lower()]; print(json.dumps(m[:3], indent=2))"