The independent standard for AI evaluation
123 real-world tasks. 14 categories. Every score with error bars. ModelBench evaluates AI models on tool use, safety, and reasoning — not sanitized academic benchmarks that models have memorized.
AI evaluation is broken. Models score 90% on MMLU because they have seen the questions. ModelBench tests what matters in production: Can they call the right tools? Do they refuse harmful requests? Are they honest when uncertain? We publish every score with 95% confidence intervals, every judge's reasoning, and every methodology detail. No black boxes.
Why ModelBench
- Real tasks, not academic trivia — 123 production agent tasks: tool calling, UI generation, multi-step workflows, safety boundaries.
- Bias-aware judging — Gemma 4 31B judge with 12-bias mitigation (CALM framework). Open-weight, Apache 2.0.
- Statistical rigor — SEM and 95% confidence intervals on every score, following Anthropic's evaluation methodology.
- Anti-gaming by design — Contamination detection, temporal drift monitoring, and 3 prompt format variants.
How It Works
- Models face real tasks — Each model receives the same 123 tasks spanning 14 categories.
- Automated checks + AI judge evaluate — Deterministic assertions verify tool calls. Gemma 4 31B evaluates quality with 12-bias mitigation.
- Rankings with confidence — Results include error bars, score distributions, and power analysis.
Methodology grounded in 50+ peer-reviewed sources including Anthropic's evaluation research, Stanford HELM, the CALM framework for judge bias, and the International AI Safety Report led by Yoshua Bengio.