The independent standard for AI evaluation

123 real-world tasks. 14 categories. Every score with error bars. ModelBench evaluates AI models on tool use, safety, and reasoning — not sanitized academic benchmarks that models have memorized.

AI evaluation is broken. Models score 90% on MMLU because they have seen the questions. ModelBench tests what matters in production: Can they call the right tools? Do they refuse harmful requests? Are they honest when uncertain? We publish every score with 95% confidence intervals, every judge's reasoning, and every methodology detail. No black boxes.

Why ModelBench

Real tasks, not academic trivia — 123 production agent tasks: tool calling, UI generation, multi-step workflows, safety boundaries.
Bias-aware judging — Gemma 4 31B judge with 12-bias mitigation (CALM framework). Open-weight, Apache 2.0.
Statistical rigor — SEM and 95% confidence intervals on every score, following Anthropic's evaluation methodology.
Anti-gaming by design — Contamination detection, temporal drift monitoring, and 3 prompt format variants.

How It Works

Models face real tasks — Each model receives the same 123 tasks spanning 14 categories.
Automated checks + AI judge evaluate — Deterministic assertions verify tool calls. Gemma 4 31B evaluates quality with 12-bias mitigation.
Rankings with confidence — Results include error bars, score distributions, and power analysis.

Methodology grounded in 50+ peer-reviewed sources including Anthropic's evaluation research, Stanford HELM, the CALM framework for judge bias, and the International AI Safety Report led by Yoshua Bengio.