The independent standard for AI evaluation

123 real-world tasks. 14 categories. Every score with error bars. ModelBench evaluates AI models on tool use, safety, and reasoning — not sanitized academic benchmarks that models have memorized.

AI evaluation is broken. Models score 90% on MMLU because they have seen the questions. ModelBench tests what matters in production: Can they call the right tools? Do they refuse harmful requests? Are they honest when uncertain? We publish every score with 95% confidence intervals, every judge's reasoning, and every methodology detail. No black boxes.

Why ModelBench

How It Works

  1. Models face real tasks — Each model receives the same 123 tasks spanning 14 categories.
  2. Automated checks + AI judge evaluate — Deterministic assertions verify tool calls. Gemma 4 31B evaluates quality with 12-bias mitigation.
  3. Rankings with confidence — Results include error bars, score distributions, and power analysis.

Methodology grounded in 50+ peer-reviewed sources including Anthropic's evaluation research, Stanford HELM, the CALM framework for judge bias, and the International AI Safety Report led by Yoshua Bengio.