AutomationBench AI benchmark leaderboard

AutomationBench AI benchmark leaderboard

AutomationBench AI benchmark leaderboard

Can AI models do real work? Zapier's AI benchmark assessment measures execution in realistic systems.

AutomationBench tests AI agents on end-to-end workflow execution using 47 real tools across six business functions: Sales, Marketing, Operations, Support, Finance, and HR. It's built on real patterns from 2B+ monthly tasks across 3.7M companies.

Realistic environments

Each task drops an agent into an isolated environment (CRM records, inbox threads, calendars, and more) with the kind of ambiguity that makes real work hard.

Deterministic scoring

We score the final environment state against fixed success criteria. No LLM-as-judge. No vibes.

Leaderboard

Rank
Modelo
Puntaje
Cost / task
1
Claude Opus 4.8 (XHigh)
15.5%
$2.36
2
Claude Opus 4.8 (Max)
15.4%
$3.14
3
Gemini 3.5 Flash (Medium)
14.5%
$0.87
4
GPT-5.5 (XHigh)
12.9%
$6.31
5
Gemini 3.5 Flash (High)
12.6%
$1.30
6
Gemini 3.5 Flash (Low)
12.2%
$0.65
7
Claude Opus 4.8 (High)
11.6%
$1.87
8
GPT-5.5 (High)
11.3%
$4.06
9
Claude Opus 4.8 (Medium)
11.1%
$1.69
10
Claude Opus 4.8 (Low)
9.9%
$1.13

By domain

Dominio
Top Model
Puntaje
2nd Place model
Ventas
GPT-5.5 (XHigh) — OpenAI
17.9%
Claude Opus 4.8 (Max) — Anthropic (17.1%)
Marketing
GPT-5.5 (XHigh) — OpenAI
20.0%
Claude Opus 4.7 (Max) — Anthropic (18.0%)
Operaciones
Gemini 3.5 Flash (Medium) — Google
20.0%
Claude Opus 4.8 (Max) — Anthropic (19.0%)
Apoyo
Claude Opus 4.8 (XHigh) — Anthropic
15.0%
Claude Opus 4.8 (Max/High) — Anthropic (tie at 13.0%)
Finanzas
Claude Opus 4.8 (Max) — Anthropic
12.5%
Gemini 3.5 Flash (High) — Google / Claude Opus 4.8 (XHigh) — Anthropic (tie at 11.7%)
HR
Claude Opus 4.8 (XHigh) — Anthropic
20.0%
Gemini 3.5 Flash (Medium) — Google (19.2%)

Try the latest models in Zapier

Try the latest models in Zapier

Want to test frontier models on real workflows—today? Use Zapier to run agents across the tools you already use. Zapier Benchmarks results will be published as evaluations complete.

Want to test frontier models on real workflows—today? Use Zapier to run agents across the tools you already use. Zapier Benchmarks results will be published as evaluations complete.