AutomationBench, our lead eval, tests AI agents on end-to-end workflow execution across six domains (Sales, Marketing, Operations, Support, Finance, and HR). It's built on real patterns from 2B+ monthly tasks across 3.7M Zapier customers.
Realistic environments
Each task drops an agent into an isolated environment (CRM records, inbox threads, calendars, and more) with the kind of ambiguity that makes real work hard.
Deterministic scoring
We score the final environment state against fixed success criteria. No LLM-as-judge. No vibes.
Rank
Model
Score
Cost / task
1
GPT-5.5 (XHigh) — OpenAI
12.9%
$6.31
2
GPT-5.5 (High) — OpenAI
11.3%
$4.06
3
Claude Opus 4.7 (Max) — Anthropic
9.9%
$1.80
4
Gemini 3.1 Pro (High) — Google
9.6%
$0.54
5
GPT-5.5 (Medium) — OpenAI
8.5%
$2.42
6
Claude Opus 4.7 (High) — Anthropic
8.4%
$1.21
7
Claude Opus 4.7 (XHigh) — Anthropic
8.2%
$1.44
8
GPT-5.4 (High) — OpenAI
7.6%
$1.93
9
GPT-5.4 (XHigh) — OpenAI
7.3%
$3.92
10
Gemini 3.1 Pro (Low) — Google
7.2%
$0.30
Domain
Top Model
Score
2nd Place model
Sales
GPT-5.5 (XHigh) — OpenAI
17.9%
GPT-5.5 (High) — OpenAI (13.7%)
Marketing
GPT-5.5 (XHigh) — OpenAI
20.0%
Claude Opus 4.7 (Max) — Anthropic (18.0%)
Operations
GPT-5.5 (High) — OpenAI
17.0%
Gemini 3.1 Pro (High) — Google (tie at 14.0% with GPT-5.5 XHigh and Opus 4.7 Max)
Support
GPT-5.4 (High) — OpenAI
10.0%
GPT-5.4 (XHigh) — OpenAI (tie at 10.0%)
Finance
Claude Opus 4.7 (Max) — Anthropic
8.3%
Claude Opus 4.7 (High) — Anthropic (tie at 6.7% with GPT-5.5 Medium / High / XHigh)
HR
GPT-5.5 (XHigh) — OpenAI
10.8%
GPT-5.5 (High) — OpenAI (10.0%)