Can AI models do real work?

Can AI models do real work?

Zapier Benchmarks measure execution: did the work get done correctly in realistic systems?

Zapier Benchmarks measure execution: did the work get done correctly in realistic systems?

AutomationBench, our lead eval, tests AI agents on end-to-end workflow execution across six domains (Sales, Marketing, Operations, Support, Finance, and HR). It's built on real patterns from 2B+ monthly tasks across 3.7M Zapier customers.

Realistic environments

Each task drops an agent into an isolated environment (CRM records, inbox threads, calendars, and more) with the kind of ambiguity that makes real work hard.

Deterministic scoring

We score the final environment state against fixed success criteria. No LLM-as-judge. No vibes.

Leaderboard

Leaderboard

Rank

Model

Score

Cost / task

1

GPT-5.5 (XHigh) — OpenAI

12.9%

$6.31

2

GPT-5.5 (High) — OpenAI

11.3%

$4.06

3

Claude Opus 4.7 (Max) — Anthropic

9.9%

$1.80

4

Gemini 3.1 Pro (High) — Google

9.6%

$0.54

5

GPT-5.5 (Medium) — OpenAI

8.5%

$2.42

6

Claude Opus 4.7 (High) — Anthropic

8.4%

$1.21

7

Claude Opus 4.7 (XHigh) — Anthropic

8.2%

$1.44

8

GPT-5.4 (High) — OpenAI

7.6%

$1.93

9

GPT-5.4 (XHigh) — OpenAI

7.3%

$3.92

10

Gemini 3.1 Pro (Low) — Google

7.2%

$0.30

By domain

By domain

Domain

Top Model

Score

2nd Place model

Sales

GPT-5.5 (XHigh) — OpenAI

17.9%

GPT-5.5 (High) — OpenAI (13.7%)

Marketing

GPT-5.5 (XHigh) — OpenAI

20.0%

Claude Opus 4.7 (Max) — Anthropic (18.0%)

Operations

GPT-5.5 (High) — OpenAI

17.0%

Gemini 3.1 Pro (High) — Google (tie at 14.0% with GPT-5.5 XHigh and Opus 4.7 Max)

Support

GPT-5.4 (High) — OpenAI

10.0%

GPT-5.4 (XHigh) — OpenAI (tie at 10.0%)

Finance

Claude Opus 4.7 (Max) — Anthropic

8.3%

Claude Opus 4.7 (High) — Anthropic (tie at 6.7% with GPT-5.5 Medium / High / XHigh)

HR

GPT-5.5 (XHigh) — OpenAI

10.8%

GPT-5.5 (High) — OpenAI (10.0%)

Try the latest models in Zapier

Try the latest models in Zapier

Want to test frontier models on real workflows—today? Use Zapier to run agents across the tools you already use. Zapier Benchmarks results will be published as evaluations complete.

Want to test frontier models on real workflows—today? Use Zapier to run agents across the tools you already use. Zapier Benchmarks results will be published as evaluations complete.