AutomationBench AI benchmark leaderboard

AutomationBench AI benchmark leaderboard

AutomationBench AI benchmark leaderboard

Can AI models do real work? Zapier's AI benchmark assessment measures execution in realistic systems.

AutomationBench tests AI agents on end-to-end workflow execution using 47 real tools across six business functions: Sales, Marketing, Operations, Support, Finance, and HR. It's built on real patterns from 2B+ monthly tasks across 3.7M companies.

Realistic environments

Each task drops an agent into an isolated environment (CRM records, inbox threads, calendars, and more) with the kind of ambiguity that makes real work hard.

Deterministic scoring

We score the final environment state against fixed success criteria. No LLM-as-judge. No vibes.

Leaderboard

Rank
Model
Score
Cost / task
1
Fable 5.0 (Max)
17.4%
$3.67
2
Fable 5.0 (XHigh)
16.0%
$3.03
3
Claude Opus 4.8 (XHigh)
15.5%
$2.36
4
Claude Opus 4.8 (Max)
15.4%
$3.14
5
Gemini 3.5 Flash (Medium)
14.5%
$0.87
6
Fable 5.0 (Medium)
12.9%
$2.59
7
GPT-5.5 (XHigh)
12.9%
$6.31
8
Gemini 3.5 Flash (High)
12.6%
$1.30
9
Fable 5.0 (High)
12.6%
$2.70
10
Gemini 3.5 Flash (Low)
12.2%
$0.65

By domain

Domain
Top Model
Score
2nd Place model
Sales
GPT-5.5 (XHigh) — OpenAI
17.9%
Claude Opus 4.8 (Max) — Anthropic (17.1%)
Marketing
GPT-5.5 (XHigh) — OpenAI
20.0%
Fable 5.0 (Max) — Anthropic / Claude Opus 4.7 (Max) — Anthropic (tie at 18.0%)
Operations
Fable 5.0 (Max) — Anthropic
27.0%
Fable 5.0 (XHigh) — Anthropic (23.0%)
Support
Claude Opus 4.8 (XHigh) — Anthropic
15.0%
Fable 5.0 (XHigh) — Anthropic (14.0%)
Finance
Fable 5.0 (Max) — Anthropic
12.5%
Claude Opus 4.8 (Max) — Anthropic (tie at 12.5%)
HR
Fable 5.0 (Max) — Anthropic
20.0%
Claude Opus 4.8 (XHigh) — Anthropic (tie at 20.0%)

How AutomationBench works

Each task boots a tiny simulated company, hands the agent a request a real ops person would get, and grades the world it leaves behind. The agent's reply isn't scored — the data is.

600+

Held-out evaluation tasks

6

Business domains

47

Simulated apps

~500

API endpoints

01

Boot a tiny company

The task seeds a fresh simulated business: CRM records, inbox threads, spreadsheets, support cases. Every model starts from this exact same world — including its traps: stale rows, near-duplicate names, policies buried in an inbox.

02

The agent works alone

One trigger message kicks things off. The agent gets tools to search, read, update, and send — no clarifying questions, no human in the loop. It has to find what it needs the way a new hire would: by looking.

03

Grade the end state

When the agent stops, deterministic assertions inspect the world it left behind. Was the record updated? Did the right team get the email — and the wrong teams not? The outcome happened or it didn't.

One task, end to end · sales.multi_hop_lookup

"We just closed the Meridian Corp Platform Deal! Mark it as won and route the win notice to the right team per our routing policy. Confirm the account tier from the Account Hierarchy spreadsheet, convert currencies if needed, and check for any open support escalations."
CRM · three Meridians
Meridian Corp — Platform Deal€120,000
Meridian Solutions — Platform Deal$150,000
Meridian Corporation — Platform Deal$95,000

Near-duplicate names. Update the wrong record and the task fails.

Sheets · stale rows
EUR → USD1.10Jan 10
EUR → USD1.30Jan 25
TierMid-MarketDec 15
TierEnterpriseJan 12

Both sheets contain outdated entries. The newest row wins: €120,000 × 1.30 = $156,000, tier = Enterprise.

Inbox · the routing policy

"Enterprise → executive-team@ · Mid-Market → vp-sales@ · SMB → smb-team@ … if the account has any open Critical/High escalations, also notify support-escalation@."

The rules aren't in the prompt — the agent has to find this email. And the open Critical case sits on the parent account, one hop up the hierarchy.

The grade · six assertions on the final state

  • Opportunity Meridian Corp — Platform Deal stage is Closed Wonthe right record, not a lookalike
  • Email sent to executive-team@ naming the deal, $156,000, and Enterprisecorrect tier from the newest row, correct FX math
  • Email sent to support-escalation@ with the same detailsfound the parent account's open Critical case
  • No win notice sent to vp-sales@the stale tier would have routed here
  • No win notice sent to smb-team@a lookalike account's tier
  • No win notice sent to sales-team@the lazy default for "everything else"

Official metric

task_completed_correctly

Strict pass/fail. Every scored assertion must pass. Leaderboard scores run against a held-out private evaluation set; the public 600-task set is for research and experimentation. Run-to-run variance is typically within 1%.

Diagnostic only

partial_credit

Fraction of assertions passed. Useful for debugging and training signal — and as a dense reward when the benchmark is used as an RL environment. Not part of the headline score.

Frequently asked

Why are the scores so low?

The benchmark measures full workflow completion across multiple apps, not isolated reasoning. The most striking failure mode: models declare success while actually failing. In one analysis, 72% of Opus's failures, 91% of Gemini's, and 84% of GPT-5.4's involved this false confidence — the agent reports the task done, but the world state is wrong.

Other common failure modes: not persisting when first searches don't find data, assuming data lives where it intuitively "should" (CRM instead of Sheets), processing some items from a list then summarizing as if done, paraphrasing instructions instead of following them exactly.

A task is only complete when business state is correct end-to-end. Getting most of the way there still fails strict scoring.

How does the agent actually interact with the apps?

Two tools: search runs BM25 keyword search over API schemas and returns the top 5 candidates; execute mimics a curl/fetch with method, URL, and body. Discovering the right endpoints is part of the challenge.

Leaderboard scores run in API mode. Behind the simulated apps, Pydantic models are the source of truth — schemas, pagination, required fields, and 4xx error cases behave like the real APIs, but state lives locally so runs are reproducible. Agents get up to 50 steps per task (rarely hit). AutomationBench also supports Zapier-tool mode and a Limited Zapier toolset for experiments on how tool surface shape affects performance.

How is this different from other agent benchmarks?

Many AI benchmarks test coding, browsing, QA, or general reasoning. AutomationBench focuses on whether agents can complete cross-app business work reliably across sales, marketing, ops, support, finance, and HR.

Key design choices: deterministic final-state assertions instead of LLM-as-judge, multi-app workflows instead of single-turn Q&A, simulated tools and APIs instead of screenshot-based browsing, and both positive and negative assertions — the negatives exist to prevent shotgun reward hacking, like emailing everyone in the company instead of the specified recipients.

Try the latest models in Zapier

Try the latest models in Zapier

Want to test frontier models on real workflows—today? Use Zapier to run agents across the tools you already use. Zapier Benchmarks results will be published as evaluations complete.

Want to test frontier models on real workflows—today? Use Zapier to run agents across the tools you already use. Zapier Benchmarks results will be published as evaluations complete.