New benchmark - Coming soon

Can AI models do real work?

Can AI models do real work?

Zapier Benchmarks measure whether an AI agent can execute an end-to-end business workflow across real tools—correctly.

PREVIEW

We’re finalizing the public release. The methodology summary below is accurate. The leaderboard is coming soon.

Download Whitepaper (coming soon)

Download Whitepaper (coming soon)

View on GitHub

View on GitHub

Outcomes, not outputs

Outcomes, not outputs

Zapier Benchmarks measure execution: did the work get done correctly in real systems?

Zapier Benchmarks evaluate AI agents on end-to-end workflow execution across six domains (Sales, Marketing, Operations, Support, Finance, and HR) and score whether the result matches fixed success criteria—so you can compare models on real work.

Real environments

Each task drops an agent into an isolated environment (CRM records, inbox threads, calendars, and more) with the kind of ambiguity that makes real work hard.

Deterministic scoring

We score the final environment state against fixed success criteria. No LLM-as-judge. No vibes.

Coming soon

Leaderboard

Leaderboard

We’ll publish results as models complete evaluation on the public task set.

What we'll report

The score measures what percentage of workflow tasks the model was able to fully complete.

Pass rate • strict success

By domain • Sales → HR

Cost • $/task (where available)

Try the latest models in Zapier

Try the latest models in Zapier

Want to test frontier models on real workflows—today? Use Zapier to run agents across the tools you already use. Zapier Benchmarks results will be published as evaluations complete.

Want to test frontier models on real workflows—today? Use Zapier to run agents across the tools you already use. Zapier Benchmarks results will be published as evaluations complete.