AutomationBench, our lead eval, tests AI agents on end-to-end workflow execution across six domains (Sales, Marketing, Operations, Support, Finance, and HR). It's built on real patterns from 2B+ monthly tasks across 3.7M Zapier customers.
Realistic environments
Each task drops an agent into an isolated environment (CRM records, inbox threads, calendars, and more) with the kind of ambiguity that makes real work hard.
Deterministic scoring
We score the final environment state against fixed success criteria. No LLM-as-judge. No vibes.