The gold standard for AI agent evaluation

Reduce your eval cycle time from days to hours. Iterate on your agent faster with HUD.

"Leading AI labs use HUD to boost agent success rates by over 5x on complex tasks, in weeks."

from hud import load_taskset, run_job, ClaudeAgent

# load
taskset = load_taskset("GAIA")

# evaluate
job = await run_job(ClaudeAgent, taskset, "test-gaia-job")

# get results OR view them in app.hud.so
print(await job.get_analytics())

We 💛 Researchers.

Browser Use
UC Berkeley
MIT
Columbia University
Yale University

Evaluate anything.

HUD Logo
OSWorld illustration

OSWorld

Academic369 tasks
1. OpenAI CUA38.1%
2. Claude 3.7 Sonnet28%
3. UI-TARS-72B24.6%
Financial Analysis 1 illustration

Financial Analysis 1

Professional15 tasks
financial-analyst
WebArena illustration

WebArena

Academic25 tasks
webarena
Pokemon 1 illustration

Pokemon 1

Gaming10 tasks
game-agent
WebVoyager illustration

WebVoyager

Academic643 tasks
webvoyager
Autonomy-10 illustration

Autonomy-10

Private30 tasks
autonomy
GeoGuessr 1 illustration

GeoGuessr 1

Gaming50 tasks
geoguessr
HR 1 illustration

HR 1

Professional15 tasks
hr-analytics
Legal Research 1 illustration

Legal Research 1

Professional15 tasks
legal-researcher

Evaluate instantly, anytime.

Stop waiting hours for results. Our platform orchestrates hundreds of concurrent machines, spinning up full OS environments in seconds for rapid evaluation cycles. Iterate faster, identify regressions sooner, and push better agents to production.

OSWorld Benchmark Runtime Comparison


Integrate Your Agent's Stack

Evaluate agentic abilities while leveraging existing tools & models.

OpenAI Operator
Claude Computer Use
MCP
RAG

Agents, your way.

Don't force your agent into a specific mold. The HUD evaluation schema can adapt to any architecture. Bring your own tools, models (like VLMs or RAG systems), or APIs. Focus on evaluating the core agentic abilities on various environments, while integrating any unique components of your agent stack.


Any evaluation, any environment.

Go beyond standard benchmark sets. Create tasks tailored to your specific agent, product, or workflow across diverse environments. Evaluate performance on desktop software, web browsers, text-based interfaces, or proprietary dockerfile environments.

Test on any environment

Desktop / OS
Browser / Web
Response / API
Dockerfile / Custom

Build evaluations for niche workflows, proprietary tools, and unique agent loops.

Pricing

HUD Logo

Basic

$1/evaluation*

Most informative evalsets cost ~$10-15 per run (avg. 10 min).

*Plus $0.15/hr per active environment.

  • ✓Access to all stock evaluations
  • ✓Full control, telemetry and evaluation
  • ✓Access to public leaderboards (Coming soon)

Start with $10 in free credits!

Get started

Enterprise

Custom

Significant discounts available for labs running evals at scale.

  • ✓Benchmark agents on proprietary datasets & workflows
  • ✓Stress-test new models before production deployment
  • ✓Dedicated support for complex evaluation needs

Are you a researcher?

Get $100 in free credits when you sign up with a .edu email address.

Or maybe you have specific needs?Tell us what you're building.

Any questions?

Book a 15-min Call

Talk to a product specialist, scope your eval goals, see how others test agents.

Or email us a quick question atfounders@hud.so.

HUD