The evaluation platform for computer use agents

Evaluate and iterate on your AI agent in hundreds of environments and thousands of tasks designed by |

from hud import load_taskset, run_job, ClaudeAgent

# load
taskset = load_taskset("GAIA")

# evaluate
job = await run_job(ClaudeAgent, taskset, "test-gaia-job")

# get results OR view them in app.hud.so
print(await job.get_analytics())

We 💛 Researchers.

Browser Use
UC Berkeley
MIT
Columbia University
Yale University

Evaluate anything.

HUD Logo
OSWorld illustration

OSWorld

Academic369 tasks
1. OpenAI CUA38.1%
2. Claude 3.7 Sonnet28%
3. UI-TARS-72B24.6%
WebVoyager illustration

WebVoyager

Academic643 tasks
SheetBench illustration

SheetBench

Professional50 tasks
GAIA illustration

GAIA

Academic165 tasks
Mind2Web illustration

Mind2Web

Academic7775 tasks
Financial Analysis 1 illustration

Financial Analysis 1

Professional15 tasks
WebArena illustration

WebArena

Academic200 tasks
Pokemon 1 illustration

Pokemon 1

Gaming10 tasks
Autonomy-10 illustration

Autonomy-10

Private30 tasks
GeoGuessr 1 illustration

GeoGuessr 1

Gaming50 tasks
HR 1 illustration

HR 1

Professional15 tasks
Legal Research 1 illustration

Legal Research 1

Professional15 tasks

Improve your iteration loop.

Don't wait hours for results. Our platform orchestrates hundreds of concurrent machines, spinning up full OS environments in seconds for rapid evaluation cycles. Iterate faster, identify regressions sooner, and push better agents to production.

OSWorld Benchmark Runtime Comparison


Integrate Your Agent's Stack

Evaluate agentic abilities while leveraging existing tools & models.

OpenAI Operator
Claude Computer Use
MCP
RAG

Your agent, your way.

Don't force your agent into a specific mold. The HUD evaluation schema can adapt to any architecture. Bring your own tools, models (like VLMs or RAG systems), or APIs. Focus on evaluating the core agentic abilities on various environments, while integrating any unique components of your agent stack.


Any evaluation, any environment.

Go beyond standard benchmark sets. Create tasks tailored to your specific agent, product, or workflow across diverse environments. Evaluate performance on desktop software, web browsers, text-based interfaces, or proprietary dockerfile environments.

Test on any environment

Desktop / OS
Browser / Web
Response / API
Dockerfile / Custom

Build evaluations for niche workflows, proprietary tools, and unique agent loops.

Pricing

HUD Logo

Basic

$1/evaluation*

Most informative evalsets cost ~$10-15 per run (avg. 10 min).

*Plus $0.15/hr per active environment.

  • ✓Access to all stock evaluations
  • ✓Full control, telemetry and evaluation
  • ✓Access to public leaderboards (Coming soon)

Start with $10 in free credits!

Get started

Enterprise

Custom

Significant discounts available for labs running evals at scale.

  • ✓Benchmark agents on proprietary datasets & workflows
  • ✓Stress-test new models before production deployment
  • ✓Dedicated support for complex evaluation needs

Are you a researcher?

Get $100 in free credits when you sign up with a .edu email address.

Need custom environments and evalsets?Tell us what you're building.

Any questions?

Or email us a quick question atfounders@hud.so.

HUD