The evaluation platform for computer use agents

Evaluate and iterate on your AI agent in hundreds of environments and thousands of tasks designed by |

from hud import load_taskset, run_job, ClaudeAgent

# load
taskset = load_taskset("GAIA")

# evaluate
job = await run_job(ClaudeAgent, taskset, "test-gaia-job")

# get results OR view them in app.hud.so
print(await job.get_analytics())

We 💛 Researchers.

Browser Use - HUD Partner using our evaluation and RL environment platform
UC Berkeley - HUD Partner using our evaluation and RL environment platform
MIT - HUD Partner using our evaluation and RL environment platform
Columbia University - HUD Partner using our evaluation and RL environment platform
Yale University - HUD Partner using our evaluation and RL environment platform

Evaluate anything.

HUD Logo
OSWorld illustration

OSWorld

Academic369 tasks
1. OpenAI CUA38.1%
2. Claude 3.7 Sonnet28%
3. UI-TARS-72B24.6%
WebVoyager illustration

WebVoyager

Academic643 tasks
SheetBench illustration

SheetBench

Professional50 tasks
GAIA illustration

GAIA

Academic165 tasks
Mind2Web illustration

Mind2Web

Academic7775 tasks
Financial Analysis 1 illustration

Financial Analysis 1

Professional15 tasks
WebArena illustration

WebArena

Academic200 tasks
Pokemon 1 illustration

Pokemon 1

Gaming10 tasks
Autonomy-10 illustration

Autonomy-10

Private30 tasks
GeoGuessr 1 illustration

GeoGuessr 1

Gaming50 tasks
HR 1 illustration

HR 1

Professional15 tasks
Legal Research 1 illustration

Legal Research 1

Professional15 tasks

Improve your iteration loop.

Don't wait hours for results. Our platform orchestrates hundreds of concurrent machines, spinning up full OS environments in seconds for rapid evaluation cycles. Iterate faster, identify regressions sooner, and push better agents to production.

OSWorld Benchmark Runtime Comparison


Integrate Your Agent's Stack

Evaluate agentic abilities while leveraging existing tools & models.

OpenAI Operator
Claude Computer Use
MCP
RAG

Your agent, your way.

Don't force your agent into a specific mold. The HUD evaluation schema can adapt to any architecture. Bring your own tools, models (like VLMs or RAG systems), or APIs. Focus on evaluating the core agentic abilities on various environments, while integrating any unique components of your agent stack.


Any evaluation, any environment.

Go beyond standard benchmark sets. Create tasks tailored to your specific agent, product, or workflow across diverse environments. Evaluate performance on desktop software, web browsers, text-based interfaces, or proprietary dockerfile environments.

Test on any environment

Desktop / OS
Browser / Web
Response / API
Dockerfile / Custom

Build evaluations for niche workflows, proprietary tools, and unique agent loops.

Pricing

HUD Logo

SDK

Free

Build and run evaluations locally with our open-source SDK.

  • ✓Full SDK for building MCP environments
  • ✓Run evaluations locally with Docker
  • ✓Create custom benchmarks & evaluators
  • ✓Hot-reload environment development
  • ✓Open source community & examples

Cloud

$0.50/environment hour*

Most full evalsets cost ~$1-10 per run. No infrastructure to manage.

*Complex environments like OSWorld may cost more.

  • ✓Run evaluations at scale by deploying environments
  • ✓Access to production benchmarks
  • ✓Live telemetry & debugging dashboard
  • ✓Parallel evaluation runs (100+ concurrent)
  • ✓Submit to public leaderboards & scorecards

Start with $10 in free credits!

Start evaluating

Enterprise

Custom

Volume discounts & tailored solutions for research and enterprise teams.

  • ✓Private benchmarks with custom environments
  • ✓Stress-test new models before production deployment
  • ✓On-premise deployment options available
  • ✓Dedicated engineering support & training

Are you a researcher?

Get $100 in free credits when you sign up with a .edu email address.

Need custom environments and evalsets?Tell us what you're building.

Any questions?

Or email us a quick question atfounders@hud.so.

HUD