The evaluation platform for computer use agents

Evaluate and iterate on your AI agent in hundreds of environments and thousands of tasks designed by |

from hud import load_taskset, run_job, ClaudeAgent

# load
taskset = load_taskset("GAIA")

# evaluate
job = await run_job(ClaudeAgent, taskset, "test-gaia-job")

# get results OR view them in app.hud.so
print(await job.get_analytics())

We 💛 Researchers.

Evaluate anything.

OSWorld

Academic369 tasks

1. OpenAI CUA38.1%

2. Claude 3.7 Sonnet28%

3. UI-TARS-72B24.6%

Launch

WebVoyager

Academic643 tasks

Launch

SheetBench

Professional50 tasks

Launch

GAIA

Academic165 tasks

Launch

Mind2Web

Academic7775 tasks

Launch

Financial Analysis 1

Professional15 tasks

I'm interested!

WebArena

Academic200 tasks

I'm interested!

Pokemon 1

Gaming10 tasks

I'm interested!

Autonomy-10

Private30 tasks

I'm interested!

GeoGuessr 1

Gaming50 tasks

I'm interested!

HR 1

Professional15 tasks

I'm interested!

Legal Research 1

Professional15 tasks

I'm interested!

Improve your iteration loop.

Don't wait hours for results. Our platform orchestrates hundreds of concurrent machines, spinning up full OS environments in seconds for rapid evaluation cycles. Iterate faster, identify regressions sooner, and push better agents to production.

OSWorld Benchmark Runtime Comparison

Integrate Your Agent's Stack

Evaluate agentic abilities while leveraging existing tools & models.

OpenAI Operator

Claude Computer Use

MCP

RAG

Your agent, your way.

Don't force your agent into a specific mold. The HUD evaluation schema can adapt to any architecture. Bring your own tools, models (like VLMs or RAG systems), or APIs. Focus on evaluating the core agentic abilities on various environments, while integrating any unique components of your agent stack.

Any evaluation, any environment.

Go beyond standard benchmark sets. Create tasks tailored to your specific agent, product, or workflow across diverse environments. Evaluate performance on desktop software, web browsers, text-based interfaces, or proprietary dockerfile environments.