The evaluation and RL platform for AI agents

Evaluate and iterate on your AI agent in hundreds of environments and thousands of tasks designed by |

uv tool install hud-python
# Evaluate any dataset
hud eval hud-evals/OSWorld-Verified
# Train your own model
hud rl hud-evals/basic-2048

We 💛 Researchers.

Browser Use - HUD Partner using our evaluation and RL environment platform
TryCua - HUD Partner using our evaluation and RL environment platform
UC Berkeley - HUD Partner using our evaluation and RL environment platform
MIT - HUD Partner using our evaluation and RL environment platform
Columbia University - HUD Partner using our evaluation and RL environment platform
Yale University - HUD Partner using our evaluation and RL environment platform

Evaluate and train agents.

HUD Logo
OSWorld illustration

OSWorld

Academic369 tasks
1. OpenAI CUA38.1%
2. Claude 3.7 Sonnet28%
3. UI-TARS-72B24.6%
SheetBench-50 illustration

SheetBench-50

Professional50 tasks
WebVoyager illustration

WebVoyager

Academic643 tasks
GAIA illustration

GAIA

Academic165 tasks
Mind2Web illustration

Mind2Web

Academic7775 tasks
Financial Analysis 1 illustration

Financial Analysis 1

Professional15 tasks
WebArena illustration

WebArena

Academic200 tasks
Pokemon 1 illustration

Pokemon 1

Gaming10 tasks
Autonomy-10 illustration

Autonomy-10

Private30 tasks
GeoGuessr 1 illustration

GeoGuessr 1

Gaming50 tasks
HR 1 illustration

HR 1

Professional15 tasks
Legal Research 1 illustration

Legal Research 1

Professional15 tasks

Don't wait hours for results.

Our infrastructure handles 1000s of concurrent environments. Watch your agents think and train in real-time on app.hud.so.

hud eval hud-evals/SheetBench-50 claude --max-concurrent 100

OSWorld-Verified benchmark runtime


Turn your sofware into an environment

Desktop / OS
Browser / Web
Response / API
Dockerfile / Custom

Build your own environments.

Evaluate and train agents in your own software, web apps or chat interfaces in less than 30 minutes. Deploy the models to your own infrastructure.

hud init my-environment
cd my-environment && hud dev --interactive
# When you're done, RL a custom agent
hud rl

Your agent, any environment.

Every environment runs on MCP, so you can use any client that can call tools. Build your own agents and benchmark them against Claude and Operator.

from hud import ClaudeAgent, OperatorAgent, OpenAIChatGenericAgent from hud.samples import BrowserTask task = BrowserTask( prompt = "Navigate to the hud.so homepage", evaluate_tool = {"evaluate": { "page_contains": { "search_terms": "Navigate to the hud.so homepage" } } }) score = await ClaudeAgent().run(task)

Build and test your agent

OpenAI Operator
Claude Computer Use
MCP
RAG

Pricing

HUD Logo

SDK

Free

Build and run evaluations locally with our open-source SDK.

  • ✓Full SDK for building MCP environments
  • ✓Run evaluations locally with Docker
  • ✓Create custom benchmarks & evaluators
  • ✓Hot-reload environment development
  • ✓Open source community & examples

Cloud

$0.50/environment hour

Full benchmarks cost ~$1-10. No infrastructure to manage.


  • ✓Run evaluations at scale by deploying environments
  • ✓Access to production benchmarks
  • ✓Live telemetry & debugging dashboard
  • ✓Parallel evaluation runs (100+ concurrent)
  • ✓Submit to public leaderboards & scorecards

Start with $10 in free credits!

Enterprise

Custom

Solutions for research and enterprise teams.

  • ✓Environemnts to train & test your agent
  • ✓Private benchmarks and RL workflows
  • ✓Test models before production
  • ✓On-premise deployment options available
  • ✓Dedicated engineering support & training

Are you a researcher?

Get $100 in free credits when you sign up with a .edu email address.

Need custom environments and evalsets?Tell us what you're building.

Any questions?

Or email us a quick question atfounders@hud.so.

HUD
HUD - AI Agent Evaluation Platform