The evaluation and RL platform for AI agents

Evaluate and improve your AI agent in hundreds of environments and thousands of tasks designed by |

uv tool install hud-python
# Set your API keys
hud set HUD_API_KEY=... ANTHROPIC_API_KEY=...
# Evaluate any dataset
hud eval hud-evals/OSWorld-Verified claude
# Train your own model
hud rl hud-evals/basic-2048

We 💛 Researchers.

Browser Use - HUD Partner using our evaluation and RL environment platform
TryCua - HUD Partner using our evaluation and RL environment platform
UC Berkeley - HUD Partner using our evaluation and RL environment platform
MIT - HUD Partner using our evaluation and RL environment platform
Columbia University - HUD Partner using our evaluation and RL environment platform
Yale University - HUD Partner using our evaluation and RL environment platform

Evaluate and train agents.

HUD Logo

Don't wait hours for results.

Our infrastructure handles 1000s of concurrent environments. Watch your agents think and train in real-time on hud.so/home.

hud eval hud-evals/SheetBench-50 claude --max-concurrent 100

OSWorld-Verified benchmark runtime


Turn your sofware into an environment

Desktop / OS
Browser / Web
Response / API
Dockerfile / Custom

Build your own environments.

Evaluate and train agents in your own software, web apps or chat interfaces in less than 30 minutes. Deploy the models to your own infrastructure.

hud init my-environment
cd my-environment && hud dev --interactive
# When you're done, RL a custom agent
hud rl

Your agent, any environment.

Every environment runs on MCP, so you can use any client that can call tools. Build your own agents and benchmark them against Claude and Operator.

from hud import ClaudeAgent, OperatorAgent, OpenAIChatGenericAgent from hud.samples import BrowserTask task = BrowserTask( prompt = "Navigate to the hud.so homepage", evaluate_tool = {"evaluate": { "page_contains": { "search_terms": "Navigate to the hud.so homepage" } } }) score = await ClaudeAgent().run(task)

Build and test your agent

OpenAI Operator
Claude Computer Use
MCP
RAG

Pricing

HUD Logo

SDK

Free

Build and run evaluations locally with our open-source SDK.

  • ✓Full SDK for building MCP environments
  • ✓Run evaluations locally with Docker
  • ✓Create custom benchmarks & evaluators
  • ✓Hot-reload environment development
  • ✓Open source community & examples

Cloud

$0.50/environment hour

Full benchmarks cost ~$1-10. No infrastructure to manage.


  • ✓Run evaluations at scale by deploying environments
  • ✓Access to production benchmarks
  • ✓Live telemetry & debugging dashboard
  • ✓Parallel evaluation runs (100+ concurrent)
  • ✓Submit to public leaderboards & scorecards

Start with $10 in free credits!

Enterprise

Custom

Solutions for research and enterprise teams.

  • ✓Environemnts to train & test your agent
  • ✓Private benchmarks and RL workflows
  • ✓Test models before production
  • ✓On-premise deployment options available
  • ✓Dedicated engineering support & training

Are you a researcher?

Get $100 in free credits when you sign up with a .edu email address.

Need custom environments and evalsets?Tell us what you're building.

Any questions?

Or email us a quick question atfounders@hud.so.

HUD