The evaluation and RL platform for AI agents
Evaluate and improve your AI agent in hundreds of environments and thousands of tasks designed by |
Evaluate and train agents.
Don't wait hours for results.
Our infrastructure handles 1000s of concurrent environments. Watch your agents think and train in real-time on hud.so/home.
hud eval hud-evals/SheetBench-50 claude --max-concurrent 100
OSWorld-Verified benchmark runtime
Turn your sofware into an environment
Build your own environments.
Evaluate and train agents in your own software, web apps or chat interfaces in less than 30 minutes. Deploy the models to your own infrastructure.
hud init my-environmentcd my-environment && hud dev --interactive# When you're done, RL a custom agenthud rl
Your agent, any environment.
Every environment runs on MCP, so you can use any client that can call tools. Build your own agents and benchmark them against Claude and Operator.
from hud import ClaudeAgent, OperatorAgent, OpenAIChatGenericAgent
from hud.samples import BrowserTask
task = BrowserTask(
  prompt = "Navigate to the hud.so homepage", 
  evaluate_tool = {"evaluate": {
    "page_contains": {
      "search_terms": "Navigate to the hud.so homepage"
    }
  }
})
score = await ClaudeAgent().run(task)Build and test your agent
Pricing
SDK
Build and run evaluations locally with our open-source SDK.
- ✓Full SDK for building MCP environments
- ✓Run evaluations locally with Docker
- ✓Create custom benchmarks & evaluators
- ✓Hot-reload environment development
- ✓Open source community & examples
Cloud
Full benchmarks cost ~$1-10. No infrastructure to manage.
- ✓Run evaluations at scale by deploying environments
- ✓Access to production benchmarks
- ✓Live telemetry & debugging dashboard
- ✓Parallel evaluation runs (100+ concurrent)
- ✓Submit to public leaderboards & scorecards
Are you a researcher?
Get $100 in free credits when you sign up with a .edu email address.
Making an academic eval?Apply for a grant and we'll cover your costs!.
Need custom environments and evalsets?Tell us what you're building.





