Evaluate and train agents.
Don't wait hours for results.
Our infrastructure handles 1000s of concurrent environments. Watch your agents think and train in real-time on app.hud.so.
hud eval hud-evals/SheetBench-50 claude --max-concurrent 100
OSWorld-Verified benchmark runtime
Turn your sofware into an environment
Desktop / OS
Browser / Web
Response / API
Dockerfile / Custom
Build your own environments.
Evaluate and train agents in your own software, web apps or chat interfaces in less than 30 minutes. Deploy the models to your own infrastructure.
hud init my-environmentcd my-environment && hud dev --interactive# When you're done, RL a custom agenthud rl
Your agent, any environment.
Every environment runs on MCP, so you can use any client that can call tools. Build your own agents and benchmark them against Claude and Operator.
from hud import ClaudeAgent, OperatorAgent, OpenAIChatGenericAgent
from hud.samples import BrowserTask
task = BrowserTask(
prompt = "Navigate to the hud.so homepage",
evaluate_tool = {"evaluate": {
"page_contains": {
"search_terms": "Navigate to the hud.so homepage"
}
}
})
score = await ClaudeAgent().run(task)
Build and test your agent
OpenAI Operator
Claude Computer Use
MCP
RAG
Pricing
SDK
Free
Build and run evaluations locally with our open-source SDK.
- ✓Full SDK for building MCP environments
- ✓Run evaluations locally with Docker
- ✓Create custom benchmarks & evaluators
- ✓Hot-reload environment development
- ✓Open source community & examples
Cloud
$0.50/environment hour
Full benchmarks cost ~$1-10. No infrastructure to manage.
- ✓Run evaluations at scale by deploying environments
- ✓Access to production benchmarks
- ✓Live telemetry & debugging dashboard
- ✓Parallel evaluation runs (100+ concurrent)
- ✓Submit to public leaderboards & scorecards
Are you a researcher?
Get $100 in free credits when you sign up with a .edu email address.
Making an academic eval?Apply for a grant and we'll cover your costs!.
Need custom environments and evalsets?Tell us what you're building.