The evaluation platform for computer use agents
Evaluate and iterate on your AI agent in hundreds of environments and thousands of tasks designed by |
We 💛 Researchers.





Evaluate anything.
Improve your iteration loop.
Don't wait hours for results. Our platform orchestrates hundreds of concurrent machines, spinning up full OS environments in seconds for rapid evaluation cycles. Iterate faster, identify regressions sooner, and push better agents to production.
OSWorld Benchmark Runtime Comparison
Integrate Your Agent's Stack
Evaluate agentic abilities while leveraging existing tools & models.
Your agent, your way.
Don't force your agent into a specific mold. The HUD evaluation schema can adapt to any architecture. Bring your own tools, models (like VLMs or RAG systems), or APIs. Focus on evaluating the core agentic abilities on various environments, while integrating any unique components of your agent stack.
Any evaluation, any environment.
Go beyond standard benchmark sets. Create tasks tailored to your specific agent, product, or workflow across diverse environments. Evaluate performance on desktop software, web browsers, text-based interfaces, or proprietary dockerfile environments.
Test on any environment
Build evaluations for niche workflows, proprietary tools, and unique agent loops.
Pricing
SDK
Build and run evaluations locally with our open-source SDK.
- ✓Full SDK for building MCP environments
- ✓Run evaluations locally with Docker
- ✓Create custom benchmarks & evaluators
- ✓Hot-reload environment development
- ✓Open source community & examples
Cloud
Most full evalsets cost ~$1-10 per run. No infrastructure to manage.
*Complex environments like OSWorld may cost more.
- ✓Run evaluations at scale by deploying environments
- ✓Access to production benchmarks
- ✓Live telemetry & debugging dashboard
- ✓Parallel evaluation runs (100+ concurrent)
- ✓Submit to public leaderboards & scorecards
Start with $10 in free credits!
Start evaluatingEnterprise
Volume discounts & tailored solutions for research and enterprise teams.
- ✓Private benchmarks with custom environments
- ✓Stress-test new models before production deployment
- ✓On-premise deployment options available
- ✓Dedicated engineering support & training
Are you a researcher?
Get $100 in free credits when you sign up with a .edu email address.
Making an academic eval?Apply for a grant and we'll cover your costs!.
Need custom environments and evalsets?Tell us what you're building.