Tool Usage Leaderboard

Tool call failure rates and user tool rejection rates. Lower is better.

These benchmarks are based on real-world usage by engineers with Claude Code as the coding agent. Model names are hidden from the users during evaluation.