The study had mixed results, and none of the tools achieved even a 50% success rate, even with the help of Debug Gym. Anthropic’s Claude 3.7 Sonnet was the best performer, managing to successfully ...