GPT-5.5 scored 88.7% on SWE-Bench. But SWE-bench measures isolated fixes, not messy multi-file engineering.