AI coding glossary
SWE-bench
Also known as: swe bench, software engineering benchmark
In one sentence
The benchmark suite for measuring AI agents on real-world software engineering tasks, fix this GitHub issue, given the repo and tests. Top variants: SWE-bench Verified (humans confirmed solvability), SWE-bench Lite, SWE-bench Multimodal.
Full definition
SWE-bench is the dominant benchmark for measuring AI coding agents on real-world tasks. Each instance is a GitHub issue from a popular open-source Python repo; the agent gets the repo + the issue text and must produce a patch that passes the hidden test suite. Variants: SWE-bench Verified (~500 human-confirmed-solvable instances, the leaderboard standard); SWE-bench Lite (smaller subset); SWE-bench Multimodal (instances requiring image understanding). 2026 frontier scores: Claude Opus 4.7 and GPT-5.5 trade places monthly in the 70-80% range; Gemini 3.1 Pro is close. The benchmark gets criticized for over-tuning, but it remains the most-cited proxy for 'how good is this agent at real work.'