Question 1

What is SWE-bench?

Accepted Answer

The benchmark suite for measuring AI agents on real-world software engineering tasks, fix this GitHub issue, given the repo and tests. Top variants: SWE-bench Verified (humans confirmed solvability), SWE-bench Lite, SWE-bench Multimodal.

Question 2

What does SWE-bench mean in AI coding?

Accepted Answer

SWE-bench is the dominant benchmark for measuring AI coding agents on real-world tasks. Each instance is a GitHub issue from a popular open-source Python repo; the agent gets the repo + the issue text and must produce a patch that passes the hidden test suite. Variants: SWE-bench Verified (~500 human-confirmed-solvable instances, the leaderboard standard); SWE-bench Lite (smaller subset); SWE-bench Multimodal (instances requiring image understanding). 2026 frontier scores: Claude Opus 4.7 and GPT-5.5 trade places monthly in the 70-80% range; Gemini 3.1 Pro is close. The benchmark gets criticized for over-tuning, but it remains the most-cited proxy for 'how good is this agent at real work.'

SWE-bench

In one sentence

Full definition

On skills-hub.ai

Related terms