OpenAI · Model deep dive
GPT-5.5 for Agentic Coding: What the Terminal-Bench 2.0 Leader Means for Your Workflow
GPT-5.5 leads Terminal-Bench 2.0 at 82.7% and cuts hallucinations by 52.5% versus GPT-4.5. Here's what changes for developers running agentic coding pipelines — API migration, SKILL.md compatibility, MCP tool-call format, and when to switch.
OpenAI shipped GPT-5.5 on April 23, 2026, billing it as the first fully retrained base model since GPT-4.5. For three weeks it was mostly a benchmark curiosity. Then Terminal-Bench 2.0 published its June leaderboard and developers noticed the number: 82.7% — a 13-point gap over Claude Opus 4.7, the previous field leader. If you run agentic pipelines that issue tool calls, write code, or coordinate multi-step terminal workflows, that number means something real.
This post is not about whether GPT-5.5 is the "best" model. It's about the practical delta: what changed in the API, what breaks in existing SKILL.md and MCP configurations, and where the gains actually show up in production pipelines — as opposed to synthetic benchmark conditions.
What changed in GPT-5.5
GPT-5.5 is described by OpenAI as their "strongest agentic coding model to date." Three changes matter for developers using it in pipelines:
52.5% fewer hallucinations than GPT-4.5. This is the headline for anyone running long-horizon agents. Hallucinations in agentic pipelines are different from hallucinations in chat — when an agent fabricates a file path or invents an API parameter and the next tool call depends on it, the error compounds. Lower hallucination rates translate directly to fewer pipeline aborts.
Retrained tool-call format. GPT-5.5 uses a revised internal representation for tool use. The public API surface is backward compatible — your existing function-calling JSON still works — but the model handles parallel tool calls differently. It now schedules concurrent calls aggressively when it can infer independence. If your agent wrapper assumes sequential tool execution and shares mutable state between calls, you'll see race conditions that weren't present on GPT-5.
128K context, native code execution. Context window is identical to GPT-5, but the model was trained with heavier emphasis on in-context code execution feedback loops. In practice, agentic tasks that require reading long files, planning changes, and then verifying them are where the performance uplift is most visible.
82.7%
Terminal-Bench 2.0
#1 on the June 2026 leaderboard
52.5%
fewer hallucinations
vs GPT-4.5 on agentic tasks
13pt
lead over Opus 4.7
69.4% vs 82.7% on Terminal-Bench 2.0
Terminal-Bench 2.0: reading the leaderboard
Terminal-Bench 2.0 tests AI agents on real command-line workflows inside a sandboxed terminal: compiling code, training models, setting up servers, system administration, data science pipelines, and security tasks. It scores on a 0–1 accuracy scale across four categories: reasoning, tool calling, agents, and code.
The June 2026 leaderboard has 44 models evaluated. The top five:
Rank Model Score
──────────────────────────────────────────
1 GPT-5.5 (OpenAI) 0.827
2 Claude Mythos Preview (Anthropic) 0.820
3 GPT-5.3 Codex (OpenAI) 0.773
4 Gemini 3.5 Flash (Google) 0.762
5 GPT-5.4 (OpenAI) 0.751
Average across all 44 models: 0.573Two things stand out. First, the gap between rank 1 and rank 2 is just 7 points — Claude Mythos Preview is real competition, and that benchmark was taken from a preview model. Second, the average score across all 44 models is 0.573, meaning the median model still fails nearly half of real terminal tasks. "Good at coding" and "good at agentic terminal workflows" remain meaningfully different things in 2026.
API migration in 15 minutes
The model ID change is the smallest part. Switch gpt-5 to gpt-5.5 and most pipelines work on first run. The non-obvious part is parallel tool-call handling.
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-5", # ← change this
messages=[...],
tools=[...],
)from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-5.5",
messages=[...],
tools=[...],
# GPT-5.5 issues parallel tool calls by default.
# Set parallel_tool_calls=False if your tools share mutable state.
parallel_tool_calls=True, # explicit; default True on gpt-5.5
)Tool-call format and MCP changes
For MCP servers configured with the 2026 MCP stateless spec, GPT-5.5 is fully compatible — tool definitions, result schemas, and streaming behavior all work as expected. The one catch is the tool_choice parameter semantics.
On GPT-5, "tool_choice": "auto" meant the model might skip tool calls and respond in plain text. On GPT-5.5, "auto" is more aggressive — the model will invoke tools when it sees any plausible reason to, even if a plain text response would satisfy the request. For chat interfaces, this shows up as unexpected tool invocations mid- conversation. For agentic pipelines where you want the model to always use tools, it's the right default.
{
"mcpServers": {
"skills-hub": {
"command": "npx",
"args": ["@skills-hub-ai/mcp"],
"model": "gpt-5.5" // add model hint for clients that support it
}
}
}If you're still on Gemini CLI's MCP config format (the old url field instead of serverUrl), clean that up first before switching models — the two migrations compound if you do them together and something breaks.
SKILL.md patterns that land best
GPT-5.5's hallucination reduction is most pronounced in tasks that require accurately reading and reproducing values from a provided context: exact file paths, exact function signatures, exact error messages. Skills that give the model concrete anchors — file paths, line numbers, specific output to match — get the biggest lift.
Three SKILL.md design patterns perform well:
1. Acceptance-criteria skills
Declare explicit pass/fail criteria the model checks at the end of its work. GPT-5.5's lower hallucination rate means it's less likely to fabricate "tests pass" when they don't.
---
name: fix-flaky-test
description: Diagnoses and fixes a flaky test. Verifies the fix runs green 3× in a row.
version: 1.0.0
category: test
platforms:
- CLAUDE_CODE
- CODEX_CLI
---
TARGET: $ARGUMENTS
PHASE 1: Run the target test 5 times. Count fails. If < 2 fails, report "not
reliably flaky" and stop.
PHASE 2: Diagnose the root cause. Check: shared test state, timing dependencies,
network calls that should be mocked, random seeds not fixed.
PHASE 3: Apply the fix. Run the test 3 consecutive times. All three must pass.
DONE WHEN: 3/3 consecutive green runs. Report the root cause category and the
specific change made.2. Read-verify-write skills
GPT-5.5 handles long-context verification loops better than its predecessors. Skills that read a file, make a plan, write a patch, then re-read to verify the patch landed correctly — without hallucinating the verification — get much more reliable results on 5.5 than on GPT-5.
3. Parallel-safe subagent skills
If you're using OpenAI's Responses API or an MCP client that spawns parallel completions, GPT-5.5 handles concurrent tool use without the ghost-call problem that appeared in some GPT-5 pipelines (where the model would echo a previous tool result instead of making a fresh call).
# Install the GPT-5.5 agentic setup skill
npx @skills-hub-ai/cli install gpt-5-5-agentic-setup
# Run it against your project
# In Claude Code or Codex CLI:
# /gpt-5-5-agentic-setupWhen not to switch
GPT-5.5 is priced higher than GPT-5. The benchmark lead is real but it's measured on agentic terminal tasks. Three scenarios where staying on GPT-5 (or switching to a different model) is the better call:
High-frequency, short-context completions. If your pipeline makes thousands of small completions per day — autocomplete, short classification, docstring generation — GPT-5.5's price premium doesn't pay back. GPT-5.4 or a smaller model is the right tool.
Heavy Anthropic toolchain investment. Claude Mythos Preview is 7 points behind on Terminal-Bench 2.0, but if you're running Claude Code with subagents, SKILL.md compositions that use Claude-specific syntax, or Anthropic's extended thinking for planning — the integration advantages likely outweigh the benchmark gap. Mythos GA is expected to close it anyway.
Budget-constrained pipelines. gpt-5.3-codex scores 77.3% on Terminal-Bench 2.0 at a significantly lower token cost than GPT-5.5. For pipelines running multiple agents in parallel, the economics of 5.3-Codex are often better even though the per-task accuracy is lower.
0.573
Average Terminal-Bench 2.0 score across all 44 evaluated models
Most models still fail nearly half of real agentic terminal tasks — benchmark context matters.
The safe path for most teams: run your existing agentic pipeline with GPT-5.5 in shadow mode for a week, compare outputs and error rates against your current model, and make the switch only if the delta justifies the cost. The gains are real. They're also not magic — your pipelines still need well-designed skills, clear acceptance criteria, and proper concurrency handling. GPT-5.5 won't save a badly designed agent, but it will make a well-designed one measurably more reliable.
Related: deploying GPT-5.5 on Amazon Bedrock, Opus 4.7 vs GPT-5.5 vs Gemini 3.1 Pro — full benchmark comparison, and building agent teams that use multiple models.
Written by
Skills-Hub Team
AI model ecosystem coverage
Skills-Hub is the open registry for AI coding skills, 4,400+ SKILL.md files synced daily from Anthropic, Google, Microsoft, and 90+ official sources. Free + MIT.