OpenAI · Model deep dive

GPT-5.5 for Agentic Coding: What the Terminal-Bench 2.0 Leader Means for Your Workflow

GPT-5.5 leads Terminal-Bench 2.0 at 82.7% and cuts hallucinations by 52.5% versus GPT-4.5. Here's what changes for developers running agentic coding pipelines — API migration, SKILL.md compatibility, MCP tool-call format, and when to switch.

82.7%Terminal-Bench 2.0 accuracy — #1 agentic coding benchmark

By Skills-Hub Team · AI model ecosystem coverageJune 10, 20268 min read

GPT-5.5Agentic CodingTerminal-Bench

OpenAI shipped GPT-5.5 on April 23, 2026, billing it as the first fully retrained base model since GPT-4.5. For three weeks it was mostly a benchmark curiosity. Then Terminal-Bench 2.0 published its June leaderboard and developers noticed the number: 82.7% — a 13-point gap over Claude Opus 4.7, the previous field leader. If you run agentic pipelines that issue tool calls, write code, or coordinate multi-step terminal workflows, that number means something real.

This post is not about whether GPT-5.5 is the "best" model. It's about the practical delta: what changed in the API, what breaks in existing SKILL.md and MCP configurations, and where the gains actually show up in production pipelines — as opposed to synthetic benchmark conditions.

What changed in GPT-5.5

GPT-5.5 is described by OpenAI as their "strongest agentic coding model to date." Three changes matter for developers using it in pipelines:

52.5% fewer hallucinations than GPT-4.5. This is the headline for anyone running long-horizon agents. Hallucinations in agentic pipelines are different from hallucinations in chat — when an agent fabricates a file path or invents an API parameter and the next tool call depends on it, the error compounds. Lower hallucination rates translate directly to fewer pipeline aborts.

Retrained tool-call format. GPT-5.5 uses a revised internal representation for tool use. The public API surface is backward compatible — your existing function-calling JSON still works — but the model handles parallel tool calls differently. It now schedules concurrent calls aggressively when it can infer independence. If your agent wrapper assumes sequential tool execution and shares mutable state between calls, you'll see race conditions that weren't present on GPT-5.

128K context, native code execution. Context window is identical to GPT-5, but the model was trained with heavier emphasis on in-context code execution feedback loops. In practice, agentic tasks that require reading long files, planning changes, and then verifying them are where the performance uplift is most visible.

82.7%

Terminal-Bench 2.0

#1 on the June 2026 leaderboard

52.5%

fewer hallucinations

vs GPT-4.5 on agentic tasks

13pt

lead over Opus 4.7

69.4% vs 82.7% on Terminal-Bench 2.0

Terminal-Bench 2.0: reading the leaderboard

Terminal-Bench 2.0 tests AI agents on real command-line workflows inside a sandboxed terminal: compiling code, training models, setting up servers, system administration, data science pipelines, and security tasks. It scores on a 0–1 accuracy scale across four categories: reasoning, tool calling, agents, and code.

The June 2026 leaderboard has 44 models evaluated. The top five:

Terminal-Bench 2.0 — top 5 (June 10, 2026)

Rank  Model                          Score
──────────────────────────────────────────
 1    GPT-5.5 (OpenAI)               0.827
 2    Claude Mythos Preview (Anthropic) 0.820
 3    GPT-5.3 Codex (OpenAI)         0.773
 4    Gemini 3.5 Flash (Google)      0.762
 5    GPT-5.4 (OpenAI)               0.751

Average across all 44 models:         0.573

Two things stand out. First, the gap between rank 1 and rank 2 is just 7 points — Claude Mythos Preview is real competition, and that benchmark was taken from a preview model. Second, the average score across all 44 models is 0.573, meaning the median model still fails nearly half of real terminal tasks. "Good at coding" and "good at agentic terminal workflows" remain meaningfully different things in 2026.

API migration in 15 minutes

The model ID change is the smallest part. Switch gpt-5 to gpt-5.5 and most pipelines work on first run. The non-obvious part is parallel tool-call handling.

Before

from openai import OpenAI

client = OpenAI()
response = client.chat.completions.create(
    model="gpt-5",           # ← change this
    messages=[...],
    tools=[...],
)

After — with parallel-call guard

from openai import OpenAI

client = OpenAI()
response = client.chat.completions.create(
    model="gpt-5.5",
    messages=[...],
    tools=[...],
    # GPT-5.5 issues parallel tool calls by default.
    # Set parallel_tool_calls=False if your tools share mutable state.
    parallel_tool_calls=True,   # explicit; default True on gpt-5.5
)

Tool-call format and MCP changes

For MCP servers configured with the 2026 MCP stateless spec, GPT-5.5 is fully compatible — tool definitions, result schemas, and streaming behavior all work as expected. The one catch is the tool_choice parameter semantics.

On GPT-5, "tool_choice": "auto" meant the model might skip tool calls and respond in plain text. On GPT-5.5, "auto" is more aggressive — the model will invoke tools when it sees any plausible reason to, even if a plain text response would satisfy the request. For chat interfaces, this shows up as unexpected tool invocations mid- conversation. For agentic pipelines where you want the model to always use tools, it's the right default.

MCP client config — no changes needed for stateless MCP

{
  "mcpServers": {
    "skills-hub": {
      "command": "npx",
      "args": ["@skills-hub-ai/mcp"],
      "model": "gpt-5.5"   // add model hint for clients that support it
    }
  }
}

If you're still on Gemini CLI's MCP config format (the old url field instead of serverUrl), clean that up first before switching models — the two migrations compound if you do them together and something breaks.

SKILL.md patterns that land best

GPT-5.5's hallucination reduction is most pronounced in tasks that require accurately reading and reproducing values from a provided context: exact file paths, exact function signatures, exact error messages. Skills that give the model concrete anchors — file paths, line numbers, specific output to match — get the biggest lift.

Three SKILL.md design patterns perform well:

1. Acceptance-criteria skills

Declare explicit pass/fail criteria the model checks at the end of its work. GPT-5.5's lower hallucination rate means it's less likely to fabricate "tests pass" when they don't.

example-skill.md (acceptance-criteria pattern)

---
name: fix-flaky-test
description: Diagnoses and fixes a flaky test. Verifies the fix runs green 3× in a row.
version: 1.0.0
category: test
platforms:
  - CLAUDE_CODE
  - CODEX_CLI
---

TARGET: $ARGUMENTS

PHASE 1: Run the target test 5 times. Count fails. If < 2 fails, report "not
reliably flaky" and stop.

PHASE 2: Diagnose the root cause. Check: shared test state, timing dependencies,
network calls that should be mocked, random seeds not fixed.

PHASE 3: Apply the fix. Run the test 3 consecutive times. All three must pass.

DONE WHEN: 3/3 consecutive green runs. Report the root cause category and the
specific change made.

2. Read-verify-write skills

GPT-5.5 handles long-context verification loops better than its predecessors. Skills that read a file, make a plan, write a patch, then re-read to verify the patch landed correctly — without hallucinating the verification — get much more reliable results on 5.5 than on GPT-5.

3. Parallel-safe subagent skills

If you're using OpenAI's Responses API or an MCP client that spawns parallel completions, GPT-5.5 handles concurrent tool use without the ghost-call problem that appeared in some GPT-5 pipelines (where the model would echo a previous tool result instead of making a fresh call).

Terminal

# Install the GPT-5.5 agentic setup skill
npx @skills-hub-ai/cli install gpt-5-5-agentic-setup

# Run it against your project
# In Claude Code or Codex CLI:
# /gpt-5-5-agentic-setup

When not to switch

GPT-5.5 is priced higher than GPT-5. The benchmark lead is real but it's measured on agentic terminal tasks. Three scenarios where staying on GPT-5 (or switching to a different model) is the better call:

High-frequency, short-context completions. If your pipeline makes thousands of small completions per day — autocomplete, short classification, docstring generation — GPT-5.5's price premium doesn't pay back. GPT-5.4 or a smaller model is the right tool.

Heavy Anthropic toolchain investment. Claude Mythos Preview is 7 points behind on Terminal-Bench 2.0, but if you're running Claude Code with subagents, SKILL.md compositions that use Claude-specific syntax, or Anthropic's extended thinking for planning — the integration advantages likely outweigh the benchmark gap. Mythos GA is expected to close it anyway.

Budget-constrained pipelines. gpt-5.3-codex scores 77.3% on Terminal-Bench 2.0 at a significantly lower token cost than GPT-5.5. For pipelines running multiple agents in parallel, the economics of 5.3-Codex are often better even though the per-task accuracy is lower.

0.573

Average Terminal-Bench 2.0 score across all 44 evaluated models

Most models still fail nearly half of real agentic terminal tasks — benchmark context matters.

The safe path for most teams: run your existing agentic pipeline with GPT-5.5 in shadow mode for a week, compare outputs and error rates against your current model, and make the switch only if the delta justifies the cost. The gains are real. They're also not magic — your pipelines still need well-designed skills, clear acceptance criteria, and proper concurrency handling. GPT-5.5 won't save a badly designed agent, but it will make a well-designed one measurably more reliable.

Written by

Skills-Hub Team

AI model ecosystem coverage

Skills-Hub is the open registry for AI coding skills, with SKILL.md files synced daily from Anthropic, Google, Microsoft, and 90+ official sources. Free + MIT.

Browse skills →More posts

Continue reading

OpenAI Codex and GPT-5.5 on Amazon Bedrock: The Developer Playbook

7 min read →

Claude Code Subagents: The Complete 2026 Guide to Agent Teams

9 min read →