Skip to main content

Open Source · Model release

Cohere North Mini Code: The Open-Source 30B Agent That Fits on One H100

Cohere's North Mini Code landed June 9 as the first Apache-licensed agentic coding model that genuinely outperforms much larger proprietary alternatives — 33.4 on the AI Coding Index, 256K context, single-H100 FP8 deployment. Here's the full developer playbook.

33.4Artificial Analysis Coding Index — beats 120B+ models
By Skills-Hub Team · Open-source AI coverage8 min read
CohereNorth Mini CodeOpen Source

Every month in 2026 brings a new proprietary coding model with a bigger benchmark number. What Cohere shipped on June 9 is different: North Mini Code 1.0 is Apache-licensed, fits on a single H100, and scores 33.4 on the Artificial Analysis Coding Index — enough to beat Nemotron 3 Super at 120B and Devstral 2 at 123B. For teams who want agentic coding infrastructure they can run on their own hardware without vendor lock-in, it is the most serious open-weight option available today.

30B / 3B

total / active parameters

MoE — full model, fraction of compute

256K

context window

64K max generation

2.8×

throughput vs Devstral Small 2

tokens/sec at the same hardware tier

Why open weights change the calculus

Proprietary APIs are excellent until they are not. The June 18 Gemini CLI enterprise pivot — which closed the free tier overnight — reminded every team depending on a managed inference endpoint that the economics can change without warning. Open weights don't solve every problem, but they solve the "rug-pull" problem: the model you run today on your own H100 is the same model you run in six months.

North Mini Code specifically targets what Cohere calls sovereign AI: agentic coding infrastructure that a team can deploy on-premise, in a private VPC, or air-gapped, without sending code to an external API. For healthcare companies, financial institutions, and any team with strict data-residency requirements, this unlocks workloads that were previously off-limits for AI assistance.

Architecture: 30B total, 3B active

North Mini Code uses a mixture-of-experts (MoE) architecture with 30 billion total parameters but only 3 billion active per token. The distinction matters for hardware planning: a dense 30B model needs roughly 60–70 GB of GPU memory at FP16; a 30B MoE with 3B active can run in less than 24 GB at FP8 quantization, putting it squarely on a single H100 80GB or even an H100 SXM at FP4.

The practical consequence is throughput. Because each forward pass activates a fraction of the weights, the model processes tokens faster than a comparably-scoring dense model would. Cohere reports 30% faster inter-token latency versus Devstral Small 2 at matched hardware, which matters a great deal for interactive coding agents where users feel every additional second of generation delay.

Hardware requirements at a glance
Quantization  | VRAM      | Hardware tier       | Use case
---------------------------------------------------------------------------
BF16 (full)   | ~60 GB    | 2× A100 40GB or     | Max quality, batch jobs
              |           | 1× H100 SXM 80GB    |
FP8           | ~30 GB    | 1× H100 80GB        | Recommended default
W4A16         | ~18 GB    | 1× A100 40GB or     | Cost-optimized / edge
              |           | 1× L40S             |

Hugging Face hosts all three quantization variants under CohereLabs/north-mini-code-1.0. You pick the right one for your GPU tier; the model card has benchmarks for each quantization to help you judge the quality-cost tradeoff.

Benchmarks in context

The headline number is 33.4 on the Artificial Analysis Coding Index, a composite benchmark that aggregates SWE-Bench Verified, SWE-Bench Pro, and Terminal-Bench 2.0 scores. In the open-weight tier, that clears Nemotron 3 Super (120B) and Devstral 2 (123B) — both four times larger. It is the highest score posted by any open-weight model smaller than 70B parameters.

33.4

Artificial Analysis Coding Index — highest score for any open-weight model under 70B

One important caveat Cohere is transparent about: during Artificial Analysis benchmarking, North Mini Code generated 75 million output tokens — three times the class median of 25 million. The model achieves its score partly by generating more tokens per problem, which has cost implications in practice. At $0.50 per million output tokens on the Cohere API, a three-hour agentic session that produces the expected token volume would cost roughly $3–8, competitive with but not dramatically cheaper than mid-tier proprietary models. The win comes from the self-hosted path, where compute is the only variable cost.

Running it: local, managed, and hybrid

North Mini Code is available on four inference surfaces today: HuggingFace (self-hosted), Cohere API, OpenRouter, and OpenCode. Each has a different operational profile.

Self-hosted via HuggingFace

The most flexible path. Pull the FP8 variant onto an H100 and serve it with vLLM or TGI for an OpenAI-compatible inference endpoint your existing tooling can hit without modification.

Terminal — serve with vLLM
# Install vLLM
pip install vllm

# Serve North Mini Code FP8 on port 8000
python -m vllm.entrypoints.openai.api_server \
  --model CohereLabs/north-mini-code-1.0-fp8 \
  --dtype float8 \
  --max-model-len 65536 \
  --port 8000

# Health check
curl http://localhost:8000/health

The --max-model-len 65536 cap is intentional: North Mini Code supports 256K input context but max generation is 64K. Setting this on the server prevents clients from accidentally requesting longer generation windows than the model supports.

Managed via Cohere API or OpenRouter

If you don't have an H100 available, the Cohere API and OpenRouter both offer North Mini Code as a drop-in endpoint. On OpenRouter the model ID is cohere/north-mini-code-1.0. Both endpoints are OpenAI-compatible, so any tool that accepts a baseURL override will work without further modification.

OpenRouter config for OpenCode
{
  "model": "openrouter/cohere/north-mini-code-1.0",
  "providers": {
    "openrouter": {
      "apiKey": "$OPENROUTER_API_KEY",
      "baseUrl": "https://openrouter.ai/api/v1"
    }
  }
}

Via OpenCode

OpenCode natively supports North Mini Code through its multi-provider routing layer. This is the fastest path to using it as a coding agent without additional infrastructure setup — you get LSP integration, terminal access, and multi-file editing out of the box.

North Mini Code in agentic workflows

The model was built explicitly for agentic use: sub-agent orchestration, architecture mapping, and code review are cited by Cohere as primary design targets. In practice, this shows up in two specific ways compared to general-purpose models fine-tuned on code.

First, the model handles tool-call sequences reliably — the multi-turn patterns of read → analyze → edit → run → verify that agentic coding sessions require. Second, the 256K context makes it viable for whole-repository context injection, which is especially useful for the architecture-mapping use case where you want the model to reason about dependency graphs and cross-file relationships.

SKILL.md snippet — routing to North Mini Code
---
name: north-mini-code-review
description: Routes code review tasks to a self-hosted North Mini Code instance.
version: 1.0.0
category: review
platforms:
  - CLAUDE_CODE
  - CODEX_CLI
env:
  NORTH_MINI_CODE_URL: "http://localhost:8000"
---
North Mini Code is built for sovereign AI deployment — agentic coding infrastructure that teams can run on their own terms, on-prem or in a private VPC, without sending source code to an external endpoint.
, Cohere engineering blog

One workflow that pairs particularly well is using North Mini Code for the compute-heavy phases of a pipeline (architecture analysis, large context review) and a faster proprietary model for interactive completions. The 256K input context is genuinely useful here: you can load an entire monorepo's dependency graph and ask the model to produce a migration plan without chunking.

256K

input context

whole-repo architecture mapping

64K

max generation

long diffs and migration plans

H100 at FP8

single-GPU deployment

When to choose it (and when not to)

North Mini Code is the right choice in four scenarios.

Data sovereignty requirements. If your code cannot leave your infrastructure — HIPAA, SOC 2 Type II in a private environment, government contracts — self-hosted North Mini Code is one of the only paths to a frontier-quality coding agent. Proprietary APIs are simply off the table for these workloads.

High-volume agentic batch jobs. Running hundreds of code review passes per day against a managed API at $0.50/M output tokens adds up. On your own H100, the marginal cost per token is electricity and amortized hardware. At 10M+ output tokens per month, self-hosted pays back in under a year for most teams.

Low-latency interactive coding. The MoE architecture's 30% inter-token latency advantage over comparable dense models is perceptible in interactive use. If you're running an agentic pipeline where developers are watching generation happen in real time, this matters.

Experimentation and fine-tuning. Apache 2.0 means you can fine-tune North Mini Code on your internal codebase and ship the resulting model without licensing concerns. This is not possible with any proprietary coding model today.

Where North Mini Code is not the right choice: if your team needs the absolute highest SWE-Bench scores and has no infrastructure constraints, Claude Fable 5 and GPT-5.5 still lead the field by a significant margin. North Mini Code is best-in-class among open-weight models under 70B; it is not best-in-class overall.

The practical install path: use the skills-hub north-mini-code integration skill to wire it into your Claude Code or OpenCode setup in one pass, including the vLLM server configuration and token limit guardrails.

Terminal
# Install the North Mini Code integration skill
npx @skills-hub-ai/cli install north-mini-code

# Skill configures your local vLLM endpoint or OpenRouter fallback
# and wires the model into your .claude/settings.json

The open-source coding model tier is maturing faster than anyone expected. North Mini Code landing at 33.4 on the AI Coding Index in June 2026, outperforming models four times its size, sets a benchmark that will force the entire field — open and closed — to keep moving. If you run infrastructure you control, it earns a serious look.

Written by

Skills-Hub Team

Open-source AI coverage

Skills-Hub is the open registry for AI coding skills, with SKILL.md files synced daily from Anthropic, Google, Microsoft, and 90+ official sources. Free + MIT.

Continue reading