AI/Deep Dive

Closing the Gap: How Open-Weight Models Caught the Frontier

Three years ago the best downloadable model trailed the frontier by more than a year. In 2026 that lag is measured in months, and a wave of open-weight labs, most of them Chinese, is compressing the distance faster than almost anyone predicted.

Olympia Tech··10 min read

Ask which large language model is best and, until recently, the answer was simple: whichever one OpenAI, Anthropic or Google was charging for that quarter. Open-weight models, the ones you can download and run on your own hardware, were a respectable second tier, always a year or so behind.

That assumption is now stale. Through 2026 the lag between the best open-weight model and the closed frontier collapsed to a handful of months, and on the benchmarks that drive most enterprise value it has effectively vanished. The surprise is who did it: five open-weight families, four of them Chinese, reached frontier quality at almost the same moment.

The honest version of the story is more interesting than the hype. Here is where the gap actually stands, measured by groups that do not sell models, and what it means for anyone choosing where to spend.

Capability over time, the closing distance Closed frontier Best open weight
Dec 2023Oct 2025May 2026Now~4 mo16 mo gap
~4 mo
Open-weight lag behind closed SOTA (Epoch AI, May 2026)
3.3 %
Top closed vs. top open capability gap (Stanford AI Index 2026)
60 /100
Highest intelligence index: the closed frontier ceiling
51 /100
Best open weight: GLM-5.2, at roughly one-sixth the cost
01

The gap, measured

The “open source is catching up” story is easy to overstate. The honest version is more interesting, and the numbers come from groups that do not sell models.

In 2024 the picture was simple: if you wanted the best model, you paid OpenAI, Anthropic or Google. Open weights were a respectable second tier. That assumption is now stale. Epoch AI, tracking capability with its composite index, found that through early 2026 the best open-weight models trailed the closed frontier by an average of just four months, up slightly from three in late 2025, but a world away from the ~16 months it once took Llama 3.1 405B to match the original GPT-4.

Stanford's 2026 AI Index puts the raw capability gap between the single best closed model and the single best open model at just 3.3%. On knowledge benchmarks like MMLU, a 17.5-point chasm from late 2023 has effectively closed to zero. The lead now lives in the hardest places: frontier reasoning, sustained multi-step agentic work, and safety alignment.

A crucial caveat runs the other way. Epoch notes the true gap may be larger than it looks: closed labs keep their most capable models private, and open models tend to hill-climb public benchmarks harder, so they can look better on the tests everyone can see than on the ones they cannot.

16 mo → 4 mo
How far the open-weight lag has collapsed since the GPT-4 era, per Epoch AI's capability index.
3.3%
Top closed vs. top open capability gap in Stanford's 2026 AI Index, down from double digits.
2.7%
How far the leading U.S. model led the leading Chinese model in March 2026, a gap that has traded hands more than once.
02

Where everyone actually stands

Artificial Analysis runs its own independent evaluations rather than trusting lab-reported numbers. Its Intelligence Index v4.1 aggregates nine benchmarks across agents, coding, reasoning and knowledge. Closed in indigo, open in ember.

Claude Fable 5
60
Claude Opus 4.8
56
GPT-5.5 (xhigh)
55
Claude Sonnet 5
53
GLM-5.2 (max)
51
MiniMax-M3
44
DeepSeek V4 Pro
44

Artificial Analysis Intelligence Index v4.1, mid-2026, scale 0 to 100. Kimi K2.6 and Alibaba's Qwen 3.7 Max sit just behind the plotted open models; Gemini 3.1 Pro competes at the very top of the closed tier. Index versions and harnesses shift the exact ordering week to week.

A single composite hides the more striking story: on the specific tasks that generate most enterprise value, the best open model already trades blows with the frontier. Z.ai's GLM-5.2 is the clearest example.

Benchmark
GLM-5.2 (open)
GPT-5.5
Claude Opus 4.8
Terminal-Bench 2.1: shell agents
81.0
85.0
Humanity's Last Exam (with tools)
54.7
52.2
57.9
SWE-bench Pro: real software fixes
62.1
58.6
59.2
Cost per task vs. frontier
~1⁄6×

GLM-5.2 lands within a few points of Anthropic's flagship on shell-agent reliability, tops the field on SWE-bench Pro real-world software fixes, and beats GPT-5.5 on Humanity's Last Exam, all at roughly one-sixth the price. ◄ marks the top score in each row. Sources: Hugging Face (Z.ai), VentureBeat, Semgrep.

03

Who's on the field

Five independent open families, DeepSeek, Qwen, Kimi, GLM and Mistral, reached frontier quality at roughly the same time. That simultaneity is what makes the shift structural rather than a one-off. Chinese labs now hold four of the top five open-weight slots.

The open insurgents
CN
GLM-5.2 logo
GLM-5.2
Z.ai · Zhipu AI

The current open-weight champion. A 744B-parameter Mixture-of-Experts model with a 1M-token context, purpose-built for long-horizon coding. Trained entirely on Huawei Ascend silicon, with no Nvidia dependency.

MIT license744B / 40B active1M context
CN
DeepSeek V4 logo
DeepSeek V4
DeepSeek

Back near the top with a two-tier lineup: V4 Pro (1.6T / 49B active) for maximum capability, V4 Flash for cheap, fast inference. Leads open models on real-work agentic benchmarks, but hallucinates when unsure.

MIT licensePro + Flash1M context
CN
Qwen 3.x logo
Qwen 3.x
Alibaba

The broadest family in open weights, from 0.8B pocket models to the 3.7 Max flagship. Unmatched multilingual strength (Chinese, Japanese, Korean) and the deepest deployment footprint of any open lineup.

Apache 2.0Full size rangeMultimodal
CN
Kimi K2.6 logo
Kimi K2.6
Moonshot AI

A trillion-parameter vision-language model tuned for agent swarms: a plan-write-test-debug loop that can run for days and spin up hundreds of collaborating agents. Strongest open model on graduate-level science reasoning.

Modified MIT1T / 32B activeMultimodal
CN
MiniMax M3 logo
MiniMax M3
MiniMax

Pitched as the first open model to fuse frontier coding, a 1M-token context and native multimodality. Its MSA sparse-attention design cuts long-context compute to roughly one-twentieth of the previous generation.

Open weights1M contextNative multimodal
US
Llama 5 logo
Llama 5
Meta

Once the model that defined open source, now the Western anchor playing catch-up. Llama 5 packs 600B+ params, but on pure benchmarks it trails the leading Chinese open models, and its community license is not OSI-approved.

Community license600B+Largest ecosystem
EU
Mistral Large 3 logo
Mistral Large 3
Mistral · France

Europe's flagship and the standard-bearer for sovereign AI that stays on your hardware and under your laws. A 675B sparse MoE with 200+ language support: behind on raw scores, but the trusted default inside the EU.

Apache 2.0675B / 41B active256K context
US
Nemotron 3 · gpt-oss logo
Nemotron 3 · gpt-oss
NVIDIA · OpenAI

The rest of the long tail: NVIDIA's Nemotron line and OpenAI's gpt-oss weights keep a Western open presence alive, alongside Google's Gemma 4, the sole Western entry in the open top tier.

Open weightsEfficiency-tunedWestern tail
The closed frontier
US
Claude Opus 4.8 · Fable 5 logo
Claude Opus 4.8 · Fable 5
Anthropic

Holds the top of nearly every serious reasoning and agentic-coding leaderboard. The Mythos-tier Fable 5 sits above Opus as the current intelligence ceiling.

ClosedIndex 56–60Frontier reasoning
US
GPT-5.5 logo
GPT-5.5
OpenAI

The strongest all-around reasoning model for many workloads: complex multi-step problems, long-horizon agents and architectural code. Priced at a steep premium over open weights.

ClosedIndex 55All-round
US
Gemini 3.1 Pro logo
Gemini 3.1 Pro
Google DeepMind

Google's frontier line trades the top spot with Anthropic and OpenAI depending on the benchmark, and is often cited as the strongest in head-to-head coding-arena play.

ClosedTop tierCoding arena
04

Spotlight: the model that made everyone look

Z.ai · released June 16, 2026 · MIT license

GLM-5.2

When security firm Semgrep dropped GLM-5.2 into its cyber benchmark on a whim, an open-weight model, with none of their custom scaffolding, beat a frontier coding agent at a genuinely hard security-research task. Their honest framing: the harness still matters more than the model, but that an open, one-sixth-cost model could win at all is the story.

Architecturally it leans on IndexShare, reusing one lightweight indexer across every four sparse-attention layers to cut per-token compute by 2.9× at full 1M context, plus an upgraded multi-token-prediction layer for faster decoding. And because it was trained on Huawei Ascend chips, it sidesteps U.S. export controls entirely.

Parameters
744B / 40B active
Context
1M tokens
Terminal-Bench 2.1
81.0 vs 85 Opus
Cost vs frontier
~1⁄6 ×
An open model no longer just gets close to the frontier. On the benchmarks that matter for real engineering, it matches or exceeds it.
the emerging consensus across independent 2026 evaluations
05

The dimension where open already won

Capability is a close race. Price is not. This is why “open vs. closed” has quietly become “how do we orchestrate both.”

Monthly cost, a typical RAG workload (100k requests, 5k in / 1k out)
GPT-5.5Closed frontier
$2,275
DeepSeek V3.2Open, hosted
$168
Illustrative hosted-inference rates, April 2026. Roughly a 13× cost difference for comparable output on the broad middle of production tasks. Self-hosting at high volume drops marginal cost toward zero.

Independent benchmarking consistently shows well-optimized open models delivering 85 to 90% of closed-model performance on enterprise tasks while cutting cost 60 to 84% for high-volume work. For summarization, extraction, classification, code assistance and content generation, the quality difference has shrunk to near-irrelevance.

There is a second-order effect: open weights set a price ceiling. When a model like GLM-5.2 delivers comparable results for pennies, closed providers cannot hold premium per-token pricing indefinitely. Open competition is the reason your closed-model bill keeps shrinking, even if you never run an open model.

06

Why teams actually own the stack

Capability convergence is why open is now viable. This is why teams switch: the things you can only do when you hold the weights instead of renting them.

Weights & fine-tuning
Open weights

Full weights in hand: LoRA, QLoRA or full-parameter SFT on your own data.

Closed frontier

Wrapper APIs only. No architecture access, and tuning is limited to what the vendor exposes.

Inference control
Open weights

Speculative decoding, prefix caching and custom quantization are all yours to tune.

Closed frontier

A black box: a silent vendor update can shift behaviour or latency under a live workload.

Data & sovereignty
Open weights

Runs air-gapped or on-prem; nothing leaves your network, which suits HIPAA and the EU AI Act.

Closed frontier

Prompts and data leave for third-party servers, a standing compliance and residency risk.

Cost model
Open weights

Fixed infrastructure, so marginal cost falls toward zero at volume, and you meter every token.

Closed frontier

Metered per token, and reasoning tokens bill at the full output rate, often several times the visible length.

Portability & lock-in
Open weights

Runs on any hardware or cloud, and you freeze the exact version you shipped.

Closed frontier

Tied to one vendor's roadmap, price changes and model deprecations.

07

Where the gap holds, and the fine print

“Catching up” is not “caught up.” Four honest asterisks on the whole story.

The harness beats the model

The biggest performance swings come from tooling and scaffolding, not model choice. A frontier model in a weak agent loop loses to an open model in a strong one, and vice versa. Raw scores travel poorly to production.

Public-benchmark hill-climbing

Open models tend to optimize harder for the tests everyone can see. Epoch warns they often score worse on private benchmarks, so the visible gap may understate the real one. Closed labs also keep their best models unreleased.

Confident hallucination

DeepSeek's V4 line posts high hallucination rates: when it does not know an answer, it almost always answers anyway. Frontier alignment and calibration remain a real, if narrowing, closed-model advantage.

The frontier keeps moving

The lag ticked from three months to four between late 2025 and mid-2026. Epoch argues the compute-investment gulf is widening, not narrowing, so parity on any given week is not the same as parity for good.

08

A three-way world

The open surge is also a geographic story. The U.S. still ships the most notable models; China ships the most impactful open ones; Europe writes the rules.

🇺🇸 USBREAKTHROUGH PERFORMANCE🇨🇳 CNOPEN-WEIGHT LEAD🇪🇺 EURULES & SOVEREIGNTY
US
Breakthrough performance
50

Notable models released in 2025, the most of any nation. Home to every closed frontier lab, but its open champions (Llama, gpt-oss, Gemma) now trail China's on raw benchmarks.

CN
Cost-efficient scaling · open lead
4 / 5

Of the top five open-weight models are Chinese. DeepSeek's R1 sparked the reset in early 2025; Z.ai, Alibaba, Moonshot and MiniMax carried it, increasingly on domestic Huawei silicon, routing around export controls.

EU
Rules & sovereignty
3

Notable models in 2025, led by Mistral. Behind on scores, but the EU AI Act and data-sovereignty rules have made European open weights the trusted default for regulated, on-premises deployment.

The bottom line

Not “open or closed.” Both, on purpose.

The 2026 consensus among practitioners is not a winner. It is a routing decision. Frontier closed models still own the hardest reasoning, the most reliable long-horizon agents, and the best-calibrated safety, worth paying for on high-stakes work. Open weights own cost, privacy, control and customization, and are now good enough that for most production tasks the capability difference is a rounding error against an order-of-magnitude price cut.

What changed is that the frontier is no longer a closed club. Five open families reached it at once, a new state-of-the-art open model lands every few weeks, and the lag is counted in months. Whether that lag keeps shrinking or the compute gulf reopens it is the single most important number to watch.

Sources: Epoch AI (open vs. closed capability gap) · Stanford HAI 2026 AI Index · Artificial Analysis (Intelligence Index v4.1) · Hugging Face (GLM-5.2) · VentureBeat · Semgrep (cyber benchmarks) · DeepInfra. Figures reflect public benchmarks as of early July 2026 and shift week to week; this briefing synthesizes third-party reporting and is not investment or purchasing advice.

One thread runs underneath this whole story: silicon. GLM-5.2 trained on Huawei Ascend chips precisely because the physical supply chain, not the software, is where export controls bite. For that side of the race, from lithography to power, see our core sample of the AI stack's full value chain.

More from AI

Deep DiveAI

Python and JavaScript Are Winning the AI Race

AI models trained on public code are quietly turning Python and JavaScript into the default languages of software, and enterprise stacks risk being left behind.

Olympia Tech · 4 min
Deep DiveAI

AI Agents: Unlocking the Next UX Revolution

The next leap in user experience is not prettier interfaces but delegation, where autonomous agents act on our intent so we touch less and offload more.

Olympia Tech · 4 min