
The current open-weight champion. A 744B-parameter Mixture-of-Experts model with a 1M-token context, purpose-built for long-horizon coding. Trained entirely on Huawei Ascend silicon, with no Nvidia dependency.
Three years ago the best downloadable model trailed the frontier by more than a year. In 2026 that lag is measured in months, and a wave of open-weight labs, most of them Chinese, is compressing the distance faster than almost anyone predicted.
Ask which large language model is best and, until recently, the answer was simple: whichever one OpenAI, Anthropic or Google was charging for that quarter. Open-weight models, the ones you can download and run on your own hardware, were a respectable second tier, always a year or so behind.
That assumption is now stale. Through 2026 the lag between the best open-weight model and the closed frontier collapsed to a handful of months, and on the benchmarks that drive most enterprise value it has effectively vanished. The surprise is who did it: five open-weight families, four of them Chinese, reached frontier quality at almost the same moment.
The honest version of the story is more interesting than the hype. Here is where the gap actually stands, measured by groups that do not sell models, and what it means for anyone choosing where to spend.
The “open source is catching up” story is easy to overstate. The honest version is more interesting, and the numbers come from groups that do not sell models.
In 2024 the picture was simple: if you wanted the best model, you paid OpenAI, Anthropic or Google. Open weights were a respectable second tier. That assumption is now stale. Epoch AI, tracking capability with its composite index, found that through early 2026 the best open-weight models trailed the closed frontier by an average of just four months, up slightly from three in late 2025, but a world away from the ~16 months it once took Llama 3.1 405B to match the original GPT-4.
Stanford's 2026 AI Index puts the raw capability gap between the single best closed model and the single best open model at just 3.3%. On knowledge benchmarks like MMLU, a 17.5-point chasm from late 2023 has effectively closed to zero. The lead now lives in the hardest places: frontier reasoning, sustained multi-step agentic work, and safety alignment.
A crucial caveat runs the other way. Epoch notes the true gap may be larger than it looks: closed labs keep their most capable models private, and open models tend to hill-climb public benchmarks harder, so they can look better on the tests everyone can see than on the ones they cannot.
Artificial Analysis runs its own independent evaluations rather than trusting lab-reported numbers. Its Intelligence Index v4.1 aggregates nine benchmarks across agents, coding, reasoning and knowledge. Closed in indigo, open in ember.
Artificial Analysis Intelligence Index v4.1, mid-2026, scale 0 to 100. Kimi K2.6 and Alibaba's Qwen 3.7 Max sit just behind the plotted open models; Gemini 3.1 Pro competes at the very top of the closed tier. Index versions and harnesses shift the exact ordering week to week.
A single composite hides the more striking story: on the specific tasks that generate most enterprise value, the best open model already trades blows with the frontier. Z.ai's GLM-5.2 is the clearest example.
GLM-5.2 lands within a few points of Anthropic's flagship on shell-agent reliability, tops the field on SWE-bench Pro real-world software fixes, and beats GPT-5.5 on Humanity's Last Exam, all at roughly one-sixth the price. ◄ marks the top score in each row. Sources: Hugging Face (Z.ai), VentureBeat, Semgrep.
Five independent open families, DeepSeek, Qwen, Kimi, GLM and Mistral, reached frontier quality at roughly the same time. That simultaneity is what makes the shift structural rather than a one-off. Chinese labs now hold four of the top five open-weight slots.

The current open-weight champion. A 744B-parameter Mixture-of-Experts model with a 1M-token context, purpose-built for long-horizon coding. Trained entirely on Huawei Ascend silicon, with no Nvidia dependency.

Back near the top with a two-tier lineup: V4 Pro (1.6T / 49B active) for maximum capability, V4 Flash for cheap, fast inference. Leads open models on real-work agentic benchmarks, but hallucinates when unsure.

The broadest family in open weights, from 0.8B pocket models to the 3.7 Max flagship. Unmatched multilingual strength (Chinese, Japanese, Korean) and the deepest deployment footprint of any open lineup.

A trillion-parameter vision-language model tuned for agent swarms: a plan-write-test-debug loop that can run for days and spin up hundreds of collaborating agents. Strongest open model on graduate-level science reasoning.

Pitched as the first open model to fuse frontier coding, a 1M-token context and native multimodality. Its MSA sparse-attention design cuts long-context compute to roughly one-twentieth of the previous generation.

Once the model that defined open source, now the Western anchor playing catch-up. Llama 5 packs 600B+ params, but on pure benchmarks it trails the leading Chinese open models, and its community license is not OSI-approved.

Europe's flagship and the standard-bearer for sovereign AI that stays on your hardware and under your laws. A 675B sparse MoE with 200+ language support: behind on raw scores, but the trusted default inside the EU.

The rest of the long tail: NVIDIA's Nemotron line and OpenAI's gpt-oss weights keep a Western open presence alive, alongside Google's Gemma 4, the sole Western entry in the open top tier.

Holds the top of nearly every serious reasoning and agentic-coding leaderboard. The Mythos-tier Fable 5 sits above Opus as the current intelligence ceiling.

The strongest all-around reasoning model for many workloads: complex multi-step problems, long-horizon agents and architectural code. Priced at a steep premium over open weights.

Google's frontier line trades the top spot with Anthropic and OpenAI depending on the benchmark, and is often cited as the strongest in head-to-head coding-arena play.
When security firm Semgrep dropped GLM-5.2 into its cyber benchmark on a whim, an open-weight model, with none of their custom scaffolding, beat a frontier coding agent at a genuinely hard security-research task. Their honest framing: the harness still matters more than the model, but that an open, one-sixth-cost model could win at all is the story.
Architecturally it leans on IndexShare, reusing one lightweight indexer across every four sparse-attention layers to cut per-token compute by 2.9× at full 1M context, plus an upgraded multi-token-prediction layer for faster decoding. And because it was trained on Huawei Ascend chips, it sidesteps U.S. export controls entirely.
An open model no longer just gets close to the frontier. On the benchmarks that matter for real engineering, it matches or exceeds it.
Capability is a close race. Price is not. This is why “open vs. closed” has quietly become “how do we orchestrate both.”
Independent benchmarking consistently shows well-optimized open models delivering 85 to 90% of closed-model performance on enterprise tasks while cutting cost 60 to 84% for high-volume work. For summarization, extraction, classification, code assistance and content generation, the quality difference has shrunk to near-irrelevance.
There is a second-order effect: open weights set a price ceiling. When a model like GLM-5.2 delivers comparable results for pennies, closed providers cannot hold premium per-token pricing indefinitely. Open competition is the reason your closed-model bill keeps shrinking, even if you never run an open model.
Capability convergence is why open is now viable. This is why teams switch: the things you can only do when you hold the weights instead of renting them.
Full weights in hand: LoRA, QLoRA or full-parameter SFT on your own data.
Wrapper APIs only. No architecture access, and tuning is limited to what the vendor exposes.
Speculative decoding, prefix caching and custom quantization are all yours to tune.
A black box: a silent vendor update can shift behaviour or latency under a live workload.
Runs air-gapped or on-prem; nothing leaves your network, which suits HIPAA and the EU AI Act.
Prompts and data leave for third-party servers, a standing compliance and residency risk.
Fixed infrastructure, so marginal cost falls toward zero at volume, and you meter every token.
Metered per token, and reasoning tokens bill at the full output rate, often several times the visible length.
Runs on any hardware or cloud, and you freeze the exact version you shipped.
Tied to one vendor's roadmap, price changes and model deprecations.
“Catching up” is not “caught up.” Four honest asterisks on the whole story.
The biggest performance swings come from tooling and scaffolding, not model choice. A frontier model in a weak agent loop loses to an open model in a strong one, and vice versa. Raw scores travel poorly to production.
Open models tend to optimize harder for the tests everyone can see. Epoch warns they often score worse on private benchmarks, so the visible gap may understate the real one. Closed labs also keep their best models unreleased.
DeepSeek's V4 line posts high hallucination rates: when it does not know an answer, it almost always answers anyway. Frontier alignment and calibration remain a real, if narrowing, closed-model advantage.
The lag ticked from three months to four between late 2025 and mid-2026. Epoch argues the compute-investment gulf is widening, not narrowing, so parity on any given week is not the same as parity for good.
The open surge is also a geographic story. The U.S. still ships the most notable models; China ships the most impactful open ones; Europe writes the rules.
Notable models released in 2025, the most of any nation. Home to every closed frontier lab, but its open champions (Llama, gpt-oss, Gemma) now trail China's on raw benchmarks.
Of the top five open-weight models are Chinese. DeepSeek's R1 sparked the reset in early 2025; Z.ai, Alibaba, Moonshot and MiniMax carried it, increasingly on domestic Huawei silicon, routing around export controls.
Notable models in 2025, led by Mistral. Behind on scores, but the EU AI Act and data-sovereignty rules have made European open weights the trusted default for regulated, on-premises deployment.
The 2026 consensus among practitioners is not a winner. It is a routing decision. Frontier closed models still own the hardest reasoning, the most reliable long-horizon agents, and the best-calibrated safety, worth paying for on high-stakes work. Open weights own cost, privacy, control and customization, and are now good enough that for most production tasks the capability difference is a rounding error against an order-of-magnitude price cut.
What changed is that the frontier is no longer a closed club. Five open families reached it at once, a new state-of-the-art open model lands every few weeks, and the lag is counted in months. Whether that lag keeps shrinking or the compute gulf reopens it is the single most important number to watch.
Sources: Epoch AI (open vs. closed capability gap) · Stanford HAI 2026 AI Index · Artificial Analysis (Intelligence Index v4.1) · Hugging Face (GLM-5.2) · VentureBeat · Semgrep (cyber benchmarks) · DeepInfra. Figures reflect public benchmarks as of early July 2026 and shift week to week; this briefing synthesizes third-party reporting and is not investment or purchasing advice.
One thread runs underneath this whole story: silicon. GLM-5.2 trained on Huawei Ascend chips precisely because the physical supply chain, not the software, is where export controls bite. For that side of the race, from lithography to power, see our core sample of the AI stack's full value chain.
AI models trained on public code are quietly turning Python and JavaScript into the default languages of software, and enterprise stacks risk being left behind.
The next leap in user experience is not prettier interfaces but delegation, where autonomous agents act on our intent so we touch less and offload more.