GLM-5.1 Is Here—Is China’s New Model Actually Beating GPT on Long-Horizon Tasks?

A Chinese AI lab has just released a model that can optimize code over 600 iterations without human intervention. GLM-5.1 increases the query speed of a vector database from 3,500 queries per second to 21,500 queries per second. It built a Linux desktop in a browser in just 8 hours, adding a file browser, terminal, text editor, and games—all integrated into a unified user interface. So, can it beat GPT?

On April 7, 2026, Chinese AI lab Z.ai (a subsidiary of Zhipu AI) released GLM-5.1, a 754-billion-parameter Mixture-of-Experts model that represents a fundamental shift in how we think about AI capabilities .

The headline numbers are impressive. But the real story isn‘t about benchmark scores—it’s about something far more interesting: long-horizon task execution.

For years, AI models have been evaluated on single-turn performance. You ask a question, it gives an answer. You give a coding prompt, it generates code. But real-world software engineering doesn‘t work that way. Real problems require iteration, debugging, testing, and refinement—sometimes over hundreds of cycles.

GLM-5.1 is the first open-weights model specifically optimized for this kind of sustained, multi-step work. And the results are turning heads in the AI community.

The Core Innovation: Staying Productive When the Runway Gets Long

The fundamental limitation of previous models—including GLM-5 itself—is that they tend to exhaust their repertoire early. They apply familiar techniques for quick initial gains, then plateau. Giving them more time doesn‘t help .

GLM-5.1 is built differently. According to Z.ai’s technical documentation, the model is designed to stay effective on agentic tasks over much longer horizons. It handles ambiguous problems with better judgment and stays productive over longer sessions. It breaks complex problems down, runs experiments, reads results, and identifies blockers with real precision .

The key metric: GLM-5.1 sustains optimization over hundreds of rounds and thousands of tool calls. The longer it runs, the better the result .

This isn‘t marketing hype. Z.ai demonstrated this capability across three distinct scenarios with progressively less structured feedback .

Scenario 1: Vector Database Optimization (600+ Iterations)

VectorDBBench is an open-source coding challenge that evaluates a model‘s ability to build a high-performance approximate nearest neighbor search database. The model is given a Rust skeleton with API endpoints and empty implementation stubs. It must use tool-call-based agents to read files, write code, compile, test, and profile.

The previous state-of-the-art result under this setting was 3,547 queries per second (QPS), achieved by Claude Opus 4.6 .

Z.ai restructured the evaluation into an outer optimization loop: in each iteration, the model could use as many tool calls as needed to edit code, compile, test, and profile, then submit a new version to be benchmarked. The model decided autonomously when to submit and what to try next .

The result: GLM-5.1 did not plateau after 50 or 100 submissions. It continued to find meaningful improvements over 600+ iterations with 6,000+ tool calls, ultimately reaching 21.5k QPS—roughly 6× the best result achieved in a single 50-turn session .

The optimization trajectory revealed a characteristic staircase pattern: periods of incremental tuning within a fixed strategy, punctuated by structural changes that shifted the performance frontier. Around iteration 90, the model shifted from full-corpus scanning to IVF cluster probing with f16 vector compression, jumping to 6.4k QPS. Around iteration 240, it introduced a two-stage pipeline—u8 prescoring followed by f16 reranking—reaching 13.4k QPS .

Six such structural transitions occurred over the full run, each initiated by the model after analyzing its own benchmark logs and identifying the current bottleneck .

Scenario 2: GPU Kernel Optimization (1,000+ Turns)

KernelBench evaluates whether a model can take a reference PyTorch implementation and produce a faster GPU kernel with identical outputs. The benchmark is organized into three levels of increasing complexity, covering 50 problems total.

For reference, torch.compile with default settings achieves 1.15× speedup on these problems; with max-autotune, 1.49× .

When Z.ai ran four models on Level 3 (full-model, end-to-end optimization of complete architectures like MobileNet, VGG, MiniGPT, and Mamba), the trajectories highlighted stark differences in long-horizon optimization behavior :

GLM-5 improved quickly at first but leveled off relatively early
Claude Opus 4.5 continued a bit longer, but its gains also tapered off
GLM-5.1 pushed the frontier further, delivering 3.6× speedup and continuing to make progress well into the run

While its rate of improvement also slowed over time, it sustained useful optimization for substantially longer than GLM-5. Claude Opus 4.6 remained the strongest model in this setting, finishing at 4.2× and still showing headroom at the end .

Scenario 3: Building a Linux Desktop Over 8 Hours

The previous two scenarios have explicit numeric objectives—QPS, speedup—that the model can benchmark against. Website generation is inherently more subjective: given a natural-language prompt, produce a working web application. There is no single metric to optimize; what counts as “good” depends on completeness, visual polish, and interaction quality .

Z.ai tested this with a deliberately ambitious prompt: build a Linux-style desktop environment as a web application. No starter code, no design mockups, no intermediate guidance .

The difference was stark:

Most models—including earlier versions of GLM—give up quickly. They produce a basic skeleton with a static taskbar and one or two placeholder windows, then declare the task complete. The model has no mechanism to step back and ask what‘s missing .

Z.ai wrapped GLM-5.1 in a simple harness that changed this: after each round of execution, the model reviewed its own output, identified what could be improved—missing features, rough styling, broken interactions—and continued. This loop ran for 8 hours .

Early on, GLM-5.1 delivered a basic layout with a taskbar and simple window—similar to what a short session would produce. But it didn’t stop there.

As it continued, the system steadily filled out: file browser, terminal, text editor, system monitor, calculator, games—each new addition integrated into a coherent UI rather than bolted on as an afterthought. Styling became more polished, interactions smoother, edge cases handled. By the end, the result was a complete, visually consistent desktop environment running in the browser .

This is what becomes possible when the model is given both the time and the capability to keep refining .

Benchmark Performance: Where GLM-5.1 Stands

Beyond long-horizon tasks, GLM-5.1 shows competitive performance on standard benchmarks. According to Z.ai‘s published results :

Benchmark	GLM-5.1	GLM-5	GPT-5.4	Claude Opus 4.6	Gemini 3.1 Pro
SWE-Bench Pro	58.4	55.1	57.7	57.3	54.2
NL2Repo	42.7	35.9	41.3	49.8	33.4
Terminal-Bench 2.0	63.5	56.2	—	65.4	68.5
AIME 2026	95.3	95.4	98.7	95.6	98.2
GPQA-Diamond	86.2	86.0	92.0	91.3	94.3

On SWE-Bench Pro—the gold standard for real-world software engineering—GLM-5.1 achieves 58.4%, outperforming both GPT-5.4 (57.7%) and Claude Opus 4.6 (57.3%) . This is a significant milestone for an open-weights model.

Independent analysis from WaveSpeedAI puts GLM-5.1‘s coding score at 94.6% of Claude Opus 4.6’s performance .

The “Pelican Test”: A Real-World Demonstration

Simon Willison, a prominent UK-based developer and AI researcher, independently tested GLM-5.1 through OpenRouter. He asked the model to “Generate an SVG of a pelican on a bicycle” .

What happened next was unexpected. Without prompting, the model decided to return an HTML page that included both the SVG and a separate set of CSS animations. The SVG itself was excellent—Willison called it “my new favorite from an open weights model” .

However, the animation was broken: the pelican floated off the screen. Willison followed up: “the animation is a bit broken, the pelican ends up positioned off the screen at the top right” .

GLM-5.1‘s response was remarkable. It diagnosed the issue:

“The issue is that CSS transform animations on SVG elements override the SVG transform attribute used for positioning, causing the pelican to lose its placement and fly off to the top-right. The fix is to separate positioning (SVG attribute) from animation (inner group) and use animateTransform for SVG rotations since it handles coordinate systems correctly.”

It then spat out fresh HTML that fixed the problem, complete with a wobbling beak animation described in SVG comments .

This exchange demonstrates GLM-5.1‘s ability to not just generate code, but to debug its own output, identify root causes, and produce corrected solutions—exactly the kind of iterative capability that defines long-horizon task performance.

The Bigger Picture: What GLM-5.1 Represents

GLM-5.1 is more than just another model release. It represents several significant shifts in the AI landscape:

1. The First Open-Weights Model at Frontier Scale

GLM-5 is the first open-weights model to reach score 50 on the Artificial Analysis Intelligence Index. The weights are available on Hugging Face under the MIT license (zai-org/GLM-5). GLM-5.1 weights have been promised but not yet released .

2. Trained Without Nvidia Hardware

The model was trained entirely on 100,000 Huawei Ascend 910B chips—no Nvidia GPUs. Given U.S. export controls on AI chips to China, this is a milestone for China‘s AI self-sufficiency .

3. Aggressive Pricing

GLM-5.1 is priced at $1.00 per million input tokens and $3.20 per million output tokens—a fraction of Claude Opus 4.6’s $15/$75 pricing. Only DeepSeek undercuts it on pure price .

4. MIT License Commitment

Despite concerns that Zhipu might follow OpenAI‘s closed-source path, the company confirmed that GLM-5.1 will remain open-source. “Don‘t panic. GLM-5.1 will be open source,” Zhipu’s global head posted on social media .

Limitations

GLM-5.1 is not without constraints:

Text-only. No image, audio, or video input. For multimodal tasks, Claude, GPT, or Gemini remain necessary .
Self-reported coding scores. The 94.6%-of-Opus claim uses Claude Code as the evaluation framework. Independent verification is pending .
Storage requirements. The full BF16 model requires ~1.49TB—self-hosting is non-trivial .
Not yet fully open. Only GLM-5 weights are currently available; GLM-5.1 weights are promised but not yet released .

The Bottom Line: Is It Beating GPT?

The answer depends on what you‘re measuring.

On standard benchmarks like GPQA-Diamond and AIME, GLM-5.1 trails GPT-5.4 and Claude Opus 4.6. The best closed-source models still lead on pure reasoning and mathematical accuracy.

On long-horizon software engineering tasks like SWE-Bench Pro and VectorDBBench optimization, GLM-5.1 is competitive with—and in some cases exceeds—the best closed-source models. Its ability to sustain productive iteration over hundreds or thousands of cycles is genuinely new.

On price, GLM-5.1 is dramatically cheaper than its U.S. counterparts—15× cheaper than Claude Opus 4.6 for output tokens.

On openness, GLM-5.1 represents the most capable open-weights model ever released. The MIT license means developers can self-host, fine-tune, and deploy without API dependency.

So, is China beating the U.S. on long-horizon tasks? Not yet across the board—but GLM-5.1 proves the gap is closing fast. And for developers who prioritize iterative capability, open weights, and cost efficiency, GLM-5.1 isn‘t just competitive. It might be the better choice.

As Simon Willison put it after his pelican test: “Something new happened.” That “something new” might just change how we think about what AI can do when given the time to do it right.