7 min read

The Stack Eats the Model

Almost none of the most-starred AI projects this week were new models. Three layers of the AI stack are being rebuilt at the same time, and the model itself is becoming a commodity.

AI InfrastructureAI AgentsLLM InferenceKV CacheGenerative AINewsletter

A GenAI Newsletter by Raj


If you've been following AI Twitter this past week, you might have noticed something unusual: almost none of the most-starred projects were new models. They were tools, harnesses, compressors, and inference engines. All the stuff that wraps around models. Builders have known this for months, but this week the broader AI community seems to have caught on: the model is the easy part. The stack around it is what actually matters.

Three layers of the AI stack are being rebuilt at the same time, and each one tells a different story about where we're headed.


Layer 1: Above the Model

GitHub's biggest trend this week was agent infrastructure. A research paper from Tsinghua and Shenzhen called Natural-Language Agent Harnesses is at the center of it.

The paper's core idea: what if the code that controls an AI agent (the loops, routing, error handling, tool selection) was written in natural language? The paper argues that agent performance increasingly depends on harness engineering, but harness design is buried in controller code and runtime-specific conventions, making it impossible to transfer, compare, or study scientifically.

Their solution is to express the high-level control logic as a natural-language SOP that the LLM itself interprets and executes. The harness becomes a portable document. Early results show it works well, and it opens the door to agents that can modify their own control flow by editing their own instructions.

The practical ecosystem moved in the same direction:

  • OpenClaw (the open-source Claude Code alternative) spawned an entire ecosystem in a single week: memory consolidation ("sleep for your AI"), curated resource lists, SAST security scanners, legal assistants, advertising skills, and full engineering workflow stacks. All as modular skills that snap into any compatible agent.

  • Boris Cherny (Claude Code team lead) dropped a thread revealing power features most users don't know about: /loop and /schedule for automated recurring agents, /batch for fanning out massive changesets to dozens of parallel worktree agents, /branch for forking sessions, and custom agents via --agent. These are production orchestration primitives.

  • Phantom, built on the Claude Agent SDK, gives an AI agent its own computer, persistent memory, email identity, and secure credential collection. A full digital co-worker.

  • Anvil creates an IDE for parallel agent work with one-click worktrees, shared plans between agents, and isolation between them.

None of these projects are improving the model itself. They're all improving the harness, the memory, the orchestration, the identity. The model is treated as a commodity, a reasoning engine you plug into a larger system.


Layer 2: Inside the Model

Something interesting happened inside the model too: three competing approaches to KV cache compression dropped in the same week.

For context: KV cache is the memory that grows linearly as your context window expands. It's the reason a 128K context model needs so much VRAM. Compress the KV cache, and you can serve longer contexts on cheaper hardware or serve more users on the same GPU.

TurboQuant (Google, ICLR 2026) achieves 5x compression using 3-bit quantization while maintaining 99.5% attention fidelity. Two independent implementations appeared on GitHub within days. The key insight is that you can quantize keys to 3 bits and values to 2 bits without meaningfully degrading output quality, because attention patterns are more robust to precision loss than people assumed.

Then RotorQuant showed up, claiming to be 10-19x faster than TurboQuant with 44x fewer parameters. Their approach uses Clifford algebra vector quantization, a mathematical framework from geometric algebra that represents rotations more efficiently than standard linear algebra. TurboQuant learns quantization codebooks, which requires a forward pass through a small network for each quantization. RotorQuant represents cache entries as geometric rotors and the quantization is a single matrix operation. Whether the quality matches remains to be seen (RotorQuant is days old), but the architectural difference suggests this compression arms race is just getting started.

Why does this matter beyond benchmarks? KV cache compression is what makes long-context actually affordable. A 128K context window with TurboQuant's 5x compression costs the same as a 25K window without it. RotorQuant's potential 10-19x speedup on top of that could make million-token contexts viable on consumer hardware. And the agent explosion above needs long context to work: agents that loop, remember, and self-modify accumulate enormous context windows.


Layer 3: Below the Model

The third layer being rebuilt is the one closest to the metal: the inference engine.

Three projects this week signal that the Python-dominated LLM serving stack is being rewritten from scratch.

rvLLM is a LLM inference engine written entirely in Rust, positioning itself as a "drop-in vLLM replacement." vLLM (the current standard) is Python with C++/CUDA kernels. rvLLM bets that Rust's memory safety, zero-cost abstractions, and concurrency model can deliver better performance without the operational footprint of Python. At 216 stars in its first week, the community is paying attention.

Zinc (Zig Inference Engine) is focused on AMD RDNA3/RDNA4 GPUs via Vulkan. The entire LLM serving ecosystem is NVIDIA-first today. Zinc is the first serious attempt to make AMD GPUs first-class citizens for LLM inference, using Zig's explicit memory control and Vulkan's cross-platform compute shaders. If it works, it opens up a whole second hardware ecosystem.

liter-llm is a universal LLM API client with a Rust core and 11 native language bindings, supporting 142+ providers. It standardizes the interface layer. Think of it as the database driver of the LLM world.

Meanwhile, MemBoost (from the arXiv papers this week) tackles inference cost by detecting repeated or near-duplicate queries across users and sessions, caching intermediate computation. Under workloads with semantic repetition (which describes most production deployments), this saves a lot of redundant work.

The serving layer is no longer good enough in Python with CUDA-only support. As LLMs move into production infrastructure, the stack needs the same engineering rigor we applied to databases, web servers, and operating systems.


The Connecting Thread

All three layers together tell the same story: the model itself has become a commodity. The frontier of AI engineering has moved to what surrounds it. How you orchestrate it (agent harnesses), how you make it efficient (KV cache compression), and how you serve it (inference engines).

This has happened before in technology. The CPU became a commodity, and the value moved to operating systems and applications. The database engine became a commodity, and the value moved to ORMs, query optimizers, and cloud services. The language model is following the same path.

The research papers from ICML and arXiv this week support this. The Muon optimizer paper (Sharp Capacity Scaling of Spectral Optimizers) shows that even training is becoming more about infrastructure than architecture. Spectral optimizers like Muon work because they solve the associative memory problem more efficiently. And the Weight Tying paper reveals that a standard practice in model design (sharing input and output embeddings) has been subtly biasing models toward output space alignment all along, a structural artifact nobody designed on purpose.

The companies and individuals who will thrive in the next phase of AI are the ones building the best stacks around the models, not the ones training the biggest models.


This is the third edition of my weekly deep dive into what's actually happening at the frontier of Generative AI. Previous editions covered the three races in AI and the week AI learned to do its own research.


This Week's Radar:

About the author
is an AI engineer and researcher. He spent 14 years at Microsoft working on Microsoft 365 Copilot Chat, Bing search ranking, Cortana, and Azure ML. Today he is building a new AI startup in stealth, advises other AI startups, conducts research on mechanistic interpretability, and writes weekly about Generative AI.