9 min read

Why Looping Is the New Scaling

Three papers converge on an idea that could reshape how we think about model intelligence: you do not need more layers. You need to run the right layers again.

TransformersInference ScalingLoopingMechanistic InterpretabilityGenerative AINewsletter

A GenAI Newsletter by Raj


For two years, the AI industry has been chasing scale. Bigger models, more parameters, longer training runs. The implicit bet: if we make the network deeper and wider, it will get smarter.

Three papers appeared this week that suggest a different path. None of them made the front page of Hacker News. None came from OpenAI or Anthropic. But they converge on an idea that could reshape how we think about model intelligence: you do not need more layers. You need to run the right layers again.


The Idea

A transformer model processes your input by pushing it through a stack of layers, one after another. Layer 1 does some work, passes the result to layer 2, and so on until the final layer produces an output. Every layer has its own weights. A 40-layer model has 40 sets of weights, each trained independently.

Looping changes this. Instead of 40 unique layers, you take a block of, say, 8 layers and run the input through that same block 5 times. The math is identical to a 40-layer model in terms of compute, but you only store 8 layers worth of parameters. The model gets depth without getting bigger.

This idea has been around for years. Universal Transformers proposed it in 2018. But it stayed a curiosity because vanilla looping had a fatal flaw: it only worked if you ran exactly the number of loops the model was trained for. Run it for 4 loops instead of 5, and the output collapsed into garbage. Run it for 6, same thing. The model was fragile to its own depth.

This week, three independent research groups published results that fix this problem and explain why looping works at all.


Paper 1: Elastic Looped Transformers

Sahil Goyal, Swayam Agrawal, and collaborators introduced ELT, a visual generation model that loops transformer blocks with a training trick called Intra-Loop Self Distillation. The idea is simple: during training, randomly pick an intermediate loop count and force the model to produce decent output at that point too, not just at the final loop.

The result is a model family that works at any compute budget from a single training run. Want faster inference? Exit after 3 loops. Want higher quality? Run all 8. The model degrades gracefully instead of collapsing.

With 4x fewer parameters than standard models, ELT matches the image quality of DiT-XL on ImageNet. The same weights, used multiple times, do the work that used to require a model four times larger.

This is a visual generation paper, not a language model paper. But the architecture is general. The principle transfers.

https://arxiv.org/abs/2604.09168


Paper 2: Why Looping Works

Hugh Blayney, Alvaro Arroyo, Johan Obando-Ceron and collaborators published the first mechanistic study of looped reasoning in language models. They wanted to understand what actually happens inside the model when you run the same layers twice.

Their answer: the hidden state converges to a fixed point.

When you loop a block of layers, the model's internal representation traces a trajectory through a high-dimensional space. On the first pass, it moves a long distance. On the second pass, it moves less. By the third or fourth pass, it barely moves at all. It has settled into an orbit. The attention patterns stabilize. The model has "finished thinking."

The deeper finding is that looped blocks learn the same inference stages as deeper feedforward models. A 40-layer feedforward model develops specialized computations at different depths. A looped model with 8 layers run 5 times develops the same computations, in the same order, within each loop iteration. The loop is not just a parameter saving trick. The model is actually learning to iterate on its own reasoning.

They also found that block size matters. A single looped layer does not converge well. Blocks of 3 to 5 layers form stable fixed points. Larger blocks converge faster but have diminishing returns.

https://arxiv.org/abs/2604.11791


Paper 3: Entropy Tells You Where the Model Disagrees With Itself

Songlin Yang, Xianghao Kong, and Anyi Rao proposed an information-theoretic framework for probing what happens inside transformer layers. They tracked entropy trajectories across layers in multimodal models and found something relevant to the looping story: shared parameters do not guarantee unified processing. What matters is whether the information flow is consistent across layers.

In models where different modalities follow different entropy trajectories through the same layers, the output is incoherent. In models where the trajectories align, the output is good. The weights are the same in both cases. The difference is in how the model routes information through those weights.

This matters for looping because it explains a failure mode. If you loop a block of layers where the information flow is inconsistent, the fixed point the model converges to may be the wrong one. The block needs to have coherent internal dynamics for looping to help. Not every set of layers is worth repeating.

https://arxiv.org/abs/2604.10949


What This Means

These three papers, read together, form a theory of inference-time scaling that is fundamentally different from the "make it bigger" approach.

The old theory: intelligence scales with parameter count. If the model is not smart enough, train a bigger one.

The new theory: intelligence scales with inference-time compute, applied strategically. The right block of layers, run multiple times, can match a model with 4x more parameters. The mechanism is fixed-point convergence. The practical requirement is that the looped block must have coherent information flow and span a complete inference stage (roughly 3-5 layers).

This connects to something I have been working on. My research on reasoning circuits in transformers found that language models contain specific blocks of 3-5 layers that, when duplicated at inference time, improve reasoning by 5-16% without any retraining. The key challenge was identifying which layers to duplicate. The papers this week now explain why those specific layers work: they are the layers where the model's representation is approaching but has not quite reached a fixed point. One more pass through those layers lets it finish the thought.

The practical implications are immediate. If you are deploying a language model behind an API, you can serve a smaller model that loops specific layers and match the quality of a model several times larger. The memory footprint stays small. The latency is tunable: more loops for hard questions, fewer for easy ones. The same weights serve every difficulty level.

If you are training models, the implication is that you should think about which layers are worth making unique and which are better shared. The current default of giving every layer its own parameters may be wasteful. A hybrid architecture with some unique layers and some looped blocks could be both smaller and smarter.

And if you are building products, the most interesting possibility is adaptive inference. The model tries a question, checks whether its hidden state has converged, and either returns the answer or loops again. Easy questions get fast answers. Hard questions get more compute. The user does not choose. The model decides based on its own internal dynamics.


Also This Week

Policy circuits in alignment. Gregory Frank published a paper localizing the exact circuit that makes aligned models refuse harmful requests. An intermediate-layer attention gate (contributing less than 1% of the output signal) detects harmful content and triggers deeper amplifier heads that generate the refusal. The gate is causally necessary but nearly invisible by activation magnitude. This is important for interpretability research: the most critical components in a transformer may be the quietest ones.

https://arxiv.org/abs/2604.04385

SCOPE: Better on-policy distillation. Signal-Calibrated On-Policy Distillation Enhancement improves how student models learn from teacher models by using dual-path adaptive weighting. On-policy distillation (where the student generates its own training data rather than copying the teacher's) is becoming the default approach for post-training. SCOPE makes the token-level credit assignment less noisy.

https://arxiv.org/abs/2604.10688

Hiro acquired by OpenAI. Hiro, the AI personal CFO startup, is joining OpenAI. They stopped accepting new signups immediately. The financial AI space continues to consolidate around the largest labs, which makes independent benchmarks like FABRIC more important, not less.


The Takeaway

The race to build bigger models is not over, but a parallel race has started. The question is no longer just "how many parameters can we train?" It is also "how intelligently can we use the parameters we already have?"

Looping is the simplest version of this idea. Run the same layers twice. But the principle extends to any form of adaptive inference-time compute: chain-of-thought, tree search, self-verification, retrieval-augmented generation. All of these are ways of spending more compute at inference time to get better answers from the same model.

The papers this week give us the first mechanistic understanding of why this works. Hidden states converge to fixed points. Looped blocks learn inference stages. Coherent information flow determines which layers benefit from repetition.

The models are not getting smarter by getting bigger. They are getting smarter by thinking longer.


This is the seventh edition of my weekly deep dive into what is actually happening at the frontier of Generative AI. Previous editions: AI Gets Personal / The Quiet Skill Revolution / The Stack Got Leaked / The Stack Eats the Model / The Three Races in AI / The Week AI Learned to Do Its Own Research


This Week's Radar:

About the author
is an AI engineer and researcher. He spent 14 years at Microsoft working on Microsoft 365 Copilot Chat, Bing search ranking, Cortana, and Azure ML. Today he is building a new AI startup in stealth, advises other AI startups, conducts research on mechanistic interpretability, and writes weekly about Generative AI.