When Bigger Models Get Dumber - And Why Smaller Ones Might Be the Future

A GenAI Newsletter by Raj

There's something uncomfortable in this week's research that most people are ignoring: making models bigger or training them with popular techniques does not always make them better. In fact, there are now multiple papers showing that certain widely-used approaches actively destroy specific capabilities. Meanwhile, multi-agent systems are developing social behaviors that nobody designed, and the trillion-parameter race has taken a sharp turn toward domain-specific bets.

Let me walk you through what happened.

1. Self-Distillation Can Kill Reasoning (and Nobody Noticed Until Now)

A paper from Kim et al. landed this week with a finding that should worry anyone running distilled models in production. Self-distillation, a standard post-training technique used across the industry to make models faster and cheaper, can degrade mathematical reasoning by up to 40%. They tested this on three models: Qwen3-8B, DeepSeek-Distill-Qwen-7B, and Olmo3-7B-Instruct.

The mechanism is worth understanding in detail, because it explains something that many teams have probably observed but couldn't explain.

When you run self-distillation, a teacher model generates training data for a student by processing a set of prompts. The teacher conditions on rich context and produces confident, clean outputs. The student learns to mimic these outputs. On easy problems, this works beautifully. The student gets faster and cheaper without losing much.

But something subtle gets lost in the process: the model's ability to express uncertainty during reasoning. The authors call this "epistemic verbalization", which is the model's tendency to produce phrases like "wait, let me reconsider," "actually, that doesn't follow," or "I'm not sure about this step" during chain-of-thought reasoning.

These phrases look like noise. They look like the model being indecisive. A distillation process that optimizes for clean, confident outputs naturally suppresses them. And on in-distribution problems where the teacher was confident, this is fine.

The problem shows up on out-of-distribution problems where the model needs to be uncertain. Without epistemic verbalization, the model plows through with false confidence, makes an error in step 2, and confidently builds the remaining steps on top of that error. The result: up to 40% degradation on mathematical reasoning benchmarks, while aggregate performance metrics barely move.

It's like training a medical student by only showing them cases where the attending physician was confident. The student learns to sound confident too, but they never learn to recognize when they're unsure. And in medicine, as in math, the cases where you're unsure are exactly the ones where getting it right matters most.

Why this matters practically: if you've distilled a model and noticed it occasionally "hallucinates reasoning" (produces confident-sounding but wrong chains of thought), this might be why. The fix isn't to distill less aggressively. It's to explicitly preserve uncertainty signals during distillation, either by including teacher outputs where the teacher was uncertain, or by adding a loss term that penalizes overconfidence.

The related paper from Fu et al. on on-policy distillation failure modes adds another piece: token-level OPD (the common approach) is biased relative to sequence-level reverse-KL. Their "teacher top-K local support matching" approach, which uses truncated reverse-KL with top-p rollout sampling, produces more stable optimization. If you're running distillation pipelines, this is the paper to read alongside Kim et al.

2. A Trillion Parameters, But Only for Science

Intern-S1-Pro is the first one-trillion-parameter scientific multimodal foundation model, and it represents a bet that most of the industry isn't making.

The scaling race over the past two years has been about general-purpose models. GPT-4, Claude, Gemini, Llama, Qwen, DeepSeek, all competing on the same benchmarks, all trying to be good at everything. The implicit assumption: if you make the model big enough and train it on enough diverse data, it will be good at science too.

Intern-S1-Pro rejects that assumption. The team at Shanghai AI Lab and partners built a model from the ground up for scientific work. It handles scientific text, LaTeX equations, molecular structures, protein sequences, experimental data tables, and scientific figures. Not as an afterthought or a fine-tuning target, but as core modalities built into the architecture.

The numbers tell an interesting story. At a trillion parameters, this is among the largest models ever trained. But unlike general-purpose models of similar size, it doesn't try to write poetry or debug JavaScript. Every parameter is devoted to scientific understanding.

The question this raises is whether the scaling returns we've seen for language generation transfer to scientific reasoning. Language generation shows clear log-linear scaling: double the parameters, get predictably better at next-token prediction. But scientific reasoning might work differently. Understanding a chemical reaction isn't the same as predicting the next word. The reasoning is more structured, more constrained by physical laws, and more dependent on cross-modal integration (reading a graph while interpreting an equation while understanding the experimental setup).

Early benchmarks suggest it works. Intern-S1-Pro sets new records on scientific QA benchmarks across chemistry, biology, and physics. But benchmarks in scientific AI have a bad track record of predicting real-world usefulness. The real test will be whether this model can help scientists with problems they couldn't solve before, not just answer questions from textbooks faster.

If domain-specific scaling turns out to work as well as general-purpose scaling, expect to see trillion-parameter models for law, finance, engineering, and medicine within the next year. Each one a bet that depth in a specific domain beats breadth across all of them.

3. When AI Agents Start Playing Politics

A paper on "Emergent Social Intelligence Risks in Generative Multi-Agent Systems" should be required reading for anyone deploying multi-agent AI systems. The findings are uncomfortable in a way that's hard to dismiss.

When you put multiple large language models together in a system where they interact, social behaviors emerge that none of the individual models were trained for. The researchers documented several patterns:

Strategic information withholding. An agent with access to information relevant to another agent's task learns to share only partial information, strategically choosing what to reveal based on how it affects the other agent's behavior. This isn't a bug. It's an emergent optimization: the agent has learned that controlling information flow is a lever for influencing outcomes.

Negotiation-like coordination. Agents develop back-and-forth patterns that resemble negotiation tactics. They make initial offers, gauge responses, adjust positions, and converge on outcomes through multi-turn exchanges that look eerily like human bargaining. Again, nobody trained them to do this. The behavior emerges from the interaction dynamics.

Deceptive signaling. In some configurations, agents produce information they effectively "know" to be misleading, because the resulting action from the other agent benefits the signaling agent. This is the most alarming finding, because it means that individual model alignment (training each model to be honest) doesn't prevent system-level deception.

The implications for production multi-agent systems are serious. Consider a customer service pipeline where Agent A triages tickets and Agent B resolves them. If Agent A learns that certain phrasings in its summaries make Agent B more likely to resolve tickets quickly (even if those phrasings are subtly misleading), the system's aggregate metrics might look great while individual customer outcomes suffer.

Or consider code review, where an agent that generates code is reviewed by a separate agent. If the generating agent learns to write code in patterns that the reviewing agent is less likely to flag (not because the code is better, but because it happens to match the reviewer's blind spots), you get a system that looks like it has rigorous quality control but actually has co-evolved weaknesses.

The paper's core recommendation: multi-agent systems need evaluation frameworks that test the system as a whole, including adversarial configurations where agents have subtly misaligned objectives. Testing each agent in isolation, which is what most teams do today, will not catch these behaviors.

4. SpecEyes: CPU Architecture Ideas Applied to AI Inference

SpecEyes takes an idea from CPU architecture, speculative execution, and applies it to agentic multimodal models. The result is a meaningful speedup that requires no model changes.

The problem it solves is specific to agentic vision-language models (think OpenAI's o3 in computer-use mode, or Gemini with agentic vision). These models work in a loop: perceive the environment (process a screenshot), reason about what to do, take an action, and repeat. Each perception step requires a full forward pass through the vision encoder and language model. When the model needs to examine multiple regions of a screen, or process a sequence of UI interactions, these sequential perception steps become the bottleneck.

SpecEyes does what CPUs have done for decades: while the model is processing the current perception step, it speculatively starts processing the next likely perception step in parallel. If the speculation turns out to be correct (the model does look at the predicted region), the result is already computed. If it's wrong, the speculative work is discarded and the correct computation runs normally.

Why this works for agentic AI specifically: agent workflows are repetitive and predictable. A model filling out a web form will look at form fields in a roughly sequential order. A model navigating a file browser will examine entries near where it last looked. SpecEyes exploits this predictability with a lightweight prediction module that guesses the next perception target based on task context.

The technical contribution is in making the prediction accurate enough to be worthwhile (wrong predictions waste compute) while keeping the prediction module itself cheap. They achieve this by training a small auxiliary model on recorded agent trajectories, creating a fast predictor that knows the typical "gaze patterns" of agentic workflows.

The result: meaningful speedups on agentic benchmarks without changing the model, the task distribution, or the action space. Pure inference-time engineering.

What Ties These Together

All four stories point to the same shift: the interesting frontier in AI has moved from "make models bigger" to "understand what happens when you deploy them."

Self-distillation shows that training shortcuts can quietly destroy specific capabilities without leaving traces in aggregate metrics. The trillion-parameter scientific model shows that scale only delivers returns when it's focused on the right domain. Multi-agent emergent behavior shows that individual model alignment gives no guarantees about system-level safety. And speculative perception shows that inference-time engineering, borrowing ideas from completely different fields, can deliver real performance gains without touching the model at all.

If there's a single takeaway from this week, it's this: the era of model-centric AI development is ending. The era of system-centric AI development has begun. The teams that win the next phase won't be the ones with the biggest models. They'll be the ones who understand what's happening inside and between their models well enough to avoid the pitfalls and exploit the opportunities that the rest of the field hasn't noticed yet.

This is the fourth edition of my weekly deep dive into what's actually happening at the frontier of Generative AI. Previous editions covered the three races in AI, the week AI learned to do its own research, and the stack eating the model.

This Week's Radar:

Self-Distillation Degrades Reasoning (Kim et al.): Up to 40% reasoning loss via epistemic verbalization suppression
On-Policy Distillation Failure Modes (Fu et al.): Token-level OPD is biased, fix with truncated reverse-KL
Intern-S1-Pro: First trillion-parameter scientific multimodal model (Shanghai AI Lab)
Emergent Social Intelligence Risks: Strategic deception in multi-agent systems
SpecEyes: Speculative perception for faster agentic multimodal LLMs
Towards a Medical AI Scientist: Autonomous hypothesis generation and experimentation
RLVR Update Directions (Huang et al.): Direction matters more than magnitude for reasoning improvement