The Three Races Happening in AI Right Now

A GenAI Newsletter by Raj

If you only followed model releases, you'd think AI progress is linear: bigger models, better benchmarks, repeat. But if you look at what's actually being built this week, there are three separate races happening at the same time, each with different winners, different stakes, and different implications for what AI looks like a year from now.

Race 1: The Efficiency Race

The model that caught my attention this week is Nemotron-Cascade 2 from NVIDIA. It's a 30B parameter Mixture-of-Experts model where only 3 billion parameters are active at any given time. Despite this, its mathematical and coding reasoning performance approaches that of frontier open models.

This is part of a pattern. The efficiency race centers on one question: how small can you make the model and still get frontier-quality output? The answer keeps shrinking. A year ago, you needed 70B+ parameters for competitive reasoning. Six months ago, 32B was enough. Now NVIDIA is showing that 3B active parameters can get close.

Nemotron-Cascade 2 uses two techniques worth understanding:

Cascade RL: Instead of training one large model with reinforcement learning, they train a cascade where a small model handles easy queries and a larger model only activates for hard ones. Think of it as an automatic router that saves compute most of the time.

Multi-Domain On-Policy Distillation: The model learns from its own outputs across math, code, and language simultaneously, instead of from a teacher model's outputs. This avoids the distribution mismatch that makes traditional distillation fragile.

At ICML this year, a separate paper on FP4 quantization showed that you can train LLMs in 4-bit floating point, which is half the precision of the already aggressive FP8. FP4 means 2x the throughput on the same hardware, which is a big deal for training costs. A year ago, researchers said FP8 was the floor for training precision. That floor just dropped again.

Why does this matter? Every halving of compute requirements doubles the number of people and companies who can run these models. The efficiency race isn't about saving money for large labs. It's about making frontier AI accessible to anyone with a laptop.

Race 2: The Multimodal Race

One paper worth paying attention to from the conference circuit is Magma, a foundation model for multimodal AI agents that can operate in both digital and physical worlds. It came out of Microsoft Research and was presented at CVPR.

Most vision-language models can describe what they see. Magma can act on what it sees by clicking buttons in GUIs, manipulating objects in 3D environments, and navigating physical spaces. They combine what they call "verbal intelligence" with "spatial intelligence," so the model keeps its language understanding while also being able to plan and carry out actions in visual environments.

There's a growing gap between AI that can talk about the world and AI that can do things in the world, and several projects this week are working to close it.

NVIDIA released Kimodo, a kinematic motion diffusion model that generates physically realistic human and robot motion from text descriptions. You can say "walk to the table and pick up the cup" and Kimodo generates a 3D motion sequence that a humanoid robot can execute, complete with proper foot contacts, joint constraints, and smooth transitions.

Kimodo's design splits the problem into two stages: one model predicts the global trajectory (where the body goes), and a second model predicts the local motion (what the limbs do). This separation lets you constrain the path independently from the gesture, which is exactly what robotics applications need.

On the research side, a paper called Generation Models Know Space showed that multimodal LLMs suffer from "spatial blindness," meaning they can describe scenes semantically but fail at fine-grained geometric reasoning. The proposed fix uses the 3D understanding that generative models already have baked in to give language models better spatial awareness. It works, but it also highlights that language and space seem to be processed by very different parts of these models, and nobody has a clean way to bridge them yet.

The multimodal race determines whether AI stays in the chatbox or enters the physical world. Magma, Kimodo, and spatial reasoning research are three pieces of the same puzzle, and when they converge, we'll have AI agents that can see a room, plan a route, and execute it.

Race 3: The Alignment Race

This one gets less attention but probably matters the most in the long run.

At ICML, a paper called The Geometry of Refusal in Large Language Models found something worth knowing about. Earlier research suggested that a single "refusal direction" in the model's activation space controls whether it refuses harmful queries, and that removing this direction could jailbreak the model completely. The new paper shows it's more complicated than that: refusal behavior comes from concept cones, and these cones are separate from the model's core capabilities.

What this means in practice is that the safety mechanisms in LLMs are harder to bypass than we thought, but they're also tangled up with how the model reasons. Removing safety tends to break capability too.

This ties into something else I noticed. From ACL 2025, a survey on Personalized Alignment argues that the biggest gap in real-world LLM deployment is that alignment is treated as one-size-fits-all. What counts as "helpful" for a doctor is different from what counts as "helpful" for a student. The paper goes through different approaches for making alignment work per-user without needing to fine-tune a separate model for each person, including things like contextual steering, preference profiles, and adaptive guardrails.

From the practical side, a paper on Energy Considerations of LLM Inference found that existing benchmarks for efficiency optimization miss how real-world workloads actually behave. The energy cost of running LLMs in production is far more variable than lab benchmarks suggest, because query distributions in the wild look nothing like evaluation suites. This matters because energy cost is starting to function as its own kind of alignment constraint. Regulators and investors are asking whether specific AI applications justify the electricity they consume.

The alignment race goes beyond preventing harm. It's really about who gets to define what "aligned" means, and whether that definition ends up being universal or personalized, technical or political, measured in safety scores or in electricity bills. Based on this week's papers, the answer seems to be all of the above.

The Intersection

These three races don't exist in isolation. They overlap in ways that shape where AI actually goes from here.

Efficiency combined with multimodal capabilities gives us embodied AI that runs on edge devices (Kimodo on a Jetson, not a data center). Efficiency combined with alignment gives us personalized models small enough to run locally, with per-user safety profiles. Multimodal combined with alignment creates agents that can act in the physical world and actually need robust safety, because you can't undo a robot's action the way you can discard a chatbot's response.

The companies and research groups that will shape the next phase of AI are the ones working at these intersections. Winning one race alone won't be enough.

This is the second edition of my weekly deep dive into what's actually happening at the frontier of Generative AI. If you missed the first one, on Karpathy's AutoResearch, transformer circuit surgery, and self-evolving agents, read it here.

This Week's Radar:

Nemotron-Cascade 2: NVIDIA's 30B MoE with 3B active params
Magma: Foundation model for multimodal AI agents (CVPR)
NVIDIA Kimodo: Kinematic motion diffusion for human and robot motion
Geometry of Refusal: How safety works inside LLMs (ICML)
FP4 Quantization for LLM Training: Training in 4-bit precision (ICML)
Personalized Alignment Survey: Per-user alignment without per-user fine-tuning (ACL)
Generation Models Know Space: Fixing spatial blindness in multimodal LLMs