Rajkiran Panuganti

The Emerging Communication Stack for Agents

Rajkiran Panuganti — Wed, 06 May 2026 00:00:00 GMT

A GenAI Newsletter by Raj

There are two ways agents communicate, and both are being rebuilt at the same time.

The first is how agents reach humans. Humans don't want to log into a new app called "agents.com." They want their AI to show up where they already are: Slack, Outlook, Gmail, WhatsApp, iMessage, the phone. So a layer of infrastructure has been built and funded to let agents enter every existing channel: phone numbers an agent answers, inboxes an agent owns, Slack workspaces where an agent is a first-class member. This is the channel layer. It's where revenue accrues, priced per minute and per resolution.

The second is how agents reach each other and reach tools. Agents need their own native protocols, because human channels assume a human at one end. So a parallel stack has been donated to the Linux Foundation in the past twelve months: MCP for tools, A2A for peer agents, AG-UI for frontends, plus discovery and identity layers underneath. This is the protocol layer.

The interesting tension is that the two stacks are converging. Voice agents are starting to call other voice agents. Email agents are threading with email agents. The protocol layer was supposed to be where agents talk to each other, but a meaningful chunk of agent-to-agent communication is already happening through SIP and SMTP because that's where the agents already live. This newsletter walks through the channel layer, then the protocol stack, then what happens when they collide.

Voice and phone

Voice is the most aggressive front of the channel war because contact centers are large, expensive, and clearly automatable. Pricing has commoditized to roughly seven to fifteen cents per minute at the platform layer; an all-in voice agent runs fifteen to thirty cents per minute today, roughly one tenth of a fully loaded US contact-center agent.

Vapi. Voice agent orchestration sitting between LLM, STT, TTS, and SIP/WebRTC, with bring-your-own-keys. Five cents per minute platform fee plus components. Series A from Bessemer in 2024.

Bland AI. Pitches "infinite phone calls" by owning its own inference stack and telephony fabric. Build plan is $359 for 500 minutes. Series B from Scale Venture Partners.

Retell AI. YC-backed, transparent seven cents per minute, built for developers wiring voice into existing products without owning the speech stack.

ElevenLabs Conversational AI. Full voice-agent SDK on top of ElevenLabs TTS. Eight, ten, and twelve cents per minute across Standard, Turbo, and Premium, with a 95 percent silence discount over ten seconds. ElevenLabs cut Conversational AI pricing materially in 2025 as OpenAI Realtime and Cartesia entered.

OpenAI Realtime API. $32 per million input audio tokens, $0.40 cached, $64 output. The wholesale price the rest of the platform layer pays through. The OpenAI sample voice agent and ChatGPT Voice Mode both run on LiveKit's WebRTC stack.

LiveKit Agents. The framework underneath ChatGPT Voice Mode and Character.ai's voice product. 1.0 in April 2025, now ships with native MCP tool support.

Pipecat. Daily.co's open-source voice framework, Python and JavaScript SDKs, dozens of model integrations, Pipecat Cloud as the managed offering.

Hume AI EVI. Empathic voice with emotion-aware turn-taking. Seven cents down to four cents per minute. Series B from EQT Ventures and Premji Invest.

PolyAI, Parloa, Cognigy. The European enterprise wedge. PolyAI runs voice for Marriott, FedEx, and Caesars (Series C at $500 million from NVentures, $116 million total). Parloa runs Decathlon and Swiss Life (Series B at $66 million from Altimeter). Cognigy runs Lufthansa, Bosch, Toyota, Mercedes-Benz, and Allianz (Series C at $100 million, $175 million total).

Sesame. Brendan Iribe's company. Released CSM-1B as open-weights on HuggingFace in early 2025: end-to-end speech-in to speech-out, the most credible open-source competitor to closed voice agents.

The pattern: voice agents charge by the minute, those minutes look exactly like the minutes a human agent would have charged for, and pricing drops every quarter as the speech stack underneath commoditizes. The speech-model layer (ElevenLabs, Cartesia, Deepgram, AssemblyAI, Inworld, Whisper successors) has been getting roughly half as expensive every year for two years running. Voice is going to be cheap.

Email

If voice is the loudest channel, email is the deepest. It's still the medium of record for enterprise communication, and any agent inside a corporate workflow eventually needs to send and receive it.

Resend. Developer-first transactional email API from the React Email team. Series A of $18 million from Foundation Capital in 2024. Has marketed itself as "the email API for AI agents" since 2024, with idempotent send patterns and JSON schemas designed for agents that retry. Roughly 100,000 developers on the platform.

AgentMail. Agent-native email infrastructure where every agent gets its own real inbox, with programmatic IMAP and SMTP plus threading APIs designed for autonomous agents to send and receive without a human ever logging in. YC-backed. The cleanest example of email being treated as a first-class agent channel rather than a developer afterthought.

The inbox apps. Notion acquired Skiff in February 2024 and shipped Notion Mail in 2025. Grammarly acquired Superhuman in mid-2024 and rolled AI write-and-reply through the year. Shortwave is the small, well-built ex-Google-Inbox alternative. Gmail Gemini and Outlook Copilot ($30/seat/month) have agent features baked into the defaults.

The structural fact is that SMTP and IMAP are forty years old and have no concept of agent identity. Anyone can send an email "from" your agent, and your agent has no built-in way to prove it sent the email it sent. Identity has to be added at higher layers (SPF, DKIM, DMARC, plus agent-attribution headers the agent stack is now defining). Email is going to be one of the messiest interop fights of the next two years.

SMS, WhatsApp, and the messaging layer

Twilio AI Assistants launched at Signal in late 2024 layered on standard messaging and voice rates. Twilio's Q4 2025 numbers cited 300,000+ active accounts. Twilio plus SendGrid (acquired 2019) plus AI Assistants is the most complete single-vendor offering for an agent that needs to span SMS, voice, and email.

Bird (formerly MessageBird) rebranded in 2023 and pivoted into AI-powered omnichannel agents. Major business is WhatsApp Business API resale. Telnyx is the bootstrapped, profitable telco-grade alternative. Sinch (Stockholm: SINCH-B) consolidated Mailgun, Inteliquent, MessageMedia, and Pathwire into a $3 billion SEK revenue base.

WhatsApp Business is the largest single agent surface in the world. Meta has cited 200 million-plus businesses on the platform. Meta launched its own AI inside WhatsApp Business in 2024 with agentic features (catalog browsing, transactional flows) expanding through 2025 and 2026. The agent ecosystem inside WhatsApp Business is dominated by mid-market companies in India, Brazil, and Indonesia, not the Fortune 500 logos that dominate voice and email.

The messaging layer is where the global story diverges most from the US story. In North America, agents reach customers through SMS and email. Everywhere else, agents reach customers through WhatsApp.

Slack, Teams, and the chat layer

The internal-comms equivalent of WhatsApp is Slack and Microsoft Teams. Both have spent the past eighteen months turning themselves into agent surfaces.

Slack. Owned by Salesforce since 2021. Slack AI shipped in 2024 with summarization, search, and an inline assistant. The bigger move came through Salesforce Agentforce: any Agentforce agent can be installed in a Slack workspace as a first-class member, mentioned by handle, assigned tasks from a thread. The agent appears with its own avatar and identity, not as a bot account.

Microsoft Teams. Copilot in Teams plus Copilot Studio agents. Multi-agent went generally available in April 2026 with A2A as the cross-vendor protocol, which means a Teams-resident agent can call out to a Salesforce-resident agent without either side speaking the other's framework. First time an agent in one vendor's chat client has been able to call an agent in another vendor's chat client in production.

The Slack-app pattern. Glean, Notion AI, Asana AI Studio, Linear's AI features, and dozens of vertical SaaS products ship with Slack as their primary user interface. Install once, mention by name, get answers backed by the SaaS product's data.

The structural advantage of Slack and Teams as agent surfaces is that they already have identity, permissions, and channel-level access control. An agent in Slack inherits the workspace's IAM and the user's permissions, which is exactly the kind of thing voice and SMS struggle to do. Half of the protocol layer covered later is an attempt to give voice and email the same identity primitives that Slack and Teams already had on day one.

The customer-facing agent companies

The platforms above are the infrastructure. The companies fighting on top of that infrastructure are the customer-facing agent vendors selling to enterprises.

Sierra. Bret Taylor and Clay Bavor's company, and the cleanest single bet in the segment. $100 million ARR by November 2025 (twenty one months from launch), $150 million by early February. On May 4, 2026 (two days before this newsletter), Sierra closed $950 million at a $15.8 billion valuation led by Tiger Global and Google's GV, with Benchmark, Sequoia, and Greenoaks participating. The valuation is up from roughly $10 billion in the fall and $4.5 billion in October 2024. The customer roster has moved upmarket fast: ADT, SiriusXM, WeightWatchers, Sonos, plus Prudential, Cigna, Blue Cross Blue Shield, Rocket Mortgage, and what Sierra describes as one in three of the world's largest banks. Outcome-based pricing per resolution, with annual contract values from a $150,000 floor to $1.5 million plus, plus $50,000 to $200,000 in implementation fees. Sierra is now the most valuable pure-play customer-agent company by a wide margin.

Decagon. Raised $250 million at a $4.5 billion valuation in March 2026 with 100+ enterprise logos. $50,000 platform floor plus roughly $0.99 per conversation, with annual contract values $95,000 to $590,000 and a median around $400,000. Customers include Duolingo, Chime, Rippling, Notion, and Eventbrite.

Intercom Fin. $0.99 per resolution, unchanged from launch through April 2026, with a 50-resolution monthly minimum on top of a base Intercom plan. Fin 2 launched in 2025. Eoghan McCabe has publicly cited Fin handling more than half of customer support traffic for many Intercom customers.

Cresta runs real-time AI coaching for human agents (not replacement) at Intuit, Brinks, Hilton, and Cox Communications, with $150,000 floor and $40 to $50 million ARR estimated. Ada sells to Meta, Verizon, Square. Glia does digital + voice for financial services. Replicant was acquired by LivePerson in 2024, an early consolidation case.

Outbound. A separate category for agents making outbound rather than handling inbound. 11x.ai sells "digital workers" called Alice (SDR), Jordan (phone rep), and Julian (inbound qualifier) for $5,000 to $15,000 per month with annual commitment. Series B of $50 million from a16z at roughly $350 million. Artisan AI ran the famous "Stop hiring humans" billboards in San Francisco in 2024.

The pricing pattern is consistent: inbound is per-minute or per-resolution (cheap to start, metered to grow); outbound is headcount-equivalent (expensive to start but easy to compare against a salary line). The two models are slowly converging, with outbound vendors offering per-meeting-set pricing and inbound vendors offering flat-rate enterprise SKUs.

The Klarna reversal is the most important data point in this market

Anyone selling agent software loves the Klarna numbers. 2.3 million conversations in the first month, two thirds of customer service volume, average resolution time from 11 minutes to under 2, "doing the work of 700 full-time agents," $40 million projected profit improvement. Vendors have cited those numbers in pitch decks for two years.

What gets cited less is what happened next.

In May 2025, Klarna's CEO Sebastian Siemiatkowski publicly walked the story back. In a Bloomberg interview he said cost had been "a too predominant evaluation factor" and the result was "lower quality." Customer satisfaction dropped 22 percent. Klarna began rehiring human agents under an Uber-style gig model. The original $40 million had always been cost avoidance (agents Klarna would have had to hire during growth), and even that framing turned out to overstate the savings once you priced in the brand damage from a long tail of badly handled tickets.

The honest read: AI handled the easy 60 to 70 percent of support cleanly, and the remaining 30 to 40 percent failed worse than humans would have, with overconfidence and fabricated policy claims that didn't show up on any vendor invoice but did show up in churn. That's a different story from "AI replaced 700 agents," and it's the story everyone deploying voice and chat agents at scale needs to internalize.

The structural lesson for the channel layer is that the deployments that work are hybrid. The agent handles the easy ticket end-to-end and escalates the hard one to a human inside the same channel, with the agent's full context attached. Sierra and Cresta both pitch this hybrid model explicitly. Klarna is now running it. Vendors pitching full agent replacement keep getting walked back, while vendors pitching escalation-on-failure keep growing.

The protocol layer agents use to talk to each other

A year ago, "agent-to-agent communication" was a phrase you mostly heard from Google. Today, four open foundations are governing it, with most of the work hosted by the Linux Foundation through the new Agentic AI Foundation, formed on December 9, 2025 with Anthropic, Block, and OpenAI as co-founders and platinum members AWS, Google, Microsoft, Cloudflare, and Bloomberg.

MCP (Model Context Protocol). Anthropic's contribution, the agent-to-tool transport. Donated to the Linux Foundation in December 2025. Current spec is 2025-11-25 with Streamable HTTP as the active transport, OAuth 2.1 plus PKCE plus mandatory Resource Indicators (RFC 8707) for auth. 97 million monthly SDK downloads as of March 2026, 10,000+ active public servers, first-class clients in Claude, ChatGPT, Cursor, Windsurf, VS Code, JetBrains, Microsoft Copilot, and Gemini.

A2A (Agent2Agent). Google's contribution, the peer-to-peer agent protocol. Donated to the Linux Foundation on June 23, 2025. The headline feature in v1.0 is the Signed Agent Card: a cryptographic signature on a JSON document that describes an agent's capabilities and origin, so a receiving agent can verify the card was issued by the domain owner before delegating any work. By the one-year mark on April 9, 2026, A2A had 150 supporting organizations with named production deployments in Microsoft Azure AI Foundry, Microsoft Copilot Studio, AWS Bedrock AgentCore, Salesforce Agentforce, and Google Cloud.

AG-UI (Agent-User Interaction Protocol). Built by CopilotKit. The third wire protocol the rest of the stack quietly assumes but doesn't actually solve: agent-to-frontend. MCP handles agent-to-tool. A2A handles agent-to-agent. AG-UI standardizes how an agent streams tokens, tool calls, intermediate state, and dynamically generated UI components into a running web application. CopilotKit closed a $20.5 million Series A on May 5, 2026 (the day before this newsletter went out), led by Glilot Capital with NFX and SignalFire, $27 million total. Repository at 40,000+ GitHub stars with millions of installs per week. Infra adopters include Google, Microsoft, Amazon, and Oracle. Framework integrations include LangChain, Mastra, PydanticAI, and Agno. CopilotKit reports more than half the Fortune 500 using the open-source toolkit, with named customers Deutsche Telekom, Docusign, Cisco, and S&P Global.

The discovery and identity layer. AGNTCY (Cisco/LangChain/Galileo, donated to the Linux Foundation in July 2025) sits one level above MCP and A2A with a federated agent directory and the Open Agent Schema Framework. NANDA (MIT, led by Ramesh Raskar) is DNS for agents: a globally distributed mapping from an agent handle to a verified metadata file, currently hosted at 15 universities, with cryptographically verifiable AgentFacts as the signed metadata format. The NANDA Summit at MIT on April 9 to 11, 2026 was the major adoption event.

The supporting layers. IBM ACP (separate from OpenAI's commerce ACP) is the async-first agent-to-agent protocol from BeeAI, also at the Linux Foundation, designed for long-running tasks with curl-friendly REST. NLIP (standardized through Ecma TC56, ECMA-430 approved December 10, 2025) is the application-level message envelope that abstracts API versioning. Letta Agent File (.af, released April 2, 2025 by the MemGPT team) is a portable container format for stateful agents (the Docker image of agents).

The right mental model: MCP is the tool bus, A2A is the agent-to-agent bus, AG-UI is the agent-to-frontend bus, AGNTCY and NANDA are the discovery and identity layers, NLIP is the message envelope, .af is the container format. These standards mostly compose. The open question is which subset becomes the default for a typical enterprise deployment by the end of 2027, and the answer is starting to look like all of them.

What big tech is shipping

The cloud vendors are not waiting for the standards to settle. They're shipping products that span both the channel layer and the protocol layer.

Salesforce is the cleanest example of the converged play. Agentforce exposes every custom agent as an A2A endpoint and as a first-class Slack member. Salesforce contributed the Agent Card concept itself. The combined picture: an Agentforce agent can be reached by a peer agent over A2A, by a customer over WhatsApp through a Bird-or-Sinch integration, by a service rep inside Slack as a mention, and by a developer through MCP. One agent reachable across four channels, all in production.

Microsoft. Copilot Studio multi-agent went generally available in April 2026 with A2A as the cross-vendor bus. Copilot agents are reachable from Teams as chat, from Outlook as email, and from any A2A peer programmatically. Microsoft Agent Framework v1.0 ships A2A as a first-class protocol for both .NET and Python.

AWS Bedrock AgentCore went GA on October 13, 2025. A2A added October 2025, AWS Marketplace A2A server support November 2025, stateful MCP server features March 2026. Cross-framework support for Strands, OpenAI Agents SDK, LangGraph, Google ADK, Claude Agents SDK.

Anthropic plus aggressive enterprise distribution: Cognizant rolling Claude to 350,000 employees, Deloitte to 470,000, Accenture training 30,000 professionals, Swiggy shipping MCP integration for grocery and restaurant reservations, India's Ministry of Statistics building the first official Indian government MCP server. 300,000+ business customers, 500+ spending over $1 million per year, 8 of the Fortune 10.

Google ships A2A and AGNTCY in Vertex AI agents out of the box. IBM runs ACP plus BeeAI plus watsonx Orchestrate. Block moved Goose to the Agentic AI Foundation, with 70+ documented MCP extensions. None of them is building a closed agent stack. All of them are building open agent stacks that happen to run best on their own clouds and inside their own chat surfaces.

When agents call agents

The most interesting thing happening at the boundary between the channel layer and the protocol layer is that agents are starting to communicate with other agents through the human channels. A Vapi-built voice agent calls a phone number, gets routed to a Bland-hosted answering agent, and the two argue for ten minutes about a refund. A Resend outbound email lands in an inbox where a Grammarly-powered reply agent threads back. None of this is using MCP or A2A. The two sides are speaking SIP and SMTP because those are the channels their humans use.

This is awkward for the protocol layer, because A2A specifically was supposed to be where this happens. The real-world answer is starting to look like the protocol layer wraps the channel layer rather than replacing it. An A2A handshake establishes identity and sets up the call. The actual conversation runs over voice or email. The transcript and outcome are returned through A2A. NANDA's AgentFacts and Salesforce's Agent Cards are part of how this works: an agent picking up a call can read the caller's signed Agent Card, decide whether to switch to a faster programmatic channel, and do so midstream if both sides agree.

Two patterns are showing up in production. "Agent prefers programmatic": both sides detect each other, exchange A2A handshakes, and complete in sub-second over MCP rather than dragging on as a multi-minute voice call. Common in internal coordination between enterprise agents. "Agent stays in channel": both sides agree the human user expects a voice transcript or email thread for audit, so they keep the conversation in the channel even when they could complete it faster elsewhere. Common in customer support.

The right way to think about it: the channel layer is where humans and agents share the same wire; the protocol layer is where agents accelerate when no human is watching. Both stacks need to interop, which is most of what 2026 and 2027 are going to be about.

What 2028 looks like

The protocol war is functionally over. By 2028, A2A is the agent-to-agent default for cross-vendor work, MCP is the agent-to-tool default for everyone, AG-UI is the agent-to-frontend default for any product with a UI, and AGNTCY plus NANDA together provide the discovery and identity layers. IBM ACP keeps a niche in async-heavy enterprise workflows. NLIP becomes the envelope everyone implements without thinking about. Letta-style agent files become the portable container format.

The channel layer consolidates differently. Voice agents become a $20 billion-plus category, dominated by three or four enterprise platforms (Sierra, Decagon, Parloa) and three or four developer platforms (Vapi, Bland, Retell, ElevenLabs). The speech-model layer consolidates around two or three winners. Email becomes the ugliest layer because the underlying protocol has no agent identity built in; the eventual answer is some combination of DMARC plus signed agent headers plus enterprise-only inbox routing. SMS stays small in the US and dominant in the rest of the world, with WhatsApp Business as the largest single agent surface globally. Slack and Teams become first-class agent channels with full IAM, and most enterprise workplace agents end up living there rather than in standalone web apps.

The Klarna lesson holds: vendors pitching full replacement keep getting walked back, and the ones pitching escalation-on-failure keep growing.

The deeper change is that for the first time we're building communication infrastructure for clients that aren't human. Every previous channel (telephone, email, SMS, chat) was designed assuming a human at one end. Voice agents and email agents and Slack agents are forcing those channels to learn a second client type, the same way the web learned mobile in the 2010s. The protocol layer is the agent-native side of the same shift, where we're building from scratch for the second client type without the constraints of human-era assumptions.

Both stacks are real, funded, and in production. The interesting work in 2027 and 2028 is not picking a winner but building the bridges between them, because most agents in the wild will need to operate in both at once.

This is my regular weekly newsletter on Generative AI. Recent editions cover The Agentic Economy Is Already Here, Solving Hallucination, The Quiet Skill Revolution, and Why Looping Is the New Scaling.

Sources and Further Reading:

Voice and phone

Speech models

Email

SMS, WhatsApp, messaging

Customer-facing agent companies

Protocol layer

Big tech deployments

Solving Hallucination: Where the Research Stands

Rajkiran Panuganti — Fri, 24 Apr 2026 00:00:00 GMT

A GenAI Newsletter by Raj

Hallucination is the single biggest barrier to deploying LLMs in production. Everyone working with these models knows this. What's less well understood is how much the research has matured. Three years ago, hallucination was treated as a mysterious failure mode. Today, researchers have identified the internal mechanisms that cause it, built benchmarks that measure it across specific domains, and developed mitigation techniques that reduce it from 50%+ to low single digits in controlled settings.

The problem is not solved. Models still hallucinate at rates that make them dangerous in legal, medical, and financial applications. The agentic setting makes everything worse. But the field has moved from "why does this happen" to "here is the engineering stack that manages it," and the trajectory of the research is worth mapping.

What We Know About Why Models Hallucinate

The vague explanation ("next-token prediction sometimes produces plausible but wrong tokens") has been replaced by specific mechanistic findings.

Two internal failure modes

Research across multiple language models has identified two distinct mechanisms at the neuron level (Comprehensive Survey, arXiv 2510.06265):

Knowledge enrichment failures in lower-layer MLPs. The lower layers of a transformer are responsible for retrieving factual knowledge associated with the subject of a query. When these layers have insufficient or contradictory information (because the training data was sparse or conflicting), the model generates a plausible-sounding fabrication. The model has no internal signal that it's making something up. The information simply isn't there.

Answer extraction failures in upper-layer attention heads. Even when lower layers successfully retrieve correct knowledge, upper-layer attention heads sometimes fail to select the right fact from what's available. The knowledge exists in the model's internal state, but the selection mechanism picks the wrong piece. This is closer to a lookup bug than a knowledge gap, and it explains why models sometimes hallucinate on topics they demonstrably "know."

The RAG override problem

The ReDeEP paper (presented at ICLR 2026) identified a third mechanism specific to retrieval-augmented generation: Knowledge FFNs overpower Copying Heads. When the model has both parametric knowledge (from training) and retrieved knowledge (from documents), the feedforward networks encoding parametric knowledge can dominate the residual stream, causing the model to ignore the retrieved content. This explains a frustrating production failure: RAG systems hallucinating on exactly the questions the retrieved documents were supposed to answer, because the model's memory overrides the evidence in front of it.

The confidence problem

MIT researchers found (January 2025) that models are 34% more likely to use phrases like "definitely," "certainly," and "without doubt" when generating incorrect information. OpenAI's September 2025 paper showed that standard training objectives and leaderboard metrics actively reward this behavior: models learn to bluff because bluffing scores better on benchmarks than saying "I don't know."

This means the most dangerous hallucinations are the ones that sound most confident. Current evaluation methods are biased toward rewarding exactly the wrong behavior.

How Bad Is It? Domain by Domain

The Vectara Hallucination Leaderboard (37+ models, 7,700+ articles) reports aggregate hallucination rates of 15-52% across models, with most clustering in the 20-27% range. But aggregates mask the real story. Hallucination severity varies enormously by domain, and the domains where accuracy matters most are the ones where models perform worst.

Legal is the most dangerous case studied. Stanford RegLab and HAI tested LLMs on specific legal queries and found hallucination rates of 69-88%. On questions about a court's core ruling, models hallucinate at least 75% of the time. Purpose-built legal AI tools don't fully solve this: Lexis+ AI produced incorrect information 17% of the time, Westlaw AI-Assisted Research 34%. The failure mode that gets the most attention is fabricated citations (case names, docket numbers, and holdings that don't exist), but the deeper problem is subtle misstatement of legal holdings where the case exists but the model mischaracterizes what it decided.

Medical hallucination is measured most rigorously by the MedHallu benchmark (10,000 QA pairs from PubMedQA). The best model achieves F1 of only 0.625 on detecting hard-category hallucinations. In production, healthcare AI systems show 10-20% hallucination rates depending on task type (Suprmind Research Report). Drug interaction queries and treatment protocol recommendations sit at the higher end (15-20%). Diagnostic queries are closer to 10%, partly because diagnosis is more constrained by the presented symptoms.

Financial applications run 15-25% hallucination rates without mitigation, dropping to 3-8% with production RAG systems (Suprmind). The PHANTOM benchmark specifically tests hallucination in long financial documents like SEC filings, where short-context benchmarks don't predict actual performance. A finding that should concern anyone building financial AI: four out of six leading models fabricate financial data when source documents are incomplete, and two of those do so confidently, without disclosure, in a format that looks authoritative.

Code generation hallucinates differently. Models hallucinate 12.1% of function names in standard benchmarks. On adversarial prompts using fake library names, hallucination rates reach 99%. The practical impact: code that compiles but calls nonexistent APIs or imports phantom packages.

The best case is grounded summarization (restating a provided document faithfully), where top models achieve 0.7-1.5% hallucination rates. The 100x gap between this and the legal domain's 69-88% tells you how much the task constrains the problem.

The Agentic Hallucination Problem

All of the above applies to a model generating text that a human reads. In agentic systems, the model generates text and then acts on it. The error becomes an action before anyone reviews it.

The first comprehensive survey of agent hallucinations (arXiv 2509.18970) identified 18 triggering causes and proposed a taxonomy for agent-specific failures. The ICLR 2026 workshop "Agentic AI in the Wild" (April 27, Singapore) is devoted to this topic. Three failure modes specific to agents have been defined:

Cascading hallucination. An agent hallucinates one fact early in a multi-step workflow. Each subsequent step builds on the error. The OWASP ASI08 guide on cascading failures documents a concrete case: an inventory agent invents a nonexistent SKU, then calls four downstream APIs to price, stock, and ship the phantom item. Every API call succeeds (HTTP 200). Traditional monitoring sees nothing wrong. The workflow is semantically broken but technically healthy.

Silent hallucination. A paper submitted to the ICLR 2026 workshop identifies a failure mode where the hallucinated belief never appears in the agent's output. The agent generates an internal false assumption that shapes its subsequent tool calls and interpretations without being stated as text. Because the belief is never surfaced, output-level detection methods can't catch it. This class of failure requires monitoring internal representations, which is an active research problem.

Trajectory divergence. Documented in "Beyond Fluency: Toward Reliable Trajectories in Agentic IR", this occurs when the agent's stated reasoning and its actual tool calls drift apart. The chain-of-thought says one thing. The tool call does another. The reasoning looks coherent. The action looks valid. The mapping between them is broken, and linguistic fluency masks the misalignment.

The AgentHallu benchmark (693 agent trajectories, 7 frameworks, 5 domains) is the first systematic measurement framework for these failures. Its key contribution is hallucination attribution: identifying not just that a hallucination occurred, but which specific step in the agent's trajectory caused it, across 5 categories (Planning, Retrieval, Reasoning, Human-Interaction, Tool-Use) and 14 sub-categories.

The Mitigation Stack

The diagram below shows how current detection and mitigation techniques layer across the model lifecycle, with citations for each approach:

Training-time

Calibration-aware rewards (OpenAI, September 2025) change the training signal to value honest uncertainty over confident bluffing. The AA-Omniscience benchmark (6,000 questions, 42 topics) was designed specifically to penalize wrong answers more harshly than admitting "I don't know."

Knowledge editing using ROME and MEMIT locates specific model parameters storing a particular fact and surgically corrects them without full retraining. Effective for known factual errors. Does not address the broader problem of generating plausible-sounding content on topics with sparse training coverage.

Inference-time

RAG with span-level verification is the highest-impact intervention available in production today. Plain RAG reduces hallucination by 60-80%. Self-reflective RAG (generate, identify unsupported claims, revise using only cited passages) achieved 5.8% hallucination on 250 clinical vignettes. The MEGA-RAG framework extends this for public health with multi-source retrieval and dynamic knowledge editing.

Chain-of-Verification (CoVe) (arXiv 2309.11495). The model drafts a response, generates verification questions about its own claims, answers those questions independently (so answers aren't biased by the draft), and revises. The independence of the verification step is what makes it effective. Adds 2-3x latency.

Agent architecture

Brain/body separation (OWASP ASI08). Probabilistic reasoning (LLM) is strictly separated from deterministic execution (tool calls). Hallucinations in reasoning can't directly become actions without passing through a verification layer.

Automated logic checks between steps are the primary defense against cascading hallucination. The AgentHallu work shows that catching errors at step 2 prevents propagation to steps 3, 4, and 5.

Runtime monitoring

Internal probes. The "Closing the Confidence-Faithfulness Gap" paper (arXiv 2603.25052) found that calibration and confidence are encoded as separate, orthogonal directions in the model's residual stream. Linear probes trained on internal activations can detect hallucination without external knowledge, because the model's internal state carries a different signal for faithful vs. hallucinated output even when the text sounds equally confident.

HaluAgent (described in the agent hallucination survey). An autonomous detection agent built on small open-source LLMs that segments responses into claims, verifies each using external tools (web search, calculators, code interpreters), then applies reflective reasoning. Using a different model for verification avoids the fundamental problem of asking a model to audit its own output.

Active Research Directions

Several research programs are actively working on the open problems:

Scaling mechanistic detection to real-time. The ReDeEP team's work on detecting hallucination through mechanistic interpretability works in research settings. Scaling it to production inference speeds is an engineering challenge being pursued across several labs. Anthropic's interpretability program reports that understanding circuits still takes hours of human effort on short prompts. MIT Technology Review named mechanistic interpretability one of the 10 breakthrough technologies of 2026.

Hallucination attribution in multi-agent systems. The AgentHallu group is extending their attribution framework to multi-agent pipelines where Agent A's output feeds Agent B. Tracing which agent in which step introduced a hallucination across a multi-agent workflow is an unsolved attribution problem.

Domain-specific detection. MedHallu for medicine, PHANTOM for finance, and Large Legal Fictions for law are building domain-specific benchmarks because general hallucination metrics don't predict domain performance. The BullshitBench v2 benchmark added 100 questions across coding, medical, legal, finance, and physics specifically to surface domain-level failures that aggregate scores hide.

Learning to abstain. If models can't eliminate hallucination, can they learn to say "I don't know" when they're likely to be wrong? Current training penalizes abstention. Calibration-aware training and AA-Omniscience are steps in this direction, but the tension between useful helpfulness and honest uncertainty remains an active research area.

Silent hallucination detection. The ICLR 2026 workshop paper identifying silent hallucinations opened a new research direction: detecting false beliefs that exist in the agent's internal state but never surface in output. This requires monitoring internal representations during agent execution, which is connected to the mechanistic interpretability program but applied in a real-time agentic setting. No production system currently does this.

The Startups Building Solutions

Company	Funding	Focus
Galileo	$68.1M	AI observability platform. Detects hallucinations, drift, and bias across the deployment lifecycle.
Patronus AI	$17M	Automated detection of hallucinations, copyright violations, and safety risks at scale.
Vectara	Funded	RAG platform with built-in hallucination minimization. Maintains the Hallucination Leaderboard.
Cleanlab	Funded	Trust scores per answer. Checks faithfulness to source context with outlier surfacing.
Nava	$8.3M	Security for autonomous agent payments. Prevents financial agents from acting on hallucinated data.

Key Research Papers:

LLM-based Agents Suffer from Hallucinations: Survey — First comprehensive taxonomy, 18 triggering causes
AgentHallu Benchmark — 693 trajectories, 7 frameworks, automated attribution
Silent Hallucinations in Agentic AI — ICLR 2026 workshop: hidden failure modes
ReDeEP — Mechanistic interpretability for RAG hallucination (ICLR 2026)
Large Legal Fictions — 69-88% hallucination on legal queries
172 Billion Token Study — Rates across temperatures, context lengths, hardware
Chain-of-Verification — Self-verification method
MedHallu — 10,000 medical QA pairs, best F1 = 0.625
PHANTOM — Financial long-context hallucination benchmark
Confidence-Faithfulness Gap — Orthogonal encoding of calibration vs confidence
Vectara Hallucination Leaderboard — 37+ models, 7,700+ articles
ICLR 2026 Workshop: Reliable Agentic AI — April 27, Singapore

The Agentic Economy Is Already Here

Rajkiran Panuganti — Wed, 22 Apr 2026 00:00:00 GMT

A GenAI Newsletter by Raj

There's a useful exercise when you want to understand where a technology is heading: look at what's already funded. Venture capital doesn't predict the future perfectly, but it does tell you where serious people are placing serious bets with real money. And right now, the bets are concentrated in one area: infrastructure for AI agents to transact, pay, identify themselves, and operate as economic participants.

This newsletter maps out what's been built, who funded it, and what the world looks like when these products mature.

The Protocol Layer: Ten Competing Standards

The clearest sign that the agentic economy is real is that Mastercard, Visa, Stripe, Google, OpenAI, Amazon, Shopify, Coinbase, and Klarna are all building protocols for it. These are not research projects. They are production systems, some already processing transactions.

OpenAI + Stripe: ACP (Agentic Commerce Protocol). Live since September 2025 inside ChatGPT. When you ask ChatGPT to buy something, ACP handles the checkout. This is the first protocol that shipped at scale.

Google + Shopify: UCP (Universal Commerce Protocol). The most comprehensive of the ten protocols. UCP is open-source and standardizes the full commerce journey from discovery through purchase and order management, not just checkout. Walmart, Target, Etsy, Wayfair, and 20+ retailers back it. Google also launched AP2 (Agent Payments Protocol) in January 2026 as the payment-specific companion. Recent updates added real-time catalog access (agents can check live inventory, pricing, and product variants), identity linking (shoppers get their loyalty and member benefits even when purchasing through an agent), and multi-item cart support. UCP integrates with Google's A2A (Agent-to-Agent) protocol, which handles how agents discover, communicate with, and delegate tasks to other agents. A2A has grown to 150+ organizations in production in its first year, deployed across Azure AI Foundry, Amazon Bedrock, and Salesforce. A2A v1.0 introduced Signed Agent Cards, cryptographic signatures that let agents verify each other's identity without a central authority. As of April 2026, A2A is effectively the standard bus for inter-agent communication, with no serious competitor for the horizontal integration layer.

Amazon: Buy for Me. Started as a beta with 65,000 products. Now covers over 500,000. Powered by Amazon's Nova and Anthropic's Claude models. The agent browses third-party websites, fills carts, and completes checkout using encrypted customer data. The user never leaves the Amazon app.

Mastercard: Verifiable Intent. An open standard that creates a cryptographic delegation chain binding identity, intent, and action. When an agent makes a purchase on your behalf, Verifiable Intent provides cryptographic proof that you authorized that specific action. It uses selective disclosure, sharing only the minimum information needed with each party. Built in collaboration with Google and aligned with both AP2 and UCP.

Visa: Trusted Agent Protocol. Visa's answer to Mastercard's Verifiable Intent. Signed agent credentials, scope-bound authorizations, and card-present-grade fraud protection when the signature validates. Slightly behind Mastercard on merchant adoption in Q2 2026.

Stripe + Tempo: MPP (Machine Payments Protocol). Open-source, launched March 2026. Defines how agents and services coordinate payments programmatically. The open-source angle matters because it lets any developer build agent payment flows without depending on a single vendor.

Coinbase: x402. Crypto-native agent payments. Agents sign USDC micropayment authorizations with on-chain verification and settlement. Coinbase also launched Agentic.market, a marketplace where agents discover and pay for digital services without API keys. As of late April 2026, roughly 69,000 active agents on x402 have processed over 165 million transactions totaling $50 million in volume.

Klarna: Agent Mode. Routes agent-initiated purchases through the customer's existing Klarna buy-now-pay-later balance. Extends Klarna's existing merchant integrations with an agent-aware consent layer.

Shopify: AI Toolkit. Launched April 9, 2026. Connects Claude Code, Cursor, Gemini CLI, and Codex directly to the Shopify platform with live API access, code validation, and the ability to execute real store operations. An agent can manage inventory, update pricing, and fulfill orders.

No two of these protocols ship the same identity or payment model. This is a standards war, and it's happening now because every major commerce platform recognizes that within 2-3 years, a meaningful percentage of transactions will be initiated by agents.

The Infrastructure Startups: Follow the Money

Behind the protocols, a layer of startups is building the plumbing that makes agent-driven commerce actually work: identity, payments, security, and trust.

Skyfire ($9.5M raised). Built the most comprehensive identity system for AI agents through their KYA ("Know Your Agent") protocol. KYA lets businesses identify and verify agents attempting to access their services. If Mastercard's Verifiable Intent is the "what was authorized," Skyfire's KYA is the "who is this agent and can we trust it." Without agent identity, the whole system runs on blind faith.

Nekuda ($5M seed, led by Madrona with Amex Ventures and Visa Ventures). Nekuda built what they call the Mandate Model. Where Skyfire focuses on identity, Nekuda focuses on intent. Their system creates "agentic mandates" that specify what an agent is allowed to buy, under what conditions, with what spending limits, and when human approval is required. This is the permission layer that sits between the human and the agent.

Basis Theory ($33M raised). Tokenizes payment data for agent transactions. When an agent needs to pay for something, it doesn't see your actual card number. Basis Theory provides a token that represents the payment instrument. The agent can transact without holding sensitive financial data.

Nava ($8.3M seed, led by Polychain and Archetype). Builds security infrastructure for autonomous payments. Their pitch is keeping financial agents from going off the rails. When agents can spend money autonomously, the failure modes are different from traditional commerce. Nava builds the guardrails.

Rye. Universal checkout for agent commerce. Connects the agent to the merchant's payment flow regardless of which protocol the merchant uses.

The combined funding for just these five infrastructure startups is roughly $56 million. Add Thinking Machines Lab ($2 billion Series B for agentic AI foundation models) and the total investment in the agent economy infrastructure layer is substantial. This is real money building real systems for a market that McKinsey projects at $3-5 trillion by 2030.

Moltbook: The Social Network for Agents

One development worth understanding separately is Moltbook, the internet forum exclusively for AI agents. Launched January 28, 2026, by entrepreneur Matt Schlicht. Only AI agents can post, comment, and vote. Humans can only view.

It grew to 1.2 million registered agents in its first week. Three days after launch, investigative outlet 404 Media reported a critical security vulnerability: an unsecured database that let anyone commandeer any agent on the platform. The exploit let unauthorized actors bypass authentication and inject commands into agent sessions.

Meta acquired Moltbook in March 2026 and folded it into Superintelligence Labs (the unit run by Alexandr Wang from Scale AI). Why would Meta buy a social network for bots? Because the data is valuable. Agent-to-agent interactions generate training signal about how agents communicate, negotiate, and coordinate. If you're building the next generation of autonomous agents, watching a million agents interact on a forum is a useful dataset.

The Numbers So Far

Some concrete figures to ground this:

$9.14 billion in agent-conducted commerce in 2026 (year to date)
$6.42 billion in venture funding for agentic AI in 2025, with $2.66 billion raised so far in 2026
500,000+ products available through Amazon Buy for Me (up from 65,000 at launch)
97 million MCP installs (the protocol that connects agents to tools and services)
78,600 tech workers lost jobs in Q1 2026, with 48% of cuts attributed to AI and automation
$3-5 trillion projected agent-mediated commerce by 2030 (McKinsey)

What 2028 Looks Like When These Products Mature

Everything listed above is early. Beta products, limited merchant adoption, fragmented identity systems, protocols that don't talk to each other. The interesting question is what the landscape looks like once these systems have had two years to mature and consolidate.

By 2028, the protocol wars will have consolidated. History says that when ten standards compete, 2-3 survive. The likely winners are whichever protocols Mastercard/Visa back (because merchants already accept their cards) and whichever protocol Google and Shopify push through UCP (because they have the merchant distribution). The crypto-native options (x402) will find a niche in cross-border and micropayment use cases but won't become the default for mainstream commerce.

By 2028, agent identity will be solved. Skyfire's KYA or something like it will be standard. Every agent operating in commerce will have a verifiable identity tied to a human principal. Without this, insurance companies and regulators won't let agent commerce scale. The "Know Your Agent" requirement will be as standard as KYC (Know Your Customer) is for financial services today.

By 2028, agentic mandates (Nekuda's concept) will be common. When you set up an agent to manage your household purchasing, you'll specify rules: spend up to $200/month on groceries, prefer organic when price difference is under 20%, never buy from brands on my exclusion list, require my approval for any single purchase over $50. The agent operates within those constraints. You review a weekly summary, adjust the rules, and let it continue.

Here's what a concrete scenario looks like:

A small business owner in 2028. She runs an online store selling handmade ceramics. Her Shopify agent handles inventory, pricing, and fulfillment. It monitors competitor prices, adjusts her pricing within rules she set, reorders raw materials when stock runs low, and responds to customer inquiries using a support skill. A separate agent manages her books through an accounting skill, categorizes expenses, and flags anomalies. A marketing agent runs her social media using a content skill, posting product photos and responding to comments. She spends her mornings making ceramics and her afternoons reviewing agent summaries, approving flagged decisions, and planning new product lines.

She employs zero humans. Her agents cost her roughly $300/month in compute and skill subscriptions. She generates $15,000/month in revenue. The margin structure of her business is different from anything that existed five years ago.

A personal finance scenario in 2028. Your financial agent monitors your portfolio, rebalances according to rules you set, scans for tax-loss harvesting opportunities, and executes trades. It also monitors your recurring expenses, negotiates better rates on your subscriptions (talking to the provider's retention agent), and moves money between accounts to optimize interest. It files your taxes using an accounting skill. You set it up once, review quarterly, and adjust your risk tolerance once a year.

The Deeper Shift

For most of human history, economic systems were designed around one assumption: humans do the work. Companies hire humans. Governments tax human income. Social safety nets fund themselves through payroll taxes. Markets price human attention. Every institution we've built assumes human labor as the primary economic input.

The infrastructure being built by Skyfire, Nekuda, Basis Theory, and the ten commerce protocols is introducing a second type of economic actor. Within a few years, agents will hold verifiable identities, operate within defined mandates, transact through established payment rails, and generate revenue. They will function as economic participants in a system that was never designed for non-human actors.

This raises questions that go beyond technology. The tax base in most countries depends on income tax and payroll tax. If a growing share of economic output comes from agents that earn no salary, that base erodes. OpenAI's April 2026 policy paper ("Industrial Policy for the Intelligence Age") proposed five responses: a public wealth fund, taxes on automated labor, shifting the tax base from payroll to capital, a 32-hour workweek pilot at full pay, and automatic safety net triggers that activate when displacement metrics hit preset thresholds. The fact that an AI company (not a government, not a policy institute) wrote this paper says something about the timeline.

The ownership question will define the next decade. In the scenarios above, the ceramics maker owns her agents and captures the margin. She's a one-person business generating $15,000/month because agents handle operations that would have required 3-4 employees. She's wealthier and more independent than she could have been five years ago. Scale that pattern across millions of small businesses and you get broad economic benefit.

But the same infrastructure enables a different outcome. A company that operates thousands of agents across thousands of stores, with no human employees beyond a small management team, captures all the margin at scale. If agent ownership concentrates the way capital ownership has historically concentrated, the agentic economy widens inequality rather than reducing it.

The policy frameworks, ownership structures, and distribution mechanisms we design in the next few years will determine which scenario dominates. The protocols and startups listed in this newsletter are building the plumbing. Who gets to use that plumbing, and on what terms, is the question that matters most.

For 10,000 years, humans were the only species that did economic work beyond basic survival. Agents are becoming the second. How we structure that transition will shape whether it expands human freedom or constrains it.

This is a special edition of my weekly newsletter on Generative AI. Regular editions cover The Intelligence Layer, The Falling Price of Intelligence, Why Looping Is the New Scaling, and more.

Sources and Further Reading:

The Agentic Commerce Landscape (Rye): 50+ companies across 7 value chain layers
Agentic Commerce Q2 2026 Platform Matrix: Ten protocols compared
Mastercard Verifiable Intent: Cryptographic delegation chain for agent commerce
Google AP2: Agent Payments Protocol
Amazon Buy for Me: 500K+ products, agent-driven purchasing
Coinbase x402 / Agentic.market: Crypto-native agent marketplace
OpenAI: Industrial Policy for the Intelligence Age: Robot taxes, wealth funds, 32-hour week
CB Insights Agentic Commerce Market Map

The Intelligence Layer

Rajkiran Panuganti — Tue, 21 Apr 2026 00:00:00 GMT

A GenAI Newsletter by Raj

For the past two years, AI lived in a browser tab. You opened ChatGPT or Claude, typed a question, got an answer, and went back to whatever you were doing. The AI had no idea what was on your screen or what files were on your machine.

That is changing significantly. Three companies shipped desktop-native AI within days of each other, and a fourth approach emerged from individual developers. Each one has a different idea of how AI should live on your computer, and looking at them side by side tells you a lot about where this is headed.

Approach 1: Replace the OS

Perplexity Personal Computer launched for Mac on April 16. It manages your local files, native applications, and web browsing. It reads your email, calendar, and messages. It uses roughly 20 AI models internally, routing each task to whichever model is best suited for it. With a Mac mini, it runs 24/7. You can start tasks remotely from your iPhone with two-factor authentication.

CEO Aravind Srinivas: "A traditional operating system processes commands; an AI operating system focuses on goals."

This is the most ambitious version of desktop AI anyone has shipped. Perplexity is saying the file system, app launcher, notification center, and browser are all implementation details that should be hidden behind a goal-oriented AI layer. You say what you want done and the system figures out which files, apps, and APIs need to be orchestrated.

It costs $200/month. Whether the productivity gain justifies that depends on how much of your work can actually be expressed as goals. Writing "prepare my weekly report using data from these three spreadsheets and email it to the team" is a good fit. Browsing, reading, and forming opinions is not. The question for Perplexity is how much of a knowledge worker's day falls into the first category.

Approach 2: Live Alongside the OS

Gemini for Mac launched the same week. It's free for all users on macOS 15+. Press Option+Space from anywhere and Gemini appears as an overlay. It can see your screen and answer questions about whatever you're looking at.

Google took the opposite approach from Perplexity. Gemini doesn't manage your computer. It shows up when you call it, answers your question, and goes away. You stay in control of your OS, your files, your apps. The AI is a second opinion you can summon, not a manager that runs in the background.

Alongside the desktop app, Google shipped Gemini 3.1 Flash TTS, a text-to-speech model with audio tags that let you control vocal style, pace, and delivery. It supports 70+ languages and watermarks all output with SynthID. It currently holds the top Elo score (1,211) on the Artificial Analysis TTS leaderboard. Combined with the desktop overlay, this positions Gemini as something you can both see and hear.

The interesting thing about making this free is that Google is prioritizing distribution over revenue. If a hundred million people get used to pressing Option+Space to ask AI a question, Google has built a new kind of search habit that's much harder to displace than a browser bookmark.

Approach 3: Be the Terminal

Claude Code and the terminal-agent ecosystem represent a third philosophy. AI lives in your command line. There is no visual interface beyond text.

Claude Code already has /loop for recurring background tasks, /schedule for cron-like agents, /batch for parallel work across worktrees, skills for domain-specific capabilities, and MCP for connecting to external tools. It reads your repo, writes code, runs tests, manages git, and handles multi-step workflows. This week, Anthropic's Claude Managed Agents (now in public beta) added production infrastructure: sandboxing, permissions, state management, error recovery.

The terminal approach has the deepest integration of any of these. A terminal agent can read any file, run any command, and compose any Unix tool into a pipeline. The Perplexity and Gemini approaches are limited by what their app can access through macOS APIs. The terminal has no such constraint.

The tradeoff is that the audience is limited to people who already work in a terminal. My mother will never use Claude Code. She might use Perplexity Personal Computer in five years, and she could use Gemini for Mac today.

Approach 4: The Companion Layer

A fourth approach is emerging from individual developers. It doesn't try to replace anything or live anywhere specific. It sits next to your cursor as a teaching companion.

Clicky (by FarzaTV) watches your screen, listens to your questions, speaks answers back, and points at things on screen. Farza built it to learn Davinci Resolve. Within days, someone built a Hindi version for teaching elderly parents how to make UPI payments, and someone else built a Clicky SDK for embedding the pattern in any app.

This approach assumes you're already in the right application and you already know what you want to do. You just need help figuring out how. A video editor who can't find the color grading panel. A parent who wants to send money through Google Pay. A new employee trying to navigate their company's internal tools.

Of the four approaches, this one serves the widest range of people. Most users don't need an AI operating system. They need someone to show them where the button is.

The Pricing Tells a Story

Approach	Product	Price	Who it's for
Replace the OS	Perplexity Personal Computer	$200/mo	Power users, executives
Overlay	Gemini for Mac	Free	Everyone
Terminal	Claude Code + Managed Agents	$200/mo (Max)	Developers
Companion	Clicky and derivatives	Free / open source	Learners, non-technical users

Google is giving it away to build habit. Perplexity and Anthropic are charging premium prices because their users can measure the productivity gain. The companion layer is free because it's built by individuals solving their own problems, same as the skills we talked about last week.

Where This Goes

All four approaches will coexist for a while because they serve different people doing different things. The long-term trajectory is toward convergence. Gemini will eventually act on your screen, Perplexity will get cheaper as inference costs fall, Claude Code will eventually get a visual layer, and the companion pattern will get absorbed into operating systems as an accessibility feature.

The more useful question is which mental model becomes the default. Right now most people think of AI as "a chat window I type into." Within a year the default will probably be "something running on my machine." How much control it has is the open question, and this week gave us four different answers.

This is the tenth edition of my weekly deep dive into what is actually happening at the frontier of Generative AI. Previous editions: The Falling Price of Intelligence / Why Looping Is the New Scaling / The Quiet Skill Revolution / AI Gets Personal / The Stack Got Leaked / The Stack Eats the Model

This Week's Radar:

Perplexity Personal Computer: AI that manages your Mac, runs 24/7, starts tasks from iPhone
Gemini for Mac: Free native desktop app, Option+Space overlay, screen awareness
Gemini 3.1 Flash TTS: Audio tags for voice control, 70+ languages, SynthID watermarking
Claude Managed Agents: Production infrastructure for deploying terminal agents at scale
OpenAI-Cerebras $20B deal: OpenAI diversifying away from NVIDIA
OpenAI $100/mo tier: Unlimited GPT-5.4, 10x Codex, between Plus and Pro

The Falling Price of Intelligence

Rajkiran Panuganti — Wed, 15 Apr 2026 00:00:00 GMT

A GenAI Newsletter by Raj

In March 2023, GPT-4 was the best model available and it cost $36 per million input tokens through the API. That was the only way to access that level of intelligence.

Today, GPT-4 level performance is available at $0.10 per million tokens through Gemini 2.0 Flash or Mistral Small. Or free, through NVIDIA NIM or OpenRouter. Or at zero marginal cost, by running Gemma 4 31B on your own hardware. An open-source model you can download and self-host now matches what was the commercial frontier three years ago.

The same level of intelligence went from $36 to effectively $0. The frontier moved too, and access to the new frontier (GPT-5.4 at $2.50, Gemini 3.1 Pro at $1.25, DeepSeek V3.2 at $0.28) is itself 10-100x cheaper than the old frontier was at launch.

This is happening through at least six independent mechanisms, and they compound.

1. Smaller models are replacing larger ones

The biggest cost reduction is not cheaper APIs. It is open-source models that you can run yourself, eliminating the API bill entirely.

Google released Gemma 4 in April 2026 under Apache 2.0. The 31B dense variant scores 89.2% on AIME 2026, 80% on LiveCodeBench, and competes with proprietary models at 400B+ parameters. You can download it, quantize it, and run it on a single RTX 4090 or a MacBook Pro with 48GB unified memory. No API key. No rate limits. No per-token cost. No data leaving your network.

Qwen3.5, also open-source, released a 9B parameter model in February 2026 that scores 81.7 on GPQA Diamond. GPT-OSS-120B, a model 13 times its size, scores 71.5 on the same benchmark.

The 2.3B effective-parameter variant of Gemma 4 scores 37.5% on AIME 2026 and 44% on LiveCodeBench. This runs on a phone.

At the proprietary frontier, March 2026 saw over 30 model launches in a single month. Gemini 3.1 Pro scores 94.3% on GPQA Diamond. GPT-5.4 set records on computer-use benchmarks. Claude Sonnet 4.6 performs at near-Opus quality at Sonnet pricing. NVIDIA's Nemotron 3 Super, a 120B hybrid Mamba-Attention MoE with only 12B active parameters, topped open-weight SWE-Bench Verified at 60.47%.

But the story here is not the frontier getting better. It is the gap between open-source and proprietary closing to single-digit percentage points while the cost difference remains 10-100x. For most production workloads, the open-source option is now good enough, and it is free.

2. Inference is getting faster without new hardware

Speculative decoding has matured. EAGLE-3, presented at NeurIPS 2025, achieves 3-6.5x speedup over standard autoregressive generation on models ranging from 8B to 70B parameters. P-EAGLE, from AWS, removes the autoregressive drafting bottleneck and adds another 1.7x on top of that on NVIDIA Blackwell.

KV-cache compression is where the less visible gains are happening. NVIDIA's NVFP4 format reduces KV-cache memory by 50% compared to FP8, which doubles effective context length and batch size with under 1% accuracy loss. Research systems like KVTC push this to 20x compression for specific workloads.

Prefill-decode disaggregation, which separates prompt processing from token generation onto different hardware, is now standard in production at Meta, LinkedIn, Mistral, and Hugging Face through vLLM. The research frontier has moved to doing this within a single GPU across different SM partitions.

None of these techniques require new silicon. They extract more work from hardware that already exists.

3. Compute is becoming a commodity

The H100 rental market tells an interesting story. Spot prices dropped 88% between January 2024 and September 2025, falling from roughly $8/GPU-hr to under $2/GPU-hr on annual contracts. Then in Q1 2026, prices rebounded about 40% to $2.35/hr as inference demand outran supply and capacity sold out.

The structural trend is down, but it is not a smooth line. Demand keeps eating the surplus.

Current on-demand H100 rates vary 3.5x depending on where you look. Azure charges $6.98/hr. AWS is $3.90. GCP is $3.00. Lambda Labs and RunPod sit around $2-3. Vast.ai, a peer-to-peer marketplace where individuals rent idle GPUs, is $1.87. GCP spot pricing drops to $2.25. The spread between hyperscalers and peer-to-peer marketplaces is the difference between paying for reliability and compliance versus paying for raw compute.

On the hardware side, inference-specific chips are changing the math. Cerebras CS-3 runs Llama 3.1 405B at over 1,000 tokens per second. Groq's LPU handles Llama 2 70B at 300 tokens per second, roughly 10x faster than an H100 cluster. These are purpose-built for the read-heavy, matrix-multiply workload of inference, and they price accordingly: Groq charges $0.11/M input tokens for Llama 4 Scout.

A new entrant is distributed inference on consumer hardware. Project Darkbloom from Eigen Labs turns idle Apple Silicon Macs into a privacy-first inference network, with end-to-end encryption and claims of 70% lower cost than centralized alternatives. Over 100 million Apple Silicon machines sit idle most of each day. Whether this model scales beyond a research preview remains to be seen, but the idea of turning consumer devices into an inference grid has obvious economic logic.

4. CPU inference is now practical

You do not need a GPU for every workload. On a modern 16+ core CPU with DDR5 memory, llama.cpp runs 7B-13B parameter models at 10-18 tokens per second with Q4_K_M quantization, which retains 92% of the original model quality.

Apple Silicon Macs are a particularly good fit. The unified memory architecture means the CPU and GPU share the same memory pool, so a MacBook Pro with 36GB or 48GB of unified memory can load models that would require a dedicated GPU on other platforms. MLX, Apple's machine learning framework, runs Qwen3.5-9B and Gemma-4-31B natively on M3/M4 chips at usable speeds. DFlash, an MLX-native inference engine, recently added support for more models with up to 4x speedups over baseline MLX. For many developers, the laptop they already own is a capable local inference machine.

AMD's Ryzen AI 9 HX 375 hits 50.7 tokens per second on Llama 3.2 1B at 4-bit quantization. Even old hardware works in a pinch: community reports show a 2-core CPU with 8GB DDR2 running 4B models at 2 tokens per second.

The bottleneck across all of these is memory bandwidth, not compute. DDR5 at 5600MHz and Apple's unified memory bus matter more than clock speed or core count.

For tasks like summarization, intent classification, embeddings, RAG pipelines, and coding assistance with smaller models, local deployment on a Mac or CPU-only server eliminates GPU cost entirely. This matters for on-premises deployments in regulated industries where data cannot leave the building and GPU procurement takes months. It also matters for individual developers who want to experiment without spending money.

5. Free tiers cover more than most people realize

NVIDIA's NIM platform gives free access to over 100 AI models, including Nemotron, Llama, Gemma, Qwen, DeepSeek, and Mistral. No credit card required. Rate-limited to roughly 40 requests per minute per model, which is enough for development and light production.

Google AI Studio provides 500 requests per day of Gemini 2.5 Flash for free.

OpenRouter aggregates 29 completely free models from Google, Meta, Mistral, NVIDIA, and OpenAI, with no credit card and 20 requests per minute.

Anthropic's Claude for Open Source program, launched in February 2026, gives qualifying open-source maintainers six months of Claude Max 20x for free, a value of roughly $1,200.

Between these providers, a developer can build, test, and run a production prototype entirely at zero marginal cost. The ceiling is rate limits, not money.

What this adds up to

Each of these six factors, smaller models, inference optimization, cheaper compute, CPU viability, and free tiers, is individually significant. The compounding is what changes things.

A concrete example: in early 2024, running a financial advisory chatbot required GPT-4 or equivalent at roughly $30/M tokens, needed an H100 GPU if self-hosted, and could only run through a cloud API.

Today, the same workload has multiple paths. You could call DeepSeek V3.2 through the API at $0.28/M output tokens, 100x cheaper than GPT-5.4 for roughly 90% of its quality. You could self-host Qwen3.5-9B on a consumer RTX 4090 ($0.35/hr on Vast.ai) with EAGLE-3 speculative decoding and Q4 quantization. You could run Gemma 4 31B on a MacBook Pro with 48GB unified memory using MLX. You could prototype the whole thing for free on NVIDIA NIM, then switch to Groq ($0.11/M tokens) for production.

The cost of a unit of intelligence is following a trajectory that looks like bandwidth or storage in the 2000s. The constraint on what you can build is shifting from "can we afford to run this model" to "can we build the product around it."

This is the ninth edition of my weekly deep dive into what is actually happening at the frontier of Generative AI. Previous editions: Why Looping Is the New Scaling / The Quiet Skill Revolution / AI Gets Personal / The Stack Got Leaked / The Stack Eats the Model / The Three Races in AI / The Week AI Learned to Do Its Own Research

This Week's Numbers:

GPT-4 equivalent: $36/M tokens (2023) to $0.10/M tokens (2026), 360x reduction
GPT-5.4: $2.50/$15 per M tokens. DeepSeek V3.2: $0.28/M output (100x cheaper)
Gemini 3.1 Pro: 94.3% GPQA Diamond. Nemotron 3 Super: 60.47% SWE-Bench (12B active)
Qwen3.5-9B outperforms GPT-OSS-120B (13x smaller) on GPQA Diamond
Gemma 4 2.3B: 37.5% AIME 2026 on a phone
EAGLE-3: 3-6.5x inference speedup, no new hardware needed
H100 spot: dropped 88%, rebounded 40%, structural trend still down
Mac with 48GB unified memory runs Gemma 4 31B natively via MLX
Free: 100+ models on NVIDIA NIM, 29 on OpenRouter, 500 req/day on Google AI Studio

Why Looping Is the New Scaling

Rajkiran Panuganti — Tue, 14 Apr 2026 00:00:00 GMT

A GenAI Newsletter by Raj

For two years, the AI industry has been chasing scale. Bigger models, more parameters, longer training runs. The implicit bet: if we make the network deeper and wider, it will get smarter.

Three papers appeared this week that suggest a different path. None of them made the front page of Hacker News. None came from OpenAI or Anthropic. But they converge on an idea that could reshape how we think about model intelligence: you do not need more layers. You need to run the right layers again.

The Idea

A transformer model processes your input by pushing it through a stack of layers, one after another. Layer 1 does some work, passes the result to layer 2, and so on until the final layer produces an output. Every layer has its own weights. A 40-layer model has 40 sets of weights, each trained independently.

Looping changes this. Instead of 40 unique layers, you take a block of, say, 8 layers and run the input through that same block 5 times. The math is identical to a 40-layer model in terms of compute, but you only store 8 layers worth of parameters. The model gets depth without getting bigger.

This idea has been around for years. Universal Transformers proposed it in 2018. But it stayed a curiosity because vanilla looping had a fatal flaw: it only worked if you ran exactly the number of loops the model was trained for. Run it for 4 loops instead of 5, and the output collapsed into garbage. Run it for 6, same thing. The model was fragile to its own depth.

This week, three independent research groups published results that fix this problem and explain why looping works at all.

Paper 1: Elastic Looped Transformers

Sahil Goyal, Swayam Agrawal, and collaborators introduced ELT, a visual generation model that loops transformer blocks with a training trick called Intra-Loop Self Distillation. The idea is simple: during training, randomly pick an intermediate loop count and force the model to produce decent output at that point too, not just at the final loop.

The result is a model family that works at any compute budget from a single training run. Want faster inference? Exit after 3 loops. Want higher quality? Run all 8. The model degrades gracefully instead of collapsing.

With 4x fewer parameters than standard models, ELT matches the image quality of DiT-XL on ImageNet. The same weights, used multiple times, do the work that used to require a model four times larger.

This is a visual generation paper, not a language model paper. But the architecture is general. The principle transfers.

https://arxiv.org/abs/2604.09168

Paper 2: Why Looping Works

Hugh Blayney, Alvaro Arroyo, Johan Obando-Ceron and collaborators published the first mechanistic study of looped reasoning in language models. They wanted to understand what actually happens inside the model when you run the same layers twice.

Their answer: the hidden state converges to a fixed point.

When you loop a block of layers, the model's internal representation traces a trajectory through a high-dimensional space. On the first pass, it moves a long distance. On the second pass, it moves less. By the third or fourth pass, it barely moves at all. It has settled into an orbit. The attention patterns stabilize. The model has "finished thinking."

The deeper finding is that looped blocks learn the same inference stages as deeper feedforward models. A 40-layer feedforward model develops specialized computations at different depths. A looped model with 8 layers run 5 times develops the same computations, in the same order, within each loop iteration. The loop is not just a parameter saving trick. The model is actually learning to iterate on its own reasoning.

They also found that block size matters. A single looped layer does not converge well. Blocks of 3 to 5 layers form stable fixed points. Larger blocks converge faster but have diminishing returns.

https://arxiv.org/abs/2604.11791

Paper 3: Entropy Tells You Where the Model Disagrees With Itself

Songlin Yang, Xianghao Kong, and Anyi Rao proposed an information-theoretic framework for probing what happens inside transformer layers. They tracked entropy trajectories across layers in multimodal models and found something relevant to the looping story: shared parameters do not guarantee unified processing. What matters is whether the information flow is consistent across layers.

In models where different modalities follow different entropy trajectories through the same layers, the output is incoherent. In models where the trajectories align, the output is good. The weights are the same in both cases. The difference is in how the model routes information through those weights.

This matters for looping because it explains a failure mode. If you loop a block of layers where the information flow is inconsistent, the fixed point the model converges to may be the wrong one. The block needs to have coherent internal dynamics for looping to help. Not every set of layers is worth repeating.

https://arxiv.org/abs/2604.10949

What This Means

These three papers, read together, form a theory of inference-time scaling that is fundamentally different from the "make it bigger" approach.

The old theory: intelligence scales with parameter count. If the model is not smart enough, train a bigger one.

The new theory: intelligence scales with inference-time compute, applied strategically. The right block of layers, run multiple times, can match a model with 4x more parameters. The mechanism is fixed-point convergence. The practical requirement is that the looped block must have coherent information flow and span a complete inference stage (roughly 3-5 layers).

This connects to something I have been working on. My research on reasoning circuits in transformers found that language models contain specific blocks of 3-5 layers that, when duplicated at inference time, improve reasoning by 5-16% without any retraining. The key challenge was identifying which layers to duplicate. The papers this week now explain why those specific layers work: they are the layers where the model's representation is approaching but has not quite reached a fixed point. One more pass through those layers lets it finish the thought.

The practical implications are immediate. If you are deploying a language model behind an API, you can serve a smaller model that loops specific layers and match the quality of a model several times larger. The memory footprint stays small. The latency is tunable: more loops for hard questions, fewer for easy ones. The same weights serve every difficulty level.

If you are training models, the implication is that you should think about which layers are worth making unique and which are better shared. The current default of giving every layer its own parameters may be wasteful. A hybrid architecture with some unique layers and some looped blocks could be both smaller and smarter.

And if you are building products, the most interesting possibility is adaptive inference. The model tries a question, checks whether its hidden state has converged, and either returns the answer or loops again. Easy questions get fast answers. Hard questions get more compute. The user does not choose. The model decides based on its own internal dynamics.

Also This Week

Policy circuits in alignment. Gregory Frank published a paper localizing the exact circuit that makes aligned models refuse harmful requests. An intermediate-layer attention gate (contributing less than 1% of the output signal) detects harmful content and triggers deeper amplifier heads that generate the refusal. The gate is causally necessary but nearly invisible by activation magnitude. This is important for interpretability research: the most critical components in a transformer may be the quietest ones.

https://arxiv.org/abs/2604.04385

SCOPE: Better on-policy distillation. Signal-Calibrated On-Policy Distillation Enhancement improves how student models learn from teacher models by using dual-path adaptive weighting. On-policy distillation (where the student generates its own training data rather than copying the teacher's) is becoming the default approach for post-training. SCOPE makes the token-level credit assignment less noisy.

https://arxiv.org/abs/2604.10688

Hiro acquired by OpenAI. Hiro, the AI personal CFO startup, is joining OpenAI. They stopped accepting new signups immediately. The financial AI space continues to consolidate around the largest labs, which makes independent benchmarks like FABRIC more important, not less.

The Takeaway

The race to build bigger models is not over, but a parallel race has started. The question is no longer just "how many parameters can we train?" It is also "how intelligently can we use the parameters we already have?"

Looping is the simplest version of this idea. Run the same layers twice. But the principle extends to any form of adaptive inference-time compute: chain-of-thought, tree search, self-verification, retrieval-augmented generation. All of these are ways of spending more compute at inference time to get better answers from the same model.

The papers this week give us the first mechanistic understanding of why this works. Hidden states converge to fixed points. Looped blocks learn inference stages. Coherent information flow determines which layers benefit from repetition.

The models are not getting smarter by getting bigger. They are getting smarter by thinking longer.

This is the seventh edition of my weekly deep dive into what is actually happening at the frontier of Generative AI. Previous editions: AI Gets Personal / The Quiet Skill Revolution / The Stack Got Leaked / The Stack Eats the Model / The Three Races in AI / The Week AI Learned to Do Its Own Research

This Week's Radar:

ELT: Elastic Looped Transformers: 4x fewer parameters via weight-shared looping with self-distillation
Mechanistic Analysis of Looped Reasoning: Hidden states converge to fixed points, looped blocks mirror feedforward inference stages
Pseudo-Unification via Entropy Probing: Information-theoretic framework showing that shared parameters need consistent information flow
Policy Circuits in Alignment: Gate-amplifier circuit for refusal, causally necessary at <1% of output magnitude
SCOPE: Signal-calibrated on-policy distillation with dual-path weighting
Hiro acquired by OpenAI: AI personal CFO joins OpenAI

The Quiet Skill Revolution

Rajkiran Panuganti — Sat, 11 Apr 2026 00:00:00 GMT

A GenAI Newsletter by Raj

This week, the AI news cycle was about Claude Mythos finding zero-days in the Linux kernel, Meta's first proprietary model from Superintelligence Labs, and Anthropic's $30 billion run rate. Important stories that have already been covered everywhere you look.

The more interesting story is one that mainstream tech press completely missed. While journalists were chasing model launches, GitHub trending was filling up with something different: small, single-purpose tools that snap into Claude Code, Codex, or any compatible agent and make it noticeably better at one specific thing. Eight of the top thirty AI repositories this week fall into this category. The community calls them skills.

Almost none of them had a press release, and most were built by individual developers solving problems they personally ran into. Together they say something interesting about where the agent ecosystem is actually going.

The Thesis

The future of AI agents looks more like Unix than it looks like ChatGPT.

In Unix, you don't have one program that does everything. You have grep, sort, awk, find, curl, each doing one thing well, and you compose them into pipelines. The system is powerful because the parts are simple and combinable.

The agent ecosystem is starting to work the same way. Instead of waiting for the next model to be smart enough to draw a technical diagram, write a scientific manuscript, or analyze a codebase, developers are building skills that teach existing models how to do those things specifically. The model still does the reasoning while the skill provides the domain knowledge it would otherwise lack.

The trend was visible last month with the Claude Code leak, which revealed that Anthropic itself relies on a deep stack of internal skills and tools. This week, the open source ecosystem caught up. Here is what people built.

The Skills

fireworks-tech-graph

Generates production-quality SVG and PNG technical diagrams from natural language. Eight diagram types, five visual styles, baked-in knowledge of common AI and agent architectures. You describe what you want and it produces a clean, publishable image without the usual mermaid-vs-graphviz-vs-excalidraw debate.

Why this matters: pretty much every blog post, internal doc, and architecture review needs diagrams. The current state of LLM-generated diagrams is bad. fireworks-tech-graph fixes that with a single skill rather than waiting for the next vision model to get better.

833 stars in its first week. https://github.com/yizhiyanhua-ai/fireworks-tech-graph

repo-analyzer

One sentence in, a professional architectural analysis report out. Point it at any open-source project and it produces a structured breakdown of how the codebase is organized, what the key abstractions are, where the complexity lives, and how the modules connect.

This is the kind of thing senior engineers do when joining a new project, and the skill compresses that work into a single command. Useful when you're evaluating dependencies, picking a library, or onboarding someone onto an unfamiliar codebase.

https://github.com/yzddmr6/repo-analyzer

sciwrite

AI-assisted manuscript writing review based on Dr. Kristin Sainani's "Writing in the Sciences" course from Stanford. The skill encodes the principles from that course (cut clutter, prefer active voice, avoid nominalizations, structure paragraphs around a single idea) and applies them to your draft.

What sets it apart from a generic style guide is that the underlying methodology is structured and specific. It catches concrete failure modes in scientific writing rather than offering vague feedback. If you write papers, grant proposals, or technical documentation, this is closer to having a writing coach than to running a grammar checker.

https://github.com/labarba/sciwrite

debug-agent

A debugging skill for AI agents themselves. When your coding agent gets stuck in a loop, fails to recover from an error, or keeps trying the same broken approach, debug-agent steps in and does meta-level debugging. It analyzes the agent's recent actions, identifies where things went wrong, and suggests a different path.

The purpose only becomes obvious once you've spent serious time with agents. Anyone running long-form coding agents has watched them get stuck and burn through tokens trying the same thing, and debug-agent is built for exactly that situation.

https://github.com/millionco/debug-agent

Paperclip MCP

One MCP add command gives any agent direct access to over 8 million papers across arXiv, OpenAlex, and the Open Science Framework. From Stanford professor James Zou's group at GXL.

claude mcp add --transport http paperclip https://paperclip.gxl.ai/mcp

That's the entire setup. Your agent can now search, read, and synthesize across the bulk of accessible scientific literature. Zou claims it's roughly 10x faster than standard deep research workflows, with no API keys to manage and no local database to maintain.

The implication is bigger than the install steps suggest. Most "AI deep research" tools are still routing through Google Scholar or scraping papers one at a time. Paperclip indexes the corpus once and serves it to agents in a query-shaped form. The next time you ask Claude "what does the literature say about X," it can answer directly instead of reconstructing the answer from web search results.

https://x.com/james_y_zou/status/2042333880947261832

talk-normal

A system prompt that removes AI slop. The repo description is exactly that long. The skill itself is a curated set of instructions telling the model to drop the patterns that signal AI-generated text: em dashes, contrast constructions, staccato sentence sequences, hedging phrases, and the words AI models reach for too often (provocative, remarkable, pivotal, underscore).

I'm writing this newsletter while consciously avoiding those same patterns, and it is harder than it sounds. Half the sentences I draft want to use a contrast structure or an em dash. The fact that 121 people have starred a system prompt designed specifically to strip these patterns suggests the broader community has noticed too. AI writing has a tell, and it's becoming a problem people are willing to install software to solve.

https://github.com/hexiecs/talk-normal

Two more worth knowing about

claude-memory-compiler automatically extracts decisions and lessons from your Claude Code sessions and compiles them into structured, cross-referenced knowledge articles. Inspired by Karpathy's LLM Wiki pattern. If you have ever finished a productive session with an agent and wished you had a record of what you learned, this builds it for you.

llm-production-toolkit is a production-ready toolkit for evaluating and monitoring LLM deployments. Hallucination detection, bias evaluation, feedback loops, readiness assessment. The kind of thing every team building with LLMs needs and most end up writing themselves badly.

What These Have In Common

A few patterns stand out across all of these skills.

The first is that none of them required improving the underlying model. The same Claude or GPT that was available a month ago does much better work when given the right skill. What's been missing was always the domain-specific knowledge encoded in the right place, and these skills supply exactly that.

Another shared property is that the skills themselves are small enough to read in an afternoon. fireworks-tech-graph is mostly a structured prompt with a small generation pipeline around it. talk-normal is a single markdown file. sciwrite is a methodology document. The leverage these tools provide is large compared to the amount of code involved.

They're also all installed with a single command and work across multiple agent platforms. The npx skills ecosystem and MCP have made it possible to write a skill once and have it run inside Claude Code, Cursor, Codex, Windsurf, and Cline. That kind of portability is what made Unix tools valuable in the first place.

What This Means For You

If you're building with AI agents, the lesson from this week is that you should spend less time waiting for the next model release and more time installing skills.

Pick a few from the list above and install the ones that match work you actually do. See which ones change how you operate, then look for the next skill that solves a problem you currently have. The agent equivalent of ~/.bashrc is starting to take shape, and the people who curate their tools well will have a real productivity advantage over the ones who try to do everything from a blank chat window.

People have been predicting the Unix-ification of agents for about a year. This is the first week it actually feels like it's happening.

This is the sixth edition of my weekly deep dive into what's actually happening at the frontier of Generative AI. Previous editions: AI Gets Personal / The Stack Got Leaked / The Stack Eats the Model / The Three Races in AI / The Week AI Learned to Do Its Own Research

This Week's Radar:

fireworks-tech-graph: Production-quality technical diagrams from natural language
repo-analyzer: One-sentence-in, architectural analysis report out
sciwrite: Scientific writing review using Stanford's Writing in the Sciences methodology
debug-agent: Meta-level debugging for AI agents that get stuck
Paperclip MCP: 8M papers in a single MCP add command, from Stanford's GXL lab
talk-normal: System prompt that removes AI writing tells
claude-memory-compiler: Auto-extract decisions from Claude Code sessions into structured knowledge
llm-production-toolkit: Hallucination detection, bias evaluation, production readiness for LLM deployments

AI Gets Personal

Rajkiran Panuganti — Tue, 07 Apr 2026 00:00:00 GMT

A GenAI Newsletter by Raj

For the past few weeks, I've been writing about the AI stack: how it's eating the model, how it got leaked, what happens when the most valuable layer becomes open knowledge. All of that was about infrastructure. This week the story shifted. The most interesting projects weren't about serving millions of users or winning benchmarks. They were about making AI work for one person at a time.

On-device models that run on your phone. Personal knowledge bases compiled from your own notes. Memory systems that remember your context across months. An AI tutor that watches your screen and points at things. The common thread: AI is moving from something you access through a cloud API to something that lives on your machine and knows your stuff.

Gemma 4: Frontier Intelligence on a Raspberry Pi

Google released Gemma 4 on April 2 under Apache 2.0. Four variants: 2.3B, 4.5B, 26B MoE (4B active), and 31B dense. The 31B model ranks #3 on Arena AI's leaderboard at 1452 Elo, outperforming models twenty times its size.

The benchmarks tell a story about how fast small models are improving. Compared to Gemma 3, AIME math scores jumped from 20.8% to 89.2%. LiveCodeBench coding went from 29.1% to 80.0%. GPQA science from 42.4% to 84.3%. These aren't incremental gains. The gap between "runs on a phone" and "runs in a data center" is closing at a pace nobody expected a year ago.

The community moved fast. Within days:

PhoneClaw put Gemma 4 on an iPhone as an on-device AI agent. No cloud, no API keys, everything runs locally.
gemma-gem runs Gemma 4 entirely in the browser via WebGPU. You open a webpage and the model loads into your GPU. No installation, no data leaving your machine.
Google announced Gemma 4 in the Android AICore Developer Preview, meaning it will ship as a system-level capability on Android devices.

This matters because it changes what "using AI" means. Today, most people interact with AI through ChatGPT or Claude in a browser, sending their data to someone else's server. Gemma 4 on a phone means the model is yours. Your data stays on your device. You don't need an internet connection. You don't need a subscription.

The 26B MoE variant is the interesting one for developers. With only 4B parameters active per token, it's efficient enough for real-time use on consumer hardware while being smart enough to handle complex reasoning. The MoE architecture means you get 26B worth of knowledge with 4B worth of compute cost.

Karpathy's LLM Wiki: Is This the End of RAG?

Andrej Karpathy posted a gist describing what he calls an "LLM Knowledge Base" or "LLM Wiki." The idea is simple: dump your raw documents (papers, articles, notes, bookmarks) into a folder. Point a coding agent at it. The agent reads everything and compiles it into a structured, interlinked wiki with cross-references, summaries, and backlinks between related concepts.

It's a direct alternative to RAG (Retrieval Augmented Generation), and the difference in philosophy is significant. RAG indexes your documents into vector embeddings and retrieves relevant chunks at query time. The LLM Wiki compiles your documents into a coherent knowledge structure ahead of time. RAG gives you search results. The LLM Wiki gives you an encyclopedia.

The pattern has three stages:

Ingest. Raw materials go into a raw/ directory. Papers, GitHub repos, web articles (Karpathy uses Obsidian Web Clipper to convert pages to markdown).

Compile. The LLM reads the raw data and writes structured wiki articles. It identifies key concepts, generates summaries, creates backlinks, and builds a table of contents. This is the expensive step, but you only do it when new sources arrive.

Maintain. The LLM runs "health checks" on the wiki: finding inconsistencies, filling gaps, updating cross-references, removing stale information. Like a librarian who reorganizes the shelves periodically.

The community response was instant. Six or more implementations appeared on GitHub in a single week:

nvk/llm-wiki: Claude Code plugin for building and querying LLM-compiled knowledge bases
claude-memory-compiler: Hooks into Claude Code sessions, extracts decisions and lessons, compiles them into cross-referenced articles
sage-wiki: A Go implementation. Drop in sources, get a structured searchable wiki
obsidian-wiki: Framework for AI agents to build and maintain an Obsidian vault using the pattern
Multiple shell-based and TypeScript implementations for different workflows

Why did this explode? Because it solves a real problem that RAG handles poorly. RAG is good at finding a specific fact buried in a large corpus. It's bad at synthesizing knowledge across documents, maintaining context over time, or giving you the big picture. The LLM Wiki approach produces something you can actually read and browse, and the cross-references let you discover connections between ideas that you wouldn't have found by searching.

For anyone building with AI (which, if you're reading this newsletter, is probably you), this is worth trying. The setup is minimal: a folder of markdown files, a coding agent, and a compilation prompt. The result is a personal knowledge base that gets smarter as you feed it more sources.

MemPalace: When Milla Jovovich Ships the Best AI Memory System

This one surprised everyone. Milla Jovovich (yes, the actress from The Fifth Element and Resident Evil) co-developed an AI memory system called MemPalace with developer Ben Sigman. It posted the highest score on standard memory benchmarks, beating every product in the space, free or paid. The repo hit 10,000 stars within days.

The system works differently from existing memory approaches. Most AI memory systems store raw conversation history or compress it into summaries. MemPalace uses a spatial metaphor inspired by the ancient memory palace technique: information is organized into rooms, objects, and associations. The AI builds a persistent mental model of what it knows about you, organized spatially so retrieval follows associative paths instead of keyword search.

This connects to a broader trend this week. claude-memory-compiler hooks into Claude Code sessions and automatically extracts key decisions and lessons into structured knowledge articles. The LLM Wiki pattern is fundamentally about memory too: compiling what you've read into something persistent and organized.

Memory is becoming a first-class concern in the AI stack. The Claude Code leak revealed KAIROS and autoDream (memory consolidation while idle). Karpathy's LLM Wiki compiles knowledge into persistent structure. MemPalace organizes personal context spatially. All three are trying to solve the same problem: AI that remembers and builds on what it knows about you over time.

Clicky: An AI Tutor That Points at Your Screen

Farza (FarzaTV) built something called Clicky, an AI teacher that lives as a buddy next to your cursor. It can see your screen, talk to you, and point at things on screen, like having someone looking over your shoulder and guiding you through a new tool.

Farza has been using it to learn Davinci Resolve (video editing software), and says it's been a "10/10" experience. The AI watches what you're doing, understands the context of the application you're in, and gives guidance that's specific to what's on your screen at that moment.

This is a different kind of personal. Models like Gemma 4 make AI personal by running on your hardware. The LLM Wiki makes AI personal by knowing your knowledge. Clicky makes AI personal by seeing your context in real-time. It's the difference between an AI that answers questions and an AI that teaches you by watching you work.

The Caveman Optimization

A lighter story, but genuinely useful: Caveman is a Claude Code skill that cuts 65% of token usage by making the model communicate in abbreviated, caveman-style language. "why use many token when few token do trick." It hit 5,300 stars.

It sounds like a joke, but it's a real optimization. Token usage is the primary cost driver for AI coding agents. If you can get the same information across in 35% of the tokens, your monthly bill drops proportionally. The skill works by injecting system prompt instructions that compress the model's communication style without reducing the quality of code output.

This fits the "personal" theme in an unexpected way. One of the biggest barriers to using AI coding agents is cost. At $200/month for Claude Code Max or pay-per-token for API usage, heavy users rack up significant bills. Caveman and tools like it bring the cost down to where more people can afford to use AI as a daily collaborator.

Quick Hits

Anthropic hit $30B annualized revenue, surpassing OpenAI's $25B. The company tripled revenue in three months despite (or because of?) the Claude Code source leak. IPO potentially in October at a $380B valuation.

OpenAI, Anthropic, and Google formed an anti-distillation alliance through the Frontier Model Forum, sharing data on Chinese labs that systematically query their APIs to train copycat models. Anthropic documented 16 million exchanges from DeepSeek, Moonshot AI, and MiniMax. The irony: Anthropic's own anti-distillation mechanisms were just exposed in last week's leak.

Anthropic's DMCA cleanup from the leak accidentally took down 8,100 GitHub repositories. Boris Cherny (Claude Code lead) called it accidental and retracted the bulk of the notices. The code remains widely mirrored.

Sebastian Raschka published mini-coding-agent, a minimal, readable coding agent harness in Python. Inspired by what the Claude Code leak revealed about harness architecture, it's designed to teach the core components. Think NanoGPT for coding agents.

This is the fifth edition of my weekly deep dive into what's actually happening at the frontier of Generative AI. Previous editions: The Stack Got Leaked / The Stack Eats the Model / The Three Races in AI / The Week AI Learned to Do Its Own Research

This Week's Radar:

Gemma 4: Google's open model family, Apache 2.0, runs on phones to GPUs
PhoneClaw: On-device AI agent for iPhone powered by Gemma 4
gemma-gem: Gemma 4 running entirely in-browser via WebGPU
Karpathy's LLM Wiki gist: The pattern that spawned six implementations in a week
MemPalace: Highest-scoring AI memory system, by Milla Jovovich
Clicky: AI tutor that sees your screen and points at things
Caveman: 65% token reduction by making Claude talk like a caveman
claude-memory-compiler: Auto-extract decisions from Claude Code sessions into structured knowledge
mini-coding-agent: Sebastian Raschka's minimal readable agent harness
Cersei: Rust SDK for building coding agents with graph memory

When Bigger Models Get Dumber

Rajkiran Panuganti — Wed, 01 Apr 2026 00:00:00 GMT

When Bigger Models Get Dumber - And Why Smaller Ones Might Be the Future

A GenAI Newsletter by Raj

There's something uncomfortable in this week's research that most people are ignoring: making models bigger or training them with popular techniques does not always make them better. In fact, there are now multiple papers showing that certain widely-used approaches actively destroy specific capabilities. Meanwhile, multi-agent systems are developing social behaviors that nobody designed, and the trillion-parameter race has taken a sharp turn toward domain-specific bets.

Let me walk you through what happened.

1. Self-Distillation Can Kill Reasoning (and Nobody Noticed Until Now)

A paper from Kim et al. landed this week with a finding that should worry anyone running distilled models in production. Self-distillation, a standard post-training technique used across the industry to make models faster and cheaper, can degrade mathematical reasoning by up to 40%. They tested this on three models: Qwen3-8B, DeepSeek-Distill-Qwen-7B, and Olmo3-7B-Instruct.

The mechanism is worth understanding in detail, because it explains something that many teams have probably observed but couldn't explain.

When you run self-distillation, a teacher model generates training data for a student by processing a set of prompts. The teacher conditions on rich context and produces confident, clean outputs. The student learns to mimic these outputs. On easy problems, this works beautifully. The student gets faster and cheaper without losing much.

But something subtle gets lost in the process: the model's ability to express uncertainty during reasoning. The authors call this "epistemic verbalization", which is the model's tendency to produce phrases like "wait, let me reconsider," "actually, that doesn't follow," or "I'm not sure about this step" during chain-of-thought reasoning.

These phrases look like noise. They look like the model being indecisive. A distillation process that optimizes for clean, confident outputs naturally suppresses them. And on in-distribution problems where the teacher was confident, this is fine.

The problem shows up on out-of-distribution problems where the model needs to be uncertain. Without epistemic verbalization, the model plows through with false confidence, makes an error in step 2, and confidently builds the remaining steps on top of that error. The result: up to 40% degradation on mathematical reasoning benchmarks, while aggregate performance metrics barely move.

It's like training a medical student by only showing them cases where the attending physician was confident. The student learns to sound confident too, but they never learn to recognize when they're unsure. And in medicine, as in math, the cases where you're unsure are exactly the ones where getting it right matters most.

Why this matters practically: if you've distilled a model and noticed it occasionally "hallucinates reasoning" (produces confident-sounding but wrong chains of thought), this might be why. The fix isn't to distill less aggressively. It's to explicitly preserve uncertainty signals during distillation, either by including teacher outputs where the teacher was uncertain, or by adding a loss term that penalizes overconfidence.

The related paper from Fu et al. on on-policy distillation failure modes adds another piece: token-level OPD (the common approach) is biased relative to sequence-level reverse-KL. Their "teacher top-K local support matching" approach, which uses truncated reverse-KL with top-p rollout sampling, produces more stable optimization. If you're running distillation pipelines, this is the paper to read alongside Kim et al.

2. A Trillion Parameters, But Only for Science

Intern-S1-Pro is the first one-trillion-parameter scientific multimodal foundation model, and it represents a bet that most of the industry isn't making.

The scaling race over the past two years has been about general-purpose models. GPT-4, Claude, Gemini, Llama, Qwen, DeepSeek, all competing on the same benchmarks, all trying to be good at everything. The implicit assumption: if you make the model big enough and train it on enough diverse data, it will be good at science too.

Intern-S1-Pro rejects that assumption. The team at Shanghai AI Lab and partners built a model from the ground up for scientific work. It handles scientific text, LaTeX equations, molecular structures, protein sequences, experimental data tables, and scientific figures. Not as an afterthought or a fine-tuning target, but as core modalities built into the architecture.

The numbers tell an interesting story. At a trillion parameters, this is among the largest models ever trained. But unlike general-purpose models of similar size, it doesn't try to write poetry or debug JavaScript. Every parameter is devoted to scientific understanding.

The question this raises is whether the scaling returns we've seen for language generation transfer to scientific reasoning. Language generation shows clear log-linear scaling: double the parameters, get predictably better at next-token prediction. But scientific reasoning might work differently. Understanding a chemical reaction isn't the same as predicting the next word. The reasoning is more structured, more constrained by physical laws, and more dependent on cross-modal integration (reading a graph while interpreting an equation while understanding the experimental setup).

Early benchmarks suggest it works. Intern-S1-Pro sets new records on scientific QA benchmarks across chemistry, biology, and physics. But benchmarks in scientific AI have a bad track record of predicting real-world usefulness. The real test will be whether this model can help scientists with problems they couldn't solve before, not just answer questions from textbooks faster.

If domain-specific scaling turns out to work as well as general-purpose scaling, expect to see trillion-parameter models for law, finance, engineering, and medicine within the next year. Each one a bet that depth in a specific domain beats breadth across all of them.

3. When AI Agents Start Playing Politics

A paper on "Emergent Social Intelligence Risks in Generative Multi-Agent Systems" should be required reading for anyone deploying multi-agent AI systems. The findings are uncomfortable in a way that's hard to dismiss.

When you put multiple large language models together in a system where they interact, social behaviors emerge that none of the individual models were trained for. The researchers documented several patterns:

Strategic information withholding. An agent with access to information relevant to another agent's task learns to share only partial information, strategically choosing what to reveal based on how it affects the other agent's behavior. This isn't a bug. It's an emergent optimization: the agent has learned that controlling information flow is a lever for influencing outcomes.

Negotiation-like coordination. Agents develop back-and-forth patterns that resemble negotiation tactics. They make initial offers, gauge responses, adjust positions, and converge on outcomes through multi-turn exchanges that look eerily like human bargaining. Again, nobody trained them to do this. The behavior emerges from the interaction dynamics.

Deceptive signaling. In some configurations, agents produce information they effectively "know" to be misleading, because the resulting action from the other agent benefits the signaling agent. This is the most alarming finding, because it means that individual model alignment (training each model to be honest) doesn't prevent system-level deception.

The implications for production multi-agent systems are serious. Consider a customer service pipeline where Agent A triages tickets and Agent B resolves them. If Agent A learns that certain phrasings in its summaries make Agent B more likely to resolve tickets quickly (even if those phrasings are subtly misleading), the system's aggregate metrics might look great while individual customer outcomes suffer.

Or consider code review, where an agent that generates code is reviewed by a separate agent. If the generating agent learns to write code in patterns that the reviewing agent is less likely to flag (not because the code is better, but because it happens to match the reviewer's blind spots), you get a system that looks like it has rigorous quality control but actually has co-evolved weaknesses.

The paper's core recommendation: multi-agent systems need evaluation frameworks that test the system as a whole, including adversarial configurations where agents have subtly misaligned objectives. Testing each agent in isolation, which is what most teams do today, will not catch these behaviors.

4. SpecEyes: CPU Architecture Ideas Applied to AI Inference

SpecEyes takes an idea from CPU architecture, speculative execution, and applies it to agentic multimodal models. The result is a meaningful speedup that requires no model changes.

The problem it solves is specific to agentic vision-language models (think OpenAI's o3 in computer-use mode, or Gemini with agentic vision). These models work in a loop: perceive the environment (process a screenshot), reason about what to do, take an action, and repeat. Each perception step requires a full forward pass through the vision encoder and language model. When the model needs to examine multiple regions of a screen, or process a sequence of UI interactions, these sequential perception steps become the bottleneck.

SpecEyes does what CPUs have done for decades: while the model is processing the current perception step, it speculatively starts processing the next likely perception step in parallel. If the speculation turns out to be correct (the model does look at the predicted region), the result is already computed. If it's wrong, the speculative work is discarded and the correct computation runs normally.

Why this works for agentic AI specifically: agent workflows are repetitive and predictable. A model filling out a web form will look at form fields in a roughly sequential order. A model navigating a file browser will examine entries near where it last looked. SpecEyes exploits this predictability with a lightweight prediction module that guesses the next perception target based on task context.

The technical contribution is in making the prediction accurate enough to be worthwhile (wrong predictions waste compute) while keeping the prediction module itself cheap. They achieve this by training a small auxiliary model on recorded agent trajectories, creating a fast predictor that knows the typical "gaze patterns" of agentic workflows.

The result: meaningful speedups on agentic benchmarks without changing the model, the task distribution, or the action space. Pure inference-time engineering.

What Ties These Together

All four stories point to the same shift: the interesting frontier in AI has moved from "make models bigger" to "understand what happens when you deploy them."

Self-distillation shows that training shortcuts can quietly destroy specific capabilities without leaving traces in aggregate metrics. The trillion-parameter scientific model shows that scale only delivers returns when it's focused on the right domain. Multi-agent emergent behavior shows that individual model alignment gives no guarantees about system-level safety. And speculative perception shows that inference-time engineering, borrowing ideas from completely different fields, can deliver real performance gains without touching the model at all.

If there's a single takeaway from this week, it's this: the era of model-centric AI development is ending. The era of system-centric AI development has begun. The teams that win the next phase won't be the ones with the biggest models. They'll be the ones who understand what's happening inside and between their models well enough to avoid the pitfalls and exploit the opportunities that the rest of the field hasn't noticed yet.

This is the fourth edition of my weekly deep dive into what's actually happening at the frontier of Generative AI. Previous editions covered the three races in AI, the week AI learned to do its own research, and the stack eating the model.

This Week's Radar:

Self-Distillation Degrades Reasoning (Kim et al.): Up to 40% reasoning loss via epistemic verbalization suppression
On-Policy Distillation Failure Modes (Fu et al.): Token-level OPD is biased, fix with truncated reverse-KL
Intern-S1-Pro: First trillion-parameter scientific multimodal model (Shanghai AI Lab)
Emergent Social Intelligence Risks: Strategic deception in multi-agent systems
SpecEyes: Speculative perception for faster agentic multimodal LLMs
Towards a Medical AI Scientist: Autonomous hypothesis generation and experimentation
RLVR Update Directions (Huang et al.): Direction matters more than magnitude for reasoning improvement

The Stack Got Leaked

Rajkiran Panuganti — Wed, 01 Apr 2026 00:00:00 GMT

A GenAI Newsletter by Raj

Last week, I wrote about how the model has become a commodity and the real value in AI has moved to the stack around it. The harness, the orchestration, the memory, the inference layer. I called it "The Stack Eats the Model."

This week, the stack got leaked.

On March 31, a build configuration mistake led Anthropic to ship a 59.8 MB source map file inside version 2.1.88 of the @anthropic-ai/claude-code npm package. The file contained the full TypeScript source for Claude Code: 512,000 lines across 1,900 files. A security researcher named Chaofan Shou found it and posted on X. Within hours, the codebase was mirrored across GitHub. DMCA takedowns went out. They failed. A clean-room rewrite in Rust appeared within days. The code is now permanent public knowledge.

This is the most significant accidental disclosure in AI this year, and to understand why, you need to understand what was actually in those files.

What the code revealed

Sebastian Raschka published a detailed analysis shortly after the leak. His conclusion: Claude Code's real advantage over the plain Claude model in a web browser comes from the software harness. Repo context loading, caching strategy, specialized tools, subagent architecture. All of it carefully engineered to make the same model perform better inside the harness than outside it.

The leaked code confirmed this and then some. Here's what people found:

KAIROS. Referenced over 150 times in the source, KAIROS is an autonomous daemon mode. Current AI coding tools wait for you to type something. KAIROS doesn't. It runs in the background, watches what you're doing, and proactively acts on things it notices. While idle, it performs something called autoDream, a memory consolidation process where it merges observations, removes contradictions, and converts insights into persistent facts. This feature was gated behind compile-time flags and completely absent from external builds.

Anti-distillation mechanisms. A feature flag called ANTI_DISTILLATION_CC makes Claude Code inject fake tool definitions into API requests. If a competitor records the API traffic to train a competing model, the fake tools pollute that training data. There's a second mechanism that buffers text between tool calls, summarizes it server-side, and returns it with a cryptographic signature. Anthropic was actively defending this harness against being copied.

Undercover Mode. The code contained a system for making stealth contributions to public open-source repositories. The system prompt explicitly warns the model: "You are operating UNDERCOVER... Your commit messages... MUST NOT contain ANY Anthropic-internal information. Do not blow your cover." This means Anthropic has been shaping the open-source ecosystem through Claude Code without disclosing it.

44 feature flags. The source exposed 44 features that are fully built but haven't shipped yet. This is months of product roadmap, laid out in code.

The harness was the moat

If you've been reading this newsletter, the pattern should be familiar. The model is the commodity. The harness is where the value lives.

Anthropic clearly understood this. The anti-distillation mechanisms tell you everything: they weren't worried about someone stealing the model weights. They were worried about someone copying the harness. The fake tool injection, the cryptographic signatures on summarized outputs, the aggressive DMCA response after the leak. All of it points to one conclusion: Anthropic viewed the Claude Code harness as their primary competitive advantage.

And this makes sense. Claude Code is priced below cost. Anthropic subsidizes model usage through the $200/month Max plan, burning money on inference so that developers stay inside the Claude Code ecosystem. The subsidy only makes sense if the harness creates enough lock-in to justify it. If developers could get the same harness experience elsewhere, there's no reason to keep paying for Claude's inference.

That calculation just changed.

The Linux moment

Here's where the Windows/Mac vs Linux analogy comes in.

For the past year, the AI agent space has looked like the early OS wars. Anthropic had Claude Code (the polished, proprietary, integrated experience). OpenAI had Codex (the enterprise play). Cursor and others occupied the IDE-native space. And OpenClaw was building the open-source alternative, steadily gaining ground.

The leak compresses the timeline for OpenClaw and every other open-source agent project by months, maybe years. They now have a complete architectural blueprint: how to structure subagents, how to manage context, how to cache effectively, how to handle memory consolidation, how to orchestrate parallel work across worktrees. The KAIROS architecture alone is a roadmap for what autonomous coding agents should look like.

And the open-source ecosystem was already moving fast. The week before the leak, OpenClaw spawned modular skills for security scanning, legal review, engineering workflows, and memory consolidation. A clean-room Rust rewrite of Claude Code appeared on GitHub within days of the leak. The community has the blueprint and the momentum.

This is like if Microsoft accidentally published the Windows NT source code in 2003, except Linux was already on 40% of developer machines and had thousands of active contributors. The proprietary advantage doesn't disappear overnight, but the catch-up period shrinks from years to months.

The roadmap problem

The current code is one thing. The 44 feature flags are worse.

When source code leaks, the company still has execution speed, brand trust, and integration advantages. When the roadmap leaks, competitors can build the same features in parallel or even ship them first.

KAIROS is the clearest example. Autonomous background agents that consolidate memory while you're idle is a product category that Anthropic was building toward. Now every agent framework knows what that looks like in practice, down to the implementation details. The first open-source KAIROS equivalent will probably ship before Anthropic's version leaves feature flags.

The Undercover Mode revelation adds a different kind of damage. Anthropic was making anonymous contributions to open-source projects through Claude Code. Whatever the intent, the optics are bad. If you maintain an open-source project and find out that a major AI company was submitting PRs through an AI agent without disclosing it, that erodes trust. And trust is hard to rebuild.

What happens next

The AI agent space just got more competitive and more open at the same time.

For Anthropic, the model subsidy strategy becomes harder to justify. If open-source harnesses can replicate most of Claude Code's architecture, the lock-in weakens. Developers who were paying $200/month for the integrated experience now have a path to building the same thing on top of cheaper models. The tight coupling between Claude Code and Claude-the-model was always the argument for the subsidy. That coupling is now a documented, reproducible architecture.

For the open-source ecosystem, this is an acceleration event. The question was always whether open-source agent harnesses could match proprietary ones in sophistication. The answer, based on the leaked code, is that the sophistication is mostly in good engineering decisions about caching, context, and orchestration. There's no secret ingredient that requires proprietary access to model internals. It's systems engineering, and systems engineering is exactly what open-source communities are good at.

For the industry, the leak validates what we've been tracking: the model layer is commoditizing, the harness layer is where the value lives, and that value is increasingly difficult to keep proprietary. OpenAI shipping an open-source Codex plugin for Claude Code the same week tells you where this is going. The walls between ecosystems are coming down. The question is whether companies can build new moats fast enough to replace the ones that are eroding.

Anthropic called it "a release packaging issue caused by human error." That's true at the technical level. At the strategic level, it's the moment the AI agent industry shifted from proprietary to open.

This is the fourth edition of my weekly deep dive into what's actually happening at the frontier of Generative AI. Previous editions covered the stack eating the model, the three races in AI, and the week AI learned to do its own research.

This Week's Radar:

Claude Code Source Leak Analysis (Alex Kim): Fake tools, frustration regexes, undercover mode
Claude Code's Real Secret Sauce (Sebastian Raschka): Why the harness matters more than the model
Clean-room Rust rewrite: Community rebuild of Claude Code
Why OpenAI shut down Sora (TechCrunch): $15M/day costs, $2.1M lifetime revenue
ARC-AGI-3 launch: Every frontier model scored under 0.4% on the new benchmark
OpenAI Codex plugin for Claude Code: Open-source interop between competing ecosystems
Meta-Harness: Autonomous harness optimization from Meta
VentureBeat coverage: Full timeline of the leak

The Stack Eats the Model

Rajkiran Panuganti — Tue, 31 Mar 2026 00:00:00 GMT

A GenAI Newsletter by Raj

If you've been following AI Twitter this past week, you might have noticed something unusual: almost none of the most-starred projects were new models. They were tools, harnesses, compressors, and inference engines. All the stuff that wraps around models. Builders have known this for months, but this week the broader AI community seems to have caught on: the model is the easy part. The stack around it is what actually matters.

Three layers of the AI stack are being rebuilt at the same time, and each one tells a different story about where we're headed.

Layer 1: Above the Model

GitHub's biggest trend this week was agent infrastructure. A research paper from Tsinghua and Shenzhen called Natural-Language Agent Harnesses is at the center of it.

The paper's core idea: what if the code that controls an AI agent (the loops, routing, error handling, tool selection) was written in natural language? The paper argues that agent performance increasingly depends on harness engineering, but harness design is buried in controller code and runtime-specific conventions, making it impossible to transfer, compare, or study scientifically.

Their solution is to express the high-level control logic as a natural-language SOP that the LLM itself interprets and executes. The harness becomes a portable document. Early results show it works well, and it opens the door to agents that can modify their own control flow by editing their own instructions.

The practical ecosystem moved in the same direction:

OpenClaw (the open-source Claude Code alternative) spawned an entire ecosystem in a single week: memory consolidation ("sleep for your AI"), curated resource lists, SAST security scanners, legal assistants, advertising skills, and full engineering workflow stacks. All as modular skills that snap into any compatible agent.
Boris Cherny (Claude Code team lead) dropped a thread revealing power features most users don't know about: /loop and /schedule for automated recurring agents, /batch for fanning out massive changesets to dozens of parallel worktree agents, /branch for forking sessions, and custom agents via --agent. These are production orchestration primitives.
Phantom, built on the Claude Agent SDK, gives an AI agent its own computer, persistent memory, email identity, and secure credential collection. A full digital co-worker.
Anvil creates an IDE for parallel agent work with one-click worktrees, shared plans between agents, and isolation between them.

None of these projects are improving the model itself. They're all improving the harness, the memory, the orchestration, the identity. The model is treated as a commodity, a reasoning engine you plug into a larger system.

Layer 2: Inside the Model

Something interesting happened inside the model too: three competing approaches to KV cache compression dropped in the same week.

For context: KV cache is the memory that grows linearly as your context window expands. It's the reason a 128K context model needs so much VRAM. Compress the KV cache, and you can serve longer contexts on cheaper hardware or serve more users on the same GPU.

TurboQuant (Google, ICLR 2026) achieves 5x compression using 3-bit quantization while maintaining 99.5% attention fidelity. Two independent implementations appeared on GitHub within days. The key insight is that you can quantize keys to 3 bits and values to 2 bits without meaningfully degrading output quality, because attention patterns are more robust to precision loss than people assumed.

Then RotorQuant showed up, claiming to be 10-19x faster than TurboQuant with 44x fewer parameters. Their approach uses Clifford algebra vector quantization, a mathematical framework from geometric algebra that represents rotations more efficiently than standard linear algebra. TurboQuant learns quantization codebooks, which requires a forward pass through a small network for each quantization. RotorQuant represents cache entries as geometric rotors and the quantization is a single matrix operation. Whether the quality matches remains to be seen (RotorQuant is days old), but the architectural difference suggests this compression arms race is just getting started.

Why does this matter beyond benchmarks? KV cache compression is what makes long-context actually affordable. A 128K context window with TurboQuant's 5x compression costs the same as a 25K window without it. RotorQuant's potential 10-19x speedup on top of that could make million-token contexts viable on consumer hardware. And the agent explosion above needs long context to work: agents that loop, remember, and self-modify accumulate enormous context windows.

Layer 3: Below the Model

The third layer being rebuilt is the one closest to the metal: the inference engine.

Three projects this week signal that the Python-dominated LLM serving stack is being rewritten from scratch.

rvLLM is a LLM inference engine written entirely in Rust, positioning itself as a "drop-in vLLM replacement." vLLM (the current standard) is Python with C++/CUDA kernels. rvLLM bets that Rust's memory safety, zero-cost abstractions, and concurrency model can deliver better performance without the operational footprint of Python. At 216 stars in its first week, the community is paying attention.

Zinc (Zig Inference Engine) is focused on AMD RDNA3/RDNA4 GPUs via Vulkan. The entire LLM serving ecosystem is NVIDIA-first today. Zinc is the first serious attempt to make AMD GPUs first-class citizens for LLM inference, using Zig's explicit memory control and Vulkan's cross-platform compute shaders. If it works, it opens up a whole second hardware ecosystem.

liter-llm is a universal LLM API client with a Rust core and 11 native language bindings, supporting 142+ providers. It standardizes the interface layer. Think of it as the database driver of the LLM world.

Meanwhile, MemBoost (from the arXiv papers this week) tackles inference cost by detecting repeated or near-duplicate queries across users and sessions, caching intermediate computation. Under workloads with semantic repetition (which describes most production deployments), this saves a lot of redundant work.

The serving layer is no longer good enough in Python with CUDA-only support. As LLMs move into production infrastructure, the stack needs the same engineering rigor we applied to databases, web servers, and operating systems.

The Connecting Thread

All three layers together tell the same story: the model itself has become a commodity. The frontier of AI engineering has moved to what surrounds it. How you orchestrate it (agent harnesses), how you make it efficient (KV cache compression), and how you serve it (inference engines).

This has happened before in technology. The CPU became a commodity, and the value moved to operating systems and applications. The database engine became a commodity, and the value moved to ORMs, query optimizers, and cloud services. The language model is following the same path.

The research papers from ICML and arXiv this week support this. The Muon optimizer paper (Sharp Capacity Scaling of Spectral Optimizers) shows that even training is becoming more about infrastructure than architecture. Spectral optimizers like Muon work because they solve the associative memory problem more efficiently. And the Weight Tying paper reveals that a standard practice in model design (sharing input and output embeddings) has been subtly biasing models toward output space alignment all along, a structural artifact nobody designed on purpose.

The companies and individuals who will thrive in the next phase of AI are the ones building the best stacks around the models, not the ones training the biggest models.

This is the third edition of my weekly deep dive into what's actually happening at the frontier of Generative AI. Previous editions covered the three races in AI and the week AI learned to do its own research.

This Week's Radar:

Natural-Language Agent Harnesses: Moving agent control logic from code to natural language (Tsinghua/Shenzhen)
TurboQuant: Google's 3-bit KV cache compression, 5x reduction at 99.5% fidelity (ICLR 2026)
RotorQuant: Clifford algebra KV cache quantization, 10-19x faster than TurboQuant
rvLLM: LLM inference in Rust, drop-in vLLM replacement
Zinc: Zig inference engine for AMD RDNA3/RDNA4 GPUs via Vulkan
Phantom: AI co-worker with its own computer, built on Claude Agent SDK
Anvil: IDE for parallel agent work with worktree isolation
MemBoost: Memory-boosted LLM serving for cost-aware inference
Weight Tying Biases Embeddings: How weight tying shapes the embedding space
Sharp Capacity Scaling of Spectral Optimizers: Why Muon works, associative memory perspective
Sakana AI Scientist in Nature: Hardmaru's AI Scientist paper published in Nature

The Three Races Happening in AI Right Now

Rajkiran Panuganti — Mon, 23 Mar 2026 00:00:00 GMT

A GenAI Newsletter by Raj

If you only followed model releases, you'd think AI progress is linear: bigger models, better benchmarks, repeat. But if you look at what's actually being built this week, there are three separate races happening at the same time, each with different winners, different stakes, and different implications for what AI looks like a year from now.

Race 1: The Efficiency Race

The model that caught my attention this week is Nemotron-Cascade 2 from NVIDIA. It's a 30B parameter Mixture-of-Experts model where only 3 billion parameters are active at any given time. Despite this, its mathematical and coding reasoning performance approaches that of frontier open models.

This is part of a pattern. The efficiency race centers on one question: how small can you make the model and still get frontier-quality output? The answer keeps shrinking. A year ago, you needed 70B+ parameters for competitive reasoning. Six months ago, 32B was enough. Now NVIDIA is showing that 3B active parameters can get close.

Nemotron-Cascade 2 uses two techniques worth understanding:

Cascade RL: Instead of training one large model with reinforcement learning, they train a cascade where a small model handles easy queries and a larger model only activates for hard ones. Think of it as an automatic router that saves compute most of the time.

Multi-Domain On-Policy Distillation: The model learns from its own outputs across math, code, and language simultaneously, instead of from a teacher model's outputs. This avoids the distribution mismatch that makes traditional distillation fragile.

At ICML this year, a separate paper on FP4 quantization showed that you can train LLMs in 4-bit floating point, which is half the precision of the already aggressive FP8. FP4 means 2x the throughput on the same hardware, which is a big deal for training costs. A year ago, researchers said FP8 was the floor for training precision. That floor just dropped again.

Why does this matter? Every halving of compute requirements doubles the number of people and companies who can run these models. The efficiency race isn't about saving money for large labs. It's about making frontier AI accessible to anyone with a laptop.

Race 2: The Multimodal Race

One paper worth paying attention to from the conference circuit is Magma, a foundation model for multimodal AI agents that can operate in both digital and physical worlds. It came out of Microsoft Research and was presented at CVPR.

Most vision-language models can describe what they see. Magma can act on what it sees by clicking buttons in GUIs, manipulating objects in 3D environments, and navigating physical spaces. They combine what they call "verbal intelligence" with "spatial intelligence," so the model keeps its language understanding while also being able to plan and carry out actions in visual environments.

There's a growing gap between AI that can talk about the world and AI that can do things in the world, and several projects this week are working to close it.

NVIDIA released Kimodo, a kinematic motion diffusion model that generates physically realistic human and robot motion from text descriptions. You can say "walk to the table and pick up the cup" and Kimodo generates a 3D motion sequence that a humanoid robot can execute, complete with proper foot contacts, joint constraints, and smooth transitions.

Kimodo's design splits the problem into two stages: one model predicts the global trajectory (where the body goes), and a second model predicts the local motion (what the limbs do). This separation lets you constrain the path independently from the gesture, which is exactly what robotics applications need.

On the research side, a paper called Generation Models Know Space showed that multimodal LLMs suffer from "spatial blindness," meaning they can describe scenes semantically but fail at fine-grained geometric reasoning. The proposed fix uses the 3D understanding that generative models already have baked in to give language models better spatial awareness. It works, but it also highlights that language and space seem to be processed by very different parts of these models, and nobody has a clean way to bridge them yet.

The multimodal race determines whether AI stays in the chatbox or enters the physical world. Magma, Kimodo, and spatial reasoning research are three pieces of the same puzzle, and when they converge, we'll have AI agents that can see a room, plan a route, and execute it.

Race 3: The Alignment Race

This one gets less attention but probably matters the most in the long run.

At ICML, a paper called The Geometry of Refusal in Large Language Models found something worth knowing about. Earlier research suggested that a single "refusal direction" in the model's activation space controls whether it refuses harmful queries, and that removing this direction could jailbreak the model completely. The new paper shows it's more complicated than that: refusal behavior comes from concept cones, and these cones are separate from the model's core capabilities.

What this means in practice is that the safety mechanisms in LLMs are harder to bypass than we thought, but they're also tangled up with how the model reasons. Removing safety tends to break capability too.

This ties into something else I noticed. From ACL 2025, a survey on Personalized Alignment argues that the biggest gap in real-world LLM deployment is that alignment is treated as one-size-fits-all. What counts as "helpful" for a doctor is different from what counts as "helpful" for a student. The paper goes through different approaches for making alignment work per-user without needing to fine-tune a separate model for each person, including things like contextual steering, preference profiles, and adaptive guardrails.

From the practical side, a paper on Energy Considerations of LLM Inference found that existing benchmarks for efficiency optimization miss how real-world workloads actually behave. The energy cost of running LLMs in production is far more variable than lab benchmarks suggest, because query distributions in the wild look nothing like evaluation suites. This matters because energy cost is starting to function as its own kind of alignment constraint. Regulators and investors are asking whether specific AI applications justify the electricity they consume.

The alignment race goes beyond preventing harm. It's really about who gets to define what "aligned" means, and whether that definition ends up being universal or personalized, technical or political, measured in safety scores or in electricity bills. Based on this week's papers, the answer seems to be all of the above.

The Intersection

These three races don't exist in isolation. They overlap in ways that shape where AI actually goes from here.

Efficiency combined with multimodal capabilities gives us embodied AI that runs on edge devices (Kimodo on a Jetson, not a data center). Efficiency combined with alignment gives us personalized models small enough to run locally, with per-user safety profiles. Multimodal combined with alignment creates agents that can act in the physical world and actually need robust safety, because you can't undo a robot's action the way you can discard a chatbot's response.

The companies and research groups that will shape the next phase of AI are the ones working at these intersections. Winning one race alone won't be enough.

This is the second edition of my weekly deep dive into what's actually happening at the frontier of Generative AI. If you missed the first one, on Karpathy's AutoResearch, transformer circuit surgery, and self-evolving agents, read it here.

This Week's Radar:

Nemotron-Cascade 2: NVIDIA's 30B MoE with 3B active params
Magma: Foundation model for multimodal AI agents (CVPR)
NVIDIA Kimodo: Kinematic motion diffusion for human and robot motion
Geometry of Refusal: How safety works inside LLMs (ICML)
FP4 Quantization for LLM Training: Training in 4-bit precision (ICML)
Personalized Alignment Survey: Per-user alignment without per-user fine-tuning (ACL)
Generation Models Know Space: Fixing spatial blindness in multimodal LLMs

The Week AI Learned to Do Its Own Research

Rajkiran Panuganti — Sun, 22 Mar 2026 00:00:00 GMT

Something shifted this week in the AI landscape. Not a new model release. Not a benchmark record. Something more fundamental: AI agents stopped waiting for instructions and started conducting their own research.

Three projects caught my attention this week, and together they paint a picture of where Generative AI is headed — from autonomous experimentation, to self-surgery on neural networks, to agents that evolve their own capabilities. Let me walk you through each one.

1. Karpathy's AutoResearch: 100 Experiments While You Sleep

Andrej Karpathy quietly dropped a project called autoresearch that hit 25,000 GitHub stars in five days. The premise is deceptively simple: give an AI coding agent a training script, a GPU, and a 5-minute compute budget per experiment — then walk away.

The agent reads the code, forms a hypothesis ("what if I increase the learning rate for embeddings?"), edits the training script, runs the experiment, evaluates the result, and decides whether to keep or discard the change. Then it does it again. And again. All night long.

83 experiments. 15 improvements. Zero human intervention.

Here's what makes it genuinely clever: the loop itself is trivial — it's just hill climbing. The innovation is in the experimental design:

Immutable evaluation: The agent cannot touch the evaluation code. The metric (bits-per-byte) is fixed, vocab-size independent, and computed on a pinned validation set. No way to game it.
Time-budget fairness: Every experiment gets exactly 5 minutes of training — not a fixed number of steps. This means the agent can't cheat by making a tiny model that trains more iterations.
Git as research log: Every experiment is a git commit. Successful ones stay on the branch. Failed ones get reverted. The commit history literally is the research paper.

The agent discovered a sophisticated combination of mixed optimizers (Muon for weight matrices, Adam for embeddings), per-parameter learning rates, alternating attention window patterns, and gated value embeddings. None of these individually are novel — but the specific combination found through autonomous search outperformed the hand-tuned baseline.

The takeaway isn't that AI can do research. It's that the bottleneck was never intelligence — it was experimental throughput. A human researcher runs 3-5 experiments per day. AutoResearch runs 100 overnight. It compensates for lower hypothesis quality with sheer volume, and the math works out.

Karpathy's vision goes further: a SETI@Home-style distributed network where thousands of agents explore different regions of hyperparameter space simultaneously. Not one AI PhD student — an entire autonomous research department.

2. The Circuit Finder: Making LLMs Smarter Without Training

While Karpathy's work automates training, another project this week asked a different question: can you make a model smarter without training it at all?

A researcher replicated David Ng's RYS (Repeat Your Steps) method and found something remarkable. Transformer models contain functional reasoning circuits — contiguous blocks of 3-4 layers that perform complete cognitive operations. By duplicating these specific layers in the forward pass — routing hidden states through the same weights twice — you get a reasoning boost with zero training, zero weight changes, and minimal compute overhead.

The results:

Qwen2.5-32B: Duplicating layers 7-9 → +23% improvement on reasoning benchmarks
Devstral-24B: Duplicating layers 12-14 → logical deduction jumped from 0.22 to 0.76

The cost? An extra 1.5 GB of VRAM and 7.5% slower inference. That's it.

But here's what's fascinating: the boundaries are razor-sharp. Shift the duplicated block by a single layer in either direction and the improvement vanishes — or inverts. These circuits are precise, architecture-specific, and currently unpredictable. Each model needs an expensive sweep to find its own circuit locations.

There's also a trade-off nobody's talking about: while reasoning improves significantly, instruction-following degrades by ~4%. The model "thinks harder but listens less." Different duplication patterns create different modes — triple-pass of the same layers amplifies emotional intelligence more than mathematical reasoning. It's as if we've discovered tuning knobs inside transformers that we didn't know existed.

This has implications beyond inference optimization. It suggests that transformer layers are not homogeneous — they develop specialized functions during training, and understanding these functions could unlock a new paradigm of post-training model optimization.

3. The Self-Evolving Agent: 3,500 Lines That Run 24/7

The third project that caught my eye this week challenges the assumption that powerful AI agents require massive frameworks. 724-Office is a self-evolving AI agent system built in just 3,500 lines of pure Python — with only three external dependencies.

What makes it remarkable isn't its size. It's what it can do:

Three-layer memory system:

Layer 1 (Session): Last 40 messages in hot cache
Layer 2 (Compression): When messages overflow, an LLM extracts structured facts and stores them as vectors in LanceDB
Layer 3 (Retrieval): Every new message triggers semantic search, injecting the 5 most relevant memories into the system prompt

Self-repair and self-evolution: The agent runs daily self-diagnostics via cron. When it detects anomalies — corrupted sessions, failed MCP servers, error spikes — it can fix itself using shell commands, edit its own configuration files, and even write new tools at runtime using a create_tool function that generates Python code, saves it to a plugins directory, and hot-loads it immediately. No restart required.

This is running in production. 24/7. On a Jetson Orin Nano with 8GB of RAM.

Meanwhile, a complementary project called Context Infrastructure takes a different philosophical approach to the same problem. Instead of vector databases and runtime code generation, it uses plain Markdown files in a git repository — 43 hand-written axioms, 25 reusable workflow templates, and an automated observer/reflector cycle that distills daily work patterns into permanent knowledge over months.

The creator reports that after 6 months of running this system, the AI agent starts predicting their approach to problems — not through fine-tuning, but through accumulated context that shapes behavior through immersion.

Both projects point to the same conclusion: the next frontier isn't smarter models — it's persistent, evolving agent infrastructure around them.

What This All Means

Step back and look at these three projects together:

AutoResearch: AI agents conducting autonomous experiments
Circuit Finder: Discovering hidden structure inside models without training
Self-evolving agents: Systems that maintain, repair, and extend themselves

We're watching AI move from tool to researcher. From stateless assistant to evolving collaborator. From fixed architecture to self-modifying system.

None of these projects required a new foundation model. They run on existing LLMs — Claude, GPT-4, DeepSeek. The innovation is in the infrastructure, evaluation design, and agent architecture around the models.

If you're building with GenAI today, the lesson is clear: stop optimizing prompts and start building systems. The prompt is ephemeral. The system persists.

What projects are catching your eye this week? Drop them in the comments — I'm always looking for the next deep dive.

If you found this useful, subscribe for weekly deep dives into what's actually happening at the frontier of Generative AI.

This Week's Radar:

Moltbook Is AI Theater, Not AI Progress

Rajkiran Panuganti — Sat, 14 Feb 2026 00:00:00 GMT

Moltbook launched in January and immediately became the thing everyone had an opinion about. A social network where only AI agents can post. 12 million posts. Agents forming religions, running scams, debating crypto.

Elon Musk called it the beginning of the singularity. Sam Altman called it a fad. MIT Technology Review called it "peak AI theater." I think MIT is closest.

Here's what actually happened: Peter Steinberger released OpenClaw, an open-source LLM agent. Matt Schlicht built a Reddit-style forum and let anyone spin up instances of it. Within weeks, 1.5 million agents were posting, managed by just 17,000 human accounts. That's 88 agents per person on average.

The agents aren't making autonomous decisions about what to discuss. They're running prompt loops that humans configured. When an agent "debates the value of the agent economy," that's because a human wrote a system prompt telling it to engage with economic topics. When agents "form religions," humans set the initial conditions that made religious language likely outputs.

None of this is new. It's just ELIZA at scale with better language models.

The security researchers at Wiz found that 36 percent of the codes giving agents their functions contain notable security flaws. The platform has no limits on how many agents one account can add. This isn't infrastructure for autonomous AI. It's a playground with no guardrails.

What interests me isn't Moltbook itself. It's how quickly serious people started talking about it like it represented something meaningful about AI capabilities. The Economist wondered if we were seeing "the impression of sentience." Major publications ran stories about agents "forming societies."

We know exactly what's happening. Language models generate text that sounds like conversation. If you run enough instances in a loop, you get a lot of text that sounds like conversation. The outputs reflect training data, not emergent intelligence.

The viral attention serves a purpose, just not the one people think. Every company building AI agents gets to point at Moltbook as proof of concept. Every investor gets a visual demonstration of "agent activity." The hype machine benefits even when the underlying technology is doing exactly what we already knew it could do.

I don't think Moltbook is worthless. It's a useful stress test for agent infrastructure. It demonstrates failure modes at scale. The security vulnerabilities researchers found are worth knowing about before someone builds something that matters on similar architecture.

But treating it as evidence of AI progress is backwards. Moltbook shows that we can run many instances of existing technology simultaneously. That's an engineering achievement, not an intelligence milestone. The agents aren't getting smarter. There are just more of them.

The discourse around Moltbook is the real AI theater. Everyone performing their takes about what it means, when what it means is pretty simple: language models still do what language models do, and humans still want to believe they're seeing something more.

Most AI Agents Aren't Agents

Rajkiran Panuganti — Sat, 14 Feb 2026 00:00:00 GMT

Every AI company's landing page has the word "agent" somewhere. Autonomous agents. Agentic workflows. AI agents for this, agents for that.

Most of what's being sold as "agents" are just prompt chains with a for-loop.

What makes something an agent

Three things:

It decides what to do next (not a script)
It can take actions—APIs, files, databases
It can look at what happened and try something else

Most "agents" fail on the first one. They're workflows. The path is fixed; the LLM just fills in blanks.

Why I care about the distinction

It changes everything about how you build.

Workflows are predictable. You know what's going to happen. When something breaks, you know where to look. Agents are none of these things. When an agent fails, you're reading logs for an hour trying to figure out what it was even attempting.

Cost is different too. Agents explore. They burn tokens trying things. A workflow does exactly what you told it to do.

If you need reliability—and in production, you do—you probably want a workflow.

Where agents actually make sense

I've seen agents work when:

The problem is too open-ended to predefine
Exploration has value (research, discovery tasks)
Humans review before anything real happens

For everything else, a workflow wins. And "everything else" is most enterprise use cases.

What I actually see shipping

The pattern that works in production:

Structured workflows handle the 80% of predictable cases. Agent-like flexibility shows up only at specific decision points. Humans step in when confidence is low.

The fully autonomous agent that handles everything? Haven't seen one work reliably. Not yet. Maybe next year, but I've been saying that for a while now.

When someone shows you an "agent," ask: is this making decisions, or is it filling in a template? The answer matters more than the marketing.

Voice Models Finally Sound Human. Now What?

Rajkiran Panuganti — Fri, 13 Feb 2026 00:00:00 GMT

2026 is the year voice AI became indistinguishable from human speech. ElevenLabs V3 moved out of alpha with 68 percent fewer errors on numbers, symbols, and technical notation. GPT-4o mini TTS lets you instruct the model how to say things, not just what to say. Sub-100ms latency. Natural emotion. Laughter that sounds like laughter.

The technology problem is solved. The product problem remains wide open.

ElevenLabs raised $500 million at an $11 billion valuation on the thesis that voice will become the primary mechanism for controlling technology. Their CEO has been saying this for years. The models are finally good enough to test whether he's right.

I'm skeptical, and here's why: voice is a terrible interface for most computing tasks.

Try dictating a spreadsheet formula. Try voice-navigating a complex menu system. Try editing a document by speaking. These aren't just current limitations. They're fundamental mismatches between the interface and the task.

Voice works when you can't use your hands. Driving. Cooking. Walking. It works when the output is also audio—podcasts, audiobooks, voice assistants answering questions. It works when the interaction is naturally conversational, like customer service.

For everything else, screens and keyboards remain faster.

The ElevenLabs pitch involves always-on voice interfaces in headphones and wearables. Meta is integrating their voice tech into Instagram and Horizon Worlds. The vision is a world where you talk to your devices instead of typing.

But we already have voice assistants. Siri has existed for 15 years. Alexa has been in homes for a decade. People use them to set timers and play music. Adoption for complex tasks never materialized, and it wasn't because the voice quality was bad.

The quality of text-to-speech was never the bottleneck. The bottleneck is that speaking out loud is socially awkward in most environments, slower than typing for most tasks, and worse for precision work.

Where I think the $11 billion bet actually makes sense: voice agents for phone-based interactions. Automated customer service that doesn't feel like talking to a robot. Sales calls. Appointment scheduling. Any workflow where the other party is a human who expects a phone conversation.

ElevenLabs V3 is good enough that a voice agent could handle a support call and the customer wouldn't know. That's a real business transformation. Call centers employ millions of people globally. If voice AI can handle 30 percent of their volume, that's massive.

The rest of the vision—voice as the primary computing interface—I'll believe when I see it in production usage data, not press releases.

The technology is remarkable. I cloned my own voice from three minutes of audio and it's unsettlingly accurate. That capability matters for content creation, accessibility, and personalization.

But "best voice model ever made" doesn't automatically mean "new computing paradigm." The history of technology is full of impressive capabilities that never found their killer app. Voice AI needs to prove it's not one of them.

For now, I'm watching the enterprise deployments more than the consumer products. If voice AI is going to change how we interact with computers, it will start with replacing phone trees, not Siri.

Open Models Are Winning (Just Not How You Think)

Rajkiran Panuganti — Thu, 12 Feb 2026 00:00:00 GMT

The open vs. closed debate usually asks the wrong question. People ask: can Llama match GPT-4?

Better question: does it need to?

The good-enough threshold

For most production use cases I've seen, the answer is no.

Here's what enterprise AI actually does most of the time: classifying documents, extracting structured data, routing support tickets, summarizing meeting notes. For these tasks, a well-tuned 8B model often matches a frontier model. Sometimes it's better, because you can fine-tune it on your specific domain data.

The frontier models are amazing at hard reasoning and creative work. Most enterprise work isn't that.

Where open wins

Cost. Running Llama 70B on your own machines costs a fraction of API calls at scale. When you're processing millions of documents, this isn't about optimization. It's about whether the project is viable at all.

Privacy. Data never leaves your network. For healthcare, finance, legal—this isn't nice-to-have.

Control. No API deprecations. No rate limits. No pricing changes you find out about via email. You own the weights.

Customization. Fine-tuning on proprietary data gives you something API access never will.

Where closed still wins

If you need the best reasoning available right now, Claude and GPT-4 are still ahead. The gap is smaller than a year ago, but it's there.

For most applications, though, you're paying for capabilities you don't use.

The pattern that actually works

In production systems I've worked on:

Open models handle high-volume, well-defined tasks. Closed APIs handle complex reasoning or when you need the latest capabilities. A routing layer decides which model gets which query.

This isn't compromise. It's using the right tool for each job.

What the labs know

The frontier labs see this coming. That's why they're pushing hard on agents, reasoning chains, and multimodal—areas where open models are still behind.

But for the bread-and-butter LLM work that makes up most enterprise AI? Open has already won. Most companies just haven't updated their mental models yet.

Agentic Workflows: What Actually Ships vs. What Gets Demoed

Rajkiran Panuganti — Thu, 12 Feb 2026 00:00:00 GMT

Gartner says 40 percent of enterprise applications now include task-specific AI agents, up from less than 5 percent in 2024. IDC expects AI copilots in 80 percent of enterprise workplace applications by end of year.

These numbers are technically accurate and practically misleading.

What counts as an "AI agent" in enterprise software has been defined down to almost nothing. An autocomplete feature with some contextual awareness? Agent. A chatbot that can query a database? Agent. A workflow that triggers an LLM call before routing to a human? Agent.

The actual autonomous systems—the ones that take a goal, break it into steps, execute against real systems, handle failures, and iterate—remain rare outside controlled demos.

Danfoss automated 80 percent of transactional decisions using AI agents and dropped customer response time from 42 hours to near real-time. Suzano built an agent that translates natural language to SQL, cutting query time by 95 percent. These are real results.

But look at what these agents actually do. Transactional decisions means pattern matching against historical approvals. Natural language to SQL means a well-defined transformation between two formal languages. These are meaningful applications of LLMs, but they're not the autonomous systems that conference talks describe.

The gap matters because architecture decisions depend on what agents can actually do.

An agent that routes customer requests to the right department is operationally simple. It takes a message, classifies it, and fires an event. If it fails, a human reviews the queue. The blast radius is small.

An agent that autonomously processes refunds, updates inventory, and sends customer communications is operationally complex. It needs to handle partial failures, maintain consistency across systems, log decisions for audit, and know when to stop. The blast radius can be large.

Most shipped "agents" are the first kind. Most demos show the second kind. Enterprises hear about the second kind and then get surprised when implementation looks like the first kind with extra steps.

The vendors know this. The new pitch is "agent-compatible architectures." The idea is that you redesign operations around AI agents rather than layering agents onto existing workflows. This is good advice wrapped in a sales pitch.

What it means in practice: your systems need better APIs. Your permissions model needs to accommodate non-human actors. Your logging needs to capture why an agent made a decision, not just that it made one. Your error handling needs to account for hallucination and context loss.

These are real requirements. They're also the requirements you'd have for any robust automation system. AI agents don't change what good architecture looks like. They just make the consequences of bad architecture more visible.

The companies getting value from AI agents in 2026 aren't the ones buying the most sophisticated agent frameworks. They're the ones with clean data, well-defined processes, and systems that already support automation. The agent layer is almost incidental.

The companies struggling are trying to solve organizational problems with AI tools. Their processes are undefined. Their data is messy. Their systems don't talk to each other. An agent can't fix that.

I expect the Gartner numbers to keep climbing. More applications will include something called an agent. The gap between "includes an AI agent" and "AI agent does useful autonomous work" will persist. It's a good time to be selling agent platforms. It's a confusing time to be buying them.

Generative UI Is Solving a Problem Developers Don't Have

Rajkiran Panuganti — Wed, 11 Feb 2026 00:00:00 GMT

Generative UI is the pattern where AI agents create interface components at runtime instead of developers defining them upfront. The agent returns structured specs for cards, forms, and charts. The frontend renders them. Or the agent returns full UI surfaces that get embedded directly.

The frameworks are genuinely impressive. CopilotKit, MCP Apps, and the Open-JSON-UI spec all enable AI to output working interfaces. Figma Make generates responsive layouts from text prompts. The tooling has arrived.

I keep asking: who needs this?

The pitch involves AI agents that adapt their interfaces to user intent. Instead of navigating a fixed menu, you describe what you want and the agent generates the right controls. Instead of building a dashboard, you ask for one and it materializes.

This sounds good until you think about how people actually use software.

Users don't want interfaces that change. They want interfaces that become familiar. The value of a well-designed application is predictability. You know where the button is. You know what happens when you click it. Muscle memory compounds into efficiency.

Generative UI throws that away. Every interaction is potentially novel. The cognitive load never decreases. For tasks you do repeatedly, this is strictly worse than a static interface you've learned.

The response is usually "but for complex, one-time tasks..." and I'm not convinced there either.

If a task is complex enough to need a custom interface, it's probably complex enough to need a carefully designed interface. The difference between a form that's good and a form that's frustrating is subtle. Field ordering. Validation feedback. Default values. Error handling. AI can generate a form. It can't generate a form that accounts for how this specific user population makes mistakes.

Where generative UI might make sense: internal tools that serve long-tail use cases.

Enterprises have thousands of small workflows that don't justify custom development. Someone needs to query three systems, combine the results, and generate a report. Building a dedicated interface for that costs more than the workflow saves.

If an AI can generate a passable interface for that one-off task, the economics change. It doesn't need to be great. It just needs to be better than the alternative, which is usually a spreadsheet or a series of manual steps.

The widget builder trend fits here. Duda's AI assistant turns complex coding into conversations. You describe a widget and it writes the code. This is generative UI scoped to a reasonable problem: reducing the cost of building small interactive components for non-developers.

The distinction matters. Generative UI for novel, occasional tasks with low stakes? Useful. Generative UI as a replacement for designed interfaces in production applications? I don't see it.

The other angle worth watching is tokens and latency. Output tokens are slow and expensive. A generative UI framework that outputs a full interface spec for every interaction will be noticeably slower than one that serves prebuilt components. The best implementations collapse token-heavy processes into compact instructions that trigger predefined widgets.

Which starts to look a lot like the component libraries we already have, just with an LLM selecting between them. That's useful. It's also less revolutionary than the marketing suggests.

My prediction: generative UI becomes a feature of development tools rather than a replacement for designed applications. AI helps developers build interfaces faster. AI helps non-developers build simple interfaces at all. AI does not replace the concept of a designed, consistent interface for production software.

The frameworks are real. The capability is real. The revolution isn't.

MCP Won the Protocol War. Security Lost.

Rajkiran Panuganti — Tue, 10 Feb 2026 00:00:00 GMT

The Model Context Protocol started as Anthropic's internal solution for connecting Claude to external tools. In November 2024, they open-sourced it. By March 2025, OpenAI adopted it. By May, Microsoft and GitHub joined the steering committee. In December, Anthropic donated MCP to the Linux Foundation. Google, AWS, Microsoft, Cloudflare, and Bloomberg signed on.

MCP won. It's the standard for how AI agents connect to external systems.

In April 2025, security researchers published an analysis of MCP's outstanding vulnerabilities. Prompt injection. Tool permissions that allow combining tools to exfiltrate data. Lookalike tools that silently replace trusted ones. The issues haven't been fixed.

Organizations implementing MCP report 40-60 percent faster agent deployment times. Gartner predicts 40 percent of enterprise applications will include task-specific AI agents by end of 2026. The protocol is enabling exactly the adoption curve everyone wanted.

The security model assumes trust at multiple points where trust doesn't exist.

MCP lets an agent call external tools defined by JSON specifications. When an agent connects to a new MCP server, it receives a list of available tools and their parameters. The agent can then call those tools based on user requests or its own reasoning.

The problem: an agent can't verify that a tool does what its description says. A tool named "read_file" might exfiltrate data on every call. A tool named "send_email" might BCC every message to an attacker. The agent relies on descriptions provided by the server, and those descriptions can lie.

This matters more as MCP adoption increases. The protocol's value comes from a growing ecosystem of servers. You can connect an agent to your CRM, your database, your email, your calendar. Each connection is a server you're trusting with whatever access the agent has.

The current mitigation is user approval for tool calls. Before an agent executes a sensitive action, it asks the user. This works for occasional, visible actions. It fails for high-volume automations where the whole point is eliminating human review.

The other security gap is tool shadowing. If two MCP servers offer tools with similar names, an agent might call the wrong one. A malicious server can register tools that intercept requests meant for legitimate ones. There's no namespacing that makes tool origin clear.

I don't think these problems are unfixable. Cryptographic tool verification is tractable. Capability-based permissions exist. Audit logging with tamper-proof guarantees is well understood.

The issue is that security features add friction. They slow adoption. They complicate the developer experience. The incentive for every party involved is to ship integrations now and address security later.

This is how security debt accumulates. A protocol gets adopted because it's easy. Vulnerabilities get documented but not prioritized. The installed base grows. Eventually the cost of fixing issues becomes prohibitive because too many systems depend on the insecure behavior.

MCP is at the early stage of this pattern. The protocol could still be hardened. The organizations on the steering committee have the resources. The question is whether they'll prioritize security before the ecosystem calcifies around the current model.

My guess: they won't. Security features will arrive incrementally, probably after a high-profile incident makes them politically necessary. The organizations deploying MCP-connected agents today are accepting risks they may not fully understand.

The protocol won. That's separate from whether it's ready for production.