The Falling Price of Intelligence

A GenAI Newsletter by Raj

In March 2023, GPT-4 was the best model available and it cost $36 per million input tokens through the API. That was the only way to access that level of intelligence.

Today, GPT-4 level performance is available at $0.10 per million tokens through Gemini 2.0 Flash or Mistral Small. Or free, through NVIDIA NIM or OpenRouter. Or at zero marginal cost, by running Gemma 4 31B on your own hardware. An open-source model you can download and self-host now matches what was the commercial frontier three years ago.

The same level of intelligence went from $36 to effectively $0. The frontier moved too, and access to the new frontier (GPT-5.4 at $2.50, Gemini 3.1 Pro at $1.25, DeepSeek V3.2 at $0.28) is itself 10-100x cheaper than the old frontier was at launch.

This is happening through at least six independent mechanisms, and they compound.

1. Smaller models are replacing larger ones

The biggest cost reduction is not cheaper APIs. It is open-source models that you can run yourself, eliminating the API bill entirely.

Google released Gemma 4 in April 2026 under Apache 2.0. The 31B dense variant scores 89.2% on AIME 2026, 80% on LiveCodeBench, and competes with proprietary models at 400B+ parameters. You can download it, quantize it, and run it on a single RTX 4090 or a MacBook Pro with 48GB unified memory. No API key. No rate limits. No per-token cost. No data leaving your network.

Qwen3.5, also open-source, released a 9B parameter model in February 2026 that scores 81.7 on GPQA Diamond. GPT-OSS-120B, a model 13 times its size, scores 71.5 on the same benchmark.

The 2.3B effective-parameter variant of Gemma 4 scores 37.5% on AIME 2026 and 44% on LiveCodeBench. This runs on a phone.

At the proprietary frontier, March 2026 saw over 30 model launches in a single month. Gemini 3.1 Pro scores 94.3% on GPQA Diamond. GPT-5.4 set records on computer-use benchmarks. Claude Sonnet 4.6 performs at near-Opus quality at Sonnet pricing. NVIDIA's Nemotron 3 Super, a 120B hybrid Mamba-Attention MoE with only 12B active parameters, topped open-weight SWE-Bench Verified at 60.47%.

But the story here is not the frontier getting better. It is the gap between open-source and proprietary closing to single-digit percentage points while the cost difference remains 10-100x. For most production workloads, the open-source option is now good enough, and it is free.

2. Inference is getting faster without new hardware

Speculative decoding has matured. EAGLE-3, presented at NeurIPS 2025, achieves 3-6.5x speedup over standard autoregressive generation on models ranging from 8B to 70B parameters. P-EAGLE, from AWS, removes the autoregressive drafting bottleneck and adds another 1.7x on top of that on NVIDIA Blackwell.

KV-cache compression is where the less visible gains are happening. NVIDIA's NVFP4 format reduces KV-cache memory by 50% compared to FP8, which doubles effective context length and batch size with under 1% accuracy loss. Research systems like KVTC push this to 20x compression for specific workloads.

Prefill-decode disaggregation, which separates prompt processing from token generation onto different hardware, is now standard in production at Meta, LinkedIn, Mistral, and Hugging Face through vLLM. The research frontier has moved to doing this within a single GPU across different SM partitions.

None of these techniques require new silicon. They extract more work from hardware that already exists.

3. Compute is becoming a commodity

The H100 rental market tells an interesting story. Spot prices dropped 88% between January 2024 and September 2025, falling from roughly $8/GPU-hr to under $2/GPU-hr on annual contracts. Then in Q1 2026, prices rebounded about 40% to $2.35/hr as inference demand outran supply and capacity sold out.

The structural trend is down, but it is not a smooth line. Demand keeps eating the surplus.

Current on-demand H100 rates vary 3.5x depending on where you look. Azure charges $6.98/hr. AWS is $3.90. GCP is $3.00. Lambda Labs and RunPod sit around $2-3. Vast.ai, a peer-to-peer marketplace where individuals rent idle GPUs, is $1.87. GCP spot pricing drops to $2.25. The spread between hyperscalers and peer-to-peer marketplaces is the difference between paying for reliability and compliance versus paying for raw compute.

On the hardware side, inference-specific chips are changing the math. Cerebras CS-3 runs Llama 3.1 405B at over 1,000 tokens per second. Groq's LPU handles Llama 2 70B at 300 tokens per second, roughly 10x faster than an H100 cluster. These are purpose-built for the read-heavy, matrix-multiply workload of inference, and they price accordingly: Groq charges $0.11/M input tokens for Llama 4 Scout.

A new entrant is distributed inference on consumer hardware. Project Darkbloom from Eigen Labs turns idle Apple Silicon Macs into a privacy-first inference network, with end-to-end encryption and claims of 70% lower cost than centralized alternatives. Over 100 million Apple Silicon machines sit idle most of each day. Whether this model scales beyond a research preview remains to be seen, but the idea of turning consumer devices into an inference grid has obvious economic logic.

4. CPU inference is now practical

You do not need a GPU for every workload. On a modern 16+ core CPU with DDR5 memory, llama.cpp runs 7B-13B parameter models at 10-18 tokens per second with Q4_K_M quantization, which retains 92% of the original model quality.

Apple Silicon Macs are a particularly good fit. The unified memory architecture means the CPU and GPU share the same memory pool, so a MacBook Pro with 36GB or 48GB of unified memory can load models that would require a dedicated GPU on other platforms. MLX, Apple's machine learning framework, runs Qwen3.5-9B and Gemma-4-31B natively on M3/M4 chips at usable speeds. DFlash, an MLX-native inference engine, recently added support for more models with up to 4x speedups over baseline MLX. For many developers, the laptop they already own is a capable local inference machine.

AMD's Ryzen AI 9 HX 375 hits 50.7 tokens per second on Llama 3.2 1B at 4-bit quantization. Even old hardware works in a pinch: community reports show a 2-core CPU with 8GB DDR2 running 4B models at 2 tokens per second.

The bottleneck across all of these is memory bandwidth, not compute. DDR5 at 5600MHz and Apple's unified memory bus matter more than clock speed or core count.

For tasks like summarization, intent classification, embeddings, RAG pipelines, and coding assistance with smaller models, local deployment on a Mac or CPU-only server eliminates GPU cost entirely. This matters for on-premises deployments in regulated industries where data cannot leave the building and GPU procurement takes months. It also matters for individual developers who want to experiment without spending money.

5. Free tiers cover more than most people realize

NVIDIA's NIM platform gives free access to over 100 AI models, including Nemotron, Llama, Gemma, Qwen, DeepSeek, and Mistral. No credit card required. Rate-limited to roughly 40 requests per minute per model, which is enough for development and light production.

Google AI Studio provides 500 requests per day of Gemini 2.5 Flash for free.

OpenRouter aggregates 29 completely free models from Google, Meta, Mistral, NVIDIA, and OpenAI, with no credit card and 20 requests per minute.

Anthropic's Claude for Open Source program, launched in February 2026, gives qualifying open-source maintainers six months of Claude Max 20x for free, a value of roughly $1,200.

Between these providers, a developer can build, test, and run a production prototype entirely at zero marginal cost. The ceiling is rate limits, not money.

What this adds up to

Each of these six factors, smaller models, inference optimization, cheaper compute, CPU viability, and free tiers, is individually significant. The compounding is what changes things.

A concrete example: in early 2024, running a financial advisory chatbot required GPT-4 or equivalent at roughly $30/M tokens, needed an H100 GPU if self-hosted, and could only run through a cloud API.

Today, the same workload has multiple paths. You could call DeepSeek V3.2 through the API at $0.28/M output tokens, 100x cheaper than GPT-5.4 for roughly 90% of its quality. You could self-host Qwen3.5-9B on a consumer RTX 4090 ($0.35/hr on Vast.ai) with EAGLE-3 speculative decoding and Q4 quantization. You could run Gemma 4 31B on a MacBook Pro with 48GB unified memory using MLX. You could prototype the whole thing for free on NVIDIA NIM, then switch to Groq ($0.11/M tokens) for production.

The cost of a unit of intelligence is following a trajectory that looks like bandwidth or storage in the 2000s. The constraint on what you can build is shifting from "can we afford to run this model" to "can we build the product around it."

This is the ninth edition of my weekly deep dive into what is actually happening at the frontier of Generative AI. Previous editions: Why Looping Is the New Scaling / The Quiet Skill Revolution / AI Gets Personal / The Stack Got Leaked / The Stack Eats the Model / The Three Races in AI / The Week AI Learned to Do Its Own Research

This Week's Numbers:

GPT-4 equivalent: $36/M tokens (2023) to $0.10/M tokens (2026), 360x reduction
GPT-5.4: $2.50/$15 per M tokens. DeepSeek V3.2: $0.28/M output (100x cheaper)
Gemini 3.1 Pro: 94.3% GPQA Diamond. Nemotron 3 Super: 60.47% SWE-Bench (12B active)
Qwen3.5-9B outperforms GPT-OSS-120B (13x smaller) on GPQA Diamond
Gemma 4 2.3B: 37.5% AIME 2026 on a phone
EAGLE-3: 3-6.5x inference speedup, no new hardware needed
H100 spot: dropped 88%, rebounded 40%, structural trend still down
Mac with 48GB unified memory runs Gemma 4 31B natively via MLX
Free: 100+ models on NVIDIA NIM, 29 on OpenRouter, 500 req/day on Google AI Studio