9 min read

AI Gets Personal

The most interesting AI projects this week weren't about serving millions of users. They were about making AI work for one person at a time: on-device models, personal knowledge bases, and memory systems that remember you.

Gemma 4LLM WikiOn-Device AIAI MemoryGenerative AINewsletter

A GenAI Newsletter by Raj


For the past few weeks, I've been writing about the AI stack: how it's eating the model, how it got leaked, what happens when the most valuable layer becomes open knowledge. All of that was about infrastructure. This week the story shifted. The most interesting projects weren't about serving millions of users or winning benchmarks. They were about making AI work for one person at a time.

On-device models that run on your phone. Personal knowledge bases compiled from your own notes. Memory systems that remember your context across months. An AI tutor that watches your screen and points at things. The common thread: AI is moving from something you access through a cloud API to something that lives on your machine and knows your stuff.


Gemma 4: Frontier Intelligence on a Raspberry Pi

Google released Gemma 4 on April 2 under Apache 2.0. Four variants: 2.3B, 4.5B, 26B MoE (4B active), and 31B dense. The 31B model ranks #3 on Arena AI's leaderboard at 1452 Elo, outperforming models twenty times its size.

The benchmarks tell a story about how fast small models are improving. Compared to Gemma 3, AIME math scores jumped from 20.8% to 89.2%. LiveCodeBench coding went from 29.1% to 80.0%. GPQA science from 42.4% to 84.3%. These aren't incremental gains. The gap between "runs on a phone" and "runs in a data center" is closing at a pace nobody expected a year ago.

The community moved fast. Within days:

  • PhoneClaw put Gemma 4 on an iPhone as an on-device AI agent. No cloud, no API keys, everything runs locally.
  • gemma-gem runs Gemma 4 entirely in the browser via WebGPU. You open a webpage and the model loads into your GPU. No installation, no data leaving your machine.
  • Google announced Gemma 4 in the Android AICore Developer Preview, meaning it will ship as a system-level capability on Android devices.

This matters because it changes what "using AI" means. Today, most people interact with AI through ChatGPT or Claude in a browser, sending their data to someone else's server. Gemma 4 on a phone means the model is yours. Your data stays on your device. You don't need an internet connection. You don't need a subscription.

The 26B MoE variant is the interesting one for developers. With only 4B parameters active per token, it's efficient enough for real-time use on consumer hardware while being smart enough to handle complex reasoning. The MoE architecture means you get 26B worth of knowledge with 4B worth of compute cost.


Karpathy's LLM Wiki: Is This the End of RAG?

Andrej Karpathy posted a gist describing what he calls an "LLM Knowledge Base" or "LLM Wiki." The idea is simple: dump your raw documents (papers, articles, notes, bookmarks) into a folder. Point a coding agent at it. The agent reads everything and compiles it into a structured, interlinked wiki with cross-references, summaries, and backlinks between related concepts.

It's a direct alternative to RAG (Retrieval Augmented Generation), and the difference in philosophy is significant. RAG indexes your documents into vector embeddings and retrieves relevant chunks at query time. The LLM Wiki compiles your documents into a coherent knowledge structure ahead of time. RAG gives you search results. The LLM Wiki gives you an encyclopedia.

The pattern has three stages:

Ingest. Raw materials go into a raw/ directory. Papers, GitHub repos, web articles (Karpathy uses Obsidian Web Clipper to convert pages to markdown).

Compile. The LLM reads the raw data and writes structured wiki articles. It identifies key concepts, generates summaries, creates backlinks, and builds a table of contents. This is the expensive step, but you only do it when new sources arrive.

Maintain. The LLM runs "health checks" on the wiki: finding inconsistencies, filling gaps, updating cross-references, removing stale information. Like a librarian who reorganizes the shelves periodically.

The community response was instant. Six or more implementations appeared on GitHub in a single week:

  • nvk/llm-wiki: Claude Code plugin for building and querying LLM-compiled knowledge bases
  • claude-memory-compiler: Hooks into Claude Code sessions, extracts decisions and lessons, compiles them into cross-referenced articles
  • sage-wiki: A Go implementation. Drop in sources, get a structured searchable wiki
  • obsidian-wiki: Framework for AI agents to build and maintain an Obsidian vault using the pattern
  • Multiple shell-based and TypeScript implementations for different workflows

Why did this explode? Because it solves a real problem that RAG handles poorly. RAG is good at finding a specific fact buried in a large corpus. It's bad at synthesizing knowledge across documents, maintaining context over time, or giving you the big picture. The LLM Wiki approach produces something you can actually read and browse, and the cross-references let you discover connections between ideas that you wouldn't have found by searching.

For anyone building with AI (which, if you're reading this newsletter, is probably you), this is worth trying. The setup is minimal: a folder of markdown files, a coding agent, and a compilation prompt. The result is a personal knowledge base that gets smarter as you feed it more sources.


MemPalace: When Milla Jovovich Ships the Best AI Memory System

This one surprised everyone. Milla Jovovich (yes, the actress from The Fifth Element and Resident Evil) co-developed an AI memory system called MemPalace with developer Ben Sigman. It posted the highest score on standard memory benchmarks, beating every product in the space, free or paid. The repo hit 10,000 stars within days.

The system works differently from existing memory approaches. Most AI memory systems store raw conversation history or compress it into summaries. MemPalace uses a spatial metaphor inspired by the ancient memory palace technique: information is organized into rooms, objects, and associations. The AI builds a persistent mental model of what it knows about you, organized spatially so retrieval follows associative paths instead of keyword search.

This connects to a broader trend this week. claude-memory-compiler hooks into Claude Code sessions and automatically extracts key decisions and lessons into structured knowledge articles. The LLM Wiki pattern is fundamentally about memory too: compiling what you've read into something persistent and organized.

Memory is becoming a first-class concern in the AI stack. The Claude Code leak revealed KAIROS and autoDream (memory consolidation while idle). Karpathy's LLM Wiki compiles knowledge into persistent structure. MemPalace organizes personal context spatially. All three are trying to solve the same problem: AI that remembers and builds on what it knows about you over time.


Clicky: An AI Tutor That Points at Your Screen

Farza (FarzaTV) built something called Clicky, an AI teacher that lives as a buddy next to your cursor. It can see your screen, talk to you, and point at things on screen, like having someone looking over your shoulder and guiding you through a new tool.

Farza has been using it to learn Davinci Resolve (video editing software), and says it's been a "10/10" experience. The AI watches what you're doing, understands the context of the application you're in, and gives guidance that's specific to what's on your screen at that moment.

This is a different kind of personal. Models like Gemma 4 make AI personal by running on your hardware. The LLM Wiki makes AI personal by knowing your knowledge. Clicky makes AI personal by seeing your context in real-time. It's the difference between an AI that answers questions and an AI that teaches you by watching you work.


The Caveman Optimization

A lighter story, but genuinely useful: Caveman is a Claude Code skill that cuts 65% of token usage by making the model communicate in abbreviated, caveman-style language. "why use many token when few token do trick." It hit 5,300 stars.

It sounds like a joke, but it's a real optimization. Token usage is the primary cost driver for AI coding agents. If you can get the same information across in 35% of the tokens, your monthly bill drops proportionally. The skill works by injecting system prompt instructions that compress the model's communication style without reducing the quality of code output.

This fits the "personal" theme in an unexpected way. One of the biggest barriers to using AI coding agents is cost. At $200/month for Claude Code Max or pay-per-token for API usage, heavy users rack up significant bills. Caveman and tools like it bring the cost down to where more people can afford to use AI as a daily collaborator.


Quick Hits

Anthropic hit $30B annualized revenue, surpassing OpenAI's $25B. The company tripled revenue in three months despite (or because of?) the Claude Code source leak. IPO potentially in October at a $380B valuation.

OpenAI, Anthropic, and Google formed an anti-distillation alliance through the Frontier Model Forum, sharing data on Chinese labs that systematically query their APIs to train copycat models. Anthropic documented 16 million exchanges from DeepSeek, Moonshot AI, and MiniMax. The irony: Anthropic's own anti-distillation mechanisms were just exposed in last week's leak.

Anthropic's DMCA cleanup from the leak accidentally took down 8,100 GitHub repositories. Boris Cherny (Claude Code lead) called it accidental and retracted the bulk of the notices. The code remains widely mirrored.

Sebastian Raschka published mini-coding-agent, a minimal, readable coding agent harness in Python. Inspired by what the Claude Code leak revealed about harness architecture, it's designed to teach the core components. Think NanoGPT for coding agents.


This is the fifth edition of my weekly deep dive into what's actually happening at the frontier of Generative AI. Previous editions: The Stack Got Leaked / The Stack Eats the Model / The Three Races in AI / The Week AI Learned to Do Its Own Research


This Week's Radar:

  • Gemma 4: Google's open model family, Apache 2.0, runs on phones to GPUs
  • PhoneClaw: On-device AI agent for iPhone powered by Gemma 4
  • gemma-gem: Gemma 4 running entirely in-browser via WebGPU
  • Karpathy's LLM Wiki gist: The pattern that spawned six implementations in a week
  • MemPalace: Highest-scoring AI memory system, by Milla Jovovich
  • Clicky: AI tutor that sees your screen and points at things
  • Caveman: 65% token reduction by making Claude talk like a caveman
  • claude-memory-compiler: Auto-extract decisions from Claude Code sessions into structured knowledge
  • mini-coding-agent: Sebastian Raschka's minimal readable agent harness
  • Cersei: Rust SDK for building coding agents with graph memory
About the author
is an AI engineer and researcher. He spent 14 years at Microsoft working on Microsoft 365 Copilot Chat, Bing search ranking, Cortana, and Azure ML. Today he is building a new AI startup in stealth, advises other AI startups, conducts research on mechanistic interpretability, and writes weekly about Generative AI.