The Week AI Learned to Do Its Own Research

Something shifted this week in the AI landscape. Not a new model release. Not a benchmark record. Something more fundamental: AI agents stopped waiting for instructions and started conducting their own research.

Three projects caught my attention this week, and together they paint a picture of where Generative AI is headed — from autonomous experimentation, to self-surgery on neural networks, to agents that evolve their own capabilities. Let me walk you through each one.

1. Karpathy's AutoResearch: 100 Experiments While You Sleep

Andrej Karpathy quietly dropped a project called autoresearch that hit 25,000 GitHub stars in five days. The premise is deceptively simple: give an AI coding agent a training script, a GPU, and a 5-minute compute budget per experiment — then walk away.

The agent reads the code, forms a hypothesis ("what if I increase the learning rate for embeddings?"), edits the training script, runs the experiment, evaluates the result, and decides whether to keep or discard the change. Then it does it again. And again. All night long.

83 experiments. 15 improvements. Zero human intervention.

Here's what makes it genuinely clever: the loop itself is trivial — it's just hill climbing. The innovation is in the experimental design:

Immutable evaluation: The agent cannot touch the evaluation code. The metric (bits-per-byte) is fixed, vocab-size independent, and computed on a pinned validation set. No way to game it.
Time-budget fairness: Every experiment gets exactly 5 minutes of training — not a fixed number of steps. This means the agent can't cheat by making a tiny model that trains more iterations.
Git as research log: Every experiment is a git commit. Successful ones stay on the branch. Failed ones get reverted. The commit history literally is the research paper.

The agent discovered a sophisticated combination of mixed optimizers (Muon for weight matrices, Adam for embeddings), per-parameter learning rates, alternating attention window patterns, and gated value embeddings. None of these individually are novel — but the specific combination found through autonomous search outperformed the hand-tuned baseline.

The takeaway isn't that AI can do research. It's that the bottleneck was never intelligence — it was experimental throughput. A human researcher runs 3-5 experiments per day. AutoResearch runs 100 overnight. It compensates for lower hypothesis quality with sheer volume, and the math works out.

Karpathy's vision goes further: a SETI@Home-style distributed network where thousands of agents explore different regions of hyperparameter space simultaneously. Not one AI PhD student — an entire autonomous research department.

2. The Circuit Finder: Making LLMs Smarter Without Training

While Karpathy's work automates training, another project this week asked a different question: can you make a model smarter without training it at all?

A researcher replicated David Ng's RYS (Repeat Your Steps) method and found something remarkable. Transformer models contain functional reasoning circuits — contiguous blocks of 3-4 layers that perform complete cognitive operations. By duplicating these specific layers in the forward pass — routing hidden states through the same weights twice — you get a reasoning boost with zero training, zero weight changes, and minimal compute overhead.

The results:

Qwen2.5-32B: Duplicating layers 7-9 → +23% improvement on reasoning benchmarks
Devstral-24B: Duplicating layers 12-14 → logical deduction jumped from 0.22 to 0.76

The cost? An extra 1.5 GB of VRAM and 7.5% slower inference. That's it.

But here's what's fascinating: the boundaries are razor-sharp. Shift the duplicated block by a single layer in either direction and the improvement vanishes — or inverts. These circuits are precise, architecture-specific, and currently unpredictable. Each model needs an expensive sweep to find its own circuit locations.

There's also a trade-off nobody's talking about: while reasoning improves significantly, instruction-following degrades by ~4%. The model "thinks harder but listens less." Different duplication patterns create different modes — triple-pass of the same layers amplifies emotional intelligence more than mathematical reasoning. It's as if we've discovered tuning knobs inside transformers that we didn't know existed.

This has implications beyond inference optimization. It suggests that transformer layers are not homogeneous — they develop specialized functions during training, and understanding these functions could unlock a new paradigm of post-training model optimization.

3. The Self-Evolving Agent: 3,500 Lines That Run 24/7

The third project that caught my eye this week challenges the assumption that powerful AI agents require massive frameworks. 724-Office is a self-evolving AI agent system built in just 3,500 lines of pure Python — with only three external dependencies.

What makes it remarkable isn't its size. It's what it can do:

Three-layer memory system:

Layer 1 (Session): Last 40 messages in hot cache
Layer 2 (Compression): When messages overflow, an LLM extracts structured facts and stores them as vectors in LanceDB
Layer 3 (Retrieval): Every new message triggers semantic search, injecting the 5 most relevant memories into the system prompt

Self-repair and self-evolution: The agent runs daily self-diagnostics via cron. When it detects anomalies — corrupted sessions, failed MCP servers, error spikes — it can fix itself using shell commands, edit its own configuration files, and even write new tools at runtime using a create_tool function that generates Python code, saves it to a plugins directory, and hot-loads it immediately. No restart required.

This is running in production. 24/7. On a Jetson Orin Nano with 8GB of RAM.

Meanwhile, a complementary project called Context Infrastructure takes a different philosophical approach to the same problem. Instead of vector databases and runtime code generation, it uses plain Markdown files in a git repository — 43 hand-written axioms, 25 reusable workflow templates, and an automated observer/reflector cycle that distills daily work patterns into permanent knowledge over months.

The creator reports that after 6 months of running this system, the AI agent starts predicting their approach to problems — not through fine-tuning, but through accumulated context that shapes behavior through immersion.

Both projects point to the same conclusion: the next frontier isn't smarter models — it's persistent, evolving agent infrastructure around them.

What This All Means

Step back and look at these three projects together:

AutoResearch: AI agents conducting autonomous experiments
Circuit Finder: Discovering hidden structure inside models without training
Self-evolving agents: Systems that maintain, repair, and extend themselves

We're watching AI move from tool to researcher. From stateless assistant to evolving collaborator. From fixed architecture to self-modifying system.

None of these projects required a new foundation model. They run on existing LLMs — Claude, GPT-4, DeepSeek. The innovation is in the infrastructure, evaluation design, and agent architecture around the models.

If you're building with GenAI today, the lesson is clear: stop optimizing prompts and start building systems. The prompt is ephemeral. The system persists.

What projects are catching your eye this week? Drop them in the comments — I'm always looking for the next deep dive.

If you found this useful, subscribe for weekly deep dives into what's actually happening at the frontier of Generative AI.

This Week's Radar: