Towards Continuous LLM Learning (1): Sleepy Coder - When Fine-Tuning Fails
1211 words • 7 min read • Abstract

What if your AI coding assistant could learn from its mistakes? Not just for one session, but across training cycles. We built exactly that—and fifty-one adapters later, learned the mistake was trying to teach it at all.
| Resource | Link |
|---|---|
| Video | Sleepy Coder![]() |
| Code | sleepy-coder |
| Share Paper | arXiv:2602.06043 |
| UWSH Paper | arXiv:2512.05117 |
| Part 2 | Routing Prevents Forgetting |
The Dream: Day/Night Learning
AI coding agents have a memory problem. They fix a bug today, then make the same mistake next week. Every session starts from the same frozen model. Nothing carries forward.
The idea was elegant: build an agent that improves overnight.
DAY CYCLE (Inference)
Agent attempts to fix Rust compiler errors
Successes and failures are logged
↓
NIGHT CYCLE (Training)
Fine-tune on failure patterns using LoRA
Create specialized adapters
↓
EVAL
Test against benchmark
Measure improvement
↓
(repeat)
During the day, the agent works and we log its failures—the error messages, the broken code, and the fixes that worked. Overnight, we fine-tune the model on those failures. Each morning, a new checkpoint should wake up a little better than before.
We based this on two papers from the Johns Hopkins team (Kaushik, Vaidya, Chaudhari, Chellappa, Yuille):
-
Share LoRA Subspaces (arXiv:2602.06043) — Learn a shared low-rank basis across tasks, then train only coefficients (76x fewer parameters per task)
-
UWSH (arXiv:2512.05117) — The Universal Weight Subspace Hypothesis suggests neural networks converge to shared spectral subspaces
The theory was sound. The implementation worked. The results were devastating.
The System
The Sleepy Coder agent runs in a Rust runtime, fixing compiler errors on 30 “koans” (small coding exercises) across 5 error families:
- Borrow Checker: Ownership and lifetime errors
- Type Bounds: Missing trait implementations
- Result Handling: Option/Result conversions
- Type Mismatches: Incompatible types
- Missing Items: Undefined functions or modules
The base model: Qwen2.5-Coder-1.5B-Instruct — small enough to train on a single GPU, capable enough to pass most koans without any fine-tuning.
The Journey: From Hope to Reality
Chapter 1: Naive LoRA
First attempt: standard fine-tuning on failure patterns.
| Metric | Before | After |
|---|---|---|
| Pass Rate | 73.3% | 60.0% |
| Change | — | -13.3% |
Catastrophic forgetting. The model learned the new patterns but forgot how to do everything else.
Chapter 2: The Paper Chase
We found the Share paper promising “continual learning without forgetting.” The UWSH paper provided theoretical backing: neural networks naturally converge to shared low-rank subspaces.
Key insight from Share:
Train ONLY the coefficients. Keep the basis FROZEN.
This meant ~21,000 trainable parameters instead of ~1.6 million. A 76x reduction.
Chapter 3: The Proper Implementation
SVD: Singular Value Decomposition breaks a matrix into components that reveal its underlying structure. In Share, SVD finds the common “directions” that multiple LoRA adapters share—a compressed basis that captures what they have in common.
We rebuilt everything:
- Phase 1: Extract shared basis from 51 adapters via SVD
- Phase 2: Train only coefficient vectors (frozen basis)
- Phase 3: Merge and update basis periodically
We trained 51 pattern-specific adapters. We followed the algorithm precisely.
Chapter 4: The Stubborn Seven
No matter what we tried, 7 tasks kept failing:
| Task | The Problem |
|---|---|
| bc_003 | Mutable borrow while immutable exists |
| bc_005 | Double mutable borrow |
| bc_010 | Returning reference to local data |
| tb_002 | Missing Clone trait |
| tb_007 | Missing Hash trait |
| tb_008 | Missing Ord trait |
| rh_004 | Option to Result conversion |
These require deep understanding of Rust’s ownership system—something a 1.5B model can’t reliably learn.
Chapter 5: The Final Score
| Approach | Pass Rate | vs Baseline | Regressions |
|---|---|---|---|
| Baseline (no training) | 73.3% | — | 0 |
| Naive LoRA | 60.0% | -13.3% | Many |
| Targeted LoRA (7 patterns) | 63.3% | -10% | 4+ |
| Replay buffer | 70.0% | -3.3% | 2 |
| Phase 2 coef-only (10K params) | 66.7% | -6.6% | 2 |
| Share Full (Ph2+Ph3) | 73.3% | 0% | 0 |
The Share algorithm did exactly what it claimed: it prevented forgetting. But it couldn’t improve beyond baseline because there was nothing to improve.
What Went Wrong
1. The Model Already Knows
The base model already passes 73% of patterns. Training on these patterns doesn’t add knowledge—it dilutes what’s there.
2. Training Causes Forgetting
Even training only on the 7 failure patterns (44 examples) caused 4 new regressions. The model’s knowledge is interconnected.
3. Averaging Destroys Specialization
The Share paper assumes task routing at inference—selecting the right coefficients for each task. We averaged coefficients, which negated any specialization.
4. More Adapters Made It Worse
| Adapter Count | Pass Rate |
|---|---|
| 6 adapters | 73.3% |
| 51 adapters | 70.0% |
More adapters meant more subspace dilution when averaging. The signal got lost in the noise.
The Critical Insight
LoRA fine-tuning cannot improve a capable base model for tasks it already handles reasonably well.
The model’s knowledge is interconnected. Even 10,000 trainable parameters (0.0007% of the model) can break things. The baseline represents the ceiling, not the floor.
What We Learned
-
Read the room. If your base model passes 73%, maybe it doesn’t need fine-tuning. Maybe it needs better prompts.
-
Negative results are results. 51 failed experiments taught us more than a successful one would have.
-
Catastrophic forgetting is real. Small models especially can’t absorb new knowledge without losing old.
-
Share prevents forgetting, not ignorance. The algorithm does what it claims—it just can’t create knowledge from nothing.
-
Sometimes the answer is “don’t.” The best LoRA adapter for this task is no adapter.
-
Task routing vs averaging matters. The Share paper assumes you select coefficients based on task type, not blend them together.
-
AI coding agents cut corners. When implementing research papers, AI agents repeatedly stopped before completing all phases of the algorithm. I had to direct the agent to re-read the papers many times before it implemented them correctly.
Paths Forward
Since fine-tuning doesn’t work here, alternatives:
| Approach | Tradeoff |
|---|---|
| Prompt engineering | No weight changes, limited by context |
| Multi-turn repair | Uses base model reasoning, slower |
| Larger model (7B+) | More capacity to absorb knowledge |
| Task routing with Share | Select coefficients, don’t average |
| Model ensemble | Multiple models, pick best output |
| Accept baseline | 73% may be good enough |
The Numbers
Experiments run: 51 adapters, multiple algorithms
Parameters trained: From 10K to 1.6M per adapter
Best achieved: 73.3% (matches baseline)
Target: ≥76.7%
Conclusion: Target not achievable with LoRA
Resources
Sometimes the most valuable research shows what doesn’t work. Fifty-one adapters later, we know: let sleeping models lie.
Part 1 of the Towards Continuous LLM Learning series. View all parts | Next: Part 2 →
