How AI Learns Part 4: Memory-Based Learning
627 words • 4 min read • Abstract

Modern AI systems increasingly rely on external memory.
This shifts “learning” away from parameters.
| Resource | Link |
|---|---|
| Related | Engram | Engram Revisited | Multi-hop RAG |
The Memory Paradigm
Why External Memory?
Most “learning new facts” should not modify weights.
Weights are for generalization. They encode reasoning patterns, language structure, and capability.
Memory is for storage. It holds specific facts, documents, and experiences.
If you store everything in weights:
- You create interference
- You risk forgetting
- You must retrain
If you store facts in memory:
- No forgetting
- Fast updates
- Survives model upgrades
Retrieval-Augmented Generation (RAG)
Documents are embedded into vectors. At query time:
- Embed the query
- Search the vector database
- Retrieve relevant documents
- Inject into prompt
- Generate grounded response
The model does not need to remember facts internally. It retrieves them on demand.
RAG Benefits
| Benefit | Description |
|---|---|
| No forgetting | External storage, not weights |
| Persistent | Survives restarts and model changes |
| Scalable | Add documents without retraining |
| Verifiable | Can cite sources |
RAG Challenges
- Retrieval precision (wrong docs = bad answers)
- Latency (search takes time)
- Index maintenance
- Chunk boundaries
Cache-Augmented Generation (CAG)
Instead of retrieving from vector DB, cache previous context or KV states.
Use cases:
- Repeated knowledge tasks
- Multi-turn conversations
- Pre-computed context windows
Benefits over RAG:
- Often faster (no embedding + search)
- More deterministic
- Good for structured repeated workflows
Trade-offs:
- Less flexible
- Cache management complexity
Engram-Style Memory
Recent proposals (e.g., DeepSeek research) introduce conditional memory modules with direct indexing.
Instead of scanning long context or searching vectors:
- Memory slots indexed directly
- O(1) lookup instead of O(n) attention
- Separates static knowledge from dynamic reasoning
The goal: Constant-time memory access that doesn’t scale with context length.
This changes the compute story:
- Don’t waste attention on “known facts”
- Reserve compute for reasoning
- Avoid context rot
Model Editing
A related technique: surgically patch specific facts without full fine-tuning.
Example: The model says “The capital of Australia is Sydney.” You edit the specific association to “Canberra” without retraining.
Pros:
- Targeted fixes
- Fast
Cons:
- Side effects possible
- Consistency not guaranteed
The Key Distinction
| Aspect | Weight Learning | Memory Learning |
|---|---|---|
| Location | Parameters | External storage |
| Persistence | Model lifetime | Storage lifetime |
| Forgetting risk | High | None |
| Update speed | Slow (training) | Fast (database) |
| Survives model change? | No | Yes |
When to Use What
| Situation | Approach |
|---|---|
| Need new reasoning capability | Weight-based (fine-tune) |
| Need to know new facts | Memory-based (RAG) |
| Need domain expertise | Weight-based (LoRA) |
| Need to cite sources | Memory-based (RAG) |
| Frequently changing data | Memory-based (RAG/CAG) |
References
| Concept | Paper |
|---|---|
| RAG | Retrieval-Augmented Generation (Lewis et al. 2020) |
| Engram | Engram: Conditional Memory via Scalable Lookup (DeepSeek 2025) |
| REALM | REALM: Retrieval-Augmented Pre-Training (Guu et al. 2020) |
| Model Editing | Editing Factual Knowledge (De Cao et al. 2021) |
Coming Next
In Part 5, we’ll examine context engineering and recursive reasoning: ICL, RLM, and techniques that prevent context rot during inference.
The brain stays stable. The notebook grows.
Part 4 of the How AI Learns series. View all parts | Next: Part 5 →