Deepseek Papers (3/3): Engram Revisited - From Emulation to Implementation
1033 words • 6 min read • Abstract

We started by training models to act like they had memory. Then we found an open source implementation that does it for real. This is what we learned.
| Resource | Link |
|---|---|
| Paper | arXiv:2601.07372 |
| Our Code | engram-poc |
| Reference | weagan/Engram |
| Video | Engram Revisited![]() |
| Playlist | All Engram Videos |
The Journey
Phase 1: Behavioral Emulation
Part 2 described our first approach: LoRA fine-tuning to make a model behave like it has memory. Train on patterns, and the model learns to respond consistently.
| Metric | Baseline | LoRA-tuned |
|---|---|---|
| Accuracy | 8.6% | 14.1% |
| Improvement | - | +63% relative |
It worked, but the architecture was unchanged. We were approximating Engram benefits, not implementing them.
Phase 2: The Discovery
Then we found weagan/Engram on GitHub—real hash-based memory in ~300 lines of Python:
class EnhancedEngramModule(nn.Module):
def __init__(self, table_size=50000, d_model=512):
# Large learnable memory table
self.memory_table = nn.Parameter(torch.zeros(table_size, d_model))
# Gate decides when to trust memory
self.gate = nn.Sequential(
nn.Linear(d_model * 2, d_model),
nn.ReLU(),
nn.Linear(d_model, 1),
nn.Sigmoid()
)
def forward(self, hidden_states, input_ids):
# O(1) hash lookup
indices = self.multi_head_hash(input_ids)
retrieved = F.embedding(indices, self.memory_table)
# Gated injection
gate_score = self.gate(torch.cat([hidden_states, retrieved], dim=-1))
return hidden_states + gate_score * retrieved
The key insight: the gate decides when to trust the lookup. Not every token needs memory.
Phase 3: Integration with HuggingFace
We ported the module to work with HuggingFace models:
SmolLM-135M (frozen)
↓
EnhancedEngramModule (per layer)
- 50K slot memory table
- O(1) hash-based lookup
- Learned gating
↓
Output
The proof it works—O(1) lookup regardless of sequence length:
| Sequence Length | Lookup Time | Expected if O(n) |
|---|---|---|
| 64 tokens | 0.15 ms | - |
| 2048 tokens | 2.77 ms | 4.8 ms |
Sub-linear scaling proves constant-time hash lookup.
The Reality Check
Here’s where it gets interesting. Real Engram memory excels at some tasks and hurts others.
Where Engram Helps
| Task Type | Baseline | Engram | Change |
|---|---|---|---|
| Acronym expansion | 25% | 75% | +200% |
| Element symbols | 33% | 67% | +103% |
| Long-term fact recall | 90% | 100% | +11% |
For exact-match lookups with structured keys, Engram dominates.
Where Engram Hurts
| Task Type | Baseline | Engram | Change |
|---|---|---|---|
| World capitals | 83% | 67% | -19% |
| Pattern completion | 14% | 11% | -21% |
For tasks where the base model already knows the answer, Engram’s hash collisions add noise.
The Key Insight
Engram is a specialized tool, not a general enhancement.
| Use Engram For | Don’t Use Engram For |
|---|---|
| FAQ responses | Creative generation |
| Terminology lookup | Novel combinations |
| Entity facts | Context-dependent answers |
| Code boilerplate | Reasoning tasks |
The gating mechanism is critical: it must learn to suppress memory when it doesn’t help. Without proper gating, hash collisions inject noise into every token.
Obstacles Encountered
1. Hash Collisions
Different inputs can map to the same memory slot. The gate must learn to ignore irrelevant retrievals.
2. Parameter Explosion
50K slots × 768 dimensions × 30 layers = 1.2B additional parameters. We had to inject selectively (every 4th layer) to stay practical.
3. Training Dynamics
Memory tables start at zero. They need higher learning rates (10x) to develop meaningful representations before the model learns to use them.
4. Evaluation Mismatch
Our pattern completion task wasn’t ideal for hash-based memory. Engram shines on exact-match retrieval, not generalization.
Combined Approach
The best results came from combining both methods:
Base Model (SmolLM-135M)
↓
EnhancedEngramModule
- Long-term fact storage
- O(1) lookup for known patterns
↓
LoRA Adapters
- Pattern completion
- Domain-specific behaviors
↓
Output
This gives you:
- Long-term memory from hash tables
- Pattern consistency from behavioral training
- Flexibility to disable either component
What We Learned
-
Emulation vs Implementation: LoRA fine-tuning approximates memory behavior; hash tables implement it. Both have their place.
-
Gating is Essential: The learned gate prevents hash collisions from degrading performance. Never use Engram without gating.
-
Match Task to Tool: Hash-based memory excels at exact lookups, not pattern generalization. Use it where applicable.
-
Selective Application: Don’t inject Engram everywhere. Target layers and use cases where it helps.
-
The Gate as a Safety Valve: When the gate learns to output near-zero for a task, that’s the model telling you Engram doesn’t help there. Listen to it.
Resources
- Engram Paper (arXiv:2601.07372)
- engram-poc Repository - Our implementation
- weagan/Engram - Reference implementation
- Engram Revisited Video
- Engram Video Playlist
- Part 1: mHC
- Part 2: Engram Introduction
Series Recap
| Part | Topic | Key Insight |
|---|---|---|
| 1 | mHC | Doubly-stochastic constraints bound signal amplification |
| 2 | Engram Intro | O(1) lookup beats recomputing through attention |
| 3 | Engram Revisited | Use Engram where applicable; gate to avoid worse results |
Hash-based memory is powerful but specialized. The gate decides when to use it—and when not to.
Part 3 of the Deepseek Papers series. View all parts
