Deepseek Papers (3/3): Engram Revisited - From Emulation to Implementation

From behavioral emulation to real implementation: integrating hash-based Engram memory with HuggingFace models. The gating mechanism is critical---it learns when to trust memory lookup and when hash collisions would add noise. Engram excels at exact-match retrieval, not generalization.

We started by training models to act like they had memory. Then we found an open source implementation that does it for real. This is what we learned.

Resource	Link
Paper	arXiv:2601.07372
Our Code	engram-poc
Reference	weagan/Engram
Video	Engram Revisited
Playlist	All Engram Videos

The Journey

Phase 1: Behavioral Emulation

Part 2 described our first approach: LoRA fine-tuning to make a model behave like it has memory. Train on patterns, and the model learns to respond consistently.

Metric	Baseline	LoRA-tuned
Accuracy	8.6%	14.1%
Improvement	-	+63% relative

It worked, but the architecture was unchanged. We were approximating Engram benefits, not implementing them.

Phase 2: The Discovery

Then we found weagan/Engram on GitHub—real hash-based memory in ~300 lines of Python:

class EnhancedEngramModule(nn.Module):
    def __init__(self, table_size=50000, d_model=512):
        # Large learnable memory table
        self.memory_table = nn.Parameter(torch.zeros(table_size, d_model))

        # Gate decides when to trust memory
        self.gate = nn.Sequential(
            nn.Linear(d_model * 2, d_model),
            nn.ReLU(),
            nn.Linear(d_model, 1),
            nn.Sigmoid()
        )

    def forward(self, hidden_states, input_ids):
        # O(1) hash lookup
        indices = self.multi_head_hash(input_ids)
        retrieved = F.embedding(indices, self.memory_table)

        # Gated injection
        gate_score = self.gate(torch.cat([hidden_states, retrieved], dim=-1))
        return hidden_states + gate_score * retrieved

The key insight: the gate decides when to trust the lookup. Not every token needs memory.

Phase 3: Integration with HuggingFace

We ported the module to work with HuggingFace models:

SmolLM-135M (frozen)
        ↓
EnhancedEngramModule (per layer)
  - 50K slot memory table
  - O(1) hash-based lookup
  - Learned gating
        ↓
Output

The proof it works—O(1) lookup regardless of sequence length:

Sequence Length	Lookup Time	Expected if O(n)
64 tokens	0.15 ms	-
2048 tokens	2.77 ms	4.8 ms

Sub-linear scaling proves constant-time hash lookup.

The Reality Check

Here’s where it gets interesting. Real Engram memory excels at some tasks and hurts others.

Where Engram Helps

Task Type	Baseline	Engram	Change
Acronym expansion	25%	75%	+200%
Element symbols	33%	67%	+103%
Long-term fact recall	90%	100%	+11%

For exact-match lookups with structured keys, Engram dominates.

Where Engram Hurts

Task Type	Baseline	Engram	Change
World capitals	83%	67%	-19%
Pattern completion	14%	11%	-21%

For tasks where the base model already knows the answer, Engram’s hash collisions add noise.

The Key Insight

Engram is a specialized tool, not a general enhancement.

Use Engram For	Don’t Use Engram For
FAQ responses	Creative generation
Terminology lookup	Novel combinations
Entity facts	Context-dependent answers
Code boilerplate	Reasoning tasks

The gating mechanism is critical: it must learn to suppress memory when it doesn’t help. Without proper gating, hash collisions inject noise into every token.

Obstacles Encountered

1. Hash Collisions

Different inputs can map to the same memory slot. The gate must learn to ignore irrelevant retrievals.

2. Parameter Explosion

50K slots × 768 dimensions × 30 layers = 1.2B additional parameters. We had to inject selectively (every 4th layer) to stay practical.

3. Training Dynamics

Memory tables start at zero. They need higher learning rates (10x) to develop meaningful representations before the model learns to use them.

4. Evaluation Mismatch

Our pattern completion task wasn’t ideal for hash-based memory. Engram shines on exact-match retrieval, not generalization.

Combined Approach

The best results came from combining both methods:

Base Model (SmolLM-135M)
        ↓
EnhancedEngramModule
  - Long-term fact storage
  - O(1) lookup for known patterns
        ↓
LoRA Adapters
  - Pattern completion
  - Domain-specific behaviors
        ↓
Output

This gives you:

Long-term memory from hash tables
Pattern consistency from behavioral training
Flexibility to disable either component

What We Learned

Emulation vs Implementation: LoRA fine-tuning approximates memory behavior; hash tables implement it. Both have their place.
Gating is Essential: The learned gate prevents hash collisions from degrading performance. Never use Engram without gating.
Match Task to Tool: Hash-based memory excels at exact lookups, not pattern generalization. Use it where applicable.
Selective Application: Don’t inject Engram everywhere. Target layers and use cases where it helps.
The Gate as a Safety Valve: When the gate learns to output near-zero for a task, that’s the model telling you Engram doesn’t help there. Listen to it.

Resources

Engram Paper (arXiv:2601.07372)
engram-poc Repository - Our implementation
weagan/Engram - Reference implementation
Engram Revisited Video
Engram Video Playlist
Part 1: mHC
Part 2: Engram Introduction

Series Recap

Part	Topic	Key Insight
1	mHC	Doubly-stochastic constraints bound signal amplification
2	Engram Intro	O(1) lookup beats recomputing through attention
3	Engram Revisited	Use Engram where applicable; gate to avoid worse results

Hash-based memory is powerful but specialized. The gate decides when to use it—and when not to.