When people say, “AI learned something,” they usually mean one of at least four very different things.

Large Language Models (LLMs) do not learn in one single way. They learn at different time scales, in different locations, and with very different consequences. To understand modern AI systems—especially agents—we need to separate these layers.

Resource Link
Related ICL Revisited | RLM | Engram

Four Time Scales of Learning

Concentric rings showing four time scales of learning: core weights, adapters, external memory, and prompt/context
Learning happens at different layers with different persistence and speed.

1. Pretraining (Years)

This is the foundation.

The model trains on massive datasets using gradient descent. The result is a set of weights—billions of parameters—encoding statistical structure of language and knowledge.

This learning:

  • Is slow and expensive
  • Persists across restarts
  • Cannot easily be reversed
  • Is vulnerable to interference if modified later

Think of this as long-term biological memory.

2. Fine-Tuning (Days to Weeks)

Fine-tuning modifies the weights further, but with narrower data.

This includes:

  • Instruction tuning (following directions)
  • Alignment methods (Reinforcement Learning from Human Feedback (RLHF), Direct Preference Optimization (DPO))
  • Domain adaptation
  • Parameter-efficient methods like Low-Rank Adaptation (LoRA)

This is still weight-based learning.

It persists across restarts. It risks catastrophic forgetting. It modifies the brain itself.

3. Memory-Based Learning (Seconds to Minutes)

This is where many modern systems shift.

Instead of changing weights, they store information externally:

  • RAG (Retrieval-Augmented Generation)
  • CAG (Cache-Augmented Generation)
  • Vector databases
  • Engram-style memory modules

The model retrieves relevant memory per query.

The brain stays stable. The notebook grows.

This learning:

  • Persists across restarts
  • Survives model upgrades
  • Does not cause forgetting
  • Is fast

4. In-Context Learning (Milliseconds)

This is temporary reasoning scaffolding.

Information exists only in the prompt window.

It:

  • Does not update weights
  • Does not persist across sessions
  • Is powerful but fragile
  • Suffers from context rot

This is working memory.

Why This Matters

Most discussions collapse all of this into “the model learned.”

But:

  • Updating weights risks forgetting
  • Updating memory does not
  • Updating prompts does not persist
  • Updating adapters can be modular and reversible

Continuous learning systems must coordinate all four.

Persistence Comparison

Mechanism Persists Across Chat? Persists Across Restart? Persists Across Model Change?
Pretraining Yes Yes No
Fine-tune Yes Yes No
LoRA Yes Yes Usually
Distillation Yes Yes No
ICL No No No
RAG Yes Yes Yes
Engram Yes Yes Yes
CAG Yes Yes Yes

That last column is subtle but powerful for agents.

References

Concept Paper
LoRA LoRA: Low-Rank Adaptation of Large Language Models (Hu et al. 2021)
RAG Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (Lewis et al. 2020)
ICL What Can Transformers Learn In-Context? (Garg et al. 2022)
Engram Engram: Conditional Memory via Scalable Lookup (DeepSeek 2025)
DPO Direct Preference Optimization (Rafailov et al. 2023)

Coming Next

In Part 2, we’ll examine the two fundamental failure modes that arise from confusing these layers: catastrophic forgetting and context rot.


Learning happens in layers of permanence.