ML Frontier #04: Is Chain of Thought Real?
1201 words • 7 min read • Abstract

Fourth ML Frontier episode. In 2022, Chain of Thought changed how we think about AI reasoning. By 2026, the question has shifted from “how to make CoT better” to “is it real reasoning at all?”
| Resource | Link |
|---|---|
| Papers | 10 papers (2024–2026) |
| Video | ML Frontier 4: Is CoT Real?![]() |
| Comments | Discord |
What Chain of Thought Promised
Wei et al. (2022) showed that prompting language models to “think step by step” dramatically improved performance on math, logic, and multi-step reasoning tasks. The idea was simple: instead of jumping straight to an answer, generate intermediate reasoning steps. The model explains its work, and the answer improves.
This became the foundation for everything from coding assistants to scientific reasoning pipelines. But a deeper question was always lurking: are those reasoning steps real?
The Faithfulness Problem
Recent research shows models can produce plausible-looking reasoning steps that don’t reflect their actual internal computation. The chain of thought looks like reasoning—it has logical connectives, intermediate conclusions, references to the problem—but the model may have arrived at the answer through entirely different internal pathways.
This is the faithfulness gap. A model’s visible reasoning trace can be:
- Post-hoc rationalization — constructing a justification after already deciding the answer
- Biased by surface features — following patterns in the prompt rather than the problem structure
- Unfaithful to internal state — the actual computation happening in the model’s hidden layers doesn’t match the text it generates
“Reasoning Models Don’t Always Say What They Think” (arXiv 2505.05410) shows that even models specifically trained to reason via CoT produce traces that are often unfaithful to their actual decision process. A March 2026 follow-up, “Reasoning Models Struggle to Control their Chains of Thought” (arXiv 2603.05706), goes further: models can’t reliably steer or suppress their reasoning traces even when instructed to. And “Diagnosing Pathological Chain-of-Thought in Reasoning Models” (arXiv 2602.13904) catalogs specific failure modes where CoT reasoning becomes actively pathological.
The implication is uncomfortable: when a model shows you its “thinking,” you can’t assume it’s showing you how it actually thinks.
CoT Is Task-Dependent
The research also reveals that Chain of Thought isn’t universally beneficial. It helps with:
- Math and arithmetic — multi-step calculations benefit from intermediate results
- Multi-hop logic — problems requiring sequential deductions
- Complex planning — tasks with dependencies between steps
But CoT can actually hurt performance on:
- Pattern recognition — tasks where the answer is immediate/intuitive
- Simple classification — forcing step-by-step reasoning adds noise
- Tasks with misleading structure — when the “obvious” reasoning path leads away from the correct answer
A 2024 meta-analysis, “To CoT or not to CoT?” (arXiv 2409.12183), confirms this systematically: CoT provides negligible or negative benefit on many task types including commonsense reasoning and factual retrieval. The blanket advice of “always use Chain of Thought” is wrong. The right approach depends on the task.
Adaptive Reasoning: Knowing When to Think
The field is moving toward conditional reasoning—models that decide when to think step by step and when to skip it. Wang and Zhou (2024) showed that chain-of-thought reasoning can emerge from models without explicit prompting, suggesting the capability is latent rather than purely prompt-dependent.
This points toward a future where models dynamically allocate reasoning effort:
- Simple questions get immediate answers
- Complex questions trigger step-by-step decomposition
- Ambiguous questions get clarifying sub-questions
“s1: Simple Test-Time Scaling” (arXiv 2501.19393) demonstrates this with budget-forcing—a simple mechanism to control how much reasoning a model performs at test time, truncating or extending thinking adaptively. “Outcome-Based RL Provably Leads Transformers to Reason” (arXiv 2601.15158) shows that RL training can teach transformers when reasoning is needed, not just how to reason. The model itself becomes the judge of how much thinking a problem requires.
Latent Reasoning: Thinking Without Showing
Some of the most interesting current work explores latent reasoning—models that reason internally without generating visible steps. Instead of producing a text trace, the model uses its internal representations to perform multi-step computation within the forward pass.
This connects to research on:
- Implicit chain of thought — reasoning encoded in hidden states rather than output tokens
- Pause tokens — giving models extra computation steps without requiring text output
- Internal scratchpads — dedicated hidden-state computation that doesn’t appear in the response
COCONUT (arXiv 2412.06769) demonstrates this concretely: LLMs reason using continuous hidden states as “thoughts” instead of generating discrete tokens. Two 2026 papers push further: “Latent Chain-of-Thought as Planning” (arXiv 2601.21358) decouples reasoning from verbalization entirely, and “Latent Reasoning with Supervised Thinking States” (arXiv 2602.08332) trains models to reason through supervised internal states.
If latent reasoning works at scale, it could offer the accuracy benefits of CoT without the token cost or the faithfulness problem—because there’s no visible trace to be unfaithful.
CoT as Part of an Ecosystem
Chain of Thought is no longer a standalone technique. It’s one component in a broader ecosystem:
| Component | Role |
|---|---|
| CoT | Step-by-step decomposition |
| Tool use | Offload computation to external systems |
| Reflection | Self-critique and error correction |
| Planning loops | Multi-turn strategy with backtracking |
| Reinforcement learning | Reward signals for reasoning quality |
This makes CoT the bridge concept between language models as predictors (next token), as reasoners (multi-step logic), and as agents (goal-directed behavior). Understanding where CoT works and where it breaks is essential for building systems that combine all three.
The Open Questions
| Question | Status |
|---|---|
| Is CoT faithful to internal computation? | Evidence says often not |
| When should models use CoT? | Task-dependent; adaptive approaches emerging |
| Can models reason without visible steps? | Latent reasoning research is promising |
| Does CoT scale with model size? | Yes, but with diminishing returns on simple tasks |
| Will CoT survive as a technique? | Likely evolves into adaptive/latent forms |
Papers
| Date | Paper | Link |
|---|---|---|
| Jan 2022 | Chain-of-Thought Prompting Elicits Reasoning in Large Language Models | arXiv 2201.11903 |
| Sep 2024 | To CoT or not to CoT? CoT Helps Mainly on Math and Symbolic Reasoning | arXiv 2409.12183 |
| Dec 2024 | Training LLMs to Reason in a Continuous Latent Space (COCONUT) | arXiv 2412.06769 |
| Jan 2025 | s1: Simple Test-Time Scaling | arXiv 2501.19393 |
| May 2025 | Reasoning Models Don’t Always Say What They Think | arXiv 2505.05410 |
| Jan 2026 | Outcome-Based RL Provably Leads Transformers to Reason | arXiv 2601.15158 |
| Jan 2026 | Latent Chain-of-Thought as Planning | arXiv 2601.21358 |
| Feb 2026 | Latent Reasoning with Supervised Thinking States | arXiv 2602.08332 |
| Feb 2026 | Diagnosing Pathological Chain-of-Thought in Reasoning Models | arXiv 2602.13904 |
| Mar 2026 | Reasoning Models Struggle to Control their Chains of Thought | arXiv 2603.05706 |
Is the reasoning real, or just a good story? Follow for more ML Frontier episodes exploring research at the edge.
Part 4 of the Machine Learning Frontier series. View all parts
Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.
