Multi-Hop Reasoning (1/2): Training Wheels for Small LLMs
692 words • 4 min read • Abstract

A tiny 135M parameter model goes from 0% to 75% accuracy in 5 minutes of training. The secret? Knowledge graph-guided training with rejection sampling.
| Resource | Link |
|---|---|
| Paper | KG-Guided RAG (arXiv) |
| Code | multi-hop-reasoning |
| ELI5 | eli5.md |
| Demo | Live Demo |
| Video | LLM with Training Wheels![]() |
| Part 2 | The Distribution Trap |
The Problem: Multi-Hop Reasoning
LLMs struggle with questions requiring multiple reasoning steps. “What’s the fix for a crash caused by a corrupted config file on a system running outdated firmware?” requires connecting several facts:
- Corrupted config → need config reset
- Outdated firmware → need firmware update
- Crash context → check dependencies between these fixes
Standard fine-tuning teaches pattern matching. Multi-hop reasoning requires following logical chains.
The Paper’s Approach
Learn with training wheels, remove them after learning completes.
Knowledge Graph-Guided RAG from Princeton proposes using knowledge graphs during training to score reasoning quality—then removing the graph at inference.
The key insight: train with scaffolding, test without it.
My Implementation
The repo implements this for a software troubleshooting domain:
| Component | Details |
|---|---|
| Knowledge Graph | ~200 entities, ~600 edges (symptoms, causes, fixes) |
| Training Data | MCQs with 1-3 hop paths |
| Eval Data | MCQs with 4-5 hop paths (harder) |
| Model | SmolLM-135M-Instruct |
| Framework | MLX (Apple Silicon native) |
The Training Pipeline
┌─────────────────────────────────────────┐
│ 1. SFT: Learn output format │
│ TRACE: <reasoning> │
│ ANSWER: A|B|C|D │
├─────────────────────────────────────────┤
│ 2. RSFT: Rejection Sampling FT │
│ - Generate multiple answers │
│ - Score with knowledge graph │
│ - Keep only correct traces │
│ - Train on winners │
└─────────────────────────────────────────┘
The Reward Function
The knowledge graph scores outputs during training:
- R_corr: +1.0 correct answer, -2.0 incorrect
- R_path: Entity coverage (did the trace mention relevant nodes?)
- P_spam: -0.5 penalty for repeating entities (prevents gaming)
At inference, the graph is removed. The model must reason from learned patterns.
Results
| Phase | Accuracy | Training Time |
|---|---|---|
| Base model | 0% | - |
| After SFT | 30% | ~2 min |
| After RSFT | 75% | ~3 min |
The critical finding: distribution matching matters.
Training on easy examples (1-2 hops) hurt performance on hard eval (4-5 hops). Training on examples matching the eval distribution achieved 75%.
Running It
git clone https://github.com/softwarewrighter/multi-hop-reasoning
cd multi-hop-reasoning
# Setup (Apple Silicon)
make setup-mlx
# Full pipeline
make train
Results appear in ~5 minutes on an M-series Mac.
Implementation Details
| Metric | Value |
|---|---|
| Primary Language | Python |
| Source Files | 12 .py files |
| Estimated Size | ~1.5 KLOC |
| Framework | MLX, Transformers |
| Platform | Apple Silicon (MLX native) |
Good for you if: You want to understand knowledge graph-guided training, experiment with rejection sampling fine-tuning, or see how small models can learn reasoning patterns.
Complexity: Moderate. Clean codebase with Make targets for each step. Requires understanding of fine-tuning concepts.
Key Takeaways
-
Scaffolded training works. Use structured feedback during training, remove it at inference.
-
Distribution matching matters. Train on examples that match your eval distribution.
-
Small models can reason. 135M parameters is enough for 75% accuracy on 4-5 hop questions.
-
MLX makes iteration fast. Full pipeline runs in 5 minutes on a MacBook.
Resources
- Paper: Knowledge Graph-Guided RAG
- Repository: multi-hop-reasoning
- Live Demo
- Video: LLM with Training Wheels
Knowledge graphs as training wheels—helping small models learn to reason, then letting go.
Part 1 of the Multi-Hop Reasoning series. View all parts | Next: Part 2 →
