Multi-Hop Reasoning (1/2): Training Wheels for Small LLMs

A 135M parameter model goes from 0% to 75% accuracy in 5 minutes. Using knowledge graph-guided training with rejection sampling, we teach multi-hop reasoning with scaffolding during training, then remove it at inference.

A tiny 135M parameter model goes from 0% to 75% accuracy in 5 minutes of training. The secret? Knowledge graph-guided training with rejection sampling.

Resource	Link
Paper	KG-Guided RAG (arXiv)
Code	multi-hop-reasoning
ELI5	eli5.md
Demo	Live Demo
Video	LLM with Training Wheels
Part 2	The Distribution Trap

The Problem: Multi-Hop Reasoning

LLMs struggle with questions requiring multiple reasoning steps. “What’s the fix for a crash caused by a corrupted config file on a system running outdated firmware?” requires connecting several facts:

Corrupted config → need config reset
Outdated firmware → need firmware update
Crash context → check dependencies between these fixes

Standard fine-tuning teaches pattern matching. Multi-hop reasoning requires following logical chains.

The Paper’s Approach

Learn with training wheels, remove them after learning completes.

Knowledge Graph-Guided RAG from Princeton proposes using knowledge graphs during training to score reasoning quality—then removing the graph at inference.

The key insight: train with scaffolding, test without it.

My Implementation

The repo implements this for a software troubleshooting domain:

Component	Details
Knowledge Graph	~200 entities, ~600 edges (symptoms, causes, fixes)
Training Data	MCQs with 1-3 hop paths
Eval Data	MCQs with 4-5 hop paths (harder)
Model	SmolLM-135M-Instruct
Framework	MLX (Apple Silicon native)

The Training Pipeline

┌─────────────────────────────────────────┐
│  1. SFT: Learn output format            │
│     TRACE: <reasoning>                  │
│     ANSWER: A|B|C|D                     │
├─────────────────────────────────────────┤
│  2. RSFT: Rejection Sampling FT         │
│     - Generate multiple answers         │
│     - Score with knowledge graph        │
│     - Keep only correct traces          │
│     - Train on winners                  │
└─────────────────────────────────────────┘

The Reward Function

The knowledge graph scores outputs during training:

R_corr: +1.0 correct answer, -2.0 incorrect
R_path: Entity coverage (did the trace mention relevant nodes?)
P_spam: -0.5 penalty for repeating entities (prevents gaming)

At inference, the graph is removed. The model must reason from learned patterns.

Results

Phase	Accuracy	Training Time
Base model	0%	-
After SFT	30%	~2 min
After RSFT	75%	~3 min

The critical finding: distribution matching matters.

Training on easy examples (1-2 hops) hurt performance on hard eval (4-5 hops). Training on examples matching the eval distribution achieved 75%.

Running It

git clone https://github.com/softwarewrighter/multi-hop-reasoning
cd multi-hop-reasoning

# Setup (Apple Silicon)
make setup-mlx

# Full pipeline
make train

Results appear in ~5 minutes on an M-series Mac.

Implementation Details

Metric	Value
Primary Language	Python
Source Files	12 `.py` files
Estimated Size	~1.5 KLOC
Framework	MLX, Transformers
Platform	Apple Silicon (MLX native)

Good for you if: You want to understand knowledge graph-guided training, experiment with rejection sampling fine-tuning, or see how small models can learn reasoning patterns.

Complexity: Moderate. Clean codebase with Make targets for each step. Requires understanding of fine-tuning concepts.

Key Takeaways

Scaffolded training works. Use structured feedback during training, remove it at inference.
Distribution matching matters. Train on examples that match your eval distribution.
Small models can reason. 135M parameters is enough for 75% accuracy on 4-5 hop questions.
MLX makes iteration fast. Full pipeline runs in 5 minutes on a MacBook.

Resources

Knowledge graphs as training wheels—helping small models learn to reason, then letting go.