Small Models (6/6): Which Small AI Fits YOUR Laptop?
985 words • 5 min read • Abstract

Maximum AI capability on minimum hardware. The 2-3B efficient frontier.
This is Part 6 (the finale) of the Small Models, Big Brains series. We’re benchmarking the best small models to help you choose the right one for your laptop.
| Resource | Link |
|---|---|
| Code | efficient-llm |
| Phi-2 | microsoft/phi-2 |
| Gemma | ai.google.dev/gemma |
| SmolLM | HuggingFace Blog |
| Video | Which Small AI Fits YOUR Laptop?![]() |
The Efficient Frontier
In economics, the “efficient frontier” is the set of optimal portfolios offering the highest return for a given level of risk.
In AI, it’s the models offering the best capability for a given size.
The Contenders
| Model | Params | Source | Key Strength |
|---|---|---|---|
| Phi-2 | 2.7B | Microsoft | Reasoning, synthetic data |
| Gemma-2B | 2B | Distillation, multilingual | |
| SmolLM2-1.7B | 1.7B | HuggingFace | 11T tokens, fast inference |
| SmolLM3-3B | 3B | HuggingFace | Dual reasoning, 6 languages |
Benchmark Results
Actual measurements on Apple Silicon (M-series) from efficient-llm:
| Model | MMLU | GSM8K | HumanEval | Speed (CPU) | Memory |
|---|---|---|---|---|---|
| Phi-2 | 61.7% | 57.0% | 50.0% | 7.1 tok/s | 5.2GB |
| Gemma-2B | 38.9% | 18.0% | 90.0% | 8.5 tok/s | 4.7GB |
| SmolLM2 | 55.6% | * | * | 3.7 tok/s | 3.2GB |
*SmolLM2 GSM8K/HumanEval scores reflect prompt format incompatibility, not capability.
The Key Insight: Data Quality Beats Parameters
Phi-2 achieves 61.7% MMLU with only 2.7B parameters.
For comparison:
- Llama-2-7B: ~46% MMLU
- Llama-2-13B: ~55% MMLU
Phi-2 beats models 5x its size. The secret? Synthetic textbook training.
Microsoft generated high-quality educational content specifically designed to teach reasoning. Quality data > quantity data > model size.
Model Profiles
Phi-2: The Reasoning Champion
Strengths: Math, logic, code understanding
Weakness: Less conversational
Best for: Technical tasks, chain-of-thought
Phi-2 was trained on “textbook quality” synthetic data. It thinks like a textbook explains.
Gemma-2B: The Distillation Expert
Strengths: Multilingual, edge deployment
Weakness: Lower benchmark scores
Best for: Production apps, Google ecosystem
Google distilled knowledge from larger models into this compact package. Great tooling and documentation.
SmolLM2-1.7B: The Speed Demon
Strengths: Fastest inference, smallest footprint
Weakness: Prompt format sensitivity
Best for: Memory-constrained environments
HuggingFace trained on 11T tokens—massive overtraining like TinyLlama but at a slightly larger scale.
SmolLM3-3B: The Balanced Choice
Strengths: Dual reasoning modes, 6 languages
Weakness: Newest, less battle-tested
Best for: General-purpose small model needs
The latest from HuggingFace, designed to be the go-to small model.
Decision Framework
├── Need best reasoning? → Phi-2
├── Need instruction following? → SmolLM2 or SmolLM3
├── Need multilingual? → Gemma-2B or SmolLM3
├── Memory constrained (<4GB)? → SmolLM2 + INT4
├── Need Google ecosystem? → Gemma-2B
├── General purpose? → SmolLM3
└── Maximum quality per byte? → Phi-2
Running the Benchmarks
git clone https://github.com/softwarewrighter/efficient-llm
cd efficient-llm
# Setup
uv venv && source .venv/bin/activate
uv pip install torch transformers accelerate bitsandbytes datasets tqdm
# HuggingFace login (required for Gemma)
huggingface-cli login
# Download and benchmark
python download_models.py
python benchmark_quality.py
python benchmark_speed.py
python benchmark_memory.py
# Interactive demos
python demo_reasoning.py
python demo_code.py
python demo_chat.py
Hardware Requirements
| Setup | Models You Can Run |
|---|---|
| 4GB RAM | SmolLM2 (INT4) |
| 8GB RAM | All models (INT4) |
| 16GB RAM | All models (FP16) |
| Apple Silicon | All models (MPS) |
Implementation Details
| Metric | Value |
|---|---|
| Primary Language | Python |
| Source Files | 7 .py files |
| Estimated Size | ~1.4 KLOC |
| Framework | Transformers, PyTorch |
| Build System | uv / pip |
| Key Features | MMLU/GSM8K/HumanEval benchmarks, demos |
Good for you if: You want to benchmark 2-3B models, compare quality vs speed tradeoffs, or run interactive comparisons between Phi-2, Gemma, and SmolLM.
Complexity: Low. Similar structure to billion-llm. Standalone Python scripts for each benchmark and demo. Requires HuggingFace authentication for Gemma access.
Series Recap
Over six parts, we’ve explored the cutting edge of small model research:
| Part | Model/Topic | Key Insight |
|---|---|---|
| 1 | TRM (<1K params) | Iteration beats scale |
| 2 | MobileLLM (350M) | Offline AI is practical |
| 3 | HRM (27M) | Hierarchy enables reasoning |
| 4 | BDH | Sparsity enables interpretability |
| 5 | 1B models | The efficiency sweet spot |
| 6 | 2-3B models | Data quality beats parameters |
Key Takeaways
-
Data quality beats parameter count. Phi-2 proves careful curation outperforms brute scaling.
-
The 2-3B range is remarkably capable. These models handle real tasks, not just demos.
-
Each model has its niche. Match the model to your use case.
-
Quantization makes everything accessible. INT4 lets you run 3B models on 4GB RAM.
-
The frontier keeps moving. SmolLM3 is weeks old. Better models are coming.
What We’ve Learned
Small models aren’t a compromise—they’re a different optimization target. When you can’t throw compute at a problem, you’re forced to be clever:
- Recursive reasoning (TRM)
- Mobile-optimized architectures (MobileLLM)
- Hierarchical decomposition (HRM)
- Sparse interpretable activations (BDH)
- Overtraining on quality data (TinyLlama, Phi-2)
These techniques will eventually feed back into large models too. Small model research isn’t a dead end—it’s the frontier.
Resources
Part 6 of 6 in the Small Models, Big Brains series. Thanks for following along!
Have questions? Find me on YouTube @SoftwareWrighter or Discord.
Part 6 of the Small Models, Big Brains series. View all parts
