Small Models (2/6): AI in Your Pocket

AI in your pocket, no internet required. Pocket Eliza++ runs MobileLLM-350M on Android via llama.cpp and JNI, creating a privacy-first therapist chatbot. The 260MB quantized model achieves ~10 tokens/second on mid-range phones.

AI on your phone. All day. No internet required.

This is Part 2 of the Small Models, Big Brains series. Today we’re putting a language model in your pocket with Pocket Eliza++—a modern AI therapist that runs completely offline on Android.

Resource	Link
Paper	MobileLLM (ICML 2024)
Code	pocket-llm
Runtime	llama.cpp
Video	AI in Your Pocket

Why Offline Matters

Benefit	Description
Privacy	Data never leaves your device
Speed	No network latency
Cost	No API fees
Offline	Works without internet
Battery	Efficient on-device inference

Cloud AI is convenient, but sometimes you want a conversation that stays on your device.

MobileLLM: Meta’s Edge Champion

MobileLLM is Meta’s sub-500M parameter model optimized specifically for on-device inference.

Architecture Optimizations

Technique	Benefit
Deep-thin design	More layers, fewer parameters per layer
SwiGLU activation	Better performance than ReLU
Embedding sharing	Saves 30% of parameters
Grouped-query attention	Faster inference

The result: a 260MB quantized model (Q4_K_M) that runs smoothly on phones.

Pocket Eliza++

Eliza taking notes

The original ELIZA (1966) used pattern matching to simulate a Rogerian therapist. Pocket Eliza++ uses the same therapeutic approach but with actual language understanding.

Therapeutic Design

The system prompt instructs the model to:

Ask one short question at a time
Never repeat questions
Vary question types (feelings, motivations, specifics)
Never give advice or explanations

It’s a reflective listener, not a problem solver.

Technical Stack

┌─────────────────────────────────┐
│     Kotlin + Jetpack Compose    │  UI Layer
├─────────────────────────────────┤
│            JNI Bridge           │
├─────────────────────────────────┤
│           llama.cpp             │  Inference Engine
├─────────────────────────────────┤
│    MobileLLM-350M (Q4_K_M)      │  Model (260MB)
└─────────────────────────────────┘

Model: MobileLLM-350M quantized to Q4_K_M (260MB GGUF)
Runtime: llama.cpp compiled for Android via NDK
Interface: Kotlin + Jetpack Compose
Bridge: JNI bindings connect Kotlin to native llama.cpp

Building the App

# Clone the repository
git clone https://github.com/softwarewrighter/pocket-llm
cd pocket-llm/android-demo

# Clone llama.cpp into native source
git clone https://github.com/ggerganov/llama.cpp.git \
    app/src/main/cpp/llama.cpp

# Download the model (260MB)
mkdir -p app/src/main/assets
curl -L -o app/src/main/assets/MobileLLM-376M-Q4_K_M.gguf \
    "https://huggingface.co/pjh64/MobileLLM-350M-GGUF/resolve/main/MobileLLM-376M-Q4_K_M.gguf"

# Build and install
./gradlew assembleDebug
adb install -r app/build/outputs/apk/debug/app-debug.apk

Build Requirements

Requirement	Value
Target SDK	35 (Android 15)
Min SDK	28 (Android 9.0)
ABI	arm64-v8a
NDK	CMake for native build
Kotlin	2.0.0

Quick CLI Demo

Don’t want to build the Android app? Test with Ollama:

pip install -r requirements.txt
ollama pull smollm:360m
python3 eliza.py

Performance

On a mid-range Android phone (Snapdragon 7 series):

First token: ~500ms
Generation: ~10 tokens/second
Memory: ~400MB RAM
Battery: Minimal impact for short sessions

Implementation Details

Metric	Value
Languages	Kotlin (UI), Python (CLI), C++ (JNI)
Source Files	6 `.kt`, 4 `.py`, 2 `.cpp`
Estimated Size	~1.3 KLOC
Android Target	SDK 28+ (Android 9.0)
Build System	Gradle + CMake (NDK)
Key Dependency	llama.cpp (vendored)

Good for you if: You want to deploy LLMs on Android, learn JNI/NDK integration, or build privacy-focused mobile AI apps.

Complexity: Moderate-High. Requires Android Studio, NDK setup, and understanding of JNI bridges. The llama.cpp integration is the tricky part; the Kotlin UI is straightforward Jetpack Compose.

Key Takeaways

Sub-500M models are phone-ready. MobileLLM proves useful AI fits in your pocket.
llama.cpp is the universal runtime. Same engine runs on Mac, Linux, Windows, and Android.
Privacy doesn’t require sacrifice. Offline AI can still be conversational and helpful.
Quantization is essential. Q4_K_M brings 350M parameters down to 260MB with minimal quality loss.

What’s Next

Part 3 explores the Hierarchical Reasoning Model (HRM)—a 27M parameter model that beats o3-mini on abstract reasoning.