Blog Series

Series

Deepseek Papers (3)
Five ML Concepts (29)
General Technology (2)
How AI Learns (7)
Machine Learning (6)
Multi-Hop Reasoning (2)
Personal Software (5)
Small Models, Big Brains (6)
Throwback Thursday (5)
Towards Continuous LLM Learning (2)

Multi-part blog post series, organized by topic.

Deepseek Papers

Part 1: Deepseek Papers (1/3): mHC - Training Stability at Any Depth
February 1, 2026

760 words • 4 min read • Abstract
Implementing Deepseek's mHC (Manifold-Constrained Hyper-Connections) paper. Using Sinkhorn-Knopp iteration to create doubly-stochastic matrices, mHC maintains training stability at 48 layers where standard hyper-connections explode. Cross-platform validation on Apple Silicon and NVIDIA.
Part 2: Deepseek Papers (2/3): Engram - Conditional Memory for Transformers
February 2, 2026

705 words • 4 min read • Abstract
Implementing Deepseek's Engram paper on conditional memory. Instead of recomputing common patterns through O(n^2) attention, Engram provides O(1) lookup for cached results. Our LoRA-based behavioral approximation achieves 58% loss reduction in 10 seconds.
Part 3: Deepseek Papers (3/3): Engram Revisited - From Emulation to Implementation
February 11, 2026

1033 words • 6 min read • Abstract
From behavioral emulation to real implementation: integrating hash-based Engram memory with HuggingFace models. The gating mechanism is critical---it learns when to trust memory lookup and when hash collisions would add noise. Engram excels at exact-match retrieval, not generalization.

Five ML Concepts

Part 1: Five ML Concepts - #1
February 4, 2026

411 words • 3 min read • Abstract
Five ML concepts in under 30 seconds each: Backpropagation (learning by flowing error backward), Transformers (attention over all tokens), Mamba (linear-time sequence modeling), Hallucination (confident nonsense), and Embeddings (meaning as coordinates).
Part 2: Five ML Concepts - #2
February 5, 2026

446 words • 3 min read • Abstract
Five ML concepts in under 30 seconds each: Gradient Descent (walk downhill to minimize error), Attention (focus on what matters), DPO (align from preference pairs), Learning Rate (step size tradeoff), Temperature (dial between predictable and creative).
Part 3: Five ML Concepts - #3
February 6, 2026

524 words • 3 min read • Abstract
Five ML concepts in under 30 seconds each: Loss Function (how far off predictions are), Overfitting (memorizing vs learning), Fine-tuning (specializing pre-trained models), LoRA (efficient adaptation with small matrices), Tokenization (breaking text into digestible pieces).
Part 4: Five ML Concepts - #4
February 7, 2026

453 words • 3 min read • Abstract
Five ML concepts in under 30 seconds each: Activation Functions (introduce nonlinearity), Transfer Learning (reuse knowledge across tasks), VLM (joint image-text understanding), Adam (adaptive learning rates), Superposition (many concepts in overlapping representations).
Part 5: Five ML Concepts - #5
February 8, 2026

493 words • 3 min read • Abstract
Five ML concepts in under 30 seconds each: Perceptron (single linear unit ancestor), Pre-training (learn general patterns first), Speculative Decoding (draft fast, verify in parallel), In-Context Learning (adapt from prompt examples), Latent Space (internal representations where similar things cluster).
Part 6: Five ML Concepts - #6
February 9, 2026

491 words • 3 min read • Abstract
Five ML concepts in under 30 seconds each: Regularization (constraints to prevent overfitting), BERT (bidirectional masked language modeling), RoPE (position via rotation in attention), Prompting (craft inputs to steer outputs), Positional Encoding (tell model where tokens are).
Part 7: Five ML Concepts - #7
February 10, 2026

469 words • 3 min read • Abstract
Five ML concepts in under 30 seconds each: Cross-Validation (rotate held-out data), GPT (predict next token at scale), GQA (shared keys/values for efficiency), Context Window (how much the model sees), Self-Attention (each token attends to all others).
Part 8: Five ML Concepts - #8
February 11, 2026

477 words • 3 min read • Abstract
Five ML concepts in under 30 seconds each: Bias-Variance Tradeoff (balance under/overfitting), Diffusion (generate by learning to denoise), KV Cache (store past keys/values), Mixed Precision (lower precision for speed), MLA (compress attention into latent space).
Part 9: Five ML Concepts - #9
February 12, 2026

470 words • 3 min read • Abstract
Five ML concepts in under 30 seconds each: Dropout (random disabling prevents overfitting), RLHF (learn from human preferences), Inference (using trained models), Quantization (lower precision for efficiency), Flash Attention (block-wise for memory savings).
Part 10: Five ML Concepts - #10
February 13, 2026

499 words • 3 min read • Abstract
Five ML concepts in under 30 seconds each: CNN (sliding filters for image features), Encoder-Decoder (compress then generate), RAG (retrieve context before generating), Few-shot Learning (learn from prompt examples), Distillation (small student mimics large teacher).
Part 11: Five ML Concepts - #11
February 14, 2026

503 words • 3 min read • Abstract
Five ML concepts in under 30 seconds each: RNN (sequential processing with memory), Chain of Thought (step-by-step reasoning), Softmax (scores to probabilities), MoE (route inputs to specialists), Distribution Shift (training vs deployment mismatch).
Part 12: Five ML Concepts - #12
February 15, 2026

488 words • 3 min read • Abstract
Five ML concepts in under 30 seconds each: Precision vs Recall (correct positives vs finding all), OOD Inputs (data unlike training), Batch Size (examples per update), Inductive Bias (built-in assumptions), Latency vs Throughput (speed vs capacity).
Part 13: Five ML Concepts - #13
February 16, 2026

448 words • 3 min read • Abstract
Five ML concepts in under 30 seconds each: Calibration (predicted probabilities match outcomes), Shortcut Learning (exploiting spurious patterns), Early Stopping (halt when validation plateaus), Universal Approximation (NNs can fit any function), Checkpointing (save model state).
Part 14: Five ML Concepts - #14
February 17, 2026

448 words • 3 min read • Abstract
Five ML concepts in under 30 seconds each: ROC/AUC (performance across thresholds), Spurious Correlations (coincidental patterns), Gradient Clipping (limit gradients for stability), Loss Landscapes (error surface over parameters), Cold Start (no history for new users).
Part 15: Five ML Concepts - #15
February 18, 2026

470 words • 3 min read • Abstract
Five ML concepts in under 30 seconds each: Perplexity (how surprised by data), Catastrophic Forgetting (new learning erases old), Weight Initialization (starting values matter), Curse of Dimensionality (high-D makes data sparse), Monitoring (track performance and drift).
Part 16: Five ML Concepts - #16
February 19, 2026

468 words • 3 min read • Abstract
Five ML concepts in under 30 seconds each: Train/Val/Test Split (separate data roles), Overconfidence (high probability wrong predictions), Batch Normalization (stable training), Optimization vs Generalization (low train loss doesn't mean good test), A/B Testing (compare with experiments).
Part 17: Five ML Concepts - #17
February 20, 2026

472 words • 3 min read • Abstract
Five ML concepts in under 30 seconds each: Benchmark Leakage (test data contamination), Concept vs Data Drift (changed relationships vs inputs), Weight Decay (L2 penalty for simplicity), Scaling Laws (predictable performance growth), Shadow Deployment (test alongside production).
Part 18: Five ML Concepts - #18
February 21, 2026

444 words • 3 min read • Abstract
Five ML concepts in under 30 seconds each: Preference Learning (train from comparisons), Ensembling (combine models for robustness), ML Fragility (breaks on distribution shift), Epoch (one pass through data), Cost vs Quality (bigger isn't always better).
Part 19: Five ML Concepts - #19
February 22, 2026

451 words • 3 min read • Abstract
Five ML concepts in under 30 seconds each: Autoencoders (compress and reconstruct), Correlation vs Causation (co-occurrence isn't cause), Curriculum Learning (easy to hard), Failure Analysis (categorize errors), Covariate Shift (new inputs, same task).
Part 20: Five ML Concepts - #20
February 23, 2026

456 words • 3 min read • Abstract
Five ML concepts in under 30 seconds each: VAEs (generative with structured latents), Uncertainty Estimation (know when you don't know), Interpretability (distributed representations resist explanation), Gradient Noise (mini-batch variation), Human-in-the-Loop (human oversight for critical decisions).
Part 21: Five ML Concepts - #21
February 24, 2026

447 words • 3 min read • Abstract
Five ML concepts in under 30 seconds each: Prompt Injection (malicious instructions overriding AI behavior), Jailbreaks (bypassing safety constraints), GRU (gated recurrent units for sequences), Planning vs Prediction (action evaluation vs forecasting), Production Rollbacks (reverting to stable model versions).
Part 22: Five ML Concepts - #22
February 25, 2026

472 words • 3 min read • Abstract
Five ML concepts in under 30 seconds each: RSFT (rejection sampling fine-tuning with filtered outputs), Model Steerability (adjusting behavior at inference time), LSTM (long short-term memory for sequences), Why More Data Beats Better Models (data scale trumps architecture tweaks), System Reliability vs Model Quality (balancing accuracy with uptime).
Part 23: Five ML Concepts - #23
February 26, 2026

440 words • 3 min read • Abstract
Five ML concepts in under 30 seconds each: Emergent Behavior (capabilities appearing at scale), Tool Use (AI calling external tools), Loss Surface Sharpness (flatter minima generalize better), Learning Rate Schedules (adjusting learning rate during training), Canary Deployment (gradually rolling out new models safely).
Part 24: Five ML Concepts - #24
February 27, 2026

426 words • 3 min read • Abstract
Five ML concepts in under 30 seconds each: Warmup (gradually increasing learning rate at start), Data Leakage (training on unavailable deployment info), Mode Collapse (limited generative output variety), Blue/Green Deployment (switching between parallel production environments), Reward Hacking (exploiting reward function flaws).
Part 25: Five ML Concepts - #25
February 28, 2026

406 words • 3 min read • Abstract
Five ML concepts in under 30 seconds each: Label Smoothing (softening targets to reduce overconfidence), Miscalibration (confidence not matching accuracy), Representation Learning (automatically learning useful features), Adversarial Examples (inputs crafted to cause errors), Double Descent (test error decreasing twice with model size).
Part 26: Five ML Concepts - #26
March 1, 2026

424 words • 3 min read • Abstract
Five ML concepts in under 30 seconds each: Data Augmentation (expanding training data with transformations), Caching Strategies (reducing latency by reusing computation), Constitutional AI (training models to follow explicit principles), Goodhart's Law (optimizing metrics distorts objectives), Manifold Hypothesis (data lies on lower-dimensional structures).
Part 27: Five ML Concepts - #27
March 2, 2026

419 words • 3 min read • Abstract
Five ML concepts in under 30 seconds each: Elastic Weight Consolidation (protecting important parameters during new task learning), Replay Buffers (mixing past examples to prevent forgetting), Parameter Routing (activating task-specific parameter subsets), Memory-Augmented Networks (external memory modules for neural networks), Model Editing (targeted weight updates without full retraining).
Part 28: Five ML Concepts - #28
March 3, 2026

443 words • 3 min read • Abstract
Five ML concepts in under 30 seconds each: Lottery Ticket Hypothesis (small winning subnetworks within large models), Sparse Activation (using only part of a model per input), Conditional Computation (dynamically routing inputs for efficiency), Inference Parallelism (distributing inference across devices), Compute Optimality (balancing model size, data, and compute).
Part 29: Five ML Concepts - #29
March 4, 2026

457 words • 3 min read • Abstract
Five ML concepts in under 30 seconds each: Neural Collapse (late-stage geometric convergence of class representations), Grokking (sudden generalization after prolonged memorization), SAM (optimizing for flat loss regions under perturbations), Mechanistic Interpretability (analyzing internal circuits of neural networks), Self-Training Instability (feedback loops that amplify errors in self-generated data).

General Technology

Part 1: MCP: Teaching Claude to Play (and Trash Talk)
February 2, 2026

661 words • 4 min read • Abstract
Teaching Claude to play tic-tac-toe and trash talk using Model Context Protocol (MCP). A Rust server exposes 6 tools via JSON-RPC over stdio, proving MCP standardizes AI tool integration across any compatible language model.
Part 2: JSON et al: A Deep Dive into Data Serialization Formats
February 21, 2026

2241 words • 12 min read • Abstract
JSON is everywhere, but it's not the only option. This post explores data formats beyond basic JSON—JSONL for streaming, JSONB for fast queries, Protocol Buffers for compact wire formats, YAML/TOML for human editing, and TOON for LLM efficiency. Each has trade-offs: pick two of readability, compactness, or speed.

How AI Learns

Part 1: How AI Learns Part 1: The Many Meanings of Learning
February 24, 2026

592 words • 3 min read • Abstract
When people say 'AI learned something,' they usually mean one of four very different things. Understanding these time scales---from milliseconds to years---is essential for building AI systems that improve safely over time.
Part 2: How AI Learns Part 2: Catastrophic Forgetting vs Context Rot
February 25, 2026

641 words • 4 min read • Abstract
Two fundamentally different failure modes plague AI systems. Catastrophic forgetting destroys old knowledge when learning new skills. Context rot loses early instructions in long conversations. Different problems, different solutions.
Part 3: How AI Learns Part 3: Weight-Based Learning
February 26, 2026

649 words • 4 min read • Abstract
Weight-based learning modifies the neural network itself. Pretraining, fine-tuning, LoRA, alignment methods, distillation---each changes the brain permanently. Slow to change, but forms the stable core.
Part 4: How AI Learns Part 4: Memory-Based Learning
February 27, 2026

627 words • 4 min read • Abstract
Modern AI systems increasingly rely on external memory. RAG, CAG, and Engram-style modules shift 'learning' away from weights. The brain stays stable. The notebook grows.
Part 5: How AI Learns Part 5: Context Engineering & Recursive Reasoning
February 28, 2026

631 words • 4 min read • Abstract
Large context windows are not a complete solution. As context grows, attention dilutes and instructions drift. Recursive Language Models treat context as a dynamic environment, rebuilding focus each step instead of dragging everything forward.
Part 6: How AI Learns Part 6: Toward Continuous Learning
March 1, 2026

691 words • 4 min read • Abstract
Continuous learning aims to absorb new information and skills over time without losing old capabilities. The key: learn often in memory, consolidate carefully in weights. Periodic consolidation, not constant updates.
Part 7: How AI Learns Part 7: Designing a Continuous Learning Agent
March 2, 2026

894 words • 5 min read • Abstract
A robust architecture: core model (rarely updated) + adapters (modular skills) + external memory (facts) + context manager (RLM-style) + logging and evaluation loop. Errors feed into memory first. Only recurring, validated improvements reach adapters.

Machine Learning

Part 1: Solving Sparse Rewards with Many Eyes
February 3, 2026

1473 words • 8 min read • Abstract
Single explorer: 0% success. Five explorers: 60% success. Sparse rewards are an information problem, not a compute problem. Using multiple scouts with different exploration strategies, we gather diverse discoveries that benefit a shared learner.
Part 2: DyTopo: Dynamic Topology for Multi-Agent AI
February 12, 2026

781 words • 4 min read • Abstract
When multiple AI agents work together, fixed communication patterns fail at scale. DyTopo rebuilds the graph each round based on semantic similarity between what agents need and what they can offer, preventing context explosion while enabling adaptive collaboration.
Part 3: RLM: Recursive Language Models for Massive Context
February 13, 2026

995 words • 5 min read • Abstract
When data won't fit in a context window, RLM expands the workspace instead. The MIT paper achieves 87-91% accuracy where standard prompting scores 0%. My Rust implementation provides four capability levels from DSL commands to WASM sandboxing to LLM delegation.
Part 4: Neural-Net-RS: An Educational Neural Network Platform
February 15, 2026

1048 words • 6 min read • Abstract
Personal Software for education: a neural network platform where every step is visible---no framework magic. CLI with progress bars, web UI with real-time loss charts, WASM for browser execution. Built via Vibe Coding to watch XOR training reveal why hidden layers matter.
Part 5: In-Context Learning Revisited: From Mystery to Engineering
February 22, 2026

643 words • 4 min read • Abstract
ICL evolved from emergent surprise (2020) to mechanistic understanding (2022) to engineered capability (2026). Transformers implement implicit gradient descent during inference---they learn without weight updates. The frontier: models learning from their own feedback. Not magic. Meta-learning in plain sight.
Part 6: Many-Eyes Learning: Intrinsic Rewards and Diversity
February 25, 2026

1393 words • 7 min read • Abstract
Expanding many-eyes learning with intrinsic rewards and a new web visualization. CuriousScout uses count-based novelty, OptimisticScout uses optimistic initialization. The key trade-off: diversity helps during exploration, but once Q-values converge, all scouts should follow the same optimal policy. Strategy quality matters more than diversity in simple environments.

Multi-Hop Reasoning

Part 1: Multi-Hop Reasoning (1/2): Training Wheels for Small LLMs
February 1, 2026

692 words • 4 min read • Abstract
A 135M parameter model goes from 0% to 75% accuracy in 5 minutes. Using knowledge graph-guided training with rejection sampling, we teach multi-hop reasoning with scaffolding during training, then remove it at inference.
Part 2: Multi-Hop Reasoning (2/2): The Distribution Trap
February 18, 2026

796 words • 4 min read • Abstract
RSFT on easy examples made performance worse---27% vs 37% SFT baseline. Training distribution must match evaluation distribution. Easy examples teach shortcuts that fail on hard problems. The fix is one flag change.

Personal Software

Part 1: Cat Finder: Personal Software via Vibe Coding
February 14, 2026

914 words • 5 min read • Abstract
Personal Software via Vibe Coding: I needed to find cat photos scattered across my system. Instead of cloud services or app stores, I described what I wanted to Claude Code and got a working Rust CLI tool using YOLOv8 and ONNX Runtime. Privacy-first, locally-run, and mine to modify.
Part 2: midi-cli-rs: Music Generation for AI Coding Agents
February 20, 2026

1063 words • 6 min read • Abstract
Personal Software via Vibe Coding: a music tool for AI agents. midi-cli-rs provides mood presets (suspense, upbeat, calm, jazz) so agents can generate complete audio compositions from simple commands. No music theory required.
Part 3: midi-cli-rs: Extending with Custom Mood Packs
February 23, 2026

1300 words • 7 min read • Abstract
Personal Software grows. midi-cli-rs now supports custom mood packs---TOML files that extend built-in moods with your own musical variations. No Rust required. Define tempo, key, intensity, and let the generators handle the rest.
Part 4: music-pipe-rs: Unix Pipelines for MIDI Composition
February 24, 2026

1173 words • 6 min read • Abstract
Personal Software continues. music-pipe-rs takes the Unix philosophy to MIDI composition---small tools connected by pipes. Start with a seed, generate motifs, transform, visualize, convert to MIDI. Deterministic output from a single seed at the pipeline head.
Part 5: music-pipe-rs: Web Demo and Multi-Instrument Arrangements
February 28, 2026

697 words • 4 min read • Abstract
Continuing the music-pipe-rs story: a web demo with Bach and Baroque arrangements, the seq command for explicit note sequences, and GarageBand integration. Plus the generative music resources that inspired this project.

Small Models, Big Brains

Part 1: Small Models (1/6): 976 Parameters Beat Billions
January 31, 2026

703 words • 4 min read • Abstract
The best LLMs score zero on hard mazes. A model with 976 parameters scores 85%. The Tiny Recursive Model uses think-act cycles with deep supervision, proving iteration beats scale for tasks requiring backtracking and spatial reasoning.
Part 2: Small Models (2/6): AI in Your Pocket
February 1, 2026

765 words • 4 min read • Abstract
AI in your pocket, no internet required. Pocket Eliza++ runs MobileLLM-350M on Android via llama.cpp and JNI, creating a privacy-first therapist chatbot. The 260MB quantized model achieves ~10 tokens/second on mid-range phones.
Part 3: Small Models (3/6): Planner + Doer = Genius
February 2, 2026

789 words • 4 min read • Abstract
27 million parameters beats o3-mini on ARC. The Hierarchical Reasoning Model separates planning from execution, mimicking the brain's dual-process theory. It achieves 40% on the hardest reasoning benchmark where most LLMs score under 5%.
Part 4: Small Models (4/6): This AI Has a Visible Brain
February 3, 2026

842 words • 5 min read • Abstract
LLMs are black boxes. Baby Dragon Hatchling uses brain-inspired sparse coding with 80% sparsity, making only 20% of neurons active per token. When fewer neurons fire, each one carries interpretable meaning. Train it on Shakespeare and actually see what's happening inside.
Part 5: Small Models (5/6): Max AI Per Watt
February 4, 2026

839 words • 5 min read • Abstract
One billion parameters: the sweet spot for AI. Big enough to reason, small enough to run anywhere. Comparing TinyLlama, Llama-3.2-1B, StableLM, and Pythia with LoRA fine-tuning in minutes and speculative decoding for 2-3x speedups.
Part 6: Small Models (6/6): Which Small AI Fits YOUR Laptop?
February 5, 2026

985 words • 5 min read • Abstract
Which small AI fits your laptop? Benchmarking Phi-2, Gemma-2B, and SmolLM on the 2-3B efficient frontier. Phi-2 achieves 61.7% MMLU with only 2.7B parameters, beating models 5x larger through synthetic textbook training. Data quality beats parameters.

Throwback Thursday

Part 1: TBT (1/?): My First Program Was a Horse Race
January 29, 2026

1138 words • 6 min read • Abstract
My first program was a horse race game in APL on an IBM mainframe in 1972. This Throwback Thursday post recreates it using GNU APL, exploring array-oriented programming and the ideas that shaped languages from J to NumPy.
Part 2: TBT (2/?): Pipelines on OS/390
February 5, 2026

1779 words • 9 min read • Abstract
Unix invented pipes. Mainframes reinvented them for records, not bytes. This Throwback Thursday recreates CMS/TSO Pipelines in Rust with a visual debugger, demonstrating record-oriented dataflow from the 1996 Olympics web server era.
Part 3: TBT (3/?): Vector Graphics Games
February 13, 2026

1633 words • 9 min read • Abstract
Before pixels, there were vectors. Vibe Coding classic arcade games (Asteroids, BattleZone, Tempest) in Rust/WebAssembly with wgpu rendering---from my first encounter with an IBM 2250 to playable browser demos, all built in one day with Claude Code.
Part 4: TBT (4/?): ToonTalk - Teaching Robots to Program
February 19, 2026

1069 words • 6 min read • Abstract
ToonTalk is a 1995 visual programming environment where you train robots by showing them what to do. I vibe coded tt-rs, a Rust/WebAssembly reimplementation with boxes, scales, birds, nests, and robots---programming by demonstration for the browser.
Part 5: TBT (5/?): IBM 1130 System Emulator - Experience 1960s Computing
February 26, 2026

1231 words • 7 min read • Abstract
A browser-based IBM 1130 system emulator with authentic console panel indicator lights, keypunch, printer, and assembly game. Experience the full 1965 minicomputer ecosystem through interactive simulations. Work in progress.

Towards Continuous LLM Learning

Part 1: Towards Continuous LLM Learning (1): Sleepy Coder - When Fine-Tuning Fails
February 12, 2026

1211 words • 7 min read • Abstract
What happens when you fine-tune a model on new tasks? It forgets old ones. This post documents our implementation of the Share algorithm in Rust—using SVD-based subspace extraction to enable continual learning without catastrophic forgetting. Part 1 covers the problem and initial negative results.
Part 2: Towards Continuous LLM Learning (2): Routing Prevents Forgetting
February 18, 2026

775 words • 4 min read • Abstract
Part 2 of implementing the Share algorithm: after fixing critical bugs (zero-gradient saddle point, half-parameter training), routing-based coefficient selection achieves zero regressions. Result handling improved 40% to 50%. We're 60% through verifying the paper's claims.