Many-Eyes Learning: Intrinsic Rewards and Diversity

Expanding many-eyes learning with intrinsic rewards and a new web visualization. CuriousScout uses count-based novelty, OptimisticScout uses optimistic initialization. The key trade-off: diversity helps during exploration, but once Q-values converge, all scouts should follow the same optimal policy. Strategy quality matters more than diversity in simple environments.

In Part 1, we demonstrated that multiple scouts dramatically improve learning in sparse-reward environments. Five scouts achieved 60% success where a single scout achieved 0%.

This post explores how scouts explore: intrinsic rewards that drive novelty-seeking behavior, and what happens when you mix different exploration strategies.

Resource	Link
Code	many-eyes-learning
Part 1	Solving Sparse Rewards with Many Eyes
Video	Many-Eyes Learning: Watch AI Scouts Explore

Recap: The Many-Eyes Architecture

┌─────────────┐  ┌─────────────┐  ┌─────────────┐
│   Scout 1   │  │   Scout 2   │  │   Scout N   │
│ (strategy A)│  │ (strategy B)│  │ (strategy N)│
└──────┬──────┘  └──────┬──────┘  └──────┬──────┘
       │                │                │
       v                v                v
┌─────────────────────────────────────────────────┐
│              Experience Buffer                   │
└─────────────────────────────────────────────────┘
                       │
                       v
┌─────────────────────────────────────────────────┐
│               Shared Learner                     │
└─────────────────────────────────────────────────┘

Scouts are information gatherers, not independent learners. They explore with different strategies, pool their discoveries, and a shared learner benefits from the combined experience.

New Scout Strategies

CuriousScout: Count-Based Novelty

IRPO formalizes intrinsic rewards as the mechanism that drives scout exploration. CuriousScout implements count-based curiosity:

class CuriousScout(Scout):
    def __init__(self, bonus_scale: float = 1.0):
        self.state_counts = defaultdict(int)
        self.bonus_scale = bonus_scale

    def intrinsic_reward(self, state):
        count = self.state_counts[state]
        return self.bonus_scale / sqrt(count + 1)

How it works:

Track how many times each state has been visited
Reward = bonus_scale / √(count + 1)
Novel states get high rewards; familiar states get diminishing returns

The intuition: A curious scout is drawn to unexplored territory. The first visit to a state is exciting (reward = 1.0). The fourth visit is mundane (reward = 0.5). This creates natural pressure to explore widely.

OptimisticScout: Optimism Under Uncertainty

A different philosophy: assume unknown states are valuable until proven otherwise.

class OptimisticScout(Scout):
    def __init__(self, optimism: float = 10.0):
        self.optimism = optimism

    def initial_q_value(self):
        return self.optimism  # Instead of 0

How it works:

Initialize all Q-values to a high value (e.g., 10.0)
The agent is “optimistic” about unvisited state-action pairs
As it explores and receives actual rewards, Q-values decay toward reality

The intuition: If you’ve never tried something, assume it might be great. This drives exploration without explicit novelty bonuses.

Strategy Comparison

Strategy	Mechanism	Best For
Random	Uniform random actions	Baseline, maximum coverage
Epsilon-Greedy	Random with probability ε, greedy otherwise	Balancing exploit/explore
CuriousScout	Novelty bonus for unvisited states	Systematic coverage
OptimisticScout	High initial Q-values	Early exploration pressure

The Diversity Experiment

Does mixing strategies help, or is it enough to have multiple scouts with the same good strategy?

Setup

7x7 sparse grid, 100 training episodes
All configurations use exactly 5 scouts (fair comparison)
5 random seeds for statistical significance

Configurations

Homogeneous Random: 5 identical random scouts
Homogeneous Epsilon: 5 identical epsilon-greedy scouts (ε=0.2)
Diverse Mix: Random + 2 epsilon-greedy (ε=0.1, 0.3) + CuriousScout + OptimisticScout

Results

Configuration	Success Rate
Random baseline	7%
Homogeneous random	20%
Homogeneous epsilon	40%
Diverse mix	40%

Analysis

Finding: Strategy quality matters more than diversity in simple environments.

Epsilon-greedy (homogeneous or mixed) outperforms pure random
Diverse mix performs the same as homogeneous epsilon-greedy
Having 5 good scouts beats having 5 diverse but weaker scouts

Why doesn’t diversity help here?

In a simple 7x7 grid, the exploration problem is primarily about coverage, not strategy complementarity. Five epsilon-greedy scouts with different random seeds already explore different regions due to stochastic action selection.

Diversity likely provides more benefit in:

Complex environments with multiple local optima
Tasks requiring different behavioral modes
Environments with deceptive reward structures

Web Visualization

The web visualization demonstrates Many-Eyes Learning with real-time parallel scout movement. (The upcoming video walks through this demo—the post focuses on the underlying mechanism.)

Many-Eyes Web Visualization

How It Works

The web version uses Q-learning with a shared Q-table (simpler than DQN for clarity). All scouts contribute to the same Q-table—the core “many eyes” concept: more explorers = faster Q-value convergence.

Scout	Role	Epsilon	Behavior
Random	Baseline	1.0 (constant)	Always random, never follows policy
Scouts 1-N	Learning Agents	0.5-0.8 → 0.01	Epsilon-greedy with decay

Exploration Modes

The UI provides a dropdown to select different exploration strategies:

Mode	Heatmap Diversity	Learning Performance
Shared Policy	Low (identical paths)	Best (lowest avg steps)
Diverse Paths	High (distinct paths)	Worse (biases override optimal)
High Exploration	High	Worst (never fully exploits)
Boltzmann	Moderate	Moderate

The Diversity vs Performance Trade-off

There’s a fundamental trade-off between visual diversity and learning performance:

Shared Policy wins on performance: The “many eyes” benefit comes from diverse exploration during learning (finding the goal faster). But once Q-values converge, all scouts should follow the same optimal policy.
Diverse Paths sacrifices performance for visuals: Scout-specific directional biases (Scout 1 prefers right, Scout 2 prefers down) create visually interesting heatmaps but suboptimal behavior.
High Exploration never converges: Fixed 50% random actions means scouts never fully exploit the learned policy.

Key insight: For best learning, use Shared Policy. Use other modes to visualize how different exploration strategies affect the learning process, but expect higher average steps.

Learning Phases

Phase	Episodes	Avg Steps	Behavior
Random	1-5	~70	All scouts exploring randomly
Early Learning	5-15	40-60	Policy starts forming
Convergence	15-30	15-25	Clear optimal path emerges
Stable	30+	12-18	Near-optimal with random scout noise

Why “Average Steps to Goal”?

Success rate is coarse-grained—with 5 scouts, only 6 values are possible (0%, 20%, 40%, 60%, 80%, 100%). After ~10 episodes, all scouts typically reach the goal. Average steps shows continued policy refinement, dropping from ~70 (random) to ~8 (optimal).

Running the Visualization

./scripts/serve.sh   # Open http://localhost:3200

Yew/WASM frontend with FastAPI backend
Speed control from 1x to 100x
Replay mode to step through recorded training

What’s Next

Potential future directions:

Direction	Why It Matters
Larger environments	Test scaling to 15x15, 25x25 grids
Scout communication	Real-time sharing vs passive pooling
Adaptive intrinsic rewards	Learn the reward function (closer to full IRPO)
Multi-goal environments	Multiple sparse rewards to discover

Key Takeaways

Intrinsic rewards drive exploration. CuriousScout and OptimisticScout implement different philosophies: novelty bonuses vs optimistic initialization.
Strategy quality > diversity in simple environments. Five good scouts beat five diverse but weaker scouts.
Diversity during learning, convergence after. The “many eyes” benefit comes from diverse exploration during learning. Once Q-values converge, all scouts should follow the same optimal policy.
Shared Q-table enables collective learning. All scouts contribute to one Q-table—more explorers means faster convergence.
Visual diversity costs performance. Modes like “Diverse Paths” create interesting heatmaps but suboptimal behavior. Use “Shared Policy” for best learning results.

References

Concept	Paper
IRPO	Intrinsic Reward Policy Optimization (Cho & Tran 2026)
Reagent	Reasoning Reward Models for Agents (Fan et al. 2026)
ICM	Curiosity-driven Exploration (Pathak et al. 2017)

Diverse exploration, convergent execution. Many eyes see more, but the best path is the one they all agree on.