Adaptive difficulty selection via GP-IRT with BALD acquisition

## Summary

The current knowledge mapping system estimates P(correct | location) using a Gaussian Process on a 50×50 grid, but has no awareness of question **difficulty levels**. It treats all questions equally regardless of whether they're L1 (easy) or L4 (expert). This means:

1. **No difficulty-level map**: We can't tell users "you're at L3 in this region but L1 over there"
2. **Suboptimal question selection**: The sampler picks questions purely by spatial uncertainty, ignoring that asking a question at the wrong difficulty is wasteful (too easy = no information; too hard = just guessing)
3. **No adaptive difficulty**: There's no mechanism to start easy and ramp up as the learner demonstrates mastery

## Proposal: GP-IRT with BALD Acquisition

Based on deep research across 5 parallel investigations (adaptive staircase methods, Bayesian Knowledge Tracing, Item Response Theory, RBF ordinal interpolation, and Computerized Adaptive Testing in embedding spaces), all 5 converge on the same architecture:

### The Key Insight

The existing GP posterior value (0–1) can be reinterpreted as a **transformed IRT ability parameter**. By applying fixed difficulty thresholds, every cell immediately gets a difficulty level estimate with zero additional state. BALD (Bayesian Active Learning by Disagreement) acquisition then selects questions that are simultaneously in uncertain spatial regions AND at the learner's difficulty boundary.

### Mathematical Model

**Layer 1: Ability mapping** (reinterpret existing GP output)
```
θ_irt(x,y) = 4 × GP_mean(x,y) - 2      // rescale [0,1] → [-2,2]
σ_irt(x,y) = 4 × GP_uncertainty(x,y)    // scale uncertainty accordingly
```

**Layer 2: IRT difficulty model** (new, ~10 lines)
```
P(correct | θ, difficulty_d) = sigmoid(a × (θ - b[d]))

where:
  a = 1.5 (discrimination, shared across levels)
  b = [-1.5, -0.5, 0.5, 1.5] for L1–L4
```

**Layer 3: Difficulty level estimate** (derived, zero cost)
```
Mastery level L*(x,y) = max{ d ∈ {1,2,3,4} : GP_mean(x,y) > threshold[d] }

Thresholds in [0,1] space: [0.125, 0.375, 0.625, 0.875]
  - GP < 0.125  → below L1
  - [0.125, 0.375) → L1
  - [0.375, 0.625) → L2
  - [0.625, 0.875) → L3
  - ≥ 0.875 → L4
```

**Layer 4: BALD question selection** (replaces pure uncertainty scoring)
```
EIG(question_q) = a² × P_q × (1 - P_q) × σ²_irt(x_q, y_q)

This naturally balances:
  - Spatial uncertainty (high σ → explore unknown regions)
  - Difficulty informativeness (P near 0.5 → question is at learner's boundary)
  - Discrimination power (a² weights more discriminating items)
```

### Phase-Based Strategy

| Phase | Trigger | Strategy | Goal |
|-------|---------|----------|------|
| **Calibrate** | N < 10 | Prefer L2–L3 in high-uncertainty regions | Establish rough ability baseline |
| **Map** | 10 ≤ N < 30 or coverage < 15% | Full BALD acquisition | Refine ability map + find difficulty boundaries |
| **Learn** | N ≥ 30 | ZPD targeting (P ≈ 0.55–0.70) | Optimal learning in zone of proximal development |

Phase transitions are soft — if coverage drops (e.g., user explores a new region), the system drops back to mapping mode.

### Empirical Constraints Discovered

| Metric | Value | Implication |
|--------|-------|-------------|
| Cell occupancy | 6.0% (149/2500) | Per-cell staircases are impossible — must interpolate |
| Multi-difficulty cells | 46.3% (69/149) | Only ~half of occupied cells have multiple difficulty levels |
| Questions per domain | 50 | Budget limits resolution; adaptive selection is critical |
| Difficulty balance | ~13/13/12/12 per domain | IRT thresholds can be evenly spaced |
| IRT+BALD overhead | <1ms in JS | Negligible on top of existing ~10ms GP update |
| BALD vs uncertainty | 10–12% RMSE improvement | Meaningful selection improvement |

### Why Not Other Approaches?

| Approach | Verdict | Reason |
|----------|---------|--------|
| Classical staircase | ❌ | 0.5-level quantization error with only 4 levels; needs 40+ trials per location |
| QUEST+ per cell | ❌ | Only 6% cell occupancy; not enough questions per cell |
| Per-cell BKT | ❌ | Same sparsity problem; 54% of cells have only 1 difficulty level |
| Deep Knowledge Tracing | ❌ | Requires pre-training data from thousands of students; no cold-start |
| Nadaraya-Watson | ❌ | Accuracy plateaus at ~42–47% for ordinal estimation |
| Full GP ordinal regression | Overkill | The IRT threshold approach achieves similar results with 1% of the complexity |

### Minimal Viable Version (80% of value, ~25 lines)

The MVP requires changes to only 3 files:

1. **`src/utils/math.js`** — add `normalCDF()` (6 lines, Abramowitz & Stegun approximation)
2. **`src/learning/estimator.js`** — add `difficultyLevel` to `predict()` output (3 lines using threshold lookup)
3. **`src/learning/sampler.js`** — replace `score = uncertainty` with BALD EIG formula (5 lines)

This gives:
- Every cell gets a difficulty level estimate (L1–L4) for free
- Question selection becomes 10–12% more efficient
- No new data structures, no new state, no new UI required
- Backward-compatible: BALD reduces to uncertainty-based when all questions have the same difficulty

### Full Implementation Roadmap

**Phase A: MVP** (~2–4 hours)
- [ ] Add `normalCDF` to `math.js`
- [ ] Add `difficultyLevel` to `estimator.predict()`
- [ ] Replace uncertainty scoring with BALD EIG in `sampler.selectNext()`
- [ ] Tests: verify BALD scores > uncertainty scores for boundary questions

**Phase B: Full IRT Layer** (~4–8 hours)
- [ ] Level posterior computation (4-element probability vector per cell)
- [ ] `$phase` computed atom in `store.js`
- [ ] Phase-based selection in `sampler.js`
- [ ] `difficultyConfidence` field
- [ ] Update `selectByMode()` strategies

**Phase C: Visualization** (~4–8 hours)
- [ ] Difficulty-level color mode for heatmap
- [ ] Show difficulty level in quiz panel
- [ ] Per-cell level posterior in tooltip
- [ ] Difficulty breakdown in insights

**Phase D: Calibration** (ongoing)
- [ ] Collect anonymized response data
- [ ] Fit IRT parameters (a, b[1..4]) via maximum likelihood
- [ ] A/B test BALD vs. uncertainty selection
- [ ] Tune phase transition thresholds

### Key Limitations & Assumptions

1. **IRT parameters are theoretical defaults** — b = [-1.5, -0.5, 0.5, 1.5] and a = 1.5 are not empirically calibrated. Should be fit from real response data.
2. **GP value ≠ true ability** — GP_mean is P(correct) averaged across observed difficulties. The IRT rescaling is an approximation that works when difficulty distribution is roughly uniform (which holds for current data: ~13/13/12/12).
3. **Stationarity assumption** — All methods assume stable ability during assessment. Active learning causes within-session ability drift (the learner is learning!). The ZPD phase partially addresses this.
4. **UMAP distortion** — RBF/Matérn kernels assume Euclidean distance, but UMAP non-linearly distorts distances. Empirical validation needed.
5. **50-question budget** — True difficulty boundary mapping benefits from 100+ questions per domain with deliberate spatial and difficulty coverage.

### Research Sources

This proposal synthesizes findings from 50+ papers including:
- **IRT**: Duck-Mayr et al. (GPIRT, UAI 2020), Chu & Ghahramani (GP Ordinal, JMLR 2005)
- **Staircase**: Watson (QUEST+, 2017), Lesmes et al. (qCSF, 2010), Kontsevich & Tyler (Psi, 1999)
- **BKT**: Corbett & Anderson (1995), Vie & Kashima (KTM, AAAI 2019), Yeung (Deep-IRT, 2019), Schmucker et al. (Supercharging BKT, JEDM 2024)
- **Active Learning**: Houlsby et al. (BALD, 2011), Bayesian Optimal Experimental Design literature
- **CAT**: EduCAT (USTC), ALEKS Knowledge Space Theory, Survey of CAT (arXiv:2404.00712)

Full research reports and figures available in `.omc/scientist/reports/` and `.omc/scientist/figures/`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adaptive difficulty selection via GP-IRT with BALD acquisition #23

Summary

Proposal: GP-IRT with BALD Acquisition

The Key Insight

Mathematical Model

Phase-Based Strategy

Empirical Constraints Discovered

Why Not Other Approaches?

Minimal Viable Version (80% of value, ~25 lines)

Full Implementation Roadmap

Key Limitations & Assumptions

Research Sources

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Phase	Trigger	Strategy	Goal
Calibrate	N < 10	Prefer L2–L3 in high-uncertainty regions	Establish rough ability baseline
Map	10 ≤ N < 30 or coverage < 15%	Full BALD acquisition	Refine ability map + find difficulty boundaries
Learn	N ≥ 30	ZPD targeting (P ≈ 0.55–0.70)	Optimal learning in zone of proximal development

Metric	Value	Implication
Cell occupancy	6.0% (149/2500)	Per-cell staircases are impossible — must interpolate
Multi-difficulty cells	46.3% (69/149)	Only ~half of occupied cells have multiple difficulty levels
Questions per domain	50	Budget limits resolution; adaptive selection is critical
Difficulty balance	~13/13/12/12 per domain	IRT thresholds can be evenly spaced
IRT+BALD overhead	<1ms in JS	Negligible on top of existing ~10ms GP update
BALD vs uncertainty	10–12% RMSE improvement	Meaningful selection improvement

Approach	Verdict	Reason
Classical staircase	❌	0.5-level quantization error with only 4 levels; needs 40+ trials per location
QUEST+ per cell	❌	Only 6% cell occupancy; not enough questions per cell
Per-cell BKT	❌	Same sparsity problem; 54% of cells have only 1 difficulty level
Deep Knowledge Tracing	❌	Requires pre-training data from thousands of students; no cold-start
Nadaraya-Watson	❌	Accuracy plateaus at ~42–47% for ordinal estimation
Full GP ordinal regression	Overkill	The IRT threshold approach achieves similar results with 1% of the complexity

Adaptive difficulty selection via GP-IRT with BALD acquisition #23

Description

Summary

Proposal: GP-IRT with BALD Acquisition

The Key Insight

Mathematical Model

Phase-Based Strategy

Empirical Constraints Discovered

Why Not Other Approaches?

Minimal Viable Version (80% of value, ~25 lines)

Full Implementation Roadmap

Key Limitations & Assumptions

Research Sources

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions