Skip to content

Adaptive difficulty selection via GP-IRT with BALD acquisition #23

@jeremymanning

Description

@jeremymanning

Summary

The current knowledge mapping system estimates P(correct | location) using a Gaussian Process on a 50×50 grid, but has no awareness of question difficulty levels. It treats all questions equally regardless of whether they're L1 (easy) or L4 (expert). This means:

  1. No difficulty-level map: We can't tell users "you're at L3 in this region but L1 over there"
  2. Suboptimal question selection: The sampler picks questions purely by spatial uncertainty, ignoring that asking a question at the wrong difficulty is wasteful (too easy = no information; too hard = just guessing)
  3. No adaptive difficulty: There's no mechanism to start easy and ramp up as the learner demonstrates mastery

Proposal: GP-IRT with BALD Acquisition

Based on deep research across 5 parallel investigations (adaptive staircase methods, Bayesian Knowledge Tracing, Item Response Theory, RBF ordinal interpolation, and Computerized Adaptive Testing in embedding spaces), all 5 converge on the same architecture:

The Key Insight

The existing GP posterior value (0–1) can be reinterpreted as a transformed IRT ability parameter. By applying fixed difficulty thresholds, every cell immediately gets a difficulty level estimate with zero additional state. BALD (Bayesian Active Learning by Disagreement) acquisition then selects questions that are simultaneously in uncertain spatial regions AND at the learner's difficulty boundary.

Mathematical Model

Layer 1: Ability mapping (reinterpret existing GP output)

θ_irt(x,y) = 4 × GP_mean(x,y) - 2      // rescale [0,1] → [-2,2]
σ_irt(x,y) = 4 × GP_uncertainty(x,y)    // scale uncertainty accordingly

Layer 2: IRT difficulty model (new, ~10 lines)

P(correct | θ, difficulty_d) = sigmoid(a × (θ - b[d]))

where:
  a = 1.5 (discrimination, shared across levels)
  b = [-1.5, -0.5, 0.5, 1.5] for L1–L4

Layer 3: Difficulty level estimate (derived, zero cost)

Mastery level L*(x,y) = max{ d ∈ {1,2,3,4} : GP_mean(x,y) > threshold[d] }

Thresholds in [0,1] space: [0.125, 0.375, 0.625, 0.875]
  - GP < 0.125  → below L1
  - [0.125, 0.375) → L1
  - [0.375, 0.625) → L2
  - [0.625, 0.875) → L3
  - ≥ 0.875 → L4

Layer 4: BALD question selection (replaces pure uncertainty scoring)

EIG(question_q) = a² × P_q × (1 - P_q) × σ²_irt(x_q, y_q)

This naturally balances:
  - Spatial uncertainty (high σ → explore unknown regions)
  - Difficulty informativeness (P near 0.5 → question is at learner's boundary)
  - Discrimination power (a² weights more discriminating items)

Phase-Based Strategy

Phase Trigger Strategy Goal
Calibrate N < 10 Prefer L2–L3 in high-uncertainty regions Establish rough ability baseline
Map 10 ≤ N < 30 or coverage < 15% Full BALD acquisition Refine ability map + find difficulty boundaries
Learn N ≥ 30 ZPD targeting (P ≈ 0.55–0.70) Optimal learning in zone of proximal development

Phase transitions are soft — if coverage drops (e.g., user explores a new region), the system drops back to mapping mode.

Empirical Constraints Discovered

Metric Value Implication
Cell occupancy 6.0% (149/2500) Per-cell staircases are impossible — must interpolate
Multi-difficulty cells 46.3% (69/149) Only ~half of occupied cells have multiple difficulty levels
Questions per domain 50 Budget limits resolution; adaptive selection is critical
Difficulty balance ~13/13/12/12 per domain IRT thresholds can be evenly spaced
IRT+BALD overhead <1ms in JS Negligible on top of existing ~10ms GP update
BALD vs uncertainty 10–12% RMSE improvement Meaningful selection improvement

Why Not Other Approaches?

Approach Verdict Reason
Classical staircase 0.5-level quantization error with only 4 levels; needs 40+ trials per location
QUEST+ per cell Only 6% cell occupancy; not enough questions per cell
Per-cell BKT Same sparsity problem; 54% of cells have only 1 difficulty level
Deep Knowledge Tracing Requires pre-training data from thousands of students; no cold-start
Nadaraya-Watson Accuracy plateaus at ~42–47% for ordinal estimation
Full GP ordinal regression Overkill The IRT threshold approach achieves similar results with 1% of the complexity

Minimal Viable Version (80% of value, ~25 lines)

The MVP requires changes to only 3 files:

  1. src/utils/math.js — add normalCDF() (6 lines, Abramowitz & Stegun approximation)
  2. src/learning/estimator.js — add difficultyLevel to predict() output (3 lines using threshold lookup)
  3. src/learning/sampler.js — replace score = uncertainty with BALD EIG formula (5 lines)

This gives:

  • Every cell gets a difficulty level estimate (L1–L4) for free
  • Question selection becomes 10–12% more efficient
  • No new data structures, no new state, no new UI required
  • Backward-compatible: BALD reduces to uncertainty-based when all questions have the same difficulty

Full Implementation Roadmap

Phase A: MVP (~2–4 hours)

  • Add normalCDF to math.js
  • Add difficultyLevel to estimator.predict()
  • Replace uncertainty scoring with BALD EIG in sampler.selectNext()
  • Tests: verify BALD scores > uncertainty scores for boundary questions

Phase B: Full IRT Layer (~4–8 hours)

  • Level posterior computation (4-element probability vector per cell)
  • $phase computed atom in store.js
  • Phase-based selection in sampler.js
  • difficultyConfidence field
  • Update selectByMode() strategies

Phase C: Visualization (~4–8 hours)

  • Difficulty-level color mode for heatmap
  • Show difficulty level in quiz panel
  • Per-cell level posterior in tooltip
  • Difficulty breakdown in insights

Phase D: Calibration (ongoing)

  • Collect anonymized response data
  • Fit IRT parameters (a, b[1..4]) via maximum likelihood
  • A/B test BALD vs. uncertainty selection
  • Tune phase transition thresholds

Key Limitations & Assumptions

  1. IRT parameters are theoretical defaults — b = [-1.5, -0.5, 0.5, 1.5] and a = 1.5 are not empirically calibrated. Should be fit from real response data.
  2. GP value ≠ true ability — GP_mean is P(correct) averaged across observed difficulties. The IRT rescaling is an approximation that works when difficulty distribution is roughly uniform (which holds for current data: ~13/13/12/12).
  3. Stationarity assumption — All methods assume stable ability during assessment. Active learning causes within-session ability drift (the learner is learning!). The ZPD phase partially addresses this.
  4. UMAP distortion — RBF/Matérn kernels assume Euclidean distance, but UMAP non-linearly distorts distances. Empirical validation needed.
  5. 50-question budget — True difficulty boundary mapping benefits from 100+ questions per domain with deliberate spatial and difficulty coverage.

Research Sources

This proposal synthesizes findings from 50+ papers including:

  • IRT: Duck-Mayr et al. (GPIRT, UAI 2020), Chu & Ghahramani (GP Ordinal, JMLR 2005)
  • Staircase: Watson (QUEST+, 2017), Lesmes et al. (qCSF, 2010), Kontsevich & Tyler (Psi, 1999)
  • BKT: Corbett & Anderson (1995), Vie & Kashima (KTM, AAAI 2019), Yeung (Deep-IRT, 2019), Schmucker et al. (Supercharging BKT, JEDM 2024)
  • Active Learning: Houlsby et al. (BALD, 2011), Bayesian Optimal Experimental Design literature
  • CAT: EduCAT (USTC), ALEKS Knowledge Space Theory, Survey of CAT (arXiv:2404.00712)

Full research reports and figures available in .omc/scientist/reports/ and .omc/scientist/figures/.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions