The pipeline behind every item parameter.: Brightroom

The Mar 2024 engine note ended with an admission: about a quarter of the item pool was under-calibrated, and the selection step deprioritized those items because their parameter estimates were not yet trustworthy. The pool behaved like a smaller pool than its raw size suggested.

Last week the calibration pipeline that closes that gap moved from manual to nightly. Every item parameter in the pool now gets re-estimated every twenty-four hours, against a rolling thirty-day window of responses, on a cluster job that finishes around four in the morning St. Gallen time. This is a note on what that job actually does, why it took five months to write properly, and what is still wrong with it.

What "calibration" means in this loop

Every item in the v1 pool carries three numbers: discrimination (a), difficulty (b), and a guessing floor (c). The engine's selection step scores candidate items by Fisher information at the candidate's current ability, and that score depends entirely on those three numbers being right. A misfit difficulty moves an item to the wrong place on the information curve; a misfit discrimination overstates or understates how much the item teaches the engine.

Calibration is the work of estimating (a, b, c) from real response data, and re-estimating them as more data arrives. The estimates start as priors (an author's target difficulty, a content-family heuristic) and become posteriors only after enough candidates of varied ability have responded to the item. Until they do, the item is sitting in a probabilistic fog.

Day 0prior · CI ±0.42

Day 30posterior · CI ±0.11

The same item, before and after thirty days of responses. The point estimate moves slightly; the interval around it tightens by roughly four-to-one. The engine is allowed to trust the second one.

What the figure shows is the typical trajectory for an item that has crossed the calibration threshold. The point estimate of difficulty barely moves, because the prior was close, but the credible interval narrows enough that the Fisher-information score against the item is no longer dominated by uncertainty. Multiply this across roughly 2,400 items and the pool behaves measurably differently.

The pipeline, end to end

Three stages, every night.

Stage 1: pull. The response logger has been recording every alpha and Pro session since December with full context: the candidate's θ at time of presentation, time-on-task, the option selected, the prior items in the session, the session number, the engine version. The nightly job pulls the last thirty days of those records, partitions by item, and drops items with fewer than twenty-five responses in the window. (Below twenty-five, the marginal maximum-likelihood fit is too wide to be useful.)

Stage 2: fit. For each remaining item, the job runs an EM estimator against the response data, with the previous night's posterior as the prior. Items whose fit moves by more than 0.15 on the difficulty scale get flagged for review. Items whose discrimination drops below 0.4 get flagged for retirement. Items whose guessing floor estimates upward by more than 0.05 get reviewed for ambiguity in the answer-choice text.

Stage 3: publish. The updated parameter table is written to a versioned snapshot. The engine reads the latest snapshot at session start, not on a per-request basis, so a candidate's session uses a single coherent set of parameters from start to finish. The snapshot rotation is the reason the cron lands at four in the morning. That is the quietest window for the small number of sessions that span midnight.

# nightly calibration job
window = last_n_days(30)
items  = pool.all()

for item in items:
    responses = window.filter(item_id=item.id)
    if len(responses) < MIN_RESPONSES:
        continue
    prior = item.posterior  # last night's
    a, b, c, se = em_fit(responses, prior=prior)
    if drift(item.b, b) > DRIFT_THRESHOLD:
        flag(item, reason="difficulty drift")
    if a < A_FLOOR:
        flag(item, reason="low discrimination")
    item.update(a=a, b=b, c=c, se=se)

snapshot.publish(items, ts=now())

Three hundred and forty lines of Python, plus a thin Rust EM kernel for the per-item fit. The orchestration is the longer half; the fitting is the smaller one. Everything that's hard about this job is in stages 1 and 3, not stage 2.

The hardest call: anchors

A naive nightly re-fit moves every item every night. That breaks cross-cohort ability comparison: a candidate sitting today and a candidate sitting six weeks ago are no longer scored against the same yardstick. Two candidates with identical response patterns would get different θ readings purely because the intervening forty-two re-fits moved the item parameters underneath them.

The fix is anchor items. About 12% of the pool is held out from nightly re-fit: items whose parameters have been stable for ninety days and whose response counts exceed five hundred. Anchors fix the scale. The rest of the pool floats relative to them. A candidate's scaled-score readout, and the guarantee bands, are ultimately defined against the anchors, not against the floating pool.

Choosing the anchor set was the work of the last six weeks. We held out a candidate set of 380 items, watched the per-anchor drift across forty-two nightly fits run in shadow against the manual-calibration baseline, and kept the 290 anchors whose parameters did not move more than 0.05 on difficulty across that window. The other ninety dropped back into the float.

What this is built to fix

Three things the nightly pipeline is designed to improve. These are design goals and the mechanism behind them, not a published before/after study with reported effect sizes.

Usable pool size. Items sitting at low response counts move through the calibration threshold within days of crossing the minimum, rather than waiting on a weekly manual re-fit. The engine sees a wider eligible set at every selection step, which is the direct path to better information-per-question.

Measurement precision. A wider, better calibrated pool gives the selection step more options to maximize Fisher information over, so the items chosen extract a little more signal per response. The aim is sharper measurement at session end for the same number of items.

Item-drift detection is no longer manual.When an item leaks (it appears in an external forum thread, its exposure runs past budget, its response pattern shifts), the manual weekly cadence would catch it only on the next Friday review. The nightly job is built to flag drift far faster, so a compromised item can be retired before it skews many sessions.

What it doesn't fix

Cold-start items. A brand-new item with zero responses still carries author-set priors, no posteriors, and a wide credible interval. The selection step deprioritizes them. They have to be served at low exposure to enough candidates of varied ability before they clear the threshold and start scoring competitively in the selection step. There is no algorithmic shortcut to that. The shortcut is writing better priors at authoring time, which is the curriculum team's work, not the engine team's.

We will write a longer note on cold-start when the work being done at authoring time (the difficulty-prediction model the curriculum team is fitting against the accumulated alpha data) is ready to ship. That is the next pipeline note worth writing.

Brightroom Engineering

The pipeline behind every item parameter.

What "calibration" means in this loop

The pipeline, end to end

The hardest call: anchors

What this is built to fix

What it doesn't fix

Introducing Brightroom for Institutions: the room, opened to a group.

Introducing Companion: the Brightroom app for iPhone.

Introducing the Brightroom Library.

Cookies on Brightroom