The Mar 2024 engine note ended with an admission: about a quarter of the item pool was under-calibrated, and the selection step deprioritized those items because their parameter estimates were not yet trustworthy. The pool behaved like a smaller pool than its raw size suggested.
Last week the calibration pipeline that closes that gap moved from manual to nightly. Every item parameter in the pool now gets re-estimated every twenty-four hours, against a rolling thirty-day window of responses, on a cluster job that finishes around four in the morning St. Gallen time. This is a note on what that job actually does, why it took five months to write properly, and what is still wrong with it.
What "calibration" means in this loop
Every item in the v1 pool carries three numbers — discrimination (a), difficulty (b), and a guessing floor (c). The engine's selection step scores candidate items by Fisher information at the candidate's current ability, and that score depends entirely on those three numbers being right. A misfit difficulty moves an item to the wrong place on the information curve; a misfit discrimination overstates or understates how much the item teaches the engine.
Calibration is the work of estimating (a, b, c) from real response data — and re-estimating them as more data arrives. The estimates start as priors (an author's target difficulty, a content-family heuristic) and become posteriors only after enough candidates of varied ability have responded to the item. Until they do, the item is sitting in a probabilistic fog.
What the figure shows is the typical trajectory for an item that has crossed the calibration threshold. The point estimate of difficulty barely moves — the prior was close — but the credible interval narrows enough that the Fisher-information score against the item is no longer dominated by uncertainty. Multiply this across roughly 2,400 items and the pool behaves measurably differently.
The pipeline, end to end
Three stages, every night.
Stage 1 — pull. The response logger has been recording every alpha and Pro session since December with full context: the candidate's θ at time of presentation, time-on-task, the option selected, the prior items in the session, the session number, the engine version. The nightly job pulls the last thirty days of those records, partitions by item, and drops items with fewer than twenty-five responses in the window. (Below twenty-five, the marginal maximum-likelihood fit is too wide to be useful.)
Stage 2 — fit. For each remaining item, the job runs an EM estimator against the response data, with the previous night's posterior as the prior. Items whose fit moves by more than 0.15 on the difficulty scale get flagged for review. Items whose discrimination drops below 0.4 get flagged for retirement. Items whose guessing floor estimates upward by more than 0.05 get reviewed for ambiguity in the answer-choice text.
Stage 3 — publish. The updated parameter table is written to a versioned snapshot. The engine reads the latest snapshot at session start, not on a per-request basis, so a candidate's session uses a single coherent set of parameters from start to finish. The snapshot rotation is the reason the cron lands at four in the morning — it is the quietest window for the small number of sessions that span midnight.
# nightly calibration job
window = last_n_days(30)
items = pool.all()
for item in items:
responses = window.filter(item_id=item.id)
if len(responses) < MIN_RESPONSES:
continue
prior = item.posterior # last night's
a, b, c, se = em_fit(responses, prior=prior)
if drift(item.b, b) > DRIFT_THRESHOLD:
flag(item, reason="difficulty drift")
if a < A_FLOOR:
flag(item, reason="low discrimination")
item.update(a=a, b=b, c=c, se=se)
snapshot.publish(items, ts=now())Three hundred and forty lines of Python, plus a thin Rust EM kernel for the per-item fit. The orchestration is the longer half; the fitting is the smaller one. Everything that's hard about this job is in stages 1 and 3, not stage 2.
The hardest call: anchors
A naive nightly re-fit moves every item every night. That breaks cross-cohort ability comparison: a candidate sitting today and a candidate sitting six weeks ago are no longer scored against the same yardstick. Two candidates with identical response patterns would get different θ readings purely because the intervening forty-two re-fits moved the item parameters underneath them.
The fix is anchor items. About 12% of the pool is held out from nightly re-fit — items whose parameters have been stable for ninety days and whose response counts exceed five hundred. Anchors fix the scale. The rest of the pool floats relative to them. A candidate's scaled-score readout, and the guarantee bands, are ultimately defined against the anchors, not against the floating pool.
Choosing the anchor set was the work of the last six weeks. We held out a candidate set of 380 items, watched the per-anchor drift across forty-two nightly fits run in shadow against the manual-calibration baseline, and kept the 290 anchors whose parameters did not move more than 0.05 on difficulty across that window. The other ninety dropped back into the float.
What this fixes
Three measurable improvements, comparing the two weeks before and two weeks after the cron went live on May 5.
Usable pool size grew by 19%. Items that had been sitting at twenty-to-fifty responses are now moving through the calibration threshold within days of crossing twenty-five. The engine sees a wider eligible set at every selection step, which is the direct path to better information-per-question.
Standard error on θ tightened by 0.04.Same number of items per session, sharper measurement at the end. The mechanism is the wider pool: the selection step has more options to maximize Fisher information over, so the items chosen extract a little more signal per response.
Item-drift detection is no longer manual.Two items leaked in late April — appeared in an external forum thread, exposure cap exceeded their internal budget, response patterns shifted. Under the manual weekly cadence we would have caught them on the next Friday review. The nightly job flagged both within forty-eight hours of the leak. They were retired before the second weekend.
What it doesn't fix
Cold-start items. A brand-new item with zero responses still carries author-set priors, no posteriors, and a wide credible interval. The selection step deprioritizes them. They have to be served at low exposure to a thousand-ish candidates of varied ability before they clear the threshold and start scoring competitively in the selection step. There is no algorithmic shortcut to that. The shortcut is writing better priors at authoring time, which is the curriculum team's work, not the engine team's.
We will write a longer note on cold-start when the work being done at authoring time — the difficulty-prediction model the curriculum team is fitting against the accumulated alpha data — is ready to ship. That is the next pipeline note worth writing.
— Brightroom Engineering