Three months ago, in a founding letter, we promised to build the measurement layer first. This is a short note on what that looks like with three months on it.

There is, as of two weeks ago, a Brightroom adaptive engine. It runs in a private alpha for roughly thirty testers — all candidates sitting the new GMAT Focus Edition inside the next twelve months. The loop works. It is not yet good. We want to put on the record what it does, what it does not, and what we have learned from three weeks of watching it run.

What v0 actually does

Unglamorous, on purpose. After each response, the engine updates a single scalar — an ability estimate, in the rolling-window sense — and selects the next item from a small pool based on a target difficulty near that estimate. There is no Item Response Theory yet. There are no item parameters fit from data yet. There is no mastery vector yet. There is a number that goes up when the candidate gets things right and down when they don't, and a selection rule that says: pick the next item near that number.

# v0 loop, per response
recent  = history[-WINDOW:]
ability = sum(recent.correct) / len(recent)   # bounded [0, 1]

target  = lerp(EASY, HARD, ability)
pool    = items.filter(unseen).near(target)
next_q  = random.choice(pool)

That is the entire selection logic. Seven lines of Python sitting behind a small service. The rest of the alpha is plumbing — auth, session state, the response logger that captures every answer alongside the context the engine had when it served the item.

Why it is not 3PL yet

We are deliberately not running 3PL Item Response Theory yet, even though that is the obvious next step and the thing this engine will become. The reason is calibration data.

A 3PL model with hand-set priors and no posteriors does worse than a simpler model with no priors at all. It routes confidently in the wrong direction. With three weeks of alpha responses we do not yet have enough density on enough items to estimate the parameters that the loop in our heads needs. Skipping that step is the most common failure mode of every test-prep adaptive system we audited before starting.

Three weeks of alpha

Three things we have learned, in three weeks.

The cohort sticks. Bug reports run high; session abandons are low. We expected to lose more of the thirty than we have. The product is not yet good enough for them to be staying for; we read this as proof they want the category to be better, not proof we are delivering anything yet.

The item pool is too small. Thirty testers running ninety-minute sessions burn through low-discrimination items quickly. We are seeing item re-exposure inside the same week, which compromises the signal we need on each item's behaviour. The pool has to widen before the cohort does. That is the bottleneck on the Q1 2024 roadmap.

The estimator is noisy. A single bad-day-at-work session can throw the ability estimate by 0.15 on the [0, 1] scale — close to a whole band of the score range. We can see this in the per-session trajectories: a candidate sitting steady at 0.62 has one rough Tuesday and the engine pulls them back to 0.47 for the next session, then over-corrects upward. v1 needs a smoother estimator, and we know roughly what shape it takes.

What's next

In Q1 2024 the loop becomes proper 3PL IRT. Item parameters get fit from the alpha data we are spending these months collecting. The mastery vector — the structure that will hold per-sub-skill estimates separately from the global ability — gets stubbed in. The rolling-window ability gets replaced by something closer to a maximum-likelihood estimate against the candidate's full response history, with a smoothing prior that the noisy-Tuesday problem cannot easily move.

We are not going to call any of this "launched." We will keep calling it alpha. The right name is decided by the candidates and the data, not by us. The bar to call it live is a measurable lift on a held-out cohort, not a feature checkmark.

The harder problem behind the loop

We are spending more time on the calibration pipeline than on the loop itself. The loop is seven lines. The pipeline that feeds it is the actual product.

Every response that comes back from the alpha is logged with full context — the ability estimate at time of presentation, the time-on-task, the option selected, the prior items in the session, the candidate's session number. We need that context, because in three months when we fit item parameters from the accumulated data we have to be able to ask: was this item a discriminator at high ability, or was it just a low-information item that happened to get many responses by accident? Logging the response without the context is the version of this work that produces a bad engine that looks fine.

We will write the pipeline up properly when it is doing more than logging. For now this is what the engine looks like. Not yet good. Honest about it.

— Brightroom Engineering