An engine that picks the next question, properly.: Brightroom

Three months ago we wrote that the engine was seven lines of Python and a number that went up and down. The note ended with a forecast: in Q1 2024 the loop becomes proper 3PL Item Response Theory. That happened last Tuesday. This is what shipped, what it does, and what is still wrong with it.

What changed

The v0 engine ran on a rolling-window scalar: recent correct-rate, bounded to [0, 1], used as both a score and a targeting parameter. It worked well enough to demonstrate the loop existed. It was not a measurement model.

v1 replaces the scalar with a three-parameter logistic Item Response Theory estimator. Every item in the pool now carries three numbers: discrimination (a), difficulty (b), and a guessing floor (c). Every candidate carries an ability estimate (θ) and a standard error. The selection step is no longer "pick the item nearest the current ability." It is "pick the item that maximizes Fisher information at the current ability, subject to topic-balance and exposure constraints."

v0 select:  pool.filter(unseen).near(target)v0 update:  ability = sum(recent.correct) / len(recent)v1 select:  argmax(I(item, theta)) over eligible poolv1 update:  theta, se = irt_mle(history, item_params)

The v0 loop and the v1 loop, side by side. The v1 update carries a standard error; the v0 update does not. That single field is the reason the rest of the loop can be honest.

The selection logic, in pseudocode, is what most adaptive textbooks describe and most adaptive products do not run:

# v1 loop, per response
theta, se = irt_mle(history, item_params)

eligible = pool.filter(unseen, topic_balance, exposure_cap)
scored   = [(item, fisher_information(item, theta))
            for item in eligible]
next_q   = max(scored, key=lambda x: x[1])

if se < TARGET_SE and coverage_met(mastery):
    end_session()

Fisher information, at a given θ, measures how much the next response is expected to reduce uncertainty about the candidate's ability. Picking the item that maximizes it is the same as picking the item that, in expectation, teaches the engine the most per minute of candidate time. That is the entire point of an adaptive loop.

The mastery vector, stubbed

v1 also stubs in the structure that v2 will fill out: alongside the scalar θ, the engine now carries a thirty-element vector of sub-skill estimates: quant sub-strands, verbal sub-strands, data insights sub-strands. Each response updates a small set of those entries, weighted by the sub-skill loadings attached to the item.

The vector is not yet a selection input. Selection still runs against the scalar. The mastery estimates are logged, surfaced in the analytics panel for testers to react to, and used to bias the topic-balance constraint when two candidate items score equally on Fisher information. We will write a longer note when the vector becomes a first-class part of the selection step. That is v2.

What this fixes from v0

Three things we wrote about in the December note are materially better.

The estimator is no longer noisy. The bad-Tuesday problem from v0, where one rough session dragged the ability estimate down half a band, is gone. A 3PL MLE against the candidate's full history, with a smoothing prior, simply does not move that fast in response to a single session. The trade is that early sessions converge more slowly; the standard error stays wide for the first eight to ten items, then tightens. That is the correct shape.

The item pool can be larger without selection getting worse. v0 broke when we widened the pool: with no per-item parameters, "near the target" had no good way to choose between two items at similar difficulty. v1 scores every item by information at the candidate's current ability, so a wider pool produces a sharper selection, not a noisier one. The pool is now at roughly 2,400 items. v0 was running on 380.

Stop conditions are honest. v0 ran for a fixed item count. v1 ends a session when the standard error on θ falls below a target, or when pacing data suggests the next item would harm the candidate more than help them. A short, clean session is now a possible outcome. A long, ragged one is no longer the default.

What is still wrong with it

Two things. Both are calibration problems, which is the category of problem we said in December was the actual bottleneck.

About a quarter of the pool is under-calibrated.Items that have been seen by fewer than roughly 120 candidates of varied ability carry parameter estimates whose confidence intervals are wide enough that the Fisher-information ranking against them is not yet trustworthy. The engine knows this. Those items are seeded at low exposure, and the selection step deprioritizes them. But the consequence is that the v1 pool behaves like a smaller pool than its raw size suggests. Closing this gap is the work of the next ninety days.

Pacing is not yet a first-class signal. The loop selects on ability and topic balance. It does not select on the candidate's pacing pattern within the current session: time-on-task, the shape of the response latencies, the gap between fast-confident answers and slow-uncertain ones. Time-on-task is logged. It is not consumed. We will write a note when it is, because that is the next loop change worth describing.

Who this is in front of

v1 went live last Tuesday for the private-alpha cohort, which has grown since the small group we wrote about in December. From this week, v1 is also the engine behind Pro 1-month, the first paid Brightroom tier. Pro 1-month is $199 a month, gives a candidate the same engine the alpha cohort has been using, and is being kept deliberately narrow on features until we are confident the engine is worth paying for. The diagnostic stays free.

We are calling it v1. We are not calling it finished. A v1 engine with a quarter of its pool under-calibrated and no pacing signal is a useful product, but it is not the engine we want to be running this time next year. v2 is the mastery vector becoming a selection input. v3 is pacing as a constraint inside the loop. Both are on the 2024 roadmap. Neither is launched until the data say so.

Brightroom Engineering

An engine that picks the next question, properly.

What changed

The mastery vector, stubbed

What this fixes from v0

What is still wrong with it

Who this is in front of

Introducing Brightroom for Institutions: the room, opened to a group.

Introducing Companion: the Brightroom app for iPhone.

Introducing the Brightroom Library.

Cookies on Brightroom