Data Insights needed three dimensions, not one.: Brightroom

The mastery vector is a selection input as of last Thursday. The 30-element sub-skill structure we stubbed in March 2024 (logged, surfaced in analytics, used only to bias topic-balance ties) now sits inside the loop next to Fisher information, with a learned cross-loading matrix and a per-candidate routing rule that respects the vector's shape. Two candidates with the same θ in Data Insights now get routed to different items.

The reason the change took ten months is structural. The v1 architecture treated each section as a scalar (one θ for Quant, one for Verbal, one for Data Insights), which works cleanly for Quant and Verbal and quietly fails for Data Insights. This post is on why DI needed three dimensions, what the model now looks like, and what is still hard.

The DI problem in one paragraph

Data Insights is, structurally, three sub-sections wearing one section name. The five item families (data sufficiency, multi-source reasoning, table analysis, graphics interpretation, two-part analysis) load on three latent abilities that GMAC does not separate in their public scoring: interpretation (reading a chart or table accurately), inference (deriving an unstated conclusion from given information), and integration (combining quantitative and verbal evidence across sources). A candidate strong on interpretation and weak on inference, and a candidate weak on interpretation and strong on inference, can sit the same section and produce the same scaled score. They do not have the same gap.

Candidate AInterpretationInferenceIntegrationθ = 0.64

Candidate BInterpretationInferenceIntegrationθ = 0.64

Two alpha-cohort candidates with the same DI ability estimate and inverted sub-skill profiles. Under v1, the engine routed both to the same next-item difficulty. Under v2, it routes them to different items.

What the vector now contains

Thirty sub-skills, partitioned across the three sections as the GMAC content outline partitions them: twelve for Quant, ten for Verbal, eight for Data Insights. The DI eight collapse onto the three latent dimensions described above; the Quant twelve and Verbal ten collapse onto smaller latent structures that v2 also fits but does not yet route on, because the same-θ-different-profile problem is empirically rare in Quant and Verbal at the cohort sizes we have.

Each item in the pool now carries a loading vector: a sparse row in a 30-column matrix describing which sub- skills the item probes and with what weight. After every response, the candidate's mastery vector updates:

# v2 mastery update, per response
loading = item.loadings           # sparse 30-vector
predicted = sigmoid(loading @ m - item.b)
residual  = response.correct - predicted
m         = m + LR * loading * residual
m         = clip(m, -3.0, 3.0)

The update is small, sparse, and proportional to the residual: the difference between what the engine expected and what the candidate did. An item the candidate was expected to get right and got wrong moves the relevant sub-skill components downward; an item they were expected to miss and got right moves them upward. Items the engine predicted correctly barely move the vector at all, which is the right behavior: the engine has nothing to learn from confirmation.

The selection step, updated

Selection no longer maximizes Fisher information alone. The new objective is a weighted sum:

# v2 selection, per response
theta, se = irt_mle(history, item_params)
m         = mastery_vector(history)

eligible  = pool.filter(unseen, topic_balance, exposure_cap)
target    = difficulty_target(theta, se, pace)

scored = [
    (item,
     ALPHA * fisher_information(item, theta)
   + BETA  * mastery_lift(item.loadings, m))
    for item in eligible
    if abs(item.b - target) <= window(pace)
]
next_q = max(scored, key=lambda x: x[1])

Fisher information rewards items that sharpen the scalar θ estimate. The new term, mastery_lift, rewards items that probe the candidate's weakest sub- skill among the eligible items. The two terms compete: the most informative item for θ is not always the most informative item for the candidate's mastery shape. Alpha and beta are tuned per section. In Data Insights they sit at roughly 0.6 and 0.4. In Quant and Verbal, the second term is still set to a small value pending the data that says it should move.

The hard part: the loadings matrix

The loadings matrix is the part that took most of the ten months. v1 had a hand-authored matrix written by the curriculum team: every item tagged with which sub-skills it probed, manually. Hand-authored matrices are fine for small pools and unreliable at scale; by the time we had two thousand items in the pool, the authoring drift between team members was producing meaningful inconsistency in the routing.

v2 fits the matrix from response data. The process is a constrained matrix factorization run nightly inside the calibration pipeline: given the response history (candidates × items × correct/incorrect) and the v1 hand-authored loadings as the prior, fit a sparse loadings matrix that maximizes the joint likelihood of the responses under the 3PL model with the mastery- adjusted ability. The constraint is the sparsity pattern: an item is allowed to load on at most three sub-skills, and item authors are still the source of truth for which sub-skills are eligible for a given item family. The data adjusts the weights.

Two items reliably drift in this fit. The first is an item whose hand-authored loadings overweight the "dominant" sub-skill (the family the author belongs to) and underweight the secondary skill the item also probes. The data flags this within a week of the item accumulating sixty responses. The second is an item whose response pattern fits a sub-skill the author did not tag. Those get reviewed and either re-tagged or retired.

Why this matters to a candidate

Two candidates sitting Data Insights with the same scaled score get different next-item sequences. The candidate whose alpha-cohort data identifies them as weak-interpretation gets routed to interpretation items first; the candidate whose data identifies them as weak-inference gets the inverse. Neither candidate is told, in product, "we think you are weak at X." The engine routes against the model, the candidate encounters the items, and the gap closes by the route rather than by the label.

The candidate-visible change is a sharper diagnostic report. The next time a candidate runs the diagnostic re-take, the DI sub-skill breakdown is the new three- axis structure, replacing the single per-section bar. The Section Analytics surface, which has been in alpha through the autumn, ships its DI panel against the same three axes when it goes to beta in the spring.

Brightroom Engineering

Data Insights needed three dimensions, not one.

The DI problem in one paragraph

What the vector now contains

The selection step, updated

The hard part: the loadings matrix

Why this matters to a candidate

Introducing Brightroom for Institutions: the room, opened to a group.

Introducing Companion: the Brightroom app for iPhone.

Introducing the Brightroom Library.

Cookies on Brightroom