About — Board Control

Who Built This

Colin Davy on the Jeopardy! set — On the Alex Trebek Stage.

Board Control is built by Colin Davy, a data scientist in Chicago and Jeopardy champion. You can read the full story of how he used data science to prepare for and win on Jeopardy in this article. Find him on Bluesky at @adjbaseline.

What This Site Does

Every day, this site automatically analyzes the latest Jeopardy game and generates win probability charts showing each player's chances of winning at every moment — like the win probability graphs ESPN shows during football games, but for Jeopardy. It covers every regular season and tournament game. Over 5,000 games and counting.

Beyond win probability, the site includes:

BUTTREY Ratings — Who's actually the best? A data-driven ranking of 600+ players using a Bradley-Terry model based on head-to-head Coryat score performance (which strips out wagering strategy).
BRING IT Forecaster — The dream-matchup tool. Pick any three rated players and BRING IT uses their BUTTREY ratings to predict who wins the head-to-head. Ken vs. James vs. Matt? Finally, an answer.
Daily Double Optimizer — Did they bet the right amount? See whether a contestant's wager was mathematically correct given their position and estimated knowledge.
Excitement Index — A composite score ranking every game by how dramatic it was. Lead changes, wild swings, upset potential, and clutch wagering all factor in.

Why It Exists

Board Control started with a simple question: "Was my Daily Double wager correct?" Answering that required a win probability model. Building the model opened up bigger questions — who's really the best player of all time, which games were the most dramatic, which wagers were smart and which were memorable blow-ups. One tool turned into the site you see now: a place for anyone who wants to understand the show the way serious fans understand any competitive format.

For short-form answers to common questions, see the FAQ. For questions or feedback, get in touch.

How the Excitement Index Is Calibrated

The Excitement Index is a 0–10 score built from ten measurable game-content signals — Round Tempo, Final Stakes, DD Wagering, FJ Cover Tightness, Hot Start, Buzzer Dominance, Stakes Context, Comeback Depth, FJ Swing, and Run-of-Correct. The FAQ has the short version. This section is for readers who want the underlying methodology.

The calibration target

The default slider weights aren't picked by feel. They're fit against actual r/Jeopardy community reaction. We pulled every available Reddit thread for each modern-era episode (~2,400 games with substantive discussion) and asked Claude Sonnet 4.6 to score each thread + its top 50 comments on a strict 1–10 rubric:

10 — multiple GOTY mentions, immediate consensus this was historic
7–8 — strong positive sentiment, real engagement
5–6 — moderate engagement, mixed reception
3–4 — subdued thread, mostly procedural
1–2 — dominated by "boring" / "snoozer" / "blowout" complaints

Crucially, the rubric explicitly debiases against star power. A close-fought game between unknowns can score 9 or 10; a blowout featuring a famous player can score 3 or 4. The signal we're after is sentiment density and content, not comment volume or name recognition.

The autoresearch loop

The 10-component formula wasn't designed up front. It was produced by twelve iterations of a structured loop:

Residual analysis — find the games where the current formula disagrees most with the calibration target.
Hypothesis — propose a new primitive (or a transform tweak, or a dropped feature) that could plausibly close the gap.
Re-fit — re-optimize all slider weights against the calibration target with the candidate change in place.
Ship or discard — keep the change only if it clears both gates: cross-validated Spearman ρ improves by at least +0.005, and the held-out (untrained) games show no regression.

Most candidates lost. The ones that survived twelve rounds of this gate are what's in v14. Architecturally the formula stayed simple throughout: a weighted mean of normalized primitives, one monotone transform per primitive, no multipliers, no conditional logic.

How well does it work?

On held-out games — episodes the optimizer never saw during fitting — Spearman ρ between the formula's score and the human-graded community sentiment is 0.61. That means roughly 37% of the variance in how Reddit actually rates a Jeopardy game is captured by these ten game-content primitives alone, with no information about who the players are or how famous they became.

Footnote: primitives that came back

A useful sanity check on whether this is data-driven or vibes-driven: several of the primitives in v14 had been rejected in earlier iterations under noisier calibration targets (an older version used a keyword-counting heuristic on Reddit comments instead of the LLM rubric). Once the target got cleaner, the same primitives passed the gate cleanly. Comeback Depth, Hot Start, Buzzer Dominance, FJ Cover Tightness, and Run-of-Correct all fall into this category — features whose signal was real but had been buried by noise in the older target.

The reverse also happened: two primitives the previous version (v9) leaned on, FJ Suspense and DD Correct Aggression, dropped out under the cleaner target. They were redundant with other features — FJ Cover Tightness measures the same thing as FJ Suspense more directly, and DD Correct Aggression correlates +0.50 with Final Stakes on raw values. Both were removed, weights re-fit without them, and the holdout score improved.

What this doesn't capture

The honest limitations. The formula sees gameplay numbers — scores, wagers, buzzer wins, leads, deficits. It does not see question quality, contestant chemistry, banter, host moments, or anything else that makes an episode feel alive on the broadcast. A game can be objectively close and tactically interesting and still feel flat on TV; another can have a "vibe" the numbers will never explain. The 0.61 holdout correlation is a real number — about 37% of community-sentiment variance — but the remaining 63% is genuinely outside what gameplay statistics can reach.