When the judge shares the blind spot

LLM-as-judge blind spots are the systematic reasoning failures an automated evaluator inherits from the model it is built on. They matter because a benchmark scored by a language model and a reward model trained on that model’s preferences can agree completely and both be wrong in the same direction. New work on probabilistic reasoning has made the failure concrete, and it points at a fix that does not come from a better prompt. It comes from human ground truth whose consistency is tracked over time, so a team can tell genuine signal from shared error.

The dice trap | Ontology Roundup

Ontology Roundup · Issue 04

The dice trap

A 20-second test of the same shortcut that trips eight state-of-the-art models. Answer first, then look.

You roll two fair dice. You are told at least one of them is a six.

What is the probability that both are sixes?

Still possible: at least one six (11 outcomes)

Both sixes (1 outcome)

Ruled out by the clue (25 outcomes)

Telling you “at least one is a six” does not isolate one die and leave the other free. It removes every outcome with no six at all, leaving 11 equally likely outcomes. Exactly one of them is the double six. So the answer is 1 in 11, not 1 in 6. The intuitive move quietly assumes the two dice stay independent after the clue, and the clue is precisely what links them.

A problem built to trip a shortcut

Avena et al., in “How reliable are LLMs when it comes to playing dice?”, tested eight state-of-the-art models on discrete-probability problems. On standard questions the models did reasonably well. On counterintuitive questions, the kind designed to trigger a heuristic shortcut, they failed in a consistent and predictable way, and chain-of-thought prompting did not reliably rescue them. The failures were not random noise that more sampling would average out. They clustered, because the shortcut that produces them is the same shortcut a person takes when they answer fast.

Take a problem in the paper’s spirit. You roll two fair dice. Someone you trust tells you at least one of them came up a six. What is the probability that both are sixes? The fast answer is one in six: one die is already a six, so the other just needs to match. The fast answer is wrong. Once you condition on at least one six, the outcomes still on the table are not thirty-six, they are eleven, and exactly one of those eleven is the double six. The probability is one in eleven, not one in six. The intuitive move treats the two dice as independent after the conditioning, and the conditioning is exactly what breaks the independence. A model that has absorbed the same shortcut from its training data reaches for the same one in six, and explains its way there fluently.

Why LLM-as-judge blind spots travel downstream

This would be a curiosity if probabilistic reasoning lived in a corner of the benchmark suite. It does not. The same models that miss the dice problem are now doing the grading. LLM-as-judge pipelines score open-ended outputs, rank candidate responses, and stand in for human raters at scale. Reward models are trained on preference labels, and a growing share of those labels are generated or filtered by other models. When the evaluator carries a reasoning blind spot, that blind spot does not stay in the evaluator. It propagates into every score it assigns and every preference it expresses, and from there into the reward model that learns to imitate it.

The implication is uncomfortable and specific: a reward model trained on model-generated preferences likely inherited the exact failure modes of the model that produced them. That is what LLM-as-judge blind spots are at the system level. Not a single wrong answer, but a bias baked into the training signal, invisible precisely because the thing that would catch it shares it.

Agreement is not evidence when the prior is shared

The usual defence is agreement. If the judge model and the policy model agree, or if several model graders converge, the result is treated as reliable. Convergence feels like triangulation. It is not, when the instruments share a calibration error. Three thermometers built with the same two-degree bias will agree with each other all day and all be two degrees wrong. The dice paper is the evidence that frontier models do share calibration errors on a describable class of problems, so model-on-model agreement on those problems certifies consensus, not correctness.

This is the structural point that connects to the rest of the Roundup. Issue 03 argued that reward-model QA is the missing layer between collecting step-level preference data and trusting the reward model trained on it. The dice result is why that layer cannot be staffed by more of the same models. The quality assurance has to come from outside the shared prior.

The check that does not share the failure mode

Human ground truth is the obvious candidate, and it is also the one most quietly eroded over the last two years, as model-generated labels got cheaper and human review got thinner. But not all human judgement is equal, and one-shot human labels have their own failure modes: fatigue, inconsistency, and the same fast-thinking shortcuts the dice problem exploits. A single anonymous rater answering one in six is no better than the model.

What breaks the shared-prior loop is human judgement you can actually characterise: evaluators whose reasoning consistency is measured across many tasks and tracked over time, so you can see who stays steady on counterintuitive problems and who drifts. Consistency that is observed and recorded is a different asset from consistency you assume. It lets a team weight, audit, and defend its ground truth, rather than hoping the crowd averaged out. Tracked human consistency is the property that lets you separate signal from shared error, because it is measured against outcomes the model prior did not get to define.

What auditable human ground truth requires

For tracked human judgement to be trustworthy it has to be verifiable by a third party, not just asserted by the platform that collected it. That needs three things, and all three exist as mature open standards. Stable evaluator identity that persists across batches and projects, so a consistency record attaches to a person rather than a disposable account, is the work of the W3C Decentralized Identifiers specification. Signed, timestamped contributions that prove who judged what and when are carried as Verifiable Credentials. And a way to revoke or update a standing without re-issuing everything is provided by the W3C Bitstring Status List. Selective disclosure lets an evaluator prove domain expertise without exposing their identity or full history, so the audit does not become surveillance, and the Decentralized Identity Foundation maintains the interoperability work that keeps these pieces talking to each other.

None of this is speculative cryptography. It is the same identity stack that already underwrites credential verification in other regulated settings, pointed at a new surface: the provenance of the humans who produce evaluation ground truth.

Where Ontology fits

Ontology’s contribution here is substrate, not a turnkey evaluator product. ONT ID and ONTO Wallet implement exactly these primitives: persistent decentralised identity, verifiable credentials held by the person rather than the platform, and selective disclosure as a default rather than a bolt-on. A team building evaluation infrastructure can use that substrate to give every evaluator a portable identity, attach a longitudinal consistency record to it as signed credentials, and let any downstream consumer of the data check that record without trusting the collector’s word for it. The dice paper is a useful jolt because it makes the abstract concrete: the judge can be confidently, fluently wrong, and the only check that helps is human ground truth you can verify. The week ahead takes that into consistency as a safety property, the question of who watches the watchers as models approach self-improvement, and what a shared standard for evaluator quality would actually contain.