{"id":922,"date":"2026-06-15T17:21:29","date_gmt":"2026-06-15T17:21:29","guid":{"rendered":"https:\/\/ont.io\/news\/?p=922"},"modified":"2026-06-15T17:21:35","modified_gmt":"2026-06-15T17:21:35","slug":"llm-as-judge-blind-spots","status":"publish","type":"post","link":"https:\/\/ont.io\/news\/llm-as-judge-blind-spots\/","title":{"rendered":"When the judge shares the blind spot"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">LLM-as-judge blind spots are the systematic reasoning failures an automated evaluator inherits from the model it is built on. They matter because a benchmark scored by a language model and a reward model trained on that model&#8217;s preferences can agree completely and both be wrong in the same direction. New work on probabilistic reasoning has made the failure concrete, and it points at a fix that does not come from a better prompt. It comes from human ground truth whose consistency is tracked over time, so a team can tell genuine signal from shared error.<\/p>\n\n\n\n<meta charset=\"utf-8\">\n<meta name=\"viewport\" content=\"width=device-width, initial-scale=1\">\n<title>The dice trap | Ontology Roundup<\/title>\n<style>\n  :root{\n    --navy:#02101C;\n    --navy-2:#0A1B2A;\n    --navy-3:#10263A;\n    --blue:#48A3FF;\n    --blue-dim:#2C5A82;\n    --ink:#E8F1FB;\n    --muted:#9DB4C9;\n    --line:#1E3550;\n  }\n  *{box-sizing:border-box}\n  .dt-wrap{\n    font-family:-apple-system,BlinkMacSystemFont,\"Segoe UI\",Roboto,Helvetica,Arial,sans-serif;\n    background:var(--navy);\n    color:var(--ink);\n    max-width:680px;\n    margin:0 auto;\n    padding:28px 26px 30px;\n    border-radius:16px;\n    line-height:1.6;\n  }\n  .dt-kicker{font-size:12px;letter-spacing:.14em;text-transform:uppercase;color:var(--blue);font-weight:600;margin:0 0 10px}\n  .dt-wrap h1{font-size:25px;font-weight:600;margin:0 0 6px;color:#fff;line-height:1.25}\n  .dt-sub{color:var(--muted);font-size:15px;margin:0 0 22px}\n  .dt-card{background:var(--navy-2);border:1px solid var(--navy-3);border-radius:12px;padding:18px 18px 20px;margin:0 0 18px}\n  .dt-q{font-size:17px;font-weight:500;margin:0 0 4px;color:#fff}\n  .dt-qn{color:var(--muted);font-size:14px;margin:0 0 16px}\n  .dt-opts{display:flex;flex-wrap:wrap;gap:10px}\n  .dt-opt{\n    flex:1 1 150px;background:var(--navy-3);color:var(--ink);border:1px solid var(--blue-dim);\n    border-radius:10px;padding:14px 12px;font-size:16px;font-weight:500;cursor:pointer;text-align:center;\n    transition:border-color .15s, background .15s;\n  }\n  .dt-opt:hover{border-color:var(--blue);background:#143049}\n  .dt-opt .dt-frac{font-size:20px;font-weight:700;color:#fff;display:block}\n  .dt-opt[disabled]{cursor:default;opacity:.9}\n  .dt-opt.correct{border-color:var(--blue);background:#10324f}\n  .dt-opt.wrong{border-color:#5A2D33;background:#2A1418;opacity:.7}\n  .dt-verdict{font-size:15px;margin:16px 0 0;padding:12px 14px;border-radius:10px;background:var(--navy-3);border-left:3px solid var(--blue)}\n  .dt-verdict b{color:#fff}\n  .dt-reveal{display:none;margin-top:18px}\n  .dt-reveal.show{display:block}\n  .dt-gridwrap{display:flex;gap:20px;flex-wrap:wrap;align-items:flex-start;margin:6px 0 4px}\n  .dt-grid{display:grid;grid-template-columns:repeat(6,30px);grid-auto-rows:30px;gap:4px}\n  .dt-cell{border-radius:5px;background:#0E2236;border:1px solid #163551;font-size:11px;color:#42607B;display:flex;align-items:center;justify-content:center}\n  .dt-cell.cond{background:#123A57;border-color:var(--blue-dim);color:#bcd6ef}\n  .dt-cell.win{background:var(--blue);border-color:#bfe0ff;color:#04263f;font-weight:700}\n  .dt-legend{font-size:13px;color:var(--muted);max-width:230px}\n  .dt-legend .sw{display:inline-block;width:12px;height:12px;border-radius:3px;vertical-align:-1px;margin-right:7px}\n  .dt-legend .row{margin:0 0 8px}\n  .dt-note{font-size:14px;color:var(--muted);margin:14px 0 0}\n  .dt-why{background:var(--navy-2);border:1px solid var(--navy-3);border-radius:12px;padding:18px;margin-top:18px}\n  .dt-why h2{font-size:16px;font-weight:600;color:var(--blue);margin:0 0 8px}\n  .dt-why p{font-size:14.5px;color:var(--ink);margin:0 0 10px}\n  .dt-why p:last-child{margin-bottom:0}\n  .dt-src{font-size:12px;color:#6F89A1;margin-top:16px}\n  .dt-src a{color:var(--blue);text-decoration:none}\n  @media (max-width:520px){.dt-grid{grid-template-columns:repeat(6,26px);grid-auto-rows:26px}}\n<\/style>\n\n\n<div class=\"dt-wrap\">\n  <p class=\"dt-kicker\">Ontology Roundup \u00b7 Issue 04<\/p>\n  <h1>The dice trap<\/h1>\n  <p class=\"dt-sub\">A 20-second test of the same shortcut that trips eight state-of-the-art models. Answer first, then look.<\/p>\n\n  <div class=\"dt-card\">\n    <p class=\"dt-q\">You roll two fair dice. You are told at least one of them is a six.<\/p>\n    <p class=\"dt-qn\">What is the probability that both are sixes?<\/p>\n    <div class=\"dt-opts\" id=\"dtOpts\">\n      <button class=\"dt-opt\" data-key=\"six\"><span class=\"dt-frac\">1 in 6<\/span>the other just needs to match<\/button>\n      <button class=\"dt-opt\" data-key=\"eleven\"><span class=\"dt-frac\">1 in 11<\/span>count what is still possible<\/button>\n      <button class=\"dt-opt\" data-key=\"thirtysix\"><span class=\"dt-frac\">1 in 36<\/span>both sixes from scratch<\/button>\n    <\/div>\n    <div class=\"dt-verdict\" id=\"dtVerdict\" style=\"display:none\"><\/div>\n\n    <div class=\"dt-reveal\" id=\"dtReveal\">\n      <div class=\"dt-gridwrap\">\n        <div class=\"dt-grid\" id=\"dtGrid\" aria-hidden=\"true\"><\/div>\n        <div class=\"dt-legend\">\n          <p class=\"row\"><span class=\"sw\" style=\"background:#123A57;border:1px solid #2C5A82\"><\/span>Still possible: at least one six (11 outcomes)<\/p>\n          <p class=\"row\"><span class=\"sw\" style=\"background:#48A3FF\"><\/span>Both sixes (1 outcome)<\/p>\n          <p class=\"row\"><span class=\"sw\" style=\"background:#0E2236;border:1px solid #163551\"><\/span>Ruled out by the clue (25 outcomes)<\/p>\n        <\/div>\n      <\/div>\n      <p class=\"dt-note\">Telling you \u201cat least one is a six\u201d does not isolate one die and leave the other free. It removes every outcome with no six at all, leaving 11 equally likely outcomes. Exactly one of them is the double six. So the answer is <b style=\"color:#fff\">1 in 11<\/b>, not 1 in 6. The intuitive move quietly assumes the two dice stay independent after the clue, and the clue is precisely what links them.<\/p>\n    <\/div>\n  <\/div>\n\n  <div class=\"dt-why\" id=\"dtWhy\" style=\"display:none\">\n    <h2>Why a dice problem is an evaluation problem<\/h2>\n    <p>Avena et al. tested eight state-of-the-art models on problems like this one. On counterintuitive items the models failed in a consistent, predictable way, and chain-of-thought prompting did not reliably rescue them. The shortcut that produced your first instinct is the same shortcut baked into the training data.<\/p>\n    <p>Those same models now do the grading. When an LLM-as-judge carries a reasoning blind spot, a reward model trained on its preferences inherits the blind spot too. The judge and the judged share a prior, so their agreement certifies consensus, not correctness.<\/p>\n    <p>The only check that does not share the failure mode is human ground truth you can verify: evaluators whose reasoning consistency is tracked over time, with a portable identity and signed contributions, so a team can tell signal from shared error. That verifiable layer is what the Ontology identity stack provides.<\/p>\n    <p class=\"dt-src\">Source: Avena et al., \u201cHow reliable are LLMs when it comes to playing dice?\u201d (arXiv, 5 June 2026). Read the full piece: <a href=\"#\" data-internal=\"article-url\">When the judge shares the blind spot<\/a>.<\/p>\n  <\/div>\n<\/div>\n\n<script>\n(function(){\n  var ANSWERS={\n    six:{ok:false,msg:\"That is the fast answer, and it is the trap. One die is already a six, so it feels like the other just needs to match. Look at what the clue actually rules out.\"},\n    eleven:{ok:true,msg:\"Correct, and most people (and most models) do not get here first. Here is why 11 is the number that matters.\"},\n    thirtysix:{ok:false,msg:\"That is the chance of two sixes with no clue at all. The clue changes the question: you already know at least one six landed. Look at what stays possible.\"}\n  };\n  var opts=document.getElementById('dtOpts');\n  var verdict=document.getElementById('dtVerdict');\n  var reveal=document.getElementById('dtReveal');\n  var why=document.getElementById('dtWhy');\n  var grid=document.getElementById('dtGrid');\n  var answered=false;\n\n  for(var d1=1;d1<=6;d1++){\n    for(var d2=1;d2<=6;d2++){\n      var c=document.createElement('div');\n      c.className='dt-cell';\n      var cond=(d1===6||d2===6);\n      var win=(d1===6&#038;&#038;d2===6);\n      if(win){c.className+=' win';c.textContent='6,6';}\n      else if(cond){c.className+=' cond';c.textContent=d1+','+d2;}\n      else{c.textContent=d1+','+d2;}\n      grid.appendChild(c);\n    }\n  }\n\n  opts.addEventListener('click',function(e){\n    var btn=e.target.closest('.dt-opt');\n    if(!btn||answered)return;\n    answered=true;\n    var key=btn.getAttribute('data-key');\n    var a=ANSWERS[key];\n    var all=opts.querySelectorAll('.dt-opt');\n    for(var i=0;i<all.length;i++){\n      all[i].setAttribute('disabled','');\n      var k=all[i].getAttribute('data-key');\n      if(k==='eleven')all[i].classList.add('correct');\n      else if(k===key)all[i].classList.add('wrong');\n    }\n    verdict.style.display='block';\n    verdict.innerHTML=(a.ok?'<b>Right.<\/b> ':'<b>The intuitive answer.<\/b> ')+a.msg;\n    reveal.classList.add('show');\n    why.style.display='block';\n  });\n})();\n<\/script>\n\n\n\n<h2 class=\"wp-block-heading\">A problem built to trip a shortcut<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Avena et al., in \u201cHow reliable are LLMs when it comes to playing dice?\u201d, tested eight state-of-the-art models on discrete-probability problems. On standard questions the models did reasonably well. On counterintuitive questions, the kind designed to trigger a heuristic shortcut, they failed in a consistent and predictable way, and chain-of-thought prompting did not reliably rescue them. The failures were not random noise that more sampling would average out. They clustered, because the shortcut that produces them is the same shortcut a person takes when they answer fast.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Take a problem in the paper&#8217;s spirit. You roll two fair dice. Someone you trust tells you at least one of them came up a six. What is the probability that both are sixes? The fast answer is one in six: one die is already a six, so the other just needs to match. The fast answer is wrong. Once you condition on at least one six, the outcomes still on the table are not thirty-six, they are eleven, and exactly one of those eleven is the double six. The probability is one in eleven, not one in six. The intuitive move treats the two dice as independent after the conditioning, and the conditioning is exactly what breaks the independence. A model that has absorbed the same shortcut from its training data reaches for the same one in six, and explains its way there fluently.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Why LLM-as-judge blind spots travel downstream<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">This would be a curiosity if probabilistic reasoning lived in a corner of the benchmark suite. It does not. The same models that miss the dice problem are now doing the grading. LLM-as-judge pipelines score open-ended outputs, rank candidate responses, and stand in for human raters at scale. Reward models are trained on preference labels, and a growing share of those labels are generated or filtered by other models. When the evaluator carries a reasoning blind spot, that blind spot does not stay in the evaluator. It propagates into every score it assigns and every preference it expresses, and from there into the reward model that learns to imitate it.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The implication is uncomfortable and specific: a reward model trained on model-generated preferences likely inherited the exact failure modes of the model that produced them. That is what LLM-as-judge blind spots are at the system level. Not a single wrong answer, but a bias baked into the training signal, invisible precisely because the thing that would catch it shares it.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Agreement is not evidence when the prior is shared<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The usual defence is agreement. If the judge model and the policy model agree, or if several model graders converge, the result is treated as reliable. Convergence feels like triangulation. It is not, when the instruments share a calibration error. Three thermometers built with the same two-degree bias will agree with each other all day and all be two degrees wrong. The dice paper is the evidence that frontier models do share calibration errors on a describable class of problems, so model-on-model agreement on those problems certifies consensus, not correctness.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This is the structural point that connects to the rest of the Roundup. Issue 03 argued that reward-model QA is the missing layer between collecting step-level preference data and trusting the reward model trained on it. The dice result is why that layer cannot be staffed by more of the same models. The quality assurance has to come from outside the shared prior.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">The check that does not share the failure mode<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Human ground truth is the obvious candidate, and it is also the one most quietly eroded over the last two years, as model-generated labels got cheaper and human review got thinner. But not all human judgement is equal, and one-shot human labels have their own failure modes: fatigue, inconsistency, and the same fast-thinking shortcuts the dice problem exploits. A single anonymous rater answering one in six is no better than the model.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">What breaks the shared-prior loop is human judgement you can actually characterise: evaluators whose reasoning consistency is measured across many tasks and tracked over time, so you can see who stays steady on counterintuitive problems and who drifts. Consistency that is observed and recorded is a different asset from consistency you assume. It lets a team weight, audit, and defend its ground truth, rather than hoping the crowd averaged out. Tracked human consistency is the property that lets you separate signal from shared error, because it is measured against outcomes the model prior did not get to define.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">What auditable human ground truth requires<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">For tracked human judgement to be trustworthy it has to be verifiable by a third party, not just asserted by the platform that collected it. That needs three things, and all three exist as mature open standards. Stable evaluator identity that persists across batches and projects, so a consistency record attaches to a person rather than a disposable account, is the work of the\u00a0<a href=\"https:\/\/www.w3.org\/TR\/did-1.1\/\" target=\"_blank\" rel=\"noopener\">W3C Decentralized Identifiers specification<\/a>. Signed, timestamped contributions that prove who judged what and when are carried as <a href=\"https:\/\/www.w3.org\/TR\/vc-data-model-2.0\/\" target=\"_blank\" rel=\"noopener\">Verifiable Credentials<\/a>. And a way to revoke or update a standing without re-issuing everything is provided by the\u00a0<a href=\"https:\/\/www.w3.org\/TR\/vc-bitstring-status-list\/\" target=\"_blank\" rel=\"noopener\">W3C Bitstring Status List<\/a>. Selective disclosure lets an evaluator prove domain expertise without exposing their identity or full history, so the audit does not become surveillance, and the\u00a0<a href=\"https:\/\/identity.foundation\/\" target=\"_blank\" rel=\"noopener\">Decentralized Identity Foundation<\/a>\u00a0maintains the interoperability work that keeps these pieces talking to each other.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">None of this is speculative cryptography. It is the same identity stack that already underwrites credential verification in other regulated settings, pointed at a new surface: the provenance of the humans who produce evaluation ground truth.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Where Ontology fits<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Ontology&#8217;s contribution here is substrate, not a turnkey evaluator product. <a href=\"https:\/\/ont.id\" target=\"_blank\" rel=\"noopener\">ONT ID<\/a> and <a href=\"https:\/\/onto.app\" target=\"_blank\" rel=\"noopener\">ONTO Wallet<\/a> implement exactly these primitives: persistent decentralised identity, verifiable credentials held by the person rather than the platform, and selective disclosure as a default rather than a bolt-on. A team building evaluation infrastructure can use that substrate to give every evaluator a portable identity, attach a longitudinal consistency record to it as signed credentials, and let any downstream consumer of the data check that record without trusting the collector&#8217;s word for it. The dice paper is a useful jolt because it makes the abstract concrete: the judge can be confidently, fluently wrong, and the only check that helps is human ground truth you can verify. The week ahead takes that into consistency as a safety property, the question of who watches the watchers as models approach self-improvement, and what a shared standard for evaluator quality would actually contain.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>LLM-as-judge blind spots are the systematic reasoning failures an automated evaluator inherits from the model it is built on. They matter because a benchmark scored by a language model and a reward model trained on that model&#8217;s preferences can agree completely and both be wrong in the same direction. New work on probabilistic reasoning has<\/p>\n","protected":false},"author":5,"featured_media":924,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[170,113,13],"tags":[117,172,177,196,197],"class_list":["post-922","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai","category-data","category-did-and-privacy","tag-decentralised-identity","tag-ai-evaluation","tag-rlhf","tag-llm-as-judge-blind-spots","tag-llm-as-judge"],"_links":{"self":[{"href":"https:\/\/ont.io\/news\/wp-json\/wp\/v2\/posts\/922","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ont.io\/news\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/ont.io\/news\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/ont.io\/news\/wp-json\/wp\/v2\/users\/5"}],"replies":[{"embeddable":true,"href":"https:\/\/ont.io\/news\/wp-json\/wp\/v2\/comments?post=922"}],"version-history":[{"count":2,"href":"https:\/\/ont.io\/news\/wp-json\/wp\/v2\/posts\/922\/revisions"}],"predecessor-version":[{"id":925,"href":"https:\/\/ont.io\/news\/wp-json\/wp\/v2\/posts\/922\/revisions\/925"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/ont.io\/news\/wp-json\/wp\/v2\/media\/924"}],"wp:attachment":[{"href":"https:\/\/ont.io\/news\/wp-json\/wp\/v2\/media?parent=922"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/ont.io\/news\/wp-json\/wp\/v2\/categories?post=922"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/ont.io\/news\/wp-json\/wp\/v2\/tags?post=922"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}