{"id":902,"date":"2026-06-01T11:12:40","date_gmt":"2026-06-01T11:12:40","guid":{"rendered":"https:\/\/ont.io\/news\/?p=902"},"modified":"2026-06-01T11:12:44","modified_gmt":"2026-06-01T11:12:44","slug":"evaluator-provenance-metr","status":"publish","type":"post","link":"https:\/\/ont.io\/news\/evaluator-provenance-metr\/","title":{"rendered":"When benchmarks break: the case for traceable evaluator provenance"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">Evaluator provenance is the layer that turns benchmark results from &#8220;trust the publisher&#8221; claims into independently verifiable artefacts. When that layer is missing, a single methodology dispute can collapse confidence in a benchmark used by entire policy and research ecosystems. <a href=\"https:\/\/aiweekly.co\/alerts\/researcher-finds-fatal-flaws-in-metr-ai-progress-graph\" target=\"_blank\" rel=\"noopener\">The METR time-horizons graph is the latest example: cited everywhere, audited rarely, and now publicly contested.<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The METR situation arrived the way these things always do. A benchmark that had been quoted in policy briefings, lab announcements, and capability roundups for months turned out, on close inspection, to contain what one teardown described as numerous severe errors. The errors were not minor formatting issues. They were structural problems with how the underlying judgements had been compiled and how the resulting graph had been read. The benchmark had become a load-bearing reference for an industry whose habit is to cite it and move on.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This is not an indictment of METR specifically. It is the latest, loudest instance of a category that has been building for a year. The category is benchmarks that have no traceable evaluator provenance behind them. When the underlying judgement chain is opaque, any methodology question becomes an unfalsifiable argument. The benchmark publisher and the benchmark critic each get to assert their reading. There is no shared artefact either side can verify independently. Trust collapses to authority, and authority is exactly what every external observer was already sceptical of.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">What &#8220;evaluator provenance&#8221; actually means<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Evaluator provenance is the verifiable chain from &#8220;this person made this judgement at this time, using this rubric, while holding these credentials&#8221; through to &#8220;this judgement contributed to this aggregate statistic, alongside these other judgements, in this proportion.&#8221; It is not a single document. It is a stack of signed claims that any third party can audit without taking the publisher&#8217;s word for any individual step. The standards that make it possible have been mature for years:&nbsp;<a href=\"https:\/\/www.w3.org\/TR\/did-1.1\/\" target=\"_blank\" rel=\"noopener\">W3C Decentralized Identifiers<\/a>&nbsp;anchor each evaluator&#8217;s portable, holder-controlled identifier;&nbsp;<a href=\"https:\/\/www.w3.org\/TR\/vc-data-model-2.0\/\" target=\"_blank\" rel=\"noopener\">W3C Verifiable Credentials<\/a>&nbsp;carry the signed judgements and the issuer&#8217;s attestation of methodology;&nbsp;<a href=\"https:\/\/www.w3.org\/TR\/vc-bitstring-status-list\/\" target=\"_blank\" rel=\"noopener\">W3C Bitstring Status Lists<\/a>&nbsp;let issuers revoke credentials cleanly when a methodology issue is found upstream.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">With this stack in place, three things become observable to anyone outside the publishing organisation. First, who actually made each judgement, identified by a stable DID, not by a platform-internal anonymous ID that does not survive a vendor switch. Second, what methodology the issuer attested to at the time the judgement was made, with the version of the rubric and the date both signed into the credential. Third, what has happened to the credential since: revoked, superseded, re-attested, or unchanged.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">None of this requires opening the door on individual evaluator identity. Selective disclosure, formalised in the W3C VC 2.0 family, lets the issuer prove &#8220;this judgement came from a credentialed evaluator who met the rubric criteria&#8221; without revealing the evaluator&#8217;s name, demographic envelope, or anything else the methodology does not require.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Why the METR situation is the warning shot, not the outlier<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Every benchmark currently in wide circulation has at most a partial answer to the provenance question. The most rigorous publishers describe their methodology in detail. Almost none of them ship a signed evaluator chain that a downstream auditor can independently verify. This was tolerable while benchmarks were narrow technical artefacts cited mostly between researchers. It has become structurally untenable now that the same numbers are quoted in policy hearings, procurement decisions, and frontier-capability roadmaps.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The pattern the METR situation foreshadows is the standard one for credibility events. A single high-profile critique shifts the burden of proof. The next benchmark publication has to answer the question that did not exist last quarter: how can a third party verify what your evaluators actually did. Publishers who can answer it become the trusted ones. Publishers who cannot end up doing what METR is doing now, which is defending their reading of a methodology the rest of the field is no longer willing to take on faith.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The teams that ship evaluator-provenance-ready benchmarks first will be the ones whose results survive scrutiny when the next dispute lands. The teams that do not will inherit the credibility risk that METR is currently absorbing on behalf of the entire field.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">What the stack looks like in practice<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">A traceable evaluator chain has four moving parts and the standards work has already been done for each.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">First, each evaluator holds a DID. The DID is anchored on a substrate the evaluator controls, not on a platform-internal database. When the evaluator works across multiple platforms or studies, the DID is the durable handle that lets a third party assemble a complete picture of what that evaluator has been involved in.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Second, each judgement is wrapped in a verifiable credential. The credential names the rubric, the issuer (the eval platform or research group), the date, the credential signature, and any attestations the issuer wants to bind in (specialist certifications, calibration scores, inter-rater agreement history). The credential is signed. Tampering invalidates the signature.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Third, methodology changes are tracked as versioned issuer attestations, not as buried footnotes in a methods appendix. If a rubric is amended, the new rubric version is what subsequent credentials cite. The published benchmark statistic can be reconstructed against either rubric version by any auditor who has the underlying signed credentials.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Fourth, revocation is first-class. When a methodology defect is found, the issuer publishes a status update on a&nbsp;<a href=\"https:\/\/www.w3.org\/TR\/vc-bitstring-status-list\/\" target=\"_blank\" rel=\"noopener\">W3C Bitstring Status List<\/a>, and every downstream verifier can immediately see that the affected credentials no longer attest to what they originally claimed. The benchmark publisher does not have to issue a press release and hope the field reads it. The verification layer just stops returning the same answer.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Where Ontology fits<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Ontology has been deploying decentralised identity standards for years.\u00a0<a href=\"https:\/\/ont.id\" target=\"_blank\" rel=\"noopener\">ONT ID<\/a>\u00a0issues credentials that hold across platforms. <a href=\"https:\/\/onto.app\" target=\"_blank\" rel=\"noopener\">ONTO Wallet<\/a>\u00a0gives the evaluator direct custody of those credentials. The pattern that benchmark provenance now needs (many issuers, many verifiers, durable holders) is the topology the identity stack was built for. Ontology is not positioning a turnkey evaluator product. The identity stack is the substrate that any evaluator platform, eval research group, or benchmark publisher can build on if they want their results to survive the next credibility event.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The METR situation is uncomfortable for everyone who cited the graph. It is also the moment the field discovers that opaque evaluator chains are no longer worth what they used to be. The teams that build the provenance layer now will be the ones whose work still holds up when the next benchmark dispute lands.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Continue reading this week<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Tomorrow:\u00a0Your reward model is only as good as your preference data, on why distillation ROI inherits whatever quality problems live in the upstream reward-model training data.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Evaluator provenance is the layer that turns benchmark results from &#8220;trust the publisher&#8221; claims into independently verifiable artefacts. When that layer is missing, a single methodology dispute can collapse confidence in a benchmark used by entire policy and research ecosystems. The METR time-horizons graph is the latest example: cited everywhere, audited rarely, and now publicly<\/p>\n","protected":false},"author":5,"featured_media":905,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[170,113,13],"tags":[117,172,179,180],"class_list":["post-902","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai","category-data","category-did-and-privacy","tag-decentralised-identity","tag-ai-evaluation","tag-evaluator-provenance","tag-metr"],"_links":{"self":[{"href":"https:\/\/ont.io\/news\/wp-json\/wp\/v2\/posts\/902","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ont.io\/news\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/ont.io\/news\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/ont.io\/news\/wp-json\/wp\/v2\/users\/5"}],"replies":[{"embeddable":true,"href":"https:\/\/ont.io\/news\/wp-json\/wp\/v2\/comments?post=902"}],"version-history":[{"count":1,"href":"https:\/\/ont.io\/news\/wp-json\/wp\/v2\/posts\/902\/revisions"}],"predecessor-version":[{"id":904,"href":"https:\/\/ont.io\/news\/wp-json\/wp\/v2\/posts\/902\/revisions\/904"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/ont.io\/news\/wp-json\/wp\/v2\/media\/905"}],"wp:attachment":[{"href":"https:\/\/ont.io\/news\/wp-json\/wp\/v2\/media?parent=902"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/ont.io\/news\/wp-json\/wp\/v2\/categories?post=902"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/ont.io\/news\/wp-json\/wp\/v2\/tags?post=902"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}