Evaluator Drift: AI Benchmarks' Missing Identity Layer

Model drift in flagship AI systems is often misattributed to changes in the model when it is, in fact, a change in the evaluator population. Persistent, portable identity for evaluators, as standardised by W3C, is the missing primitive that makes evaluator drift measurable and benchmark scores comparable over time.

A flagship model lands on Tuesday. The benchmark scores are state of the art. By Friday, the team that ships it is already getting reports that the model feels different. By the following Tuesday, the scores have not moved, but the qualitative experience has shifted noticeably. Not for everyone. Not in the same direction. Just enough that nobody can pin it down.

This pattern is now familiar enough that it has a name in industry conversations: model drift. The instinct is to blame the model. Quantisation. Some silent update. A different inference path. Sometimes that is the cause. Often it is not.

The harder, less comfortable possibility is that the model has not changed at all. The evaluators have.

Static benchmarks measure model output against a fixed test set, judged by a population of humans whose composition is constantly shifting. New evaluators arrive. Existing evaluators drift in their expertise, their context, their tolerance for edge cases. Crowd-platform onboarding cohorts move through the system. The population that scored 92% on Tuesday is not the population that scored 89% by Friday, even if the model is byte-for-byte identical.

This is not an indictment of any specific eval platform. It is a structural fact about how the entire AI evaluation stack is constructed: at the bottom of the pipeline, indistinguishable from each other by design, sit anonymous humans whose continuity over time is nobody’s responsibility.

That is the missing layer. Not better benchmarks. Not larger datasets. Persistent identity for the people whose judgements those benchmarks ultimately depend on.

What “persistent” actually means

Most platforms have some notion of evaluator identity. The evaluator has a user ID. They have a history on the platform. They have a quality score, sometimes a reputation rank, computed from how often their judgements agree with golden examples or with the consensus of other evaluators.

What they do not have is identity that travels. If the team migrates from one eval vendor to another, the evaluator is a fresh sign-up. Their reputation does not move. Their history does not move. The team is forced to cold-start every quality measurement on every platform switch, and the evaluator is forced to rebuild trust from zero every time they show up somewhere new.

That is not an evaluator problem. It is a primitive problem.

The primitive that fixes it has been standing in plain sight for nearly a decade: decentralised identity. The W3C Decentralized Identifiers specification became a Recommendation in 2022. The companion Verifiable Credentials Data Model 2.0 became a Recommendation in May 2025. These standards describe an architecture where identity, and the credentials attached to it, are anchored to a unique, persistent, portable identifier that the holder controls. The eval platform does not own the identity. The evaluator does. The credentials are presented to the platform when needed and stay with the evaluator when they leave.

That changes the maths under benchmarks.

Drift you can actually measure

When evaluators are anonymous and platform-bound, evaluator drift is mostly invisible. You can compute inter-rater agreement within a single eval batch. You cannot meaningfully compare the population of evaluators on Tuesday with the population on Friday, because they are constructed as distinct, untraceable cohorts every time.

When evaluators carry portable identity and a portable record of their past judgements, three new things become possible at the population level.

First, the consistency of an individual evaluator over time becomes a measurable, persistent attribute, not a black-box reputation score that resets on every platform.

Second, the composition of an eval cohort becomes auditable. You can ask the simple question that today is almost impossible to answer rigorously: was the population that rated the model on Friday meaningfully different from the population that rated it on Tuesday? If it was, the model did not necessarily drift. Your sample drifted.

Third, the unit of accountability changes. A model’s benchmark score becomes a tuple: the model, the cohort, and the cohort’s measurable consistency over time. Anything else is a snapshot, and snapshots cannot detect drift in either direction.

This is infrastructure, not a product pitch

Decentralised identity does not solve evaluation. It does not magic away the difficulty of measuring whether a model is good. It does not replace the dozens of methodological choices a serious AI team has to make about how to construct, weight, and interpret a benchmark.

What it does is install a primitive at the bottom of the stack so that the rest of the stack can do honest work. Without it, the people whose judgements every benchmark depends on are anonymised by default, decohered by platform churn, and impossible to compare across time. With it, they are durable, portable, and observable as a population.

The conversation about benchmark reliability is converging on this point from several directions at once. Arena leaderboard watchers track ELO histories because static rankings hide population shifts. Safety researchers note that evaluator demographic composition materially changes refusal behaviour. RLHF teams complain that the cold-start problem on new platforms wastes weeks of calibration time on every vendor switch. Every one of those complaints is a complaint about the missing identity layer.

Ontology has been building toward this layer since long before AI evaluation became the loudest reason to care about it. ONT ID, our decentralised identity infrastructure, implements the W3C standards that the AI evaluation conversation now urgently needs. The broader ecosystem, including the Decentralized Identity Foundation, has been maturing the primitives in public for years. The point of this piece is not to sell ONT ID. It is to draw the connection between a problem the AI community is openly admitting it has, and a category of infrastructure that has been quietly maturing for years.

The next time a flagship model feels off by Friday, it is worth asking whether the model drifted or the people did. The honest answer requires identity that persists. The architecture for that already exists. The question is when AI evaluation infrastructure starts using it.

Continue reading this week

Tomorrow: Verifying humans without watching them, on why proof-of-personhood without surveillance is the architecture the AI training-data problem actually needs.

Ontology News

Ontology News

Why persistent identity is the missing layer under AI evaluation

What “persistent” actually means

Drift you can actually measure

This is infrastructure, not a product pitch

Continue reading this week

Geoff R

Ontology News