Preference Data Integrity: The Variable Distillation Hides

Preference data integrity is the upstream gate that determines what every distilled, fine-tuned, or RLHF-aligned model is actually optimising for. A reward model trained on inconsistent, sybil-contaminated, or methodologically opaque human preferences encodes those defects, and distillation propagates them faster at lower latency. Efficiency at the model layer does not fix a quality problem at the judgement layer; it amplifies it.

Last week’s release of RTDMD (Huang et al., 2026) proposing reward-guided RL for few-step diffusion alignment is the latest entry in a category that has become very crowded very quickly. The paper makes a useful technical contribution at the alignment step. It also explicitly acknowledges, in its own framing, that aligning distilled models with human preferences remains challenging. The framework solves a downstream optimisation problem. The upstream supply of preference signal still does what it has always done, which is determine the ceiling on everything built on top of it.

This pattern repeats across the distillation literature this year. Each paper makes the inference side cheaper, faster, or more controllable. Each paper acknowledges that preference data quality remains the limiting factor. The collective effect is an industry that is getting very good at deploying models efficiently while quietly inheriting whatever quality problems exist in the judgement data those models were aligned against.

Monday’s piece on the METR teardown argued that benchmark publishers face a credibility event when their evaluator chain is opaque. The same argument lands at the reward-model layer: a reward model trained on a preference dataset whose integrity cannot be audited is a reward model whose outputs cannot be defended when the next dispute lands. The audience is different. The structural problem is identical.

What “preference data integrity” actually has to mean

Preference data integrity is the property that every preference judgement in a training set can be traced back to: (a) a stable evaluator identity, (b) a documented and signed rubric at the version that applied when the judgement was made, (c) a verifiable record of the evaluator’s relevant credentials at that time, and (d) a status trail indicating whether the judgement, the rubric, or the evaluator credential has since been revoked or superseded. The standards stack is the same one the benchmark-provenance argument used: W3C Decentralized Identifiers for the evaluator anchor, W3C Verifiable Credentials for the signed judgements and methodology attestations, W3C Bitstring Status Lists for revocation. Same primitives, different consumer.

Without those four properties, a preference dataset is a flat list of pairwise comparisons whose generative process is opaque. The reward model trained on it is also opaque. The distilled model is a fast, cheap, deployable artefact whose alignment behaviour is whatever its training data implied, and there is no audit path back through any of it. When something downstream misbehaves, the team has nowhere to look except the model weights and the loss curves. The judgement layer is the part of the pipeline that produced the behaviour, and it is the part the team has the least visibility into.

Why distillation makes the upstream problem more expensive, not less

Distillation has its own economic logic. The larger model is too expensive to serve at scale. A smaller model trained against the larger model’s preferences captures most of the alignment behaviour at a fraction of the inference cost. The teams running this pipeline are not making a quality trade-off in their own mental model; they are paying for inference savings with engineering effort and accepting some quality delta that they intend to minimise.

The quality delta calculation assumes the reward model is fixed. It is not. The reward model is itself a derivative artefact whose quality is bounded by the preference data that trained it. Any inconsistency, sybil contamination, or methodological drift in the upstream preference data flows into the reward model, then into the distilled model, then into every downstream deployment. The distilled model is faster at producing whatever the reward model encoded. Faster is not the same as better, and when the upstream is contaminated, faster is actively worse, because the downstream cost of a single bad behaviour now has to be amortised over a much larger inference volume.

The teams that will actually realise distillation ROI are the ones whose upstream preference data has integrity guarantees the team can defend. Everyone else is buying speed against a quality ceiling they did not measure.

Where sybil contamination quietly sits

Sybil contamination is what happens when one person controls multiple evaluator accounts, or a coordinated group games a labelling pipeline. It is a chronic problem in any preference-data marketplace that pays per judgement and verifies evaluator identity with a session cookie or a payment processor. Most teams know this exists. Most teams quietly absorb it as a cost of doing business at scale.

At the reward-model training layer the cost is structural. A sybil cluster systematically biases preference data toward whatever the contaminating actor wanted, which is sometimes payment optimisation and sometimes ideological. The reward model treats the sybil-weighted preferences as signal. Distillation propagates them. The downstream model behaves the way the contaminated upstream said it should. The fix is not better fraud detection at the application layer. It is evaluator uniqueness as a primitive, which is solvable with the same W3C Verifiable Credentials and selective disclosure stack the broader identity layer rests on. Thursday’s piece in this issue takes the sybil-contamination argument in detail.

What “audit-ready preference data” looks like in practice

An audit-ready preference dataset has four properties any third party can verify without access to the underlying evaluator pool.

First, every preference judgement carries the signed credential of the evaluator who made it. The credential resolves to a DID the evaluator controls, not to a platform-internal account ID that vanishes when the evaluator stops working with that platform. The judgement is independently attributable, even years later.

Second, every judgement is bound to the rubric version that applied at the time. Rubric changes are tracked as versioned attestations, not as silent updates to a methodology document. A downstream auditor can reconstruct the dataset against either the original or the updated rubric and compare the resulting reward signals.

Third, evaluator uniqueness is verifiable. The credential attests that the evaluator is a unique person, certified by a trust framework whose criteria are publicly documented. The credential does not have to reveal name, demographic envelope, or anything else; selective disclosure (W3C VC 2.0) lets the issuer prove uniqueness without disclosing identity.

Fourth, revocation is first-class. When an evaluator credential is rescinded (for cause, for sybil activity, for methodology drift, or because the evaluator chooses to retire it), the W3C Bitstring Status List immediately surfaces the change to any downstream verifier. The team training the reward model can decide whether to retrain, reweight, or accept the affected judgements as legacy. The decision is informed; today, it is not even possible.

Where Ontology fits

Ontology has been deploying decentralised identity standards for years. ONT ID issues credentials that hold across platforms. ONTO Wallet gives the evaluator direct custody. The same identity substrate that makes benchmark publishers auditable makes preference-data publishers auditable. The standards work has been done. The only question is whether the teams currently buying inference savings on the back of opaque preference data want to know what they are buying.

The next round of model-distillation papers will continue to make inference cheaper. The teams that build on top of preference data with verifiable integrity will be the ones whose distilled models actually do what their alignment claims say they do. Everyone else will be defending behaviour they cannot audit, on the back of data they cannot trace, against critics who increasingly will not take any of it on faith.

Continue reading this week

Tomorrow: Continuous training needs continuous evaluators, on why snapshot evaluator pools cannot measure drift in a continually retrained model. Background reading: Issue 01’s piece on reputation as public infrastructure frames the evaluator-record argument the preference-data integrity question depends on.

Ontology News

Ontology News

Your reward model is only as good as your preference data

What “preference data integrity” actually has to mean

Why distillation makes the upstream problem more expensive, not less

Where sybil contamination quietly sits

What “audit-ready preference data” looks like in practice

Where Ontology fits

Continue reading this week

Geoff R

Ontology News