Verifying humans without watching them

Proof of personhood for AI training data does not have to depend on biometric surveillance. On-device verifiable credentials with selective disclosure, as specified by W3C VC Data Model 2.0, let contributors prove they are human without surrendering biographical data, replacing the surveillance-default model that most platforms inherit by accident.

The training data conversation has changed in the last twelve months. It used to be about quantity. Now it is about provenance. The number of teams openly worrying that their training corpora are contaminated by AI-generated content has gone from a niche concern to a first-order problem, and the proposed fixes range from “build a giant dataset of known-bad AI slop to train against” to “only accept data from verified human contributors” to “scrap public web scraping entirely and run controlled human studies.”

All three are honest responses to the same underlying realisation: proof of personhood, and verifiably human-origin data, are becoming a paid attribute in the AI supply chain. The premium is real, the demand is growing, and the supply is fragile.

What is rarely said out loud is that the way most platforms intend to deliver that premium is, on its face, terrible.

This is the second piece in this week’s Ontology Roundup. It builds on Tuesday’s argument on persistent identity for AI evaluators, and moves the same primitive upstream into the training-data layer.

The default solution is surveillance

When a team needs to be sure that a contributor is human, the default instinct is to collect more information. Take a selfie. Hold up an ID. Submit to a liveness check. Run behavioural biometrics in the background. Cross-reference against a database of known accounts. Layer in device fingerprinting. The longer the list of signals, the higher the confidence that a real human is on the other end.

Each of those signals also takes the human apart, piece by piece, and stores those pieces in databases the contributor does not control and cannot revoke. The human becomes the price of being trusted as a human.

This is the wrong architecture for the problem. It is also, almost certainly, what most large platforms will deploy by default, because the data they collect is valuable to them for unrelated reasons. Combining surveillance-based human verification with data-hungry platform incentives produces an outcome where the cost of being trusted to contribute training data is permanent biographical exposure.

That outcome is not a regulatory abstraction. It is the predictable consequence of solving the proof-of-human problem with the centralised tools the industry already has.

What the right architecture looks like

The right architecture is older than the problem. Verifiable credentials, as formalised in the W3C Verifiable Credentials Data Model 2.0 (which W3C announced as a Recommendation in May 2025), separate three roles: an issuer who attests a claim about a subject, a holder who controls the credential, and a verifier who needs to confirm the claim. Cryptographic signatures over the issuer’s attestation let the verifier confirm a claim without contacting the issuer at the moment of verification, and without seeing more of the holder’s data than they need to.

In the proof-of-human case, this means an issuer (a government, a bank, a biometric service, a trust framework) can attest “this person is a verified human” once. The credential lives on the contributor’s device. When a platform needs to know that a contributor is human, the contributor presents a cryptographic proof derived from the credential. The platform learns one bit of information: human, yes. It does not learn the issuer’s underlying data, does not learn the contributor’s biometric features, does not get a copy of the credential, and does not gain the ability to re-link the contributor to other platforms.

The contributor stays a human. The platform gets the assurance it needs. The issuer is not contacted, so the issuer does not learn where or when the credential was used. None of these properties are theoretical. They are the explicit design goals of the W3C VC family, and they map directly onto the proof-of-human problem the AI industry is now scrambling to solve.

Selective disclosure is the part that matters

The architectural feature that turns this from “a different form of surveillance” into “actually private” is selective disclosure. The credential may contain many attributes about the contributor: their legal name, their date of birth, their country of residence, their biometric template, whatever the issuer chose to attest. The contributor decides which attributes the verifier learns, and the cryptographic proof confirms only those.

In a proof-of-human flow, the contributor’s wallet computes a proof that the credential exists and is valid and was issued by a trusted issuer, without revealing the credential itself. The verifier learns that a trusted issuer has confirmed humanness. Nothing else. There is no PII exchange, no biometric upload, no shared identifier the platform can later use to fingerprint the contributor across the rest of their digital life.

This is not a hypothetical primitive. The W3C Verifiable Credentials Working Group published seven Recommendations in May 2025 that specify how this works in practice, including the data model, cryptographic suites, status lists, and securing mechanisms. The standards are stable. The implementations are deployable. The question is whether AI infrastructure teams design their proof-of-human flows around them, or default to surveillance because surveillance is what the existing identity vendors sell.

Where Ontology fits

Ontology has been deploying decentralised identity infrastructure on these standards for years. ONT ID is built to issue, hold, and verify W3C credentials with selective disclosure as a first-class capability, not an afterthought. The ONTO Wallet stores credentials on-device, computes presentation proofs locally, and never leaks the underlying data to the verifier or the issuer.

We did not build this for the AI training-data problem. We built it because data ownership and minimal disclosure have been the right answers to credential verification for as long as the standards have existed. The fact that AI is now the loudest reason to care does not change the architecture. It just changes how many teams need to find it.

The teams that will buy the verified-human premium are starting to ask the right questions. The platforms that answer “give us a selfie and an ID” are answering the wrong question. The platforms that answer “present a verifiable credential with selective disclosure” are answering the right one, and using infrastructure that respects the contributor whose humanity is being verified.


Continue reading this week

Tomorrow: Selective disclosure is the privacy primitive AI did not know it needed, going deep on the credential mechanism that makes this architecture work.