{"id":909,"date":"2026-06-04T09:44:37","date_gmt":"2026-06-04T09:44:37","guid":{"rendered":"https:\/\/ont.io\/news\/?p=909"},"modified":"2026-06-04T09:44:42","modified_gmt":"2026-06-04T09:44:42","slug":"longitudinal-evaluation","status":"publish","type":"post","link":"https:\/\/ont.io\/news\/longitudinal-evaluation\/","title":{"rendered":"Continuous training needs continuous evaluators"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">Longitudinal evaluation is the human-judgement layer that scales alongside continual model adaptation. A continually retrained model paired with a snapshot evaluator pool produces measurement drift faster than the model itself drifts, because the eval cohort changes underneath the result while the publisher assumes it is fixed. The standards stack that makes evaluator pools longitudinal is already mature; the question is whether teams shipping continual-tuning pipelines build the human-side infrastructure to match.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Last week&#8217;s release of&nbsp;<a href=\"https:\/\/arxiv.org\/abs\/2605.26110\" target=\"_blank\" rel=\"noopener\">Prism (Tang et al., 2026)<\/a>, a framework for multimodal continual instruction tuning, is the latest entry in a category of work that takes one premise seriously: deployed models do not sit still. They are retrained, fine-tuned, instruction-extended, and behaviourally patched on a cadence measured in weeks, sometimes in days. The Prism paper itself flags that current research in this area is hindered by severe engineering bottlenecks. The bottlenecks the authors describe are on the model side. The bottlenecks on the evaluation side are larger and less discussed.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">When a model is updated continually, evaluating it against a snapshot benchmark answers only part of the question. The benchmark scores tell the team how the new model behaves on the old test set, judged by the old evaluator pool. They do not tell the team what has changed in the cohort doing the judging. The published delta between version N and version N+1 is the sum of two things: actual model behaviour change and cohort composition change. Without a way to separate those two, the team is measuring something less specific than it thinks it is.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/ont.io\/news\/preference-data-integrity-the-variable-distillation-hides\/\">Yesterday&#8217;s piece on preference data integrity<\/a> made the point at the training layer: reward models inherit whatever quality problems exist in the upstream judgements. The same problem reappears at the post-deployment evaluation layer, with a longer time horizon and quieter symptoms. A snapshot evaluator cohort against a continually retrained model is the slow version of a contaminated reward dataset. The cost is paid in misattributed regressions, methodology disputes that cannot be settled, and a slow erosion of confidence in whatever evaluation pipeline the team is running.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">What &#8220;longitudinal&#8221; actually has to mean for an evaluator pool<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Longitudinal evaluation is the property that the evaluator cohort is observable over time, the same way the model is. Three things have to be true. First, evaluator identity is stable across batches. The same person doing batch 12 in March and batch 47 in November is identifiable as the same person, not as two separate anonymous accounts. This needs a portable, holder-controlled identifier;&nbsp;<a href=\"https:\/\/www.w3.org\/TR\/did-1.1\/\" target=\"_blank\" rel=\"noopener\">W3C Decentralized Identifiers<\/a>&nbsp;are the substrate. Second, each evaluator&#8217;s contributions are signed and timestamped, so that drift in any individual evaluator&#8217;s judgement pattern is observable, not assumed away.&nbsp;<a href=\"https:\/\/www.w3.org\/TR\/vc-data-model-2.0\/\" target=\"_blank\" rel=\"noopener\">W3C Verifiable Credentials<\/a>&nbsp;carry the signed claims. Third, cohort composition is auditable: the team can answer, at any point in the model&#8217;s training history, what fraction of the judging was done by evaluators present in the prior batch, what fraction by new entrants, and what fraction by previously-active evaluators returning. None of those questions are answerable today in the median preference-data or eval pipeline.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">With longitudinal evaluation in place, three new measurements become available to the team. Per-evaluator calibration drift becomes visible, so the team can decide whether a behavioural shift in batch N comes from the evaluator or the model. Cohort composition change becomes a tracked variable rather than a noise floor. Inter-batch consistency on hold-out items becomes a primary statistic, useful for both the publisher and any downstream auditor wondering whether the methodology held over time.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Why snapshot eval pools break continual training<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The default assumption in most evaluation pipelines is that the cohort is stable enough to ignore over the time horizons that matter. This was a defensible assumption when models were released once per year and the evaluation was done in a tight window around release. It stops being defensible when the model is being retrained every two weeks and the cohort is paid per judgement on a platform with normal turnover. Inside a calendar quarter, the practical evaluator pool can rotate substantially. The published metrics make the model look like it is improving or regressing along the dimensions the team cares about; some unknown fraction of the apparent change is cohort drift.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The Prism authors note that engineering bottlenecks are the limiting factor for continual multimodal tuning. The model-side bottlenecks they call out (catastrophic forgetting, knowledge interference, parameter-efficient adaptation) are well-defined enough that the field can attack them with code. The evaluation-side bottleneck is structural in a different way. It does not get solved by writing better evaluation code against the same cohort. It gets solved by making the cohort itself a tracked variable, with the same temporal resolution as the model.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Teams that ship continual training without longitudinal evaluation are running a calibration experiment they cannot read. The model is changing. The evaluator pool is changing. The published metric is a function of both. The cost surfaces months later as benchmarks that no longer agree with one another, methodology questions that cannot be resolved without going back to data the pipeline did not keep, and downstream users of the metric losing confidence faster than the technical team realises.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">What the human-side infrastructure looks like in practice<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Longitudinal evaluation has four moving parts, each of which maps to a standards artefact that has been mature for years.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">First, each evaluator holds a DID that persists across batches, platforms, and labelling vendors. The DID is the durable anchor. When the team&#8217;s labelling pipeline switches vendors mid-quarter, the evaluator&#8217;s record is the same identifier on both sides. The cohort composition statistic stays meaningful through the switch.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Second, every contribution by every evaluator is wrapped in a signed verifiable credential carrying the rubric version, the timestamp, the issuer attestation, and any specialist credentials the evaluator was bringing to that batch. The credential is portable and the signature is verifiable. Time-series analysis on per-evaluator behaviour is straightforward because each contribution carries enough metadata to reconstruct exactly what the evaluator was asked to do and under what version of the methodology.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Third, evaluator-level methodology versioning is first-class. When the rubric updates, the new version is what subsequent credentials cite. Existing credentials remain pinned to the version that applied when they were issued, and a&nbsp;<a href=\"https:\/\/www.w3.org\/TR\/vc-bitstring-status-list\/\" target=\"_blank\" rel=\"noopener\">W3C Bitstring Status List<\/a>&nbsp;tracks revocations, supersessions, and any retroactive re-attestations. Any downstream auditor can reconstruct the evaluator pool against any historical rubric version and recompute the published metric.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Fourth, evaluator privacy is preserved through selective disclosure (W3C VC 2.0 family). The cohort composition statistic does not require disclosing individual evaluator identity; it requires being able to count unique evaluators, track returning ones, and verify rubric eligibility, none of which need name, location, or demographic detail. The&nbsp;<a href=\"https:\/\/identity.foundation\/\" target=\"_blank\" rel=\"noopener\">Decentralized Identity Foundation<\/a>&nbsp;has been stewarding the standards for almost a decade. The deployment ergonomics are not the blocker. The willingness of teams running continual training to build the human-side infrastructure with the same rigour as the model side is.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Where Ontology fits<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Ontology has been deploying decentralised identity standards for years.\u00a0<a href=\"https:\/\/ont.id\" target=\"_blank\" rel=\"noopener\">ONT ID<\/a>\u00a0issues credentials that hold across systems and across time.\u00a0<a href=\"https:\/\/onto.app\" target=\"_blank\" rel=\"noopener\">ONTO Wallet<\/a>\u00a0gives the evaluator direct custody of their record, so that an evaluator&#8217;s contributions in batch 12 and batch 47 are anchored to the same holder regardless of which labelling vendor or eval platform mediated each batch. The pattern continual training now needs (many issuers, many verifiers, durable holders, time-series queryable contributions) is the topology the identity stack was built for. The standards work has been done. The cost of adoption falls almost entirely on plumbing.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The teams that build longitudinal evaluator infrastructure now will be the ones whose continual-training pipelines produce metrics anyone outside the team trusts six months from now. The teams that ship continual training without that human-side layer will spend the next year discovering that their published numbers do not survive a careful look.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Continue reading this week<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Tomorrow:\u00a0Sybil contamination is a preference-data problem, on the privacy-preserving primitive that prevents one person controlling multiple evaluator accounts from quietly biasing the entire upstream. Yesterday:\u00a0<a href=\"https:\/\/ont.io\/news\/preference-data-integrity-the-variable-distillation-hides\/\">Your reward model is only as good as your preference data<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Longitudinal evaluation is the human-judgement layer that scales alongside continual model adaptation. A continually retrained model paired with a snapshot evaluator pool produces measurement drift faster than the model itself drifts, because the eval cohort changes underneath the result while the publisher assumes it is fixed. The standards stack that makes evaluator pools longitudinal is<\/p>\n","protected":false},"author":5,"featured_media":910,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[170,113],"tags":[177,185,186,117,172],"class_list":["post-909","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai","category-data","tag-rlhf","tag-continual-learning","tag-longitudinal-evaluation","tag-decentralised-identity","tag-ai-evaluation"],"_links":{"self":[{"href":"https:\/\/ont.io\/news\/wp-json\/wp\/v2\/posts\/909","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ont.io\/news\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/ont.io\/news\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/ont.io\/news\/wp-json\/wp\/v2\/users\/5"}],"replies":[{"embeddable":true,"href":"https:\/\/ont.io\/news\/wp-json\/wp\/v2\/comments?post=909"}],"version-history":[{"count":1,"href":"https:\/\/ont.io\/news\/wp-json\/wp\/v2\/posts\/909\/revisions"}],"predecessor-version":[{"id":911,"href":"https:\/\/ont.io\/news\/wp-json\/wp\/v2\/posts\/909\/revisions\/911"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/ont.io\/news\/wp-json\/wp\/v2\/media\/910"}],"wp:attachment":[{"href":"https:\/\/ont.io\/news\/wp-json\/wp\/v2\/media?parent=909"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/ont.io\/news\/wp-json\/wp\/v2\/categories?post=909"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/ont.io\/news\/wp-json\/wp\/v2\/tags?post=909"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}