Pre-registration · substrate model

How Whel will validate its tiers.

This page records the validation benchmark for Whel's confidence tiers before it is run. The sample, the external sources of truth, the adjudication rules, and the falsifying outcome are fixed here so the result can be read as a calibration check rather than a retrospective rationalization. The benchmark itself will be executed after publication of this page and the result reported against the criteria below, whatever it shows.

Active pairs (corpus)

183

Strong-tier pairs

Conditions represented

Distinct compounds

171

Live from the substrate · 433 verbatim claims across 186 source documents back the active signals.

01 · Purpose

Why pre-register

Whel reads each drug–condition pair through three evidence arms and scores each arm on five dimensions, then discounts the arm score by a female-applicability multiplier and sorts it into a tier. The full model is documented on the signal types & scoring page and the technical architecture page. The question this page addresses is one step downstream of that model: when Whel reports a pair as Strong, how often does the independent external clinical record line up with it.

The credibility of any answer to that question depends on fixing the test before running it. The sample, the external comparators, the adjudication rules, and the reporting format are all locked here. The result will be reported against this page, with the version tag at the top, even if it undercuts Whel's own framing.

02 · Sample

What gets evaluated

The primary sample is every active Strong-tier drug–condition pair in the substrate at the snapshot date. As of the live corpus above, that is 11 Strong-tier pairs spanning 6 conditions and 11 distinct compounds. The full list is read directly from the substrate at request time, so the count on this page can never silently drift from what the engine actually holds.

A matched comparator sample is drawn from the Emerging tier (68 pairs) and the Exploratory tier (44 pairs), using stratified random sampling on the anchoring arm so the comparator mix mirrors the Strong sample. The comparators are included specifically so the result can be read as calibration: a Strong tier that corroborates externally at a much higher rate than the lower tiers is the outcome the model is designed to produce.

Structural pre-checks

Of the active corpus, 108pairs carry a clinical validation status, meaning a Direct evidence arm is strong enough to anchor the pair. Every Strong-tier pair passes the engine's structural audit: no missing dimension scores, no tier–score-band mismatches, no signal whose anchoring arm lacks a verbatim claim, and no duplicate source URLs within a pair. The frozen list of the 11 Strong-tier pairs is archived at the time of execution so the benchmark is reproducible against a fixed snapshot.

03 · External comparators

Where the external evidence comes from

Each pair in the sample is checked against the following external bodies of evidence. Searches use the compound name (generic and brand where relevant) together with the condition name and standard synonyms, with no date restriction and no language restriction.

Sources already indexed in the substrate for that pair are excluded from the external search so the comparison remains genuinely external. If a PubMed PMID is already attached to the signal as evidence, that PMID does not count toward external corroboration, but other PubMed records on the same pair do.

04 · External corroboration

What counts as a hit

This benchmark is deliberately external. The five-dimension rubric that produces a pair's tier, including its internal corroboration dimension, is scored only against the evidence Whel has actually ingested. The ladder below measures something separate: whether the independent external record, setting aside what Whel already holds, takes the same pair seriously. Because the two are kept apart, any agreement between them actually carries weight.

Each pair is assigned a single level on the scale below, taking the highest applicable level. The level reflects only what exists in the external record. Whether that evidence is positive or negative does not change the level: a pair that surfaces in a guideline as a discouraged option still scores E3, and the direction is recorded separately.

E0

No independent external support

Nothing in the external record beyond the sources Whel already indexed for the pair.

E1

Independent primary evidence

At least one published trial or study not already attached to the signal reports the drug–condition pair.

E2

Replicated or synthesized

A systematic review or meta-analysis, or two or more independent trials, cover the pair.

E3

Guideline-recognized

A named clinical-society guideline names the pair in either direction. A discouraged option still counts, and the direction is recorded separately.

E3 (guideline recognition) is established by a human curation pass that sits outside the scoring pipeline. A named society body such as ESHRE, ISSWSH, NAMS, NICE, ACOG, or ASRM names the pair, and the endorsement strength and evidence certainty are recorded using that body's own framework (GRADE for ESHRE, NAMS Levels I/II/III, the ISSWSH modified Delphi), then normalized so grades from different bodies can be compared. Coverage is intentionally narrow at this stage and expands following the same workflow.

Separately, and never blended into either the tier or the external level, Whel surfaces the Every Cure MATRIXcross-reference. MATRIX is an independent treatment-probability prediction from a graph-ML model trained on an open biomedical knowledge graph, built on the KGML-xDTD framework (Ma, Zhou, Liu & Koslicki, GigaScience 2023). Per-pair MATRIX scores appear beside a pair's arm scores as a disclosure layer. Full audit numbers, per-condition coverage, and the score distribution are published at external references → 05 · Coverage disclosure.

Effect direction in the external record is captured as a secondary field: supports the indexed direction, contradicts it, mixed, or unclear. This is reported alongside the level but is not used to determine the level itself.

05 · Adjudication

How pairs are scored

Each pair is reviewed in randomized order so that no rater ever scores a contiguous tier block. Tier assignment is masked: the rater sees the compound, the condition, and the anchoring arm, but not Whel's scoring or tier. The external level (E0–E3) and direction are recorded before the masked fields are revealed.

The primary adjudicator is an external clinician-researcher with a decade of NIMH- and PCORI-funded research experience in women's health, drawn from outside the project team. The project lead is not the primary rater. A subset of at least 20 percent of the sample will be independently scored by a second reviewer, with disagreements resolved by discussion and inter-rater agreement reported as Cohen's kappa. Both reviewers must be blind to the tier assignment at the time of scoring.

06 · Analysis plan

What gets reported

For each tier in the sample (Strong, Emerging, Exploratory), the report will give the distribution across E0–E3, the share that reaches at least E1 and at least E2, and the share with direction consistent with the indexed effect. Results are reported with exact confidence intervals.

The primary calibration question is whether the Strong tier reaches at least E1 in a clearly higher share of cases than the Emerging tier, and at least E2 in a clearly higher share than the Exploratory tier. The result will be reported as supported, partially supported, or not supported against the pre-specified thresholds below.

Pre-specified thresholds

Supported: at least 85 percent of Strong pairs reach E1 or higher, and at least 50 percent reach E2 or higher.
Partially supported: at least 70 percent reach E1 and at least 30 percent reach E2.
Not supported: below the partially-supported thresholds, or the Strong tier does not exceed the Emerging tier on either metric.

Directional consistency is reported separately and is not used to adjudicate the headline result. A pair where the external record is clearly opposite to the indexed direction is flagged for review.

07 · Limitations

What this benchmark does not show

A successful result here would show that Whel's Strong tier concentrates on drug–condition pairs that the external clinical record also takes seriously. It would not show that those pairs work in patients. It would show only that the indexed evidence and the external evidence agree on which pairs are worth studying. Clinical efficacy is a separate question, and answering it requires trials.

The sample is small (11 Strong, with matched comparators). The external sources of truth are themselves imperfect: guidelines lag the literature, the literature lags the biology, and some conditions in scope (notably PMDD and vulvodynia) have thinner guideline coverage than others. These conditions will tend toward E1 or E2 even where the indexed signal is well supported by trials, and that asymmetry is expected.

The classifier scoring the indexed signals is the same language-model family that may assist external search. Where external search is used to surface candidate papers, a human reviewer makes the final level assignment by reading those papers; the model does not.

08 · Reporting

Where the result will land

Results will be posted as a new section on this page and linked from the home page and the technical architecture page. The version tag at the top is incremented only if the methodology itself changes; the underlying data snapshot is recorded with the result. Raw scoring sheets, including disagreements, will be made available on request and archived alongside the writeup.

Negative or partial-support results will be reported with the same prominence as positive ones. If the Strong tier does not materially separate from Emerging on the pre-specified thresholds, the model is revised and re-tested before the benchmark is run again.

Related

The three evidence arms, the five-dimension rubric, and the indexing pipeline that produces these tiers are documented separately.

How signals are scored →

Latest revisionSubstrate model · June 2026

The pre-registered benchmark was re-grounded on the three-arm substrate model. The earlier external-validation ladder, which doubled as an in-pipeline literature grade, was retired; the in-pipeline corroboration signal now lives inside the five-dimension rubric, and this page keeps a clean external ladder (E0–E3) that is computed only after the fact. The sample numbers above are read live from the substrate so they cannot drift from the engine.

Full revision history →