How Whel will validate its tiers.
This page records the validation benchmark for Whel's confidence tiers before it is run. The sample, the external sources of truth, the adjudication rules, and the falsifying outcome are fixed here so the result can be read as a calibration check rather than a retrospective rationalization. The benchmark itself will be executed after publication of this page and the result reported against the criteria below, whatever it shows.
Live from the substrate · 433 verbatim claims across 186 source documents back the active signals.
Why pre-register
Whel reads each drug–condition pair through three evidence arms and scores each arm on five dimensions, then discounts the arm score by a female-applicability multiplier and sorts it into a tier. The full model is documented on the signal types & scoring page and the technical architecture page. The question this page addresses is one step downstream of that model: when Whel reports a pair as Strong, how often does the independent external clinical record line up with it.
The credibility of any answer to that question depends on fixing the test before running it. The sample, the external comparators, the adjudication rules, and the reporting format are all locked here. The result will be reported against this page, with the version tag at the top, even if it undercuts Whel's own framing.
What gets evaluated
The primary sample is every active Strong-tier drug–condition pair in the substrate at the snapshot date. As of the live corpus above, that is 11 Strong-tier pairs spanning 6 conditions and 11 distinct compounds. The full list is read directly from the substrate at request time, so the count on this page can never silently drift from what the engine actually holds.
A matched comparator sample is drawn from the Emerging tier (68 pairs) and the Exploratory tier (44 pairs), using stratified random sampling on the anchoring arm so the comparator mix mirrors the Strong sample. The comparators are included specifically so the result can be read as calibration: a Strong tier that corroborates externally at a much higher rate than the lower tiers is the outcome the model is designed to produce.
Of the active corpus, 108pairs carry a clinical validation status, meaning a Direct evidence arm is strong enough to anchor the pair. Every Strong-tier pair passes the engine's structural audit: no missing dimension scores, no tier–score-band mismatches, no signal whose anchoring arm lacks a verbatim claim, and no duplicate source URLs within a pair. The frozen list of the 11 Strong-tier pairs is archived at the time of execution so the benchmark is reproducible against a fixed snapshot.
Where the external evidence comes from
Each pair in the sample is checked against the following external bodies of evidence. Searches use the compound name (generic and brand where relevant) together with the condition name and standard synonyms, with no date restriction and no language restriction.
- PubMed (NCBI)
- ClinicalTrials.gov
- Cochrane Library
- ESHRE guidelines
- ASRM practice committee documents
- NICE guidance
- ACOG practice bulletins
Sources already indexed in the substrate for that pair are excluded from the external search so the comparison remains genuinely external. If a PubMed PMID is already attached to the signal as evidence, that PMID does not count toward external corroboration, but other PubMed records on the same pair do.
What counts as a hit
This benchmark is deliberately external. The five-dimension rubric that produces a pair's tier, including its internal corroboration dimension, is scored only against the evidence Whel has actually ingested. The ladder below measures something separate: whether the independent external record, setting aside what Whel already holds, takes the same pair seriously. Because the two are kept apart, any agreement between them actually carries weight.
Each pair is assigned a single level on the scale below, taking the highest applicable level. The level reflects only what exists in the external record. Whether that evidence is positive or negative does not change the level: a pair that surfaces in a guideline as a discouraged option still scores E3, and the direction is recorded separately.
E3 (guideline recognition) is established by a human curation pass that sits outside the scoring pipeline. A named society body such as ESHRE, ISSWSH, NAMS, NICE, ACOG, or ASRM names the pair, and the endorsement strength and evidence certainty are recorded using that body's own framework (GRADE for ESHRE, NAMS Levels I/II/III, the ISSWSH modified Delphi), then normalized so grades from different bodies can be compared. Coverage is intentionally narrow at this stage and expands following the same workflow.
Separately, and never blended into either the tier or the external level, Whel surfaces the Every Cure MATRIXcross-reference. MATRIX is an independent treatment-probability prediction from a graph-ML model trained on an open biomedical knowledge graph, built on the KGML-xDTD framework (Ma, Zhou, Liu & Koslicki, GigaScience 2023). Per-pair MATRIX scores appear beside a pair's arm scores as a disclosure layer. Full audit numbers, per-condition coverage, and the score distribution are published at external references → 05 · Coverage disclosure.
Effect direction in the external record is captured as a secondary field: supports the indexed direction, contradicts it, mixed, or unclear. This is reported alongside the level but is not used to determine the level itself.
How pairs are scored
Each pair is reviewed in randomized order so that no rater ever scores a contiguous tier block. Tier assignment is masked: the rater sees the compound, the condition, and the anchoring arm, but not Whel's scoring or tier. The external level (E0–E3) and direction are recorded before the masked fields are revealed.
The primary adjudicator is an external clinician-researcher with a decade of NIMH- and PCORI-funded research experience in women's health, drawn from outside the project team. The project lead is not the primary rater. A subset of at least 20 percent of the sample will be independently scored by a second reviewer, with disagreements resolved by discussion and inter-rater agreement reported as Cohen's kappa. Both reviewers must be blind to the tier assignment at the time of scoring.
What gets reported
For each tier in the sample (Strong, Emerging, Exploratory), the report will give the distribution across E0–E3, the share that reaches at least E1 and at least E2, and the share with direction consistent with the indexed effect. Results are reported with exact confidence intervals.
The primary calibration question is whether the Strong tier reaches at least E1 in a clearly higher share of cases than the Emerging tier, and at least E2 in a clearly higher share than the Exploratory tier. The result will be reported as supported, partially supported, or not supported against the pre-specified thresholds below.
- Supported: at least 85 percent of Strong pairs reach E1 or higher, and at least 50 percent reach E2 or higher.
- Partially supported: at least 70 percent reach E1 and at least 30 percent reach E2.
- Not supported: below the partially-supported thresholds, or the Strong tier does not exceed the Emerging tier on either metric.
Directional consistency is reported separately and is not used to adjudicate the headline result. A pair where the external record is clearly opposite to the indexed direction is flagged for review.
What this benchmark does not show
A successful result here would show that Whel's Strong tier concentrates on drug–condition pairs that the external clinical record also takes seriously. It would not show that those pairs work in patients. It would show only that the indexed evidence and the external evidence agree on which pairs are worth studying. Clinical efficacy is a separate question, and answering it requires trials.
The sample is small (11 Strong, with matched comparators). The external sources of truth are themselves imperfect: guidelines lag the literature, the literature lags the biology, and some conditions in scope (notably PMDD and vulvodynia) have thinner guideline coverage than others. These conditions will tend toward E1 or E2 even where the indexed signal is well supported by trials, and that asymmetry is expected.
The classifier scoring the indexed signals is the same language-model family that may assist external search. Where external search is used to surface candidate papers, a human reviewer makes the final level assignment by reading those papers; the model does not.
Where the result will land
Results will be posted as a new section on this page and linked from the home page and the technical architecture page. The version tag at the top is incremented only if the methodology itself changes; the underlying data snapshot is recorded with the result. Raw scoring sheets, including disagreements, will be made available on request and archived alongside the writeup.
Negative or partial-support results will be reported with the same prominence as positive ones. If the Strong tier does not materially separate from Emerging on the pre-specified thresholds, the model is revised and re-tested before the benchmark is run again.
The three evidence arms, the five-dimension rubric, and the indexing pipeline that produces these tiers are documented separately.
The pre-registered benchmark was re-grounded on the three-arm substrate model. The earlier external-validation ladder, which doubled as an in-pipeline literature grade, was retired; the in-pipeline corroboration signal now lives inside the five-dimension rubric, and this page keeps a clean external ladder (E0–E3) that is computed only after the fact. The sample numbers above are read live from the substrate so they cannot drift from the engine.