Methodology changelog.
This page records every dated revision to Whel's validation methodology. Entries that change the rubric, the sample, the external sources of truth, the adjudication rules, or the pre-specified thresholds are recorded here so the methodology page stays readable and so smaller refinements that would not warrant a roadmap entry remain visible. Newest revision is on top.
Major version. The LLM-synthesized repurposing_signals model was retired in favor of the arm-aware substrate engine. Whel now reads each drug-condition pair through three evidence arms (Direct, Pathway, Community), scores each arm on five tuned dimensions (corroboration, rigor, specificity, plausibility, consistency), discounts by a female-applicability multiplier, and sorts into a tier. Cross-condition was demoted from a scored fourth arm to a derived-hypotheses lens. The full model is documented on the signal types & scoring page. Snapshot at cutover: 183 active pairs across 6 conditions, tiered 11 Strong / 66 Moderate / 65 Emerging / 41 Exploratory, of which 108 carry a clinical validation status.
The L0–L3 literature grade was retired with this change. It had served two roles at once, an in-pipeline corroboration input and an external-validation benchmark, and the external reviewer flagged that doubling as circular. The in-pipeline signal is now the corroboration dimension inside the five-dimension rubric, scored only against evidence Whel has actually ingested. The validation methodology page keeps a clean external-only ladder (E0–E3) computed after the fact, so any agreement between the tier and the external record now carries real information.
The flagship editorial pages (the two featured signals), the homepage, the candidates index, the technical-architecture page, and the external-references page were all rebound to read tiers, scores, validation status, dimensions, and claims directly from the live substrate at request time. Hard-coded snapshots of the old model, which had drifted from the engine's stricter, more honest corroboration scoring, were removed, so the public numbers can no longer diverge from what the engine holds.
Source-level LLM extraction pipeline built and shipped. The Phase 2a smoke test on the same day surfaced a real architectural finding: the sources.key_finding_excerpt column that Phase 2a was designed to ground existed in the schema since migration 041 but was 0 percent populated across all 2,166 active-signal source rows. No script had ever written to it. Phase 2a as originally designed would have skipped every row. The smoke test reported 30 of 30 source rows skipped with the same error message: “source row has empty key_finding_excerpt; nothing to ground.” The smoke test itself was the audit working as intended; the verifier surfaced the missing data layer before any misleading public numbers shipped.
v3.13 ships the data-layer fix: scripts/extract-key-findings.py. The script iterates every free-text source row, fetches the canonical source text (NCBI E-utilities efetch for PubMed abstracts, ClinicalTrials.gov API v2 briefSummary plus detailedDescription for trial records, Reddit's public JSON endpoint for post body plus top 5 comments for forum signals), and calls Claude Opus 4.8 to extract a 2 to 4 sentence key finding focused on the drug-condition pair for that specific signal. The extraction prompt is tight on purpose: outputs must use only claims directly supported by the canonical source text, must be specific (effect sizes, sample sizes, direction of effect when present), and must return the exact literal NO_RELEVANT_FINDING if the canonical source does not actually discuss the drug-condition pair. The refusal path is itself a useful audit signal: any row that returns NO_RELEVANT_FINDING is a source that was attached to a signal but does not actually evidence the claim, which Phase 2a then surfaces as a flag-worthy citation.
The script emits two artifacts. A JSON run log at scripts/audit-output/key-finding-extractions.json records every per-source result with the prompt input, the raw model output, the latency, and the status (extracted, no_relevant_finding, fetch_failed, api_failed, skipped). A Supabase migration at supabase/migrations/045_backfill_key_finding_excerpts.sql carries the UPDATE statements that write each successful extraction back to the sources table. Each UPDATE is guarded by key_finding_excerpt IS NULL so the migration is idempotent and safe to re-run.
Order of operations to land Phase 2a's first real numbers: (1) run the extraction script locally (requires ANTHROPIC_API_KEY and the Supabase env vars; roughly 30 to 40 minutes dominated by Reddit's rate limit on 190 posts); (2) review the generated migration in Supabase Studio, spot-check a few UPDATE statements, run the migration; (3) re-run the export to refresh the audit snapshot with the now-populated key_finding_excerpt field; (4) run scripts/verify-summary-grounding.py for the first real Phase 2a numbers. The Phase 2a disclosure block on /about/external-references → 01d switches from “tooling shipped, awaiting first run” to live numbers at step (4).
A separate Roadmap row records signal-level summary grounding as a Planned follow-on. The current repurposing_signals.summary and repurposing_signals.mechanism_hypothesis fields are hand-written prose in seed migrations 002 through 007, not LLM-generated; they reference real PMIDs and clinical findings in narrative form and would also benefit from grounding against the cited source corpus. That is a separate audit mechanism (per-signal union of cited sources rather than per-source 1:1) and not in scope for v3.13.
Path C Phase 2 tooling shipped. Phase 1 (citation validation against canonical publishers) is now joined by Phase 2 (grounding the LLM-generated finding text on each source row against the canonical content at the publisher). Phase 2 is split into two mechanisms because Whel's five source types divide cleanly into free-text and structured-data categories, and the right verification mechanism differs by category. Sentence-BERT cosine-similarity grounding is appropriate when the canonical source has a free-text abstract (PubMed, ClinicalTrials.gov, Reddit); field-by-field structured verification is appropriate when the canonical source is a structured record (AEMS reaction counts; Open Targets drug-target-disease attributions).
Phase 2a (free-text grounding) ships as scripts/verify-summary-grounding.py. For every source row in {pubmed, clinical_trial, reddit}, the script splits the LLM-generated finding text on sources.key_finding_excerpt into sentences, fetches the canonical source text (NCBI E-utilities efetch for PubMed abstracts, ClinicalTrials.gov API v2 briefSummary plus detailedDescription for trial records, Reddit's public JSON endpoint for post body plus the top five comments), embeds both sides with all-MiniLM-L6-v2 from sentence-transformers, and computes max cosine similarity per LLM-summary sentence against the canonical sentences. Sentences scoring below 0.40 are flagged as “not directly supported by the source.” Reddit posts are refetched on every audit run rather than cached, so deleted or removed posts surface as “source no longer available,” itself a useful finding. Threshold tuning against a human-labeled validation set is recorded on the Roadmap as a Planned follow-on; the 0.40 default is documented and defensible as a v0.1 baseline but is expected to move with calibration.
Phase 2b (structured-source verification) ships as scripts/verify-structured-sources.py. AEMS rows (source_type='faers' in the schema; referred to as AEMS in user-facing prose to reflect the FDA's 2025 rename of the system formerly known as FAERS) carry the canonical openFDA query in the url column. The verifier re-runs each url, reads meta.results.total from the openFDA response, and compares it to the count parsed out of the LLM-extracted title within a tolerance of max(5, 10 percent of claimed count). The tolerance accommodates the fact that AEMS is a continuously-updating dataset; exact matches across two timepoints are not expected and would themselves be evidence of caching. Open Targets rows are verified by re-fetching the drug record through the OT GraphQL drug(chemblId: $id) query and confirming that the target symbol claimed in the LLM-extracted title appears in the canonical linkedTargets list. Numerical score comparison is deferred to a follow-on once the OT GraphQL exposes per-drug per-disease per-target scores in a single hop.
The scripts/export-sources-for-audit.py export was extended in the same commit to include key_finding_excerpt and primary_endpoint_text; both columns were already present in the live database per migration 041 but had not been pulled into the audit snapshot. After the re-export and the two verifier runs, the two new disclosure blocks on /about/external-references → 01d switch from “tooling shipped, awaiting first run” to live numbers (audited row count, sentences flagged, flag rate for Phase 2a; status breakdown by source_type for Phase 2b). Phase 3 (prompt hardening that forbids citation generation outside the Phase 1 manifest) remains Planned.
Naming convention going forward: the source_type field value in the database stays as the literal string faers (the schema enum value, unchanged), but all user-facing prose on the site uses AEMS to reflect the FDA's rename of the public dashboard from FAERS (FDA Adverse Event Reporting System) to AEMS (FDA Adverse Event Monitoring System). FAERS appears in the methodology only when describing the literal database field value, in code formatting. A small sweep of recent prose on the v3.9 and v3.10 changelog entries and the external-references disclosure block landed in the same commit.
OT-DRUGNAME backfill: the 10 active-signal Open Targets source rows flagged in v3.10 as storing a synthetic OT-{DRUGNAME} shorthand in sources.external_id have been backfilled to the canonical CHEMBL identifier for each drug. CHEMBL IDs were resolved via the Open Targets GraphQL search(entityNames: ["drug"]) endpoint and independently verified through the same drug(chemblId: $id) query that the audit verifier calls. The 10 mappings: APREPITANT → CHEMBL1471, DESVENLAFAXINE → CHEMBL1118, ENZALUTAMIDE → CHEMBL1082407, TRIMEBUTINE → CHEMBL190044, TASIMELTEON → CHEMBL2103822, TRADIPITANT → CHEMBL3544984, OLAPARIB → CHEMBL521686, FOSNETUPITANT → CHEMBL3989917, MILNACIPRAN → CHEMBL259209, and TRIIODOTHYRONINE (liothyronine) → CHEMBL1544. For each drug, the top search hit is the base compound; salt and prodrug forms (e.g. DESVENLAFAXINE SUCCINATE, LIOTHYRONINE SODIUM) appeared as alternates and were rejected so the backfilled IDs match the convention of the 38 existing canonical Open Targets rows in the sources table, which store base CHEMBL IDs rather than salt-form variants.
The backfill ships as Supabase migration supabase/migrations/044_backfill_ot_drugname_to_chembl.sql. Each of the ten UPDATE statements targets one specific source row by its UUID id AND its expected old external_id, so the migration is a no-op on any row that has already been touched and safe to re-run. The migration updates two columns on each row: external_id from OT-{DRUGNAME} to the canonical CHEMBL ID, and url from a platform.opentargets.org/disease/{GO_or_EFO} link to the matching platform.opentargets.org/drug/CHEMBL{id} link, matching the URL shape of the 38 existing canonical rows. Title and key-finding columns are unchanged.
User-visible effect on /conditions/[slug] drug cards: the source-attribution chip rendered alongside each affected Open Targets signal changes from a synthetic label (e.g. “OT-APREPITANT”) to the canonical CHEMBL identifier (“CHEMBL1471”), and the outbound link goes to the drug page rather than the disease page. The disease page showed every drug for the condition; the drug page shows every disease for the drug. Both are valid Open Targets surfaces; the drug page is the more useful destination for a citation that is evidencing a specific drug-condition pair.
Expected audit shift after the migration runs: in the next scripts/verify-database-sources.py report, the opentargets resolved_match count rises from 38 to 48, the unresolved count for opentargets drops from 10 to 0, and the headline summary on /about/external-references → 01d shifts from “170 resolved_match / 10 unresolved” to “180 resolved_match / 0 unresolved.” The Roadmap row Backfill canonical Open Targets identifiers on signals using OT-DRUGNAME shorthand flips from Planned to Live; the v3.10 architectural-debt finding is closed.
First database-sources audit run. The Path C Phase 1 tooling that shipped in v3.9 was executed on the live Whel sources table: 2,166 source rows across all active signals, audited row by row against the canonical external source for each identifier type. Result: 170 fully resolved with matching metadata, 1,986 format-only passes (AEMS dashboard URLs and Reddit permalinks that pass the well-formed-URL pattern but cannot be resolved further because neither publisher exposes a record-lookup API), and 10 unresolved. There were zero resolved_mismatch entries: every PubMed PMID, every ClinicalTrials.gov NCT ID, and every canonical Open Targets identifier that resolved did so with a stored title matching the canonical title within the 0.80 fuzzy threshold. That is a strong positive signal that the LLM extraction pipeline is producing accurate metadata for the identifier types where canonical metadata is checkable: 113 of 113 PubMed rows clean, 19 of 19 ClinicalTrials.gov rows clean, 38 of 38 canonical Open Targets rows clean.
The 10 unresolved entries are all Open Targets rows where the external_id column stores a synthetic shorthand of the form OT-{DRUGNAME} rather than a canonical Open Targets identifier (CHEMBL ID, ENSEMBL gene ID, or EFO / MONDO disease ID). The ten drug names are: APREPITANT, DESVENLAFAXINE, ENZALUTAMIDE, TRIMEBUTINE, TASIMELTEON, TRADIPITANT, OLAPARIB, FOSNETUPITANT, MILNACIPRAN, and TRIIODOTHYRONINE (liothyronine). The Open Targets GraphQL search correctly does not resolve these because they are not Open Targets identifiers. However, the url column on these rows points at a real platform.opentargets.org page (with a real GO / EFO disease ontology ID), and the descriptive title in the title column carries the actual drug-target-disease finding (e.g. “Aprepitant, genetic_target_overlap for menopause (target: TACR1, OT score: 0.482)”). So users see a valid citation that links to a real Open Targets page; the failure is at the identifier-storage layer, not the user-visible content layer. Same shape as the Bate finding in v3.8 (real underlying citation, mangled metadata).
The 10 entries are kept in the audit as unresolved rather than patched into format_only_pass so the architectural debt stays visible. A new Roadmap row records the fix: backfill the canonical Open Targets identifier (CHEMBL for the drug, plus the specific target ENSEMBL ID and the disease MONDO/EFO ID that the URL already points at) on each of the 10 signals. The fix would reduce the unresolved count to zero on the next audit run without changing what is rendered to users. Recorded on the Roadmap under Backfill canonical Open Targets identifiers on signals using OT-DRUGNAME shorthand.
Where Path C Phase 1 now stands: the manifest audit covers 22 hand-written prose references and the database-sources audit covers 2,166 live database rows. Both run on demand from scripts/verify-citations.py and scripts/verify-database-sources.py, both gated by --strict, and both surface their results on /about/external-references → 01d. Phase 1 is now complete for the existing citation surface. Phase 2 (sentence-level summary grounding via Sentence-BERT) and Phase 3 (prompt hardening that forbids LLM citation generation outside the Phase 1 manifest) remain Planned.
Phase 1 audit scope expanded in two directions: the pre-verified citation manifest now covers the featured-page references, and the tooling for auditing the live database sources table shipped. Together these close the gap between “hand-written prose citations are audited” (v3.8) and “every citation rendered to a user can be audited.” The featured-page expansion added eight references to the manifest: six PMIDs from /featured (Jung & Brubaker 2019, Lethaby et al. 2016, NAMS 2020, Raz & Stamm 1993, Perrotta et al. 2008, Anger et al. 2022) and two PMIDs from /featured/anastrozole-endometriosis (the two PMC-linked systematic reviews). After the expansion, the verifier ran on 22 entries.
First run flagged a real author misattribution on the live anastrozole-endometriosis featured page. The page cited “Nawathe et al., 2011” pointing at PMC3141646, which converts to PMID 21693038. PubMed esummary returns that PMID as Ferrero S, Gillott DJ, Venturini PL & Remorgida V 2011, “Use of aromatase inhibitors to treat endometriosis-related pain symptoms: a systematic review” (Reproductive Biology and Endocrinology). The journal and year were correct, the description on the featured page matched the Ferrero paper exactly, but the author attribution was simply wrong. The featured page and the manifest were both corrected to attribute Ferrero rather than Nawathe. The second featured-page reference (the 2023 systematic review of systematic reviews in Drug Design, Development and Therapy) was unattributed in the original copy; PubMed esummary returned Peitsidis P as first author, and the featured page was updated to read “Peitsidis P et al. 2023” with the canonical title. A small verifier patch landed alongside: corporate-authored guideline papers like NAMS 2020 GSM return an empty first-author surname from PubMed (NAMS Editorial Panel is a corporate author, not a personal one), and the surname-equality check now treats two explicitly-empty surnames as a valid match for that case.
Tooling for the live database-sources audit shipped at the same time as two new artifacts. The export script at scripts/export-sources-for-audit.py queries the Whel database's repurposing_signals and sources tables and dumps every active-signal source row to lib/sources-audit-snapshot.json. The verifier at scripts/verify-database-sources.py audits the snapshot row by row against the canonical source for each identifier type: PMIDs against NCBI E-utilities, NCT IDs against ClinicalTrials.gov API v2, Open Targets identifiers against the Open Targets GraphQL API, and AEMS and Reddit URLs against format checks (URLs are well-formed, point at the correct host, include the expected path segments). Output lands in scripts/audit-output/database-sources-audit-report.json and lib/database-sources-audit-snapshot.json, which the disclosure on /about/external-references 01d reads from. Until the export runs and the snapshot is committed, the disclosure shows the “tooling shipped, awaiting first run” block instead of live numbers, which is the honest state.
What is not yet audited: the export step requires Supabase credentials, which only run locally, so the database-sources audit cannot complete in this transparency cycle until the export runs. The next methodology entry will record the first run's numbers and any findings. After that, both the manifest audit and the database-sources audit run on every push; both are gated by the same --strict convention.
Path C Phase 1 (citation validation) goes live as code. The manual audit that produced the v3.7 entry was the prototype; v3.8 ships the engineered version. The implementation has three artifacts. A structured pre-verified reference list at lib/whel-citations.json records every external citation that appears on a public surface, with its identifiers and the metadata claimed by the citing surface. A verifier script at scripts/verify-citations.py resolves every PMID against the NCBI E-utilities esummary endpoint, every DOI against the Crossref REST API works/{doi} endpoint, and every arXiv ID against the arXiv API, then compares returned canonical metadata (title, first-author surname, container title, year) against the claims in the manifest using fuzzy match with calibrated thresholds. Output is written to two sinks: the human-readable run log at scripts/audit-output/citation-audit-report.json and a site-imported sidecar at lib/citation-audit-snapshot.json that the external-references disclosure reads from. A --strict flag exits non-zero on any unresolved or mismatched entry and is wired for pre-publish use in CI.
The first official run on June 7, 2026 found that the audit script catches exactly the failure modes it was built to catch. Five real issues surfaced in the initial manifest, all of which had previously slipped past the manual review process that produced the v3.7 cleanup. One was a wrong title attached to a real DOI: the Bate & Evans 2009 reference in the methods PDF cited “Quantitative methods for pharmacovigilance signal detection” but the canonical title for DOI 10.1002/pds.1742 is “Quantitative signal detection using spontaneous ADR reporting.” The DOI resolved, the authors matched, the year matched, but the title was wrong. This is the exact failure mode (real identifier, mangled metadata) that Phase 1 was built to catch and that pure identifier resolution misses. Three further issues were epub-versus-journal-issue year mismatches on Ma 2023 KGML-xDTD, Pushpakom 2019 Drug repurposing, and Ochoa 2023 Open Targets, all of which are real and widely accepted citations but whose Crossref records show the epub year while site copy and the methods PDF use the journal-issue year. The verifier was relaxed to accept a 1-year tolerance for this real-world citation noise, with a comment in the script explaining why. The fifth issue was a Crossref-side metadata gap on the Zunzunegui Sanz bioRxiv DOI; the verifier was patched to fall back to the canonical container name when the DOI prefix identifies the work as a bioRxiv preprint.
After the manifest and verifier patches landed, the re-run cleared all 14 entries (resolved + match for every citation). The live audit numbers are now surfaced on /about/external-references → 01d replacing the “what the disclosure will display when shipped” placeholder, alongside the failure modes the first run caught. Phase 2 (sentence-level summary grounding via Sentence-BERT) and Phase 3 (prompt hardening so the LLM can only cite from the pre-verified manifest) remain Planned; the disclosure surface now explicitly distinguishes Phase 1 results from what Phases 2 and 3 will add. The strict-mode CI gate is in place but not yet wired to the deploy pipeline; that wiring is a separate small step recorded on the Roadmap.
Site-wide citation audit. Five citations recorded in earlier methodology entries and on the public site were flagged in external review as either misattributed or post-knowledge-cutoff fabrications generated by the LLM scoring layer. Three did not correspond to real papers as written (Gong et al. 2026 “Reference fabrication in biomedical large language models” in Bioengineering; Li et al. 2025 on knowledge-guided prompting in IEEE J. Biomed. Health Inform.; Zong et al. 2026 “EvidenceNet” as a paper title rather than a dataset name). Two were partially fabricated with real underlying identifiers (KGML-xDTD was attributed to Fajgenbaum et al. 2024 Lancet Haematology rather than the actual Ma, Zhou, Liu & Koslicki 2023 GigaScience paper; Zunzunegui Sanz et al. 2025 had a real bioRxiv DOI but the wrong title and author list).
Each flagged citation was verified against canonical external sources (NCBI E-utilities for PMIDs, Crossref REST API for DOIs, arXiv for arXiv identifiers) and replaced with one of three outcomes per the audit policy. Where a real underlying paper exists with mangled bibliographic details, the citation was updated to the canonical record: KGML-xDTD now cites Ma, Zhou, Liu & Koslicki (GigaScience 2023, doi:10.1093/gigascience/giad057); Zong et al. 2026 now cites the real arXiv paper at arXiv:2603.28325 with its correct title and author list (EvidenceNet is the dataset name introduced in the paper, not the paper title); Zunzunegui Sanz et al. 2025 now cites the correct title from the real bioRxiv record at doi:10.1101/2025.06.13.659527. Where the citation did not correspond to a real paper, it was either dropped and the surrounding claim rewritten, or replaced with a real substitute paper that supports the same claim. The fabricated “47 to 55 percent biomedical LLM reference fabrication rate” citation (Gong 2026) was replaced with the real Bhattacharyya et al. 2023 study in Cureus (PMID 37337480, doi:10.7759/cureus.39238), which examined 115 references across 30 ChatGPT-generated medical papers and reported 47 percent fully fabricated, 46 percent authentic but with bibliographic errors, and 7 percent fully accurate; the supporting Gravel, D'Amours-Gravel & Osmanlliu 2023 study (Mayo Clin Proc Digit Health, doi:10.1016/j.mcpdig.2023.05.004) was added alongside it. The Li et al. 2025 knowledge-guided prompting citation was dropped without substitution; the underlying concept is well-supported in the broader biomedical NLP literature without requiring a single named source.
Fixes were applied to every public surface where the flagged citations appeared: this changelog page, the methodology page section 04, the external-references page sections 01c and 01d, the roadmap REGISTER rows for the MATRIX cross-reference and Path A, B, and C, and both sources for the methods PDF (docs/methods-draft.md and docs/methods-print.html). The methods PDF binary at public/whel-methods-v0.1.pdf was version-bumped in the HTML source to v0.2 (June 2026, citation audit); the regenerated PDF binary will land in public/ in a follow-up step. The audit was conducted by manual lookup, which is the same procedure that Path C Phase 1 will run as code once the citation-validation pipeline ships; Path C Phase 3 (prompt hardening that forbids citation generation outside a pre-verified reference list) is what prevents this failure mode going forward.
LLM output validation strategy made explicit. The structured grounding layers in v3.4 (Path A and Path B, recorded in section 01c on /about/external-references) constrain what data the LLM works with. A separate failure surface applies to what the LLM produces as output. Three failure modes are documented in the literature and apply to Whel's specific pipeline: per-source extraction misclassification (an LLM that reads a PubMed abstract and assigns the wrong study type, wrong direction of effect, or hallucinates mechanism details not present in the source); summary drift (an LLM-written summary that extends beyond what the source actually says, the risk pattern documented by Bhattacharyya et al. 2023 (Cureus, doi:10.7759/cureus.39238) applied to Whel's task); and citation fabrication or misattribution in long-form prose Whel publishes (featured signal walkthroughs, the methods PDF, written drafts), where the LLM is asked to generate references rather than classify ones it was given.
Whel's response is a three-part output validation pipeline, recorded as Path C on the Roadmap. Phase 1 is a citation validation step that resolves every PMID against NCBI E-utilities and every DOI against the Crossref REST API, returning the canonical title, authors, journal, and year, and comparing those against the LLM-claimed metadata. References that fail to resolve or whose returned metadata mismatch the LLM's claims are blocked from publication. Phase 2 is sentence-level summary grounding using a sentence-transformer model (Sentence-BERT or equivalent) to compute the cosine similarity between each sentence in an LLM-generated summary and the source abstract. Sentences that fall below a calibrated similarity threshold are flagged as “not directly supported by the source” and either suppressed or surfaced with that marker on the signal card. Phase 3 is prompt hardening for any LLM-generated long-form prose that ships to users. The hardened prompt forbids citation generation (the LLM may only cite from a pre-verified reference list provided to it), forbids numerical claims unless they appear verbatim in the input context, and requires the LLM to produce, alongside the text, a sentence-by-sentence list of supporting input sources that Phase 1 then checks.
A fourth strategy in the broader literature, multi-sample consistency checking through re-querying the model, was considered and deferred. The cost (three to five times the Claude API spend) does not favorably trade against the marginal gain on Whel's constrained extraction task, and Phase 2 grounding addresses the same failure modes more cheaply. The deferred entry is recorded here so a future decision to revisit it has the design history available.
Path C is distinct from Path A and Path B. A and B ground the LLM's inputs (canonical ontologies for entity resolution; a domain-restricted knowledge graph for scoring-time context). C validates the LLM's outputs (citations, summary statements, published prose). They are complementary layers in the same overall pipeline architecture and are designed to ship in parallel rather than sequentially. The Path C disclosure surface lives in section 01d on /about/external-references.
MATRIX cross-reference reaches per-signal display. The Every Cure MATRIX coverage layer (live since v3.1) was previously surfaced only as an aggregate audit on the external-references page (compound match rate, per-condition counts, score distribution). Per-pair MATRIX scores are now surfaced on each signal card on /conditions/[slug] pages as a “MATRIX · Top N%” chip alongside the L-grade chip and the tier chip, where the percentile is MATRIX's own quantile rank across its roughly 39.5 million drug–disease predictions. Per-pair scores are sourced from a new public snapshot at lib/matrix-pair-scores-snapshot.json, extracted from the same audit report that produces the aggregate snapshot. 176 of 271 active compound–condition pairs in the current audit run have a MATRIX score and now show the chip; 95 pairs are “matrix silent” (compound not in MATRIX's drug list, or score below MATRIX's publication threshold) and correctly show no chip.
The external-references coverage disclosure at /about/external-references → 01b was extended with a “How to read these numbers” explainer card that defines both MATRIX values in Every Cure's own framing (treatment-probability prediction from a model trained on a biomedical knowledge graph), explains what “Top N%” does and does not say, quotes Every Cure's “research use only” disclaimer verbatim, and explains why Whel surfaces an independent ML layer beside its own literature-driven grades. The chip tooltip uses the same treatment-probability framing for hover-state consistency. No change to Whel's rubric, sample, or tier definitions; MATRIX remains separated from Whel's grades rather than blended into them.
Structured grounding strategy made explicit. Whel's evidence extraction and scoring layer is built on Claude Opus 4.6, a large language model. WHBench, an independent 2026 benchmark of frontier LLMs on women's health clinical questions (Maurya, Saboo & Kumar, 2026, arXiv:2604.00024), found that no model in its 22-model lineup exceeded 75% on the 23-criterion rubric, with the top model fully correct in only 35.5% of scenarios. The failure pattern is systematic rather than random: universal blind spots in social determinants of health (0.7%–19.1% across all 22 models), wide variation in safety harm rates within the top tier, and persistent gaps in completeness on follow-up timelines and monitoring plans. Empirical work on medical LLM reference fabrication documents high error rates: Bhattacharyya, Miller, Bhattacharyya & Miller 2023 (Cureus, doi:10.7759/cureus.39238, PMID 37337480) examined 115 references across 30 ChatGPT-generated medical papers and found 47% fully fabricated, 46% authentic but with bibliographic errors, and only 7% authentic and accurate; Gravel, D'Amours-Gravel & Osmanlliu 2023 (Mayo Clin Proc Digit Health, doi:10.1016/j.mcpdig.2023.05.004) reported a similar pattern of fabricated and inaccurate citations in ChatGPT-generated medical content. The hybrid-architecture literature on combining structured biomedical knowledge with LLM extraction (Zong, Lv, Xue, Zheng, Wan & Zhang 2026, arXiv:2603.28325, which introduces the EvidenceNet dataset; Zunzunegui Sanz, Otero-Carrasco & Rodríguez-González 2025 on LLM-assisted drug-repurposing hypothesis validation, bioRxiv, doi:10.1101/2025.06.13.659527) shows that adding structured external knowledge on top of LLM extraction improves accuracy and interpretability.
Whel's response, recorded as two roadmap items, is to add two structured grounding layers on top of the existing LLM pipeline without replacing it. These are architectural additions to the pipeline rather than post-hoc validation: they change what data lands in the database and how the LLM arrives at its scoring. Path A is ontology-grounded entity resolution: every compound and condition the LLM extracts is resolved against canonical biomedical registries (ChEMBL or DrugBank for compounds, MONDO for conditions), rewritten with the registry's standard identifier, and enriched with the structured metadata that resolution returns (drug class, ATC code, known targets; ontology lineage) before being written to the database. Entities that fail to resolve are flagged for human review rather than silently stored. This addresses the structured-output hallucination class of error directly and also moves the data Whel stores from free-text strings to canonical identifiers with structured metadata. Path B is knowledge-graph grounding, built using the BioCypher framework (Lobentanzer et al., Nature Biotechnology 2023), restricted to Whel's six conditions and active-signal compounds. The knowledge graph informs the LLM at prompt time, following the knowledge-augmented prompting pattern documented in the biomedical NLP literature: mechanistic paths drawn from the subgraph relevant to a given signal are included as structured context during scoring, reducing the model's reliance on parametric memory alone. The graph also surfaces beside each signal as a disclosure layer (“graph supports” or “graph silent”) in the same shape as the existing MATRIX cross-reference at /about/external-references.
Whel will not train a custom graph neural network for drug-condition link prediction. The platform consumes machine learning (Claude Opus 4.8 for extraction and scoring, MATRIX scores from Every Cure as an external disclosure layer, where MATRIX builds on the KGML-xDTD graph-ML framework of Ma, Zhou, Liu & Koslicki, GigaScience 2023) but does not develop its own ML models. The knowledge-graph plus graph-neural-network prediction direction (TxGNN; Huang et al. 2024, Nature Medicine) is acknowledged as state-of-the-art for global drug repurposing prediction but is out of scope for an evidence index focused on women's hormonal health, where the value proposition is provenance and interpretability rather than throughput.
Source-coverage philosophy made explicit. The four automated pipelines (PubMed, ClinicalTrials.gov, FDA AEMS, Open Targets, Reddit) ingest representative sources per compound–condition pair through condition-keyed Boolean searches with publication-date and article-type filters, rather than exhaustive enumeration of every paper in the literature. For under-researched conditions this is a reasonable approximation of the available evidence base. For well-studied compound-condition pairs it surfaces synthesis papers (reviews, position statements, society guidelines) and may leave the original RCTs cited inside them outside the indexed sources. The L0–L3 grade carries the independent-corroboration question as a separate layer. A planned manual-curation extension, documented in the Roadmap register as “Manual primary-source curation pass,” will close the gap on high-evidence signals through the same human-in-the-loop worklist pattern that produced the L3 grades. The featured-signal walkthrough on /featured already documents this gap in prose for the one signal it covers, in Section 04 “Literature Whel did not ingest.”
The external-evidence rubric (L0 / L1 / L2 / L3) is now codified in a schema-versioned sidecar at lib/literature-grade-rubric.json and surfaced on this page as a collapsible block in Section 04. v3.2 records the search procedure per source (PubMed, ClinicalTrials.gov, Cochrane, named guideline bodies), inclusion criteria and boundary rules at every level transition, source-attribution requirements per L assignment, and the conflict-resolution rule used when two reviewers disagree. No change to the sample, the comparators, or the pre-specified thresholds; the tightening makes the L assignment behind any signal reproducible against the printed rules, which the v3.1 page implied but did not pin down.
Every Cure's MATRIX dataset is now surfaced as an independent biological-plausibility layer beside Whel's grades wherever MATRIX has coverage; it is not blended into the grades. A reproducible audit of MATRIX coverage over Whel's active compound–condition universe was run and published on this site (85.7% adjusted compound match rate, six of six conditions confirmed, full per-condition breakdown and dataset SHAs at /about/external-references → 01b · Coverage disclosure). No change to Whel's rubric, sample, or tier definitions.
v3 records the close of an independent external review covering two findings. C1 (replication-score drift in the LLM rater): the rater prompts in all four pipelines were tightened to enforce literal source counting per the published rubric; 14 signals were downgraded to the tier the literature actually supports; 19 manually-verified PubMed citations were added so each remaining Moderate-tier signal carries the source count the strict rubric requires. S3 (ClinicalTrials.gov citation/condition mismatches across 21 audit rows): 10 signals were deactivated, 5 were reassigned from clinical-trial-finding to cross-condition framing, 1 source was dropped where the signal retained independent support, 2 sources were replaced with proper condition-specific citations (ESHRE 2022 endometriosis guideline; 2025 network meta-analysis of hormone therapies for adenomyosis pain), and 1 row was documented as a ClinicalTrials.gov API field limitation. Recorded in database migrations 036 through 040. Planned extensions, including external cross-reference to Every Cure MATRIX scores and a cross-arm concordance flag, are documented on the Roadmap page.
Named an external clinician-researcher as the primary rater in place of the project lead. The sample, the rubric, the external comparators, and the pre-specified thresholds are unchanged from v1. Sample numbers reflect the Whel database snapshot at time of publication. Updates to this page will be versioned and dated.
The methodology page itself records the live pre-registration; the roadmap records planned changes that have not yet shipped.