When does clinical AI actually work?
A clinical AI reports 90% accuracy in trials. Drop it into your hospital and that number shifts. I study when the evidence holds and when it breaks — across operators, settings, and benchmarks.
Pipeline 2026
Publicly released
Same evidence. Two very different numbers.
An AI diagnostic tool reports 90% accuracy across 4,000 cases. That's the average — pooled across every operator. Slide to see how the conditional accuracy shifts.
Gap: 2 percentage points lower than the reported figure.
Operator
Junior attending
skill index = 72
Numbers in this demo are illustrative — not actual study results.
"Most validation studies report one number. The truth has dozens."
Three directions
Direction A
Epistemic drift
AI increasingly produces the evidence we rely on. Measuring and naming what silently drifts.
Direction B
Operator-dependence + LMIC
Surgical and diagnostic trials don't transfer cleanly between settings. Measuring what actually transports.
Direction C
Deployment-validity index
A structured index for whether a piece of clinical AI evidence should be trusted in a given setting.
Work
16 items · 4 groupsClinical AI & epistemology
When models help produce evidence — measurement and evaluation frameworks.
Structural Validity Without Direction Fidelity: Evaluating LLM-Generated Medical Abstracts Across Model Scales
Target: NEJM AI
Two-paper series, data collection complete (n = 1080 + 720 + 120). OSF prereg X4RP5.
In progressWhen Coherence Is No Longer Evidence: Epistemic Drift in AI-Augmented Science
Accountability in Research · Q1
Under reviewEpistemic Immunodepression in Clinical Evidence
MetaArXiv · CC-BY 4.0 · 10.31222/osf.io/gqunf_v1
Cross-posted to LessWrong.
Preprint2026
Method frameworks & patents
OPERA · CIVER · BRIM — original rubrics in progress.
OPERA — Evidence Framework for Conditional Validity in Operator-Dependent Medicine
Target: Journal of Clinical Epidemiology · Q1
Companion framework to the M2 pilot. v6 draft.
In progressA pilot of a new clinical evidence framework: conditional validity in operator-dependent research
MetaArXiv · pending DOI
OPERA empirical pilot. κ = 0.965 across 4 paired surgical case studies.
Preprint2026CIVER — Clinical Information Veracity Evaluation Rubric
Vietnam patent · filed 2026-03-27
PatentBRIM — sister rubric framework
Vietnam patent · filed
Patent
Pediatric & plastic surgery
Anorectal and urogenital malformations, clinical case work.
Diagnostic Imaging of Neonatal Anorectal Malformations: Emphasis on Discordant Cases and Implications for Surgical Planning
Journal of Pediatric Surgery · Q1 · 10.1016/j.jpedsurg.2026.163184
Published2026Staged vs Single-stage Anorectal Malformation Repair — a Meta-analysis
Pediatric Surgery International · Q2
Under reviewTransperineal Ultrasound Transferability Between HIC and LMIC ARM Cohorts
Target: Pediatric Radiology
ARM cohort transferability seed paper.
In progressFrozen Section Biopsy in Hirschsprung Disease
Journal of Indian Association of Pediatric Surgeons · Q3
Under reviewGastric Volvulus in a Pediatric Patient — Case Report
Journal of Surgical Case Reports
Under reviewSingle-stage Repair of Proximal Hypospadias — Case Series
Urology Case Reports · Q3
Under reviewTiered Surgical Strategy for Posterior Urethral Valves in an LMIC Setting
Target: Journal of Pediatric Urology · Q1
Three-arm cohort (n = 27), 2015–2023.
In progress
Open-source software
CiteCheck · AI for Academic — tools released free to the community.
CiteCheck
Python · PyPI · MIT licence
Reference verification engine. Powers the AI for Academic Paper Checker.
Open source- Open source
Want to collaborate or ask a research question?
Book an advisory session →