Diagnostic performance of machine learning models versus established risk stratification for intracranial aneurysm rupture: a systematic review and bivariate meta-analysis.
Patel S., Nischal SA., Chai Y-H., Mendonca A., Kale KM., Castiglione J., Patel P., Gooch R., Tjoumakaris SI., Jabbour P.
BACKGROUND: Machine learning (ML) models have been proposed to improve the discrimination of intracranial aneurysm rupture status beyond established clinical risk stratification tools. However, reported performance is heterogeneous and the relative contribution of model architecture and feature dominance remains unclear. METHODS: We performed a Preferred Reporting Items for Systematic Reviews and Meta-Analyses-diagnostic test accuracy systematic review and diagnostic meta-analysis of studies evaluating ML models for intracranial aneurysm rupture discrimination. PubMed, Embase and CENTRAL were searched to February 2026. Sensitivity and specificity were pooled using a bivariate random-effects model, with summary receiver operating characteristic curves generated across training, internal testing and external validation datasets. Models were compared with regression-based approaches and Population, Hypertension, Age, Size of aneurysm, Earlier subarachnoid haemorrhage, Site of aneurysm (PHASES) scores. Subgroup and meta-regression analyses explored associations between algorithm family and feature domain. RESULTS: Sixty-two retrospective cohorts (29 709 patients 209 models) met the inclusion criteria. In training datasets, pooled sensitivity and specificity for ML were 0.81 (95% CI 0.75 to 0.85) and 0.83 (0.80-0.86), with an area under the curve (AUC) of 0.878, exceeding PHASES (AUC 0.667). In testing datasets, ML retained higher discrimination (AUC 0.837) than regression models (0.806) and PHASES (0.646). In external validation, sensitivity was preserved (0.82), but specificity declined (0.66). Deep learning demonstrated the highest AUCs (training and testing). Incorporation of haemodynamic or radiomic features improved pooled discrimination relative to morphology alone. Evidence of small-study effects and mostly unclear Prediction Model Risk Of Bias Assessment Tool ratings were observed. CONCLUSIONS: ML approaches demonstrate higher pooled discrimination for aneurysm rupture status than conventional risk scores in retrospective datasets, but reduced external validation specificity and heterogeneity limit confidence for clinical translation. Prospective, externally validated, calibrated models are required before integration into routine cerebrovascular risk stratification.

