WHAT’S NEW? Atrial fibrillation (AF) represents a significant public health issue due to its considerable impact on morbidity and mortality as well as its economic strain on healthcare systems. Nevertheless, tools specifically designed to assess mortality risk in patients with AF are lacking. This study aimed to utilize machine learning methods for identifying pertinent variables and developing an easily applicable prognostic score to predict 1-year mortality in AF patients. By leveraging a large population dataset and employing XGBoost models for predictor screening, the CRAMB (Charlson comorbidity index, readmission, age, metastatic solid tumor, and blood urea nitrogen maximum) score was developed. The simplicity of the CRAMB score makes it user-friendly, allowing for coverage of a large and heterogeneous AF population. Moreover, the proposed model has better predictive performance than that of the clinically used CHA2DS2-VASc risk score for 1-year mortality among AF patients. |
INTRODUCTION
Atrial fibrillation (AF) is a prevalent cardiac arrhythmia linked to considerable morbidity and mortality. It is characterized by an irregular and often rapid heart rate, resulting in compromised blood flow and potential complications such as stroke, heart failure, and other cardiovascular events [1]. AF has a broad impact on cardiac function, functional status, and quality of life and is also a risk factor for stroke [2]. AF becomes more prevalent with age, affecting more than 2 million individuals in the United States, 14% to 17% of whom are aged 65 years and older [3]. The prevalence of AF in the Polish population ≥65 years was estimated as 19.2% [4]. AF represents a significant public health issue due to its considerable impact on morbidity and mortality as well as its economic strain on healthcare systems.
The assessment tool for evaluating the risk of stroke in AF patients known as CHA2DS2-VASc score (congestive heart failure, hypertension, age, diabetes mellitus, prior stroke or transient ischemic attack or thromboembolism, vascular disease, age, sex) [5] has been associated with cardiovascular events and mortality in diverse patient groups, including those without AF [6]. Nevertheless, tools specifically designed to assess mortality risk in AF patients are lacking. Although recent studies have introduced new AF risk scores [7, 8], these scores were developed based on data from clinical trials, limiting their applicability to the broader AF population.
Consequently, further research is necessary to identify potential models for scoring AF risk. The objective of this study was to employ machine learning methods to identify relevant variables and create an easily applicable prognostic score for predicting 1-year mortality in AF patients.
METHODS
Study design and setting
The data used in this research originated from the Medical Information Mart for Intensive Care-IV (MIMIC-IV version 2.2) database [9, 10]. Over the period from 2008 to 2019, the intensive care unit (ICU) at Beth Israel Deaconess Medical Center admitted more than 50 000 critically ill patients, as documented in MIMIC-IV. Approval for the MIMIC-IV database was granted by the Massachusetts Institute of Technology (Cambridge, MA, US) and Beth Israel Deaconess Medical Center (Boston, MA, US), with consent obtained for the initial data collection. The critical care database from China comprises comprehensive information on 2790 ICU patients, predominantly with pneumonia, admitted from January 2019 to December 2020 [11]. The database was approved by the Ethics Committee of Zigong Fourth People’s Hospital (Approval Number: 2021-014) and can be accessed through the online repository “PhysioNet” with the requisite credentials [12]. The MIMIC-IV database was used for model development and testing, while the Chinese hospital ICU database was used for external validation of the model.
Study population
The study population included patients aged 18 years and older with a discharge diagnosis of AF. AF patients were identified by searching International Classification of Diseases diagnostic terminology in the MIMIC-IV database and the external validation database by matching the keyword “atrial fibrillation”. The types of queried AF diagnostic terms were manually reviewed to ensure compliance. The exclusion criterion was lack of patient data on survival outcomes. In the case of MIMIC-IV records containing the same patient ID, only one record was retained with the smallest hospitalization sequence.
Study variables
The variables examined in the research included the characteristics of the study population, complications, various scores (such as the Charlson Comorbidity Index and the CHA2DS2-VASc score), vital signs, and an array of laboratory tests (including routine blood tests, blood biochemistry, coagulation, blood lipids, cardiac markers, etc.). Additionally, the investigation considered the use of vasopressors (norepinephrine, epinephrine, phenylephrine, dopamine, dobutamine, vasopressin, and milrinone), antithrombotic agents (heparin, enoxaparin, warfarin, aspirin, clopidogrel, ticagrelor, rivaroxaban, edoxaban, dabigatran etexilate, fondaparinux sodium, prasugrel, and apixaban), beta-blockers (propranolol, metoprolol, bisoprolol, carvedilol, labetalol, atenolol, and nebivolol) and various other data points. For laboratory tests, summary statistics, including minimum and maximum values during hospitalization, were used to derive variables. An indicator column for the respective drug was generated based on whether the drug was used during hospitalization. The variable “readmission” was derived from the variable “hospital stay sequence” for convenient clinical application. If the number of hospital admissions was greater than 1, “readmission” was assigned the value “Yes”; otherwise, it was assigned the value “No”.
Outcome variable
The primary outcome measured was 1-year mortality. Survival time was calculated by using the date of death available in the MIMIC-IV database and external validation database restricted to a 1-year timeframe.
Machine learning model development and validation
The derivation dataset was randomly partitioned into training and test samples at a 3:1 ratio. To prevent model overfitting, tenfold cross-validation and model calibration techniques were applied. To accommodate varying degrees of missing values in dataset variables, the mainstream machine learning model XGBoost was employed due to its ability to handle missing data. The discriminative performance of the models was assessed using the area under the receiver operating characteristic curve. Feature scaling was deemed unnecessary before inputting the data into the model. A total of 174 candidate variables were incorporated into the model training process. Furthermore, a calibration curve was used as a graphical representation to evaluate the concordance between the predicted probabilities and observed outcomes in binary classification models. On the calibration curve, the x-axis denotes the mean predicted probability assigned by the model to a specific class, and the y-axis signifies the observed frequency of positive instances. Ideally, a well-calibrated model produces a calibration curve that closely aligns with the diagonal line (y = x), signifying a perfect correspondence between the predicted probabilities and actual outcomes.
Machine learning model interpretability
SHapley Additive exPlanations (SHAP) is a model-agnostic explainability technique that assigns importance values to features based on their contribution to a model’s prediction [13]. SHAP values are grounded in SHapley values from game theory, which fairly distribute payouts based on each player’s contribution to the total gain. This method ensures local accuracy, missingness, and consistency, making it versatile and reliable across different model types.
Development of the scoring scale
The XGBoost model assigned importance to predictor variables, and variables with higher importance were selected based on this ranking. These selected variables were subsequently integrated into a logistic model to construct the scoring model. Manual testing was employed to evaluate the impact of introducing or removing variables on the area under the curve (AUC) of the logistic regression model in the test set. After striking a balance between AUC performance and the increase in model complexity associated with the number of variables included, the chosen variables for the AF scoring model were ultimately determined. A nomogram was used to construct the finalized AF scores. Decision curve analysis (DCA) was employed to assess the clinical utility and net benefit of the AF scoring model, CCI, and CHA2DS2-VASc scores within the test set [14]. DCA quantified the net benefit of a clinical prediction model at different risk thresholds, avoiding the simplistic assumptions that all patients were at low or high risk. The superior model was identified by the highest net benefit at the chosen threshold. The flowchart of the study is shown in Supplementary material, Figure S1.
Data analysis
Python software (version 3.11.5) was used to construct the machine learning models, evaluate the performance, and generate the AUC and calibration curves. R software (version 4.3.2) was used for logistic and Cox regression analyses, forest plot creation, DCA, and nomogram generation. Baseline characteristics are presented as means (standard deviations), medians (IQR), or percentages (%), as determined by the distribution characteristics of the data. The DeLong test was applied to determine whether the AUC of a given prediction significantly differed from that of another prediction [15]. Python was used to make descriptive tables [16] and run the DeLong test. When constructing the original machine learning model, no handling of missing values was conducted. However, during the development of the logistic model for the AF score, missing values were removed from the dataset based on the variables included in the AF score, as logistic models are unable to manage missing values. In all analyses, statistical significance was defined as a two-sided P-value <0.05.
During sensitivity analysis, missing values in the original dataset were imputed. The Python library “MIDASpy” was used for data filling [17]. Additionally, hyperparameter tuning was performed on the XGBoost model to evaluate the impact of imputation and model parameter adjustments on performance. A grid search was used for hyperparameter tuning, with values for ‘n_estimators’ of 50, 100, 150, and 200 and values for ‘max_depth’ ranging from 3 to 10.
RESULTS
Baseline characteristics
This study enrolled 26 365 individuals diagnosed with AF from the MIMIC-IV database. Among the patients, 56.3% were male. The cohort had a median age of 77.0 years (with an interquartile range [IQR] of 68.0–85.0), a median CHA2DS2-VASc score of 4 (IQR 2–5), a median CCI of 5 (IQR 4–7), and a median hospitalization duration of 4 days (IQR 1–7). Additional results are presented in Supplementary material Tables S1 and S2. The external validation dataset included 231 patients with atrial fibrillation, of whom 152 (65.8%) died. Additional findings are detailed in Supplementary material, Table S3.
Screening variables using the XGBoost model
The XGBoost model showed an AUC of 95% and a confidence interval (95% CI) of 0.825 (95% CI, 0.816–0.835) for the prediction of 1-year mortality in the test set (Figure 1). Figure 2 illustrates the significance of the predictor variables determined by the XGBoost model. Notably, the CCI and the presence of metastatic solid tumors were identified as the top two variables, with considerably greater importance than other variables. Supplementary material, Figure S2 shows the predictor importance interpretation based on the SHAP values for the XGBoost model.
Derivation and evaluation of the AF score
The 1-year mortality risk score for AF was calculated as the CRAMB score, which represents the CCI, readmission, age, metastatic solid tumor, and maximum blood urea nitrogen (BUN) (Figure 2). Logistic and Cox regression analyses were employed to assess the predictive value of these five variables for the outcome of death and were expressed as odds ratios (ORs) and hazard ratios (HRs). Both the forest plot of ORs (Figure 3) and the forest plot of HRs (Figure 4) demonstrated that these variables were significantly different.
A nomogram was used to calculate the CRAMB score (Supplementary material, Figure S3). In the test set, the AUC for the CRAMB score was 0.765 (95% CI, 0.753–0.776), surpassing the CCI at 0.733 (95% CI, 0.720–0.746) and the CHA2DS2-VASc score at 0.617 (95% CI, 0.603–0.631) (Figure 1). The sensitivity analysis showed that hyperparameter adjustment and missing value filling had very little impact on the AUC of XGBoost and the different scoring models (Supplementary material, Figure S4). Table 1 displays supplementary performance metrics corresponding to these scores.
Item |
Accuracy |
Sensitivity |
Specificity |
ROC AUC |
DeLong test P-value |
Charlson Comorbidity Index |
0.692 |
0.375 |
0.876 |
0.733 |
<0.001 |
CHA2DS2-VASc |
0.635 |
0.068 |
0.963 |
0.617 |
<0.001 |
CRAMB |
0.715 |
0.438 |
0.876 |
0.765 |
– |
The DeLong test results comparing the CRAMB score with existing scores (CCI and CHA2DS2-VASc) showed statistically significant differences (P <0.001), as indicated in Table 1. The DCA results provided in Figure 5 demonstrate that the CRAMB score consistently exhibited a positive and greater net benefit across the entire threshold range than did the default strategies, assuming either high or low risk, as indicated by the CCI and CHA2DS2-VASc scores, and the hypothesis of not using a scoring system. The calibration plot (Supplementary material, Figure S5) for the test set indicated that the CRAMB score was well calibrated.
Model evaluation on the external validation set
In the external validation set, the AUC for the CRAMB score was 0.582 (95% CI, 0.502–0.657), which surpassed that of the CCI (0.542 [95% CI, 0.469–0.618]) and that of the CHA2DS2-VASc score (0.511 [95% CI, 0.438–0.586]) (Supplementary material, Figure S6). Additional findings are detailed in Supplementary material, Table S4. Decision curve analysis showed that the positive return of the CRAMB score exceeded that of the other two scores between the threshold probabilities of 60%–80% (Supplementary material, Figure S7).
DISCUSSION
Main findings
This study’s primary contribution is establishing a benchmark for using machine learning models in the construction of AF scores for mortality prediction. This study introduces and validates a novel risk score for assessing the 1-year mortality risk in AF patients. By leveraging a large-sample population dataset and employing XGBoost models for predictor screening, we developed the CRAMB score (Charlson comorbidity index, readmission, age, metastatic solid tumor, and blood urea nitrogen maximum). XGBoost excels at variable selection by effectively capturing nonlinear relationships and handling missing data [18]. Its built-in feature importance mechanism automatically identifies key variables, a capability lacking in logistic regression. Furthermore, compared with logistic regression, XGBoost’s ensemble learning often results in superior predictive performance, and its regularization techniques boost resilience against overfitting, making it a robust choice for predictive modeling and variable selection. The variables incorporated in the CRAMB score were validated through logistic and Cox regression analyses, demonstrating their predictive significance for mortality. The CRAMB score exhibited excellent calibration, and DCA illustrated its clinical utility. Importantly, the findings of this study demonstrated that the CRAMB score outperformed the widely used CHA2DS2-VASc risk score in predicting mortality despite the latter’s original focus on predicting ischemic stroke.
Predictors of death in AF patients
Predictors and risk factors for death in AF patients span a broad spectrum of clinical and demographic variables. Hypertension has been identified as a significant risk factor for incident heart failure and all-cause mortality in AF patients [19]. Moreover, patients with chronic kidney disease who develop AF face an increased risk of stroke and death [20], and renal function has been associated with the risk of stroke and bleeding in AF patients [21]. Additionally, age is correlated with elevated risks of stroke and mortality in patients with either AF or sinus rhythm [22]. Proposed factors such as cancer-related inflammation, anticancer treatments, and other comorbidities associated with cancer are believed to influence atrial remodeling, potentially increasing the susceptibility of cancer patients to AF [23]. Therefore, AF screening is important to reduce the burden of AF-associated stroke [24].
Comparison with similar studies
Compared to the ABC-death (age, biomarkers [N-terminal pro B-type natriuretic peptide, troponin T, growth differentiation factor-15]) risk score [7] and BASIC-AF risk score (biomarkers, age, ultrasound, ventricular conduction delay, and clinical history) [8], the CRAMB score was constructed based on the MIMIC-IV database, leading to significant differences in population characteristics compared to clinical trial populations. Therefore, this study addresses a gap in the development of scoring methods and screening predictor variables within a broader population than previous studies of this nature. Future research on AF scores should focus on the characteristics of the population used for score development, comprehensively considering the importance and applicability of the variables included.
Expanding the clinical application potential of the CRAMB score
For effective integration into clinical workflows, the CRAMB score should be incorporated into electronic health records for automated calculation and routine assessments during admissions and outpatient visits. In addition, the nomogram can also be turned into an online tool for automatic calculation. Training clinical staff on its use, interpretation, and communication with patients is essential. Integrating the score into clinical decision support systems and multidisciplinary team meetings will enhance patient management. Pilot programs and continuous outcome monitoring will refine its application, ensuring robust and effective use, ultimately improving patient care and optimizing use of resources.
Limitations
The main limitation of this study is the limited representativeness of the external validation set. Future studies should validate the model using datasets from multiple medical centers or international sources to enhance generalizability. Conducting a prospective cohort study to assess the predictive power of the CRAMB score would provide stronger evidence of its efficacy, as prospective data collection allows for better control of variables and reduces retrospective biases. Additionally, incorporating a broader set of variables, such as lifestyle factors and detailed medication history, could improve the model’s accuracy and relevance. Finally, the CRAMB score was developed using data from a specific period, which may not reflect current medical practices. As healthcare evolves, this could affect the score’s relevance. To address this, we should regularly update the model with recent data to maintain its accuracy and relevance. Continuous recalibration and validation will ensure that the CRAMB score reflects current practices and improves patient care and outcomes in a dynamic healthcare environment.
CONCLUSIONS
This study’s primary contribution is establishing a benchmark for using machine learning models in the construction of a score for mortality prediction in AF patients. By leveraging a large-sample population dataset and employing XGBoost models for predictor screening, we developed the CRAMB score (CCI, readmission, age, metastatic solid tumor, and blood urea nitrogen maximum). The simplicity of the CRAMB score makes it user-friendly, allowing for the coverage of a broad and heterogeneous AF population. Moreover, the proposed model has superior predictive performance compared to that of the currently used CHA2DS2-VASc risk score for 1-year mortality in AF patients. External validation of the CRAMB score in new datasets has potential value for enhancing clinical practice.
Supplementary material
Supplementary material is available at https://journals.viamedica.pl/polish_heart_journal.
Article information
Conflict of interest: None declared.
Funding: This work was supported by the Real World Study Project of Hainan Boao Lecheng Pilot Zone (Real World Study Base of NMPA) (No. HNLC2022RWS017) to HC.
Open access: This article is available in open access under Creative Common Attribution-Non-Commercial-No Derivatives 4.0 International (CC BY-NC-ND 4.0) license, which allows downloading and sharing articles with others as long as they credit the authors and the publisher, but without permission to change them in any way or use them commercially. For commercial use, please contact the journal office at polishheartjournal@ptkardio.pl