Research on the Influencing Factors of Heart Disease based on Logistic Regression and XGBoost

Authors

  • Zitong Zhou Author

DOI:

https://doi.org/10.61173/qeache10

Keywords:

Heart disease, influencing factors, logistic regression, XGBoost

Abstract

This study leverages the UCI Cleveland Heart Disease dataset (297 complete records, 14 features) to present a two-stage pipeline that couples rigorous feature engineering with hybrid modeling and compares Logistic Regression (LR) against Gradient Boosting Decision Trees (GBDT/XGBoost) for cardiac risk prediction. Mutual information and recursive feature elimination first isolate the most informative variables-chest-pain type, maximum heart rate, exercise-induced ST depression, ST slope, and number of major vessels-whose clinical relevance is well established. GBDT then markedly outperforms LR across all evaluated metrics: accuracy 89.2% vs 83.7%, recall 86.7% vs 80.0%, and AUC 0.925 vs 0.874, demonstrating the value of capturing non-linear interactions among risk factors. Feature-importance analysis corroborates these predictors’ medical interpretability. The authors acknowledge limitations arising from the modest sample size and the ongoing need for transparent models; they recommend expanding the data, integrating SHAP explanations, and incorporating real-time monitoring. Overall, explainable GBDT-based tools can support clinicians in early identification of high-risk individuals, enabling personalized interventions and improved patient outcomes.

Downloads

Published

2025-10-23

Issue

Section

Articles