Logistic Regression for Customer Churn Analysis: The Role of Data Proportion and Class Balancing in Model Performance

Authors

  • Huimin Pan Author

DOI:

https://doi.org/10.61173/x28wbd81

Keywords:

Customer churn prediction, logistic regression, machine learning

Abstract

Customer churn prediction has become a critical challenge for the banking industry, as retaining existing clients is often more cost-effective than acquiring new ones. This study applies logistic regression to the Bank Customer Churn dataset (Kaggle, 2017), which contains 10,000 records and 13 features. After preprocessing, including removal of irrelevant identifiers, categorical encoding, and feature selection, seven key variables were retained. The research explores the impact of training set proportions (50%, 60%, 70%, 80%) and class balancing techniques (oversampling vs. no balancing) on model performance. Results show that model accuracy remains stable at approximately 0.81 across different training ratios, suggesting that increasing training size does not yield significant gains. In contrast, balancing the dataset reduced overall accuracy (0.72 vs. 0.81), reflecting the trade-off between accuracy and minority class recall in imbalanced classification. Logistic regression coefficients further revealed interpretable patterns: customers in Germany had higher churn odds, while active membership, longer tenure, and multiple product ownership reduced churn likelihood. These findings contribute to understanding how data preprocessing choices affect churn modeling outcomes and provide actionable insights for banking institutions seeking to strengthen retention strategies.

Downloads

Published

2025-10-23

Issue

Section

Articles