An Analysis of Factors Influencing the Box Office of Movies and Optimization of the XGBoost Model for Prediction
DOI:
https://doi.org/10.61173/q9ac4q60Keywords:
Box office prediction, Machine learning, XGBoost, Film industry analyticsAbstract
Box office prediction is crucial for the film industry, but traditional models struggle to capture the non-linear relationships and interactions among factors such as budget and release timing. Recent advancements have introduced machine learning approaches, including Random Forest (RF) and Extreme Gradient Boosting (XGBoost). XGBoost the good performance due to its exceptional feature selection capability and nonlinear modeling. Based on the data of 1,313 films from 1995-2016, this study systematically compares three models – linear regression, RF, and XGBoost. By adding new features like director's film count and optimizing hyperparameters, the results show XGBoost achieves the best performance (Coefficient of Determination =0.69, Root Mean Squared Error (RMSE)=0.78, Mean Absolute Percentage Error MAPE=3.12%, Mean Absolute Error (MAE)=0.55), significantly outperforming RF and linear regression. Feature importance analysis reveals budget and the total number of audience ratings (NAR) as the most important variables, while summer releases also significantly impact revenue. Therefore, high-budget productions, enhanced audience engagement, and strategic release time can maximize box office revenue. The study shows that XGBoost works well for predicting box office success and also provides useful data to help the film industry make better decisions.