Збірник наукових праць

DOI: https://doi.org/10.32515/2664-262X.2025.11(42).1.14-26

A Data-Driven Approach for Balancing Overfitting and Underfitting in Decision Tree Models

Mykola Zlobin, Volodymyr Bazylevych

About the Authors

Mykola Zlobin, post-graduate, Chernihiv Polytechnic National University, Chernihiv, Ukraine, e-mail: mykolay.zlobin@gmail.com, ORCID ID: 0009-0000-7653-6109

Volodymyr Bazylevych, Associate Professor, PhD in Economics (Candidate of Economics Sciences), Chernihiv Polytechnic National University, Chernihiv, Ukraine, e-mail: bazvlamar@stu.cn.ua, ORCID ID: 0000-0001-8935-446X

Abstract

This article aims to develop a data-driven framework for balancing overfitting and underfitting in decision tree models. Overfitting occurs when a model captures noise, reducing generalization, while underfitting leads to poor predictive accuracy. The study systematically tunes the max_leaf_nodes parameter and evaluates model performance using Mean Absolute Error (MAE). The objective is finding the most optimal balance that ensures model accuracy while preventing excessive complexity. A Decision Tree Regressor has been trained on the Ames Housing dataset, which includes 79 explanatory variables related to home prices. The dataset has been splitted into training and validation sets. The model has been evaluated by iterating over different max_leaf_nodes values, ranging from 2 to 5000, and computing the MAE for each configuration. The results show that increasing max_leaf_nodes initially improves accuracy, but beyond 400 nodes, MAE stabilizes around 242,906, indicating that further complexity does not improve performance. The paper highlights that models with too few leaf nodes underfit the data, while models with too many leaf nodes overfit, capturing spurious patterns. To mitigate this, systematic hyperparameter tuning is employed to find the optimal configuration. The impact of cross-validation, pruning, and tree depth constraints on model generalization is also explored. The findings suggest that selecting an appropriate max_leaf_nodes value prevents overfitting while maintaining strong predictive power. Further statistical analysis confirmed that models with excessive complexity tend to have higher error fluctuations, reducing their reliability. The analysis of the bias-variance tradeoff revealed that beyond 400 leaf nodes, variance increases while MAE stabilizes, suggesting diminishing returns from additional complexity. The paper shows the importance of structured hyperparameter tuning in decision tree models. The optimal max_leaf_nodes value is found at 400. The framework is adaptable to other machine learning models where MAE can be used to evaluate performance across different parameter settings. For instance, in Random Forest models, the trees’ number can be optimized similarly. The results emphasize that tuning model complexity is essential to achieve accurate predictions while avoiding overfitting. Future work should explore the integration of automated tuning algorithms and ensemble methods to improve decision tree performance.

Keywords

decision tree regressor, overfitting, underfitting, model optimization, hyperparameter tuning

Full Text:

PDF

References

1. Gu, Y., Wylie, B. K., Boyte, S. P., Picotte, J., Howard, D. M., Smith, K., Nelson, K. J. (2016). An optimal sample data usage strategy to minimize overfitting and underfitting effects in regression tree models based on remotely-sensed data. Remote sensing, 8(11) , 943. https://doi.org/10.3390/rs8110943.

2. Aliferis, C., Simon, G. (2024). Overfitting, underfitting and general model overconfidence and under-performance pitfalls and best practices in machine learning and AI. Artificial intelligence and machine learning in health care and medical sciences: Best practices and pitfalls, 477-524. https://doi.org/10.1007/978-3-031-39355-6_10

3. Li, Y., Linero, A. R., Murray, J. (2023). Adaptive conditional distribution estimation with Bayesian decision tree ensembles. Journal of the American Statistical Association, 118(543), 2129-2142. https://doi.org/10.1080/01621459.2022.2037431.

4. Zhang, J., Wang, Y., Santolucito, M., Piskac, R. (2020). Succinct Explanations With Cascading Decision Trees. arXiv preprint arXiv:2010.06631. https://doi.org/10.48550/arXiv.2010.06631.

5. Song, Y. Y., Ying, L. U. (2015). Decision tree methods: applications for classification and prediction. Shanghai archives of psychiatry, 27(2), 130. https://doi.org/10.11919/j.issn.1002-0829.215044

6. Adler, A. I., Painsky, A. (2022). Feature importance in gradient boosting trees with cross-validation feature selection. Entropy, 24(5), 687. https://doi.org/10.3390/e24050687.

7. Lee, D., Tellez, F. P., Jaiswal, R. (2024). Predicting Fire Incidents with ML: an XAI approach. https://doi.org/10.21203/rs.3.rs-5356484/v1

8. Amro, A., Al-Akhras, M., Hindi, K. E., Habib, M., Shawar, B. A. (2021). Instance reduction for avoiding overfitting in decision trees. Journal of Intelligent Systems, 30(1) , 438-459. doi.org/10.1515/jisys-2020-0061

9. Mienye, I. D., Jere, N. (2024). A Survey of Decision Trees: Concepts, Algorithms, and Applications. IEEE Access, 12 , 86716-86727. https://doi.org/10.1109/ACCESS.2024.3416838

10. Liu, B., Mazumder, R. (2023). ForestPrune: compact depth-pruned tree ensembles. In International Conference on Artificial Intelligence and Statistics. 9417-9428. PMLR.

11. Zhao, L., Alipour-Fanid, A., Slawski, M., Zeng, K. (2018). Prediction-time efficient classification using feature computational dependencies. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2787-2796. https://doi.org/10.1145/3219819.3220117

12. Park, Y., Ho, J. C. (2019). Tackling overfitting in boosting for noisy healthcare data. IEEE Transactions on Knowledge and Data Engineering, 33(7) , 2995-3006. https://doi.org/10.1109/TKDE.2019.2959988

13. Leiva, R. G., Anta, A. F., Mancuso, V., Casari, P. (2019). A novel hyperparameter-free approach to decision tree construction that avoids overfitting by design. Ieee Access, 7, 99978-99987. https://doi.org/10.1109/ACCESS.2019.2930235

14. James, G., Witten, D., Hastie, T., Tibshirani, R. (2013). An introduction to statistical learning. Springer. https://doi.org/10.1007/978-3-031-38747-0

15. Murphy, K. P. (2012). Machine learning: a probabilistic perspective. MIT Press.

16. Wan, X., Wang, W., Liu, J., & Tong, T. (2014). Estimating the sample mean and standard deviation from the sample size, median, range and/or interquartile range. BMC Medical Research Methodology, 14, 1-13. DOI: https://doi.org/10.1186/1471-2288-14-135.

17. Wasserman, L. (2013). All of statistics: a concise course in statistical inference. Springer Science & Business Media.

18. De Cock, D. (2011). Ames, Iowa: Alternative to the Boston housing data as an end-of-semester regression project. Journal of Statistics Education, 19(3), 1-15. https://jse.amstat.org/v19n3/decock.pdf.

Citations

1. Gu Y., Wylie B. K., Boyte S. P., Picotte J., Howard D. M., Smith K., Nelson K. J. An optimal sample data usage strategy to minimize overfitting and underfitting effects in regression tree models based on remotely-sensed data. Remote sensing. 2016. Vol. 8, № 11. P. 943. DOI: https://doi.org/10.3390/rs8110943.

2. Aliferis C., Simon G. Overfitting, underfitting and general model overconfidence and under-performance pitfalls and best practices in machine learning and AI. Artificial intelligence and machine learning in health care and medical sciences: Best practices and pitfalls. 2024. P. 477-524. DOI: 10.1007/978-3-031-39355-6_10.

3. Li Y., Linero A. R., Murray J. Adaptive conditional distribution estimation with Bayesian decision tree ensembles. Journal of the American Statistical Association. 2023. Vol. 118, № 543. P. 2129-2142. DOI: https://doi.org/10.1080/01621459.2022.2037431.

4. Zhang J., Wang Y., Santolucito M., Piskac R. Succinct Explanations With Cascading Decision Trees. arXiv preprint arXiv:2010.06631. 2020. DOI: https://doi.org/10.48550/arXiv.2010.06631.

5. Song Y. Y., Ying L. U. Decision tree methods: applications for classification and prediction. Shanghai archives of psychiatry. 2015. Vol. 27, № 2. P. 130. DOI: https://doi.org/10.11919/j.issn.1002-0829.215044.

6. Adler A. I., Painsky A. Feature importance in gradient boosting trees with cross-validation feature selection. Entropy. 2022. Vol. 24, № 5. P. 687. DOI: https://doi.org/10.3390/e24050687.

7. Lee D., Tellez F. P., Jaiswal R. Predicting Fire Incidents with ML: an XAI approach. 2024. DOI: https://doi.org/10.21203/rs.3.rs-5356484/v1.

8. Amro A., Al-Akhras M., Hindi K. E., Habib M., Shawar B. A. Instance reduction for avoiding overfitting in decision trees. Journal of Intelligent Systems. 2021. Vol. 30, № 1. P. 438-459. DOI: https://doi.org/10.1515/jisys-2020-0061.

9. Mienye I. D., Jere N. A Survey of Decision Trees: Concepts, Algorithms, and Applications. IEEE Access. 2024. Vol. 12. P. 86716-86727. DOI: https://doi.org/10.1109/ACCESS.2024.3416838.

10. Liu B., Mazumder R. ForestPrune: compact depth-pruned tree ensembles // International Conference on Artificial Intelligence and Statistics. 2023. P. 9417-9428. PMLR.

11. Zhao L., Alipour-Fanid A., Slawski M., Zeng K. Prediction-time efficient classification using feature computational dependencies. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018. P. 2787-2796. DOI: https://doi.org/10.1145/3219819.3220117.

12. Park Y., Ho J. C. Tackling overfitting in boosting for noisy healthcare data. IEEE Transactions on Knowledge and Data Engineering. 2019. Vol. 33, № 7. P. 2995-3006. DOI: https://doi.org/10.1109/TKDE.2019.2959988.

13. Leiva R. G., Anta A. F., Mancuso V., Casari P. A novel hyperparameter-free approach to decision tree construction that avoids overfitting by design. IEEE Access. 2019. Vol. 7. P. 99978-99987. DOI: https://doi.org/10.1109/ACCESS.2019.2930235.

14. James G., Witten D., Hastie T., Tibshirani R. An Introduction to Statistical Learning. Springer. 2013. DOI: https://doi.org/10.1007/978-3-031-38747-0

15. Murphy K. P. Machine learning: a probabilistic perspective. MIT Press. 2012.

16. Wan X., Wang W., Liu J., Tong T. Estimating the sample mean and standard deviation from the sample size, median, range and/or interquartile range. BMC medical research methodology. 2014. Vol. 14. P. 1-13. DOI: https://doi.org/10.1186/1471-2288-14-135.

17. Wasserman L. All of statistics: a concise course in statistical inference. Springer Science & Business Media. 2013.

18. De Cock D. Ames, Iowa: Alternative to the Boston housing data as an end-of-semester regression project. Journal of Statistics Education. 2011. Vol. 19, № 3. P. 1-15. URL: https://jse.amstat.org/v19n3/decock.pdf.