Збірник наукових праць

DOI: https://doi.org/10.32515/2664-262X.2025.11(42).1.14-26

Підхід на основі даних для збалансування перенавчання та недонавчання в моделях дерева рішень

М. М. Злобін, В. М. Базилевич

Про авторів

М. М. Злобін, аспірант, НУ «Чернігівська політехніка», Чернігів, Україна, e-mail: mykolay.zlobin@gmail.com, ORCID ID: 0009-0000-7653-6109

В. М. Базилевич, доцент, кандидат економічних наук, НУ «Чернігівська політехніка», Чернігів, Україна, e-mail: bazvlamar@stu.cn.ua, ORCID ID: 0000-0001-8935-446X

Анотація

Стаття присвячена розробці підходу на основі даних для балансування надмірної (overfitting) та недостатньої пристосованості (underfitting) в моделях дерев рішень. Надмірна пристосованість зазвичай виникає, коли модель вловлює шум, зменшуючи узагальнення, тоді як недостатня пристосованість призводить до низької точності прогнозування. У дослідженні систематично налаштовувався параметр max_leaf_nodes та оцінювалась ефективність моделі за допомогою середньої абсолютної помилки (MAE). Мета полягала в тому, щоб знайти оптимальний баланс, який забезпечує точність моделі, запобігаючи при цьому її надмірній складності. Регресор дерева рішень (A Decision Tree Regressor) навчався на наборі даних Ames Housing, який включає 79 пояснювальних змінних, пов'язаних з цінами на житло. Набір даних було розділено на навчальний та валідаційний набори (тобто на набори для навчання та перевірки). Модель оцінювалася шляхом ітерації над різними значеннями max_leaf_nodes, від 2 до 5000, і обчислення MAE для кожної конфігурації. Результати показали, що збільшення max_leaf_nodes спочатку покращувало точність, але після 400 вузлів MAE стабілізувалося на рівні 242,906, що свідчило про те, що подальше ускладнення не покращувало продуктивність. У статті підкреслено, що моделі з надто малою кількістю листкових вузлів не відповідають даним, тоді як моделі з надто великою кількістю листкових вузлів - надмірно пристосовуються, захоплюючи помилкові патерни. Для пом'якшення цієї проблеми використано систематичне налаштування гіперпараметрів для пошуку оптимальної конфігурації. Також досліджено вплив перехресної перевірки, скорочення та обмежень на глибину дерева на узагальнення моделі. Висновки свідчать, що вибір відповідного значення max_leaf_nodes запобігає надмірному пристосуванню, зберігаючи при цьому сильну прогностичну силу. У статті показано важливість структурованого налаштування гіперпараметрів у моделях дерева рішень. Оптимальне значення max_leaf_nodes знаходиться на рівні 400. Фреймворк можна адаптувати до інших моделей машинного навчання, де MAE можна використовувати для оцінки продуктивності при різних налаштуваннях параметрів. Наприклад, у моделях випадкового лісу (Random Forest) кількість дерев можна оптимізувати аналогічно. Результати підкреслюють, що налаштування складності моделі має важливе значення для досягнення точних прогнозів, уникаючи при цьому надмірного пристосування. У подальших роботах слід дослідити інтеграцію алгоритмів автоматизованого налаштування та ансамблевих методів для покращення продуктивності дерев рішень.

Ключові слова

регресор дерева рішень, надмірне пристосування, перенавчання, недостатнє пристосування, недонавчання, оптимізація моделі, гіперпараметричне налаштування

Повний текст:

PDF

Посилання

1. Gu, Y., Wylie, B. K., Boyte, S. P., Picotte, J., Howard, D. M., Smith, K., Nelson, K. J. (2016). An optimal sample data usage strategy to minimize overfitting and underfitting effects in regression tree models based on remotely-sensed data. Remote sensing, 8(11) , 943. https://doi.org/10.3390/rs8110943.

2. Aliferis, C., Simon, G. (2024). Overfitting, underfitting and general model overconfidence and under-performance pitfalls and best practices in machine learning and AI. Artificial intelligence and machine learning in health care and medical sciences: Best practices and pitfalls, 477-524. https://doi.org/10.1007/978-3-031-39355-6_10

3. Li, Y., Linero, A. R., Murray, J. (2023). Adaptive conditional distribution estimation with Bayesian decision tree ensembles. Journal of the American Statistical Association, 118(543), 2129-2142. https://doi.org/10.1080/01621459.2022.2037431.

4. Zhang, J., Wang, Y., Santolucito, M., Piskac, R. (2020). Succinct Explanations With Cascading Decision Trees. arXiv preprint arXiv:2010.06631. https://doi.org/10.48550/arXiv.2010.06631.

5. Song, Y. Y., Ying, L. U. (2015). Decision tree methods: applications for classification and prediction. Shanghai archives of psychiatry, 27(2), 130. https://doi.org/10.11919/j.issn.1002-0829.215044

6. Adler, A. I., Painsky, A. (2022). Feature importance in gradient boosting trees with cross-validation feature selection. Entropy, 24(5), 687. https://doi.org/10.3390/e24050687.

7. Lee, D., Tellez, F. P., Jaiswal, R. (2024). Predicting Fire Incidents with ML: an XAI approach. https://doi.org/10.21203/rs.3.rs-5356484/v1

8. Amro, A., Al-Akhras, M., Hindi, K. E., Habib, M., Shawar, B. A. (2021). Instance reduction for avoiding overfitting in decision trees. Journal of Intelligent Systems, 30(1) , 438-459. doi.org/10.1515/jisys-2020-0061

9. Mienye, I. D., Jere, N. (2024). A Survey of Decision Trees: Concepts, Algorithms, and Applications. IEEE Access, 12 , 86716-86727. https://doi.org/10.1109/ACCESS.2024.3416838

10. Liu, B., Mazumder, R. (2023). ForestPrune: compact depth-pruned tree ensembles. In International Conference on Artificial Intelligence and Statistics. 9417-9428. PMLR.

11. Zhao, L., Alipour-Fanid, A., Slawski, M., Zeng, K. (2018). Prediction-time efficient classification using feature computational dependencies. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2787-2796. https://doi.org/10.1145/3219819.3220117

12. Park, Y., Ho, J. C. (2019). Tackling overfitting in boosting for noisy healthcare data. IEEE Transactions on Knowledge and Data Engineering, 33(7) , 2995-3006. https://doi.org/10.1109/TKDE.2019.2959988

13. Leiva, R. G., Anta, A. F., Mancuso, V., Casari, P. (2019). A novel hyperparameter-free approach to decision tree construction that avoids overfitting by design. Ieee Access, 7, 99978-99987. https://doi.org/10.1109/ACCESS.2019.2930235

14. James, G., Witten, D., Hastie, T., Tibshirani, R. (2013). An introduction to statistical learning. Springer. https://doi.org/10.1007/978-3-031-38747-0

15. Murphy, K. P. (2012). Machine learning: a probabilistic perspective. MIT Press.

16. Wan, X., Wang, W., Liu, J., & Tong, T. (2014). Estimating the sample mean and standard deviation from the sample size, median, range and/or interquartile range. BMC Medical Research Methodology, 14, 1-13. DOI: https://doi.org/10.1186/1471-2288-14-135.

17. Wasserman, L. (2013). All of statistics: a concise course in statistical inference. Springer Science & Business Media.

18. De Cock, D. (2011). Ames, Iowa: Alternative to the Boston housing data as an end-of-semester regression project. Journal of Statistics Education, 19(3), 1-15. https://jse.amstat.org/v19n3/decock.pdf.

Пристатейна бібліографія

1. Gu Y., Wylie B. K., Boyte S. P., Picotte J., Howard D. M., Smith K., Nelson K. J. An optimal sample data usage strategy to minimize overfitting and underfitting effects in regression tree models based on remotely-sensed data. Remote sensing. 2016. Vol. 8, № 11. P. 943. DOI: https://doi.org/10.3390/rs8110943.

2. Aliferis C., Simon G. Overfitting, underfitting and general model overconfidence and under-performance pitfalls and best practices in machine learning and AI. Artificial intelligence and machine learning in health care and medical sciences: Best practices and pitfalls. 2024. P. 477-524. DOI: 10.1007/978-3-031-39355-6_10.

3. Li Y., Linero A. R., Murray J. Adaptive conditional distribution estimation with Bayesian decision tree ensembles. Journal of the American Statistical Association. 2023. Vol. 118, № 543. P. 2129-2142. DOI: https://doi.org/10.1080/01621459.2022.2037431.

4. Zhang J., Wang Y., Santolucito M., Piskac R. Succinct Explanations With Cascading Decision Trees. arXiv preprint arXiv:2010.06631. 2020. DOI: https://doi.org/10.48550/arXiv.2010.06631.

5. Song Y. Y., Ying L. U. Decision tree methods: applications for classification and prediction. Shanghai archives of psychiatry. 2015. Vol. 27, № 2. P. 130. DOI: https://doi.org/10.11919/j.issn.1002-0829.215044.

6. Adler A. I., Painsky A. Feature importance in gradient boosting trees with cross-validation feature selection. Entropy. 2022. Vol. 24, № 5. P. 687. DOI: https://doi.org/10.3390/e24050687.

7. Lee D., Tellez F. P., Jaiswal R. Predicting Fire Incidents with ML: an XAI approach. 2024. DOI: https://doi.org/10.21203/rs.3.rs-5356484/v1.

8. Amro A., Al-Akhras M., Hindi K. E., Habib M., Shawar B. A. Instance reduction for avoiding overfitting in decision trees. Journal of Intelligent Systems. 2021. Vol. 30, № 1. P. 438-459. DOI: https://doi.org/10.1515/jisys-2020-0061.

9. Mienye I. D., Jere N. A Survey of Decision Trees: Concepts, Algorithms, and Applications. IEEE Access. 2024. Vol. 12. P. 86716-86727. DOI: https://doi.org/10.1109/ACCESS.2024.3416838.

10. Liu B., Mazumder R. ForestPrune: compact depth-pruned tree ensembles // International Conference on Artificial Intelligence and Statistics. 2023. P. 9417-9428. PMLR.

11. Zhao L., Alipour-Fanid A., Slawski M., Zeng K. Prediction-time efficient classification using feature computational dependencies. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018. P. 2787-2796. DOI: https://doi.org/10.1145/3219819.3220117.

12. Park Y., Ho J. C. Tackling overfitting in boosting for noisy healthcare data. IEEE Transactions on Knowledge and Data Engineering. 2019. Vol. 33, № 7. P. 2995-3006. DOI: https://doi.org/10.1109/TKDE.2019.2959988.

13. Leiva R. G., Anta A. F., Mancuso V., Casari P. A novel hyperparameter-free approach to decision tree construction that avoids overfitting by design. IEEE Access. 2019. Vol. 7. P. 99978-99987. DOI: https://doi.org/10.1109/ACCESS.2019.2930235.

14. James G., Witten D., Hastie T., Tibshirani R. An Introduction to Statistical Learning. Springer. 2013. DOI: https://doi.org/10.1007/978-3-031-38747-0

15. Murphy K. P. Machine learning: a probabilistic perspective. MIT Press. 2012.

16. Wan X., Wang W., Liu J., Tong T. Estimating the sample mean and standard deviation from the sample size, median, range and/or interquartile range. BMC medical research methodology. 2014. Vol. 14. P. 1-13. DOI: https://doi.org/10.1186/1471-2288-14-135.

17. Wasserman L. All of statistics: a concise course in statistical inference. Springer Science & Business Media. 2013.

18. De Cock D. Ames, Iowa: Alternative to the Boston housing data as an end-of-semester regression project. Journal of Statistics Education. 2011. Vol. 19, № 3. P. 1-15. URL: https://jse.amstat.org/v19n3/decock.pdf.