DOI: https://doi.org/10.32515/2664-262X.2025.12(43).2.27-35
Experimental Study of the Effectiveness of the GPT-4o Model for Evaluating the Quality of User Interfaces with Consideration of Security Risks
About the Authors
Olena Prysiazhniuk, Associate Professor, PhD in Information Technology (Candidate of Technical Sciences), Associate Professor of Information and Digital Technologies Department, Volodymyr Vynnychenko Central Ukrainian State University, Kropyvnytskyi, Ukraine, ORCID: https://orcid.org/0000- 0002-7135-3124, e-mail: elena_drobot@ukr.net
Anna Puzikova, Associate Professor, PhD in Information Technology (Candidate of Physical and Mathematical Sciences), Associate Professor of Information and Digital Technologies Department, Volodymyr Vynnychenko Central Ukrainian State University, Kropyvnytskyi, Ukraine, ORCID: https://orcid.org/0000-0002-6843-5583, e-mail: a.v.puzikova@cuspu.edu.ua
Dmytro Oryshechko, Master’s degree Student in Computer Science, Central Ukrainian State University, Kropyvnytskyi, Ukraine, ORCID: https://orcid.org/0009-0004-5371-8697, e-mail: 12026083@cuspu.edu.ua
Abstract
The emergence of multimodal large language models (LLMs), including GPT-4o, creates new opportunities for automating the evaluation of user interface quality in terms of usability, accessibility, and security. A review of publicly available sources demonstrates that the integration of LLM-based tools for user interface evaluation, particularly in the context of security risk assessment, remains insufficiently explored. The purpose of this article is to examine the effectiveness of employing the multimodal GPT-4o model for automating the assessment of web interface quality from a user experience perspective, with consideration of security risks.
The research involved the development and implementation of a set of methodological procedures, which included: 1) defining a system of evaluation criteria (usability, accessibility, visual design, and information architecture) and analyzing the associated security risks; 2) conducting a series of expert assessments of the interfaces of 20 university websites, performed by human experts and by GPT-4o using a unified set of criteria and scoring scales. In addition to numerical assessments, the experts prepared analytical reports documenting security risks identified in cases where an interface received a low score for at least one criterion, together with recommendations for improvement; 3) performing a comparative analysis of the obtained results using agreement coefficients to evaluate the consistency between GPT-4o and human evaluators.
The effectiveness of GPT-4o integration was assessed based on the degree of alignment between model- generated and human-generated scores, as well as the structure and quality of the analytical reports produced by GPT-4o. Main results of the study: 1) a set of critical security risks inherent to the selected system of user interface evaluation criteria has been identified, and their impact on the quality and objectivity of UX analysis has been determined; 2) a two-stage procedure for conducting the experimental study has been developed; 3) statistically significant agreement between the evaluations provided by GPT-4o and human experts has been demonstrated, indicating the feasibility of employing the model as a reliable assessment agent within UX audit processes; 4) it has been established that GPT-4o is capable of detecting a broader spectrum of UX vulnerabilities – particularly those related to accessibility and inclusivity – than human experts, which is essential for evaluating web resources intended for users with special needs.
Keywords
interface requirements analysis, user interface quality, artificial intelligence, GPT-4о, expert evaluations, security risks
Experimental Study of the Effectiveness of the GPT-4o Model for Evaluating the Quality of User Interfaces with Consideration of Security Risks
About the Authors
Olena Prysiazhniuk, Associate Professor, PhD in Information Technology (Candidate of Technical Sciences), Associate Professor of Information and Digital Technologies Department, Volodymyr Vynnychenko Central Ukrainian State University, Kropyvnytskyi, Ukraine, ORCID: https://orcid.org/0000- 0002-7135-3124, e-mail: elena_drobot@ukr.net
Anna Puzikova, Associate Professor, PhD in Information Technology (Candidate of Physical and Mathematical Sciences), Associate Professor of Information and Digital Technologies Department, Volodymyr Vynnychenko Central Ukrainian State University, Kropyvnytskyi, Ukraine, ORCID: https://orcid.org/0000-0002-6843-5583, e-mail: a.v.puzikova@cuspu.edu.ua
Dmytro Oryshechko, Master’s degree Student in Computer Science, Central Ukrainian State University, Kropyvnytskyi, Ukraine, ORCID: https://orcid.org/0009-0004-5371-8697, e-mail: 12026083@cuspu.edu.ua
Abstract
Keywords
Full Text:
PDFReferences
1. Takaffoli, M., Li, S., & Mäkelä, V. (2024). Generative AI in User Experience Design and Research: How Do UX Practitioners, Teams, and Companies Use GenAI in Industry? Designing Interactive Systems Conference: Proceedings of the ACM International Conference (pp. 1579–1593), July 1–5, 2024, Copenhagen, Denmark. https://doi.org/10.1145/3643834.3660720.
2. Muratovic, F., Kearns-Manolatos, D., & Alibage, A. (2025). Generative AI in Software Development: Challenges, Opportunities, and New Paradigms for Quality Assurance. Computer, 58(7), 31–39. doi.org/10.1109/MC.2025.3556330.
3. Wang, J., Huang, Y., Chen, C., Liu, Z., Wang, S., & Wang, Q. (2023). Software Testing With Large Language Models: Survey, Landscape, and Vision. IEEE Transactions on Software Engineering, 50, 911–936. https://doi.org/10.1109/TSE.2024.3368208.
4. Vorochek, O. H., & Solovei, I. V. (2024). Doslidzhennia zasobiv shtuchnoho intelektu dlia avtomatyzatsii protsesu testuvannia prohramnoho zabezpechennia [Research of artificial intelligence tools for automating software testing processes]. Visnyk Natsionalnoho tekhnichnoho universytetu «KhPI». Seriia: Systemnyi analiz, upravlinnia ta informatsiini tekhnolohii, 1(11)’2024, 58–64. https://doi.org/10.20998/2079-0023.2024.01.09 [in Ukrainian].
5. Hsueh, N.-L., Lin, H.-J., & Lai, L.-C. (2024). Applying Large Language Model to User Experience Testing. Electronics, 13(23). https://doi.org/10.3390/electronics13234633.
6. Freeman, L., Robert, J., & Wojton, H. (2025). The Impact of Generative AI on Test & Evaluation: Challenges and Opportunities. Foundations of Software Engineering: Proceedings of the 33rd ACM International Conference (pp. 1376–1380), June 23–28, 2025, Norway. https://doi.org/10.1145/3696630.3728723.
7. Quinlan, M., Ceross, A., & Simpson, A. (2023). The aesthetics of cyber security: How do users perceive them? arXiv preprint, arXiv:2306.08171v1. https://doi.org/10.48550/arXiv.2306.08171.
8. Petelka, J., Zou, Y., & Schaub, F. (2019). Put Your Warning Where Your Link Is: Improving and Evaluating Email Phishing Warnings. Human Factors in Computing Systems (CHI): Proceedings of the Conference, May 4–9, 2019, Glasgow, Scotland, UK (No. 518, pp. 1–15). https://doi.org/10.1145/3290605.3300748.
9. Moran, K. (2025, October 26). The Aesthetic-Usability Effect. Nielsen Norman Group: UX-Training, Consulting, & Research. https://www.nngroup.com/articles/aesthetic-usability-effect/.
10. Duan, P., Chen, C.-Y., Li, G., Hartmann, B., & Li, Y. (2024). Generating Automatic Feedback on UI Mockups with Large Language Models. Human Factors in Computing Systems: Proceedings of the CHI International Conference, May 11–16, 2024, NU, United States (pp. 1–20). https://doi.org/10.1145/3613904.3642782.
11. Guerino, G., Rodrigues, L., Capeleti, B., Mello, R. F., Freire, A., & Zaina, L. (2025). Can GPT-4o Evaluate Usability Like Human Experts? A Comparative Study on Issue Identification in Heuristic Evaluation. arXiv preprint, arXiv:2506.16345v1.
12. Zhong, R., Hsieh, G., & McDonald, D. (2024). How can LLMs support UX Practitioners with image-related tasks? GenAICHI: Workshop on Generative AI and HCI (pp. 1–6). generativeaiandhci.github.io/papers/2024/genaichi2024_29.pdf.
13. Renaud, K., & Coles-Kemp, L. (2022). Accessible and Inclusive Cyber Security: A Nuanced and Complex Challenge. SN Computer Science, 3. https://doi.org/10.1007/s42979-022-01239-1.
14. Renaud, K. (2021). Accessible cyber security: the next frontier? Information Systems Security and Privacy: Proceedings of the 7th International Conference, February 11–13, 2021 (pp. 9–18). doi.org/10.5220/0010419500090018.
15. Web Accessibility in Mind. (2025, October 26). WebAIM: Web Accessibility Evaluation Report. https://webaim.org/projects/million/.
16. World Wide Web Consortium. (2025, May 6). Web Content Accessibility Guidelines (WCAG) 2.1. https://www.w3.org/TR/WCAG21/?utm_source=chatgpt.com.
17. Flutter. (2025, October 20). Flutter Documentation. https://flutter.dev/docs.
18. Hnatiienko, H. M., & Snytiuk, V. Ye. (2008). Ekspertni tekhnolohii pryiniattia rishen: Monohrafiia [Expert decision-making technologies: Monograph]. Kyiv: TOV «Maklaut», 444 p. http://dspace.nbuv.gov.ua/handle/123456789/56847 [in Ukrainian].
19. European Telecommunications Standards Institute. (n.d.). Accessibility requirements for ICT products and services. https://www.etsi.org/deliver/etsi_en/301500_301599/301549/03.02.01_60/en_301549v030201p.pdf.
Citations
1. Takaffoli M., Li S., Mäkelä V. Generative AI in User Experience Design and Research: How Do UX Practitioners, Teams, and Companies Use GenAI in Industry? Designing Interactive Systems Conference: Proc. of the ACM Int. Conf., 1-5 July 2024. Copenhagen, Denmark 2024. P. 1579–1593. DOI: 10.1145/3643834.3660720.
2. Muratovic F., Kearns-Manolatos D., Alibage A. Generative AI in Software Development: Challenges, Opportunities, and New Paradigms for Quality Assurance. Computer. 2025. Vol. 58, № 7. P. 31-39. DOI: 10.1109/MC.2025.3556330.
3. Wang J., Huang Y., Chen C., Liu Z., Wang S., Wang Q. Software Testing With Large Language Models: Survey, Landscape, and Vision. IEEE Transactions on Software Engineering. 2023. Vol. 50. P. 911-936. DOI: 10.1109/TSE.2024.3368208.
4. Ворочек О.Г., Соловей І.В. Дослідження засобів штучного інтелекту для автоматизації процесу тестування програмного забезпечення. Вісник Національного технічного університету «ХПІ». Серія: Системний аналіз, управління та інформаційні технології. 2024. № 1 (11)’2024. С. 58-64. DOI: 10.20998/2079-0023.2024.01.09.
5. Hsueh N.-L., Lin H.-J., Lai L.-C. Applying Large Language Model to User Experience Testing. Electronics. 2024. Т.13, № 23. DOI: 10.3390/electronics13234633.
6. Freeman L., Robert J., Wojton H. The Impact of Generative AI on Test & Evaluation: Challenges and Opportunities. Foundations of Software Engineering: Proc. of the 33rd ACM Int. Conf. 23-28 June 2025. Norway. P. 1376-1380. DOI: 10.1145/3696630.3728723.
7. Quinlan M., Ceross A., Simpson A. The aesthetics of cyber security: How dousers perceive them? arXiv preprint. arXiv: 2306.08171vl. 2023. DOI: 10.48550/arXiv.2306.08171.
8. Petelka J., Zou Ye., Schaub F. Put Your Warning Where Your Link Is: Improving and Evaluating Email Phishing Warnings. Human Factors in Computing Systems (CHI): Proc. Conf. 4-9 May 2019. Glasgow Scotland Uk. No. 518. P.1-15. DOI: https://doi.org/10.1145/3290605.3300748.
9. Moran K. The Aesthetic-Usability Effect. Nielsen Norman Group: UX-Training, Consulting, & Research: вебсайт URL: https://www.nngroup.com/articles/aesthetic-usability-effect/ (дата звернення: 26.10.2025).
10. Duan P., Chen C.-Y., Li G., Hartmann B., Li Ya. Generating Automatic Feedback on UI Mockups with Large Language Models. Human Factors in Computing Systems: Proc. of the CHI Int. Conf, 11-16 May 2024. NU,United States. P.1-20. DOI: 10.1145/3613904.3642782.
11. Guerino G., Rodrigues L., Capeleti B., Mello R.F., Freire A., Zaina L. Can GPT-4o Evaluate Usability Like Human Experts? A Comparative Study on Issue Identification in Heuristic Evaluation. arXiv preprint. arXiv: 2506.16345v1. 2025 (дата звернення: 26.10.2025).
12. Zhong R., Hsieh G., Mcdonald D. How can LLMs support UX Practitioners with image-related tasks? GenAICHI: Workshop on Generative AI and HCI. 2024. P.1-6. URL: https://generativeaiandhci.github.io/papers/2024/ genaichi2024_29.pdf (дата звернення: 26.10.2025).
13. Renaud K., Coles-Kemp L. Accessible and Inclusive Cyber Security: A Nuanced and Complex Challenge. SN Computer Science. 2022. Vol.3. DOI: 10.1007/s42979-022-01239-1.
14. Renaud K. Accessible cyber security: the next frontier? Information Systems Security and Privacy: Proc. of the 7th Int. Conf. 11-13 February 2021. P. 9–18. DOI: 10.5220/0010419500090018.
15. Web Accessibility in Mind: вебсайт URL: https://webaim.org/projects/million/ (дата звернення: 26.10.2025).
16. Web Content Accessibility Guidelines (WCAG) 2.1. W3C Recommendation 06 May 2025: вебсайт URL: https://www.w3.org/TR/WCAG21/?utm_source=chatgpt.com (дата звернення: 11.10.2025)
17. Flutter documentation: вебсайт URL: https://flutter.dev/docs (дата звернення: 20.10.2025).
18. Гнатієнко Г.М., Снитюк В.Є. Експертні технології прийняття рішень: Монографія. Київ: ТОВ «Маклаут», 2008, 444 с.URI: http://dspace.nbuv.gov.ua/handle/123456789/56847 (дата звернення: 28.09.2025).
19. Accessibility requirements for ICT products and services. Harmonised European Standart: вебсайт URL: https://www.etsi.org/deliver/etsi_en/301500_301599/301549/03.02.01_60/en_301549v030201p.pdf.
Copyright (©) 2025, Olena Prysiazhniuk, Anna Puzikova, Dmytro Oryshechko