Preview

Siberian Journal of Clinical and Experimental Medicine

Advanced search

Sample size for assessing a diagnostic accuracy of AI-based software in radiology

https://doi.org/10.29001/2073-8552-2024-39-3-188-198

Abstract

Introduction. Determining the minimum sample size for solving various tasks is an extremely important and at the same time unexplored problem. There are many methods, but most of them are not applicable for AI-based software validation.

Aim: To consider a methodology for determining a balance of classes “norm”/ “abnormality” and propose a statistical approach to determine the data amount necessary for testing AI-based software (validation).

Material and Methods. The results of AI-based software were analyzed using dataset of mammograms. Mammograms were classified by the presence of breast cancer (“abnormality”) and the absence of breast cancer (“norm”). The general set contains 123,301 unique studies. The original balance of classes in the study was “norm” 89.3%/“abnormality” 10.7%. As the results of AI-based software (ML-algorithm), a probability of the presence of pathology in the entire study was taken. The following values were used as empirical data (GT): 0 – in case of Bi-RADS classes 1 or 2 diagnosed by a doctor, and 1 – in case of Bi-RADS classes 3, 4, 5. Each data sample is transferred to AI-based software for processing. Quality metrics are calculated based on its results: AUC ROC. All the described actions were repeated 10,000 times for all the studied balances of “norm”/”abnormality”. Based on the results of AUC ROC calculations, mean values were calculated for different random data series with the same balances. Mean AUC ROC values were subjected to analysis.

Results. A maximum value of the coefficient of variation of AUC ROC values for 10% “abnormality” share is achieved at the number of studies equal to 190; for the 20% share, it is 80 studies; for the 30% share – 120 studies, for the 40% share – 110 studies, and for the 50% share – 70 studies.

Conclusion. Summarizing the conducted study results, it can be concluded that when testing AI-based software, it is necessary to consider that the number of studies reflecting the greatest heterogeneity of AUC ROC values (the largest deviation from the mean value) is different for various class balances. If the purpose of validation is to establish the worst-case behavior of AUC ROC values, then for the studied AI-based software, the “abnormality” share should be 10%, and the number of studies 190. If the validation is carried out under conditions of a limited amount of data, then the “abnormality” share should be 50% and the number of studies equal to 70.

About the Authors

T. M. Bobrovskaya
Research and Practical Clinical Center for Diagnostics and Telemedicine Technologies of the Moscow Health Care Department (Moscow Center for Diagnostics and Telemedicine)
Russian Federation

Tatiana M. Bobrovskaya, Junior Research Scientist, Department of Innovative Technologies

24, Petrovka str., bld. 1, Moscow, 127051



Yu. A. Vasilev
Research and Practical Clinical Center for Diagnostics and Telemedicine Technologies of the Moscow Health Care Department (Moscow Center for Diagnostics and Telemedicine); Pirogov National Medical and Surgical Center
Russian Federation

Yuriy A. Vasilev, Cand. Sci. (Med.), Director of Moscow Center for Diagnostics and Telemedicine; Head of the Department of Radiation Diagnostics with a course of Clinical Radiology; Associate Professor of the Department, Pirogov National Medical and Surgical Center

24, Petrovka str., bld. 1, Moscow, 127051,

70, Nizhnyaya Pervomajskaya str., Moscow, 105203



N. Yu. Nikitin
Research and Practical Clinical Center for Diagnostics and Telemedicine Technologies of the Moscow Health Care Department (Moscow Center for Diagnostics and Telemedicine)
Russian Federation

Nikita Yu. Nikitin, Cand. Sci. (Phis.-Mat.), Senior Research Scientist, Department of Medical Informatics, Radiomics and Radiogenomics

24, Petrovka str., bld. 1, Moscow, 127051



A. V. Vladzimirskyy
Research and Practical Clinical Center for Diagnostics and Telemedicine Technologies of the Moscow Health Care Department (Moscow Center for Diagnostics and Telemedicine); L.M. Sechenov First Moscow State Medical University of the Ministry of Health of the Russian Federation (Sechenov University)
Russian Federation

Anton V. Vladzimirskyy, Dr. Sci. (Med.), Deputy Director for Research, Moscow Center for Diagnostics and Telemedicine; Professor, Information and Internet Technology Department, I.M. Sechenov First Moscow State Medical University (Sechenov University)

24, Petrovka str., bld. 1, Moscow, 127051,

8, Trubeckaya str., bld. 2, Moscow, 119991



O. V. Omelyanskaya
Research and Practical Clinical Center for Diagnostics and Telemedicine Technologies of the Moscow Health Care Department (Moscow Center for Diagnostics and Telemedicine)
Russian Federation

Olga V. Omelyanskaya, Head of Division Management of the Directorate of Science

24, Petrovka str., bld. 1, Moscow, 127051



S. F. Chetverikov
Research and Practical Clinical Center for Diagnostics and Telemedicine Technologies of the Moscow Health Care Department (Moscow Center for Diagnostics and Telemedicine)
Russian Federation

Sergey F. Chetverikov, Cand. Sci. (Tech.), Head of the Sector of System Development for the Introduction of Medical Intelligent Technologies, Department of Medical Informatics, Radiomics and Radiogenomics

24, Petrovka str., bld. 1, Moscow, 127051



K. M. Arzamasov
Research and Practical Clinical Center for Diagnostics and Telemedicine Technologies of the Moscow Health Care Department (Moscow Center for Diagnostics and Telemedicine); MIREA – Russian Technological University
Russian Federation

Kirill M. Arzamasov, Cand. Sci. (Med.), Head of the Department of Medical Informatics, Radiomics and Radiogenomics, Moscow Center for Diagnostics and Telemedicine; Associated Professor, Department of Artificial Technology, MIREA – Russian Technological University

24, Petrovka str., bld. 1, Moscow, 127051,

78, Vernadskogo prospekt, Moscow, 119454



References

1. Chervyakov N.I., Lyakhov P.A., Deryabin M.A., Nagornov N.N., Valueva M.V., Valuev G.V. Residue number system-based solution for reducing the hardware cost of a convolutional neural network. Neurocomputing. 2020;407:439–453. DOI: 10.1016/j.neucom.2020.04.018.

2. Aggarwal R., Sounderajah V., Martin G., Ting D.S.W., Karthikesalingam A., King D. et al. Diagnostic accuracy of deep learning in medical imaging: a systematic review and meta-analysis. npj Digit. Med. 2021;4:65. DOI: 10.1038/s41746-021-00438-z.

3. Tyrov I.A., Vasilev Yu.A., Arzamasov K.M., Vladzimirskyy A.V., Shulkin I.M., Omelyanskaya O.V. et al. Assessment of the maturity of artificial intelligence technologies for healthcare: methodology and its application based on the use of innovative computer vision technologies for medical image analysis and subsequent applicability in the healthcare system of Moscow. Medical doctor and information technology. 2022;4:76–92 (In Russ.). DOI: 10.25881/18110193_2022_4_76.

4. He K., Zhang X., Ren S., Sun J. Deep residual learning for image recognition. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 27–30 June, 2016. IEEE Computer Society; 2015;2016:770–778. DOI: 10.1109/CVPR.2016.90.

5. Gusev A.V., Morozov S.P., Kutichev V.A., Novitsky R.E. Legal regulation of artificial intelligence software in healthcare in the Russian Federation. Medical Technologies. Assessment and Choice. 2021;(1):36–45. (In Russ.) DOI: 10.17116/medtech20214301136.

6. Vasilev YU.A., Vladzymyrskyy A.V. (eds.) Komp’yuternoe zrenie v luchevoj diagnostike: pervyj etap Moskovskogo eksperimenta : Monografiya. 2-e izdanie, pererabotannoe i dopolnennoe. Moscow: Izdatel’skie resheniya, 2023;376. (In Russ.).

7. Ramspek C.L., Jager K.J., Dekker F.W., Zoccali C., van Diepen M. External validation of prognostic models: what, why, how, when and where? Clin. Kidney J. 2021;14(1). DOI: 10.1093/ckj/sfaa188.

8. Chetverikov S.F., Arzamasov K.M., Andreichenko A.E., Novik V.P., Bobrovskaya T.M., Vladzimirskyy A.V. Approaches to sampling for quality control of artificial intelligence in biomedical research. Modern Technologies in Medicine. 2023;15(2):19–25. (In Russ.). DOI: 10.17691/stm2023.15.2.02.

9. Vasilev Y.A., Bobrovskaya T.M., Arzamasov K.M., Chetverikov S.F., Vladzymyrskyy A.V., Omelyanskaya O.V. et al. Medical datasets for machine learning: fundamental principles of standartization and systematization. Manager Zdravookhranenia. 2023;(4):28–41. (In Russ.). DOI: 10.21045/1811-0185-2023-4-28-41.

10. Vasilev YU.A., Arzamasov K.M., Vladzimirskij A.V., Omelyanskaya O.V., Bobrovskaya T.M. et al. Podgotovka nabora dannyh dlya obucheniya i testirovaniya programmnogo obespecheniya na osnove tekhnologii iskusstvennogo intellekta: Uchebnoe posobie. Moscow: Izdatel’skie resheniya; 2024:140. (In Russ.). ISBN: 978-5-0062-1244-2.

11. Collins G.S., Ogundimu E.O., Altman D.G. Sample size considerations for the external validation of a multivariable prognostic model: a resampling study. Stat. Med. 2016;35(2):214–226. DOI: 10.1002/sim.6787.

12. Harrell F.E., Lee K.L., Mark D.B. Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat. Med. 1996;15(4):361–387. DOI: 10.1002/(SICI)1097-0258(19960229)15:43.0.CO;2-4.

13. Vergouwe Y., Steyerberg E.W., Eijkemans M.J.C., Habbema J.D.F. Substantial effective sample sizes were required for external validation studies of predictive logistic regression models. J. Clin. Epidemiol. 2005;58(5):475–483. DOI: 10.1016/j.jclinepi.2004.06.017.

14. Riley R.D., Debray T.P.A., Collins G.S., Archer L., Ensor J., van Smeden M. et al. Minimum sample size for external validation of a clinical prediction model with a binary outcome. Stat. Med. 2021;40(19):4230–4251. DOI: 10.1002/sim.9025.

15. Breast Imaging Reporting & Data System. American College of Radiology [Internet]. [cited 2024 Jan 23]. URL: https://www.acr.org/Clinical-Resources/Reporting-and-Data-Systems/Bi-Rads (16.04.2024).

16. Pavlovich P.I., Bronov O.Y., Kapninsky A.A., Abovich Y.A., Rychagova N.I. Comparative study of the digital mammography data analysis system based on artificial intelligence “Celsus” and radiologists. Digital Diagnostics. 2021;2(2S):22–23. (in Russ.). DOI: 10.17816/DD83184.

17. Kashyap R.L. (ed.) Dynamic stochastic models from empirical data: eBook, Vol. 122. Elsiever B.V.; USA: Academic Press, 1976. ISBN: 978-0-12-400550-1.

18. Assessment of maturity of artificial intelligence technologies for healthcare: methodological recommendations; issue 123. Moscow: Scientific and Practical Clinical Centre of Diagnostics and Telemedicine Technologies of the Moscow City Health Department; 2023:28.


Review

For citations:


Bobrovskaya T.M., Vasilev Yu.A., Nikitin N.Yu., Vladzimirskyy A.V., Omelyanskaya O.V., Chetverikov S.F., Arzamasov K.M. Sample size for assessing a diagnostic accuracy of AI-based software in radiology. Siberian Journal of Clinical and Experimental Medicine. 2024;39(3):188-198. (In Russ.) https://doi.org/10.29001/2073-8552-2024-39-3-188-198

Views: 432


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 2713-2927 (Print)
ISSN 2713-265X (Online)