|Year : 2022 | Volume
| Issue : 3 | Page : 296-307
Performance of deep-learning artificial intelligence algorithms in detecting retinopathy of prematurity: A systematic review
Amelia Bai1, Christopher Carty2, Shuan Dai3
1 Department of Ophthalmology, Queensland Children's Hospital, Brisbane; Centre for Children's Health Research, Brisbane; School of Medical Science, Griffith University, Gold Coast, Australia
2 Griffith Centre of Biomedical and Rehabilitation Engineering (GCORE), Menzies Health Institute Queensland, Griffith University Gold Coast; Department of Orthopaedics, Children's Health Queensland Hospital and Health Service, Queensland Children's Hospital, Brisbane, Australia
3 Department of Ophthalmology Queensland Children's Hospital, Brisbane; School of Medical Science, Griffith University, Gold Coast; University of Queensland, Australia
|Date of Submission||29-Sep-2021|
|Date of Decision||09-Nov-2021|
|Date of Acceptance||12-Nov-2021|
|Date of Web Publication||14-Oct-2022|
Dr. Shuan Dai
Level 7d Surgical Directorate, 501 Stanley Street, South Brisbane QLD 4104
Source of Support: None, Conflict of Interest: None
PURPOSE: Artificial intelligence (AI) offers considerable promise for retinopathy of prematurity (ROP) screening and diagnosis. The development of deep-learning algorithms to detect the presence of disease may contribute to sufficient screening, early detection, and timely treatment for this preventable blinding disease. This review aimed to systematically examine the literature in AI algorithms in detecting ROP. Specifically, we focused on the performance of deep-learning algorithms through sensitivity, specificity, and area under the receiver operating curve (AUROC) for both the detection and grade of ROP.
METHODS: We searched Medline OVID, PubMed, Web of Science, and Embase for studies published from January 1, 2012, to September 20, 2021. Studies evaluating the diagnostic performance of deep-learning models based on retinal fundus images with expert ophthalmologists' judgment as reference standard were included. Studies which did not investigate the presence or absence of disease were excluded. Risk of bias was assessed using the QUADAS-2 tool.
RESULTS: Twelve studies out of the 175 studies identified were included. Five studies measured the performance of detecting the presence of ROP and seven studies determined the presence of plus disease. The average AUROC out of 11 studies was 0.98. The average sensitivity and specificity for detecting ROP was 95.72% and 98.15%, respectively, and for detecting plus disease was 91.13% and 95.92%, respectively.
CONCLUSION: The diagnostic performance of deep-learning algorithms in published studies was high. Few studies presented externally validated results or compared performance to expert human graders. Large scale prospective validation alongside robust study design could improve future studies.
Keywords: Artificial intelligence, deep learning, diagnosis, retinopathy of prematurity, screening
|How to cite this article:|
Bai A, Carty C, Dai S. Performance of deep-learning artificial intelligence algorithms in detecting retinopathy of prematurity: A systematic review. Saudi J Ophthalmol 2022;36:296-307
|How to cite this URL:|
Bai A, Carty C, Dai S. Performance of deep-learning artificial intelligence algorithms in detecting retinopathy of prematurity: A systematic review. Saudi J Ophthalmol [serial online] 2022 [cited 2023 Mar 29];36:296-307. Available from: https://www.saudijophthalmol.org/text.asp?2022/36/3/296/358593
| Introduction|| |
The concept of artificial intelligence (AI) dates back to the 1950s, when Alan Turing first discussed how to build and test intelligent machines in the paper “computing machinery and intelligence.” It wasn't until 1956, however, at the seminal conference Dartmouth Summer Research Project on AI, did John McCarthy officially coin the term AI. This conference introduced a computer program designed to mimic the problem solving skills of a human, catalyzing the next 20 years of AI research. Today, AI is incorporated into many applications for day-to-day life, including speech recognition, photo captioning, language translation, robotics, and even self-driving cars.,, These applications are made possible through the use of deep learning, an advanced form of AI which self-learns from large training sets to program itself to perform certain tasks. The application of AI has gained popularity in the medical diagnostic field, and promising outcomes have resulted from deep-learning screening algorithms in Ophthalmology.
There has been particular success in AI screening for diabetic retinopathy, with several groups reporting deep-learning algorithms detecting diabetic retinopathy at sensitivities and specificities of 83%–90% and 92%–98% respectively., Moreover, the successful validation of these algorithms has seen progression to “real-world” implementation of screening programs through prospective evaluation. One such study produced a sensitivity of 83.3% and specificity of 92.5% in detecting referable diabetic retinopathy in a prospective evaluation. Similar promising results are being reported by many other groups utilizing deep learning for the diagnosis of other ophthalmic conditions including diabetic macula edema, age-related macular degeneration, glaucoma, and retinopathy of prematurity (ROP).,
ROP is a retinal vascular proliferative disease affecting premature infants whose diagnosis is dependent on timely screening. Globally, it is estimated that at least 50,000 children are blind from ROP, and it remains the leading cause of preventable childhood blindness. Advances in retinal imaging means disease is now easily identifiable by retinal photographs, making it a perfect candidate for deep learning. As survival rates of premature infants continue to increase with medical advances, the demand for ROP screening is rapidly exceeding the capacity of available specialist ophthalmologists. For this reason, reports of deep-learning models matching or exceeding human experts in ROP diagnostic performance have generated considerable interest. It remains fundamental; however, that this enthusiasm does not overrule the need for critical appraisal as a missed diagnosis of ROP can result in significant sequelae such as blindness. Therefore, any deep-learning screening algorithm will need to show high diagnostic performance, high sensitivity, be generalizable, and be applicable to the real-world setting. In anticipation of deep-learning diagnostic tools becoming implemented into clinical practice, it is judicious to systematically review the body of evidence supporting AI screening for ROP. This systematic review aims to critically appraise the current state of diagnostic performance of deep-learning algorithms for ROP screening, with particular consideration for study design, algorithm development, type of validation, performance compared to clinicians, and diagnostic accuracy.
| Methods|| |
Search strategy and selection criteria
Studies that developed or validated a deep learning model for the diagnosis of ROP and compared accuracy of algorithm diagnoses to ROP experts were included in this systematic review. We searched MEDLINE-Ovid, PubMed, Web of Science, and Embase for studies published from January 1, 2012 to September 20, 2021. The full search strategy for each database is available in [Appendix 1]. The cutoff of January 1, 2012 was prespecified based on an important breakthrough made with the development of deep-learning approaches in the model AlexNet. The search was first performed on July 10, 2020, revised on May 23, 2021 and updated on September 20, 2021. Manual searches of bibliographies and citations from included studies were also completed to identify any additional articles potentially missed by searches.
Eligibility assessment was conducted by two reviewers who independently screened titles and abstracts of search results. Only studies aiming to identify through AI algorithms the presence of the disease of interest, ROP, were included. We accepted standard-of-care diagnosis, expert opinion or consensus as adequate reference standards to classify the absence or presence of disease. We excluded studies that did not test for diagnostic performance or investigated accuracy of image segmentation rather than disease classification. Studies which assessed the ability to classify disease severity were accepted if they incorporated primary results of disease detection. Review articles, conference abstracts, and studies that presented duplicate data were excluded. We assessed the risk of bias in patient selection, index test, reference standard, and flow and timing of each study using QUADAS-2. Full assessment of bias can be found in [Appendix 2].
This systematic review was completed following the recommendations of the Preferred Reporting Items for Systematic reviews and Meta-Analyses statement and the research question was formulated according to the CHARMS checklist for systematic reviews of prediction models. Methods of analysis and inclusion criteria were specified in advance.
Data were extracted independently by two reviewers (AB and SD) using a predefined data extraction sheet, followed by cross-checking. Any discrepancies were discussed with a third reviewer (CC). Demographics and sample size (gestational age [GA], birth weight, number of participants, and number of images), data characteristics (data source, inclusion and exclusion criteria, and image augmentation), algorithm development (architecture, transfer learning, and number of images for training and tuning), algorithm validation (reference standard, number of experts, same method for assessing reference standard, and internal and external validation), and results (sensitivity, specificity, area under the receiver operating characteristic curve for algorithm(AUROC), human graders, and external validation if applicable) were sought. Two papers produced different algorithms from different data sets or with different identification tasks and were therefore recorded as separate algorithms in the results section., Data from all 12 papers were included and any missing information was recorded. In the case where sensitivity and specificity were not explicitly recorded but could be calculated from a confusion matrix, the calculated results were included.
| Results|| |
Our search identified 175 records, of which 99 were screened [Figure 1]. Thirty full-text articles were assessed for eligibility and 12 studies were included in the systematic review.,,,,,,,,,,,, Fifty studies were excluded due to no test of diagnostic performance,,,,,,,, no classification task,,, no internal validation,, no AI algorithm, and not based on standard clinical care.
Data characteristics and demographics
All twelve studies obtained retrospective images as part of routine clinical care or from local screening programs. Seven of these studies collected images from China,,,,,,, one from India, one from North America, one from America and Mexico sites, one from America and Nepal, and one study included images from New Zealand. Date range for image collection among all studies varied from July 2011 to June 2020. Three studies specified their inclusion criteria,, and five other studies specified their exclusion criteria.,,,, Poor quality images were excluded in five studies,,,, and image augmentation occurred in seven studies.,,,,,, These characteristics are summarized in [Table 1]. Seven studies recorded demographic information,,,,,,, of which the mean GA was 30.9 weeks and mean birth weight was 1501.25 g. A total of 178,459 images were used across all 12 studies ranging from 2668 to 52,249 images per study. Five studies formulated an algorithm to detect ROP,,,, and seven studies created an algorithm to detect the presence of plus disease out of a total of 5358 plus disease images.,,,,,, Full details of demographics and sample size are found in [Table 2].
|Table 2: Patient demographics and sample size for the 12 included studies|
Click here to view
Algorithm development and validation
Convolutional neural networks formed the basis for algorithms developed in all twelve studies. A variety of algorithms were utilized for transfer learning including ResNet, ImageNet, U-Net, and VGG-16, [Table 3], whereas one study did not use a transfer learning approach. The majority of studies used <6000 images to train their algorithm; however, five studies utilized >10,000 images for algorithm development.,,,, The reference standard across all twelve studies were based off disease diagnosis by 1–5 expert graders, with an average of 2.6 human graders agreeing upon each image per study. A variety of internal validation methods were recorded, including random split sample validation and cross-validation [Table 4]. Five studies,,,, obtained external validation of their AI algorithms, of which one study completed a prospective evaluation of algorithm performance.
The performance of each algorithm is listed in [Table 5]. Five studies recorded the ability of their algorithm to detect the presence of ROP disease, of which the average area under the receiver operating curve (AUROC) was 0.984.,,,, Sensitivity and specificity were recorded in four of those studies and averaged 95.72% and 98.15%, respectively.,,, One study compared human grader performance to the AI algorithm revealing similar sensitivities (94.1% AI, 93.5% human) and specificities (99.3% AI, 99.5% human) of ROP diagnostic performance. Two of the five studies underwent external validation revealing an average sensitivity and specificity of 60% and 88.3%, respectively, for detecting the presence of disease., The seven other studies determined ability of their algorithm to detect the presence of plus disease. Among these, six studies measured AUROC, with which the average was 0.98.,,,,, The average sensitivity and specificity for detecting plus disease recorded from six studies were 91.13% and 95.92%, respectively.,,,,, External validation occurred in two of these studies and produced an average sensitivity of 93.45% and specificity of 87.35%., Performance of AI algorithm at detecting pre-plus disease was measured in two articles producing an average sensitivity of 96.2% and specificity of 95.7%., This is compared to four studies who measured performance of determining the stage of ROP disease, showing an average sensitivity and specificity of 89.07% and 94.63%, respectively.,,,
| Discussion|| |
We found that deep-learning algorithms for ROP screening demonstrated sensitivity and specificity metrics that were comparable to neural network algorithms in diabetic retinopathy. Although this estimate supports the application potential for deep-learning algorithms to be implemented as a real-world diagnostic tool, there are several methodological deficiencies that were common across included studies which need to be considered. These include the quality of reference standard, use of sample size calculations, external validation, definition of presence or absence of disease, and the need for prospective evaluation.
First, we found variability in specific algorithm diagnostic targets with the 12 papers split between diagnosing the presence of ROP as a whole versus the presence of plus disease. It is important to differentiate these diagnostic targets as the clinical implication of the findings will differ. In addition, most studies utilized a reference standard graded by on average 2–3 experts with only one study producing a reference standard diagnosed by 5 clinicians per image. It is well reported that there is a significant amount of intergrader variability in ROP diagnosis due to its subjective nature;, therefore, caution needs to be taken in recognizing the potential for grader bias in studies utilizing only a few expert graders.
Second, there was a large variety in the number of images used to train each algorithm, ranging from 289 to 39,029 images. Convolutional neural networks learn by computing the error between the machine's output and the image diagnosis; hence, the more images used to train a machine the smaller the error of its diagnostic output. For this reason, the studies that had sample sizes in the ten-thousands were likely to have more reliable results than those that were trained off hundreds or thousands of images. Nonetheless, no studies reported formal sample size calculations to ensure sufficient sizing of studies. Despite the challenge of sample size calculations in the context of AI algorithms, it remains a principal component of any study design and only one paper reported sample size as a limitation in their study. Future studies should consider formulating sample size calculations to justify the number of images required for algorithm design.
Thirdly, exclusion of poor-quality images or image augmentation may impact the performance of these deep-learning algorithms in the real-world clinical setting. This is a factor which may limit the diagnostic performance of an algorithm as high quality images correlates to high quality diagnoses and smaller algorithm errors. For this reason, it is understandable that most papers will exclude poor quality images; however, it is important to keep this within reason. Quality of images used to train an algorithm should correspond to the quality of images taken in the clinical setting so that algorithm performance may equate to its real-life performance. It is also for this reason that external validation of an algorithm using an image set outside of the training image set is crucial to determine the generalizability of a study. Only five of the twelve studies completed external validation of which all but one study, showing equivalent performance, revealed inferior algorithm performance compared to their test set. This finding highlights the need for an out-of-sample external validation in these screening algorithms to better understand how the algorithm will perform in the clinical setting.
Fourth, the ground truth or reference standard labels were mostly derived from data collected for other purposes such as a database of ROP images or retrospective routine clinical care notes. Although there exists an internationally accepted guideline for defining presence and stage of ROP, the International Classification of Retinopathy of Prematurity revisited (ICROP) (more recently updated in a 2021 version), only five studies specifically mentioned the ICROP in their methods for defining the reference standard. As ICROP acts as the universally adopted diagnostic criteria for grading ROP, it is safe to assume that the other seven studies also used these guidelines; however, the criteria for the presence or absence of disease should always be clearly defined in AI studies.
Finally, only one study completed prospective evaluation of their algorithm, a process that is vital to assess the performance on real-world implications. The majority of studies assessed deep learning diagnostic accuracy in isolation, without external validation as mentioned earlier or comparison to experts. Only three studies provided a comparison of AI performance with human performance, allowing for evaluation of real-world application. Without comparison of AI to human performance, the results from the other seven studies are limited in their ability to be extrapolated into health-care delivery. In order for a deep learning diagnostic tool to be applicable in clinical bedside screening, it must perform better or comparable to the gold standard, in this case expert diagnosis. More work is required to validate the performance of AI algorithms in comparison to human graders, ideally using the same external test dataset.
It is clear from this systematic review that there still lacks a well-designed randomized head-to-head comparison of an effective externally validated AI algorithm to human performance in real-time. A study of this magnitude could reveal the possible clinical implications for an algorithm implemented in the clinical setting. For this reason, prospective evaluations of these deep-learning diagnostic tests are crucial to unveil the bounding potential of AI in both diagnostic and therapeutic medicine. We recognize that there is a large “black box” issue in deep learning, where image features learned by an algorithm is unknown to the user. It is for this reason that many clinicians are sceptical to entrust clinical care to AI, especially when the clinical features clinicians are familiar with may not be the same features used by an algorithm. This further emphasizes the need for well executed studies that minimize bias and are thoroughly and transparently reported. Most of the concerns we have highlighted in this review are avoidable with robust design and it remains critical that these AI diagnostic tests are evaluated in the context of its intended clinical pathway.
| Conclusion|| |
AI has been heralded as a revolutionary technology for many industries, and certainly deep learning algorithms for diagnosis of ROP are no exception. Despite the issues we have highlighted in this systematic review, the performance of the twelve deep-learning algorithms evaluated has been extremely high, with all studies delivering a recordable AUROC above or equivalent to 0.94. These results outline the ability for AI algorithms to perform comparable to or exceeding human experts and provide the groundwork for future large-scale prospective studies. Although there are clear screening and treatment guidelines, ROP disease burden continues to rise as increased survival of preterm infants coincides with advancements in medical care. The inadequate accessibility and number of experienced ophthalmologists continues to limit ROP screening and diagnosis. Consequently, the burden of ROP visual impairment is expected to increase unless a novel strategy such as deep-learning diagnostic algorithms becomes available. There is no doubt that the successful application of AI in ROP will revolutionalize disease diagnosis through its high predictive performance and streamlined efficiency. The clinical implications of this implementation into real-world clinical practice is immeasurable, with translation into high accessibility, high quality, timely screening and the significant reduction in cost of screening. AI will therefore become ubiquitous and indispensable for ROP screening, and it is important that high quality research continues to aid the translation of this transformative technology in order to reduce the incidence of visual loss and blindness from this preventable disease.
Financial support and sponsorship
Conflicts of interest
There are no conflicts of interest.
| References|| |
Turing A. Computing machinery and intelligence. Mind 1950;49:433-60.
McCarthy J, Minsky, M, Rochester N, Shannon C. A Proposal for the Dartmouth Summer Research Project on Artificial Intelligence 1955. AI Magazine 2006;27:12.
Wu J, Yılmaz E, Zhang M, Li H, Tan KC. Deep spiking neural networks for large vocabulary automatic speech recognition. Front Neurosci 2020;14:199.
Popel M, Tomkova M, Tomek J, Kaiser Ł, Uszkoreit J, Bojar O, et al.
Transforming machine translation: A deep learning system reaches news translation quality comparable to human professionals. Nat Commun 2020;11:4381.
Fayyad J, Jaradat MA, Gruyer D, Najjaran H. Deep learning sensor fusion for autonomous vehicle perception and localization: A review. Sensors (Basel) 2020;20:E4220.
LeCun Y, Bengio Y, Hinton G. Deep learning. Nature 2015;521:436-44.
Gulshan V, Peng L, Coram M, Stumpe MC, Wu D, Narayanaswamy A, et al.
Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA 2016;316:2402-10.
Zhang Y, Shi J, Peng Y, Zhao Z, Zheng Q, Wang Z, et al.
Artificial intelligence-enabled screening for diabetic retinopathy: A real-world, multicenter and prospective study. BMJ Open Diabetes Res Care 2020;8:e001596.
De Fauw J, Ledsam JR, Romera-Paredes B, Nikolov S, Tomasev N, Blackwell S, et al.
Clinically applicable deep learning for diagnosis and referral in retinal disease. Nat Med 2018;24:1342-50.
Burlina PM, Joshi N, Pekala M, Pacheco KD, Freund DE, Bressler NM. Automated grading of age-related macular degeneration from color fundus images using deep convolutional neural networks. JAMA Ophthalmol 2017;135:1170-6.
Li Z, He Y, Keel S, Meng W, Chang RT, He M. Efficacy of a deep learning system for detecting glaucomatous optic neuropathy based on color fundus photographs. Ophthalmology 2018;125:1199-206.
Brown JM, Campbell JP, Beers A, Chang K, Ostmo S, Chan RVP, et al.
Automated diagnosis of plus disease in retinopathy of prematurity using deep convolutional neural networks. JAMA Ophthalmol 2018;136:803-10.
Tan Z, Simkin S, Lai C, Dai S. Deep learning algorithm for automated diagnosis of retinopathy of prematurity plus disease. Transl Vis Sci Technol 2019;8:23.
Blencowe H, Lawn JE, Vazquez T, Fielder A, Gilbert C. Preterm-associated visual impairment and estimates of retinopathy of prematurity at regional and global levels for 2010. Pediatr Res 2013;74 Suppl 1:35-49.
Gilbert C. Retinopathy of prematurity: A global perspective of the epidemics, population of babies at risk and implications for control. Early Hum Dev 2008;84:77-82.
Valentine PH, Jackson JC, Kalina RE, Woodrum DE. Increased survival of low birth weight infants: Impact on the incidence of retinopathy of prematurity. Pediatrics 1989;84:442-5.
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, et al
. ImageNet large scale visual recognition challenge. Int J Comput Vis 2015;115:211-52.
Whiting PF, Rutjes AW, Westwood ME, Mallett S, Deeks JJ, Reitsma JB, et al.
QUADAS-2: A revised tool for the quality assessment of diagnostic accuracy studies. Ann Intern Med 2011;155:529-36.
Page MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, et al.
The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. Int J Surg 2021;88:105906.
Moons KG, de Groot JA, Bouwmeester W, Vergouwe Y, Mallett S, Altman DG, et al.
Critical appraisal and data extraction for systematic reviews of prediction modelling studies: The CHARMS checklist. PLoS Med 2014;11:e1001744.
Chen JS, Coyner AS, Ostmo S, Sonmez K, Bajimaya S, Pradhan E, et al.
Deep learning for the diagnosis of stage in retinopathy of prematurity: Accuracy and generalizability across populations and cameras. Ophthalmol Retina 2021;5:1027-35.
Wang J, Ju R, Chen Y, Zhang L, Hu J, Wu Y, et al.
Automated retinopathy of prematurity screening using deep neural networks. EBioMedicine 2018;35:361-8.
Campbell JP, Singh P, Redd TK, Brown JM, Shah PK, Subramanian P, et al.
Applications of artificial intelligence for retinopathy of prematurity screening. Pediatrics 2021;147:e2020016618.
Hu J, Chen Y, Zhong J, Ju R, Yi Z. Automated analysis for retinopathy of prematurity by deep neural networks. IEEE Trans Med Imaging 2019;38:269-79.
Huang YP, Basanta H, Kang EY, Chen KJ, Hwang YS, Lai CC, et al.
Automated detection of early-stage ROP using a deep convolutional neural network. Br J Ophthalmol 2021;105:1099-103.
Mao J, Luo Y, Liu L, Lao J, Shao Y, Zhang M, et al.
Automated diagnosis and quantitative analysis of plus disease in retinopathy of prematurity based on deep convolutional neural networks. Acta Ophthalmol 2020;98:e339-45.
Ramachandran S, Niyas P, Vinekar A, John R.
A deep learning framework for the detection of Plus disease in retinal fundus images of preterm infants. Biocybern Biomed Eng 2021;41:362-75.
Tong Y, Lu W, Deng QQ, Chen C, Shen Y. Automated identification of retinopathy of prematurity by image-based deep learning. Eye Vis (Lond) 2020;7:40.
Wang J, Ji J, Zhang M, Lin JW, Zhang G, Gong W, et al.
Automated explainable multidimensional deep learning platform of retinal images for retinopathy of prematurity screening. JAMA Netw Open 2021;4:e218758.
Yildiz VM, Tian P, Yildiz I, Brown JM, Kalpathy-Cramer J, Dy J, et al.
Plus disease in retinopathy of prematurity: Convolutional neural network performance using a combined neural network and feature extraction approach. Transl Vis Sci Technol 2020;9:10.
Zhang Y, Wang L, Wu Z, Zeng J, Chen Y, Tian R, et al.
Development of an automated screening system for retinopathy of prematurity using a deep neural network for wide-angle retinal images. Ieee Access 2019;7:10232-41.
Coyner AS, Swan R, Campbell JP, Ostmo S, Brown JM, Kalpathy-Cramer J, et al.
Automated fundus image quality assessment in retinopathy of prematurity using deep convolutional neural networks. Ophthalmol Retina 2019;3:444-50.
Greenwald MF, Danford ID, Shahrawat M, Ostmo S, Brown J, Kalpathy-Cramer J, et al.
Evaluation of artificial intelligence-based telemedicine screening for retinopathy of prematurity. J AAPOS 2020;24:160-2.
Gupta K, Campbell JP, Taylor S, Brown JM, Ostmo S, Chan RVP, et al.
A Quantitative severity scale for retinopathy of prematurity using deep learning to monitor disease regression after treatment. JAMA Ophthalmol 2019;137:1029-36.
Redd T, Campbell J, Brown J, Kim S, Ostmo S, Chan R, et al
. Utilization of a deep learning image assessment tool for epidemiologic surveillance of retinopathy of prematurity. Invest Ophthalmol Vis Sci 2019;60:580-4.
Smith K, Kim S, Goldstein I, Ostmo S, Chan R, Brown J, et al
. Quantitative analysis of aggressive posterior retinopathy of prematurity using deep learning. Invest Ophthalmol Vis Sci 2019;60:4759.
Taylor S, Kishan G, Campbell P, Brown J, Ostmo S, Chan R, et al
. Invest Ophthalmol Vis Sci 2018; 59: 3937.
Wallace DK, Zhao Z, Freedman SF. A pilot study using “ROPtool” to quantify plus disease in retinopathy of prematurity. J AAPOS 2007;11:381-7.
Wang J, Zhang G, Lin J, Ji J, Qiu K, Zhang M. Application of standardized manual labeling on identification of retinopathy of prematurity images in deep learning. Zhonghua Shiyan Yanke Zazhi 2019;37:653-7.
Campbell J, Chan R, Ostmo S, Anderson J, Singh P, Kalpathy-Cramer J, Chiang M. Analysis of the relationship between retinopathy of prematurity zone, stage, extent and a deep learning-based vascular severity scale. Invest Ophthalmol Vis Sci 2020;61:2193.
Choi RY, Brown JM, Kalpathy-Cramer J, Chan RV, Ostmo S, Chiang MF, et al.
Variability in plus disease identified using a deep learning-based retinopathy of prematurity severity scale. Ophthalmol Retina 2020;4:1016-21.
Ramachandran S, Kochitty S, Vinekar V, John R.
A fully convolutional neural network approach for the localization of optic disc in retinopathy of prematurity diagnosis. J Intell Fuzzy Syst 2020;38:6269-78.
Worrall DE, Wilson C, Brostow GJ. Automated retinopathy of prematurity case detection with convolutional neural networks. In: Deep Learning and Data Labeling for Medical Applications. DLMIA, LABELS; 2016. p. 68-76.
Ataer-Cansizoglu E, Bolon-Canedo V, Campbell JP, Bozkurt A, Erdogmus D, Kalpathy-Cramer J, et al.
Computer-based image analysis for plus disease diagnosis in retinopathy of prematurity: performance of the “i-ROP” system and image features associated with expert diagnosis. Transl Vis Sci Technol 2015;4:5.
Touch P, Wu Y, Kihara Y, Zepeda E, Gillette T, Cabrera M, et al
. Development of AI deep learning algorithms for the quantification of retinopathy of prematurity. J Invest Med 2019;67:209.
Wang S, Zhang Y, Lei S, Zhu H, Li J, Wang Q, et al.
Performance of deep neural network-based artificial intelligence method in diabetic retinopathy screening: A systematic review and meta-analysis of diagnostic test accuracy. Eur J Endocrinol 2020;183:41-9.
Gschließer A, Stifter E, Neumayer T, Moser E, Papp A, Pircher N, et al.
Inter-expert and intra-expert agreement on the diagnosis and treatment of retinopathy of prematurity. Am J Ophthalmol 2015;160:553-60.e3.
Campbell JP, Ataer-Cansizoglu E, Bolon-Canedo V, Bozkurt A, Erdogmus D, Kalpathy-Cramer J, et al.
Expert diagnosis of plus disease in retinopathy of prematurity from computer-based image analysis. JAMA Ophthalmol 2016;134:651-7.
International Committee for the Classification of Retinopathy of Prematurity. The International classification of retinopathy of prematurity revisited. Arch Ophthalmol 2005;123:991-9.
Chiang MF, Quinn GE, Fielder AR, Ostmo SR, Paul Chan RV, Berrocal A, et al.
International classification of retinopathy of prematurity, third edition. Ophthalmology 2021;128:e51-68.
[Table 1], [Table 2], [Table 3], [Table 4], [Table 5]