Introduction
Diabetic retinopathy (DR) is a microvascular disorder occurring due to the long-term effects of diabetes that can lead to vision-threatening damage of the retina [1]. According to the Global Burden of Disease study, DR is the leading cause of blindness in the world for those of working age [2]. Screening coupled with timely referral and treatment is a universally accepted strategy for blindness prevention [3]. Although there are several methods to diagnose DR, the most widely used one is slit lamp biomicroscopy — fundoscopy.
The global prevalence of diabetes has nearly doubled in the last three decades, and 35 % of patients suffer from some form of DR. This makes it the most common complication of the disease [4, 5]. Epidemiologic studies show that 11% of the world’s population is currently over 60 years of age, projected to double by 2050 [6]. Due to the rapidly aging population, the healthcare system is anticipated to struggle with timely diagnosis and treatment of patients. In fact, DR is the only cause of blindness with increasing prevalence over the past three decades, suggesting that ophthalmologists have already failed to meet the growing need [2]. Therefore, new methods to help streamline the process are being explored. One of the approaches that are starting to be established and integrated into global health care systems is artificial intelligence (AI).
AI is often used as an umbrella term to cover a computer science branch capable of analyzing complex data [7]. A convolutional neural network (CNN) is an AI-based system, most used to analyze visual images. It utilizes a set of algorithms that take an input image, evaluate the importance of various objects present, and classify it. The field of ophthalmology is well suited for AI studies due to numerous digital techniques widely used in clinical practice, such as color fundus photography, optical coherence tomography (OCT), and computerized visual field testing. Collected data is stored in publicly available databases and is being used to develop AI algorithms capable of doing image recognition diagnostic tasks [8]. In recent years many studies about the detection and screening of DR using different algorithms have been published.
This review aims to evaluate the efficacy of different AI algorithms used by researchers in detecting DR from retinal fundus images.
Material and methods
This systematic literature review was carried out in accordance with Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines [9].
Information sources and search strategy
The literature search was performed using “PubMed” and “ScienceDirect” databases. A manual review of references within the included articles was carried out to ensure all relevant studies were assessed. Keywords and combinations of possible synonyms used for the search were selected using the Medical Subject Headings dictionary (MeSH). The terms used were “diabetic retinopathy” combined with “artificial intelligence”, “deep neural network”, “deep learning”, “machine learning”, and “neural networks”. The databases were last searched in February 2022.
Study selection and eligibility criteria
Study selection criteria were developed using the PICO tool. PICO is an acronym for Patient, Intervention, Comparison, and Outcome. It is used to conceptualize the main research question and selection criteria by identifying an essential issue, intervention, and outcome for a diagnostic method. By clearly defining these terms, researchers can systematically examine potentially relevant studies according to their stated selection criteria [10]. This review aimed to analyze articles describing the efficacy of AI models detecting DR from retinal fundus images. Studies had to be written in English, published in the last 10 years, and have open access to the full text. Research papers using either a publicly available or their own dataset were included. The performance of AI-based algorithms had to be compared with that of an ophthalmologist. Outcome measures used to evaluate the performance of AI had to include sensitivity, specificity, and area under the receiver operating characteristic curve (AUROC). Inclusion and exclusion criteria are described in Table 1.
Table 1. Inclusion and exclusion criteria |
|
Inclusion criteria |
Exclusion criteria |
Publications written in English with open access to full text |
Literature reviews and case reports |
Published in the last 10 years |
|
Used AI to detect DR |
|
Diagnosis was made solely based on retinal images captured by fundus photography |
|
Index test was compared with ophthalmologist diagnosis |
|
Efficacy of method evaluated by sensitivity, specificity and AUROC |
|
Selection process and data collection process
After removing the duplicates, two reviewers independently screened the titles and abstracts of the remaining studies. Studies that failed to meet the inclusion criteria were excluded, and full-text articles of relevant studies were assessed and evaluated, conforming with the eligibility criteria. Mutual consensus on disagreements was reached after a discussion or a consultation with a third reviewer. The PRISMA flow diagram displays specific information about the selection process.
Data extraction and outcome measures
Two reviewers extracted data into an electronic database using a standardized form. Collected data include the following:
Sensitivity and specificity are the two fundamental performance measures of a diagnostic test. They describe the actual number of positive and negative cases, respectively in a dataset. The more accurately a true positive and a true negative are correctly identified, the higher sensitivity and specificity are achieved [11]. However, there is significant variation in diagnostic confidence when making decisions based on image findings. One of the most popular measures that can more accurately convey the overall performance of a diagnostic test is the area under the receiver operating characteristic (AUROC) curve [12]. AUROC curve represents the selected method’s ability to correctly distinguish whether a condition is present or not. It is the combined measure of sensitivity and specificity and can range between 0 and 1. The better an AI-based algorithm can correctly identify true positives and negatives, the higher the AUROC curve will be. An AUROC of 0.5 implies the prediction is no better than chance, while an AUROC of 1 implies perfect prediction [13]. Evaluating these three outcome measures allows us to objectively compare the efficacy of different diagnostic tests.
Quality assessment
Two independent reviewers used a modified seven-item checklist based on the methodological index for nonrandomized studies (MINORS) criteria to assess the quality of selected studies. The criteria included disclosure, study aim, input features, determination of ground truth labels, dataset distribution, performance metric, and explanation of the used AI model. Modified MINORS criteria have already been used in previously conducted systematic reviews for diagnostic studies [14].
Results
Review of literature
The systematic search resulted in a total of 2322 scientific publications collected. 1363 potentially relevant studies were left after the elimination of duplicates. A brief analysis of titles and abstracts was conducted, after which 168 studies remained. Full texts of the remaining articles were further examined for final inclusion, and 15 studies from 14 scientific publications were included in this systematic review. Reasons for exclusion are described in the PRISMA flow chart (Fig. 1).
Study characteristics
Pei et al. and Ming et al. used EyeWisdom system [15,16], Zhang et al. and Li et al. used the Inception V3 system [17,18], Arenas-Cavalli et al. used the DART system [19], Roychowdhury et al. used DREAM system [20], Wang et al. used both DeepDR and Lesion-Net in different reports [21, 22]. In contrast, others didn’t specify the market name for their systems. Eight of the studies included used CNN to detect DR [19, 21–27], while others did not specify or used another kind of architecture for their AI systems. Part of the studies was done using publicly available datasets, while others collected their own data. Eight studies used images from public datasets (Kaggle, MESSIDOR, MESSIDOR-2, E-Ophtha, DIARETDB1, EyePACS) [17, 20, 24–28], five studies used images from local hospitals [15, 19, 21, 23] and three studies used a combination of images gathered locally and from the datasets [16, 18, 22]. The study sample size varied from 321 to 75137 images, with a total sample size of 150179 images and an average of 9895 images per study. The full characteristics of included studies are presented in Table 2.
Table 2. Characteristics and results of included studies |
||||||
Authors |
Name of the system used |
Study sample |
Source of images |
Sensitivity, % |
Specificity, % |
AUROC |
Arenas-Cavali et al. 2022 [19] |
DART |
1123 |
Locally gathered |
94.6 |
74.3 |
0.915 |
He et al. 2020 [23] |
Not specified |
3556 |
Locally gathered |
90.8 |
98.5 |
0.946 |
Zhang et al. 2021 [17] |
Inception V3 |
7025 |
Kaggle dataset |
92.5 |
90.7 |
0.968 |
Gargeya et al. 2017 [24] |
Not specified |
75137 |
MESSIDOR-2, E-Ophtha datasets |
94.0 |
98.0 |
0.97 |
Li et al. 2019 [18] |
Inception-v3 (deep transfer learning) |
19233 |
Locally gathered |
96.9 |
93.5 |
0.991 |
Wang et al. 2020 [21] |
DeepDR, CNN |
6788 |
Locally gathered |
93.5 |
77.1 |
0.93 |
Wang et al. 2021 [22] |
Lesion‐Net, CNN |
12252 |
Kaggle dataset and locally gathered |
90.5 |
78.5 |
0.938 |
Roychowdhury et al. 2014 [20] |
DREAM |
1200 |
DIARETDB1, MESSIDOR datasets |
100.0 |
53.2 |
0.904 |
Cao et al. 2018 [28] |
Not specified |
1200 |
MESSIDOR dataset |
92.4 |
91.5 |
0.939 |
Ming et al. 2021 [16] |
EyeWisdom |
321 |
Locally gathered |
90.0 |
96.6 |
0.933 |
Pei et al. 2022 [15] |
EyeWisdomDSS |
1768 |
Locally gathered |
91.0 |
81.3 |
0.862 |
Pei et al. 2022 [15] |
EyeWisdomMCS |
1768 |
Locally gathered |
76.2 |
92.4 |
0.843 |
Saxena et al. 2020 [25] |
Not specified |
1200 |
EyePACS, MESSIDOR, MESSIDOR-2 datasets |
88.8 |
89.9 |
0.958 |
Baget-Bernaldiz et al. 2021 [26] |
Not specified |
14327 |
MESSIDOR dataset |
97.3 |
94.6 |
0.968 |
Shah et al. 2020 [27] |
Not specified |
1533 |
MESSIDOR dataset |
99.7 |
98.5 |
0.991 |
Results of individual studies
The average sensitivity of the AI-based algorithm was 92.58 % (95% CI: 89.40% to 95.70%), highest Se of 100% was achieved by Roychowdhury et al. [20] using the DREAM system, with an AUROC of 0,904 and the single lowest Sp of all the studies included — 53.2%. The system with the lowest Se of 76.2% was EyeWisdom MCS used by Pei et al. [15]. It achieved Sp of 92.4% and AUROC of 0,843, which was the lowest amongst all studies. Mean Sp was 87.22% (95% CI: 80.36% to 94.10%), highest Sp was 98.5%, achieved by He et al. [23] using CNN with Se of 90.79% and AUROC of 0.946. Average AUROC was 0.937 (95% CI: 0.913–0.960), highest AUROC of 0.991 was achieved by Li et al. [18], with Se of 96.93% and Sp of 93.45%. Results suggest that it was harder for researchers to achieve high specificity than high sensitivity. The full results of included studies are presented in Table 2.
Quality assessment
An overview of quality assessment is provided in Figure 2. All the evaluated studies had a clearly defined study aim. Articles using locally gathered datasets described patients’ exclusion and inclusion criteria in detail, while studies using publicly available datasets had a comprehensive description of the visual image selection procedure. Each study had explicitly stated that the index test was compared with ophthalmologist diagnosis as a ground truth. All the studies described how their AI-based algorithm was trained, validated, and tested using specific datasets. 100% of studies explained how the diagnostic performance of their AI model was evaluated and provided outcome measures of Se, Sp, and AUROC. Six articles didn’t specify the market name of the AI system they used. However, every study explained in depth how their AI model works.
Discussion
Screening and diagnosing DR is one of the most widely explored fields of AI research in ophthalmology, with more and more studies reporting new techniques being published every year [29]. It is important to systematically review and evaluate the emerging studies to identify the direction of work needed for full integration into our healthcare systems. The UK National Institute for Clinical Excellence (NICE) guidelines states that a DR screening test should have sensitivity and specificity of at least 80% and 95%, respectively, while the specificity of 0.5 was defined by another commission of specialists and has been shown to be cost-effective [30,31]. Out of all studies included in this review, He et al., Gargeya et al., Ming et al., and Shah et al. exceeded the requirements. At the same time, Baget-Bernaldiz et al. and Li et al. came close with sensitivity and specificity, reaching 97.3%/94.6% and 96.9%/93.5%, respectively. Furthermore, all of the studies included in this review proved to be cost-effective.
To reduce preventable vision loss globally, early detection and timely treatment of DR are needed. An efficient DR screening system must be established [32]. Reviewed studies show that AI-based, fully automated systems can operate with high accuracy while assessing retinal fundus images and potentially reduce the manual workload needed. Moreover, collected outcome measures display that these systems have made a significant advance from the early days and perform on par with trained professionals. It is expected that as technology advances AI will be able to autonomously identify patients with preventable vision loss and refer them to clinicians. This is particularly important in the developing world, where healthcare systems are chronically underfunded, and trained professionals are difficult to reach [33]. Some places have already started to integrate AI into the healthcare system. For example, in 2018 US Food and Drug Administration (FDA) approved the AI-based algorithm IDX-DR for DR detection, making it the world’s first AI medical equipment for disease diagnosis [32].
Due to the lack of standardized testing procedures, comparing the results of different AI-based algorithms can pose various challenges. Most of the studies included in this review use different algorithms to analyze different datasets, which makes direct comparisons less accurate. Moreover, the sample sizes of the studies vary drastically. For example, Ming et al. used retinal fundus images collected from local hospitals with a total sample size of 321 images, while Gargeya et al. used multiple publicly available datasets with a total sample size of 75137 retinal fundus images. Such a big difference in sample size might distort the results and negatively impact the reliability of the results. Additionally, none of the publications disclosed whether their algorithms were trained on DR images only. AI systems benefit from constant exposure to a variety of images, while algorithms exposed only to DR samples tend to have lower sensitivity and specificity over time [34].
Even though AI has already achieved significant results in DR detection, there is still room for improvement. AI-based systems still struggle to correctly identify diseases based solely on fundus photos, such as macular edema [35]. It is believed that the ability to evaluate multiple types of images could boost the diagnostic performance of AI [36]. Some AI-based algorithms currently in development use additional data besides images to make a diagnosis. EyeWisdom system used by Pei et al. and Ming et al. can screen for nearly 20 different eye diseases, such as DR, glaucoma, and age-related macular degeneration, based on the fundus photographs and disease history [15, 16]. These examples show that adding additional data inputs could increase the scope and efficacy of AI technology.
An additional topic of discussion is the potential collaboration of AI and human professionals. There are some growing worries that advancing AI could make current-age physicians redundant. However, studies show that a combination of human specialists and AI technology provide the best quality of care compared to either acting individually. AI-based algorithms can meticulously analyze provided data and make purely objective decisions, while human specialists can see the bigger picture that allows them to make nuanced decisions. This combination of different strengths ensures the best possible care is being provided [37].
Conclusion
Our results show that AI-based algorithms can accurately detect DR in retinal fundus images. AI technology can quicken the screening process and reduce the cost of care for a constantly growing population of patients. Some studies included in the review already meet the required criteria to be successfully used in clinical practice. However, due to various AI algorithms and datasets, further research and a more standardized approach are needed to accurately compare the efficacy of AI technology.
Conflict of interests
The authors report no competing interests.