TrendNCart

society-logo-bcs-informatics

Adherence of randomised controlled trials using artificial intelligence in ophthalmology to CONSORT-AI guidelines: a systematic review and critical appraisal

[ad_1]

Discussion

Here, we aimed to evaluate the adherence of RCTs investigating the use of AI within ophthalmology to the guidelines set by CONSORT-AI checklist for reporting standards for RCTs. Our study found a total of five RCTs that evaluated AI applications in ophthalmology. These articles looked at the utility of AI in diabetic retinopathy screening,19 20 ophthalmologic education,21 detecting fungal keratitis22 and diagnosing childhood cataracts.23 The mean CONSORT-AI score of the articles was 53% (range 37%–78%). None of the articles reported all items in the CONSORT-AI checklist, and all articles were rated as moderate risk, or ‘some concerns present’, through the RoB-2 tool assessment. All articles had moderate risk of bias for the ‘selection of the reported result’ and ‘deviations from intended interventions’ domains, and low risk of bias for ‘measurement of the outcome’ and ‘missing outcome data’ domains. Only one article had low risk of bias for their ‘randomisation process’, with the remainder having moderate risk in this domain.

The mean CONSORT score for our included studies (53%) is higher than mean score of 39% reported in the previous work by Yao et al in 2014 which reviewed the quality of reporting guidelines in 64 RCTs focused on ophthalmic surgery.12 Aside from the difference in the number of reviewed articles, a potential reason for this difference in reported CONSORT-AI scores is that the articles found in our study are relatively new. The CONSORT-AI guidelines were published in 2020, and 3/5 of our articles were published in 2021 or later,19 20 22which suggests that awareness of and adherence to reporting guidelines may have increased over time. Many of the items that the identified articles in our review failed to report on were also missed in studies identified by Yao et al.12 These include determining adequate sample size (item 7), concern random allocation sequence generation (item 8) and its implementation (item 10).13 The low reporting rate of sample size calculation is a critical concern as this information is essential for protocol development in all RCTs. There were some items that were commonly missed in Yao et al that were not missed in our reviewed articles, such as mentioning the term RCT in title or abstract (item 1),12 which demonstrates the value in establishment of expected reporting standards by journals and publishing editors.

We observed some common trends in CONSORT-AI and RoB2 assessments in our study. For AI-based RCTs, it is difficult to blind both the physicians and participants to the intervention received, if the participants are humans and not images. For instance, if an RCT is comparing AI-based screening versus human-based screening, the participant may know whether they have been assigned to the AI or to a human at the time the intervention is given. One strategy to blind the participants, as seen in Noriega et al19 and Xu et al,22 is to replace human participants with human-derived data. Additionally, blinding the outcome assessors to the prescribed intervention is an important feature of the study design in RCTs, but in three of the included studies in this review, Noriega et al,19 Xu et al22 and Wu et al,21 did not outline these steps in their methods.

None of our included articles described where to find their initial trial protocol. Only one of the articles, by Lin et al, was registered on ClinicalTrials.gov.23 This is a critical limitation as it could indicate a potential source of bias if analysis decisions were made after outcomes were measured which undermines the credibility of the RCT findings. Although outcome measurements were standard choices (eg, sensitivity and specificity for binary classification model performance), the role of an initial trial protocol cannot be overlooked as it is a key component of pretrial planning and study integrity. Furthermore, no articles other than Mathenge et al reported where the AI algorithm codes could be found.20 This reduces transparency and may impede the reproducibility of the results as well as the progress of applying AI technologies. Siontis et al have found that AI RCTs across all healthcare applications, not just ophthalmology, fail to provide the algorithm code for their AI tools.24

Criteria 4b (settings and locations where data were collected), 15 (baseline demographics) and 21 (generalisability of trial findings) of the CONSORT-AI checklist were not perfectly adhered to in our five articles. Only three articles reported items 4b,20 22 23 three articles reported item 1520 21 23 and two articles reported item 21.20 22 Although these criteria were not the most frequently missed items, they are of utmost importance clinically, as they concern whether the results of the trial can be reasonably applied to a clinician’s patient population. In a 2021 review of the development and validation pathways of AI RCTs, Siontis et al found that most AIs are not tested on datasets collected from patient populations outside of where the AI was developed and thus, it may be unsafe to apply these AIs to such populations.24 In fact, using limited or imbalanced datasets both in development and validation stages may lead to discriminatory AI.25 Therefore, special attention should be paid to these criteria.

In our review, we also found that the criterion for providing an explanation of any interim analyses and/or stopping guidelines if applicable (7b) was not reported across all articles. It could be argued that all RCTs should at least comment that an interim analysis was not planned, even if it was not applicable to the specific study design. Shahzad et al conducted a systematic review that also used CONSORT-AI to review the reporting quality of AI RCTs across all healthcare applications published between January 2015 and December 2021. They also found that item 7b was not reported in more than 85% of the included studies, and scored these items as non-applicable in their grading using CONSORT-AI.16

When analysing the appropriateness of analyses and the clarity of the performance assessments for each article, we found that each article chose suitable methods for their individual trials. Noriega et al, Xu et al and Lin et al evaluated the performance of their different comparators by calculating sensitivity and specificity among other metrics.19 22 23 Xu et al and Lin et al presented this information in the form of a table.22 23 Noriega et al and Xu et al also presented these results visually by plotting sensitivity and specificity of different comparators on a receiver operating curve which represented the performance of the AI alone.19 22 In Wu et al’s investigation of the effectiveness of AI-assisted problem based learning, ophthalmology clerks did a pre-lecture test and post-lecture test after either a traditional lecture or AI-assisted lecture.21 Improvement in test performance was assessed and compared between the two groups by analysing differences in the pre-lecture and post-lecture test scores using paired t-tests. A main source of bias in their study, not captured in the risk of bias assessment, is the quality of the test questions which were not made available to the readers. It is important to note that all AI-based RCTs identified in this study had no drop-outs, as all participants that enrolled in the RCT yielded valid data for analysis. This is due to the fact that in some cases, images were subjects, in pre-collected databases and registries.

Despite the comprehensive search of the literature, a limited number of RCTs on AI were retrieved in the current study. The small number of RCTs identified prevented our study from conducting any temporal analyses or stratifying our analyses. In comparison, a literature review on the reporting guidelines of RCTs in ophthalmic surgery overall yielded 65 RCTs.12 There are a couple of reasons that may explain the small number of RCTs investigating the efficacy of AI for ophthalmological applications. First, this small number may be an indication of the novelty of AI within the field of ophthalmology. Another reason may be the high costs and resources associated with RCTs. It is not feasible to conduct an RCT for all of the various AI tools developed for ophthalmology. Siontis et al found that the development and validation stages that different AI models go through before being evaluated in RCTs vary widely between papers.24 The increasing number of standard guidelines for the reporting and quality assessment of AI, including DECIDE-AI,26 PROBAST-AI,27 QUADAS-AI,28 STARD-AI29 and TRIPOD-AI27 are suggestive of the shift towards standardised assessment of AI tools. Another step that may aid in better assessment of AI tools in RCTs is determining performance metric thresholds that must be met at each stage of development and validation, although justifying these cutoffs may be difficult and subjective, and does not automatically imply high reliability for the RCT results.

[ad_2]

Source link

Leave a Comment

Your email address will not be published. Required fields are marked *