[ad_1]
Discussion
The ARIES study included more than 3 06 000 cases from different UK regions and diverse breast screening populations, enabling subgroup analyses across participant age, breast density and ethnicity. Non-inferiority tests passed for all outcome metrics on all assessed subgroups, and superiority tests passed for the majority of them. Differences within participant subgroups, with regard to the expected impact (ie, point estimates of the absolute differences between workflows with AI compared with without), were small, demonstrating important evidence towards the AI system’s safety and effectiveness. In addition, significant possible workload savings were modelled across all centres.
While most point estimates showed an improvement on every metric across all subgroups, the CDR estimates showed a decrease between −0.24 and −0.08 per 1000, across participant subgroups compared with standard DR. This is in line with the expectations from the specific workflows that are designed for operational improvements rather than improving cancer detection, as well as the retrospective study design that prevents the AI from detecting cancers in patients who have not been recalled for further assessment by human readers in the past.
In a prospective setting, additional cancer detection can be achieved by implementing the AI system as an XR, as published by Ng et al, where the AI flags cases for additional human review that have not been recalled in the DR process,4 and thus ideal for complementing the sIR and DRT workflows.
The IC flag rate, used as a proxy measure for additional cancer detection, was 41.2%, indicating the opportunity to increase CDR. Although flagging a case as suspicious does not guarantee that a cancer will indeed be picked up by the human readers, increasing CDR is a very realistic and likely scenario, as has been shown in recent publications.3 4 Moreover, in a previous retrospective study of the AI system evaluated in ARIES, the AI’s IC flag rate was 24.7% for 2-year ICs,2 and this translated into a CDR increase in live use ranging from 0.7 to 1.6 per 1000 cases after implementation at the same sites as an additional reader.4 In another retrospective study, the IC flag rate of 34.1% for 3-year ICs19 converted into a CDR increase of 1.0 per 1000 cases in a subsequent prospective evaluation.20 The method used to estimate the translation of IC flag rate in a retrospective study to CDR increase in a prospective setting for AI as an additional reader (online supplemental methods section) yields an expected CDR increase of 1.2 per 1000 cases.
This is the first study in breast screening to present a granular assessment of an AI system’s clinical impact, in terms of multiple subgroups and the use of various relevant clinical and operational outcome metrics. Several large-scale studies have been published, but without performing subgroup analyses.2 21 22 A large-scale study evaluating the same AI system assessed generalisability across different mammography equipment vendors (Hologic, GE, Siemens, IMS), but did not include the further subgroup analyses presented here.2 Others limited their report to one patient characteristic, such as age23 or breast density24 25 and SEN as a single outcome metric.23–25
The main limitations of the ARIES study are inherently linked with its retrospective nature, which is unavoidable given the large and heterogeneous population required to perform such extensive subgroup analyses. First, human reader breast density assessments at the individual patient level were not available for screening cases at the study centres, as it is not a part of standard screening practice in the UK. Using an automated breast density assessment tool, however, made it feasible to obtain a BIRADS density score for this large cohort of more than 306 000 screens. The breast density tool used was developed and validated on a large dataset from the USA,16 and the breast density distribution of the study screening population obtained from this tool was in line with distributions found elsewhere in Europe.25–28
Also, as follow-up duration was limited, the collection of the ICs is expected to be incomplete for the latter years of the study dataset. However, the follow-up duration used is nonetheless relevant for comparing subgroups and ensuring that within every subgroup, relevant cancers have been detected. In addition, ethnicity information was only available for analysis at one of the centres. This resulted in wider confidence intervals for the IC flag rate for the ethnicity subgroups, although the point estimates were in line with the other subgroups. Finally, use of AI as a second reader is expected to minimally impact human behaviour compared with concurrent use of AI for decision support, which would require confirmation prospectively.
The use of relevant metrics and subgroup analyses is key for meaningful evaluations that can inform the safety and effectiveness of AI and address an important area of AI bias.5 29 30 The non-inferiority, superiority results and expected impact on clinical and operational metrics (ie, point estimate differences between workflows with and without AI) across every participant subgroup in a large-scale and heterogeneous population provide confidence in the safe deployment and prospective use of the AI system. The study has shown how substantial workload savings with AI, supporting the sustainability of breast screening, can go hand-in-hand with safe performance across patient subgroups, ensuring screening quality for participants of all ages, breast densities and ethnicities.
[ad_2]
Source link



