Mitigated deployment strategy for ethical AI in clinical settings

Abstract

Clinical diagnostic tools can disadvantage subgroups due to poor model generalisability, which can be caused by unrepresentative training data. Practical deployment solutions to mitigate harm for subgroups from models with differential performance have yet to be established. This paper will build on existing work that considers a selective deployment approach where poorly performing subgroups are excluded from deployments. Alternatively, the proposed ‘mitigated deployment’ strategy requires safety nets to be built into clinical workflows to safeguard under-represented groups in a universal deployment. This approach relies on human–artificial intelligence collaboration and postmarket evaluation to continually improve model performance across subgroups with real-world data. Using a real-world case study, the benefits and limitations of mitigated deployment are explored. This will add to the tools available to healthcare organisations when considering how to safely deploy models with differential performance across subgroups.

Introduction

Artificial intelligence (AI) has been touted as the catalyst for the next healthcare revolution; however, the healthcare community remains wary about the risks associated with such technology. Notably, the potential for bias and the negative impact this may have on under-represented groups is of particular concern.1–3 In their paper titled ‘A selective deployment’, Vandersluis and Savulescu4 highlight a dilemma: deploying clinical AI tools that do not generalise adequately across demographic groups can harm under-represented groups through poor treatment, whereas withholding such tools could cost well-represented groups significant utility. They propose excluding poorly performing subgroups from deployment to prevent harm (selective deployment) in preference to not deploying at all (delayed deployment) and deploying universally without safety nets (expedited deployment), see figure 1. Our paper approaches the dilemma from a clinical perspective and, with a focus on diagnostic models, will argue that seeking to make all models generalisable across subgroups can, in fact, perpetuate inequality. Second, where a universal deployment is deemed clinically appropriate, we propose ‘mitigated deployment’ as an alternative strategy. Mitigated deployment involves a universal deployment with additional safety measures for under-represented subgroups where models may perform less well, such as a human-in-the-loop. As well as this, postmarket monitoring is mandated to assess the real-world performance of models on subgroups, for the purpose of harm mitigation and future model improvement. This will be illustrated using a case study of AI deployment within National Health Service (NHS) dermatology, where an ethical approach will be evaluated using the four pillars of medical ethics; autonomy, beneficence, non-maleficence and justice.

Box of definitions. AI, artificial intelligence.

Should all clinical models be generalisable?

A widely used tool for breast cancer prognostication in the NHS uses menopausal status as a factor to predict breast cancer prognosis.5 In this instance, sex-specific information is necessary for the output, and therefore, two different models for each sex may be advantageous due to the differing pathophysiology of disease manifestation.6 The example of breast cancer illustrates the problems with assuming generalisability is always in the best interest of subgroups, but does not represent the wider research landscape where males are generally overrepresented in clinical trial data with female under-representation.7 8 Although medicine has moved on from Aristotle’s understanding of the female body as a mutilated version of man, much of the current evidence base in clinical medicine stems from data that is based on white male normativity.9 10 As AI models are data driven, this bias within clinical research can be perpetuated by universal deployment. Gap- App is a prognostication model for pancreatic cancer that uses distinct gene expression profiles between sexes to predict 3-year survival11. Gap-App as a sex-distinct model outperformed a universal version by the same developers, highlighting the problem with Vandersluis and Savulescu’s4 assumption that models should be generalisable to the whole population (see figure 2). It also shows that differential performance will not always be solved by collecting more data, but instead it is the type of data that can be of most importance, as distinct models may rely on differing inputs for each sex where appropriate.

Deployment strategies for models that perform differentially across subgroups.

Our paper holds the view that separate models based on biological differences in disease pathophysiology can be clinically appropriate; however, when applied to non-biological attributes, they can be highly problematic. As sex differences are biological, they can be more easily identified and used as inputs to distinct models, such as menopausal status or gene profiles.12 This contrasts with the factors that determine gender, race or ethnicity, which are social constructs that are societally specific and not easily defined.13 Furthermore, differentiation based on social constructs can perpetuate harmful stereotypes that reflect power dynamics in society, such as historical characterisations of women as ‘hysterical’ which became the basis of ‘hysteria’ as a medical diagnosis, and beliefs stemming from slavery of innate biological differences between races resulting in harmful medical practices (ie, estimated glomerular filtration rate (eGFR) race adjustment).12 14

Current regulatory landscape

AI used for diagnostic purposes in the health service is typically regulated as a medical device. In the United Kingdom (UK), the Medicines and Healthcare products Regulatory Agency (MHRA) is the regulatory body responsible for ensuring medical devices meet necessary safety, quality and efficiency standards.15 Medical devices used within the health system must also meet clinical safety standards (DCB 0129/0160), which mandate risk assessments from both supplier and deploying organisations with oversight from a clinical safety officer.16 The 2023 equity in medical devices review, prompted by concerns of bias, has highlighted insufficiencies in the current process surrounding AI-enabled medical device regulation with recommendations for reform.17 The MHRA is currently undertaking a Medical Device Change Programme to update its guidance on AI deployment in the health service.15 As well as this, a review of clinical safety standards (DCB 0129/0160) is currently underway to ensure the advancements in AI are also accounted for in risk assessment.18 Given that this is a developing field in which regulation is still catching up, we propose that mitigated deployment can address key regulatory shortcomings for models that perform differentially across subgroups, by mandating harm mitigation like a human-in-the-loop specifically for impacted groups. Additionally, this deployment approach also requires clearer postmarket evaluation requirements that will mandate subgroup analysis in collaboration with deploying organisations, which is currently missing. This paper is aimed partly at contributing to the considerations for regulators in addressing the AI fairness question in clinical settings, which is currently being reviewed.

Mitigated deployment is apt to fill this lacuna, because it is well aligned to existing standards. At the core of the mitigated deployment strategy are the principles of human–AI collaboration and postmarket monitoring, which are emphasised by the guiding principles for good machine learning practice developed by the MHRA, Food and Drug Administration (FDA) and Health Canada.15 Additionally, the 2023 equity in medical devices review recommends that equity should be a criterion for prepurchase validation checks, demonstrating the responsible-by-design approach encouraged in mitigated deployment, which requires harm mitigation as a prerequisite for regulatory approval where appropriate.17 Outside of the UK, the European Union (EU) AI Act also mandates the need for high-risk AI systems to be developed with human oversight built into pathways before the model is placed on the market.19 Furthermore, the EU AI Act stipulates that postmarket monitoring is essential for high-risk applications. The mitigated deployment approach, therefore, is aligned with international consensus and can provide strategic guidance on how to ethically deploy models that perform differentially, which is currently missing. See figure 3 for a visual representation of the risk–benefit ratio of mitigated deployment compared with other possible deployment types.

Visual representation of the risk–benefit ratio for deployment strategies for models that perform differentially across subgroups.

Case study: DERM

Within dermatology, AI is being increasingly used in NHS skin cancer pathways, which are experiencing overwhelming demand with limited dermatologist provision.20 Bias within diagnostic models for skin conditions is a well-recognised problem, largely due to the lack of diverse training data that has limited images of Fitzpatrick V/VI skin types.21 In the NHS, AI deployment has used elements of a mitigated deployment approach in dermatology to safeguard patients from diagnostic errors. To illustrate this, the case study of Deep Ensemble for the Recognition of Malignancy (DERM) will be explored. As the only Class IIa UK Conformity Assessed (UKCA) and recently awarded Class III Conformité Européenne (CE) marked AI as a medical device, DERM is currently the world’s first autonomous AI for detecting skin cancer.22 However, in order to achieve this, DERM first had to collate real-world evidence of its efficacy and safety, which was done through live deployments with human second reads as a harm mitigation strategy, and postdeployment evaluation including subgroup analysis.

Since 2020, numerous NHS trusts have integrated DERM into their urgent skin cancer pathways where, following a General Practitioner (GP) referral for a suspicious skin lesion, patients undergo dermoscopic photography at specialised AI teledermatology hubs, and DERM is used to output a referral recommendation.23 NHS deployment adopted a ‘human-in-the-loop’ approach where skin lesions that would otherwise be discharged from the skin cancer pathway receive a remote ‘second read’ by a dermatologist who can overturn DERM’s recommendation.23 As part of clinical validation, a second read was conducted on all lesions discharged by DERM, regardless of Fitzpatrick type. Of the 754 cases (36%) flagged by the second read between 02/2022 and 04/2023, 7 cases of missed cancer were avoided.24 None of the false negatives recorded were from Fitzpatrick 5/6.24 DERM’s postdeployment monitoring strategy requires a root cause analysis for false negatives, which involves a case review by a panel that includes dermatologists who may identify patterns in model errors that can serve as a point of learning and model improvement.23 Updated model iterations of DERM have been less prone to error.23 The strategy of human reads verifying AI outputs to reduce model error has also been used in the fields of radiology and pathology.25 26

Mitigated deployment would see harm mitigation measures such as human-in-the-loop continue, specifically for under-represented groups, where model performance may remain variable. This would persist in the clinical workflow until model performance is deemed safe, as demonstrated by analysis of sufficient volumes of real-world data. Following successful clinical validation, DERM is now deployed autonomously at Chelsea and Westminster Hospital NHS Foundation Trust.27 However, a second read persists only for Fitzpatrick type V/VI.27 Based on postdeployment data from three NHS trusts between 02/2022 and 04/2023, only 4% of cases were from Fitzpatrick skin types V and VI.28 Despite no missed cancers, continuing a second read for darker-skinned patients helps mitigate harm where there is class imbalance, until enough evidence is gathered of DERM’s performance on this subgroup through live deployment. The mitigated deployment approach proposed would also use images captured during live deployments with histological labels, where patients are consenting, as training data to improve future iterations of the model. This method allows deployment and is more ethical compared with a selective deployment, which would have excluded darker-skinned groups until enough data were collected, or a delayed deployment, which would see the model not deployed for lighter skinned individuals either, or an expedited deployment, which would deploy to all without safety nets.

Limitations of mitigated deployment

A human-in-the-loop has largely proven beneficial in clinical AI; however, the risk of automation bias should be considered. An automation bias refers to the propensity of humans to over-rely on suggestions made by automated systems.36 Abdelwanis et al36 use a bowtie analysis to identify causal factors for automation bias in healthcare and propose solutions to the automation bias challenge. First, the selection of decision makers is critical as less experienced personnel can be more likely to accept incorrect algorithmic advice, due to a lack of clinical confidence in the diagnostic task. In a dermatology context, the human-in-the-loop should be a consultant dermatologist rather than trainee or generalist. Second, understanding of AI models and their limitations is important if a human-in-the-loop is used to mitigate for this. In a survey of clinicians, nearly 60% of participants considered themselves slightly or not knowledgeable at all about AI.37 Therefore, training for clinicians is essential to ensure there is an understanding of the limitations of AI and the risks they may introduce to avoid complacency in model outputs. Additionally, research within the field of human–computer interaction has highlighted the way in which AI outputs are presented to clinicians can impact how heavily the information is valued, and product designers can help design AI applications in ways that discourage an automation bias.38 This also ties into how design can aid interpretability and therefore improve the quality of a human-in-the-loop. Prior to deployment, clinical validation studies are needed to ensure there is evidence that supports the safety of both AI models and harm mitigation.

To be more palatable for NHS deployment, a challenge that can arise from the universal deployment of AI models is a hidden exclusion. Hidden exclusions will make model performance ‘better’ while appearing inclusionary without the backlash of selective deployment. For example, deployments of AI for skin cancer detection in the NHS currently exclude acral lesions from algorithmic assessment despite acral subtypes of melanoma being the most common type of melanoma in darker skin.34 Exclusions like acral regions can be clinically appropriate, but regulatory bodies need to have robust mechanisms in place to differentiate between clinically appropriate exclusions and those that act as a proxy for under-represented groups to present ‘better’ performing models. Therefore, the rationale for exclusion criteria should be provided in the process for regulatory approval, and where there is uncertainty, the expert advisory committee that provides MHRA with expert advice, should be consulted for clinical input to ensure clinical appropriateness.15

Economically, the cost of mitigated deployment can be seen as a barrier to adoption. Our paper holds the view that the benefits of regulators mandating the design of mitigation strategies at the preapproval stage are preferential as it encourages adherence but acknowledges that this can be costly. Initial financial costs to mitigate deployment will include system designs that integrate a human-in-the-loop into clinical workflows and the platform to store data that meets data protection standards. Finances available to each NHS trust will differ, which can lead to more financially well-off trusts having the means to deploy AI-enabled healthcare, when others cannot. To avoid geographical inequalities in AI-enabled healthcare, a proportion of the £21 million NHS AI Diagnostic Fund should be dedicated to incentivise NHS deployment in areas that would otherwise be neglected.39 There is precedent for ‘means-based’ allocation of funding in healthcare as demonstrated by the Carr-Hill formula used to calculate individual GP funding, which is determined by factors such as patient need.40 Furthermore, although initial costs of mitigated deployment may be high, the long-term savings to organisations will offset this; for example, DERM’s postreferral model provided cost-savings even with a second reader with a benefit–cost ratio (BCR) of 1.7 (ie, £1.70 returned for every £one invested), which is projected to increase to a BCR of 2 without a second reader.24 As well as financial savings, non-financial gains (eg, reduced waiting times) are also important considerations.

Conclusion

In summary, mitigated deployment is an ethical strategy for responsible AI deployments that can safeguard under-represented groups from model harm through AI-human collaboration and is a useful tool for policymakers in the current context of proposed regulatory reform. Mandating mitigations prior to deployment can ensure a responsible-by-design, while postdeployment evaluations including subgroup analysis are necessary to ensure mitigations are effective.

Source link

Mitigated deployment strategy for ethical AI in clinical settings

Abstract

Introduction

Should all clinical models be generalisable?

Current regulatory landscape

Case study: DERM

Limitations of mitigated deployment

Conclusion

Leave a Comment Cancel Reply

Company

Categories

Abstract

Introduction

Should all clinical models be generalisable?

Current regulatory landscape

Case study: DERM

Limitations of mitigated deployment

Conclusion

Related Posts

Leave a Comment Cancel Reply