TrendNCart

close

A proposal for developing a platform that evaluates algorithmic equity and accuracy

[ad_1]

Solutions to improve algorithm transparency and performance and promote health equity

Starting from the premise that any complex societal problem must first be measured before it can be solved, Mayo Clinic and Duke School of Medicine entered a collaboration with Optum/Change Healthcare focused on analysis of their data consisting of >35.1 billion healthcare events and over 15.7 billion insurance claims to look for patterns of care and any possible inequities in that care. Change Healthcare provides social determinants of health, including economic vulnerability, education levels/gaps, race/ethnicity and household characteristics on about 125 million unique de-identified individuals. This provides a unique combined clinical and non-clinical view of healthcare journeys in the USA. A better understanding of this dataset will enable Mayo and Duke to design initiatives to help eradicate racism and offer services to underserved communities. One component of the project reviews the billing data, including ICD codes and CPT codes. It analyses diabetes care, as reflected by haemoglobin A1c testing and the use of telemedicine services, as well as planned study of the utilisation of colorectal cancer screening services, as reflected in the use of Cologuard, an at-home stool-DNA screening test (Mayo Clinic has a financial interest in Cologuard), colonoscopy and other screening methods. Utilisation of these services is being mapped against numerous social determinants of health when available, including a patient’s education level, country of origin, economic stability indicator (financial), how likely they were to search for medical information on the internet, requests to their physician for information about medications, the presence of a senior adult in the household, number of children and home and car ownership.

The results of such analyses will help clinicians and healthcare executives develop more equitable digital tools, but they do not obviate the need to formally evaluate AI-enhanced algorithms and digital services to ensure that they achieve their stated purpose and help improve health equity. Unfortunately, the current digital solutions marketplace remains a ‘Wild West’ that is acutely in need of certifying protocols to address the aforementioned shortcomings. There are three possible pathways to follow in creating these evaluation services. One approach is to develop a system similar to the nutrition or drug label currently in place for most US foods and beverages and medications.15 It would list many of the ‘ingredients’ that have been used to generate each algorithm or digital service, including how the dataset was derived and tested and what kind of clinical studies were conducted to demonstrate that it has value in routine patient care. It would also list the type of methodology used to develop the model, for example, convolutional neural network, random forest analysis, gradient boosting, the types of statistical tests and performance metrics that were used on the training and test sets and bias assessment tools employed. A second approach would be a Consumer Reports-like system. It would take a closer look at commercially available AI-enhanced services, outlining and comparing them much the way Consumer Reports compares appliances, automobiles and the like. This second approach would be facilitated by an across-health systems data and algorithm platform or federation where internal and external models can be tested, improved and selected. That would allow potential users to separate the wheat from the chaff, providing them with a reliable resource as they decide how to make investments. A third approach would be a hybrid evaluation system that combined elements of the first two systems.

Applying these types of evaluation tools to existing diagnostic and screening algorithms might avert the poor model performances that have been reported in the medical literature. For example, an analysis of the Epic Deterioration Index, which was designed to identify subgroups of hospitalised patients with COVID-19 at risk for complications and alert clinicians to the onset of sepsis, fell short of expectations.16 The system had to be deactivated ‘because of spurious alerting owing to changes in patients’ demographic characteristics associated with the COVID-19 pandemic’.17

For any of these approaches to be successful, it is necessary to develop an AI evaluation system with specific evaluation criteria and testing environments to judge model performance and impact on health equity. The best place to start is by taking a critical look at the input data being collected for each dataset. Any algorithm developer interested in demonstrating that they have a representative service will want to present statistics on the percentages of white, black, Asian, Hispanic and other groups in their dataset, as illustrated in box 1 and table 1. Similarly, they will attest to its male/female balance, as well as its socioeconomic and geographic breakdowns. It is also important to keep in mind that an equitable algorithm must be derived from a dataset that is representative of the entire population to be served. The AI evaluation system described here would create standards by which a product can be evaluated. There would then be multiple testing labs available, as well as several certification entities that use the results of these labs.

Box 1

The fictitious product description could serve as a template for an artificial intelligence (AI) evaluation service that helps clinicians and healthcare executives make a more informed decision about how to invest in digital services that are equitable and accurate. The sample only includes a few of the most important algorithm features that can be documented in a ‘nutrition label’ style format. For clinicians with no background in information technology, an educational training session may be required to enable them to make useful comparisons among competing products. The graphic is a simplified version of what a product card might look like. It is intended to serve as the starting point for an iterative design process

RadiologyIntel

Summary: machine learning-based decision support software to augment medical imaging-related diagnosis of abdominal CT scans.

Data:

Input data sources: radiology information system/picture archiving and communication system, and epic electronic health record (EHR) system.

Input data type: digital abdominal images, text reports from radiologists, EHR narrative data on signs and symptoms, laboratory test results.

Training data location and time period: Acme Medical Center, Jamestown, Virginia, September 2014 to December 2016.

Statistical tests and metrics employed during training and validation testing

High-level Python-based neural network, Keras, TensorFlow.

Conducted on NVIDIA GeForce Graphical processing units.

Population composition

Ethnic composition

Non-Hispanic white 60%

Hispanic and Latino 18%

Black or African-American 13%

Asian 6%

Other 3%

Gender balance 55/45%, male/female

Primary outcome(s) XXX

Time horizon XXX

Algorithm and performance:

Type of algorithm employed

Convolutional neural network

Algorithm validation

Retrospective analysis*

Prospective clinical trial†

Size/Composition of training dataset:

55 000 inpatients at academic medical centre

Size/Composition of cross-validation dataset:

35 000 inpatients at community hospital

Performance metrics

Area under the curve 0.85

Sensitivity

Specificity

Classification accuracy 75%

Summary receiver operating curve 0.75

Bias assessment evaluation

Google TCAV

Audit-AI

Food and Drug Administration approval status

510(k) Premarket approval—Approved December 2020

Warnings

This model is not intended to generate independent diagnostic decisions but is to be used as an adjunct to radiologist and attending physician’s clinical expertise. Use of the algorithm should be discontinued if there are significant shifts in performance statistics or changes in patient population.

Published evidential support (fictitious references to illustrate the nutrition label model)

*Loretz A et al. Evaluation of an AI-based detection software in abdominal computed tomography scans. JAMA 2017;450:345–357.

†Mendez J et al. Randomised clinical trial to compare radiological imaging algorithm to radiologists’ diagnostic skills. Lancet 2019;333:450–460.

This form of algorithmic hygiene is a bare minimum standard, however. There are numerous types of bias that require attention, including statistical overestimation and underestimation, confirmation bias and anchoring bias. In addition, developers also need to be realistic about how data are entered into their training set. Electronic and human data entry can inadvertently insert biased information into a dataset’s raw data. Many types of healthcare require humans to enter descriptors and tags that may be influenced by their own prejudices and stereotypes. And even devices like rulers, cameras and voice recognition software used to generate data can enter biased data. Alegion, a company that does ground truth training for machine learning initiative, points out ‘For example, a camera with a chromatic filter will generate images with a consistent colour bias. An 11-7/8 inch long “foot ruler” will always over-represent lengths’.18

Vendors will also want to take the next step and demonstrate that the composition of their data scientist team is diverse and represents all the segments of society that have often been under-represented in healthcare. Without such a diverse team, subtle choices made during the data collection process will produce unbalanced datasets. Additional credentialling documents that will allow the best solutions providers to stand out would include bias impact statements, inclusive design principles, algorithm auditing process and cross-functional work teams. Algorithm developers can also use several analytical tools designed to detect such problems, including Google’s TCAV, Audit-AI and IBM’s AI Fairness 360, discussed in box 2.

Box 2

Bias detection analytics tools

Although it is virtually impossible to eliminate all bias from artificial intelligence (AI)-based datasets and algorithms, there are several tools that can help mitigate the problem. These tools are essentially algorithmic solutions to correct algorithmic inequities. Here are a few examples of these detection tools.

Testing with concept activation vectors

Testing with concept activation vectors (TCAV) is one of Google’s tools to address algorithmic bias, including bias by race, gender and location. For example, in a neural network-based system designed to classify images and identify a zebra, TCAVs can determine how sensitive the presence of stripes are in predicting the presence of the animal.21 The tool uses directional derivatives to estimate the degree to which a user-defined concept is important to the results of the classification task at hand. Using concept activation vectors can help detect biases by unearthing unexpected word, class, of concept associations that suggest an inequity. In one analysis, for instance, the ‘female’ concept was linked to the ‘apron’ class.22

Audit-AI

Makes use of a Python library from pymetrics that can detect discrimination by locating specific patterns in the training data. For example, it can input mammography access data for various ethnic groups into an algorithm in question to generate proportional pass rates of various groups, comparing white with black patients. The resulting bias ratio can then be analysed statistically looking for significant differences and clinical meaningful differences in healthcare access.23

AI Fairness 360

A Python-based bias detection algorithm from IBM, AI Fairness 360 (AIF360) starts with the assumption that many datasets do not contain enough diverse data points. The IBM team explains ‘Bias detection is demonstrated using several metrics, including disparate impact, average odds difference, statistical parity difference, equal opportunity difference and Theil index. Bias alleviation is explored via a variety of methods, including reweighing (preprocessing algorithm), prejudice remover (in-processing algorithm) and disparate impact remover (preprocessing technique)’. A use case of how AIF360 can be used to reveal discrimination is a scoring model that looks at healthcare utilisation.24

Tariq et al have also reviewed numerous AI evaluation tools that are worth considering.25 They have developed a 10-question tool to evaluate AI products that include ‘model type, dataset size and distribution, dataset demographics/subgroups, standalone model performance, comparative performance against a gold standard, failure analysis, publications, participation in public challenges, dataset release and scale of implementation’.

The history of medicine is filled with ‘near misses’, technologies that had the potential to improve patient care but that failed to hit their intended target and did not live up to that potential once rigorously tested. The evidence suggests that machine learning-enhanced algorithms as a group do not fall into that category; instead, they are poised to profoundly transform the diagnosis, treatment and prognosis of disease. As we have documented in earlier publications,10 there are a small number of RCTs and non-RCT prospective studies to support the use of these digital tools in several medical specialties, including oncology, radiology, ophthalmology and dermatology. But for clinicians and healthcare executives to make decisions regarding commercially available algorithmic services, we propose an evaluation platform that dispassionately reports on the basic features of each product. Such a platform would allow providers to compare competing products and choose those that are equitable and accurate.

[ad_2]

Source link

Leave a Comment

Your email address will not be published. Required fields are marked *