TrendNCart

open-url

Finding a constrained number of predictor phenotypes for multiple outcome prediction

[ad_1]

Methods

Definitions

An outcome is a health condition or state such as new onset of: diabetes, hypertension or cancer.

A time at risk is a period of time relative to an index date, for example, if the index date is a fixed calendar date such as 1 January 2019, then a time at risk could be between 1 January 2019 and 31 December 2019. The calendar date at which the index occurs can vary per patient, such as the first diagnosis of hypertension, so a time at risk could be between day of diagnosis for hypertension until 1 year after the day of diagnosis for hypertension.

A target population is a cohort of patients (and a corresponding index date) who you wish to predict the personalised risk of some incident outcome occurring during a time at risk. For example, a target population could consist of all patients newly diagnosed with atrial fibrillation and the index per patient would be the initial diagnosis of atrial fibrillation.

A prediction task can then be generically defined as predicting the risk of an outcome occurring during a time at risk for patients within a target population.

The term phenotype, in the context of observational healthcare research, is often used to describe a set of coding instructions that can be applied to identify a cohort of patients with some health status of interest within the database. For example, a phenotype for atrial fibrillation is the coding instructions that can identify which patients in the database have atrial fibrillation and when they have it. The coding instructions generally consist of a set of medical codes that are used to record the health status [eg, the International Classification of Diseases (ICD) 9 codes for atrial fibrillation, 427.31] and logic such as ‘the code must occur during an inpatient visit’. There is also logic to state the start and end of the phenotype: for chronic illnesses the start date is often the date of the earliest record of any code representing the illness and the end date is the end of the patient’s observation (ie, they have it forever). For acute illnesses, the end date is often some fixed time period after the start date, such as 90 days after the start of the illness. A patient is said to have phenotype X during time ti if the patient is in the cohort identified by the phenotype and ti is between the patient’s phenotype start and end date.

Databases

In this study, we used eight anonymised and deidentified observational healthcare databases (eg, health insurance claims data and electronic healthcare records) mapped to the Observational Medical Outcomes Partnership (OMOP) common data model (CDM) (see online supplemental appendix A table 1). The OMOP CDM is a standardised data structure to which any observational healthcare data can be mapped.8 The structure consists of tables, such as person (containing deidentified details about the patient, such as year of birth and sex), condition occurrence (containing time-stamped medical condition records per patient) and drug exposure (containing details time-stamped about drugs(s) dispensed or prescribed). In addition, a standardised vocabulary is used: SNOMED9 is used for medical conditions and RxNorm is used for drugs.10 Using databases mapped to the OMOP CDM makes external validation of prediction models more efficient.11 Six databases (MDCR, MDCD, CCAE, JMDC, Germany and Australia) were used to learn the constrained set of predictors and develop models using these predictors. Two held-out databases (Optum de-identified Electronic Health Record data set (Optum EHR) and Optum’s de-identified Clinformatics Data Mart Database (Optum CDM)) were only used to develop the models using the constrained predictors. This was for a fair test of whether the constrained set of predictors, learnt on different data, did well in new data.

Finding a constrained set of candidate predictor phenotypes

A large-scale descriptive analysis was performed to identify SNOMED (medical condition) and RxNorm (drug) codes that are associated with many outcomes across many target populations. For each target population and outcome pair, we created a labelled dataset consisting of patient ‘code features’ based on SNOMED/RxNorm codes recorded in the year prior to index and patient labels specifying whether the patients experienced the outcome 1 year after index. This was done using a retrospective cohort design, where we identified a target population in a database (eg, new users of lisinopril in database A) and a given index date (eg, the date a patient first has a record corresponding to lisinopril in the database), then assigned each patient the label ‘has outcome’ if the patient had the outcome within 1 year after index or the label ‘does not have outcome’ if the patient did not have the outcome within 1 year after index. A patient’s code features were created using one-hot encoding for any SNOMED/RxNorm code recorded in the year prior to index (eg, ‘code 27550009 corresponding to vascular disorder was recorded within 1 year prior to index’ had a value of 1 for a patient if the patient had code 27550009 recorded in the year before starting the drug and 0 otherwise).

For each labelled dataset, we calculated each code feature’s standardised mean difference (SMD)12 by calculating the difference between the mean value of the feature in the ‘has outcome’ group versus the mean value in the ‘does not have outcome’ group divided by the feature’s aggregated SD. This was done across 65 664 combinations of 64 target populations for different new drug users with ≥365 days observation in the database prior to index (index was date of first drug recording), 171 outcomes and 6 databases (MDCR, MDCD, CCAE, JMDC, Germany and Australia). The 171 outcomes are listed in online supplemental appendix B. These outcomes were chosen because the Observational Health Data Science and Informatics (OHDSI) collaboration13 has previously created phenotypes for them. We then aggregated the SMD values per code feature by counting how often the SMD was greater than 0.1 (‘#-SMD-significant’) and ordered the code feature by decreasing value of #-SMD-significant. A clinician then reviewed the list of SNOMED/RxNorm code features starting at the top, to identify specific medical conditions that the codes corresponded to. For example, the feature representing having SNOMED code 27550009 in the 1 year prior to index corresponding to ‘vascular disorder’ was ranked 7 and the clinician annotated this as ‘peripheral vascular disease’. The six code features before this were not annotated since they were non-disease specific, and the corresponding medical condition was unclear (eg, ‘pain’ is a broad symptom and does not correspond to a specific medical condition). The top 1500 code features were reviewed and annotated to specify the medical condition when possible. Many of the code features corresponded to unspecific conditions, in the top 100 code features, 51 were considered too broad or unspecific.

We then applied the standard phenotyping process for each medical condition identified by the clinician. This resulted in our constrained set of predictor phenotypes. These phenotypes were then used as the constrained set of predictors for a patient. For example, if we wanted to extract the constrained set of predictors for patient A on 1 January 2016, for each phenotype predictor we would see whether patient A was identified as belonging to the phenotype cohort on or prior to 1 January 2016. If patient A was deemed to belong to phenotype 1 on or prior to 1 January 2016, then the predictor for phenotype 1 would have a value of 1, otherwise, it would have a value of 0. This was repeated for each phenotype in the constrained set of predictors (see figure 1). The predictor phenotypes aim to give more meaning to a predictor (aka feature) as they aim to represent whether you have a medical condition/drug (eg, do you have a history of vascular disorder? True/False) rather than a specific code feature (eg, did you have SNOMED 27550009 recorded previously)?

Illustration of the prediction task and how the constrained predictor phenotypes are used to create predictors (aka features) for a patient.

Validating the constrained set of candidate predictors

Four prediction tasks were used to investigate the predictive ability of the constrained set of predictor phenotypes. For each task, we compared how well a model performed when using the constrained phenotypes as predictors versus how well a model performed when there was no constraint on the number of predictors included. In addition, for task sets 1–3, we implemented any previously published prediction models (existing models) to compare the performance of these against the new models developed using the constrained set of predictors.

Task 1: We developed models to predict 30-day risk of major bleeding in patients newly treated with warfarin who had a history of atrial fibrillation (index is first record of warfarin) to compare against the benchmark existing model HAS-BLED.

Task 2: In a random sample of 2 million patients with an outpatient visit in 2018 (index is first outpatient visit in 2018), we developed models to predict death within 1 year, to compare against the Charlson comorbidity index. Death is not completely reported across the databases we have, but we developed and evaluated models using the recorded deaths. The Charlson Comorbidity Index is an existing model that is frequently used in epidemiology studies as a measure of mortality.

Task 3: We developed models to predict 1-year risk of ischaemic stroke in patients newly diagnosed with atrial fibrillation (index is first atrial fibrillation date) to compare against the benchmark existing model CHA2DS2-VASc.

Task set 4: We developed a set of models to predict the 1-year risk of each of five outcomes—seizure, fracture, gastrointestinal (GI) bleed, diarrhoea and insomnia—following the initial major depression disorder (MDD) diagnosis for patients given antidepressant treatment within 30 days before and after index (first record of MDD) and observed in the data for at least 1 year prior to index. This was a random sample of previous prediction tasks used for methods research by the OHDSI community.14 There is no existing simple model for these tasks, but we include them to compare the age/sex and unconstrained models.

We developed models for all prediction tasks, where there were enough patients in the target population and at least 20 patients had the outcome, across 8 databases (JMDC, Germany, Australia, MDCR, MDCD, CCAE, Optum EHR and Optum CDM) using the PatientLevelPrediction framework14 with four different model designs:

  1. Constrained logistic regression (LR): logistic regression with least absolute shrinkage and selection operator (LASSO) regularisation15 16 using the constrained set of predictor phenotypes plus age groups (in 5-year bins)/sex.

  2. Constrained gradient boosting machines (GBM): eXtreme Gradient Boosting machine17 using the constrained set of predictor phenotypes plus age/sex.

  3. Simplest-case LR: logistic regression with LASSO regularisation using age groups (in 5-year bins)/sex-only predictors.

  4. Most-comprehensive-case LR: logistic regression with LASSO regularisation using one-hot-encoded predictors corresponding to whether a code was recorded (value of 1 if the code is recorded prior to index and 0 otherwise for a patient) for all SNOMED/RxNorm codes recorded for at least one patient in the target population prior to index plus age groups (in 5-year bins)/sex. There are often >10 000 candidate predictors for this model to select from.

  5. Existing models: We externally validated existing models for the prediction task where possible (ie, the predictors could be measured in the database).

The simplest-case LR and most-comprehensive-case LR designs have been used previously to provide relative bounds to put external validation performance into context.18 We chose the same designs to enable us to put the performance of the constrained models into context.

Models were developed using the standard PatientLevelPrediction process.14 75% of labelled data were used to learn the model with threefold cross-validation to pick the optimal hyper-parameters, and 25% of the labelled data were used to internally validate the models.

Performance was estimated using the commonly used discrimination metric, the area under the receiver operating characteristic curve (AUROC). AUROC is a measure that determines how well the models rank patients by order of risk. A value of 1 means perfect ranking and a value of 0.5 means random ranking. The AUROC can be thought of as the probability that a randomly selected patient who develops an outcome during the time at risk will be assigned a higher risk by the model compared with a randomly selected patient who does not develop the outcome during the time at risk.

Feature importance for the constrained set of candidate predictors

The coefficient for a specific predictor in a logistic regression model corresponds to how much the predictor impacts the prediction (when the predictors are normalised). To estimate the importance of each constrained predictor, we calculated the median and mean absolute value of a predictor’s logistic regression coefficients across the models developed using different databases and for different prediction tasks. In addition, we calculate how often the predictor has a non-zero coefficient in the logistic regression models.

[ad_2]

Source link

Leave a Comment

Your email address will not be published. Required fields are marked *