## Is medicine mesmerized by machine learning statistical thinking

BD Horne et al wrote an important paper Exceptional mortality prediction by risk scores from common laboratory tests that apparently garnered little attention, perhaps because it used older technology: standard clinical lab tests and logistic regression. Yet even putting themselves at a significant predictive disadvantage by binning all the continuous lab values into fifths, the authors were able to achieve a validated c-index (AUROC) of 0.87 in predicting death within 30d in a mixed inpatient, outpatient, and emergency department patient population. Their model also predicted 1y and 5y mortality very well, and performed well in a completely independent NHANES cohort 1.

The above model, called by the authors the Intermountain Risk Score, used the following predictors: age, sex, hematocrit, hemoglobin, red cell distribution width, mean corpuscular volume, red blood cell count, platelet count, mean platelet volume, mean corpuscular hemoglobin, mean corpuscular hemoglobin concentration, total white blood count, sodium, potassium, chloride, bicarbonate, calcium, glucose, creatinine, and BUN 2. The model is objective, transparent, and needs only one-time and not historical information. It did not need the EHR (other than to get age and sex) but rather used the clinical lab data system. How predicted risks are arrived at is obvious, i.e., a physician can easily see which patient factors were contributing to overall risk of mortality. The predictive factors are measured at obvious times. One can be certain that the model did not use information it shouldn’t such as the use of certain treatments and procedures that may create a kind of circularity with death. It is important to note however that inter-lab variation has created challenges in analyzing lab data from multiple health systems.

Contrast the above under-hyped approach with machine learning (ML). Consider the Avati et al’s paper Improving palliative care with deep learning which was publicized here. The Avati paper addresses an important area and is well motivated. Palliative care (e.g., hospice) is often sought at the wrong time and relies on individual physician referrals. An automatic screening method may yield a list of candidate patients near end of life who should be evaluated by a physician for the possibility of recommending palliative rather than curative care. A method designed to screen for such patients needs to be able to estimate either mortality risk or life expectancy accurately.

Avati et al’s analysis used a year’s worth of prior data on each patient and was based on 13,654 candidate features from the EHR. As with any retrospective study not based on an inception cohort with a well-defined “time zero”, it is tricky to define a time zero and somewhat easy to have survival bias and other sampling biases sneak into the analysis. The ML algorithm, in order to use a binary outcome, required division of patients into “positive” and “negative” cases, something not required by regression models for time until an event 3. “Positive” cases must have at least 12 months of previous data in the health system, weeding out patients who died quickly. “Negative” cases must have been alive for at least 12 months from the prediction date. It is also not clear how variable censoring times were handled. In standard statistical model, patients entering the system just before the data analysis have short follow-up and are right-censored early, but still contribute some information.

Avati et al used deep learning on the 13,654 features to achieve a validated c-index of 0.93. To the authors’ credit, they constructed an unbiased calibration curve, although it used binning and is very low resolution. Like many applications of ML where few statistical principles are incorporated into the algorithm, the result is a failure to make accurate predictions on the absolute risk scale. The calibration curve is far from the line of identity as shown below.

Importantly, some of the hype over ML comes from journals and professional societies and not so much from the researchers themselves. That is the case for the Avati et al deep learning algorithm, which is not actually being used in production mode at Stanford. A much better calibrated and somewhat more statistically-based algorithm is currently being used.

Like many ML algorithms, the focus is on development of “classifiers”. As detailed here, classifiers are far from optimal in medical decision support where decisions are not to be made in a paper but only once utilities/costs are known. Utilities and costs only become known during the physician/patient interaction. Unlike statistical models which directly estimate risk or life expectancy, the majority of ML algorithms start by using classification, then if a probability is needed they try to convert the patterns into a probability (this is sometimes called a “probability machine”). As judged by Avati et al’s calibration plot, this conversion may not be reliable.

Avati et al, besides showing us what is needed, and consistent with forward prediction (the calibration plot) also reported a number of problematic measures. As detailed here, the use of improper probability accuracy scoring rules is very common in the ML world, because of the hope that one can actually make a decision (classification) using the data without needing to incorporate costs of incorrect decisions (utilities). Improper accuracy scores have a number of problems, such as

Proportion classified correctly, sensitivity, specificity, precision, and recall are all improper accuracy scoring rules and should not play a role in a forward prediction mode when risk or life expectancy estimation are the real goals. A poker player wins consistently because she is able to estimate the probability she will ultimately win with her current hand, not because she recalls how often she’s had such a hand when she won.

One additional point: the ML deep learning algorithm is a black box, not provided by Avati et al, and apparently not usable by others. And the algorithm is so complex (especially with its extreme usage of procedure codes) that one can’t be certain that it didn’t use proxies for private insurance coverage, raising a possible ethics flag. In general, any bias that exists in the health system may be represented in the EHR, and an EHR-wide ML algorithm has a chance of perpetuating that bias in future medical decisions. On a separate note, I would favor using comprehensive comorbidity indexes and severity of disease measures over doing a free-range exploration of ICD-9 codes.

Nomogram for obtaining predicted 1- and 2-year survival probabilities and the 10th, 25th, 50th, 75th, and 90th percentiles of survival time (in months) for individual patients in HELP. Disease class abbreviations: a=ARF/MOSF/Coma, b=all others, c=CHF, d=Cancer, e=Orthopedic. To use the nomogram, place a ruler vertically such that it touches the appropriate value on the axis for each predictor. Read off where the ruler intersects the ‘Points’ axis at the top of the diagram. Do this for each predictor, making a listing of the points. Add up all these points and locate this value on the ‘Total Points’ axis with a vertical ruler. Follow the ruler down and read off any of the predicted values of interest. APS is the APACHE III Acute Physiology Score.

In the rush to use ML and large EHR databases to accelerate learning from data, researchers often forget about the advantages of statistical models and of using more compact, cleaner, and better defined data. They also sometimes forget how to measure absolute predictive accuracy, or that utilities must be incorporated to make optimum decisions. Utilities are applied to predicted risks; classifiers are at odds with optimum decision making and with incorporating utilities at the appropriate time, which is usually at the last minute just before the medical decision is made and not when a classifier is being built.