Seeing the Signal through the Noise in Health Data Analytics

Thought Leadership

As we venture into experiments in Precision Medicine, Accountable Care, Population Health, and Outcome-Based Medicine, we remain challenged in our analysis of patient care data. I work in a cool niche in the data analytics space where we better link records for comprehensive longitudinal histories (patients, customers, citizens, providers…you name it). This experience has shown me analytical anomalies that are indicators of a bigger problem. The data that practitioners enter in the system is wrong! I know, I know…this is unbelievable, right? But, as I peeled back the onion even more, I figured out more of the dirty little secrets of medicine that I didn’t know…

It turns out, practitioners have been forced to “learn” how to record patients as having a certain condition in order to assure coverage for a service (typically diagnostic service or medication). I am not talking about situations where a provider fraudulently diagnoses a patient with back pain so they can get opioids. Nor am I referring to clear cut fraud cases where practitioners diagnose a patient (even a fictional patient) with a condition and file to obtain reimbursement when there was no procedure actually done. Instead, this discussion is about providers who enter a diagnosis to justify a diagnostic procedure or medication even if the provider is not certain that the patient has the condition. This makes the procedure eligible for reimbursement, but the patient may not actually have the condition. The patient needs the diagnostic test done, and it is completely reasonable given the symptoms presented. The same is true for prescription of a medication where a provider knows the patient will benefit from off-label usage of the medication, but must diagnose the person with the on-label condition in order for the medication to be reimbursable.

Whether the insurance companies understand and intend for these situations to happen or not is a separate issue. However, when geeks like me look at the longitudinal data for regression analysis, coincidence detection, or decision flow, the diagnosis is just another “fact” in the data, but it is a wrong fact that skews the analysis. Two personal cases exemplify the issue I describe in this article:

  • Early in the development of biologic injectables (Enbrel, Humira, etc.) for treatment of autoimmune arthritis and similar diseases, patients who had “off-label” diagnoses were often prescribed the medication under the need for an on-label disease. Before mass “electronification” of medical records, this was a simple lie that a physician justified because they knew better. For example, I am HLAb27 negative, but I took an injectable that was only prescribed for Rheumatoid Arthritis (I technically have sero-negative polyarthropathy). Both are auto-immune diseases, and there is plenty of information that would tell a physician that it should work, but the incorrect diagnosis remains in the data.
  • When a patient presents with a persistent deep cough, wheezing, and a fever, a practitioner would want to rule out pneumonia before treating with alternative care. So, the insurance claim form for the X-ray procedure needs a diagnosis in order to justify the Chest X-ray. Originally, doctors would provide a diagnosis of “Rule out Pneumonia.” Insurance carriers deny the Chest X-ray unless the diagnosis is Pneumonia…not “Rule out Pneumonia.” So, again, we have to create artificial data to justify care. In so doing, it meets the payer demands for their business process, but causes inaccuracies in the data used elsewhere in the healthcare ecosystem.

Big data dorks like me really want to see the success of the analysis where we actually reduce the cost of care while expanding accessibility, improving quality, and protecting patient privacy. This is simply the law of unintended consequences in action yet again. To achieve efficiency, we implement an administrative procedure that actually thwarts the broader objective. The incorrect diagnosis data is actually “noise” (in a signal-to-noise ratio kind of way) that prevents accurate predictive analysis. In non-“data dork” terms: when we do the math on the effectiveness of a treatment for a disease, we call something a success if it has only a 20% uplift in positive response over a placebo; we can’t even see all of the underlying noise in the data that is preventing a more conclusive outcome.

All is not lost…

The above is just one example of a lot of “noise” in the data. As the Accountable Care Organization (ACO) model develops (it is truly not new, HMOs already get it), the focus on measurement is increasing. We have moved to ICD-10 to capture additional detail, FHIR to better share and link health data, and we are even implementing new strong identity management practices. With better access to medical histories and dramatic improvements in the assemblage of longitudinal records to make the analysis more comprehensive, good data dorks will be able to detect such a practice as an anomaly and remove the error in the analysis. For example, the current crush of healthcare analytics firms are using data from a gigantic anonymized data set from the government (Medicare) or a single large payer or provider. As we can better link these records with other records from different sources, we will be able to see indicators of the presence of the error and therefore dampen the “noise” in the analysis.

If I am diagnosed with pneumonia (as described in the case above), but I never receive medication for pneumonia, then it was probably a misdiagnosis—in this case for administrative reasons. This “noise dampening” assessment is not implemented today because large anonymous data sets and single payer (or provider) data streams don’t provide a comprehensive enough source to say with confidence that the patient was not prescribed the medication. You can’t prove a negative—basically, without confidence that you have all the patient data, you can’t say they were not prescribed the medication.

As we get better at increasing the confidence we have assembled all of the right data about the patient of interest, then we will be able to improve our data analytics to the next level. To improve that confidence we need to expand access to data to authorized individual(s)—basically an effort to improve identity management and permissions (consent) practices. We also need to provide better assurance of linkage in records than traditional probabilistic matching used in master patient index (MPI) and master data management (MDM) tools such that these algorithms can see through natural and cultural variations in names over time, see through errors in data and differences in data governance, and be able to resolve to complex edge cases in data matching that make us lose confidence in the current probabilistic approaches (twins, generational errors, familial data overlap, etc.). These changes are massive, but the payoff will be even more substantial.