ūüáļūüᶠ News Our roots are in Eastern Europe. We are actively committed to helping Ukraine refugees with our resources & expertise
Beyond Big Data for Health Care with Data Science Pipelines


Beyond Big Data for Health Care with Data Science Pipelines

February 16, 2019
Post views

By now, the consensus view of Big Data is that bigger doesn‚Äôt translate into better. It‚Äôs a more subtle irony that in the face of bigger and faster data, data science is still vexingly difficult. It‚Äôs not in the data or in the science; it’s in formulating an approach to data science pipelines.

One of the more challenging applications is using Healthcare data to address chronic disease. At CloudGeometry, we worked with Gali Health to create a data science pipeline architecture for the unique challenges of healthcare data. It required integration across different kinds of healthcare business processes and different points of data collection. The most significant challenge is that the data itself is the subject of constant change. Let’s take a closer look at where this problem comes from, and how a data science pipeline approach helps.

Beyond BI basics in classic healthcare

Let’s look at a highly simplified health care use case

  1. A patient check in to visits a doctor at a hospital
  2. The doctor examines the patient and put information in an Electronic Medical Record
  3. Data capture in the record kicks off Interactions with third-party systems like billing, Insurance, paying the doctor, Etc
  4. The doctor’s prescription gets sent to the pharmacy where the patient picks it up

At¬†the highest level, this use case spans a¬†well-scoped set of¬†systems. Each solves for a¬†separate set of¬†business problems, and they‚Äôre generally related. Drawing on¬†that commonality of¬†data makes it¬†possible to¬†do¬†some analysis across those systems. Data is¬†aggregated in¬†a¬†data warehouse or¬†data lake. Making sense of¬†what happens in¬†a¬†particular business process¬†‚ÄĒ such as¬†how many patients a¬†doctor sees¬†‚ÄĒ is¬†a¬†classic problem for business intelligence (BI).

Beyond Big Data for Health Care with Data Science Pipelines

However, this approach quickly falls apart in the real world. It’s not just because the systems are inherently complex or costly (and they are). The problem is that neither healthcare nor the patients fit this model. Healthcare does not begin and end with one or more visits to the doctor. This is even more challenging with chronic diseases, as care must be managed because the disease can only be controlled, and not cured.

The modern health care data landscape

The technology marketplace has seen fit to address this unmet need with a huge array of data-driven technologies. These applications address health care problems that end in recovery as well as chronic diseases. The landscape of analytics becomes far more daunting.

The first-order problem is that business processes that were once amenable to classic business intelligence have become much more complex with many specialized sources of instrumentation and interaction. The more challenging element is in the connections between all these systems, and the pattern of many-to-many connections. These same systems now must communicate data to each other.

Beyond Big Data for Health Care with Data Science Pipelines

Within this new ecosystem of bigger faster data in the healthcare ecosystem, the constituent system parts must deal with continuous change on three major levels.

  1. Infrastructure Changes. The configuration of¬†compute, storage, and networking on¬†which applications run is¬†constantly changing. Various services solve for various problems, and they don‚Äôt all change at¬†the same time. Side effects of¬†changes to¬†one piece of¬†infrastructure¬†‚ÄĒ e.g. transition from IPv4 to¬†IPv6¬†‚ÄĒ can propagate in¬†unforeseen ways
  2. Application Changes. Changes in¬†improvements in¬†the logic with which each application solves its isolated problems¬†‚ÄĒ as¬†simple as¬†automating a¬†process like checking in¬†to¬†the doctor‚Äôs office with a¬†smartphone app¬†‚ÄĒ also can propagate in¬†unforeseen ways
  3. Data Changes. Very few if¬†any Health Care Systems are self-contained. There is¬†always going to¬†be¬†a¬†set of¬†third-party data flows where changes are invisible¬†‚ÄĒ until the systems that rely on¬†the data flows are blindsided. Cloud applications have accelerated this phenomenon. Changes to¬†data structure and semantics materialize in¬†unexpected ways, often uncovered only when data quality goes wrong.

The infrastructure management world thinks of #1 and #2 DevOps. Application code written by developers is tested in concert with computer infrastructure managed by operators prior to release and validated continuously.

When one does not control the infrastructure, such as in cases of ingest from third-party applications, it adds a third dimension that has come to be called DataOps. Data flow management tools like StreamSets and TalenD provide an environment in which these three layers of accelerating complexity can be focused on the flow of analytic data.

Advanced analytics & healthcare for data chronic disease management

And then there are chronic diseases don’t fit neatly into the healthcare delivery paradigm. The line between illness and wellness doesn’t neatly divide between visits to the doctor. This is the thesis behind Gali Health, an innovative startup targeting patient-centered care with a data-driven model.

Gali Health aspires to organize data from a wide range of sources it does not own into what amounts to data pipelines focused on the information patients, physicians, and researchers need to consistently improve outcomes.

  • Process data aggregation. The FHIR standard describes data formats and elements exchanging EMR, using a¬†modern web-based suite of¬†API technology to¬†facilitate interoperation between legacy health care systems. FHIR provides an¬†alternative to¬†document-centric approaches by¬†directly exposing discrete data elements as¬†services. At Gali Health, we¬†used the FHIR protocols to drive dataflows that integrate basic elements of¬†healthcare like patients, admissions, diagnostic reports, and medications. These are be¬†retrieved and manipulated via their own resource URLs originating on¬†clinical systems. They are ingested through dataflows in¬†depersonalized databases to¬†support research.
  • Machine Learning. Health information traditionally has been based on¬†published literature, with observations documented by¬†clinicians. Gali Health turns the process on¬†its head, using updates by¬†each patient, starting from basic health and lifestyle information to¬†more specific clinical information (treatments, diagnosed condition, recent tests, etc.). Machine learning creates models from the data, rather than pre-programmed business processes suited to¬†a¬†clinical-only setting. Those user originated dataflows, even though they are depersonalized, can create models to¬†describe key parameters associated with a¬†specific health context. For example, for individuals affected by¬†Crohn‚Äôs disease, it¬†is¬†important to¬†understand the stages of¬†the disease and its subtypes, responses to¬†known treatments and procedures or¬†tests and the types of¬†symptoms. This data gives researchers an ever more granular approach to¬†making sense of¬†disease dynamics outside of¬†the doctor’s office setting.
  • Artificial Intelligence. Gali Health‚Äôs greatest ambition is¬†to¬†move beyond depersonalized modeling to¬†provide personalized support of¬†each person in¬†their health journey. The Gali AI¬†is¬†based on¬†an¬†in-depth analysis of¬†the experiences of¬†individuals who are dealing with prevalent chronic conditions. It renders a behavioral model including clinical, emotional, psychological and educational components that are designed to¬†provide users with comprehensive and personalized support. All of¬†this information will enable Gali to¬†intelligently support the patient Context of¬†their day-to-day experience to¬†provide personalized support of¬†each person in¬†their health journey.

In each of these scenarios, a key goal is to facilitate the acquisition of data for Analytics for exploratory data investigation, in effect doubling down on data science for scientific goals.

Data pipeline strategy for bigger/faster medical data

The rich many-to-many relationships of the modern medical data ecosystem require a twofold approach to data flow management, centered on automated workflows. It’s a virtuous cycle of pipeline development and dataflow monitoring. Pipeline Automation can be applied to data integration in the following ways:

  • Pipeline Development. Configurable connectors and data transformations facilitate reuse of¬†common or¬†standardized pipeline logic. They also speed dataflow design and minimize the need for specialized coding skills.
  • Pipeline Deployment. Creating and deploying new pipeline templates and logic pipeline templates into production together with application and configuration changes, certainly subject to¬†the same build and validation constraints
  • Pipeline Scaling. Variations in¬†volume, periodic or¬†one time, call for a¬†data pipeline approach that goes hand-in-hand with cloud infrastructure. Its elastic properties of¬†compute, networking, and storage can respond to¬†temporary ebbs and flows without introducing unnecessary costs.

The criticality of healthcare data requires a higher level of reliability in the dataflows. At Gali Health, we built this out by monitoring the performance of data movement continually. It creates the feedback loop required for both agility and high reliability.

  • Data Movement performance. Instrument and monitor the entire data movement architecture for throughput, latency and error rates end-to-end or¬†point-to-point.
  • Data characteristics Logic that inspects data in-flow can flag changes in¬†the structure of¬†the data (added/deleted fields); the semantics of¬†the data (changed type or¬†meaning); and data values (skew and noise) which can serve as¬†event triggers. Remediation can range from retries to¬†offlining for a¬†development fix
  • Dataflow utilization. Making sense of¬†performance and problems, either in¬†data characteristics or¬†resources, can provide valuable feedback for logic in¬†downstream analytic systems. For example, pre-processing in¬†lower-cost data storage can save runtime computes cycles.

In the context of data science, data pipelines need an even greater degree of agility. In more advanced cases, this entails pointing the pipeline efforts at more sophisticated analytic approaches, i.e., ML and AI. But even with more straightforward modeling and transformation, data science as practice across a broad spectrum of sources must be able to adjust to iterative data manipulations and exploration. The science is as much in the connections across bigger and faster data as it is in the data itself.

Michael Philip

Michael is founder and CEO of CloudGeometry. For last decade, he has helped a growing series of venture-backed technology startups inside and outside Silicon Valley. CloudGeometry was inspired by his experience at a succession of product and engineering positions in leading technology companies, learning what does (and does not) work in executing product development a mix of local and remote teams. Ten years in, the team has honed those lessons into a successful product development services portfolio for high-velocity startups. CloudGeometry has designed and launched successful products proven across a range of markets and sectors worldwide.

Related Stories

Why better architected is better than Well-Architected

Read Story

Ten things to know about Kubernetes on Amazon (EKS): First of 2 posts

Read Story

When Kubernetes met Spark: what native Kubernetes support means for ‚Äúbig data‚ÄĚ and data science

Read Story