The Data Engineering Checklist for Data Science

The Data Engineering Checklist for Data Science

June 2, 2018
Post views

What comes after Big Data? Bigger data. Demands on data scientists are growing even faster. They need to the resources and tools to make ever more effective use of their time. Expanding Data Engineering holds the key to breakthroughs in data science.

There is some amount of overlap between these two specialties. Having both of them succeed relies on an effective division of labor. Collaboration between the two of them makes for a more robust backbone for data-driven business.

Data Engineer Roles and Responsibilities

Ensuring the reliable delivery of data is the overarching responsibility of the data engineer. It’s no longer enough to think of it as “data prep”. Sophisticated data science relies on the creation and operation of data pipelines. Here’s a checklist for what needs to get done to make that work:

  • Data platform system architecture. Engineering a data pipeline requires the assembly of multiple technology components. They aren’t necessarily built to snap together like Legos. There are architectural trade-offs to achieve operational coherence, reliability, and growth.
  • Business architectural alignment: The problems to be solved are not strictly technical. Data pumped through the data pipeline serves customers and solves their problems profitably. Architectural decisions must align with profit goals, and contend with changes driven by market forces.
  • Opportunities for data acquisition. Some questions need answers from data outside of the pipeline. Data engineering manages delivery for data it does not completely control. This may be 3rd party data; it may also require lobbying other organizations to make their data more usable downstream.
  • Orderly supply of multiple integrated data sets. Just putting it all in one place — call it data lake or data warehouse — doesn’t make it all work. Data engineering ensures data across sources can be reliably and securely accessed across use cases.
  • Software-driven infrastructure. The fluid nature of both data and infrastructure means it must be addressable by automated processes. Configuration artifacts (and the processes that modify them) need to live in a code repository.
  • Operational data quality and integrity. Noisy and broken data is a given, all the more so as source systems have broadened the scope of data inputs. Data engineering owns the creation and maintenance of automated data validation rules and alerts.
  • Data consumption support. In a data-driven business, everybody wants some. Direct visibility into changes they need helps to improve infrastructure and business alignment. That makes the data becomes useful to a broader audience faster.
  • Science experiment support. The cycle of experimentation and discovery is unpredictable. It often never makes it past “science project” phase; that’s fine. However, data engineering needs to build this in, with spare capacity and flexibility. Advanced analytics, such as for automated data modeling like machine learning and AI, needs this even more.
  • Production-ready data science. Success can’t be a surprise. Data engineering team needs to operationalize insight. Repeatably locking in successes relies on reproducible engineering processes. It frees up the data scientist to pursue new insights.

Data scientist skills

With so much bigger data, it’s easy to imagine all the things a data scientist could do. Both the difficulty of problems and the scarcity of skills demand more focus. The prime directive is applying advanced mathematical and statistical problem-solving skills to tackle difficult business problems using data. Let’s break down the work that matters most.

  • Establish business context. Framing the problems to solve is critical to any research and analysis. Business problems are often imprecise. It’s up to data science to point efforts in the right direction.
  • Identifying relevant datasets. This is the critical step in moving in the right direction from conversational speculation. Self-evident conclusions are rare. Testing how robust those insights requires lots of iterations between different data inputs and possibly relevant outcomes.
  • Identifying hidden relationships and patterns. Source data almost always comes from unrelated business processes. Data science finds connections across boundaries between those sources. Patterns that emerge on data sources are generally not all aligned on a single problem.
  • Reflect problem statements to key stakeholders. The first questions any scientist needs to ask are directed at the person asking for the science — problem-solving starts with unpacking ambiguous needs before any heavy analytic lifting.
  • Use visual analytics to tell stories. The language of data and modeling is often opaque to research sponsors. This starts with giving stakeholders tools to improve understanding. Remember, they need to use those tools to explain it finding to other non-scientists. That’s what ensures the conclusions can have real business impact.
  • Leverage automation and machine learning and AI. For AI and ML to be useful, they need to have solid foundations in modeling and data discipline. Without those foundations, all they can do is make the wrong conclusions easier to draw.
  • Continuously reinforced data quality feedback loop. No data is reliable until you rely on it. A key collaboration point with data engineering, as the source of the data, is to ensure data pipelines have robust enough validations to prevent garbage in the first place.
  • Influence multiple data owners. Data scientists are cutting-edge synthesizers of business problem-solving. That makes them are uniquely qualified to see the value of data quality. Outreach to the different business functions who supply the data can show the benefits of investing in improvements for more analytics-friendly data
  • Focus coding on creative problem solving over optimization. Data scientists should not be thought of coders with superpowers. Software helps data science because coding has great descriptive power. It’s best to assume their work does not prioritize elegant compute and storage optimization, let alone owning all elements of a well-architected cloud.

The age of bigger data ensures big opportunities for data science. Combining it with Data Engineering ensures those opportunities can have an even greater impact.

The Data Engineering Checklist for Data Science

Alex Ulyanov

Alex is an AWS Certified Professional Solution Architect and a seasoned cloud infrastructure leader. Working with the company’s top clients, including GE Digital, Zypmedia, Origami Logic, and ThinFilm, he has driven architectural innovation to unlock the performance and reliability of cloud environments for a wide range of applications. His extended team of practitioners help large and small companies alike design, build, manage, and grow successful cloud implementations.

Related Stories

The Data Science Myth is not about Data Scientists

Read Story

The best practices of building API for your SaaS

Read Story

Ten things to know about Kubernetes on Amazon (EKS): First of 2 posts

Read Story