Modern software development runs at a faster pace than ever before. As cloud computing divides winners and losers in the data-driven marketplace, converging applications and their data with the infrastructure they run on makes more and more sense.

Converging software development with infrastructure operations is not news. Everyone is familiar with DevOps. DevOps originated over a decade-and-a-half ago in the first generation of web and mobile, and it was a game changer. Now that “Big Data” given way to bigger data, the disciplines of DevOps need to encompass what has come to be called “DataOps.”

The ideal goal is to apply the discipline of well-behaved release and validation processes to make data as predictable, reliable, and fresh as updates to mobile apps and websites. The challenge is that data infrastructure is much more complicated.

Data Drift: When Data is a Moving Target

The era of bigger, faster data goes beyond cloud economics and cheaper storage. With so many sources of data running independently, there has been in an explosion of complexity in the data supply chain. Gone are the days where one database managed all transactions of one business process.

Market risks and opportunities drive changes in business requirements; changes in business requirements drive changes in apps; changes in apps drive changes in their data. The business requirements market risks and opportunities. Part of the problem is the speed of the dynamic. What complicates it further is that multiple systems that produce the data undergo their own changes independently. They don’t account for the impact of the consumers of downstream data and analytics.

Contrast this with 20th-century practices centered on the relational database and the data warehouse. All data and applications were tightly controlled, and their output flowed to flowing to a “single source of Truth.” The same people who ran the data warehouse ran business infrastructure. All the answers came from the same place; reports and analytics were always predictable, just like business models.

That’s now changed irreversibly. More apps produce more data, broadening the range of stakeholders who consume data from those apps. Data infrastructure needs a stable way to metabolize those changes without interruption. Data quality used to mean “never change the data.” Now it means “the data is always changing; what are you going to do about it?”

Agile Development in the Age of Bigger-faster data

20th-century software release cycles were measured in months, and sometimes years. Development requirements were stable and transparent. Technologists built systems knowing how they would work and who would use them. They were tested to perfection before release into the wild.

Those days are gone. The change is felt most felt acutely by businesses who have to compete with the likes of Netflix, Facebook, Amazon, Google, LinkedIn, and their mass-market business models. When those titans emerged, no one really knew what software would best suit consumers. When mobile apps became a first-class citizen, mass-market feedback accelerated even further. That drove an even more rapid cycle of discovery, experimentation, and improvement.

It was at about this time that the Agile Manifesto set forth a new view software engineering that has come to dominate modern software development. Once there was a deep, well-divided set of specialties over a long life cycle of building a complex system. In contrast, agile development values speed of iteration and prescriptive assumptions about what’s the best way to do each thing. In the words of the Agile Manifesto, it prioritizes

  • Individuals and interactions over processes and tools
  • Working software over comprehensive documentation
  • Customer collaboration over contract negotiation
  • Responding to change over following a plan

Modern compute economics and the cloud in some ways forced the hand of software development teams. There was just too much complexity to divide and conquer with a single, long-term work assignment across a single team. The approach systematically favors the adaptive strategies of the left side “over” the classic on the right.

A good idea that works is better than an idea you haven’t tried yet, or falling back on “we’ve always done it that way”. Bigger and faster data has made the need even more acute.

Pipeline Development and Management

Relaxing the constraints of 20th-century software engineering doesn’t mean that iterative development is free of structure. Upleveling from DevOps to DataOps introduces new infrastructure complexities. There are four stepping stones around which the multiple specialties of data operations must converge.

  • Build—design topologies of flexible, repeatable dataflow pipelines using configurable tools rather than brittle scripting and one-off data movement and imports
  • Execute—run pipelines on edge systems and in auto-scaling in or cloud environments.
  • Operate—manage dataflow performance through continuous monitoring and enforcement of Data SLAs to tie development goals to operational reality
  • Protect—securing data end to end, in and around the pipeline. At every point of its journey, both from the bad actors as well as compliance, governance, etc.

The job of DataOps really revolves around pipelines, the movement of data from origin to consumption, in a fabric of many to many relationships. What makes this work data is that data pipelines are not monolithic. They are continually evolving as sources and endpoints change. Data pipelines include four types of logic for managing data in the relationships:

  • Data Origins: What are more systems were the day that enters the pipe
  • Transformation: type of data processing that you want to perform, by which changes are introduced into data that passes through the pipeline
  • Executors: Triggering a task when it receives an event. Executors do not write or store events.
  • Destinations: Represents the target(s) for a pipeline; you can use one or more destinations in a pipeline.

By applying the DevOps idea of “infrastructure as code”, data pipelines can be managed as a resilient set of resources. Versioning of configurations ensures that pipelines can continue to deliver valuable data at high quality even as the sources and destinations of data change.

Along with maintaining well-configured pipelines instance in the repository, DataOps needs to account for alerting and response to changes. The “executors” function of a pipeline can do more than move data from one stage to the next. It can also provide automated data validation and feedback, so that if there are problems at any step of the pipeline, even at initial ingest, they can be addressed. It's a set of assumptions analogous to both test automation on the development side, as well as systems monitoring on the operations side.

With thresholds and validations built into the pipeline, Ops Team can take prompt action to iterate and make changes. Because the best way to prevent garbage-in/garbage-out is to ensure no garbage enters the pipeline to begin with.