Google’s Call to Stop Grooming Data and Accept Its Infinite Arrival

Aug 12th, 2015 9:23pm by Scott M. Fulton III

After an unexplained dearth of Google news that seemed, at one point, to have lasted several minutes on Wednesday, the company issued into general release the majority of its Google Cloud-based data warehouse portfolio. This release includes Cloud Dataflow, the unique data workflow management system that gives database admins more direct control over their cloud-based environments — more on the level that they would expect from on-premises data warehouses.

Unlike most any cloud-based database system, Cloud Dataflow is designed explicitly for integrations with third parties, including commercial Hadoop provider Cloudera and Salesforce. As Google engineers have explained, this is because large-scale, big data deployments tend to differ from one another, sometimes radically. A single cloud-based tool set may not adapt adequately to every conceivable workflow. So Google’s approach is to provide a platform where multiple components can co-exist, and customers can determine the workflows that make these components work together.

Consider the ability to create a data warehouse by diagramming it, and you get the idea.

Eric Schmidt, Database Engineer

“Depending on where you’re running these jobs, typically, elasticity is not your friend today, at a macro level,” said Google cloud engineer Eric Schmidt during the recent Hadoop Summit. (Yes, Eric Schmidt is his name; no, he’s not that Eric Schmidt. But with a company that employs this many people, redundancy was bound to happen.)

“Time and life never stop, especially in streaming mode,” Schmidt said. “Data just keeps on coming, keeps on coming. Even though we’re all great developers, and we think we can create rock-solid serialization/deserialization mechanisms that handle variances in schema, the reality is that something is going to change in your system.”

When a typical data warehouse is designed for what vendors call “elasticity,” Schmidt explained, it means they aim to have the capacity to handle peak workloads. That’s not so much an elastic pipe as a fat one, and over-provisioning is too costly an option for most businesses.

What’s more, as Schmidt and his colleagues have proven, over-provisioning can be inefficient, even at peak capacity.

In a just-released white paper [PDF] co-authored by Schmidt and ten other Google colleagues, for submission to a large-scale database conference in Hawaii later this month, Google makes its formal case for the Dataflow Model. It’s a case you might not expect: When an organization publishes a database management system to the cloud, it’s often on behalf of a customer. The example the researchers give concerns a video services provider that wants simple analytics about its daily usage and some insight about its customers, and doesn’t expect the quality of that analytics to vary as the database scales up or down. Such a customer expects a certain stable level of programmability when its developers build apps to access that data.

Schmidt writes:

We propose that a fundamental shift of approach is necessary to deal with these evolved requirements in modern data processing. We as a field must stop trying to groom unbounded datasets into finite pools of information that eventually become complete, and instead live and breathe under the assumption that we will never know if or when we have seen all of our data, only that new data will arrive, old data may be retracted, and the only way to make this problem tractable is via principled abstractions that allow the practitioner the choice of appropriate tradeoffs along the axes of interest: correctness, latency, and cost.

The database may scale up and, at times, its accessibility may scale down. But from a developer’s perspective, who cares? The programmability of the database varies.

Databases Will Never Be ‘Complete’

Existing database analytics systems, argue Schmidt and his colleagues, expect the input data that constitutes the final analytical model to at some point be “complete.” In the case of statistics surrounding a running system, “completeness” will never happen. The team pointed to this omission in database design as a major shortcoming.

“We believe this approach is fundamentally flawed when the realities of today’s enormous, highly disordered datasets clash with the semantics and timeliness demanded by consumers,” the team wrote. “We also believe that any approach that is to have broad practical value across such a diverse and varied set of use cases as those that exist today (not to mention those lingering on the horizon) must provide simple, but powerful, tools for balancing the amount of correctness, latency, and be cost appropriate for the specific use case at hand. Lastly, we believe it is time to move beyond the prevailing mindset of an execution engine dictating system semantics; properly designed and built batch, micro-batch, and streaming systems can all provide equal levels of correctness, and all three see widespread use in unbounded data processing today.”

They go on to suggest that if a workflow system can sufficiently create automated flexibility — adapting workflows to different schemes as usage scales up or down — then from the end customers’ perspectives, the stability of the system will be presumed to be constant. From that point on, customers can judge service providers on just a few remaining variables: latency and resource cost.

The remainder of the white paper goes on to outline a rather complex system (which we would hope the actual Google service would simplify significantly) for defining the conditions under which different workflows would operate, and also those which would trigger changes in those workflows. It’s a mathematical way to represent the ideal that, at scales that can be massive but only at certain times, database systems themselves should not remain the same for every conceivable condition.

During Hadoop Summit, Schmidt described this simplification as “a simple knob for speed.” Cloud Dataflow now provides a set of SDKs for building parallelized data processing pipelines. It also adds a managed service for optimizing workflows over those pipelines.

In other words, all the complexity that the research team demonstrates in their white paper, is based on variables that are ascertained through customers use of the Dataflow SDK to create pipelines.

“This is probably now year 14 of innovation around big data,” said Schmidt at one point. “We’re also investing heavily in the cloud. At the same time, we are continuing to embrace all types of open source workflows on top of [Google Cloud Platform]. You want to bring your Hadoop workloads to GCP? Great. We’re continuing to optimize our infrastructure … to run Hadoop-based workloads. At the same time, we’re also spending a lot of innovation time to build fully-managed services. And this is where the optimization comes in.”

Feature image: “Flying Fátima” by José Eduardo Deboni is licensed under CC BY 2.0.

Scott M. Fulton, III is a 39-year veteran technology journalist, author, analyst, and content strategist, the latter of which means he thought almost too carefully about the order in which those roles should appear. Decisions like these, he’ll tell you,...