Battle of Open Source Analytics: Spark vs Drill vs Quasar

Battle of Open Source Analytics: Spark vs Drill vs Quasar

Three rising open source projects are on a collision course, and the results will be epic.

There is a pervasive but little-recognized need in modern Enterprise analytics, one that grows more urgent and more obvious with each passing day.

Modern companies want the ability to ask questions and derive insights from all their data, no matter its structure, and no matter its location. They want this power today, right now, without having to go through year-long data warehousing projects—projects that, due to the rapidly changing nature of modern data silos, are often obsolete before they are even finished.

This requirement is best summarized in the tagline, any data, any place, any time. Companies are fed up with being treated like golden gooses by purveyors of legacy analytics solutions that drive billions in services revenue. Instead, companies want to explore and understand any kind of data (even if it’s not relational!), no matter where it’s located, and any time they want — without mandatory ETL, data modeling, or data warehousing.

The theoretical solution to this problem is a unified analytics layer that can handle relational and post-relational data, push computation to any data source (web services, databases, data warehouses, and data lakes), and efficiently join together diverse data sets in ad hoc ways.

That’s not an easy piece of technology to build, which explains why no one has built a complete solution to the problem. In fact, the problem is just too big to solve in one fell swoop — attempting to build a feature-complete solution from day one would necessarily end in failure!

To attack a problem of this magnitude, you need an incremental strategy. One that delivers some value today, even while it’s on the road to becoming something much bigger.

Interestingly, there are three different incremental strategies to solving this problem. And, coincidentally (or not?), there are three excellent open source projects pursuing each of the three strategies.

Let’s take a look at each strategy in turn.

Spark: Computation

The Spark strategy is to build a monster analytics engine, one that’s fast and adept at performing analytics on so-called semi-structured data (ranging from relational data to JSON- and XML-like data).

When it comes to raw computation, Spark has few equals. Spark was built for in-memory execution and continues to increase in performance. In the latest version, it can compile analytical workflows directly to JVM bytecode, which can be run in-memory and co-located with data across a cluster.

Still, Spark has not been content to remain an engine of computation. Spark connectors for NoSQL data sources have begun proliferating, and a couple NoSQL database vendors are promoting Spark as a leading solution for analytics on their databases.

Unfortunately, because Spark was designed primarily for number crunching, it’s ability to federate queries is poor, so you can’t join data silos together with maximum efficiency. In addition, while Spark’s developer APIs support semi-structured data, Spark SQL only pays lip service to non-relational data. Finally, Spark’s ability to push computation down into a data source (such as PostgreSQL or ElasticSearch or a complex cloud API) is abysmal.

To realistically become more than a distributed analytics engine, Spark needs the ability to push arbitrary computation into a data source, and to fully support the complex, heterogeneous data models and use cases that are the hallmark of NoSQL analytics.

Yet, these ambitious goals are precisely what Databricks and other contributors to Spark are focusing on, at least in part. The next major version of Spark will have greatly improved pushdown, and better SQL support for analytics over semi-structured data (though still poor in comparison to the developer APIs or non-Spark solutions to the problem).

Spark wants to be more than just a distributed analytics engine.

Drill: Federation

The Drill strategy is to build a sophisticated query federation system, which can distribute computation to diverse data sources, including those that support semi-structured data.

The project has existed since 2012. While originally targeted primarily at distributed, interactive queries, the project now focuses on semi-structured data and query federation (such as distributed joins) across different data sources.

Drill, of course, has its own analytics engine, but unlike Spark, it is not capable of general-purpose computation. Rather, it can handle the same class of analytical problems that can be solved by SQL queries, which is where the mainstream market is at.

Drill’s implementation of SQL (powered by Calcite, among other projects) has better support for semi-structured data than Spark, but abysmal support for heterogeneity. The schema-on-read feature is less about dynamic schema, and more about deferred schema, which is not surprising considering the strongly relational basis of the project.

Beyond these limitations, which place Drill at post-relational technology (but just barely), many of Drill’s connectors do extremely minimal pushdown, even when the underlying data source (such as MongoDB) is capable of efficiently executing large chunks of an analytics workflow.

Drill positions itself as not being competitive with Spark. In a sense, they are right: the two are not directly competitive, but Drill does go head-to-head with Spark SQL, and even while Spark gets more connectors, better pushdown, better federation, and improved support for semi-structured data, Drill is getting faster at crunching numbers (with Apache Arrow), and in time and with sufficient resources, the Drill connectors will improve.

Quasar: Compilation

The Quasar strategy is to build advanced compiler technology that’s capable of pushing arbitrarily complex semi-structured analytics workflows all the way down to the irregular surface areas of modern data silos (no matter how fragmented they might be).

The newest of the three projects, Quasar started life with no ability to perform any computation at all. Given an analytics workflow, expressed in SQL2 (a dialect of SQL that supports unlimited heterogeneity and nesting), Quasar compiles an analytics workflow to low-level operations that run entirely inside the target data source.

Quasar’s support for pushdown is second-to-none, and the project employs the most sophisticated compiler technology of any open source project. The compiler is based on research into generalized recursion schemes and fixed-point data encoding, and incorporates bidirectional type inference, a pattern-based structural type system (with direct support for schema heterogeneity and data-in-schema), and a next-generation architecture straight from academia.

Unlike the other solutions, which offer ad hoc extensions to support post-relational data models, Quasar formally extends relational algebra to handle heterogeneity and multi-dimensionality. The extension, dubbed MRA (Multi-dimensional Relational Algebra), has a specification and reference implementation, and there is a formal grammar for SQL2.

Quasar’s current weaknesses are precisely the respective strengths of Spark and Drill: Quasar doesn’t have the ability to execute its own analytical computations (it must push them into the data source), and Quasar can’t yet federate queries across multiple data sources.

However, Quasar development is accelerating, and the project should gain four more connectors, a high-speed, columnar analytics engine for semi-structured data, and federated queries—all before the end of the year.

Destined to Collide

As should be incredibly obvious by now, these three projects — Spark, Drill, and Quasar — are on a collision course.

While all three projects took completely different approaches, they are all converging on the same problem: modern companies want to be able to explore and understand all their data, no matter its structure, and no matter where it’s located, without the fuss of ETL, data modeling, and data warehousing.

To solve this problem, a solution must have the following characteristics:

  • General-purpose and highly-efficient analytical computation, with robust support for semi-structured data.
  • Powerful query federation, including distributed joins, across both relational and post-relational data sources.
  • Compilation of advanced analytical workflows on semi-structured data to arbitrary targets (cloud APIs, databases, data lakes, data warehouses).

The solution to modern Enterprise analytics doesn’t look exactly like Spark, Drill, or Quasar. Rather, what companies want is a hybrid combining aspects of all three projects. Which is exactly why all three projects are trying to shore up their respective weaknesses and go head-to-head in the same market.

From this battle of the three giants, only one will emerge as the market leader. The others will be forced to specialize or settle for runner-up positions.

The technologies themselves can give us clues about what’s going to happen.

Surviving the Apocalypse

Spark’s greatest strength — the fact that it’s a workhorse of general-purpose computation — is also a great weakness, because it means that Spark wants to do all the computation itself, rather than share some with other technologies. In addition, Spark is only a general-purpose analytics platform for developers. Spark SQL is considerably less-powerful, completely unable to satisfy all eight characteristics of NoSQL analytics systems.

Ultimately, as competing or overlapping technologies such as Flink, Beam, Concord, Pachyderm, and Kudu increase in popularity, Spark is going to have to decide whether to stay greedy or whether to greatly improve its core compilation and federation technology, and increase the expressive power of Spark SQL for modern data models. It must decide whether to stay a developer-focused analytics engine, or become a high-level uniform interface to all sources of data, even those that have built-in computational engines of their own.

Drill’s breadth of connectors is impressive, though their depth suffers due to both limitations in the relational way its core thinks about data (which doesn’t give semi-structured data a first-class treatment), and the complexity of doing more advanced types of pushdown given the way the technology is architected.

Quasar’s compiler technology and mathematical extension to relational algebra is sophisticated, which makes the technology very un-opinionated about where the data is located and what structure it has (if any). Quasar can push even the most complex analytics into very irregular targets, such as APIs and NoSQL databases. But currently, Quasar must push analytics all the way down, so it only works with query complete backends, and as a consequence, Quasar has no support for federating queries across multiple sources of data.

While many might pick Spark as a winner because of its traction in the market, there are powerful historical reasons for doubting this verdict:

  1. Open source analytics engines come and go. MapReduce, which was once all the rage, is now obsolete—people barely remember what it was used for. A few years ago, no one had heard of Dataflow or Flink or Pachyderm, but now each is showing rising adoption. There is no doubt that Spark is the leading framework for distributed analytics, but history suggests that this does not result in long-term dominance.
  2. Developers aren’t the right market for analytics. Spark has enjoyed tremendous success, but that’s because developers have been needed to create data pipelines and analytic workflows. To the maximum extent possible, no business wants developers mucking around with analytics. Writing code to solve problems in analytics cannot last forever, and as well-defined use cases emerge, high-level solutions follow.
  3. SQL will continue to solve the mainstream needs for general-purpose analytics. Many companies have tried killing SQL, and all of them have failed. Spark, Drill, and Quasar all offer a SQL interface to data, but in the case of Spark, companies don't use Spark SQL because it’s the best interface across today’s data silos. They use SparkSQL because they're using other components Spark (such as DataFrame APIs).
  4. Companies don’t care what crunches the numbers. This suggests that solutions which focus on number crunching are vulnerable to churn, whereas, solutions which focus on the business problem — providing a uniform means of constructing analytics workflows across all data silos — are likely to have far greater staying power. Particularly if they have the ability to nimbly adapt to new computational frameworks, which biases towards powerful compiler technology like Quasar.

If Spark is destined to become the ultimately forgettable analytics engine that followed MapReduce (but was replaced by whatever comes next), this leaves Drill and Quasar. Both of these projects are less choosy about who does the number crunching, and both have embraced and extended SQL. Indeed, the projects have put a lot of work into ensuring their interfaces are powerful and uniform across today’s modern sources of data.

Drill’s compiler technology is good, but its current generation of connectors fails to take advantage of the technology. Drill’s more significant problem is that it doesn’t really think about the world in a truly semi-structured way, which means that historically, Drill has provided a second-class experience for many use cases in NoSQL analytics—including data-in-schema, heterogeneous schema, and XML-like data.

Some of these limitations are have lessened in recent years, but there has been zero progress to fix other limitations. In part, this is because Drill depends on external projects (such as Calcite), and because satisfying all eight characteristics of NoSQL analytics systems requires a re-formalizing of  relational analytics. As Presto amply demonstrates, adding a few JSON functions on top of a relational core doesn't meaningfully address the analytics problem for semi-structured data.

Quasar’s powerful generalized data model and formal extension to relational algebra mean it can handle the full range of modern data, including the most advanced use cases in NoSQL analytics. In addition, the compiler technology can target the irregular surfaces of the world, such as cloud APIs, NoSQL databases, data lakes, and RDBMS systems, and can push maximum computation into the underlying source of data.

Between Drill and Quasar, Quasar has the edge. Quasar's approach to re-think the foundations of analytics for modern data, and to focus less on where the data is and more what you can do with it, holds tremendous promise for providing a uniform, general-purpose, and code-free interface to all the world’s sources of data. Even the sources of data that are too irregular or too complex for Drill and Spark (SQL).

Quasar's interface allows expressing arbitrary analytics on semi-structured data, with full support for complex data models, schema heterogeneity, schema-in-data, and other issues that relational technology chokes on. Drill can't handle all of these cases, and neither can Spark SQL, which means businesses should, in theory, gravitate to the one interface that allows them to access all their data and solve all their problems, versus fractional solutions like Drill and Spark SQL.

What Quasar needs to go from up-and-comer to market leader is simply what the others have and Quasar currently lacks: more connectors, federated queries (including distributed joins), and fallback evaluation for data sources that aren't query complete.

If Quasar can address these limitations in a reasonable time frame, then Quasar will become the unified interface to modern data — the one that requires zero ETL, zero data modeling, and zero data mapping; the one that can handle arbitrary analytics over any kind of data, and the one that can push maximum computation to the data source, no matter how irregular it might be.

Watch and see what happens over the next 12 - 18 months. In my opinion, they will decide the future of NoSQL analytics, including which technologies become market-leaders, and which are relegated to the dustbin of history.

Nice

Like
Reply
Carnot Antonio Romero

Product Management - Data Management, Data Access Governance

6y

Curious to hear your thoughts on the just-launched Dremio, which is a kind of offshoot of Drill (created by Jacques Nadeau, who took I'm sure the best of his previous creation and incorporated a lot of new development under the hood).

Piotr Krudysz

Data Warehouse Administrator at Terg (Mediaexpert)

7y

Very good article, I've shared it with my team already. Myself I'm only familiar with Spark, which is THE analytic tool of today. Perhaps it's time to look into the other two :-)

Suminda Dharmasena

BSc, PGDip, CFA, MACM, MIEEE, PRM. A Seasoned Entrepreneurial Professional in Quantitative Research, Analytics and IT

7y

Also keep in mind the future is on changing data or streaming data where throughput and hard real time latency matters a lot. Regarding changing data you should be able to get the change delta and recompute all the dependent calculation in a mix of push (active) and pull (lazy) modes. (Think spreadheets, http://www.ankhor.com/) Another more simpler variation is streaming data. (e.g. http://apex.apache.org/, http://flink.apache.org/, http://reactivex.io/, http://twitter.github.io/heron/, http://projectreactor.io/, http://reactors.io/, https://monix.io/, Akka Streams, SWave) To start with you will need query optimisation and acceleration. (E.g. http://www.kodesoftware.com/, http://www.actian.com/products/big-data-analytics-platforms-with-hadoop/vector-smp-analytics-database/) With regarding latency and throughput you might need to consider optimal code generation AOT and JIT optimization. (Think native code generation: http://blog.memsql.com/memsql-5-ships/, http://www.phoronix.com/scan.php?page=news_item&px=MTgxNzI). In addition how to use GPU, FPGA, multicore acceleration. (e.g. https://www.mapd.com/)

Suminda Dharmasena

BSc, PGDip, CFA, MACM, MIEEE, PRM. A Seasoned Entrepreneurial Professional in Quantitative Research, Analytics and IT

7y

If you can do LA / ML like in MADLib (http://madlib.incubator.apache.org/) this would be a great addision. If you really optimise this it will be a breakthrough. E.g. if you do a regression on a moving window time series you do have to incrementally do as much crunching to the next window but this kind of optimization is absent from many frameworks which are implemented for static data.

To view or add a comment, sign in

Insights from the community

Explore topics