March 26, 2018

Making Hadoop Relatable Again

Alex Woodie

There has been much debate over the future of Hadoop in recent months. Should it work more like a cloud object store? Should it support GPUs and FPGAs, Docker or Kubernetes (or both)? Should compute and storage be separated in Hadoop? Is it even necessary anymore? The folks at Splice Machine have their own take: If you make Hadoop look more like a relational database, then people will do more with it.

“Everybody is struggling to figure out how to expose what’s been put on the data lake to the business,” Splice Machine founder and CEO Monte Zweben told Datanami at the recent Strata Data Conference. “Our opinion is that you can take infrastructure that people understand, like relational database management systems, and run them directly on the data lake.”

That’s essentially the message that Splice has been pushing since the peak of the Hadoop frenzy in the 2013-2015 timeframe, and it’s the same message that it’s pushing today. The big difference, according to Zweben, is the maturity level. Splice Machine’s open source technology that essentially turns Hadoop into a distributed ACID-compliant relational database is now ready for primetime. Wells Fargo is arguably its biggest paying customer and production use case, but it has dozens more across financial services, healthcare and other industries.

“We’re at a point of inflection as company,” Zweben said. “We spent the last four years making the transactional database really work at scale, so being able to have petabyte-scale customers getting millisecond response times to queries for record lookups. That hasn’t really been done before at the SQL level with ACID compliance, and we finally proved it at that level for production data.”

(Semisatch/Shutterstock)

Earlier this month the San Francisco company announced a new connector for Apache Spark that extends its Hadoop-resident RDBMs further into the world of Apache Spark. While Splice already utilized Spark (along with HBase) as a execution engine, the new connector brings Spark DataFrames into the Splice fold.

According to Splice, the connector brings two main benefits. First, it extends all the CRUD-like benefits of Splice’s database – including creating tables and inserting, updating, upserting, deleting, and querying data – to Spark DataFrames. Secondly, it makes data in Splice’s database available to Spark engines, such as MLlib, Spark Streaming, R, and Spark SQL.

Having a full database backing Spark will simplify the data movement activities for data scientists and engineers working in Spark, Zweben says. For starters, they no longer need to use JDBC or ODBC connections, which require data to be serialized and moved one record at a time. This will help for ETL and streaming analytics use cases.

The DataFrames API will also help for machine learning use cases, he says. After data scientists working in Spark build a predictive model using Python or R, they can use Splice Machine to extend that model to the data a business application has stored in the relational database.

Some people may wonder whether SQL has much to do with machine learning. Machine learning, after all, uses the power of statistics to automatically draw correlations between certain derived features hidden in data – huge amounts of data, ideally — while SQL is used to do arithmetic with numbers stored in tables.

Zweben has been down that road before. “Machine learning needs SQL,” he says. “And the reason is the power of any machine learning analytic is not algorithmic. It comes from the signal in the data. Getting the data in the right feature vector for the analytic is the secret behind good data science.”

When it’s pointed out that Spark already has a SQL implementation, Zweben agreed that Spark SQL is useful for some things, but argued that it’s not strong enough. “It doesn’t have enough SQL in it,” he says. “It does lot, but it’s not mutable. There’s no updatable capability.”

There’s a certain power and elegance that comes from having a database at your command, as opposed to just a file system that can accept new files but doesn’t let you update existing files. As some of the luster has come off Hadoop’s shine, Splice is well-positioned to find out how much demand there is for a Hadoop-resident relational database.

“Somebody said the other day ‘Why don’t just describe yourself as making Hadoop updateable?” It was an interesting statement,” Zweben said. “That’s what we are. Just like a database makes large scale database tables updateable, delete-able, and query-able, that’s what we do to big data. And we do it in the same way as you did it on relational database management system.”

Hadoop-based RDBMS Now Available from Splice

Picking the Right SQL-on-Hadoop Tool for the Job

Applications: Enterprise Analytics

Technologies: Frameworks, Middleware

Sectors: Financial Services, Healthcare

Vendors: Splice Machine

Tags: DataFrames, Hadoop, relational database, Spark

Only registered users may comment. Register using the form below.

Check off newsletters you would like to receive*
- HPCwire
- EnterpriseTech
- Datanami
- Technology Conferences & Events
- Advanced Computing Job Bank
- Technology Product Showcase
Email*
Name*
First Last
Organization*
Job Function*
Industry*
Country*
City*
State*
Province*
- Please check here to receive valuable email offers from Datanami on behalf of our select partners.

Making Hadoop Relatable Again

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 19, 2024

April 18, 2024

April 17, 2024

Sponsored Partner Content

Get your Data AI Ready – Celebrate One Year of Deep Dish Data Virtual Series!

Supercharge Your Data Lake with Spark 3.3

Learn How to Build a Custom Chatbot Using a RAG Workflow in Minutes [Hands-on Demo]

Overcome ETL Bottlenecks with Metadata-driven Integration for the AI Era [Free Guide]

Gartner® Hype Cycle™ for Analytics and Business Intelligence 2023

The Art of Mastering Data Quality for AI and Analytics

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Building an Operational Data Warehouse for Real-time Analytics

Can You Use Kafka as a Database?

Sponsored Multimedia

The Power of DataOps: Bring Automation to Life
No Comments

Tactical Steps for Cloud Migration
No Comments

Immuta Data Access Platform
No Comments

Data Mesh: Fact or Fiction?
No Comments

Contributors

Featured Events

Call & Contact Center Expo

AI & Big Data Expo North America 2024

AI Hardware & Edge AI Summit 2024

CDAO Government 2024

Making Hadoop Relatable Again

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 19, 2024

April 18, 2024

April 17, 2024

Most Read Features

Most Read News In Brief

Most Read This Just In

Sponsored Partner Content

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Sponsored Multimedia

Contributors

Featured Events

Share

Copy short link