#246 — March 22, 2019

Read on the Web

Psst.. we have a special feature further down this issue focusing on the increasing role of GPUs in the database world — don't miss it 😄

Database Weekly

Nvidia Sees Green in Data Science Workloads — A significant trend we’re seeing in the database and data science spaces right now is the use of GPUs for large scale data processing, so it should be no surprise that Nvidia are now doubling down on this sector. Nvidia’s CEO, Jensen Huang, devoted a third of his recent keynote solely to data science and where he sees Nvidia leading the way.


A Decade Later, Apache Spark Still Going Strong — While Spark’s rapid growth seems to be more of a recent phenomenon, the project, which began quietly at UC Berkeley and “kicked MapReduce out of the Hadoop nest”, is actually 10 years old.


Tired of Manually Revoking SSH Keys & DB Creds — "Securing access to databases was unmanageable. Now with strongDM it's very simple." - VP Engineering, Hearst MediaOS | Manage access to every DB and server with your existing SSO. Click here to watch a demo.

strongDM sponsor

Scaling Relational Databases with Apache Spark SQL and DataFramesSpark is now well established as a high performance cluster-computing framework based around the MapReduce approach of processing data. But it’s both possible to use SQL with it and to represent data stored within Spark in a relational style using DataFrames. This article explains how, then follows up with a hands-on tutorial.

Dipanjan (DJ) Sarkar

Neo4j Announces Neo4j Labs for Building Next-Gen Graph Database Tooling — The company behind the popular Neo4j graph database are formalizing the idea of having a variety of experimental projects that at least one Neo4j employee will be working on. GRANDstack, a toolkit for building full stack GraphQL databases on top of Neo4j, is one such example.

Michael Hunger

🔎  A quick look at GPUs in the database world...

With Nvidia's increased interest in GPU-based data science and databases, what else is going on in the space?

Early on, GPUs were being used to run SQL queries faster and heavily in parallel and Kinetica (originally GPUdb) made a splash back in 2016 with its work on commercializing such GPU-backed databases, till then almost entirely limited to research projects.

MapD ('MAssively Parallel Database'), another pioneer in this market and now known as OmniSci, open sourced their database system in 2017 (this article is a great way to understand their technology).

Now, the market is exploding in numerous directions. One popular approach is to bring GPU capabilities to existing database systems, such as IBM adding GPU capabilities to Spark or Brytlyt (more on their tech here) and PG-Strom doing the same for PostgreSQL.

Alternatively, there are entirely new approaches, such as Sqream's GPU-based data warehouse, or Uber's AresDB, an open source real-time analytics engine that Uber uses to deal with their huge firehose of data.

📖 Tutorials

Oracle vs. SQL Server Architecture — This is a pretty neat high level comparison between the different approaches taken by Oracle Database and SQL Server for either DBAs who need to work with both or industry observers like me :-)

Kellyn Pot’Vin-Gorman

Stitching Sheets: Using MongoDB Stitch To Create An API For Data In Google Sheets

MongoDB sponsor

Data Masking in the World of GDPR

Lockwood Lyon

MySQL Connection Handling and Scaling — A modern description of how MySQL works with connections, threads, and scaling generally.

Geir Høydalsvik

4 Apache Cassandra Pitfalls You Must Avoid — Keep it secure, keep it monitored, and don’t treat it like a standard relational database.

Justin Cameron

Performing a PostgreSQL Upgrade using pg_dumpall — A look at one of several approaches for when upgrading Postgres. pg_dumpall is well suited for clusters and is relatively straightforward, although it’s not ideal for particularly large databases.

Vallarapu, Camargos, Augustine and Ihalainen

💬 Stories and Opinions

The Creator of Elasticsearch on 'Open' Distros, Open Source, and Building a Company — Elasticsearch’s creator has responded to Amazon’s move to create an ‘Open Distro’ for Elasticsearch, a story we featured last week. Others, however, say that Amazon are doing their bit.

Elastic Blog

MongoDB Named a Leader in Forrester's Latest NoSQL Report — MongoDB has been recognized as a NoSQL market leader by research firm Forrester in its Forrester Wave™: Big Data NoSQL, Q1 2019 Report with the strongest “current offering” though Microsoft has the edge on “stronger strategy”.