Getting Answers Faster: NVIDIA and Open-Source Ecosystem Come Together to Accelerate Data Science

by Clement Farabet

No matter the industry, data science has become a universal toolkit for businesses. Data analytics and machine learning give organizations insights and answers that shape their day-to-day actions and future plans. Being data-driven has become essential to lead any industry.

While the world’s data doubles each year, CPU computing has hit a brick wall with the end of Moore’s law. For this reason, scientific computing and deep learning have turned to NVIDIA GPU acceleration. Data analytics and machine learning haven’t yet tapped into the GPU as systematically. That’s changing.

RAPIDS, launched today at GTC Europe, gives data scientists for the first time a robust platform for GPU-accelerated data science: analytics, machine learning and, soon, data visualization. And what’s more, the libraries are open-source, built with the support of open-source contributors and available immediately at www.RAPIDS.ai.

Initial benchmarks show game-changing 50x speedups with RAPIDS running on the NVIDIA DGX-2 AI supercomputer, compared with CPU-only systems, reducing experiment iteration from hours to minutes.

By the Community, for the Community

With a suite of CUDA-integrated software tools, RAPIDS gives developers new plumbing under the foundation of their data science workflows.

To make this happen, NVIDIA engineers and open-source Python contributors collaborated for two years. Building on key open-source projects including Apache Arrow, Pandas and scikit-learn, RAPIDS connects the data science ecosystem by bringing together popular capabilities from multiple libraries and adding the power of GPU acceleration.

RAPIDS will also integrate with Apache Spark, the leading open-source data science framework for data centers, used by more than 1,000 organizations.

A data science workshop following the GTC Europe keynote will feature a panel with luminaries of the open-source community — Travis Oliphant and Peter Wang, co-founders of Anaconda, as well as Wes McKinney, founder and creator of Apache Arrow and the Pandas software library, and a contributor to RAPIDS.

These pioneers will discuss the potential for RAPIDS for GPU-accelerated data science before an audience of developers, researchers and business leaders. At the workshop, Databricks, a company founded by the creators of Spark, will present on unifying data management and machine learning tools using GPUs.

It was a natural step for NVIDIA, as the creator of CUDA, to develop the first complete solution that integrates Python data science libraries with CUDA at the kernel level. By keeping it open source, we welcome further growth and contributions from other developers in the ecosystem.

This community is vast — tens of millions of downloads occur annually of the core data science libraries via the package manager Conda. Open-source development makes it easier for data scientists to rapidly adopt RAPIDS and maintain the flexibility to modify and customize tools for their applications.

NVIDIA in recent years has made diverse contributions to the AI open-source community with libraries like the Material Definition Language SDK, the NCCL software module for communication between GPUs and the NVIDIA DIGITS deep learning application.

There are 120 repositories on our GitHub page, including research algorithms, the CUTLASS library for matrix multiplication in CUDA and NVcaffe, our fork of the Caffe deep learning framework. And we’ll continue to contribute to RAPIDS alongside the open-source community, supporting data scientists as they conduct efficient, granular analysis.

Delivering Rapid Answers to Data Science Questions

Data scientists, and the insights they extract, are in high demand. But when relying on CPU systems, there’s always been a limit on how fast they can crunch data.

Depending on the size of their datasets, scientists may have a long wait for results from their machine learning models. And some may aggregate or simplify their data, sacrificing granularity for faster results.

With the adoption of RAPIDS and GPUs, data scientists can ramp up iteration and testing, providing more accurate predictions that improve business outcomes. Typical training times can shrink from days to hours, or from hours to minutes.

In the retail industry, it could allow grocery chains to estimate the optimal amount of fresh fruit to stock in each store location. For banks, GPU-accelerated insights could alert lenders about which homeowners are at risk of defaulting on a mortgage.

Access to the RAPIDS open-source suite of libraries is immediately available at www.RAPIDS.ai, where the code is being released under the Apache license. Containerized versions of RAPIDS will be available this week on the NVIDIA GPU Cloud container registry.

For more updates about RAPIDS, follow @rapidsai.