Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
20170602_OSSummit_an_intelligent_storage
1. An Intelligent Storage
~SQL Execution on GPU Closely Connected with SSD~
PG-Strom Development Team
KaiGai Kohei <kaigai@kaigai.gr.jp>
2. The PG-Strom Project
Who am I?
▌KaiGai Kohei
PG-Strom Development Team
▌Background
Linux kernel (SELinux, …)
PostgreSQL database
Enhancement of security features
Writable FDW, Remote Join
Custom Scan Interface
GPU and CUDA
▌PG-Strom
SQL Acceleration Technology using GPU
Solution for data analytics in-database
2 Open Source Summit Japan 2017 - An Intelligent Storage
3. The PG-Strom Project
PG-Strom
PG-Strom (1/3) – GPU as accelerator of SQL
3
GPU CPU
Model
NVIDIA
Tesla P40
Intel Xeon
E5-2699v4
Code name Pascal Broadwell
Release Q3-2016 Q1-2016
# of transistors 12billion 7.2billion
# of cores
3840
(simple)
22
(functional)
Core clocks
1.306GHz
~1.53GHz
2.20GHz
~3.60GHz
Performance
(FP32)
12 TFLOPS
1.2 TFLOPS
(with AVX2)
RAM Size 24GB (GDDR5) max1.5TB (DDR4)
Memory Band 347GB/s 76.8GB/s
Power
consumption
250W 145W
Manufacturing
process
16nm 14nm
Application
GPU
Off-load
WHERE, JOIN,
GROUP BY
Auto generation
of GPU binary
from SQL
PG-Strom: Extension module of PostgreSQL, to accelerate SQL workloads
using multi-thousands cores and high bandwidth memory.
Same
SQL
Statement
Same
Data
Structure
Open Source Summit Japan 2017 - An Intelligent Storage
4. The PG-Strom Project
PG-Strom (2/3) – Transparent GPU code generation on demand
Open Source Summit Japan 2017 - An Intelligent Storage4
QUERY: SELECT cat, count(*), avg(x) FROM t0
WHERE x between y and y + 20.0 GROUP BY cat;
:
STATIC_FUNCTION(bool)
gpupreagg_qual_eval(kern_context *kcxt,
kern_data_store *kds,
size_t kds_index)
{
pg_float8_t KPARAM_1 = pg_float8_param(kcxt,1);
pg_float8_t KVAR_3 = pg_float8_vref(kds,kcxt,2,kds_index);
pg_float8_t KVAR_4 = pg_float8_vref(kds,kcxt,3,kds_index);
return EVAL((pgfn_float8ge(kcxt, KVAR_3, KVAR_4) &&
pgfn_float8le(kcxt, KVAR_3,
pgfn_float8pl(kcxt, KVAR_4, KPARAM_1))));
} :
E.g) Generate CUDA C code from
calculation formula in WHERE-clause
Reference to input data
SQL expression in CUDA source code
Run-time
Compiler
(nvrtc)
Just-in-time
Compile
Parallel
Execution
5. The PG-Strom Project
PG-Strom (3/3) – An example of SQL acceleration
5
▌Test Query:
SELECT cat, count(*), avg(x)
FROM t0 NATURAL JOIN t1 [NATURAL JOIN t2 ...]
GROUP BY cat;
t0 contains 100M rows, t1...t8 contains 100K rows (like a start schema)
Open Source Summit Japan 2017 - An Intelligent Storage
8.48
13.23
18.28
23.42
28.88
34.50
40.77
47.16
5.00 5.46 5.91 6.45 7.17 8.07
9.22 10.21
0.0
5.0
10.0
15.0
20.0
25.0
30.0
35.0
40.0
45.0
50.0
2 3 4 5 6 7 8 9
QueryResponseTime[sec]
Number of tables joined
PG-Strom microbenchmark with JOIN/GROUP BY
PostgreSQL v9.6 PG-Strom 2.0devel
CPU: Xeon E5-2640v4
GPU: Tesla P40
RAM: 128GB
OS: CentOS 7.3
DB: PostgreSQL 9.6.2 +
PG-Strom 2.0devel
6. The PG-Strom Project
In-database Analytics
▌Computing close to data
Cost to move data
Cost to transform / validate data
Waste of time for DBAs
▌Utilization of SQL’s flexibility
Pre-/Post-process of analytic
algorithms
Conjunction with other data (JOIN)
Summarization of results
(GROUP BY/Window Functions)
Open Source Summit Japan 2017 - An Intelligent Storage6
7. The PG-Strom Project
What we want to develop (1/2)
Open Source Summit Japan 2017 - An Intelligent Storage7
Data
Collection
Summarize
Pre-process
Data
Generation
Analytics
& ML
Visualization
Data Size: Large Small
Amount of computing: Small Large
SSD-to-GPU
Direct SQL Exec
Transparent
GPU code generation
& Asynchronous Exec
Statistical Analytics
& Machine Learning
Libraries
Advanced algorithms
as a part of SQL statement
RowColum
Hardware
Transform
BI Tools
Execution of entire lifecycle
for data analytics in-database
DNN k-means
SVM
Random
Forest
PL/CUDA
NVMe-SSD GPU
8. The PG-Strom Project
What we want to develop (2/2) – elemental technology
Open Source Summit Japan 2017 - An Intelligent Storage8
Data
Collection
Summarize
Pre-process
Data
Generation
Analytics
& ML
Visualization
Data Size: Large Small
Amount of computing: Small Large
SSD-to-GPU
Direct SQL Exec
Transparent
GPU code generation
& Asynchronous Exec
Statistical Analytics
& Machine Learning
Libraries
Advanced algorithms
as a part of SQL statement
RowColum
Hardware
Transform
BI Tools
Execution of entire lifecycle
for data analytics in-database
DNN k-means
SVM
Random
Forest
PL/CUDA
NVMe-SSD GPU
SSD-to-GPU
Direct SQL
Execution
(for I/O)
PG-Strom
(for SQL)
PL/CUDA
and
Library
(for computing)
9. The PG-Strom Project
I/O Acceleration with GPU?
Open Source Summit Japan 2017 - An Intelligent Storage9
10. The PG-Strom Project
Re-definition of GPU’s role
I/O workloads Computing workloads
Open Source Summit Japan 2017 - An Intelligent Storage10
11. The PG-Strom Project
Brief Architecture of GPU
GPU Device
Memory
(GDDR5 or HBM2)
Host RAM
Data Transfer with DMA
over the PCIe Bus
CPU
PCIe Bus
Also data transfer via DMA
from: particular blocks
to: destination physical
address
Storage
GPU Device
Open Source Summit Japan 2017 - An Intelligent Storage11
12. The PG-Strom Project
Technology Basis (1/2) – GPUDirect RDMA
▌Feature for peer-to-peer DMA between GPU and PCIe devices
▌Originally, designed for MPI over Infiniband
▌But available for any kind of PCIe devices, if custom kernel module
Copyright (c) NVIDIA corporation, 2015
Open Source Summit Japan 2017 - An Intelligent Storage12
13. The PG-Strom Project
Technology Basis (2/2) – GPUDirect RDMA
Physical
Address Space
PCIe BAR1 AreaGPU
RAM
Host
RAM
NVMe-SSD Infiniband
HBA
PCIe devices
GPUDirect RDMA
A feature to map GPU’s device
memory on physical address space
of the host system Once GPU device memory gets physical
address of the host system, driver can
use the address as source/destination
of DMA operations.
13 Open Source Summit Japan 2017 - An Intelligent Storage
14. The PG-Strom Project
Idea of SSD-to-GPU Direct SQL Execution
Open Source Summit Japan 2017 - An Intelligent Storage14
Large PostgreSQL Tables
PCIe Bus
NVMe SSD GPUSSD-to-GPU P2P DMA
(NVMe-Strom driver)
WHERE
JOIN
GROUP BYPostgreSQL
Data Block
Reduction of unnecessary
rows according to the SQL
Smaller amount of I/O
Existing Data Flow
(Large I/O Size)
SSD-to-GPU Direct SQL Execution – Load PostgreSQL’s data block to GPU directly,
then reduce unnecessary rows with multi-thousands cores of GPUs.
15. The PG-Strom Project
As destination of
SSD to GPU DMA commands!!
Linux kernel APIs of GPUDirect RDMA
int nvidia_p2p_get_pages(uint64_t p2p_token, /* deprecated */
uint32_t va_space, /* deprecated */
uint64_t virtual_address,
uint64_t length,
struct nvidia_p2p_page_table **page_table,
void (*free_callback)(void *private_data),
void *private_data);
struct nvidia_p2p_page_table {
uint32_t version;
uint32_t page_size;
struct nvidia_p2p_page **pages;
uint32_t entries;
uint8_t *gpu_uuid;
} nvidia_p2p_page_table_t;
Array of I/O mapped
physical memory address
to the GPU device memory
CUresult cuMemAlloc(CUdeviceptr *daddr,
size_t bytesize);
Open Source Summit Japan 2017 - An Intelligent Storage15
16. The PG-Strom Project
NVMe-Strom Software Stack
Open Source Summit Japan 2017 - An Intelligent Storage16
pg-strom
NVMe-Strom
VFS
Page Cache
NVMe SSD
Driver
nvidia
driver
GPU
device
memory
GPU
device
memory
PostgreSQL
file offset
DMA
request
block number
SSD-to-GPU Peer-to-Peer DMA
cuMemAlloc()
/proc/nvme-strom
ioctl(2)
read(2)
User
Space
Kernel
Space
17. The PG-Strom Project
Benchmark (1/2) – Star Schema Benchmark
Open Source Summit Japan 2017 - An Intelligent Storage17
0
500
1000
1500
2000
2500
3000
3500
4000
4500
Q1-1 Q1-2 Q1-3 Q2-1 Q2-2 Q2-3 Q3-1 Q3-2 Q3-3 Q3-4 Q4-1 Q4-2 Q4-3
QueryExecutionThroughput[MB/s]
PostgreSQL SSDx1 PostgreSQL SSDx2 PG-Strom SSDx1 PG-Strom SSDx2
SQL with WHERE-clause, JOIN, Aggregation/GROUP BY towards 353GB database.
Usually, entire workloads are dominated by I/O.
PostgreSQL achieves up to 1.6GB/s with regular filesystem based access.
PG-Strom with SSD-to-GPU direct SQL execution achieved up to 3.8GB/s at the peak performance.
※ (Query Execution Throughput) = (Database Size; 353GB) / (Entire Query Response Time [sec])
Theoretical limitation
by 2x SSD [4.4GB/s]
Under
Investigation
Theoretical limitation
by 1x SSD [4.4GB/s]
18. The PG-Strom Project
Benchmark (2/2) – Star Schema Benchmark
Open Source Summit Japan 2017 - An Intelligent Storage18
Example of Query
SELECT sum(lo_revenue), d_year, p_brand1
FROM lineorder, date1, part, supplier
WHERE lo_orderdate = d_datekey
AND lo_partkey = p_partkey
AND lo_suppkey = s_suppkey
AND p_category = 'MFGR#12‘
AND s_region = 'AMERICA‘
GROUP BY d_year, p_brand1
ORDER BY d_year, p_brand1;
customer
12M Rows
(1.6GB)
date1
2500 Rows
(400KB)
part
1.8M Rows
(206MB)
supplier
4M Rows
(528MB)
lineorder
2.4B Rows
(351GB)
Typical queries for summary; which
consists of WHERE-clause, JOIN and
GROUP BY.
Size of lineorder table is larger than
physical RAM size, majority of the
workload is dominated by I/O.
19. The PG-Strom Project
Hardware configuration on the benchmark
19
▌Hardware/Software configuration
Model: DELL PowerEdge R730
CPU: Xeon E5-2670v3 (12C, 2.3GHz) x2
RAM: 128GB
HDD: SAS 300GB x8 (RAID5)
OS: CentOS7.2 (3.10.0-327.el7)
SW: CUDA 7.5.17
NVIDIA driver 352.99
PostgreSQL 9.6.1
PG-Strom v2.0devel
▌GPU Specifications
NVIDIA Tesla K80
2x 2496 CUDA cores (560MHz)
2x 12GB GDDR5 RAM (240GB/s)
▌SSD Specifications
Intel SSD 750 (400GB) x2
Interface: PCIe 3.0 x4 (NVMe 1.2)
Max SeqRead: 2.2GB/s
Open Source Summit Japan 2017 - An Intelligent Storage
CPU2CPU1
SSD
x2
Tesla
K80
GPU
Tesla
K20
GPU
(unused)
20. The PG-Strom Project
(OT) Consistency with disk cache
20
Problem ① Storage block may not be up-to-date, if target blocks are kept on page caches.
② Short size DMA is not efficient.
Solution – Write back cached PostgreSQL blocks (= 8KB/block) to userspace buffer once,
then kicks bulk transfer from RAM to GPU by CUDA APIs.
Almost equivalent cost with read(2) + cuMemcpyHtoDAsync()
Although PostgreSQL blocks are reordered on GPU memory, it generates correct results.
Open Source Summit Japan 2017 - An Intelligent Storage
BLK-100: uncached
BLK-101: cached
BLK-102: uncached
BLK-103: uncached
BLK-104: cached
BLK-105: cached
BLK-106: uncached
BLK-107: uncached
BLK-108: cached
BLK-109: uncached
BLCKSZ
(=8KB)
Transfer Size
Per Request
BLCKSZ *
NChunks
BLK-108: cached
BLK-105: cached
BLK-104: cached
BLK-101: cached
BLK-100: uncached
BLK-102: uncached
BLK-103: uncached
BLK-106: uncached
BLK-107: uncached
BLK-109: uncached
BLK-108: cached
BLK-105: cached
BLK-104: cached
BLK-101: cached
unused SSD-to-GPU
P2P DMA
File Userspace DMA
Buffer (RAM)
Device Memory
(GPU)
CUDA API
(userspace)
cuMemcpyHtoDAsync
21. The PG-Strom Project
Not only Scan, but GpuJoin also...
Open Source Summit Japan 2017 - An Intelligent Storage21
Inner
relation
Outer
relation
Inner
relation
Outer
relation
Hash table Hash table
Next Step Next Step
Just reference
to the results
on CPU side
Hash Table
Search by CPU
Generation of
results by CPU
Generation of
results by GPU
Parallel
Hash Table
Search by GPU
HashJoin by CPU GpuHashJoin
Here we have
SSD-to-GPU
P2P DMA
22. The PG-Strom Project
GpuPreAgg
Not only Scan, but GpuPreAgg also...
Open Source Summit Japan 2017 - An Intelligent Storage22
Aggregation by CPU Aggregation with GPU
Relation Relation
Aggregation
Aggregation
1st Step:
Parallel Hash
Reduction
2nd Step:
Atomic Merge
Read of
1M rows
Process
1M rows
by CPU
Smaller
# of rows
GpuPreAgg:
Reduction of rows
to be processed by CPU
Interim
results
Here we have SSD-to-GPU P2P DMA.
24. The PG-Strom Project
Single Node Solution
2x Enterprise Grade SSD
Based on PCIe x8
NVIDIA
Tesla P40
CPU
RAM
DMI
SSD-1
SSD-2
GPU
max 10GB/s capability of query processing / node
PCI-E
Hardware Block Diagram
Open Source Summit Japan 2017 - An Intelligent Storage24
25. The PG-Strom Project
Multi Node Solution (?)
NVIDIA
Tesla P40
Joint solution with PostgreSQL’s scale-out technologies
2x Enterprise Grade SSD
Based on PCIe x8
Open Source Summit Japan 2017 - An Intelligent Storage25
26. The PG-Strom Project
Scale Up Solution (?)
5x NVIDIA
Tesla P40
DMI
GPU-1
PCI-SW
PCI-SW
QPIx2
GPU-2
GPU-4
GPU-5
SSD-1
SSD-2
SSD-3
SSD-4
SSD-5
GPU-3
RAM
RAM
CPU 1
CPU 2
PCIe x16
x2 slot
PCIe
96lane
PCIe
96lane
Deployment on GPU monster
5x Enterprise Grade SSD
Based on PCIe x16
Open Source Summit Japan 2017 - An Intelligent Storage26
27. The PG-Strom Project
Towards columnar structure (1/3) – On-memory Columnar Cache
27
▌On-memory Columnar Cache
Caches PostgreSQL’s data-blocks (row-data) on memory as column-data
Reduction of data transfer between RAM<->GPU
Optimization of memory bandwidth of GPU
Switch GPU kernel to be invoked according to existence of column-cache
Open Source Summit Japan 2017 - An Intelligent Storage
BlockNum = 101
all_visible=false
BlockNum = 102
all_visible=true
BlockNum = 103
all_visible=true
BlockNum = 104
all_visible=true
Columnar Cache
(BlockNum=102)
Columnar Cache
(BlockNum=102)
background
worker
Construction
UPDATE
invalidation
PG-Strom
Row-data
GPU Kernel
for Row
Column-data
GPU Kernel
for Column
Row-data
GPU Kernel
for Row
Column-data
GPU Kernel
for Column
28. The PG-Strom Project
Towards columnar structure (2/3) – Background
28
▌Row data format – Random memory access
Less usage ratio of memory bus, larger number of memory transactions
Open Source Summit Japan 2017 - An Intelligent Storage
▌Column data format – Coalesced memory access
Optimal memory bus usage, least number of memory transactions
32bit
Memory transaction width: 256bit
32bit 32bit32bit 32bit 32bit
32bit 32bit 32bit 32bit 32bit 32bit 32bit 32bit
Memory transaction width: 256bit
32bit x8 = 256bit is valid
in 256bit memory transaction.
(usage ratio: 100.0%)
Only 32bit x1 = 32bit is valid
in 256bit memory transaction .
(usage ratio: 12.5%)
GPU Cores
GPU Cores
Why columnar structure is preferable for GPU architecture
29. The PG-Strom Project
World of OLTP World of OLAP
Towards columnar structure (3/3) – FPGA engine on SSD
Open Source Summit Japan 2017 - An Intelligent Storage29
FPGA Logic on SSD enables write by row, then read as column
RowColumn
Transformer
column-format
Read of
analytic data
(column-format)
Write of
transactional
data
(row-format)
Data-format
transformation by
FPGA on SSD
SQL Execution
– based on columnar
format; best for GPU
Pre-processed
data
Only required
columns are
fetched
30. The PG-Strom Project
Towards the “true” intelligent storage system
Open Source Summit Japan 2017 - An Intelligent Storage30
Large PostgreSQL Tables
PCIe Bus
NVMe SSD GPUSSD-to-GPU P2P DMA
(NVMe-Strom driver)
WHERE
JOIN
GROUP BYPostgreSQL
Data Block
Reduction of unnecessary
rows according to the SQL
Smaller amount of I/O
Step-1: A pair of NVMe-SSD and GPU looks like an intelligent storage which understand
SQL workloads for filtering-out unnecessary rows and pre-process of JOIN/GROUP BY
prior to data loading onto the main memory.
Step-2: A special NVMe-SSD
transforms data format for more
suitable one to process, based on
knowledge of the requester.
(PostgreSQL’s block format in this
case)