Brandon Harris bio photo

Brandon Harris

Cloud + Data Engineering + Analytics

Twitter LinkedIn Instagram Github Photography

 

Overview

Finding true time series data at billion row level scale is hard. It’s usually either generated from standard RNG’s which leave out the very important temporal dependencies through their use of Gaussian or Uniform distributions, or it’s too large to easily move around for testing purposes. In this post I show you how to synthesize billions of rows of true time series data with an autoregressive component, and then explore it with ClickHouse, a big data scale OLAP RDBMS, all on AWS. If you’re on a deadline, feel free to skip my thoughts and get the code here.

Background Discussion

One of the areas that I continue to be impressed with is how quickly “standard SQL” solutions rebounded to start to handle very large datasets. These multi-billion row datasets were the domain of a few limited “Big Data” solutions only a few years ago. If you were working in this field in the early 2010’s, you’ll remember some crazy things when it came to manipulating very large data sets (raise your hand if you spent time googling how to create a bag in pig…).

Of course there have always been SQL or SQL-like offerings for big data work (and I won’t even talk about some of the crazy big data OLAP cube solutions) but in those early days if you wanted serious performance, you probably were going to be doing a lot more than just writing SQL. Even Spark SQL (which was a great option at the time, and still is for some use cases) required knowing a bit about RDD’s and lazy evaluation and even though you may have had some serious SQL chops, unless you knew Python or a bit of Scala, you were in for a tough ride.

Presto went pretty far to solve this pain point, and in fact it was one of my early favorites in this space, as you can see from a post of mine way back in 2016. Presto is still doing well (Amazon Athena anyone?) and has great support (check out the cool things Starburst is doing with enterprise Presto). SparkSQL definitely has its place, and of course the proprietary solutions like Vertica, Aster and Greenplum are in use at alot of enterprises, but there are also some great offerings out there that are still not mainstream for some reason.

Back when I wrote that old Presto post, Mark had jumped in and left some comments and I’d been following his Taxi Ride Adventures ever since. I started off this research really wanting to test out a GPU database and settled on brytlytDB, but something is up with them and they’ve pulled their AMI from AWS. I went looking elsewhere for something interesting to evaluate and I decided to give ClickHouse a shot. Released in 2016 from the minds at Yandex (you should be familiar with them, but if not, think Google in Russia), ClickHouse is a columnar RDBMS implemented in C++ that offers distributed capabilities and has a syntax very close to ANSI SQL.

About ClickHouse

One of the areas that impressed me the most about ClickHouse was the numerous analytics related functions that are available right out to the box. These are just a few that I’m looking forward to playing with.

  • windowFunnel() – Searches for event chains in a sliding time window and calculates the maximum number of events that occurred from the chain.

  • retention() - The function takes as arguments a set of conditions from 1 to 32 arguments of type UInt8 that indicate whether a certain condition was met for the event. Any condition can be specified as an argument (as in WHERE).

  • sequenceCount() - Counts the number of event chains that matched the pattern. The function searches event chains that don’t overlap. It starts to search for the next chain after the current chain is matched.

  • Aggregate function combinations like resample() - Lets you divide data into groups, and then separately aggregates the data in those groups. Groups are created by splitting the values from one column into intervals.

  • groupArrayMovingAvg - Calculates the moving average of input values.

There is even an implementation for stochastic Linear and Logistic regression with an option to use a variety of algorithms for updating weights (SGD of course, but also Adam, Nesterov and Momentum)! Not only is that just plain cool, but think of the overhead you can save not moving your data out of the database!

The analytics functions lead me to the next item.. I had been impressed with the work Mark and others had done to benchmark ClickHouse, but I wanted to focus a bit more on time series data rather than your traditional Taxi ride set many have used. I started looking for some large datasets and didn’t find what I wanted, so in a similar approach to the Sparkov project, I decided to generate my own.

I found a big boost from the team behind the Python TimeSynth package, as with their help I was able to address the largest fault I’ve found with most synthetic “time series” data. Most synthetic data sets that purport to be time series are usually just generated with a random number generator based on a uniform or gaussian distribution, which means the data is independent. This is the case even with the wonderful Time Series Benchmark Suites (TSBS). Part of the magic of time series data is the temporal dependency inherent in the values. With the availability of the TimeSynth package it became trivial to implement a solution that could generate billions of records that had an autoregressive component, as well as to add noise based in a Gaussian process (meaning the noise generated was i.i.d).

OK, enough exposition, let’s walk through creating an environment to generate 2.5 billion rows of true time series data, and then using ClickHouse to demonstrate some very impressive performance. While ClickHouse has a distributed capability, I was impressed enough with the SMP-style performance to run this on a single instance for now. Even though I used a hefty instance (r5ad.24xlarge, 96 cores, 200GB memory), this is still some fantastic performance. You’re not going to see these numbers on MySQL, even with equivalent hardware, I promise. In fact, I’ve included some RedShift comparisons at the end if you’re interested in the measurements.

Hands-On

Here’s the high-level approach.

• Using a Python 3 environment with boto3 we’ll create an ec2 instance
• Configure two NVME disks in RAID 0 to use for data generation as well as our ClickHouse storage.
• Clone my repo and use the parallelized Python code there to generate our data.
• Create the destination table in ClickHouse that’s well suited to our use case of time series data (column-oriented and using the MergeTree engine).
• Load the data into ClickHouse.
• Run some queries that demonstrate how we can perform aggregations and windowing functions across billions of rows very quickly.

At this point I will assume you have an environment running Python 3 and that has the boto3 library installed. There is some more detail on this on my GitHub readme but Google will be your friend here to get to this state if you’re not already there.

Data Generation

The first thing we’ll need to do is the clone the repo from GitHub.

git clone https://github.com/namebrandon/time-series-gen.git

You’ll end up more or less with a structure like this.

---\
----\ create-db.sql - SQL DDL to create ClickHouse database and table
----\ gen.py - Python 3 script that is configurable. Data size and details are set here.
----\ launch-ec2.py - Python 3 script that uses AWS SDK to instantiate data generation and query environment
----\ load-data.sh - bash script to load data to ClickHouse
----\ requirements.txt - pip requirements
----\ secrets.txt - you create this. Two lines, account key on first, secret on second line.
----\ time-series.conf - ClickHouse conf.d override for storage - points to nvme raid 0 array

launch-ec2.py is the python script we’ll be using to create the infrastructure, and as mentioned, requires boto3. There is a requirements.txt for pip usage but that’s for the data generation component. Using it here will take care of the boto3 requirement, but it’s also overkill and you may want avoid those packages being installed in your current environment.

Assuming there are no issues with boto, you need to make sure you create a file called secrets.txt in the root of the directory where you cloned the repo. This should have your AWS IAM secret and key in the file and that’s how the launch-ec2.py obtains credentials for creating AWS resources. If you’ve got them in your environment variables you can re-write the code to use those, I believe os.getenv should help, but if you’re using virtualenv that may make things a bit more complicated.

Please note that you should never store secrets or credentials in version control. My .gitignore will prevent the secrets.txt file from being shared but you should be cautious here and double-check before committing.

touch secrets.txt
echo AADK1134123213 >> secrets.txt
echo asd090jkj12l3kj2l109a-0s9klk3aqa14 >>secrets.txt

The final step before we execute the script is to modify any account or region specific items in the code so that they work for you. There is a section in the launch-ec2.py file labeled “Configure your relevant information here”. I’ll let you guess what that section might be for!

One import item: Part of generating this data as well as optimizing most any database performance is ensuring our I/O subsystem is performing well. As a result, I’ve chosen either the r5ad or c5ad ec2 instance family. These come with a pair of NVME disks that we can put into RAID 0 for very fast I/O throughput. The userdata script that is part of the launch-ec2.py creates this array, and expects certain underlying storage devices. Unless you are comfortable picking up where this code will fail and creating your own array and manually following from this point, please stick with an r5ad or c5ad instance type (I suggest at minimum a 4xlarge).

Once you’re comfortable with your changes to the launch-ec2.py file, let’s kick it off.

python3 launch-ec2.py

When successful you should get output similar to this, with your public IP and your instance identifier.

44.234.19.212.
i-07077941efee714ca.

Next step is to connect to your instance and clone the repo once more. Please be patient here, the userdata script can take up to 10 minutes to complete. You’ll know you’re ready to move forward when the /mnt/md0 directory is available. If it’s not available the script is still running, and you need to wait a bit more. If you’re waiting more than 10 minutes, start debugging as something has broken. If you do end up needing to debug, I’d suggest re-running the python script to generate a new instance but with the user data variable emptied out, then connecting to the instance and stepping through the userdata lines one by one and seeing what might be breaking.

Assuming all is well, we proceed to connecting to the instance.

ssh -i path-to-pem-file/your.pem centos@your-instance-ip	
cd /mnt/md0
git clone https://github.com/namebrandon/time-series-gen.git

Let’s then jump into the root directory where we clone the code, and install our python requirements.

cd time-series-gen/ && pip3 install -r requirements.txt

Next up is to edit gen.py to suit your needs. I’ve installed nano already in the build, but you’re far more of a man than I am, feel free to use vi or emacs. By default it is set to create 2.5 billion rows of data and to create 96 files (one per core of a r5ad.24xlarge). This will take about 12 hours to complete (Danger, Will Robinson! - this can get expensive depending on your instance type, you may want to start small the first time you walk through this process). Now is a good time to share some insight from developing this codebase….

In general, I’ve found this code to generate anywhere from 600 - 1,100 rows of data, per second, per core. There is a caveat here, and that is you can hit these numbers as long as you don’t run out of RAM and start spilling to disk. For reference, 1MM rows is about 150MB, so 2.5 billion rows = ~ 375GB uncompressed (thus the r5ad.24xlarge). The Python code is using the multiprocessing library and assumes the optimal one file/chunk per core ratio (around line 71 in the code). Of course core speed and utilization will impact this, but I’ve found 1k rows/core/second to be a decent rule of thumb, and this does seem to scale linearly, at least up until 96 cores. I have yet to hit a bottleneck on the I/O subsystem with the RAID 0 NVME setup, and I would guess this will continue to scale pretty well, even beyond 128 cores. If you’ve got some time on your hands and want to test this, I would love to see your results!

Now we kick off the code and generate our data.. Since this could potentially take a while (and be expensive depending on your instance choice), I suggest using screen (again, already installed for you) to create a recoverable session. Your call!

screen -S datagen
python3 gen.py

If you’re new to screen, hitting CTRL-A then just “D” will detach you. You can get screen session details with screen -list and reattach with screen -r [session-name], in this case the session name is datagen.

Data Import

Assuming your data generation went fine and it’s minutes to several hours later, let’s get the data loaded into ClickHouse to experiment with.

From the repo root directory we’re going to chmod the shell scripts as executable as well as provide some config overrides so that ClickHouse will use our RAID 0 array for data and temp storage as opposed to the EBS volume it uses for the default mount point.

chmod +x *.sh
sudo cp time-series.conf /etc/clickhouse-server/conf.d/
sudo service clickhouse-server restart

After the clickhouse-server restart your should see some output that it initialized the new directory we provided in the config override.

Launch the ClickHouse client to test connectivity ( clickhouse-client). Assuming all was successful, you can use exit to exit the SQL client.

ClickHouse client version 20.3.19.4.
Connecting to localhost:9000 as user default.
Connected to ClickHouse server version 20.3.19 revision 54433.

ip-102-31-32-266.us-west-2.compute.internal :) exit
Bye.

Finally we’ll create a database and then start loading the data. Since this is a session we wouldn’t want to be interrupted, I’ll again suggest the use of screen here.

./create-db.sh
screen -S dataload
./load-data.sh

You will not see any output at this point from the load-data script, so again, CTRL-A and then hitting “D” will detatch you from the screen session. Loading is impressively quick and should take less than 35 minutes for all 2.5 billion rows. You can reattach to your screen session (screen -r) to see the output. When you see a time() output with details on user/run/system execution time, you’re done.

Querying Our Data

Let’s load the ClickHouse client again and run a count to make sure we’re all loaded and ready to go. Now is a good time to mention that you can run queries from the shell by passing in a string or file, or you can use the interactive client. See the create-db.sh shell script for an example, but for now we’ll use the interactive client.

Now is also a good time to talk about our data.. I had intended to focus on exchange / ticker type data, but I also wanted to make the data more applicable to others who might want to use it to simulate sensor readings and other non-financial systems. As a result there are some decisions I’ve made..

• The data resolves to the second, and is captured regularly (every second) with no omissions. Every second from May 6, 1949 is included, and the end date is dependent on how the generation script is configured (by default, 2.5B records puts the end date at July 25, 2028).

• There are no missing values, or otherwise extreme outliers. Gaussian noise is added to the process (in some cases to a heavy degree), but the underlying parameters for each time series / column can be viewed in the code.

The data is 17 columns wide (the first of which is your time stamp index) and the values themselves are decimal64(2) values. Almost all should be positive numbers, but it is very likely small set of the data went negative during generation. Since the generation of the data is non-deterministic your mileage will vary, but if you absolutely need positive numbers, then columns “QYUY” and “ELHW” are probably your best bets. Otherwise you really should modify the underlying parameters in the ts_data_gen() function calls in gen.py to suit your needs.

Anyway, here’s a subset of data (with only 6 columns shown) to give you an idea.

exchg_time_stamp LMWI CJHA BCTD DNUL HEQB AOGT
1992-06-23 05:23:08 51.13 162.93 312.09 1244.2 606.55 418.83
1992-06-23 05:23:09 53.11 155.72 253.96 1100.92 337.18 396.77
1992-06-23 05:23:10 48.66 177.42 180.01 1417.26 775.8 249.29
1992-06-23 05:23:11 47.23 158.38 360.52 1050.34 377.56 764.01
1992-06-23 05:23:12 38.8 172.64 113.41 1095.14 474.14 237.05
1992-06-23 05:23:13 45.9 127.07 125.8 1061.73 850.18 590.35
1992-06-23 05:23:14 43.4 154.48 208.7 631.3 477.62 657.48
1992-06-23 05:23:15 51.34 143.62 331.39 953.99 395.37 413.54
1992-06-23 05:23:16 47.72 153.32 231.18 1583.28 879.46 86.39
1992-06-23 05:23:17 59.42 127.77 69.73 1505.2 673.77 508.43

Note that I included a header.txt file in the data/ folder, but the labels are arbitrary and fully customizable. I don’t actually use the header.txt file anywhere, it’s just a reference, the column names are specified in the database creation script.

I’m hoping this choice gives everyone flexibility, those looking to limit the data to certain hours, or that need outliers or missing / irregular data can certainly use this as a starting point to modify the data, and those who need consistent, sensor-like data should be covered.

Let’s jump into the client (and for these tests I’m still using the r5ad.24xlarge instance we used to create the data). You could always generate the data on one instance, save it off to s3 or EFS/EBS, and then spin up a dedicated ClickHouse instance of your choosing. You can even use my same code, just don’t generate the data and copy your own data back down to <repo-dir>/data.

clickhouse-client

First up, our generic count.

SELECT COUNT(*) FROM perftest.exchange_data;

Pretty standard, not useful for performance but we know our data is there.

SELECT COUNT(*)
FROM perftest.exchange_data

┌────COUNT()─┐
│ 2499999936 │
└────────────┘

1 rows in set. Elapsed: 0.003 sec.

Now let’s take a look at min, max and average across a given year for a given column/symbol, in this case “DNUL”.

//QUERY A
SELECT toYear(exchg_time_stamp) as YEAR, toMonth(exchg_time_stamp) as MONTH, MAX(DNUL), MIN(DNUL), AVG(DNUL) from perftest.exchange_data WHERE toYear(exchg_time_stamp)=1992 GROUP BY  toYear(exchg_time_stamp), toMonth(exchg_time_stamp) ORDER BY MONTH ASC;
┌─YEAR─┬─MONTH─┬─MAX(DNUL)─┬─MIN(DNUL)─┬─AVG(DNUL)─┐
│ 1992 │     1 │   2299.45 │    -82.89 │   1150.04 │
│ 1992 │     2 │   2294.96 │    -72.94 │   1149.86 │
│ 1992 │     3 │   2281.89 │     36.32 │   1150.06 │
│ 1992 │     4 │   2324.44 │    -15.05 │   1150.00 │
│ 1992 │     5 │   2390.85 │   -100.95 │   1150.07 │
│ 1992 │     6 │   2241.36 │     10.40 │   1149.92 │
│ 1992 │     7 │   2275.68 │      0.40 │   1150.09 │
│ 1992 │     8 │   2298.19 │    -59.29 │   1150.15 │
│ 1992 │     9 │   2289.68 │     40.18 │   1149.96 │
│ 1992 │    10 │   2299.45 │     15.09 │   1150.13 │
│ 1992 │    11 │   2299.45 │    -82.89 │   1149.97 │
│ 1992 │    12 │   2294.96 │    -72.94 │   1149.83 │
└──────┴───────┴───────────┴───────────┴───────────┘

12 rows in set. Elapsed: 0.168 sec. Processed 31.62 million rows, 379.47 MB (188.03 million rows/s., 2.26 GB/s.)

0.168 seconds..? Ok..

How about a daily average for UOFC in July of 1993..

//QUERY B
SELECT toDayOfMonth(exchg_time_stamp) as DAY, AVG(UOFC) AS UOFC_AVG from perftest.exchange_data WHERE toYear(exchg_time_stamp)=1993 AND toMonth(exchg_time_stamp)=07 GROUP BY  toYear(exchg_time_stamp), toMonth(exchg_time_stamp), toDayOfMonth(exchg_time_stamp) ORDER BY DAY ASC;
┌─DAY─┬─UOFC_AVG─┐
│   1 │  1510.99 │
│   2 │  1510.30 │
│   3 │  1509.03 │
│   4 │  1509.67 │
│   5 │  1508.84 │
│   6 │  1509.56 │
│   7 │  1509.92 │
│   8 │  1509.87 │
│   9 │  1510.12 │
│  10 │  1509.06 │
│  11 │  1510.00 │
│  12 │  1509.44 │
│  13 │  1510.43 │
│  14 │  1509.28 │
│  15 │  1509.47 │
│  16 │  1509.92 │
│  17 │  1509.51 │
│  18 │  1510.69 │
│  19 │  1509.61 │
│  20 │  1510.94 │
│  21 │  1509.61 │
│  22 │  1510.75 │
│  23 │  1510.11 │
│  24 │  1510.34 │
│  25 │  1510.28 │
│  26 │  1510.24 │
│  27 │  1509.57 │
│  28 │  1509.53 │
│  29 │  1509.86 │
│  30 │  1510.02 │
│  31 │  1510.83 │
└─────┴──────────┘

31 rows in set. Elapsed: 0.044 sec. Processed 2.68 million rows, 32.14 MB (61.31 million rows/s., 735.69 MB/s.)

Well.. 0.44 seconds. So we’re seeing what I feel are pretty good results, but we did partition our data on import by the date field, and being a columnar database, these statistics are often compiled on import and when we optimize the database..

How about some more interesting work..

Let’s try a 7-day moving average for all of Q4 in 1993 (note that I ignore the existence of Q3 1993, so the first 6 days of Q4 until the window fills up are below the expected average).

//QUERY C
SELECT groupArrayMovingAvg(7)(UOFC_MAX) AS UOFC_MOV FROM (SELECT toDayOfMonth(exchg_time_stamp) as DAY, MAX(UOFC) AS UOFC_MAX from perftest.exchange_data WHERE toYear(exchg_time_stamp)=1993 AND toMonth(exchg_time_stamp) IN (09,10,11,12) GROUP BY  toYear(exchg_time_stamp), toMonth(exchg_time_stamp), toDayOfMonth(exchg_time_stamp) ORDER BY DAY ASC)
┌─UOFC_MOV───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ [358.33,700.31,1041.12,1414.60,1775.00,2120.75,2482.90,2486.39,2494.20,2512.15,2531.28,2552.40,2563.50,2544.22,2548.93,2560.92,2556.40,2516.79,2479.02,2473.89,2485.03,2462.45,2478.15,2471.50,2463.39,2476.13,2489.45,2484.08,2497.88,2475.13,2497.24,2506.20,2499.19,2478.81,2489.15,2487.01,2495.47,2485.14,2477.33,2481.65,2476.22,2497.23,2506.23,2493.58,2499.88,2514.91,2527.65,2536.61,2525.83,2527.55,2529.04,2519.18,2504.43,2488.56,2498.74,2475.84,2471.32,2471.38,2466.93,2481.05,2484.84,2481.69,2489.73,2469.96,2478.61,2489.53,2487.97,2486.49,2501.79,2503.89,2507.04,2497.79,2488.85,2479.57,2492.77,2463.91,2448.55,2457.14,2455.22,2460.11,2456.49,2445.62,2468.47,2497.08,2485.89,2504.13,2493.31,2505.47,2504.00,2494.33,2479.73,2489.18,2486.79,2497.58,2481.60,2479.65,2473.56,2474.36,2469.79,2444.72,2435.85,2439.86,2433.99,2435.05,2426.12,2430.92,2450.32,2458.09,2466.82,2475.17,2482.99,2494.31,2488.91,2475.86,2472.68,2468.92,2464.65,2472.17,2454.16,2472.24,2474.47,2478.68] │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

1 rows in set. Elapsed: 0.102 sec. Processed 10.54 million rows, 126.49 MB (103.48 million rows/s., 1.24 GB/s.)

Again sub-second.. 0.102 seconds.

What about the standard deviation for all values recorded on each Sunday over the past 22 years?

//QUERY D
select toYear(exchg_time_stamp) as YEAR, toMonth(exchg_time_stamp) as MONTH,toDayOfMonth(exchg_time_stamp) as DATE, stddevPop(HEQB) as STDDEV from perftest.exchange_data where toDayOfWeek(exchg_time_stamp)=07 and YEAR > (select MAX(toYear(exchg_time_stamp))-22 from perftest.exchange_data) GROUP BY toYear(exchg_time_stamp), toMonth(exchg_time_stamp), toDayOfMonth(exchg_time_stamp) ORDER BY YEAR, MONTH, DATE ASC;
....

│ 2028 │     2 │   20 │ 130.25183991022928 │
│ 2028 │     2 │   27 │ 129.94991843013983 │
│ 2028 │     3 │    5 │ 129.91073473735725 │
│ 2028 │     3 │   12 │   130.384333031235 │
│ 2028 │     3 │   19 │ 129.99975192284023 │
│ 2028 │     3 │   26 │ 130.55983915431267 │
│ 2028 │     4 │    2 │ 130.57678890216286 │
│ 2028 │     4 │    9 │ 130.69185131445647 │
│ 2028 │     4 │   16 │ 129.90156965949257 │
│ 2028 │     4 │   23 │ 130.32405073508113 │
│ 2028 │     4 │   30 │ 130.43221994584007 │
│ 2028 │     5 │    7 │  130.2685510781478 │
│ 2028 │     5 │   14 │ 129.96306860027582 │
│ 2028 │     5 │   21 │ 130.00647214658198 │
│ 2028 │     5 │   28 │   130.046485150503 │
│ 2028 │     6 │    4 │ 130.35841207992678 │
│ 2028 │     6 │   11 │  130.3562438090328 │
│ 2028 │     6 │   18 │  130.5529111126979 │
│ 2028 │     6 │   25 │ 130.48533864001732 │
│ 2028 │     7 │    2 │  130.5544487943632 │
│ 2028 │     7 │    9 │ 130.39781171476767 │
│ 2028 │     7 │   16 │  130.7816592645926 │
│ 2028 │     7 │   23 │ 130.06199790868968 │
└──────┴───────┴──────┴────────────────────┘

1125 rows in set. Elapsed: 1.523 sec. Processed 106.08 million rows, 1.27 GB (69.65 million rows/s., 835.74 MB/s.)

1.523 seconds.. There’s some magic happening here, and I like it.

This is not extensive by any means, but hopefully enough to pique some interest for those who haven’t used the platform before. Suffice it to say, I am a ClickHouse fan. If you come up with other queries to test with the time series data I’d love to hear about it!

A Quick Comparison

Now I know you might be thinking “Sure Brandon, ClickHouse looks cool but if you’re going to throw a 96 core, 768GB RAM machine at this for $6 an hour, it better be fast! How about something a bit more reasonable?”

I have you covered.

Here is a comparison of query times from the r5ad.24xlarge I used and a c5ad.4xlarge, which only has 16 cores and 32 GB of RAM and around $0.68/hour at the time of writing. Pardon my row header alignment, I clearly have more to learn about markdown and/or CSS.

  QUERY A QUERY B QUERY C QUERY D
ClickHouse r5ad.24xlarge 0.168s 0.044s 0.102s 1.523s
ClickHouse c5ad.4xlarge 0.178s 0.046s 0.117s 1.947s

2.5 billions rows.. $0.68/hr. Did I mention this thing also has a distributed mode I haven’t even tried yet?

And I know having a reference benchmark is helpful here, so here is the query exeuction on an AWS Redshift cluster (2 nodes of dc2.8xlarge, which is 32 CPU, 244GB of RAM and 2.5TB of SSD disks for storage). Data was loaded from CSV via s3 copy and only took a hair over 7 minutes thanks to the parallel s3 operations. Not bad for 2.5B rows. The table was created with a sort key on the timestamp column and it’s Redshift default behavior to perform an “ANALZYE” operation to optimize statistcs after a copy operation. Redshift queries were exeucted via Navicat for PostgreSQL as well as through the AWS web-based query editor, and multiple runs across three different clusters were evaluated, and the fastest run for each query was chosen. At ~$10.00 per hour, the dc2.8xlarge cluster is more comparable to the r5ad.24xlarge in both price and performance (64 CPU, 488GB RAM).

  QUERY A QUERY B QUERY C QUERY D
ClickHouse r5ad.24xlarge 0.168s 0.044s 0.102s 1.523s
ClickHouse c5ad.4xlarge 0.178s 0.046s 0.117s 1.947s
Redshift dc2.8xlarge 5.641s 5.663s 5.052s 6.703s

Now Redshift is a great product, don’t go thinking ClickHouse is the way to go for every possible use case. I gaurantee you we can find scenarios where out-of-the-box Redshift will beat ClickHouse, and unlike ClickHouse, Redshift ties into the existing AWS solutions very easily. There are pros and cons to both, and one should always start by framing the problem first and then working backward to the appropriate technology solution.

That’s it for now…

I’m just scratching the surface here and it will take some time to get into the full capabilities of ClickHouse, but I have to say that I’ve been very impressed thus far. Considering the fact that all of this has been running on a single box and I haven’t needed to jump into clustering thus far is really great, and it just means that those needing to squeeze even more performance out of the platform have plenty of room to work with.

If you’re looking for a lightweight, open source OLAP database that can handle some serious time series data, you’ve got a great option here with ClickHouse.