Optimizing MySQL for high concurrency on Amazon RDS

transaction response time NoRedInk’s degraded performance before fixing MySQL bottlenecks

TL;DR

MySQL couldn’t handle our load regardless of our server capacity. We fixed it by disabling the query cache, removing the thread concurrency limit and increasing the number of concurrent open tables.

Given that we’re in the education space, NoRedInk has an interesting traffic pattern unlike most other web applications. They are:

High traffic that starts around 8:00am ET and dies down at around 5:00pm PT on workdays.
Low traffic during nights and weekends.
Low traffic during summer break.

This means that, although we have daily “traffic spikes”, our infrastructure is setup to handle them automatically. Given how different our high traffic demands are from low traffic and also that we have an opportunity to tune scaling daily, we feel like we have the tuning aspect of scaling mostly nailed by now.

The fact that our traffic dies down during summer break, though, means that we have about a month of silence where our actual traffic demands increase but are invisible to us. When summer break ends, we’re faced with a bottleneck that we would have eventually found and gradually fixed otherwise.

That’s what happened this summer break to our MySQL database: traffic increased more than we anticipated, MySQL started failing in ways that we didn’t understand, and the metrics pointed us nowhere.

Our memory usage? Under control, no swapping and with spare memory. Disk IO? way under capacity. CPU usage? Aside from a few peaks, under half capacity. Network IO? No visible bottleneck.

MySQL comes with some really poor default configs, but Amazon RDS updates most of those to good defaults. That meant that once we started running into trouble we ruled out looking into those parameters at first. It ends up the configuration we had was the culprit, however, and that updating some of those values allowed us to unleash the full potential of the hardware backing our database.

I’m sure you’re curious to learn more! Let’s get busy.

Looking for concurrency bottlenecks

When you run into problems with your database performance, the first step is usually to check if your machine has any bottlenecks on CPU, memory or IO from either disk or network. Generally, one of those is the culprit, and increase its capacitiy solves the problem. But what do you do when that’s not the case?

For high concurrency scenarios with MySQL, a bottleneck with no clear hardware capacity culprit will usually occur when there’s a mutex causing contention somewhere, or when a particular parameter is blocking the number of concurrent connections that can access a particular part of the internal storage engine. Unfortunately, in cases like this, it will be difficult for you to diagnose the actual problem. If there are multiple bottlenecks, it is hard to debug them in isolation, which makes identifying them even harder.

Your best bet for a first step when trying to understand where your database is spending its time is by calling SHOW PROCESSLIST. This will show what each of your threads are doing at the moment. If a significant portion of them are on the same state, for instance “Waiting for query cache lock” or “Opening tables”, you will then know where to dig deeper.

The best tool you have at your disposal in order to find more detailed information about those bottlenecks is the performance schema. You will get access to a number of different metrics once it’s enabled that would have been unavailable otherwise.

For our debugging, in most cases what we did was one of two things:

1) List the total maximum wait by event inside of InnoDB.

-- "example of tracking down exactly what the contention points are"
SELECT EVENT_NAME,
       SUM_TIMER_WAIT/1000000000 WAIT_MS,
       COUNT_STAR
  FROM performance_schema.events_waits_summary_global_by_event_name
  ORDER BY SUM_TIMER_WAIT DESC, COUNT_STAR DESC LIMIT 30;


-- table handler waits on top, they also had the biggest increase through time
+------------------------------------------------------+-----------------+--------------+
| EVENT_NAME                                           | WAIT_MS         | COUNT_STAR   |
+------------------------------------------------------+-----------------+--------------+
| idle                                                 | 4661443248.1257 |   6165953390 |
| wait/io/table/sql/handler                            | 1755211603.3699 | 381701231771 |
| wait/io/file/innodb/innodb_log_file                  |  459199280.6118 |    139800252 |
| wait/io/file/innodb/innodb_data_file                 |   83050382.6978 |    197884296 |
| wait/io/file/myisam/kfile                            |   56080735.9075 |   5274545307 |
| wait/io/file/myisam/dfile                            |   13172549.9320 |    725383142 |
| wait/lock/table/sql/handler                          |    5669784.9629 |  15013313221 |
| wait/io/file/sql/binlog                              |    2407201.1162 |    389713292 |
...

2) List current and last wait events on InnoDB.

-- "what the current, or last completed, wait for each session was, and for exactly how long they waited"
SELECT NAME,
       IF(PPS.THREAD_ID = 0, 'Internal Thread', CONCAT(IPS.USER, '@', IPS.HOST)) USER,
       DB,
       COMMAND,
       STATE,
       TIME,
       EVENT_NAME LAST_WAIT,
       IF(TIMER_WAIT IS NULL , 'Still Waiting', TIMER_WAIT/1000000000) LAST_WAIT_MS
  FROM performance_schema.events_waits_current
  JOIN performance_schema.threads PPS USING (THREAD_ID)
LEFT JOIN INFORMATION_SCHEMA.PROCESSLIST IPS ON IPS.ID = PPS.THREAD_ID;

-- table handler waits, discrete records:
+-----------------------------------+----------------------------------+----------+---------+--------------+------+-----------------------------------------+---------------+
| NAME                              | USER                             | DB       | COMMAND | STATE        | TIME | LAST_WAIT                               | LAST_WAIT_MS  |
+-----------------------------------+----------------------------------+----------+---------+--------------+------+-----------------------------------------+---------------+
| thread/sql/one_connection         | noredink_app@172.31.35.138:36177 | noredink | Sleep   |              |    1 | idle                                    | Still Waiting |
| thread/sql/one_connection         | noredink_app@172.31.17.171:49280 | noredink | Sleep   |              |  133 | idle                                    | Still Waiting |
| thread/sql/one_connection         | noredink_app@172.31.31.142:54139 | noredink | Sleep   |              |   42 | idle                                    | Still Waiting |
| thread/sql/one_connection         | noredink_app@172.31.4.22:43788   | noredink | Query   | update       |    0 | idle                                    | Still Waiting |
| thread/sql/one_connection         | noredink_app@172.31.33.13:35375  | noredink | Query   | updating     |    0 | wait/io/table/sql/handler               | Still Waiting |
| thread/sql/one_connection         | noredink_app@172.31.42.196:55806 | noredink | Sleep   |              |    4 | idle                                    | Still Waiting |
| thread/sql/one_connection         | noredink_app@172.31.4.22:38855   | noredink | Sleep   |              |   14 | idle                                    | Still Waiting |
| thread/sql/one_connection         | noredink_app@172.31.17.171:55028 | noredink | Sleep   |              |    3 | idle                                    | Still Waiting |
| thread/sql/one_connection         | noredink_app@172.31.19.73:55344  | noredink | Sleep   |              |  229 | idle                                    | Still Waiting |
| thread/sql/one_connection         | noredink_app@172.31.42.196:59682 | noredink | Sleep   |              |   15 | wait/io/table/sql/handler               | 0.1417        |
| thread/sql/one_connection         | noredink_app@172.31.15.149:36555 | noredink | Sleep   |              |   66 | wait/io/table/sql/handler               | Still Waiting |

credit for the queries goes to Mark Leith

For the first one, we’d measure all items across time to see which seemed to peak during the periods where our database performance would have problems. We would then see if they had correlation in increased wait with the periods where we ran into problems.

That’s the backstory. Now here’s what we actually changed so that our MySQL would be perform again.

Disabling the query cache

The query cache is a feature of MySQL that allows it to return the same results when the same query is executed more than once, without having to fetch data and redo its calculations. It writes the results of the first query to the cache, and when a second identical query is executed, it just fetches the results from the cache and returns it directly. This speeds up query execution significantly, especially for queries that are expensive to fetch and compute.

Disabling the query cache is scary. NoRedInk’s usage scenario can mostly be broken down into 2 parts: 1) students answer questions 2) teachers see data about students’ answers. For the reports used to pull data for teachers to see, there are some huge queries. Huge and slow. We were afraid that disabling the query cache could cause more harm then good by removing a bottleneck just to add another.

So you can imagine our surprise when we found out that the query cache was in fact a huge hindrance to our database performance. There were two reasons:

1) If you have a workload that’s write heavy for a particular table, there will be constant invalidation of your cache. Since the cache invalidation happens at the table level, any insert on a particular table will invalidate the entire cache for that table. As quoted from the mysql query cache documentation:

Server workload has a significant effect on query cache efficiency. A query mix consisting almost entirely of a fixed set of SELECT statements is much more likely to benefit from enabling the cache than a mix in which frequent INSERT statements cause continual invalidation of results in the cache.

2) There’s a mutex around the code that handles query cache write, which means only 1 connection writes to the cache at a time. Additionally, when the cache is locked for writing, it applies a read-lock that prevents other connections from reading from the cache. To quote from the mysql query cache configuration documentation

Be careful not to set the size of the cache too large. Due to the need for threads to lock the cache during updates, you may see lock contention issues with a very large cache.

Given those two limitations of the MySQL query cache, it’s likely that, for a big concurrency scenario, the query cache is doing more harm than good. That’s definitely what we found in our case.

If you want to disable the query cache, set the query_cache_size property to 0.

Increasing InnoDB thread concurrency

InnoDB has an internal control for how many threads are allowed to enter it at the same time. InnoDB will keep track of how many threads are inside itself by assigning each a “ticket,” and it also counts the number of tickets that have been handed out. If the number is above the one configured, it’ll add the thread to a queue while it waits for the next available ticket.

There are two primary reasons why you would want to set a limit for thread concurrency:

If you have a small installation and are afraid not setting a limit on concurrency will cause it to thrash other machine resources.
If you don’t want your database to peg resources allocated for other services running on the same machine.

Given the above, this setting should always be either disabled or set to the maximum value on a high concurrency installation. Surprisingly, it’s set by default to 8, likely because of the number of vCPUs the database originally had.

You might want to set the value to the maximum value instead of disabling the thread concurrency. If the thread concurrency is disabled, InnoDB will no longer collect thread concurrency metrics. To quote from the documentation:

Disabling thread concurrency checking enables InnoDB to create as many threads as it needs. A value of 0 also disables the queries inside InnoDB and queries in queue counters in the ROW OPERATIONS section of SHOW ENGINE INNODB STATUS output.

Setting it to the maximum should cause no visible performance overhead, while still allowing you to access metrics.

Here’s our chart before, with concurrency pegged at 8 concurrent threads (16 CPUs), and after, with the configuration allowing the maximum number of threads. You can see we had 250 threads waiting to execute while the site was under stress with a maximum of 8 threads executing simultaneously. With the limit set to maximum, you can see peaks that go over 50 – meaning that, before, we had hundreds of threads that were blocked from executing.

InnoDB after change Our InnoDB profile before raising the limit on concurrent threads.

InnoDB before change Our InnoDB profile after raising the limit.

In order to disable your innodb thread concurrency, set the property innodb_thread_concurrency to 0. In order to keep the metrics for current threads, set it to the maximum, 1000, instead.

Increasing InnoDB open tables

MySQL has an LRU cache for open tables, a cache that stores table metadata for use by each database connection.

When we found out the table cache was a possible bottleneck, we collected some metrics about the number of open tables to check if they were correlated to our performance issues.

Spreadsheet Snippet

One way to see if performance could benefit from open table cache is issuing the SHOW GLOBAL STATUS command, and checking the property Open_tables. If the limit of concurrent open tables is a bottleneck, the value on the property will be the same as the max amount set by table_open_cache. This means some connections are waiting before they can get the necessary information about the table to continue executing.

Finding the ideal max level was more art than science. We slowly increased the limit in increments of one to two thousand. Once the number of currently open tables didn’t reach the maximum we set after about an hour, we figured that we found a good number for our new maximum. Since there are reports online that setting it too high is detrimental to performance, we wanted to make sure it was not much higher than the peak of open tables our database load requires.

In order to set the number of open tables, update the property table_open_cache. We updated ours from 2000 to 8000.

Conclusion

MySQL configuration parameters have a big impact on the overall performance of your application, and tuning it to your workload is incredibly important, especially as you scale. If ignored, you may end up in a position similar to ours where throwing more hardware at the problem had no effect.

Keep in mind this is what worked for us in this particular scenario. Although some parameters might seem like no-brainers, database load and usage profile can vary drastically across applications, and what helps one profile may hurt the other. So be careful!

If you’re excited about infrastructure automation and performance and love debugging interesting issues, we’re hiring operations / infrastructure engineers. You can learn more about our positions on our jobs page.

That’s it! Hope you had a fun read and have a great day!

Discuss this post on Hacker News

Marcos Toledo
@mtoledo
Director of Engineering at NoRedInk

2 notes

hoangthienan7 liked this
noredinktech posted this