The choice of good InnoDB primary keys is a critical performance tuning decision. This post will guide you through the steps of choosing the best primary key depending on your workload.

As a principal architect at Percona, one of my main duties is to tune customer databases. There are many aspects related to performance tuning which make the job complex and very interesting. In this post, I want to discuss one of the most important one: the choice of good InnoDB primary keys. You would be surprised how many times I had to explain the importance of primary keys and how many debates I had around the topic as often people have preconceived ideas that translate into doing things a certain way without further thinking.

The choice of a good primary key for an InnoDB table is extremely important and can have huge performance impacts. When you start working with a customer using an overloaded x1.16xlarge RDS instance, with close to 1TB of RAM, and after putting a new primary in place they end up doing very well with a r4.4xlarge instance — it’s a huge impact. Of course, it is not a silver bullet –, you need to have a workload like the ones I’ll highlight in the following sections. Keep in mind that tuning comes with trade-offs, especially with the primary key. What you gain somewhere, you have to pay for, performance-wise, elsewhere. You need to calculate what is best for your workload.

What is special about InnoDB primary keys?

InnoDB is called an index-organized storage engine. An index-organized storage engine uses the B-Tree of the primary key to stores the data, the table rows. That means a primary key is mandatory with InnoDB. If there is no primary key for a table, InnoDB adds a hidden auto-incremented 6 bytes counter to the table and use that hidden counter as the primary key. There are some issues with the InnoDB hidden primary key. You should always define explicit primary keys on your tables. In summary, you access all InnoDB rows by the primary key values.

An InnoDB secondary index is also a B-Tree. The search key is made of the index columns and the values stored are the primary keys of matching rows. A search by a secondary index very often results in an implicit search by primary key. You can find more information about InnoDB file format in the documentation. Jeremy Cole’s InnoDB Ruby tools are also a great way to learn about InnoDB internals.

What is a B-Tree?

A B-Tree is a data structure optimized for operations on block devices. Block devices, or disks, have a rather important data access latency, especially spinning disks. Retrieving a single byte at a random position doesn’t take much less time than retrieving a bigger piece of data like a 8KB or 16KB object. That’s the fundamental argument for B-Trees. InnoDB uses pieces of data — pages — of 16KB.

A simple three level B-Tree

Let’s attempt a simplified description of a B-Tree. A B-Tree is a data structure organized around a key. The key is used to search the data inside the B-Tree. A B-Tree normally has multiple levels. The data is stored only in the bottom-most level, the leaves. The pages of the other levels, the nodes, only contains keys and pointers to pages in the next lower level.

When you want to access a piece of data for a given value of the key, you start from the top node, the root node, compare the keys it contains with the search value and finds the page to access at the next level. The process is repeated until you reach the last level, the leaves.  In theory, you need one disk read operation per level of the B-Tree. In practice there is always a memory cache and the nodes, since they are less numerous and accessed often, are easy to cache.

An ordered insert example

Let’s consider the following sysbench table:

The primary key B-Tree size is Data_length. There is one secondary key B-Tree, the k_1 index, and its size is given by Index_length. The sysbench table was inserted in order of the primary key since the id column is auto-incremented. When you insert in order of the primary key, InnoDB fills its pages with up to 15KB of data (out of 16KB), even when innodb_fill_factor is set to 100. That allows for some row expansion by updates after the initial insert before a page needs to be split. There are also some headers and footers in the pages. If a page is too full and cannot accommodate an update adding more data, the page is split into two. Similarly, if two neighbor pages are less than 50% full, InnoDB will merge them. Here is, for example, a sysbench table inserted in id order:

The table doesn’t fit in the buffer pool, but the queries give us good insights. The pages of the primary key B-Tree have on average 75 records and store a bit less than 15KB of data. The index k_1 is inserted in random order by sysbench. Why is the filling factor so good? It’s simply because sysbench creates the index after the rows have been inserted and InnoDB uses a sort file to create it.

You can easily estimate the number of levels in an InnoDB B-Tree. The above table needs about 40k leaf pages (3M/75). Each node page holds about 1200 pointers when the primary key is a four bytes integer.  The level above the leaves thus has approximately 35 pages and then, on top of the B-Tree is the root node (PAGE_NUMBER = 3). We have a total of three levels.

A randomly inserted example

If you are a keen observer, you realized a direct consequence of inserting in random order of the primary key. The pages are often split, and on average the filling factor is only around 65-75%. You can easily see the filling factor from the information schema. I modified sysbench to insert in random order of id and created a table, also with 3M rows. The resulting table is much larger:

While the size of the primary key b-tree inserted in order of id is 644MB, the size, inserted in random order, is about 1GB, 60% larger. Obviously, we have a lower page filling factor:

The primary key pages are now filled with only about 10KB of data (~66%). It is a normal and expected consequence of inserting rows in random order. We’ll see that for some workloads, it is bad. For some others, it is a small price to pay.

A practical analogy

It is always good to have a concrete model or analogy in your mind to better understand what is going on. Let’s assume you have been tasked to write the names and arrival time, on paper, of all the attendees arriving at a large event like Percona Live. So, you sit at a table close to the entry with a good pen and a pile of sheets of paper. As people arrive, you write their names and arrival time, one after the other. When a sheet is full, after about 40 names, you move it aside and start writing to a new one. That’s fast and effective. You handle a sheet only once, and when it is full, you don’t touch it anymore. The analogy is easy, a sheet of paper represents an InnoDB page.

The above use case represents an ordered insert. It is very efficient for the writes. Your only issue is with the organizer of the event: she keeps coming to you asking if “Mr. X” or “Mrs. Y” has arrived. You have to scan through your sheets to find the name. That’s the drawback of ordered inserts, reads can be more expensive. Not all reads are expensive, some can be very cheap. For example: “Who were the first ten people to get in?” is super easy. You’ll want an ordered insert strategy when the critical aspects of the application are the rate and the latency of the inserts. That usually means the reads are not user-facing. They are coming from report batch jobs, and as long as these jobs complete in a reasonable time, you don’t really care.

Now, let’s consider a random insertion analogy. For the next day of the event, tired of the organizer questions, you decide on a new strategy: you’ll write the names grouped by the first letter of the last name. Your goal is to ease the searches by name. So you take 26 sheets, and on top of each one, you write a different letter. As the first visitors arrive, you quickly realize you are now spending a lot more time looking for the right sheet in the stack and putting it back at the right place once you added a name to it.

At the end of the morning, you have worked much more. You also have more sheets than the previous day since for some letters there are few names while for others you needed more than a sheet. Finding names is much easier though. The main drawback of random insertion order is the overhead to manage the database pages when adding entries. The database will read and write from/to disk much more and the dataset size is larger.

Determine your workload type

The first step is to determine what kind of workload you have. When you have an insert-intensive workload, very likely, the top queries are inserts on some large tables and the database heavily writes to disk. If you repeatedly execute “show processlist;” in the MySQL client, you see these inserts very often. That’s typical of applications logging a lot of data. There are many data collectors and they all wait to insert data. If they wait for too long, some data may be lost. If you have strict SLA on the insert time and relaxed ones on the read time, you clearly have an insert oriented workload and you should insert rows in order of the primary key.

You may also have a decent insert rate on large tables but these inserts are queued and executed by batch processes. Nobody is really waiting for these inserts to complete and the server can easily keep up with the number of inserts. What matters for your application is the large number of read queries going to the large tables, not the inserts. You already went through query tuning and even though you have good indexes, the database is reading from disk at a very high rate.

When you look at the MySQL processlist, you see many times the same select query forms on the large tables. The only options seem to be adding more memory to lower the disk reads, but the tables are growing fast and you can’t add memory forever. We’ll discuss the read-intensive workload in details in the next section.

If you couldn’t figure if you have an insert-heavy or read-heavy workload, maybe you just don’t have a big workload. In such a case, the default would be to use ordered inserts, and the best way to achieve this with MySQL is through an auto-increment integer primary key. That’s the default behavior of many ORMs.

A read-intensive workload

I have seen quite a few read-intensive workloads over my consulting years, mostly with online games and social networking applications. On top of that, some games have social networking features like watching the scores of your friends as they progress through the game. Before we go further, we first need to confirm the reads are inefficient. When reads are inefficient, the top select query forms will the accessing a number of distinct InnoDB pages close to the number of rows examined. The Percona Server for MySQL slow log, when the verbosity level includes “InnoDB”, exposes both quantities, and the pt-query-digest tool includes stats on them. Here’s an example output (I’ve removed some lines):

The friends table definition is:

I built this simple example on my test server. The table easily fits in memory, so there are no disk reads. What matters here is the relation between “page distin” and “Rows examine”. As you can see, the ratio is close to 1. It means that InnoDB rarely gets more than one row per page it accesses. For a given user_id value, the matching rows are scattered all over the primary key b-tree. We can confirm this by looking at the output of the sample query:

The rows are often apart by thousands of id values. Although the rows are small, about 30 bytes, an InnoDB page doesn’t contain more than 500 rows. As the application becomes popular, there are more and more users and the table size grows like the square of the number of users. As soon as the table outgrows the InnoDB the buffer pool, MySQL starts to read from disk. Worse case, with nothing cached, we need one read IOP per friend. If the rate of these selects is 300/s and on average, every user has 100 friends, MySQL needs to access up to 30000 pages per second. Clearly, this doesn’t scale for long.

We need to determine all the ways the table is accessed. For that, I use pt-query-digest and I raise the limit on the number of query forms returned. Let’s assume I found:

  • 93% of the times by user_id
  • 5% of the times by friend_id
  • 2% of the times by id

The above proportions are quite common. When there is a dominant access pattern, we can do something. The friends table is a typical example of a many-to-many table. With InnoDB, we should define such tables as:

Now, the rows are ordered, grouped, by user_id inside the primary key B-Tree but the inserts are in random order. Said otherwise, we slowed down the inserts to the benefit of the select statements on the table. To insert a row, InnoDB potentially needs one disk read to get the page where the new row is going and one disk write to save it back to the disk. Remember in the previous analogy, we needed to take one sheet from the stack, add a name and put it back in place. We also made the table bigger, the InnoDB pages are not as full and the secondary indexes are bigger since the primary key is larger. We also added a secondary index. Now we have less data in the InnoDB buffer pool.

Shall we panic because there is less data in the buffer pool? No, because now when InnoDB reads a page from disk, instead of getting only a single matching row, it gets up to hundreds of matching rows. The amount of read IOPS is no longer correlated to the number of friends times the rate of select statements. It is now only a factor of the incoming rate of select statements. The impacts of not having enough memory to cache all the table are much reduced. As long as the storage can perform more read IOPS than the rate of select statements, all is fine. With the modified table, the relevant lines of the pt-query-digest output are now:

With the new primary key, instead of 30k read IOPS, MySQL needs to perform only about 588 read IOPS (~300*1.96). It is a workload much easier to handle. The inserts are more expensive but if their rate is 100/s, it just means 100 read IOPS and 100 write IOPS in the worse case.

The above strategy works well when there is a clear access pattern. On top of my mind, here are a few other examples where there are usually dominant access patterns:

  • Game leaderboards (by user)
  • User preferences (by user)
  • Messaging application (by from or to)
  • User object store (by user)
  • Likes on items (by item)
  • Comments on items (by item)

What can you do when you don’t have a dominant access pattern? One option is the use of a covering index. The covering index needs to cover all the required columns. The order of the columns is also important, as the first must be the grouping value. Another option is to use partitions to create an easy to cache hot spot in the dataset. I’ll discuss these strategies in future posts, as this one is long enough!

We have seen in this post a common strategy used to solve read-intensive workload. This strategy doesn’t work all the time — you must access the data through a common pattern. But when it works, and you choose good InnoDB primary keys, you are the hero of the day!

You May Also Like

Troubleshooting crashes can be an arduous task, particularly when MySQL does not report the crash’s cause (as can be the case when MySQL runs out of memory). However, with new versions of MySQL (5.7+) and performance_schema, we can troubleshoot memory allocation more easily. Read our blog to learn where to begin troubleshooting MySQL memory leaks.

If your business is looking for an architecture that provides availability, data consistency, and a strong foundation for more advanced deployments capable of surviving disaster scenarios, consider an Amazon EC2 environment running Percona XtraDB Cluster. Read our solution brief to learn more about the benefits of this cloud architecture solution.

Get the Solution Brief

17 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Scott Klasing

Yves, great article on what so many consider after they have placed a table in production. My only complaint is you have not written anything on how to choose the best primary key and it involves/requires knowledge of the data.

If more mysql or database technicians were to focus on this during the data modeling phase they would have far fewer problems. There are huge gains for those who best understand database modeling, normalization of data, and the relational model itself as invented by Ted Codd back in the 60’s era. The model is straight set theory and is easily, mathmatically proven to work well when adhered to.

I have to admit I am very surprised the database industry does not require you to specify a pk rather than default to an internal hidden useless to users approach. Force the dba/programmer/user to think what is the best access path to the data. Ted Codd (inventor of the relational model) use to say if you have not identified the table’s PK then you have not identified the tables purpose which I fully agree. The proof PK’s should be mandated is if you have a table without a PK, then how do your batch programs indentify which row should be updated. You might be able to view the data and make that determination, but a program can not.

As most already know, know a Primary Key must be unique and can not contain nulls, otherwise you have not identified the row so to start you first identify all possible candidate keys, ie, those that are unique and dont allow nulls. Second choose the better of the candidate keys.

An example, best explains. Usually in most mysql shops you see lazy choices of auto incremented PKs proliferating the design and at the same time you see the most common secondary indexes that should have/ could have been the primary key. The point is improper choice of the PK always results in additional unnecessary secondary indexes which further slows inserts/updates/deletes. The pattern that shows is usually in the WHERE clauses where since the application did not know the value of the PK but it did know a name or a description it searched by the secondary key to then identify the PK. Note the similar pattern where Mysql includes the pk in all secondary indexes. Quite often the main secondary index assuming unique outweighs the choice of the autoincrement, not thought through key.

Then question the impact of single attribute PK’s like autoincrement verses a key that better represents the business but is perhaps a two attribute PK. If your model table hierarchy is only 2 to 3 deep then a multi-attribute key is manageable. If much deeper then it becomes problematic writing code since the lower levels of the hierarchy have to add an additional attribute that presents the next level. Also question the order of the attributes in the PK where your WHERE clauses must reference the first order attributes to even use the index. Also question the distinct cardinality of an attribute because the higher the distinct cardinality of an attribute the better it is in an index. Example, if an index’s first attribute has low cardinality, number of distinct values, then the index has to read/lock more similar rows to answer the questions quickly.

I believe the above process is why DBA’s, developers, implementors get lazy because they dont understand their data as well as they should. If they did understand the above becomes a very easy process to identify the best PK.

In summary I always inform our teams, choose the candidate key that best presents the business, the one you will search by, and most likely order by, group by since the PK is typically the clustered, physically sorted already index. Finally, for shops with huge write volumes as in 100’s of thousands of queries per second and more, Always, always, always UPDATE VIA the PK, never allow updates or deletes that are not PK based. Even if your volumes are lower “Design for millions of transactions per second” by only taking ONE ROW based update/delete locks as opposed to locking many rows. If you dont know the PK to update or delete by, then read in non locking mode to obtain the lists of PKs and then update or delete by the list.

I have used the above techniques for the last 37 plus years and they have never failed to produced significantly improved performance results. The above logic applies to every database I have ever worked on (IDMS, IMS, DB2, Oracle, MySQL, ObjectStar, CouchDB, MongoDB, other NOSQL db’s, and several CloudDBs), not just innodb engine tables.

Loved the article, keep up the good work.

xiaojianping

If there is no primary key for a table, InnoDB adds a hidden auto-incremented 6 bytes counter to the table and use that hidden counter as the primary key.
I remember is 8 bytes.

肖建平

If there is no primary key for a table, InnoDB adds a hidden auto-incremented 6 bytes counter to the table and use that hidden counter as the primary key.
I remember is 8 bytes.

Stofa.

Hi,

The question seems to be stupid.
What about the choice of using a varchar column as a PRIMARY KEY ? Is there any limit concerning the maximum length of a varchar column used as PRIMARY KEY for Innodb table?
Someone has suggested, as alternative to int or a bingint columns, the use of MD5 or an UUID as a PRIMARY key for innodb.

Stofa.

Thanks.

Jose María Zaragoza

Thanks for you post

Why not to create ‘friends’ table like this:

CREATE TABLE friends (
id int(10) unsigned NOT NULL AUTO_INCREMENT,
user_id int(10) unsigned NOT NULL,
friend_user_id int(10) unsigned NOT NULL,
created timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
active tinyint(4) NOT NULL DEFAULT ‘1’,
PRIMARY KEY (id),
KEY idx_friend (friend_user_id,’user_id’)
) ENGINE=InnoDB AUTO_INCREMENT=144002 DEFAULT CHARSET=latin1

– PK is autoincrement ( so, inserts are ordered , better for selecting write page )
– only one secondary index. It supports searches by (friend_user_id) and (friend_user_id,user_id)

Regards

SCG82

because 93% of queries are by user_id. your schema would perform very poorly for these queries.

Jose María Zaragoza

Thanks for your post

Why not to create ‘friends’ table like this: 

CREATE TABLE friends (
  id int(10) unsigned NOT NULL AUTO_INCREMENT,
  user_id int(10) unsigned NOT NULL,
  friend_user_id int(10) unsigned NOT NULL,
  created timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  active tinyint(4) NOT NULL DEFAULT ‘1’,
  PRIMARY KEY (id),
  KEY idx_friend (friend_user_id,’user_id’)
) ENGINE=InnoDB AUTO_INCREMENT=144002 DEFAULT CHARSET=latin1

– PK is autoincrement ( so, inserts are ordered , better for selecting write page )
– only one secondary index. It supports searches by (friend_user_id) and (friend_user_id,user_id)

Regards

Sids

Great article on tuning ideology.

Just a random question, is there any specific need of ID column “id int(10) unsigned NOT NULL AUTO_INCREMENT” afterwards you’ve shifted primary key to “PRIMARY KEY (user_id,friend_user_id)”?

Is there any use-case to keep
Column : id int(10) unsigned NOT NULL AUTO_INCREMENT and
Key : idx_id (id) ?

Sids

Thanks for such informative post.

Just a random question. Is there any particular reason you’ve kept column ( id int(10) unsigned NOT NULL AUTO_INCREMENT) and index (KEY idx_id (id)) ? Since you’ve already set new PRIMARY KEY (user_id,friend_user_id), do you still need ID column and with index on it?!

John Lee

Very informative. What is the relationship between primary key vs. partition key and their impact on performance? For example, I’d like to use ‘id’ as primary key because they are unique. id is also incrementally generated, so storing them together generally provide grouping of items that are relatively close in incoming timestamp. However, search is usually done with timestamp since I do not know id in advance. I am thinking a primary key index as such (id, timestamp). Since the tables are usually well over 100G, I would like to partition as well but on timestamp.

H

Yves, thanks for this very helpful article! I have a question regarding the concepts you explained: when having e.g. document data in a InnoDB table in a multi-tenant database used for multiple customer accounts, would it make sense / be wise to “preface” the primary key with the account ID? (Access pattern would be to retrieve e.g. the newest documents for a specific account.)

So instead of

CREATE TABLE documents (
doc_id INT UNSIGNED NOT NULL AUTO_INCREMENT,
doc_account MEDIUMINT UNSIGNED NOT NULL,
doc_title VARCHAR(128) NOT NULL,
doc_date DATE NOT NULL,

PRIMARY KEY (doc_id),
INDEX (doc_account, doc_date)
)

it would be better to do this

CREATE TABLE documents (
doc_id INT UNSIGNED NOT NULL AUTO_INCREMENT,
doc_account MEDIUMINT UNSIGNED NOT NULL,
doc_title VARCHAR(128) NOT NULL,
doc_date DATE NOT NULL,

PRIMARY KEY (doc_account, doc_id),
INDEX (doc_account, doc_date)
)

? Would I leave the secondary index just as it is? Any help is highly appreciated!