Saturday, 2 February 2013

Performance series: GUIDs vs. IDENTITY INT in RDBMS

[Level C2] We have been discussing with a few colleagues about whether we should adopt GUIDs as our primary keys. At the same time, we had a meeting with our DBAs to discuss other scenarios and the topic came up.

When the systems become more complex and you have more layers between your user interface and database, not knowing the ID of the aggregate root before storing can pose unnecessary challenges. On the other hand, the cost of choosing GUID as the primary key of the database is usually unknown. That is why I set out to find out the exact cost of choosing GUID instead of IDENTITY INT as the primary key of the table.

Pros and cons

This is not a new topic. There are quite a few resources that discuss this topic and list the pros and cons. Jeff Atwood's blog post from 6 years ago is a short and sweet one explaining why StackOveflow decided to move the IDs (at least some of them) to GUID. It also contains many references to other related posts that discuss the topic - all of which are good reads especially this one.

Why GUID?

For me, the main reason to choose a GUID is to know the ID of the entity even before persisting. If you do not have this requirement, you might still want to use GUID and avoid IDENTITY INT especially if you are using sharding or master-master replication. But for me, the first one is vital. Why?

If you use CQRS and commands to store your entities, you would naturally implement commands as an asynchronous operation. Your UI would have to find out about the result of the operation by polling or subscribing to the event published after processing the command. If you do not know the ID of your entity before persistence, you would end up using a different ID as a reference.

Problem with GUIDs

Apart from GUIDs being unreadable, there are performance implications for using GUIDs. The main ones are:

  1. GUID is a 16-byte type while INT is a 4-byte type. With storage nowadays very cheap, this normally is not a problem from the storage point of view. The problem is that reads (and writes) will be multiplied by 4x.
  2. The most serious problem is when you use the GUID ID as the clustered index of the table. In this case, with every INSERT you would be changing the layout of the data potentially having to move many data pages. This is unlike IDENTITY INT clustered indexes where data is stored sequentially providing the best performance.

Benchmarks

This old article provides a benchmark of the cost of the writes. Considering very little has been changed with the basic functionality of the SQL Server, this benchmark is still relevant - if not accurate. According to the results, cost of writing GUID primary keys in a database containing 1,000,000 records is 10x the cost of writing INT primary keys. This cost goes up exponentially when the table contains more rows.

One of the solutions presented in the article is to have a semi-sequential GUID generated in the database. In fact since SQL Server 2005, there is an option to do that using NEWSEQUENTIALID(). This will change the overhead to a mere linear 10% on both reads and writes.

The problem with these solutions is that the GUID is still generated in database so does not solve the problem of knowing the ID before storing the entity.

My solution and benchmark

As discussed, the problem generally is to do with the non-sequential nature of the GUIDs. But who said we should use the primary key as the clustered index?!

Basically my solution is to:

  1. Add an IDENTITY INT column (I called it order Id - nothing to do with customer orders) and set it to be the clustered index. If you do not want to use an additional column, you could use a DATETIME column to store a timestamp which is usually very useful
  2. Add a UNIQUEIDENTIFIER column (GUID data type in SQL Server) as the main ID. Set that to be the primary key.
  3. Add a non-clustered index for the GUID ID

This will keep the data storage sequential while benefiting from system generated IDs instead of database generated IDs. There are two drawbacks: 1) storing an additional column 2) having an additional non-clustered index which takes some space and makes reads and writes slightly slower. Bear in mind, this is  recommended only if you keep the table only for transactional usage and to store and retrieve by key and not for reporting. In most cases, this is what you would normally do especially if you are implementing CQS or CQRS.

* Using an IDENTITY INT as the clustered index

As can be seen, using this technique, performance of the GUID primary key is close to the performance of the INT primary key (roughly 10% overhead).

Conclusion

Performance of a GUID primary key is acceptable and adds mere a 10% overhead, if we use another INT or DATETIME timestamp as the clustered index.


2 comments:

  1. Great article I hear this agument a lot from developers and you've done a great job on highlighting the issues. For the Devs considering generating sequential GUIDs in your application layer (hibernate etc) please ensure the sequence is compatible with how your RDBMS stores them internally. Remember that Sql 2012 now supports Sequences which might help.

    ReplyDelete
    Replies
    1. Agreed, developers need to be aware that byte order for GUIDs matters - make sure that the app layer generates GUIDs in the same order as the data layer or else you'll need to write a translation to ensure sequential writes.

      Be careful of advising DATETIME as clustered index. Ideally a clustered index should be unique. Since we can't guarantee the uniqueness of temporal columns for OLTP applications, we're going to have to create a non-unique clustered index. When you have a non-unique clustered index, SQL Server has to create a uniqueifier value (an INT) as a secret column to make sure each row is unique. You've now got a 2 column clustering key (DATETIME + INT). SQL Server will store a copy of that clustering key with every non-clustered index row. In short - you've added 12 bytes of overhead where a 4 byte INT would have sufficed.

      Ultimately - when working with anything that puts data on disk, you need to be aware of how you're storing data on disk. That wasted space adds up long term and can be the difference between adequate performance and amazing performance.

      Delete