{"id":944,"date":"2021-01-29T05:36:19","date_gmt":"2021-01-29T05:36:19","guid":{"rendered":"https:\/\/showmethedata.blog\/?p=944"},"modified":"2021-05-12T12:01:40","modified_gmt":"2021-05-12T12:01:40","slug":"how-do-column-types-affect-join-speeds-in-data-warehouses","status":"publish","type":"post","link":"https:\/\/showmethedata.blog\/how-do-column-types-affect-join-speeds-in-data-warehouses","title":{"rendered":"How Do Column Types Affect Join Speeds In Data Warehouses?"},"content":{"rendered":"\n

When I first started building the Data Vault at Georgian<\/a>, I couldn’t decide what column data type to use as my tables’ primary key. <\/p>\n\n\n\n

I had heard that integer joins vastly outperform string joins, and was worried about degrading join performance as our data grew.<\/p>\n\n\n\n

In the SQL databases of the operational world, this decision is pretty much made for you by giving you auto-incrementing int<\/strong> primary keys out of the box.<\/p>\n\n\n\n

In the Data Warehousing world, however, whether you’re building a Kimball or Data Vault or something else, you need to make this choice explicitly.<\/p>\n\n\n\n

You can generate an integer<\/strong>, a UUID string<\/strong>, or hash<\/strong> your columns into a single column, which comes with many benefits<\/a>. As if that wasn’t complex enough, your hashed keys can be stored as strings<\/strong> or bytes<\/strong>, and each algorithm’s output may vary in length.<\/p>\n\n\n\n

This brings up a question:<\/p>\n\n\n\n

How does the column type of keys affect the speed of joins in Data Warehouses?<\/p><\/blockquote><\/figure>\n\n\n\n

After some digging, I found some benchmarks for transactional databases, but that’s not what I was looking for. Logically speaking, integers<\/strong> must<\/em> be faster than strings and byte-strings because there are generally fewer bytes to scan. But… by how much!? <\/p>\n\n\n\n

Knowing the answer seemed very important because, in a data warehouse, a bad choice can get multiplied a billion-fold.<\/p>\n\n\n\n

I finally buckled under the pressure of curiosity and decided to run a benchmark on BigQuery to answer this question for myself.<\/p>\n\n\n\n

Experiment Design<\/h2>\n\n\n\n

I decided to generate 5 million rows of random numbers and test joining them (without cache) on the following types of keys:<\/p>\n\n\n\n