{"id":944,"date":"2021-01-29T05:36:19","date_gmt":"2021-01-29T05:36:19","guid":{"rendered":"https:\/\/showmethedata.blog\/?p=944"},"modified":"2021-05-12T12:01:40","modified_gmt":"2021-05-12T12:01:40","slug":"how-do-column-types-affect-join-speeds-in-data-warehouses","status":"publish","type":"post","link":"https:\/\/showmethedata.blog\/how-do-column-types-affect-join-speeds-in-data-warehouses","title":{"rendered":"How Do Column Types Affect Join Speeds In Data Warehouses?"},"content":{"rendered":"\n
When I first started building the Data Vault at Georgian<\/a>, I couldn’t decide what column data type to use as my tables’ primary key. <\/p>\n\n\n\n I had heard that integer joins vastly outperform string joins, and was worried about degrading join performance as our data grew.<\/p>\n\n\n\n In the SQL databases of the operational world, this decision is pretty much made for you by giving you auto-incrementing int<\/strong> primary keys out of the box.<\/p>\n\n\n\n In the Data Warehousing world, however, whether you’re building a Kimball or Data Vault or something else, you need to make this choice explicitly.<\/p>\n\n\n\n You can generate an integer<\/strong>, a UUID string<\/strong>, or hash<\/strong> your columns into a single column, which comes with many benefits<\/a>. As if that wasn’t complex enough, your hashed keys can be stored as strings<\/strong> or bytes<\/strong>, and each algorithm’s output may vary in length.<\/p>\n\n\n\n This brings up a question:<\/p>\n\n\n\n How does the column type of keys affect the speed of joins in Data Warehouses?<\/p><\/blockquote><\/figure>\n\n\n\n After some digging, I found some benchmarks for transactional databases, but that’s not what I was looking for. Logically speaking, integers<\/strong> must<\/em> be faster than strings and byte-strings because there are generally fewer bytes to scan. But… by how much!? <\/p>\n\n\n\n Knowing the answer seemed very important because, in a data warehouse, a bad choice can get multiplied a billion-fold.<\/p>\n\n\n\n I finally buckled under the pressure of curiosity and decided to run a benchmark on BigQuery to answer this question for myself.<\/p>\n\n\n\n I decided to generate 5 million rows of random numbers and test joining them (without cache) on the following types of keys:<\/p>\n\n\n\n Here is the code I used to generate the tables I wanted to join:<\/p>\n\n\n\n And here is the code I used to test make join:<\/p>\n\n\n\n I ran the join query 30 times for each key type<\/em> to use a Z-test to test for the difference between the mean query times and get reliable confidence intervals.<\/p>\n\n\n\n Some definitions you may find helpful when interpreting the results:<\/p>\n\n\n\n Lower Bound 90% Confidence Interval: <\/strong>There’s a 5% probability that the true mean query time is below<\/strong> this number.<\/p>\n\n\n\n Upper Bound 90% Confidence Interval: <\/strong>There’s a 5% probability that the true mean query time is Above<\/strong> this number.<\/p>\n\n\n\n Standard Deviation: <\/strong>A measure of how much deviation (on either side) from the mean query time we observed in our sample on average.<\/p>\n\n\n\n Standard Error of the Estimate of The Mean: <\/strong>How much the true mean query time deviates from the estimated mean query time of our sample.<\/p>\n\n\n\n You may also care to have a comparative view of the above data. To keep things simple, I will only compare the difference in mean query times and ignore the confidence intervals of the differences (which I’ve made available in the excel download below).<\/p>\n\n\n\n Here is everything I took away from this experiment.<\/p>\n\n\n\n I’ve made my entire experiment available for you to download in an Excel sheet. I’ve made it dead simple to use. Feel free to add your own data to it and experiment on the data warehouse of your choice!<\/p>\n\n\n\nExperiment Design<\/h2>\n\n\n\n
\/* GENERATE_ARRAY has a limit of 1M rows\n so I had to union a bunch of them together *\/\n\nWITH\n keys_1 AS (SELECT * FROM UNNEST(GENERATE_ARRAY(1,1000000)) AS key),\n keys_2 AS (SELECT * FROM UNNEST(GENERATE_ARRAY(1000001,2000000)) AS key),\n keys_3 AS (SELECT * FROM UNNEST(GENERATE_ARRAY(2000001,3000000)) AS key),\n keys_4 AS (SELECT * FROM UNNEST(GENERATE_ARRAY(3000001,4000000)) AS key),\n keys_5 AS (SELECT * FROM UNNEST(GENERATE_ARRAY(4000001,5000000)) AS key),\n keys_union AS (\n SELECT key FROM keys_1 UNION ALL\n SELECT key FROM keys_2 UNION ALL\n SELECT key FROM keys_3 UNION ALL\n SELECT key FROM keys_4 UNION ALL\n SELECT key FROM keys_5\n ),\n keys_hashed AS (\n SELECT\n key,\n MD5(CAST(key AS STRING)) as key_md5_bytes,\n TO_HEX(MD5(CAST(key AS STRING))) as key_md5_str,\n FARM_FINGERPRINT(CAST(key AS STRING)) AS key_farm,\n SHA1(CAST(key AS STRING)) AS key_sha_bytes,\n TO_HEX(SHA1(CAST(key AS STRING))) AS key_sha_str\n FROM keys_union\n )\n SELECT *, rand() AS val FROM keys_hashed<\/code><\/pre>\n\n\n\n
SELECT\n t1.val, t2.val\nFROM bq_benchmark.t1\nJOIN bq_benchmark.t2\nUSING(<key column here>); <\/code><\/pre>\n\n\n\n
Experiment Results<\/h2>\n\n\n\n
<\/th> MB
Processed<\/th>Lower Bound
90% CI (s)<\/th>Mean Query
Time (s)<\/th>Upper Bound
90% CI (s)<\/th>Std. Dev<\/th> Std. Error
of Estimate
of Mean<\/th><\/tr><\/thead>Int<\/strong><\/td> 153<\/td> 3.92<\/td> 4.05<\/td> 4.18<\/td> 0.42<\/td> 0.078<\/td><\/tr> Farm Int<\/strong><\/td> 153<\/td> 4.19<\/td> 4.30<\/td> 4.40<\/td> 0.34<\/td> 0.06<\/td><\/tr> MD5 Bytes<\/strong><\/td> 248<\/td> 4.40<\/td> 4.57<\/td> 5.90<\/td> 0.56<\/td> 0.10<\/td><\/tr> MD5 String<\/strong><\/td> 400<\/td> 4.74<\/td> 6.09<\/td> 6.28<\/td> 0.63<\/td> 0.12<\/td><\/tr> SHA1 Bytes<\/strong><\/td> 286<\/td> 4.61<\/td> 4.77<\/td> 4.94<\/td> 0.55<\/td> 0.10<\/td><\/tr> SHA1 String<\/strong><\/td> 477<\/td> 5.50<\/td> 5.65<\/td> 5.80<\/td> 0.50<\/td> 0.09<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n Base Type<\/th> Compared Type<\/th> Absolute Diff<\/th> Relative
(1+ is Slower<\/em>)<\/th>Confidence
in Difference<\/th><\/tr><\/thead>Int<\/strong><\/td> Farm<\/strong><\/td> 0.24 s<\/td> 1.06\u2179<\/td> 99.27%<\/td><\/tr> <\/td> MD5 Bytes<\/strong><\/td> 0.52 s<\/td> 1.13\u2179<\/td> 100%<\/td><\/tr> <\/td> MD5 String<\/strong><\/td> 2.03 s<\/td> 1.5\u2179<\/td> 100%<\/td><\/tr> <\/td> SHA1 Bytes<\/strong><\/td> 0.72s<\/td> 1.18\u2179<\/td> 100%<\/td><\/tr> <\/td> SHA1 String<\/strong><\/td> 1.59 s<\/td> 1.39\u2179<\/td> 100%<\/td><\/tr> Farm<\/strong><\/td> MD5 Bytes<\/strong><\/td> 0.28 s<\/td> 1.06\u2179<\/td> 98.96%<\/td><\/tr> <\/td> MD5 String<\/strong><\/td> 1.79 s<\/td> 1.42\u2179<\/td> 100%<\/td><\/tr> <\/td> SHA1 Bytes<\/strong><\/td> 0.48 s<\/td> 1.11\u2179<\/td> 100%<\/td><\/tr> <\/td> SHA1 String<\/strong><\/td> 1.35 s<\/td> 1.31\u2179<\/td> 100%<\/td><\/tr> MD5 Bytes<\/strong><\/td> MD5 String<\/strong><\/td> 1.51 s<\/td> 1.33\u2179<\/td> 100%<\/td><\/tr> <\/td> SHA1 Bytes<\/strong><\/td> 0.2 s<\/td> 1.03\u2179<\/td> 91.93%<\/td><\/tr> <\/td> SHA1 String<\/strong><\/td> 1.07 s<\/td> 1.23\u2179<\/td> 100%<\/td><\/tr> SHA1 Bytes<\/strong><\/td> MD5 String<\/strong><\/td> 1.31 s<\/td> 1.27\u2179<\/td> 100%<\/td><\/tr> <\/td> SHA1 String<\/strong><\/td> 0.87 s<\/td> 1.18\u2179<\/td> 100%<\/td><\/tr> SHA1 String<\/strong><\/td> MD5 String<\/strong><\/td> 0.44 s<\/td> 1.08\u2179<\/td> 99.83<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n Conclusion<\/h2>\n\n\n\n
Click To Download The Experiment Sheet<\/a><\/h3>\n\n\n\n