{"id":926,"date":"2021-01-26T05:28:21","date_gmt":"2021-01-26T05:28:21","guid":{"rendered":"https:\/\/showmethedata.blog\/?p=926"},"modified":"2021-05-12T12:01:47","modified_gmt":"2021-05-12T12:01:47","slug":"generating-unique-keys-in-bigquery","status":"publish","type":"post","link":"https:\/\/showmethedata.blog\/generating-unique-keys-in-bigquery","title":{"rendered":"Generating Unique Keys In BigQuery"},"content":{"rendered":"\n<p>If you\u2019ve ever tried to build an enterprise data warehouse using BigQuery, you\u2019ll quickly realize that the auto-incrementing primary keys you were so fond of in operational databases are not a thing in BigQuery.&nbsp;<\/p>\n\n\n\n<p>This is by design. It&#8217;s not the job of your data warehouse to ensure referential integrity as your operational databases do, but rather to store and process tons of historical data.<\/p>\n\n\n\n<p>However, this shift in perspective does not make the problem of finding a way to represent unique entities go away.&nbsp;We need to generate unique keys in the data warehouse to identify whether some entity already exists in the warehouse.<\/p>\n\n\n\n<p>What are your options?<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Generating Surrogate Keys<\/h3>\n\n\n\n<p>Keys <em>generated<\/em> to uniquely identify an entity or row in a database are called <strong>surrogate <\/strong>or <strong>technical<\/strong> keys, and there is more than one way to make them.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Option 1\u200a\u2014\u200aReplicate Auto-Incrementing Integers<\/h4>\n\n\n\n<p>Your first instinct is to try to replicate the incrementing keys by using the <code>ROW_NUMBER()<\/code> aggregation function<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">SELECT \n  ROW_NUMBER() OVER() AS SurrogateKey,\n  *\nFROM `mytable`<\/code><\/pre>\n\n\n\n<p>While it\u2019s nice to have an ever-increasing numeric identifier that encodes information about the size and distribution of your records, it comes at a cost.<\/p>\n\n\n\n<p>To implement <code>ROW_NUMBER()<\/code>, <a href=\"https:\/\/cloud.google.com\/blog\/products\/data-analytics\/bigquery-and-surrogate-keys-practical-approach\" rel=\"noreferrer noopener\" target=\"_blank\">BigQuery needs to sort values at the root node of the execution tree<\/a>, which is limited by the amount of memory in one execution node.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Option 2\u2014 Generate a&nbsp;UUID<\/h4>\n\n\n\n<p>A better alternative might be to use a Universally Unique Identifier (UUID) by using the <code>GENERATE_UUID()<\/code> function.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">SELECT <br>  GENERATE_UUID() AS SurrogateKey,<br>  *<br>FROM `mytable`<\/code><\/pre>\n\n\n\n<p>This option will return 32 hexadecimal digits in 5 groups e.g.&nbsp;<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">2e8815a9\u201346fc-48fe-a7a8-cc531da385b6<\/pre>\n\n\n\n<p>Here, you\u2019re basically guaranteed to get a unique ID, but the long string means slower joins across big datasets compared to the smaller integer keys.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Option 3\u2014 Generate a Hash Key (Recommended)<\/h4>\n\n\n\n<p>My favourite option is to use a hash key to represent unique entities. The idea is to concatenate multiple columns that may uniquely identify a record (you can even choose all columns if you wish) into a string and running a cryptographic hash function on it.&nbsp;<\/p>\n\n\n\n<p>Cryptographic hash functions have 2 extremely useful properties, which in the context of surrogate key generation, have little to do with security:<\/p>\n\n\n\n<ol><li>They map input of any length into a fixed-length output<\/li><li>Similar inputs produce vastly different outputs with a low chance of collision<\/li><\/ol>\n\n\n\n<p>In Bigquery, you have a few functions to choose from:&nbsp;<\/p>\n\n\n\n<div style=\"height:57px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h4 class=\"wp-block-heading\">MD5<\/h4>\n\n\n\n<p>Returns 16 bytes using the <a rel=\"noreferrer noopener\" href=\"https:\/\/en.wikipedia.org\/wiki\/MD5\" target=\"_blank\">MD5 algorithm<\/a><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\"><strong>EXPRESSION:<\/strong>\nMD5('CAT')\n\n<strong>OUTPUT (BASE64-ENCODED BYTES AS A STRING):<\/strong>\n'wBrhpfEi8lzlZ1+GAotTag=='<\/code><\/pre>\n\n\n\n<div style=\"height:42px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>SHA1\u200a<\/strong><\/h4>\n\n\n\n<p><strong>\u200a<\/strong>Returns 20 bytes using the <a rel=\"noreferrer noopener\" href=\"https:\/\/en.wikipedia.org\/wiki\/SHA-1\" target=\"_blank\">SHA1 algorithm<\/a>.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\"><strong>EXPRESSION:<\/strong>\nSHA1('CAT')\n\n<strong>OUTPUT (BASE64-ENCODED BYTES AS A STRING):<\/strong> 'z5t3XCxERSAXjTDCZ0QAZsbv9ug='<\/code><\/pre>\n\n\n\n<div style=\"height:42px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>SHA256\u200a<\/strong><\/h4>\n\n\n\n<p><strong>\u200a<\/strong>Returns 32 bytes using the <a rel=\"noreferrer noopener\" href=\"https:\/\/en.wikipedia.org\/wiki\/SHA-2\" target=\"_blank\">SHA256 algorithm<\/a>.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\"><strong>EXPRESSION:<\/strong>\nSHA_256('CAT')\n\n<strong>OUTPUT (BASE64-ENCODED BYTES AS A STRING):<\/strong>\n'FbiaVpR0JAphb5qU3QRbJxHURd3pVbYr9Ljyoq+vD2s='<\/code><\/pre>\n\n\n\n<div style=\"height:42px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>SHA512\u200a<\/strong><\/h4>\n\n\n\n<p>Returns 64 bytes using the <a rel=\"noreferrer noopener\" href=\"https:\/\/en.wikipedia.org\/wiki\/SHA-2\" target=\"_blank\">SHA512 algorithm<\/a>.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\"><strong>EXPRESSION:<\/strong>\nSHA_512('CAT')\n\n<strong>OUTPUT (BASE64-ENCODED BYTES AS A STRING):<\/strong>\n'AEMYdVzb+Pm7YdWZ0mgWotIaKcdPY82gaiNw0ZD+GivBjBWzSH5rRfbLH2ynG+kH1vzjjoZAkZ09qk6CYLQujQ=='<\/code><\/pre>\n\n\n\n<div style=\"height:42px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>FARM_FINGERPRINT\u200a<\/strong><\/h4>\n\n\n\n<p>Returns a unique <strong>number <\/strong>using the <a rel=\"noreferrer noopener\" href=\"https:\/\/github.com\/google\/farmhash\" target=\"_blank\">open-source farmhash library<\/a>.&nbsp;<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\"><strong>EXPRESSION:<\/strong> FARM_FINGERPRINT('CAT')\n<strong>OUTPUT:<\/strong> -1069775538560612551<\/code><\/pre>\n\n\n\n<div style=\"height:42px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h3 class=\"wp-block-heading\">What Key Should I&nbsp;Use?<\/h3>\n\n\n\n<p>Although UUIDs and incremental keys give you unique keys, hash keys have some properties that make them shine in a data warehouse context.<\/p>\n\n\n\n<ul><li>You can use the same key across multiple systems and technologies without doing lookups. All systems just use the same columns and hash them to get the key they need.<\/li><li>You can load multiple tables in parallel (no problem in BigQuery, but you must turn off referential integrity in other warehouses)<\/li><li>They are deterministic\u200a\u2014\u200aYou can re-load parts of the warehouse after wiping them out, and you\u2019d keep the same key. In IoT scenarios, you can even generate them at the source devices.<\/li><li>Flexible\u200a\u2014\u200aHash keys support multi-structured data such as JSON, Images, Files, etc.<\/li><\/ul>\n\n\n\n<p>So, please\u2026 Do yourself a favour and use hash keys.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Choosing a Hash Algorithm<\/h4>\n\n\n\n<p>When choosing a hash algorithm, there are some important considerations to take in mind:<\/p>\n\n\n\n<ol><li><code>MD5<\/code> is faster to generate than <code>SHA1<\/code>, followed by <code>SHA256<\/code> and <code>SHA512<\/code><\/li><li>Shorter keys (like <code>MD5<\/code> ) mean fewer bytes to process, which leads to faster join performance.<\/li><\/ol>\n\n\n\n<p>3. <code>FARM_FINGERPRINT<\/code> returns an <code>INT<\/code>, which is faster to join on than <code>BYTES<\/code> or <code>STRING<\/code> types.<\/p>\n\n\n\n<p>With that in mind, If you really care about join performance and don\u2019t mind staying in the BigQuery world, use the <code>FARM_FINGERPRINT<\/code> function. However, if you build hybrid systems and want to generate keys across boundaries, you\u2019ll have a tough time generating farm fingerprints on every system.<\/p>\n\n\n\n<p><code>MD5<\/code> is a nice balance between speed and compatibility with pretty much any technology as is my personal preference.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large is-style-default\"><img fetchpriority=\"high\" decoding=\"async\" width=\"1024\" height=\"576\" src=\"https:\/\/showmethedata.blog\/wp-content\/uploads\/2021\/01\/feature-major-key-1024x576.jpg\" alt=\"\" class=\"wp-image-935\" srcset=\"https:\/\/showmethedata.blog\/wp-content\/uploads\/2021\/01\/feature-major-key-1024x576.jpg 1024w, https:\/\/showmethedata.blog\/wp-content\/uploads\/2021\/01\/feature-major-key-300x169.jpg 300w, https:\/\/showmethedata.blog\/wp-content\/uploads\/2021\/01\/feature-major-key-768x432.jpg 768w, https:\/\/showmethedata.blog\/wp-content\/uploads\/2021\/01\/feature-major-key.jpg 1280w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><figcaption>DJ Khaled vouching for <strong>MD5<\/strong> as the key to success<\/figcaption><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\">What About Hash Collisions?<\/h4>\n\n\n\n<p>Thanks to the <a rel=\"noreferrer noopener\" href=\"https:\/\/en.wikipedia.org\/wiki\/Pigeonhole_principle\" target=\"_blank\">pigeonhole principle<\/a>, since we\u2019re squeezing a huge range of inputs into a small range of outputs (like 16 bytes, in the case of MD5), there is a small chance of 2 different inputs resolving to the same hash key.&nbsp;<\/p>\n\n\n\n<p>What are the chances of that happening?<\/p>\n\n\n\n<p>MD5 returns a 128-bit output, which means the probability of any 2 hashes colliding is <code><strong>1\/2\u00b9\u00b2\u2078<\/strong><\/code> but since we\u2019re storing all of the hashes, the birthday paradox takes effect and the probability becomes <code><strong>1\/2\u2076\u2074<\/strong><\/code><\/p>\n\n\n\n<p>In other words, on average, you\u2019d have to generate roughly <strong><a rel=\"noreferrer noopener\" class=\"rank-math-link\" href=\"https:\/\/www.google.com\/search?q=2%5E64%2F1000%2F%28seconds+per+year%29\" target=\"_blank\">5.8 million records a second for the next 1,000 years<\/a><\/strong> before getting a collision.<\/p>\n\n\n\n<p>Should you still test for collisions and have a fallback strategy? Yes.<\/p>\n\n\n\n<p>Should it keep you up at night? No.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Conclusion<\/h2>\n\n\n\n<p>This post looked at the conceptual angle behind the choice of key types in data systems. <\/p>\n\n\n\n<p>We explored the pros and cons of each alternative and showed how a simple MD5 hash can generate surrogate keys and meet many additional desirable requirements of modern data systems.<\/p>\n\n\n\n<p>If you&#8217;re interested in the impact of the key types on join performance in data warehouses, <a href=\"https:\/\/showmethedata.blog\/how-do-column-types-affect-join-speeds-in-data-warehouses\" class=\"rank-math-link\">I&#8217;ve written about that here<\/a>.<\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>If you\u2019ve ever tried to build an enterprise data warehouse using BigQuery, you\u2019ll quickly realize that the auto-incrementing primary keys you were so fond of in operational databases are not a thing in BigQuery.&nbsp; This is by design. It&#8217;s not the job of your data warehouse to ensure referential integrity as your operational databases do, [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":935,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_eb_attr":"","footnotes":""},"categories":[14],"tags":[20,30],"_links":{"self":[{"href":"https:\/\/showmethedata.blog\/wp-json\/wp\/v2\/posts\/926"}],"collection":[{"href":"https:\/\/showmethedata.blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/showmethedata.blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/showmethedata.blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/showmethedata.blog\/wp-json\/wp\/v2\/comments?post=926"}],"version-history":[{"count":0,"href":"https:\/\/showmethedata.blog\/wp-json\/wp\/v2\/posts\/926\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/showmethedata.blog\/wp-json\/wp\/v2\/media\/935"}],"wp:attachment":[{"href":"https:\/\/showmethedata.blog\/wp-json\/wp\/v2\/media?parent=926"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/showmethedata.blog\/wp-json\/wp\/v2\/categories?post=926"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/showmethedata.blog\/wp-json\/wp\/v2\/tags?post=926"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}