{"id":926,"date":"2021-01-26T05:28:21","date_gmt":"2021-01-26T05:28:21","guid":{"rendered":"https:\/\/showmethedata.blog\/?p=926"},"modified":"2021-05-12T12:01:47","modified_gmt":"2021-05-12T12:01:47","slug":"generating-unique-keys-in-bigquery","status":"publish","type":"post","link":"https:\/\/showmethedata.blog\/generating-unique-keys-in-bigquery","title":{"rendered":"Generating Unique Keys In BigQuery"},"content":{"rendered":"\n

If you\u2019ve ever tried to build an enterprise data warehouse using BigQuery, you\u2019ll quickly realize that the auto-incrementing primary keys you were so fond of in operational databases are not a thing in BigQuery. <\/p>\n\n\n\n

This is by design. It’s not the job of your data warehouse to ensure referential integrity as your operational databases do, but rather to store and process tons of historical data.<\/p>\n\n\n\n

However, this shift in perspective does not make the problem of finding a way to represent unique entities go away. We need to generate unique keys in the data warehouse to identify whether some entity already exists in the warehouse.<\/p>\n\n\n\n

What are your options?<\/p>\n\n\n\n

Generating Surrogate Keys<\/h3>\n\n\n\n

Keys generated<\/em> to uniquely identify an entity or row in a database are called surrogate <\/strong>or technical<\/strong> keys, and there is more than one way to make them.<\/p>\n\n\n\n

Option 1\u200a\u2014\u200aReplicate Auto-Incrementing Integers<\/h4>\n\n\n\n

Your first instinct is to try to replicate the incrementing keys by using the ROW_NUMBER()<\/code> aggregation function<\/p>\n\n\n\n

SELECT \n  ROW_NUMBER() OVER() AS SurrogateKey,\n  *\nFROM `mytable`<\/code><\/pre>\n\n\n\n

While it\u2019s nice to have an ever-increasing numeric identifier that encodes information about the size and distribution of your records, it comes at a cost.<\/p>\n\n\n\n

To implement ROW_NUMBER()<\/code>, BigQuery needs to sort values at the root node of the execution tree<\/a>, which is limited by the amount of memory in one execution node.<\/p>\n\n\n\n

Option 2\u2014 Generate a UUID<\/h4>\n\n\n\n

A better alternative might be to use a Universally Unique Identifier (UUID) by using the GENERATE_UUID()<\/code> function.<\/p>\n\n\n\n

SELECT 
GENERATE_UUID() AS SurrogateKey,
*
FROM `mytable`<\/code><\/pre>\n\n\n\n

This option will return 32 hexadecimal digits in 5 groups e.g. <\/p>\n\n\n\n

2e8815a9\u201346fc-48fe-a7a8-cc531da385b6<\/pre>\n\n\n\n

Here, you\u2019re basically guaranteed to get a unique ID, but the long string means slower joins across big datasets compared to the smaller integer keys.<\/p>\n\n\n\n

Option 3\u2014 Generate a Hash Key (Recommended)<\/h4>\n\n\n\n

My favourite option is to use a hash key to represent unique entities. The idea is to concatenate multiple columns that may uniquely identify a record (you can even choose all columns if you wish) into a string and running a cryptographic hash function on it. <\/p>\n\n\n\n

Cryptographic hash functions have 2 extremely useful properties, which in the context of surrogate key generation, have little to do with security:<\/p>\n\n\n\n

  1. They map input of any length into a fixed-length output<\/li>
  2. Similar inputs produce vastly different outputs with a low chance of collision<\/li><\/ol>\n\n\n\n

    In Bigquery, you have a few functions to choose from: <\/p>\n\n\n\n

    <\/div>\n\n\n\n

    MD5<\/h4>\n\n\n\n

    Returns 16 bytes using the MD5 algorithm<\/a><\/p>\n\n\n\n

    EXPRESSION:<\/strong>\nMD5('CAT')\n\nOUTPUT (BASE64-ENCODED BYTES AS A STRING):<\/strong>\n'wBrhpfEi8lzlZ1+GAotTag=='<\/code><\/pre>\n\n\n\n
    <\/div>\n\n\n\n

    SHA1\u200a<\/strong><\/h4>\n\n\n\n

    \u200a<\/strong>Returns 20 bytes using the SHA1 algorithm<\/a>.<\/p>\n\n\n\n

    EXPRESSION:<\/strong>\nSHA1('CAT')\n\nOUTPUT (BASE64-ENCODED BYTES AS A STRING):<\/strong> 'z5t3XCxERSAXjTDCZ0QAZsbv9ug='<\/code><\/pre>\n\n\n\n
    <\/div>\n\n\n\n

    SHA256\u200a<\/strong><\/h4>\n\n\n\n

    \u200a<\/strong>Returns 32 bytes using the SHA256 algorithm<\/a>.<\/p>\n\n\n\n

    EXPRESSION:<\/strong>\nSHA_256('CAT')\n\nOUTPUT (BASE64-ENCODED BYTES AS A STRING):<\/strong>\n'FbiaVpR0JAphb5qU3QRbJxHURd3pVbYr9Ljyoq+vD2s='<\/code><\/pre>\n\n\n\n
    <\/div>\n\n\n\n

    SHA512\u200a<\/strong><\/h4>\n\n\n\n

    Returns 64 bytes using the SHA512 algorithm<\/a>.<\/p>\n\n\n\n

    EXPRESSION:<\/strong>\nSHA_512('CAT')\n\nOUTPUT (BASE64-ENCODED BYTES AS A STRING):<\/strong>\n'AEMYdVzb+Pm7YdWZ0mgWotIaKcdPY82gaiNw0ZD+GivBjBWzSH5rRfbLH2ynG+kH1vzjjoZAkZ09qk6CYLQujQ=='<\/code><\/pre>\n\n\n\n
    <\/div>\n\n\n\n

    FARM_FINGERPRINT\u200a<\/strong><\/h4>\n\n\n\n

    Returns a unique number <\/strong>using the open-source farmhash library<\/a>. <\/p>\n\n\n\n

    EXPRESSION:<\/strong> FARM_FINGERPRINT('CAT')\nOUTPUT:<\/strong> -1069775538560612551<\/code><\/pre>\n\n\n\n
    <\/div>\n\n\n\n

    What Key Should I Use?<\/h3>\n\n\n\n

    Although UUIDs and incremental keys give you unique keys, hash keys have some properties that make them shine in a data warehouse context.<\/p>\n\n\n\n