If you’ve ever tried to build an enterprise data warehouse using BigQuery, you’ll quickly realize that the auto-incrementing primary keys you were so fond of in operational databases are not a thing in BigQuery.
This is by design. It’s not the job of your data warehouse to ensure referential integrity as your operational databases do, but rather to store and process tons of historical data.
However, this shift in perspective does not make the problem of finding a way to represent unique entities go away. We need to generate unique keys in the data warehouse to identify whether some entity already exists in the warehouse.
What are your options?
Generating Surrogate Keys
Keys generated to uniquely identify an entity or row in a database are called surrogate or technical keys, and there is more than one way to make them.
Option 1 — Replicate Auto-Incrementing Integers
Your first instinct is to try to replicate the incrementing keys by using the
ROW_NUMBER() aggregation function
SELECT ROW_NUMBER() OVER() AS SurrogateKey, * FROM `mytable`
While it’s nice to have an ever-increasing numeric identifier that encodes information about the size and distribution of your records, it comes at a cost.
ROW_NUMBER(), BigQuery needs to sort values at the root node of the execution tree, which is limited by the amount of memory in one execution node.
Option 2— Generate a UUID
A better alternative might be to use a Universally Unique Identifier (UUID) by using the
GENERATE_UUID() AS SurrogateKey,
This option will return 32 hexadecimal digits in 5 groups e.g.
Here, you’re basically guaranteed to get a unique ID, but the long string means slower joins across big datasets compared to the smaller integer keys.
Option 3— Generate a Hash Key (Recommended)
My favourite option is to use a hash key to represent unique entities. The idea is to concatenate multiple columns that may uniquely identify a record (you can even choose all columns if you wish) into a string and running a cryptographic hash function on it.
Cryptographic hash functions have 2 extremely useful properties, which in the context of surrogate key generation, have little to do with security:
- They map input of any length into a fixed-length output
- Similar inputs produce vastly different outputs with a low chance of collision
In Bigquery, you have a few functions to choose from:
Returns 16 bytes using the MD5 algorithm
EXPRESSION: MD5('CAT') OUTPUT (BASE64-ENCODED BYTES AS A STRING): 'wBrhpfEi8lzlZ1+GAotTag=='
Returns 20 bytes using the SHA1 algorithm.
EXPRESSION: SHA1('CAT') OUTPUT (BASE64-ENCODED BYTES AS A STRING): 'z5t3XCxERSAXjTDCZ0QAZsbv9ug='
Returns 32 bytes using the SHA256 algorithm.
EXPRESSION: SHA_256('CAT') OUTPUT (BASE64-ENCODED BYTES AS A STRING): 'FbiaVpR0JAphb5qU3QRbJxHURd3pVbYr9Ljyoq+vD2s='
Returns 64 bytes using the SHA512 algorithm.
EXPRESSION: SHA_512('CAT') OUTPUT (BASE64-ENCODED BYTES AS A STRING): 'AEMYdVzb+Pm7YdWZ0mgWotIaKcdPY82gaiNw0ZD+GivBjBWzSH5rRfbLH2ynG+kH1vzjjoZAkZ09qk6CYLQujQ=='
Returns a unique number using the open-source farmhash library.
EXPRESSION: FARM_FINGERPRINT('CAT') OUTPUT: -1069775538560612551
What Key Should I Use?
Although UUIDs and incremental keys give you unique keys, hash keys have some properties that make them shine in a data warehouse context.
- You can use the same key across multiple systems and technologies without doing lookups. All systems just use the same columns and hash them to get the key they need.
- You can load multiple tables in parallel (no problem in BigQuery, but you must turn off referential integrity in other warehouses)
- They are deterministic — You can re-load parts of the warehouse after wiping them out, and you’d keep the same key. In IoT scenarios, you can even generate them at the source devices.
- Flexible — Hash keys support multi-structured data such as JSON, Images, Files, etc.
So, please… Do yourself a favour and use hash keys.
Choosing a Hash Algorithm
When choosing a hash algorithm, there are some important considerations to take in mind:
MD5is faster to generate than
SHA1, followed by
- Shorter keys (like
MD5) mean fewer bytes to process, which leads to faster join performance.
FARM_FINGERPRINT returns an
INT, which is faster to join on than
With that in mind, If you really care about join performance and don’t mind staying in the BigQuery world, use the
FARM_FINGERPRINT function. However, if you build hybrid systems and want to generate keys across boundaries, you’ll have a tough time generating farm fingerprints on every system.
MD5 is a nice balance between speed and compatibility with pretty much any technology as is my personal preference.
What About Hash Collisions?
Thanks to the pigeonhole principle, since we’re squeezing a huge range of inputs into a small range of outputs (like 16 bytes, in the case of MD5), there is a small chance of 2 different inputs resolving to the same hash key.
What are the chances of that happening?
MD5 returns a 128-bit output, which means the probability of any 2 hashes colliding is
1/2¹²⁸ but since we’re storing all of the hashes, the birthday paradox takes effect and the probability becomes
In other words, on average, you’d have to generate roughly 5.8 million records a second for the next 1,000 years before getting a collision.
Should you still test for collisions and have a fallback strategy? Yes.
Should it keep you up at night? No.
This post looked at the conceptual angle behind the choice of key types in data systems.
We explored the pros and cons of each alternative and showed how a simple MD5 hash can generate surrogate keys and meet many additional desirable requirements of modern data systems.
If you’re interested in the impact of the key types on join performance in data warehouses, I’ve written about that here.