If you’ve ever tried to build an enterprise data warehouse using BigQuery, you’ll quickly realize that the auto-incrementing primary keys you were so fond of in operational databases are not a thing in BigQuery.
This is by design. It’s not the job of your data warehouse to ensure referential integrity as your operational databases do, but rather to store and process tons of historical data.
However, this shift in perspective does not make the problem of finding a way to represent unique entities go away. We need to generate unique keys in the data warehouse to identify whether some entity already exists in the warehouse.
What are your options?
Keys generated to uniquely identify an entity or row in a database are called surrogate or technical keys, and there is more than one way to make them.
Your first instinct is to try to replicate the incrementing keys by using the ROW_NUMBER()
aggregation function
SELECT
ROW_NUMBER() OVER() AS SurrogateKey,
*
FROM `mytable`
While it’s nice to have an ever-increasing numeric identifier that encodes information about the size and distribution of your records, it comes at a cost.
To implement ROW_NUMBER()
, BigQuery needs to sort values at the root node of the execution tree, which is limited by the amount of memory in one execution node.
A better alternative might be to use a Universally Unique Identifier (UUID) by using the GENERATE_UUID()
function.
SELECT
GENERATE_UUID() AS SurrogateKey,
*
FROM `mytable`
This option will return 32 hexadecimal digits in 5 groups e.g.
2e8815a9–46fc-48fe-a7a8-cc531da385b6
Here, you’re basically guaranteed to get a unique ID, but the long string means slower joins across big datasets compared to the smaller integer keys.
My favourite option is to use a hash key to represent unique entities. The idea is to concatenate multiple columns that may uniquely identify a record (you can even choose all columns if you wish) into a string and running a cryptographic hash function on it.
Cryptographic hash functions have 2 extremely useful properties, which in the context of surrogate key generation, have little to do with security:
In Bigquery, you have a few functions to choose from:
Returns 16 bytes using the MD5 algorithm
EXPRESSION:
MD5('CAT')
OUTPUT (BASE64-ENCODED BYTES AS A STRING):
'wBrhpfEi8lzlZ1+GAotTag=='
Returns 20 bytes using the SHA1 algorithm.
EXPRESSION:
SHA1('CAT')
OUTPUT (BASE64-ENCODED BYTES AS A STRING): 'z5t3XCxERSAXjTDCZ0QAZsbv9ug='
Returns 32 bytes using the SHA256 algorithm.
EXPRESSION:
SHA_256('CAT')
OUTPUT (BASE64-ENCODED BYTES AS A STRING):
'FbiaVpR0JAphb5qU3QRbJxHURd3pVbYr9Ljyoq+vD2s='
Returns 64 bytes using the SHA512 algorithm.
EXPRESSION:
SHA_512('CAT')
OUTPUT (BASE64-ENCODED BYTES AS A STRING):
'AEMYdVzb+Pm7YdWZ0mgWotIaKcdPY82gaiNw0ZD+GivBjBWzSH5rRfbLH2ynG+kH1vzjjoZAkZ09qk6CYLQujQ=='
Returns a unique number using the open-source farmhash library.
EXPRESSION: FARM_FINGERPRINT('CAT')
OUTPUT: -1069775538560612551
Although UUIDs and incremental keys give you unique keys, hash keys have some properties that make them shine in a data warehouse context.
So, please… Do yourself a favour and use hash keys.
When choosing a hash algorithm, there are some important considerations to take in mind:
MD5
is faster to generate than SHA1
, followed by SHA256
and SHA512
MD5
) mean fewer bytes to process, which leads to faster join performance.3. FARM_FINGERPRINT
returns an INT
, which is faster to join on than BYTES
or STRING
types.
With that in mind, If you really care about join performance and don’t mind staying in the BigQuery world, use the FARM_FINGERPRINT
function. However, if you build hybrid systems and want to generate keys across boundaries, you’ll have a tough time generating farm fingerprints on every system.
MD5
is a nice balance between speed and compatibility with pretty much any technology as is my personal preference.
Thanks to the pigeonhole principle, since we’re squeezing a huge range of inputs into a small range of outputs (like 16 bytes, in the case of MD5), there is a small chance of 2 different inputs resolving to the same hash key.
What are the chances of that happening?
MD5 returns a 128-bit output, which means the probability of any 2 hashes colliding is 1/2¹²⁸
but since we’re storing all of the hashes, the birthday paradox takes effect and the probability becomes 1/2⁶⁴
In other words, on average, you’d have to generate roughly 5.8 million records a second for the next 1,000 years before getting a collision.
Should you still test for collisions and have a fallback strategy? Yes.
Should it keep you up at night? No.
This post looked at the conceptual angle behind the choice of key types in data systems.
We explored the pros and cons of each alternative and showed how a simple MD5 hash can generate surrogate keys and meet many additional desirable requirements of modern data systems.
If you’re interested in the impact of the key types on join performance in data warehouses, I’ve written about that here.
Is Net Promoter Score the definitive measure of customer satisfaction? Unveil the limitations of NPS…
Unlock actionable insights with essential tools for precision in measurement, tailored for small datasets.
How does one measure intangible things like "public image," "innovation," or "IT security"? Despite what…
The single most effective thing organizations can start doing to be more data-driven is understand…