Analytics Engineering

Generating Unique Keys In BigQuery

3 years ago

If you’ve ever tried to build an enterprise data warehouse using BigQuery, you’ll quickly realize that the auto-incrementing primary keys you were so fond of in operational databases are not a thing in BigQuery.

This is by design. It’s not the job of your data warehouse to ensure referential integrity as your operational databases do, but rather to store and process tons of historical data.

However, this shift in perspective does not make the problem of finding a way to represent unique entities go away. We need to generate unique keys in the data warehouse to identify whether some entity already exists in the warehouse.

What are your options?

Generating Surrogate Keys

Keys generated to uniquely identify an entity or row in a database are called surrogate or technical keys, and there is more than one way to make them.

Option 1 — Replicate Auto-Incrementing Integers

Your first instinct is to try to replicate the incrementing keys by using the ROW_NUMBER() aggregation function

SELECT 
  ROW_NUMBER() OVER() AS SurrogateKey,
  *
FROM `mytable`

While it’s nice to have an ever-increasing numeric identifier that encodes information about the size and distribution of your records, it comes at a cost.

To implement ROW_NUMBER(), BigQuery needs to sort values at the root node of the execution tree, which is limited by the amount of memory in one execution node.

Option 2— Generate a UUID

A better alternative might be to use a Universally Unique Identifier (UUID) by using the GENERATE_UUID() function.

SELECT 
  GENERATE_UUID() AS SurrogateKey,
  *
FROM `mytable`

This option will return 32 hexadecimal digits in 5 groups e.g.

2e8815a9–46fc-48fe-a7a8-cc531da385b6

Here, you’re basically guaranteed to get a unique ID, but the long string means slower joins across big datasets compared to the smaller integer keys.

Option 3— Generate a Hash Key (Recommended)

My favourite option is to use a hash key to represent unique entities. The idea is to concatenate multiple columns that may uniquely identify a record (you can even choose all columns if you wish) into a string and running a cryptographic hash function on it.

Cryptographic hash functions have 2 extremely useful properties, which in the context of surrogate key generation, have little to do with security:

They map input of any length into a fixed-length output
Similar inputs produce vastly different outputs with a low chance of collision

In Bigquery, you have a few functions to choose from:

MD5

Returns 16 bytes using the MD5 algorithm

EXPRESSION:
MD5('CAT')

OUTPUT (BASE64-ENCODED BYTES AS A STRING):
'wBrhpfEi8lzlZ1+GAotTag=='

SHA1

Returns 20 bytes using the SHA1 algorithm.

EXPRESSION:
SHA1('CAT')

OUTPUT (BASE64-ENCODED BYTES AS A STRING): 'z5t3XCxERSAXjTDCZ0QAZsbv9ug='

SHA256

Returns 32 bytes using the SHA256 algorithm.

EXPRESSION:
SHA_256('CAT')

OUTPUT (BASE64-ENCODED BYTES AS A STRING):
'FbiaVpR0JAphb5qU3QRbJxHURd3pVbYr9Ljyoq+vD2s='

SHA512

Returns 64 bytes using the SHA512 algorithm.

EXPRESSION:
SHA_512('CAT')

OUTPUT (BASE64-ENCODED BYTES AS A STRING):
'AEMYdVzb+Pm7YdWZ0mgWotIaKcdPY82gaiNw0ZD+GivBjBWzSH5rRfbLH2ynG+kH1vzjjoZAkZ09qk6CYLQujQ=='

FARM_FINGERPRINT

Returns a unique number using the open-source farmhash library.

EXPRESSION: FARM_FINGERPRINT('CAT')
OUTPUT: -1069775538560612551

What Key Should I Use?

Although UUIDs and incremental keys give you unique keys, hash keys have some properties that make them shine in a data warehouse context.

You can use the same key across multiple systems and technologies without doing lookups. All systems just use the same columns and hash them to get the key they need.
You can load multiple tables in parallel (no problem in BigQuery, but you must turn off referential integrity in other warehouses)
They are deterministic — You can re-load parts of the warehouse after wiping them out, and you’d keep the same key. In IoT scenarios, you can even generate them at the source devices.
Flexible — Hash keys support multi-structured data such as JSON, Images, Files, etc.

So, please… Do yourself a favour and use hash keys.

Choosing a Hash Algorithm

When choosing a hash algorithm, there are some important considerations to take in mind:

MD5 is faster to generate than SHA1, followed by SHA256 and SHA512
Shorter keys (like MD5 ) mean fewer bytes to process, which leads to faster join performance.

3. FARM_FINGERPRINT returns an INT, which is faster to join on than BYTES or STRING types.

With that in mind, If you really care about join performance and don’t mind staying in the BigQuery world, use the FARM_FINGERPRINT function. However, if you build hybrid systems and want to generate keys across boundaries, you’ll have a tough time generating farm fingerprints on every system.

MD5 is a nice balance between speed and compatibility with pretty much any technology as is my personal preference.

DJ Khaled vouching for MD5 as the key to success

What About Hash Collisions?

Thanks to the pigeonhole principle, since we’re squeezing a huge range of inputs into a small range of outputs (like 16 bytes, in the case of MD5), there is a small chance of 2 different inputs resolving to the same hash key.

What are the chances of that happening?

MD5 returns a 128-bit output, which means the probability of any 2 hashes colliding is 1/2¹²⁸ but since we’re storing all of the hashes, the birthday paradox takes effect and the probability becomes 1/2⁶⁴

In other words, on average, you’d have to generate roughly 5.8 million records a second for the next 1,000 years before getting a collision.

Should you still test for collisions and have a fallback strategy? Yes.

Should it keep you up at night? No.

Conclusion

This post looked at the conceptual angle behind the choice of key types in data systems.

We explored the pros and cons of each alternative and showed how a simple MD5 hash can generate surrogate keys and meet many additional desirable requirements of modern data systems.

If you’re interested in the impact of the key types on join performance in data warehouses, I’ve written about that here.

Next How Do Column Types Affect Join Speeds In Data Warehouses? »

Previous « How To Measure What You Can't See

Friends Don’t let Friends use NPS

Is Net Promoter Score the definitive measure of customer satisfaction? Unveil the limitations of NPS…

6 months ago

Metrics & Measurement

Measuring Success – Part 3/3: The Measurement Toolkit

Unlock actionable insights with essential tools for precision in measurement, tailored for small datasets.

6 months ago

Metrics & Measurement

Measuring Success – Part 2/3: The Unmeasurable Intangibles

How does one measure intangible things like "public image," "innovation," or "IT security"? Despite what…

3 years ago

Metrics & Measurement

Measuring Success – Part 1/3: The Concept of Measurement

The single most effective thing organizations can start doing to be more data-driven is understand…

3 years ago

Generating Unique Keys In BigQuery

Generating Surrogate Keys

Option 1 — Replicate Auto-Incrementing Integers

Option 2— Generate a UUID

Option 3— Generate a Hash Key (Recommended)

MD5

SHA1

SHA256

SHA512

FARM_FINGERPRINT

What Key Should I Use?

Choosing a Hash Algorithm

What About Hash Collisions?

Conclusion

Recent Posts

Friends Don’t let Friends use NPS

Measuring Success – Part 3/3: The Measurement Toolkit

Measuring Success – Part 2/3: The Unmeasurable Intangibles

Measuring Success – Part 1/3: The Concept of Measurement