One ID to rule them all
Published on:
At Global we deal with a lot of different content types across our various products, including News Articles, Playlists, Podcasts, Videos, etc. These content types are split across many source systems, including third parties. Due to this split of sources of data one thing we increasingly found tricky was reliably identifying the content type when they fed into a single system like a recommendation engine or analytics. This is because it was impossible to determine from a single piece of information alone, namely the ID, what content we were dealing with, and sometimes the accompanying type information wasn’t present. As a result we developed a Universal ID system to solve this problem.
This post details how we went about creating this Universal ID, and things I might personally change if I were doing it again.
Requirements
We started with a set of simple requirements for the universal ID:
1) Unique Forever
The ID should point to one and only one piece of content now and forever more. If the content was removed the ID should never be reused and it should be impossible for two different content sources to form the same ID.
2) URL Safe
Because our platforms exist on the web and IDs may end up in URLs for things like sharing the ID format should be entirely URL safe, e.g. no troublesome characters used in some encoding formats like equals signs.
3) Short
Short IDs would mean we could pack more of them into less data when passing long lists around. We also wanted to keep IDs short so they could in theory be read out when sharing over a voice call.
Initial Plan
Given the requirements above we came up with an initial plan:
- Create a very short unique alphanumeric prefix for each content type we had
- Combine this with the current primary ID of the content in the source system
- Encode this optimally for a short length and URL safeness.
To avoid the case where the combination of prefix and ID from one system could clash with the prefix and ID from another system we’d encode these in such a way that the clash risk was removed. To give an example of how a clash might happen imagine the following prefix and ID combination:
Prefix: p1
ID: 920304
and
Prefix: p
ID: 1920304
If you were to simply combine those by appending the resulting Universal ID would clash breaking the first requirement, and we didn’t want people to have to worry about avoiding certains prefixes, so some additional considerations would be needed when combining and encoding.
We also decided not to go with the fairly common Base64 encoding (or the URL-safe Base64URL version) because it generated fairly long encodings for short values, and contained symbols and confusable characters which might make for harder reading.
The Solution
The solution ended up being more complex than you might think for a number of reasons.
Firstly, when looking through all the different sources of content we wanted to cover we found we had 3 different primary ID formats in use:
- Integer
- UUID
- An odd composite ID format made up of 2 integers
We’d need to create a method that could encode and decode all 3 ID formats, as we wanted to be able to reverse the encoding.
Secondly, UUIDv4 itself is quite a long ID format by default as it contains 128 bits of information (32 hexadecimal digits), so doing our best to meet the requirement of short IDs would need some clever encoding.
Thus the method we came up with is:
- Convert the base of the primary ID to base 10
- For UUID we stripped out the hyphens, and treated it as base 16
- For the composite ID format we concatenated the 2 integers with the letter a in the middle, essentially making it a base 11 primary ID
- Convert the alpha-numeric prefix to base 10 as well, treating it’s starting base as 36
- Concatenate the prefix, the letter a again, and the converted primary ID to create a long base 11 encoding of all the information
- Convert the encoded base 11 string to base 58, where base 58 is all lower and upper case letters of the english alphabet, plus the number, but excluding 4 easy confusable characters (0, I, O, l).
This results in Universal IDs that would never conflict and that range from 3-4 characters for low integer primary IDs, averaging around 6-7 characters, and maxing out at an only somewhat unwieldy 26 characters or so for UUID primary IDs.
Here’s a few examples:
42KuVj - The podcast "The News Agents"
7DrhGRN - A specific episode of "The News Agents"
2JsSa4Z6PBv - An LBC video clip
7giJZVDDBDi4Me7Lp5xEmpCVwp - An article from Gold about Donny Osmond
What I might change
The Universal ID system continues to work well, but there are a couple of things I might change if I were to create one again:
Remove risky characters
So far we’ve been fairly lucky in that we’ve not knowingly surfaced any rude or offensive IDs, but it’s definitely possible, and an inevitability at some point with incrementing IDs. We at least know our prefixes don’t immediately start off dangerous so any rude or offensive parts wouldn’t be at the start of the ID.
Removing the vowels from the list of possible characters, thereby reducing the final encoding base to 50, trading off against slightly longer strings would probably be the easiest solution.
Stick to only lowercase letters
We don’t often read out URLs, or expect our users to either, but on the rare occasions we do it’s more annoying and takes longer to say “Capital w, r, one, Capital j, …” than a longer string without the need to differentiate.
Removing all upper case letters would drop the encoding base to 34 (29 if also removing vowels) and thus significantly increase the length of the final encoding. It’d definitely be a trickier switch, and encoded UUIDs could get unreasonably long.