IDing

…

SIQ

The main point of the SUOU library is to provide an implementation for the methods of SIS, a protocol for information exchange in phase of definition, and of which SUOU is the reference implementation.

The key element is the ID format called SIQ, a 112-bit identifier format.

Here follow an extract from the specification:

Why SIQ?

I needed unique, compact, decentralized, reproducible and sortable identifiers for my applications.

Something I could reliably use as database key, as long as being fit for my purposes, in the context of a larger project, a federated protocol.

Why not …

Serial numbers? They are relative. If they needed to be absolute, they would have to be issued by a single central authority for everyone else. Unacceptable for a decentralized protocol.

Username-domain identifiers? Despite them being in use in other decentralized protocols (such as ActivityPub and Matrix), they are immutable and bound to a single domain. It means, the system sees different domains or usernames as different users. Users can’t change their username after registration, therefore forcing them to carry an unpleasant or cringe handle for the rest of their life.

UUID’s? UUIDs are unreliable. Most services use UUIDv4’s, which are just opaque sequences of random bytes, and definitely not optimal as database keys. Other versions exist (such as the timestamp-based [UUIDv7](https://uuidv7.org)), however they still miss something needed for cross-domain uniqueness. In any case, UUIDs need to waste some bits to specify their “protocol”.

**Snowflake**s? Snowflakes would be a good choice, and are the inspiration for SIQ themselves. However, 64 bits are not enough for our use case, and Snowflake is already making the necessary sacrifices to ensure everything fits into 64 bits (i.e. the epoch got significantly moved forward).

Content hashes? They are based on content, therefore they require content to be immutable and undeletable. Also: collisions.

PLC’s (i.e. the ones in use at BlueSky)? [The implementation is cryptic](https://github.com/did-method-plc/did-method-plc). Moreover, it requires a central authority, and BlueSky is, as of now, holding the role of the sole authority. The resulting identifier as well is apparently random, therefore unorderable.

ULID’s? They are just UUIDv4’s with a timestamp. Sortable? Yes. Predictable? No, random bits rely on the assumption of being generated on a single host — i.e. centralization. Think of them as yet another attempt to UUIDv7’s.

Anatomy of a SIQ

SIQ’s are 112 bit binary strings. Why 112? Why not 128? Idk, felt like it. Maybe to save space. Maybe because I could fit it into UUID some day — UUID already reserves some bits for the protocol.

Those 112 bits split up into:

56 bits of timestamp;
8 bits of process (”shard”) information;
32 bits of domain hash;
16 bits of serial and qualifier.

Here is a graph of a typical SIQ layout:

``` 0: tttttttt tttttttt tttttttt tttttttt tttttttt 40: uuuuuuuu uuuuuuuu ssssssss dddddddd dddddddd 80: dddddddd dddddddd nnnnnnnn nnqqqqqq

where: t : timestamp – seconds u : timestamp – fraction seconds s : shard d : domain hash n : progressive q : qualifier (variable width, in fact) ```

Timestamp

SIQ uses 56 bits for storing timestamp:

40 bits for seconds;
16 bits for fraction seconds.

There is no need to explain [why I need no less than 40 bits for seconds](https://en.wikipedia.org/wiki/Year_2038_problem).

Most standards — including Snowflake and ULID — store timestamp in milliseconds. It means the system needs to make a division by 1000 to retrieve second value.

But 1000 is almost 1024, right? So the last ten bits can safely be ignored and we easily obtain a UNIX timestamp by doing a right shi-  wait.

It’s more comfortable to assume that 1024 is nearly 1000. Melius abundare quam deficere. And injective mapping is there.

But rounding? Truncation? Here comes the purpose of the 6 additional trailing bits: precision control. Bits from dividing milliseconds o’clock are different from those from rounding microseconds.

Yes, most systems can’t go beyond milliseconds for accuracy — standard Java is like that. But detecting platform accuracy is beyond my scope.

There are other factors to ensure uniqueness: domain and shard bits.

Domain, shard

The temporal uniqueness is ensured by timestamp. However, in a distributed, federated system there is the chance for the same ID to get generated twice by two different subjects.

Therefore, spacial uniqueness must be enforced in some way.

Since SIQ’s are going to be used the most in web applications, a way to differentiate spacially different applications is via the domain name.

I decided to reserve 32 bits for the domain hash.

The algorithm of choice is SHA-256 for its well-known diffusion and collision resistance. However, 256 bits are too much to fit into a SIQ! So, the last 4 bytes are taken.

…

Development and testing environments may safely set all the domain bits to 0.

Qualifiers

The last 16 bits are special, in a way that makes those identifiers unique, and you can tell what is what just by looking at them.

Inspired by programming language implementations, such as OCaml and early JavaScript, a distinguishing bit affix differentiates among types of heterogeneous entities:

terminal entities (leaves) end in 1. This includes content blobs, array elements, and relationships;
non-leaves end in 0.

The full assigment scheme (managed by me) looks like this:

x00000 user account x10000 application (e.g. API, client, bot, form) x01000 event, task x11000 product, subscription x00100 user group, membership, role x10100 collection, feed x01100 invite x11100 unassigned x00010 tag, category x10010 unassigned x01010 channel (guild, live chat, forum, wiki~) x11010 unassigned xx0110 thread, page xx1110 message, post, revision xxx001 3+ fk relationship xxx101 many-to-many, hash array element xxx011 array element (one to many) xxx111 content ——————————————————–

The leftover bits are used as progressive serials, incremented as generation continues, and usually reset when timestamp is incremented.

Like with snowflakes and ULID’s, if you happen to run out with serials, you need to wait till timestamp changes. Usually around 15 microseconds.

Storage

It is advised to store in databases as 16 byte binary strings.

In MySQL/MariaDB, it’s VARBINARY(16).

The two extra bytes are to ease alignment, and possible expansion of timestamp range — even though it would not be an issue until some years after 10,000 CE.

It is possible to fit them into UUID’s (specifically, UUIDv8’s — custom ones), taking advantage from databases and libraries implementing a UUID type — e.g. PostgreSQL.

Unfortunately, nobody wants to deal with storing arbitrarily long integers — lots of issues pop up by going beyond 64. 128 bit integers are not natively supported in most places. Let alone 112 bit ones.

(end of extract)

Implementation

class suou.iding.Siq: Representation of a SIQ as an integer.

class suou.iding.SiqGen(domain: str, last_siq: int = 0, local_id: int | None = None, shard_id: int | None = None): Implement a SIS-compliant SIQ generator.

SiqGen.__init__(domain: str, last_siq: int = 0, local_id: int | None = None, shard_id: int | None = None)

SiqGen.generate(typ: SiqType, n: int = 1) → Iterable[int]

Generate one or more SIQ’s. The generated ids are returned as integers. Bulk generation is supported.

Returns as an iterator, to allow generation “on the fly”. To get a scalar or a list, use .generate_one() or next(), or .generate_list() or list(.generate()), respectively.

Warning: the function may block.

Snowflake

SUOU also implements [the Discord flavor of] Snowflake ID’s.

This flavor of Snowflake requires an epoch date, and consists of: * 42 bits of timestamp, with millisecond precision; * 10 bits for, respectively, worker ID (5 bits) and shard ID (5 bits); * 12 bits incremented progressively.

class suou.snowflake.Snowflake: Representation of a Snowflake as an integer.

class suou.snowflake.SnowflakeGen(epoch: int, local_id: int = 0, shard_id: int | None = None, last_id: int = 0)

Implements a generator Snowflake ID’s (i.e. the ones in use at Twitter / Discord).

Discord snowflakes are in this format: tttttttt tttttttt tttttttt tttttttt tttttttt ttddddds sssspppp pppppppp

where: t: timestamp (in milliseconds) — 42 bits d: local ID — 5 bits s: shard ID — 5 bits p: progressive counter — 10 bits

Converter takes local ID and shard ID as one; latter 8 bits are taken for the shard ID, while the former 2 are added to timestamp, taking advantage of more precision — along with up to 2 most significant bits of progressive co

The constructor takes an epoch argument, since snowflakes, due to optimization requirements, are based on a different epoch (e.g. Jan 1, 2015 for Discord); epoch is wanted as seconds since Unix epoch (i.e. midnight of Jan 1, 1970).

Other ID formats

Other ID formats (such as UUID’s, ULID’s) are implemented by other libraries.

In particular, Python itself has support for UUID in the Standard Library.