Whether you are designing a standalone application or a microservice, you will inevitably encounter the topic of sharing identifiers. Whether it’s URLs of web pages, RESTful API resources, JSON documents, CSV exports, or something else, the identifier of specific resources will be exposed.
/orders/123
/products/345/variants/1
While an identifier is just a number and does not carry any negative connotations, there are valid reasons why you might want to avoid exposing them. These reasons include:
- Security and Data Exposure: Numerical identifiers are sequential and predictable, which can expose information about the underlying data source (e.g., the volume of the data) and provide a basis for ID enumeration.
- Privacy and Confidentiality: Concerns may arise about concealing the volume of referenced data. For example, the number of customers, clients, or orders might be information a business prefers to keep private.
- Non-descriptive Nature: Integers as identifiers can lead to confusion. An ID like 123 does not convey any additional information, making debugging edge cases more challenging.
These and other reasons (like SEO optimisation) have led to the increased use of text-based identifiers. Their readability and versatility make them ideal for external data sharing.
However, in database (or data model) design, the advantages of text identifiers are often overshadowed by the problems they introduce. While text identifiers improve interoperability, they frequently come with performance and storage trade-offs. In contrast, integers are naturally faster and more efficient to process, resulting in lower storage requirements and faster indexing, sorting, and searching-tasks that computers are optimised for.
In this article, we will explore scenarios where using text identifiers directly in the database design might seem natural and discuss strategies for using them effectively.
What Makes Text Identifiers Appealing?
Let’s be honest-text identifiers are popular for a reason. For humans, they are much more readable and, in some cases, add extra context. (Raise your hand if you don’t appreciate Heroku’s quirky and memorable names!) If chosen carefully, they can also be easier to recall.
Text identifiers can embed additional context. Consider the order number APAC-20241103-8237, which encodes both the region and the order date. Another popular reason for using text identifiers is their uniqueness in distributed environments.
They’re especially handy when people need to interact with them directly. For example, customers copying an order number from an email or support teams discussing an issue benefit from a readable, meaningful identifier. It’s simpler, more intuitive, and less likely to cause headaches when someone tries to recall or share it.
When an Identifier Isn’t Just an Identifier
Problems with text identifiers arise when they are used as natural keys in your data or database model. Despite their benefits, text identifiers often make poor primary keys for several reasons:
- Context Changes: The additional context provided by text identifiers will likely change, necessitating updates. Despite assurances to the contrary, changes are inevitable.
- Sorting Issues: Sorting text identifiers can be tricky, particularly with locale-based sorting or when numbers are embedded within the identifier (e.g., order1434 vs. order349).
The ultimate issue is efficiency:
- Text identifiers typically require more storage space than numeric ones. Each character in a text field occupies more bytes than a simple integer, leading to larger database sizes and slower performance, especially during indexing or handling large datasets.
- Databases are optimised for numeric operations, making searches, joins, and indexing on text fields inherently slower. This performance gap can significantly affect large datasets, impacting application efficiency.
- Text identifiers complicate managing relationships between tables. The additional storage requirements affect not only the source table but also all referencing entities. Multiply this increased storage by the number of referencing tables, and you’ll get a sense of the overall impact.
As databases grow, these issues become more pronounced. The combination of increased storage needs and slower operations contributes to database bloat, which can degrade system performance. Additionally, bloated indexes can mislead the query planner into making suboptimal choices, such as favouring sequential scans over index scans, complicating regular operations further.
Aren't UUIDs the Solution to All This?
It depends. While UUIDs offer numerical representation and are excellent for unique key generation across distributed systems, they are not always the best choice:
- In comparison to BIGINT, there are few real-world scenarios that necessitate the full range of UUIDs. Premature optimisation most often isn’t worthwhile for most solutions.
- UUIDs’ 16-byte (128-bit) storage size can be less efficient than many text identifiers. Even BIGINT, at 8 bytes (64 bits), is more storage-efficient. This inefficiency extends to indexes, joins, and other operations.
- A matter of opinion: UUIDs are plain ugly.
Moreover, different versions of UUIDs offer varying benefits. For example, UUIDv1 includes a timestamp component, making it somewhat sortable, while UUIDv4 is entirely random. Even sortable UUIDs may not offer significant advantages over more streamlined options like BIGINT or carefully structured text identifiers.
In most cases, the trade-offs in performance and readability make UUIDs less appealing unless their globally unique nature is essential across distributed systems.
Real-Life Examples
Let’s move beyond theory and explore practical examples:
CREATE TABLE sessions (
token TEXT PRIMARY KEY,
user_id INT NOT NULL REFERENCES users(user_id),
...
);
CREATE TABLE products (
sku TEXT PRIMARY KEY,
label TEXT NOT NULL,
...
);
CREATE TABLE documents (
document_id TEXT PRIMARY KEY,
...
);
In these cases, using text identifiers as primary keys might seem logical because:
- They serve as natural keys for the entity.
- They are likely propagated outside the service scope, hence needing storage anyway.
- They are not expected to change.
- The storage impact for a single table seems negligible.
However, this logic falters under scrutiny. Text identifiers often propagate beyond the service scope, leading to pressure to update them. Updating primary keys is no trivial matter. Consider:
- While SKUs ideally never change, real-world scenarios like rebranding, product consolidation, or supplier changes can necessitate updates. Despite their appeal, SKUs are poor primary key candidates.
- Randomly generated text identifiers (like session tokens) will… TBD.
The real issue arises when referencing these identifiers across tables:
CREATE TABLE session_logs (
log_id BIGINT GENERATED ALWAYS AS IDENTITY,
token TEXT NOT NULL REFERENCES sessions(token),
...
);
CREATE TABLE product_reviews (
review_id INTEGER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
product_sku TEXT NOT NULL REFERENCES products(sku),
...
);
CREATE TABLE customer_orders (
order_id TEXT PRIMARY KEY,
customer_id TEXT NOT NULL REFERENCES customers(customer_id),
...
);
CREATE TABLE document_revisions (
revision_id INTEGER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
document_id TEXT NOT NULL REFERENCES documents(document_id),
...
);
In such scenarios, text identifiers can become problematic. Firstly, you lose the ability to modify them. Secondly, the increased storage requirements become evident. For instance, an additional 100 bytes per reference across a million records results in an extra 100 MB of storage for just one reference.
However, storage isn’t the biggest concern; indexing is. In PostgreSQL, indexing text fields - whether as primary or foreign keys - requires significantly more space than indexing numeric fields. This leads to bloated indexes, slower lookups, and more fragmented operations, particularly in queries relying on foreign key relationships. As a result, the query planner might resort to full table scans instead of index scans, further degrading performance.
These performance issues often remain undetected in development or testing environments but can cause significant disruptions in production.
Reintroducing Text Identifiers Effectively
The goal of this post is not to discourage the use of text identifiers entirely but to highlight why they are unsuitable as primary keys. Here are some strategies to handle them effectively:
1. Introduce a Surrogate Key
One straightforward solution is to introduce a surrogate primary key, replacing references to text identifiers:
CREATE TABLE products (
product_id INT GENERATED BY DEFAULT AS IDENTITY,
sku TEXT NOT NULL,
label TEXT NOT NULL,
...
);
CREATE INDEX products_by_sku ON products(sku);
CREATE TABLE product_reviews (
review_id SERIAL PRIMARY KEY,
product_id INT REFERENCES products(product_id),
...
);
This approach maintains efficient retrieval by SKU via an index.
2. Use Mapping Tables for Greater Flexibility
Mapping tables allow you to:
* Update identifiers without affecting the parent entity.
* Maintain a history of text identifiers linked to a specific entity.
CREATE TABLE products (
product_id INT GENERATED BY DEFAULT AS IDENTITY,
label TEXT NOT NULL,
...
);
CREATE TABLE product_skus (
product_sku_id INT GENERATED BY DEFAULT AS IDENTITY,
product_id INT REFERENCES products(product_id),
sku TEXT NOT NULL,
created_at TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP,
deleted_at TIMESTAMPTZ
);
CREATE UNIQUE INDEX unique_product_skus ON product_skus (product_id, sku) WHERE deleted_at IS NULL;
This approach accommodates changing SKUs without compromising data integrity. A related use case could be linking a mapping table to a product variant:
CREATE TABLE product_variants (
variant_id INT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
product_id INT NOT NULL REFERENCES products(product_id),
sku TEXT NOT NULL,
name TEXT NOT NULL,
...
);
3. Decompose the Text Identifier’s Meaning
Text identifiers often embed meaningful information, such as regions, dates, or categories. By decomposing the identifier, you can store this contextual data separately, improving both flexibility and performance.
For example, instead of storing an order identifier like ORD-EMEA-00789 directly, you can design a more robust schema:
CREATE TABLE orders (
order_id INT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
region_id INT NOT NULL REFERENCES regions(region_id),
order_date DATE NOT NULL,
...
);
CREATE TABLE regions (
region_id INT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
name TEXT NOT NULL,
code TEXT NOT NULL,
...
);
INSERT INTO regions (name, code) VALUES
('Europe, the Middle East, and Africa', 'EMEA'),
('Asia Pacific', 'APAC');
To generate a user-friendly order number, you can create a function:
CREATE OR REPLACE FUNCTION get_order_number(p_order_id INT)
RETURNS TEXT AS $$
DECLARE
v_region_code TEXT;
v_formatted_order_id TEXT;
v_order_number TEXT;
BEGIN
SELECT r.code INTO v_region_code
FROM orders o
JOIN regions r ON o.region_id = r.region_id
WHERE o.order_id = p_order_id;
IF NOT FOUND THEN
RETURN NULL;
END IF;
v_formatted_order_id := TO_CHAR(p_order_id, 'FM00000');
v_order_number := 'ORD-' || v_region_code || '-' || v_formatted_order_id;
RETURN v_order_number;
END;
$$ LANGUAGE plpgsql;
This method enables you to store and retrieve structured data efficiently while still providing a readable and meaningful identifier for external use.
4. Reversible Text IDs
For scenarios requiring enhanced security or privacy, where identifiers should not be easily enumerable or predictable, you can use reversible text-based IDs. These allow you to present users with seemingly random text representations while maintaining efficient numerical storage internally.
Sqids is a project that can help achieve this. It generates URL-friendly unique identifiers from numbers and can encode multiple numerical identifiers into a single string. Here’s an example using the Sqids project:
[42] -> JgaEBgznCpUZo3Kk
[42, 430004] -> lTiYlvsGkh59m1PQ
The generated identifiers are reversible with knowledge of the shared alphabet, allowing you to decode requests without hitting the database, which can be beneficial in high-throughput environments. This technique is particularly useful for user, account, or session identifiers, balancing the need for security with operational efficiency.
However, it’s crucial to remember that this is not a substitute for robust security practices. Proper authentication and authorisation mechanisms are still necessary to secure your application.
By thoughtfully integrating these strategies, you can leverage the benefits of text identifiers where appropriate while avoiding common pitfalls in database design. This balance ensures efficient, maintainable systems that meet both technical and business needs.
5. Leverage Generate columns
In edge case scenarios, where keeping text identifiers with embedded context is necessary, generated columns in PostgreSQL can be a valuable feature. Since version 12, PostgreSQL allows defining columns whose values are automatically computed from other columns in the table. This ensures consistency without manual intervention.
For example, you can define a function to handle the formatting logic:
CREATE OR REPLACE FUNCTION get_formatted_order_id(p_region_id INT, p_order_id INT)
RETURNS TEXT AS $$
DECLARE
v_region_code TEXT;
BEGIN
v_region_code := CASE p_region_id
WHEN 1 THEN 'EMEA'
WHEN 2 THEN 'APAC'
ELSE 'OTHER'
END;
RETURN 'ORD-' || v_region_code || '-' || LPAD(p_order_id::TEXT, 5, '0');
END;
$$ LANGUAGE plpgsql IMMUTABLE;
The function is marked IMMUTABLE because PostgreSQL requires generated columns to use only immutable functions-those guaranteed to return the same result for the same input every time.
CREATE TABLE orders (
order_id INT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
region_id INT NOT NULL,
order_date DATE NOT NULL,
formatted_order_id TEXT GENERATED ALWAYS AS (
get_formatted_order_id(region_id, order_id)
) STORED
);
This setup ensures that formatted_order_id
is automatically updated whenever region_id
or order_id
changes. The column is physically stored, enhancing read performance for frequently queried data.
Using generated columns can simplify maintaining consistency in derived values like formatted text identifiers. However, be mindful of their characteristics, such as:
- Automated Updates: The system automatically recomputes the column’s value when the referenced columns change.
- Immutability Requirements: Only immutable functions can be used, ensuring reliable and consistent computation.
Wrapping up
Text identifiers will stay with us, and that's great. They’re human-readable, memorable, and can pack a lot of meaningful context into a simple string. They make external interactions smoother, whether it’s a customer referencing an order number or a support team tracing an issue. They even add a bit of charm and personality to otherwise sterile identifiers.
However it's important to keep their usage in control. The nice rule I've heard is
- Use text identifiers when communicating externally - for example, in URLs or API responses
- Internally, always rely on numerical IDs - surrogate keys like INT or BIGINT (even UUID if you go for planet dominanance) where necessary, to maintain database efficiency and integrity.
This approach allows you to leverage the strengths of text identifiers for external communication while keeping your database optimised for performance and scalability. By storing text identifiers in dedicated fields and using numeric primary keys for internal operations, you achieve a the right balance and do best to maintain the system performance.