Billion User Trap - The Database Design That Looked Perfect on Day One

Jun 16, 2026 architecture distributed-systems database-optimization

Share on:

Background

In the first part we established the foundational principle that separates systems which survive scale from systems that collapse under it - design for access patterns, not for entities.

In this second part we double click on the database design of the User Profile system and try to understand its impact from performance engineering and scalability standpoint. Specifically - why the schema that adheres to proper normalization, proper foreign keys, and proper indexes, becomes a latency problem.

Every Architecture Diagram Is Pristine Before the First 100 Million Rows

Show me an ER diagram from a planning session and I'll show you something that may cause a production incident when your platform scales. Not because engineers are bad at design. Because the diagram was designed for the data they had or had to persist. Not the data they'd get and would be getting accessed.

The KYC User Profile system had a natural, normalized design. Clean and elegant!

 1-- Users: core identity
 2CREATE TABLE users (
 3  user_id       UUID PRIMARY KEY,
 4  first_name     VARCHAR(255),
 5  last_name     VARCHAR(255),
 6  email         VARCHAR(255) UNIQUE,
 7  phone         VARCHAR(20),
 8  date_of_birth DATE,
 9  region        VARCHAR(10),
10  created_at    TIMESTAMP,
11  updated_at    TIMESTAMP
12);
13
14-- KYC: verification state and documents
15CREATE TABLE kyc_records (
16  kyc_id        UUID PRIMARY KEY,
17  user_id       UUID REFERENCES users(user_id),
18  status        ENUM('PENDING','VERIFIED','REJECTED','EXPIRED'),
19  verified_at   TIMESTAMP,
20  doc_type      VARCHAR(50),
21  doc_reference VARCHAR(255),
22  updated_at    TIMESTAMP
23);
24
25-- Accounts: financial accounts per user
26CREATE TABLE accounts (
27  account_id    UUID PRIMARY KEY,
28  user_id       UUID REFERENCES users(user_id),
29  account_type  ENUM('SAVINGS','CURRENT','WALLET','INVESTMENT'),
30  account_no    VARCHAR(30),
31  balance       DECIMAL(18,4),
32  status        VARCHAR(20),
33  opened_at     TIMESTAMP
34);

And here is the query for User Detail API which is equally elegant:

1SELECT u.*, k.status, k.verified_at, k.doc_type,
2       a.account_id, a.account_type, a.account_no, a.balance
3FROM   users u
4JOIN   kyc_records k ON k.user_id = u.user_id
5JOIN   accounts a    ON a.user_id = u.user_id
6WHERE  u.user_id = :userId;

This works for 100 users, 100,000 users. But beyond certain million no. of users, API starts showing signs of latency degradation - and that's because database joins are worst enemies of scale

Why Joins Become Your Worst Enemy at Scale

A database JOIN is not a free operation. For a point lookup on a single user_id across three tables, with proper indexes, you're looking at 3 B-tree traversals plus result merging.

At low concurrency this seems fine.

But at 10 million requests per day (roughly 115 RPS average, with peaks easily hitting 500-1000 RPS), every millisecond of database CPU matters.

The hidden cost of JOINs at scale

Lock contention across tables : When a KYC update happens, the kyc_records table takes a row-level write lock. Concurrent reads doing a JOIN that touches that row wait
Buffer contention : When the working set of data, your queries need is larger than the DB buffer size - pages are being loaded, evicted, and reloaded in a continuous cycle
Query plan instability : This may be due to frequent changes in your DB object statistics
The N+1 ORM problem : The "elegant" JOIN becomes N+1 fetches when the ORM decides to lazily load accounts

The Right Mental Model: Design for Access Patterns, Not just for Normalization

The normalized schema is correct for writes. Normalization avoids update anomalies.

It is wrong as the primary read model at scale!

Hence the proposal to shift to CQRS — Command Query Responsibility Segregation.

Writes still go to the normalized source-of-truth schema. But reads get their own optimized data stores.

For this system, two read models are required:

Read Model 1: User Summary Store Optimized for the listing API. Contains exactly: userId, firstName, lastName, kycStatus, region. Pre-joined. No relationships to traverse.
Read Model 2: User Detail Store Optimized for the detail API. Contains the full denormalized user record: KYC fields + all accounts. Point lookup by userId. One read, complete response.

With CQRS, the operational complexity here is real: you now have two representations of the same data. They must stay consistent. That consistency mechanism is part of this series.

The Indexes That Actually Matter

Indexes are one of the most common hotspots as far as DB performance is concerned.

The indexes most teams create

1CREATE INDEX idx_users_email ON users(email);
2CREATE INDEX idx_kyc_user_id ON kyc_records(user_id);
3CREATE INDEX idx_accounts_user_id ON accounts(user_id);

Standard Indexes are sub optimal for listing API query at scale because query using standard index requires 2 steps -

Index Scan - B-tree traversal to find the row identifier that matches WHERE clause
Heap Fetch - For each row identifier found, DB goes to the heap and fetches the full row

What you actually need

1-- Covering index for the summary read (index-only scan, no heap fetch)
2CREATE INDEX idx_user_summary ON users(region, created_at DESC, user_id)
3  INCLUDE (full_name, kyc_status);

A covering index stores the additional columns you need directly inside the index structure. When the query planner uses this index, Step 2 i.e. Heap Fetch is not required

What Comes Next

The data model is sorted. But the listing API still has a major performance / scalability issue that most teams won't find until they're in production. Stay tuned for Part - 3

Architect's Note

Three-table JOIN, proper indexes, clean normalization - Correct on day one. But by year two we start seeing latency degradations. Fundamentally speaking - the gap between a schema that works and a schema that scales is always access pattern analysis. That's not taught in SQL courses. It's learnt by making hands dirty while triaging and fixing production incidents.