The Billion-User Trap
Background
This multi part series walks through the real architectural decisions and real mistakes behind designing a globally distributed KYC User Profile system - a system that should be capable of serving billions of users at 10M+ RPD with sub-500 ms P99 latency. It's mainly about weighing trade-offs and making architectural decisions.
Not theory. Not a tutorial. No "here's how Redis works."
If you already know what a cache stampede is, what CDC does, and why OFFSET pagination is dangerous β this series is written for you. If you're a CTO or engineering leader who has watched a "simple" looking service quietly become your biggest operational liability - this series is especially for you.
Every part covers one assumption that looked reasonable on day one and became a production problem at scale. The database design that passed every review. The pagination that worked fine until it didn't. The cache layer that was added to fix latency and made almost no difference.
That's what makes them worth writing about.
It Started With Two APIs
The Lie We Tell Ourselves at the Beginning
The request sounded almost trivial.
A product manager walked into a planning meeting and wrote two bullet points on the whiteboard:
- API 1 : List users with KYC status
- API 2 : Fetch full user details including all accounts
Clean. Simple. Two endpoints.
A senior developer in the room said: "I can have a prototype running by end of week."
Nobody pushed back.
That's the first mistake.
The Requirements That Seemed Harmless
The system was a KYC (Know Your Customer) profile service for a FinTech platform. Every user had:
- A core identity record (First Name, Last Name, DOB, address)
- KYC verification status and documents
- Multiple financial accounts (savings, current, wallet β up to 4 per user)
The UI needed two things:
- API 1 : User Listing Return a paginated list of users. Each record: User ID, First Name, Last Name, KYC status. No account details.
- API 2 : User Detail Return the full profile for a selected user: complete KYC data, all linked accounts, everything.
A straightforward CRUD problem. Until the next sentence in the requirements doc.
Then the Projected Numbers Arrived
Billions of users.
10 million requests per day.
Multi-region rollout β North America first, then APAC, EMEA.
P99 latency: β€ 500ms.
Everything changed with these projected numbers. The features were still 2 simple GET APIs. But the entire conversation shifted:
- "What tables do we need?" became "What are the dominant read patterns?"
- "Which DB framework we use?" became "How do we avoid hot partitions?"
- "Should we use Postgres or MySQL?" became "How do we handle cross-region consistency?"
The First Architectural Principle: Access Patterns Over Entities
Most engineers design storage around what data exists or is going to get persisted.
Experienced architects design storage around how data is accessed.
The difference sounds subtle - but it is not.
"List users" and "Fetch user details" look similar on a whiteboard. In production they are fundamentally different problems
| Dimension | List Users | Fetch User Details |
|---|---|---|
| Access type | Range scan / full table | Point lookup by ID |
| Data volume per request | Partial (KYC status only) | Full (KYC + all accounts) |
| Cache strategy | Shared, TTL-based | Per-user, invalidation-driven |
| DB pressure at scale | Index scans, pagination cost | Key-value lookup, predictable |
| Failure mode | Slow query + cascade | Cache miss + single DB hit |
Treating these two access patterns the same is the first architectural mistake.
What Breaks the System
Scale doesn't kill systems by itself.
Hidden assumptions kill systems. Scale just surfaces them faster.
The most dangerous assumptions w.r.t given use case:
- "Our users are evenly distributed." They're not. 20% of users drive 80% of API calls.
- "Joins are fine - the database is fast." Joins are fine until they run across 500 million rows.
- "Pagination is a UI concern." Pagination is a database concern as well.
The Architectural Decision Log Starts Here
Before a single line of code, before schema design, before choosing a cache: document your access patterns.
For this system, the two dominant reads are:
1Read Pattern 1: GET /users?page=N&size=50
2 - Frequency: Very high (every UI load, every export job)
3 - Data: userId, name, kycStatus
4 - Latency target: < 200ms P99
5 - Consistency: Eventual OK (few seconds)
6
7Read Pattern 2: GET /users/{userId}
8 - Frequency: High but more targeted
9 - Data: Full KYC + all accounts
10 - Latency target: < 300ms P99
11 - Consistency: Strong preferred (compliance context)
These two patterns drive every decision in this series. Database design, Cache topology, Pagination strategy, Consistency model. All of it flows from here.
What Comes Next
In Part 2, we look at the database design - the normalized schema with elegant joins - and exactly where it breaks down before you hit 100 million users.
comments powered by Disqus