Migrating 50TB at 100k RPS Without Downtime — A Practical System Design Breakdown#
Note: This blog is a summary of the YouTube video "System Design: Migrating 50TB at 100k RPS (The Hard Way)" by FAANG Senior Engineer.
The Problem#
Imagine this:
-
A 50TB Oracle database
-
A monolithic application
-
Handling 100,000 writes per second (100k RPS)
-
Requirement:
- ✅ Zero downtime
- ✅ Zero data loss
- ✅ Migration to microservices
At this scale, the physics of distributed systems start to matter. Small mistakes become catastrophic.
Why Naive Approaches Fail#
1️⃣ "Just Take a Database Dump"#
Let's do the math.
Even with a dedicated 10 Gbps connection, transferring 50TB would take around 11–12 hours. At 100k writes per second:
100,000 × 60 × 60 × 12 ≈ 43 billion writes
That's 43 billion lost writes during downtime.
Offline migration is simply not an option.
2️⃣ "Just Do Dual Writes"#
Another common idea: Modify the app to write to both old and new databases.
This is a trap.
Problems:
- Distributed transaction complexity (Two Generals Problem)
- Network blips create inconsistencies
- 100ms outage at 100k RPS → 10,000 failed writes
- Latency increases
- Monolith becomes even more brittle
At this scale, application-level dual writes are dangerous.
First Principles: A Database Is a Log#
The key insight comes from Jay Kreps' paper on The Log.
A database table is not the source of truth. The transaction log is.
- Oracle → Redo Log
- Postgres → WAL (Write-Ahead Log)
Every write is appended to a log:
- Insert X
- Update Y
- Delete Z
It's a stream. And that stream becomes the foundation of the migration strategy.
The Real Strategy#
The solution combines three major ideas:
- Change Data Capture (CDC)
- Kafka as a buffer
- Strangler Fig Pattern
Step 1: Build a Change Data Capture Pipeline#
Instead of writing to two databases at the app layer:
- Use a CDC tool like Debezium
- It tails the Oracle redo log (like a replica)
- Converts changes into events (JSON/Avro)
- Publishes them to Apache Kafka
Why Kafka?#
Because at 100k RPS:
- You need a durable buffer
- You need backpressure handling
- You need fault tolerance
If the new microservice crashes, Kafka absorbs the load. No data loss. No pressure on Oracle.
Step 2: Handle the Existing 50TB (Snapshot + Replay)#
CDC only captures new writes. But what about the existing 50TB?
Debezium can:
- Take an initial snapshot
- Read tables chunk-by-chunk
- While simultaneously tracking log position
Flow:
- Snapshot starts at T0
- Snapshot may take days
- Meanwhile, all new writes go to Kafka
- When snapshot completes, the new DB replays Kafka events starting from T0
Result: a near real-time replica, possibly only milliseconds behind.
Step 3: Scaling Concurrency Correctly#
At 100k RPS, one consumer won't work. You must:
- Partition Kafka topics by user ID (or logical key)
This ensures parallel processing with order preserved per user.
Critical Requirement: Idempotency#
Kafka may deliver messages twice. Your new service must handle that safely.
Safe:
Set balance = 50
Dangerous:
Subtract 10
Prefer full state over deltas whenever possible.
Step 4: Move Traffic — The Strangler Fig Pattern#
Use a proxy (Envoy/Nginx):
- Initially → All traffic to monolith
- Introduce new microservice
- Enable shadow traffic (real request → monolith, copy → microservice async)
This allows performance testing at real scale with zero user impact.
Step 5: Incremental Cutover#
Route traffic gradually:
- 1% → 10% → 50% → 100%
The Ugly Part: Reverse CDC#
If a legacy monolith job reads user data that's already migrated, Oracle has stale data. Solution: stream changes back from microservice to Oracle temporarily. Remove it once fully cut over.
Final Takeaway#
Migrating 50TB at 100k RPS is like performing a brain transplant while the patient is running.
The strategy:
- CDC from transaction logs
- Kafka as a durable buffer
- Snapshot + replay
- Idempotent consumers
- Strangler Fig routing
- Incremental cutover
- Temporary reverse CDC
Source#
"System Design: Migrating 50TB at 100k RPS (The Hard Way)" — By FAANG Senior Engineer