Database Internals: Storage, Query Patterns & Selection#

The core insight: database selection is choosing which access patterns get to be fast. Storage layout makes that choice permanent. Indexes extend performance within that choice — they cannot change it.

Table of Contents#

The Three-Layer Constraint Model
PostgreSQL
MongoDB
Cassandra
Time-Series: TimescaleDB & InfluxDB
Side-by-Side Comparison
Decision Tree
The Ceiling Principle

1. The Three-Layer Constraint Model#

Every database performance problem lives in one of three layers. Each layer sets a ceiling on everything below it.

┌─────────────────────────────────────────────────────────────┐
│  Layer 1: Storage Layout  — the physics                     │
│  How bytes sit on disk. Cannot be escaped by software.      │
│  Row store / Document / Column-family / Columnar            │
└───────────────────────┬─────────────────────────────────────┘
                        │ constrains
┌───────────────────────▼─────────────────────────────────────┐
│  Layer 2: Data Model  — the contract                        │
│  How you express data. Embed vs reference.                  │
│  Schema vs schema-less. The irreversible decision.          │
└───────────────────────┬─────────────────────────────────────┘
                        │ constrains
┌───────────────────────▼─────────────────────────────────────┐
│  Layer 3: Query Pattern  — what you ask                     │
│  Point lookup vs scan vs aggregation.                       │
│  Indexes help here — but cannot exceed Layer 1's ceiling.   │
└─────────────────────────────────────────────────────────────┘

Misalignment at any layer = a ceiling you cannot engineer around.

Indexes, caching, and query tuning all operate at Layer 3. They cannot fix a wrong storage layout (Layer 1) or a wrong data model (Layer 2). Most teams debug at Layer 3 and wonder why they keep hitting walls.

2. PostgreSQL#

The Problem It Was Built to Solve#

Relational data — entities with relationships between them, where you need to query across those relationships flexibly, and where correctness guarantees (ACID) matter more than raw throughput.

Storage Layout on Disk#

Storage engine: PostgreSQL's own heap-based storage engine.

Fundamental unit: The page (8KB by default). Everything Postgres reads or writes is in pages.

Layout: Row-oriented (heap)

All columns for one row are stored together in a page. Related rows are packed into the same page.

Page 1 (8KB):
┌─────────────────────────────────────────────────────────┐
│ Page Header (24 bytes)                                  │
│ Item pointers → [ptr1][ptr2][ptr3]...                   │
│                                                         │
│ Row (tuple): id=1 | name="Shreya" | age=25 | city="BLR"│
│ Row (tuple): id=2 | name="Rahul"  | age=28 | city="MUM"│
│ Row (tuple): id=3 | name="Priya"  | age=22 | city="DEL"│
└─────────────────────────────────────────────────────────┘

Page 2 (8KB):
┌─────────────────────────────────────────────────────────┐
│ Row (tuple): id=4 | name="Arjun" | age=31 | city="BLR" │
│ ...                                                     │
└─────────────────────────────────────────────────────────┘

Key implication: Reading one row reads the whole page into memory. If you only need age from 1 million rows, Postgres still reads all other columns alongside it — they're physically inseparable on disk.

How Indexes Work#

Primary data structure: B-tree (balanced tree). Default for all indexes.

B-tree on age:
         [25]
        /    \
    [21,24]  [28,31]
       |         |
  page ptrs   page ptrs

Lookup cost: O(log n) — walk the tree to find the page pointer, then fetch the page
Range scan cost: O(log n + k) — find the start, then scan leaf nodes sequentially
Write cost: Every insert/update must update the B-tree — O(log n)

JSONB indexes:

-- Functional index: index a specific path inside JSON
CREATE INDEX ON users ((profile -> 'address' ->> 'city'));

-- GIN index: index all keys inside the JSON blob
CREATE INDEX ON users USING GIN (profile jsonb_path_ops);

JSONB with a GIN index lets you search inside JSON efficiently — but the underlying storage is still row-oriented. The whole row is fetched even if you only need one JSON field.

MVCC (Multi-Version Concurrency Control):

Postgres never overwrites a row in place. Every UPDATE writes a new version of the row (a new tuple) and marks the old one as dead. This enables transactions without locking readers.

Dead tuples accumulate and must be cleaned up by VACUUM. This is unique to Postgres's storage design.

Before UPDATE:
  [Row v1: id=1, age=25]  ← live

After UPDATE (age=26):
  [Row v1: id=1, age=25]  ← dead (marked, not deleted yet)
  [Row v2: id=1, age=26]  ← live

After VACUUM:
  [Row v1: id=1, age=25]  ← reclaimed
  [Row v2: id=1, age=26]  ← live

Query Patterns Where Postgres Excels#

1. Fetch one record by key

SELECT * FROM users WHERE id = 'u123';
-- B-tree index → O(log n) → fetch one page → done
-- Excellent. This is the sweet spot.

2. Flexible multi-column queries

SELECT * FROM orders
WHERE user_id = 'u123'
AND status = 'pending'
AND created_at > '2024-01-01';
-- Compound index covers all three columns
-- Postgres query planner picks the optimal strategy

3. Joins across tables

SELECT u.name, o.amount, p.name AS product
FROM users u
JOIN orders o ON u.id = o.user_id
JOIN products p ON o.product_id = p.id
WHERE u.city = 'Bangalore';
-- Postgres executes nested loop, hash join, or merge join
-- depending on table sizes and available indexes

4. Aggregations on moderate data

SELECT city, COUNT(*), AVG(age)
FROM users
GROUP BY city;
-- Fine up to tens of millions of rows
-- Hits ceiling at hundreds of millions without partitioning

5. ACID transactions across multiple tables

BEGIN;
  UPDATE accounts SET balance = balance - 500 WHERE id = 'a1';
  UPDATE accounts SET balance = balance + 500 WHERE id = 'a2';
  INSERT INTO audit_log (action, amount) VALUES ('transfer', 500);
COMMIT;
-- Either all three happen or none do
-- Postgres's strongest unique capability

Query Patterns Where Postgres Hits the Ceiling#

Full column scan across billions of rows

-- Slow: reading all rows to extract one column
SELECT AVG(revenue) FROM orders;  -- 2 billion rows
-- Must read every page on disk
-- Even with index, a full scan touches everything

Unbounded time-series appends

-- Millions of inserts per second into one table
-- MVCC dead tuple accumulation, VACUUM can't keep up
-- WAL (write-ahead log) becomes a bottleneck

Ad-hoc queries on massive denormalized tables

-- 500 million rows, no partition key in query
-- Full table scan → minutes, not milliseconds

Mitigation: Table partitioning extends the ceiling significantly:

CREATE TABLE events PARTITION BY RANGE (created_at);
-- Query with created_at filter → touches only relevant partition
-- Still row-oriented, but much smaller scan surface

The Storage Layout Ceiling for Postgres#

Access pattern	Performance	Reason
Row by primary key	Excellent	B-tree → one page read
Row by indexed column	Excellent	B-tree → one page read
Range scan, small result	Good	Sequential page reads
Full table aggregation	Degrades at scale	All pages must be read
Single column across all rows	Poor at scale	Row layout forces reading all columns
High write throughput (millions/sec)	Hits ceiling	Single-node write path, MVCC overhead

3. MongoDB#

The Problem It Was Built to Solve#

Variable-structure data — objects that differ in shape per record — where you almost always need the full object, and where you want flexibility to evolve your schema without migrations.

Storage Layout on Disk#

Storage engine: WiredTiger (default since MongoDB 3.2).

Format: BSON (Binary JSON) — JSON parsed once at write time, stored as binary.

JSON (text):  { "name": "Shreya", "age": 25 }  ← 26 bytes text
BSON (binary): \x1a\x00\x00\x00\x02name...      ← more compact, pre-parsed

Layout: Document-oriented

Each document is stored as a complete BSON blob. The entire object lives together on disk.

WiredTiger Page (~16KB):
┌──────────────────────────────────────────────────────────────────┐
│ Doc u123: { name:"Shreya", age:25, address:{city:"BLR"},        │
│             languages:["Python","Go"], open_to_work:true }       │
│                                                                  │
│ Doc u124: { name:"Rahul", specialty:"neurology",                │
│             license:"MH-2891" }  ← different shape, same page   │
│                                                                  │
│ Doc u125: { name:"Priya", age:22, tags:["ml","python"] }        │
└──────────────────────────────────────────────────────────────────┘

Key implication: Fetching one document fetches the entire object — name, address, nested fields, arrays, everything. This is fast when you need the whole object. It's wasteful when you only need one field from a large document.

Two B-trees, Not One#

This is the detail most engineers miss:

B-tree 1: The collection itself — keyed by _id. Documents are physically sorted by _id on disk.

Collection B-tree (keyed by _id):
  _id: "u101" → BSON blob of doc u101
  _id: "u123" → BSON blob of doc u123
  _id: "u200" → BSON blob of doc u200

B-tree 2: Each index — keyed by the indexed field, pointing back to _id.

Index B-tree (keyed by age):
  22 → _id: "u125"
  25 → _id: "u123"   ← Step 1: find _id
  28 → _id: "u089"
            ↓
         Step 2: look up _id in collection B-tree
            ↓
         Fetch BSON blob → full document

An indexed query always does two B-tree lookups. The index finds the _id; the collection B-tree fetches the document.

Multikey Index: The Array Cost#

When you index an array field, MongoDB explodes it — one index entry per array element:

Document: { _id: "u123", languages: ["Python", "Go", "Rust"] }

Index entries created:
  "Go"     → _id: "u123"
  "Python" → _id: "u123"
  "Rust"   → _id: "u123"

The cost:

10,000 documents × 1,000 array elements = 10,000,000 index entries

Impact:
  - Index may not fit in RAM → spills to disk → slow queries
  - Every array modification = multiple index updates
  - Write throughput degrades with array size

Rule: Never create a multikey index on an unbounded array.

Embed vs Reference: The Core Design Decision#

Embed when:                          Reference when:
─────────────────────────────────    ────────────────────────────────
Always accessed together             Accessed independently
Bounded size (won't grow forever)    Unbounded (can grow forever)
No independent existence             Has its own lifecycle
"Address" inside "User"              "Orders" for a "User"

// Good embed:                       // Bad embed (use reference):
{                                    { _id: "u123",
  _id: "u123",                         orders: [...1000 orders...] }
  address: {                         
    city: "Bangalore"                // Good reference:
  }                                  // orders collection:
}                                    { _id: "o1", user_id: "u123" }

Query Patterns Where MongoDB Excels#

1. Fetch full object by key

db.users.findOne({ _id: "u123" })
// One B-tree lookup → one page read → complete document
// Fastest possible access pattern for document stores

2. Query on nested fields natively

db.users.find({ "address.city": "Bangalore" })
// Dot notation is native — no operator tricks
// Works with index: db.users.createIndex({ "address.city": 1 })

3. Array membership

db.users.find({ languages: "Python" })
// Multikey index makes this O(log n)
// No JOIN to a separate table needed

4. Variable-shape documents in one collection

// Doctors, musicians, developers — all in users collection
// Each has different fields — no NULLs, no separate tables
db.users.find({ profile_type: "doctor", specialty: "neurology" })

5. Aggregation pipeline

db.orders.aggregate([
  { $match: { status: "completed" } },
  { $group: { _id: "$user_id", total: { $sum: "$amount" } } },
  { $sort: { total: -1 } },
  { $limit: 10 }
])
// Flexible, but slower than SQL for complex multi-stage aggregations

Query Patterns Where MongoDB Hits the Ceiling#

Cross-document joins

// No native join — must use $lookup (slow) or application-side join
// If you find yourself doing this often, wrong database
db.orders.aggregate([
  { $lookup: { from: "users", localField: "user_id", foreignField: "_id", as: "user" } }
])

Partial field queries at scale

// Only need name and email from 10 million users
db.users.find({}, { name: 1, email: 1 })
// Still reads entire document BSON blob per match
// Projections filter after fetch — storage is still read in full

High write throughput sustained

// MongoDB shards horizontally — but sharding adds complexity
// Single shard writes limited by one WiredTiger instance
// At Cassandra-level write volumes, MongoDB sharding is harder to operate

The Storage Layout Ceiling for MongoDB#

Access pattern	Performance	Reason
Full document by `_id`	Excellent	One B-tree lookup, one page
Full document by indexed field	Very good	Two B-tree lookups
Nested field query	Very good	Native dot notation + index
Array membership query	Good (bounded arrays)	Multikey index
Partial field scan at scale	Degrades	Whole doc read for each match
Cross-collection joins	Poor	No native join; $lookup is expensive
Unbounded array indexing	Dangerous	Index explodes in size
Massive write throughput	Hits ceiling	Single-node WiredTiger limit

4. Cassandra#

The Problem It Was Built to Solve#

Facebook's inbox (2008): hundreds of millions of users, millions of messages per minute, always the same query shape — "give me user X's messages." No single machine could handle the write volume. Manual Postgres sharding was fragile and painful.

Cassandra was designed with one goal: make write throughput and horizontal scale a first-class property, not an afterthought.

Storage Layout on Disk#

Storage engine: LSM-tree (Log-Structured Merge-tree) — fundamentally different from B-trees.

Layout: Column-family oriented

Data is organized by partition key first, then by column families within each partition.

Partition: user_id = "u123"   (lives on Node A)
┌─────────────────────────────────────────────────────┐
│  sent_at: 2024-01-15  │ message_id │ content        │
│  sent_at: 2024-01-14  │ message_id │ content        │
│  sent_at: 2024-01-13  │ message_id │ content        │
│  ... sorted by clustering key (sent_at) ...         │
└─────────────────────────────────────────────────────┘

Partition: user_id = "u124"   (lives on Node B)
┌─────────────────────────────────────────────────────┐
│  sent_at: 2024-01-15  │ message_id │ content        │
│  ...                                                │
└─────────────────────────────────────────────────────┘

All of one user's data lives on one node (+ replicas). Fetching a user's messages is one network hop to one node, then a sequential disk read within the partition.

The LSM-tree Write Path — Why Writes Are So Fast#

This is Cassandra's core innovation over B-tree databases:

Step 1: Write to memory (MemTable)

New message arrives → written to in-memory MemTable instantly
Write is confirmed to client immediately
No disk I/O on the write path

Step 2: Sequential flush to disk (SSTable)

When MemTable fills up → flushed to disk as an SSTable file
SSTable = Sorted String Table = immutable, sorted, sequential file
Sequential writes are 10-100x faster than random writes

Step 3: Compaction (background)

Multiple SSTables accumulate → merged and compacted in background
Compaction removes deleted data and merges updates
Read performance maintained without blocking writes

Write path:
  Client write → MemTable (RAM) → confirmed ✓
                      ↓ (async, when full)
                  SSTable on disk (sequential)
                      ↓ (background)
                  Compaction → merged SSTable

Compare to B-tree (Postgres):
  Client write → find correct B-tree position → random disk write → confirmed ✓
  (random writes are slow at scale)

Key implication: Cassandra trades read complexity for write speed. Reads may need to check multiple SSTables and the MemTable to assemble the current value. Compaction is the background process that keeps reads fast.

The Ring Architecture#

         Node A (tokens 0–25%)
        /                     \
Node D (75–100%)          Node B (25–50%)
        \                     /
         Node C (50–75%)

No master/leader. Every node is equal.
Partition key is hashed to determine which node owns that data.
Replication factor (typically 3) means data lives on 3 nodes.
Adding a node automatically rebalances — no manual resharding.

Write flow:
  1. Hash(user_id) → determines coordinator node
  2. Coordinator writes to owner node + 2 replica nodes
  3. Once quorum (2 of 3) confirms → acknowledged to client
  
If one node is down:
  Write still succeeds (2 remaining nodes confirm quorum)
  Downed node catches up when it rejoins (hinted handoff)

Query Patterns: The Schema = The Query#

In Cassandra, your table schema is your query. You design tables to answer specific questions, not to model your domain.

-- Design for the query: "give me user X's messages, newest first"
CREATE TABLE inbox (
  user_id    UUID,
  sent_at    TIMESTAMP,
  message_id UUID,
  content    TEXT,
  sender_id  UUID,
  PRIMARY KEY (user_id, sent_at)
) WITH CLUSTERING ORDER BY (sent_at DESC);

user_id = partition key → which node holds this data
sent_at = clustering key → sort order within the partition

-- This query is instant — hits one partition, sequential read
SELECT * FROM inbox
WHERE user_id = 'u123'
AND sent_at > '2024-01-14'
LIMIT 50;

-- This query is catastrophic — no partition key, full cluster scan
SELECT * FROM inbox WHERE content LIKE '%urgent%';
-- Cassandra asks EVERY node to scan EVERY partition
-- Do not do this in production

Query Patterns Where Cassandra Excels#

1. Fetch one entity's time-ordered data

SELECT * FROM inbox
WHERE user_id = 'u123'
AND sent_at > toTimestamp(now()) - 86400s
LIMIT 100;
-- One partition → one node → sequential disk read → milliseconds

2. High-throughput append-only writes

5 million writes/second:
  - Each write goes to the node that owns that partition
  - No single write bottleneck
  - MemTable absorbs writes in RAM
  - Sequential flush to SSTables on disk
  - Adding nodes linearly increases write capacity

3. Always-on availability

Node A goes down:
  Reads and writes route to replica nodes
  No downtime, no manual failover
  System degrades gracefully

4. Time-series per entity

-- Netflix: user's viewing history
CREATE TABLE viewing_history (
  user_id   UUID,
  watched_at TIMESTAMP,
  show_id   UUID,
  progress  INT,
  PRIMARY KEY (user_id, watched_at)
) WITH CLUSTERING ORDER BY (watched_at DESC);

-- Uber: driver location pings
CREATE TABLE driver_locations (
  driver_id UUID,
  recorded_at TIMESTAMP,
  lat FLOAT,
  lng FLOAT,
  PRIMARY KEY (driver_id, recorded_at)
) WITH CLUSTERING ORDER BY (recorded_at DESC);

Query Patterns Where Cassandra Hits the Ceiling#

Any query without the partition key

-- These are full cluster scans — catastrophically slow
SELECT * FROM inbox WHERE sender_id = 'u456';
SELECT * FROM inbox WHERE content = 'hello';
SELECT COUNT(*) FROM inbox;  -- never do this

Joins across tables

Cassandra has no JOIN.
Cross-entity queries must be done in application code.
Design your data model so you never need to join.

Ad-hoc analytics

-- "What % of users sent messages in the last 24 hours?"
-- Requires scanning all partitions across all nodes
-- Cassandra is the wrong tool for this query

Transactions across multiple partitions

Cassandra supports lightweight transactions (LWT) within one partition.
Cross-partition transactions do not exist.
If your use case needs them: wrong database.

The Storage Layout Ceiling for Cassandra#

Access pattern	Performance	Reason
Partition key lookup	Excellent	One node, sequential read
Partition key + time range	Excellent	Sequential scan within partition
High-throughput writes	Excellent	LSM-tree, no write bottleneck
Query without partition key	Catastrophic	Full cluster scan
Cross-partition aggregation	Catastrophic	Touches every node
Joins	Not possible	No native join
Flexible ad-hoc queries	Not possible	Schema encodes query
ACID transactions	Not possible	Eventually consistent by design

5. Time-Series: TimescaleDB & InfluxDB#

The Problem They Were Built to Solve#

Measurements of phenomena over time — server CPU, IoT sensor readings, stock prices, application metrics — where you need time-aware queries (rolling averages, downsampling, rate of change) that general-purpose databases handle awkwardly.

TimescaleDB#

What it is: PostgreSQL with a time-series extension. Runs as Postgres. You write SQL. You get time-series superpowers.

Storage layout: Postgres's row-oriented heap, but with automatic hypertable partitioning by time:

hypertable: sensor_readings
  ├── chunk: 2024-01-01 to 2024-01-07  (one physical table)
  ├── chunk: 2024-01-08 to 2024-01-14  (one physical table)
  ├── chunk: 2024-01-15 to 2024-01-21  (one physical table, "hot")
  └── chunk: 2024-01-22 to now         (current write target)

Each chunk is an independent Postgres table. Only the current chunk receives writes. Old chunks can be compressed, moved to cheaper storage, or converted to columnar format.

Why this helps:

Write to one hot chunk → avoids index fragmentation on the full table
Range queries hit only relevant chunks → smaller scan surface
Old chunks compress up to 90–95% (time-series data compresses extremely well)

Columnar compression on old chunks:

ALTER TABLE sensor_readings SET (
  timescaledb.compress,
  timescaledb.compress_orderby = 'time DESC',
  timescaledb.compress_segmentby = 'sensor_id'
);
SELECT add_compression_policy('sensor_readings', INTERVAL '7 days');
-- Chunks older than 7 days → columnar compressed automatically

InfluxDB#

What it is: Purpose-built time-series database. Not SQL. Custom query language (Flux).

Storage layout: TSM (Time-Structured Merge Tree) — a variant of LSM-tree optimized for time-series:

In-memory WAL (Write-Ahead Log):
  sensor=temp01, time=1705312200, value=72.4
  sensor=temp01, time=1705312205, value=72.6
  sensor=cpu01,  time=1705312205, value=0.84
  ↓ (flushed periodically)

TSM File (on disk, sorted by series key + time):
  series: sensor=cpu01
    time=1705312200 → 0.82
    time=1705312205 → 0.84
    time=1705312210 → 0.81
  series: sensor=temp01
    time=1705312200 → 72.4
    time=1705312205 → 72.6

Each measurement series (a specific metric on a specific entity) is stored contiguously. Reading a series over a time range is a sequential read — very fast.

Key concepts:

Measurement = table name equivalent (e.g. cpu_usage)
Tag = indexed metadata (e.g. host=server01) — used for filtering
Field = the actual measurement value (e.g. value=0.84) — not indexed
Timestamp = always the primary axis

Query Patterns Where Time-Series DBs Excel#

1. Time-range queries with aggregation

-- TimescaleDB
SELECT
  time_bucket('5 minutes', time) AS bucket,
  sensor_id,
  AVG(value) AS avg_temp,
  MAX(value) AS max_temp
FROM sensor_readings
WHERE sensor_id = 'temp01'
  AND time > NOW() - INTERVAL '24 hours'
GROUP BY bucket, sensor_id
ORDER BY bucket DESC;

-- This is the native query shape. Extremely fast.

2. Cross-entity aggregation over time

-- "Average CPU across all servers, last hour, per 1-minute window"
SELECT
  time_bucket('1 minute', time) AS bucket,
  AVG(cpu_usage) AS avg_cpu
FROM metrics
WHERE time > NOW() - INTERVAL '1 hour'
GROUP BY bucket
ORDER BY bucket;

-- Time-series DB scans the relevant time range only
-- Cassandra would require scanning all partitions — catastrophic

3. Downsampling (continuous aggregates)

-- TimescaleDB: pre-aggregate automatically
CREATE MATERIALIZED VIEW hourly_avg
WITH (timescaledb.continuous) AS
SELECT
  time_bucket('1 hour', time) AS bucket,
  sensor_id,
  AVG(value) AS avg_val
FROM sensor_readings
GROUP BY bucket, sensor_id;

-- Dashboard queries hit the view, not raw data
-- Near-instant response for charts

4. Data retention policies

-- Auto-delete data older than 90 days
SELECT add_retention_policy('sensor_readings', INTERVAL '90 days');
-- Drops entire old chunks — O(1) operation, no row-by-row delete

Query Patterns Where Time-Series DBs Hit the Ceiling#

Point lookups on non-time dimensions

-- "Get all readings for sensor temp01 on Jan 15th" — fine
-- "Get all sensors whose readings exceeded 100 at any point" — slow
-- Time is the primary access key; non-time filters are secondary

Relationships between entities

Time-series DBs don't model relationships.
"Which user triggered this metric?" → wrong question for this tool.
Use your OLTP database for entity relationships.

Write patterns without time as primary key

Every record must have a timestamp as the organizing axis.
If your data isn't measurements over time, wrong tool.

The Storage Layout Ceiling for Time-Series DBs#

Access pattern	Performance	Reason
Range query on one series + time	Excellent	Contiguous on disk
Cross-series aggregation + time	Very good	Chunk pruning + columnar compression
Continuous aggregates	Excellent	Pre-computed, materialized
Point lookup by entity ID	Good	Partition pruning by series key
Non-time-range queries	Degrades	Time is the primary organizing axis
Joins across entities	Not native	No relational model
ACID transactions	Limited	Not the design goal

6. Side-by-Side Comparison#

Storage Layout#

Database	Layout	Physical unit	Write style
PostgreSQL	Row-oriented heap	8KB page	Random write to B-tree
MongoDB	Document-oriented (BSON)	Variable BSON blob in WiredTiger page	Random write to B-tree
Cassandra	Column-family + LSM-tree	Partition	Sequential append (MemTable → SSTable)
TimescaleDB	Row-oriented heap + time chunks	8KB page in chunk	Sequential (hot chunk)
InfluxDB	TSM (series + time)	TSM file per series	Sequential WAL → TSM

Performance Profile#

Database	Best read	Best write	Worst query
PostgreSQL	Row by index	Transactional, moderate	Full column scan at scale
MongoDB	Full document by key	Document insert	Cross-collection join
Cassandra	Partition key lookup	Massive parallel append	No partition key
TimescaleDB	Time range + aggregation	Sequential to hot chunk	Non-time-axis queries
InfluxDB	Series + time range	High-throughput metrics	Non-time-axis queries

Consistency Model#

Database	Consistency	Transactions	Tradeoff
PostgreSQL	Strong ACID	Full multi-table	Throughput limited by single leader
MongoDB	Strong (per-doc), tunable	Multi-doc since v4.0 (costly)	Flexibility vs consistency
Cassandra	Eventual (tunable to strong)	Within one partition only	Availability over consistency
TimescaleDB	Strong ACID (inherits Postgres)	Full Postgres transactions	Same limits as Postgres
InfluxDB	Eventual	None	Write throughput over consistency

7. Decision Tree#

START: What does your data actually look like?
│
├── Uniform rows, same shape per record, relationships between entities
│   └── Do you need complex joins and ACID transactions?
│       YES → PostgreSQL
│       NO  → continue
│
├── Objects with different shapes per record
│   └── Do you almost always need the full object?
│       YES → MongoDB
│       NO (need partial fields + SQL) → PostgreSQL + JSONB column
│
├── High-volume writes (millions/min), same query shape always
│   └── Is your query always "give me entity X's data for time range Y"?
│       YES → Cassandra
│       NO (need flexible queries too) → wrong tool; consider Postgres first
│
├── Measurements over time (metrics, sensors, prices)
│   └── Do you need time-aware aggregations (avg, rate, downsample)?
│       YES → TimescaleDB (if you want SQL) or InfluxDB (purpose-built)
│       └── Do you also need joins and transactions alongside this?
│           YES → TimescaleDB (it's still Postgres underneath)
│           NO  → InfluxDB (higher write throughput, simpler)
│
└── Historical analysis, BI reports, ad-hoc questions across all data
    └── OLAP (BigQuery, Redshift, ClickHouse)
        Does data freshness need to be real-time (seconds)?
        YES → ClickHouse (handles higher-frequency ingestion)
        NO  → BigQuery / Redshift (batch ETL, minutes to hours lag)

The Refined Decision: Postgres First#

Before reaching for a specialized database, ask:

1. Can Postgres handle my write volume?
   Postgres handles ~10,000 writes/second on modest hardware.
   Most systems never exceed this. If yours doesn't → use Postgres.

2. Can Postgres handle my data with JSONB?
   Variable structure? Use a JSONB column for the flexible part.
   Structured envelope + flexible payload → Postgres + JSONB.

3. Can table partitioning solve my scale problem?
   Partition by time or by a high-cardinality field.
   Dramatically reduces scan surface. Extends Postgres life.

Only after exhausting these options → reach for a specialized database.

Use Case → Database Mapping#

System	Database	Core reason
E-commerce backend	PostgreSQL	Transactions, orders, inventory
Audit log	PostgreSQL + JSONB	Structured envelope + variable payload + immutability via triggers
Product catalog	MongoDB	Each product type has different fields
User messaging (small-medium)	PostgreSQL	Transactions, moderate volume
User messaging (Facebook/Discord scale)	Cassandra	Millions of writes/min, fixed query
Netflix viewing history	Cassandra	Per-user partition, time-ordered, massive writes
Server/application metrics	TimescaleDB or InfluxDB	Time-aware aggregations, retention policies
IoT sensor data (moderate)	TimescaleDB	SQL familiar, chunked storage, compression
IoT sensor data (extreme volume)	InfluxDB or Cassandra	Purpose-built throughput
Business intelligence / reports	BigQuery / Redshift	Aggregations across all historical data
Real-time analytics	ClickHouse	Columnar, high ingestion, near-real-time

8. The Ceiling Principle#

The deepest insight across all of this:

Every database is a set of bets. When engineers designed it, they chose which access patterns would be fast — and that choice is baked into the storage layout. Everything else is engineering around that choice.

PostgreSQL bet:   "Rows with relationships are the universal data shape.
                   Flexibility and correctness matter most."

MongoDB bet:      "Objects are the universal data shape.
                   Flexibility and locality matter most."

Cassandra bet:    "Write volume and availability are the hard constraints.
                   We will sacrifice query flexibility to solve them."

Time-series bet:  "Time is the universal access axis for measurements.
                   Time-aware queries should be trivially simple."

OLAP bet:         "Analytics scan everything.
                   Column access is the only thing that matters."

Indexes, caching, and query tuning operate at the surface. They extend how far you can push a database within its bets. But they cannot change what the bets were.

The ceiling is set at database selection time — before a single line of application code is written.

When you choose a database, you are choosing which queries will always be fast, which will be manageable with tuning, and which will always be slow no matter what you do.

Choose based on your access patterns. Not based on what's popular. Not based on what your team already knows (though that's a valid pragmatic factor). Not based on "NoSQL is modern."

The right database is the one whose bets align with your access patterns.

Built from first principles — not a list of facts, but a framework for reasoning about any database you encounter.

Database Internals: Storage, Query Patterns & Selection#

Table of Contents#

1. The Three-Layer Constraint Model#

2. PostgreSQL#

The Problem It Was Built to Solve#

Storage Layout on Disk#

How Indexes Work#

Query Patterns Where Postgres Excels#

Query Patterns Where Postgres Hits the Ceiling#

The Storage Layout Ceiling for Postgres#

3. MongoDB#

The Problem It Was Built to Solve#

Storage Layout on Disk#

Two B-trees, Not One#

Multikey Index: The Array Cost#

Embed vs Reference: The Core Design Decision#

Query Patterns Where MongoDB Excels#

Query Patterns Where MongoDB Hits the Ceiling#

The Storage Layout Ceiling for MongoDB#

4. Cassandra#

The Problem It Was Built to Solve#

Storage Layout on Disk#

The LSM-tree Write Path — Why Writes Are So Fast#

The Ring Architecture#

Query Patterns: The Schema = The Query#

Query Patterns Where Cassandra Excels#

Query Patterns Where Cassandra Hits the Ceiling#

The Storage Layout Ceiling for Cassandra#

5. Time-Series: TimescaleDB & InfluxDB#

The Problem They Were Built to Solve#

TimescaleDB#

InfluxDB#

Query Patterns Where Time-Series DBs Excel#

Query Patterns Where Time-Series DBs Hit the Ceiling#

The Storage Layout Ceiling for Time-Series DBs#

6. Side-by-Side Comparison#

Storage Layout#

Performance Profile#

Consistency Model#

7. Decision Tree#

The Refined Decision: Postgres First#

Use Case → Database Mapping#

8. The Ceiling Principle#

Related Posts

The Complete Guide to Building Multi-Tenant Systems

Load Balancing Techniques

Managed Load Balancing Services