A practical reference for engineers designing distributed systems — from quick prototypes to large-scale production architectures.
1. Why Queues Exist — The Mental Model#
Before picking a tool, understand the three fundamental problems queues solve:
Problem 1: Rate Mismatch — A producer generates events faster than a consumer can process them. A queue acts as a buffer, absorbing spikes and letting consumers work at their own pace.
Example: Your API receives 10,000 order events per second during a flash sale, but your email service can only send 500/second. A queue holds the backlog and drains it over time.
Problem 2: Decoupling — Service A shouldn't need to know Service B exists. With a queue, producers publish events into a channel — any number of consumers can subscribe without the producer caring.
Problem 3: Reliability & Fault Tolerance — If Service B goes down, a queue persists messages and re-delivers them when the consumer comes back.
The Two Fundamental Delivery Models#
Message Queue (Point-to-Point): One message is consumed by exactly one consumer. Once processed, it is deleted.
Producer → [Queue] → Consumer A (message gone after processing)
Use when distributing work — sending an email, processing a payment, resizing an image.
Pub/Sub (Publish-Subscribe): One message is delivered to ALL subscribers. Each subscriber gets its own copy.
Producer → [Topic] → Consumer A (gets a copy)
→ Consumer B (gets a copy)
→ Consumer C (gets a copy)
Use when broadcasting events — an "order placed" event that triggers billing, inventory, and notifications simultaneously.
Critical Insight: Most queues support both models, but are optimized for one. Kafka is optimized for Pub/Sub at scale. RabbitMQ is optimized for flexible routing.
2. The Four Players: Quick Summary#
| Queue | Best One-Line Description | Primary Model |
|---|---|---|
| Kafka | A distributed, persistent, ordered event log | Pub/Sub (with consumer groups) |
| RabbitMQ | A flexible, feature-rich message broker | Point-to-Point (with Pub/Sub) |
| Amazon SQS | A simple, fully managed cloud task queue | Point-to-Point |
| Redis Pub/Sub | An in-memory, fire-and-forget broadcast channel | Pub/Sub (ephemeral) |
3. Apache Kafka#
Kafka is not a traditional message queue — it's a distributed commit log. Consumers don't delete messages; they track their own offset in the log.
[Partition 0]: event1 | event2 | event3 | event4 | event5 ...
↑
Consumer A is at offset 3
Consumer B is at offset 5
Because nothing is deleted, multiple consumer groups can independently read the same data at different positions.
Core Concepts:
- Topic — A named stream of events
- Partition — A topic is split into N partitions, each an independent ordered log. More partitions = more parallelism
- Consumer Group — Consumers that collectively read a topic; each partition goes to exactly one consumer
- Offset — A cursor marking where a consumer is in a partition
- Retention — Kafka retains messages for a configured period (e.g., 7 days) regardless of consumption
Key Features:
- Ordering — Guaranteed within a partition. Use a partition key (e.g., user ID) to ensure all events for one user go to the same partition
- Replay / Rewind — Reset a consumer's offset to any point in time and replay events
- Exactly-Once Semantics — Supported but complex; at-least-once + idempotent consumers is simpler for most cases
- Log Compaction — Keeps only the latest message per key, useful for maintaining state snapshots
Use Kafka when: High throughput (millions of events/sec), multiple teams consuming the same stream, audit logs, event sourcing, stream processing, data pipelines.
Don't use Kafka when: Complex routing, small teams avoiding ops overhead, per-message TTL, simple task queues.
Gotchas: Partition count is hard to change after the fact. Consumer lag monitoring is critical. Ordering is per-partition only.
4. RabbitMQ#
RabbitMQ is a traditional message broker implementing AMQP. Producers send to an exchange, which routes to queues via bindings. Messages are deleted after acknowledgment.
Exchange Types:
Direct Exchange — Routes to a queue whose binding key exactly matches the routing key.
Fanout Exchange — Broadcasts every message to ALL bound queues. Routing keys ignored.
Topic Exchange — Pattern matching with wildcards (* matches one word, # matches zero or more).
Headers Exchange — Routes based on message header attributes.
Key Features:
- Dead Letter Queues — Failed messages route to a DLQ for inspection and retry
- Message TTL — Set expiration on messages or entire queues
- Message Priority — Higher priority messages delivered first (0–10 range)
- Delayed Messages — Via plugin, schedule messages for future delivery
- Quorum Queues — Raft-based replicated queues for production durability
Use RabbitMQ when: Complex routing, per-message TTL/priority, delayed delivery, task queues with DLQ, request/reply patterns.
Don't use RabbitMQ when: Millions of messages/second, message replay, multiple independent consumer groups on the same stream.
5. Amazon SQS#
SQS is Amazon's fully managed queue — nothing to install or scale.
- Standard Queue — At-least-once delivery, best-effort ordering, unlimited throughput
- FIFO Queue — Exactly-once, strict ordering, 300 TPS limit (3,000 with batching)
- Visibility Timeout — Message becomes invisible after pickup; redelivered if consumer crashes before deleting it
- Long Polling — Keep connection open 20s until a message arrives (always use in production)
- SQS + SNS Fan-Out — SNS publishes to multiple SQS queues simultaneously (one per service)
- Lambda Integration — Native trigger: Lambda auto-scales with queue depth
Use SQS when: Fully on AWS, simple task distribution, zero infrastructure management, native AWS integration.
Don't use SQS when: Message replay, message priority, streaming, complex routing, not on AWS.
Gotchas: Standard queue can deliver duplicates — consumers must be idempotent. Visibility timeout must be longer than processing time. Messages up to 256KB only (use S3 + claim-check pattern for larger payloads).
6. Redis Pub/Sub#
Redis Pub/Sub is a fire-and-forget broadcast channel. If a consumer is offline when a message is published, it misses that message forever — no persistence, no ACK, no retry.
Key Features:
- Sub-millisecond delivery latency (in-memory, no disk I/O)
- Pattern subscribe (
PSUBSCRIBE order.*) for wildcard channel subscriptions - Extremely fast fan-out to thousands of subscribers
Redis Streams (better alternative): Adds persistence, consumer groups, ACK, replay, and message history — essentially Kafka Lite for moderate scale.
Use Redis Pub/Sub when: Real-time presence updates, live dashboard updates, cache invalidation signals, ephemeral chat, you already have Redis.
Don't use when: Losing a message is unacceptable, you need history or replay, processing workloads.
Gotchas: No durability — Redis restart loses all in-flight messages. Slow consumers block publishing. Memory pressure disconnects slow subscribers.
7. Feature Comparison Matrix#
Delivery Semantics#
| Feature | Kafka | RabbitMQ | SQS Standard | SQS FIFO | Redis Pub/Sub |
|---|---|---|---|---|---|
| At-least-once | ✅ | ✅ | ✅ | ✅ | ❌ |
8. The Decision Framework#
Run through these questions in order:
Step 1: Distributing work or broadcasting events?
- Work (one message, one consumer) → SQS or RabbitMQ
- Events (one message, many consumers) → Kafka or SNS+SQS or Redis Pub/Sub
Step 2: Need message replay/rewind?
- Yes → Kafka (or Redis Streams for moderate scale)
- No → continue
Step 3: Throughput requirements?
- < 10K/sec → any queue, choose by features
- 500K+/sec → Kafka or SQS Standard
- Sub-millisecond to many subscribers → Redis Pub/Sub
Step 4: Complex routing (patterns, headers, content)?
- Yes → RabbitMQ
- No → continue
Step 5: Priority, TTL, or scheduled delivery?
- Priority or arbitrary delay → RabbitMQ
- Short delay (< 15 min) → SQS Delay Queue
Step 6: Infrastructure philosophy?
- Fully on AWS → SQS + SNS
- Open-source, control → Kafka or RabbitMQ
- Already have Redis, moderate scale → Redis Streams
Step 7: Durability requirements?
- Messages cannot be lost → Kafka, SQS, or RabbitMQ Quorum Queues
- Some loss acceptable → Redis Pub/Sub is fine
9. Common System Design Patterns#
Order Processing#
User Places Order → [Kafka/SNS] → Billing, Inventory, Notification, Fraud
Why Kafka: One event, multiple independent services. Each is a separate consumer group. Use SNS+SQS on AWS if no replay needed.
Background Job Queue#
API Server → [SQS/RabbitMQ] → Worker Pool
Why SQS/RabbitMQ: Classic task queue — one job, one worker. RabbitMQ if you need priority queues (premium users first).
Real-Time Analytics Pipeline#
App Events → [Kafka] → Stream Processor → Data Warehouse + Dashboard + Anomaly Detection
Why Kafka: High throughput, multiple consumers, durable log for reprocessing.
Notification Routing#
Events → [RabbitMQ Topic Exchange] → Email Queue, SMS Queue, Push Queue
Why RabbitMQ: Topic exchange routes notification.email.* to email queue, notification.sms.* to SMS queue.
Chat / Presence#
User message → [Redis Pub/Sub] → All users in room receive it
Why Redis: Messages are ephemeral; offline users catch up from a database, not the queue.
Delayed Jobs (> 15 minutes)#
Store in database with deliver_at timestamp. Scheduler service pushes to SQS/RabbitMQ at delivery time.
10. Hybrid Architectures#
Kafka + SQS#
Kafka as durable event log/source of truth. SQS for operational task queues. A Kafka consumer bridges them — reads events from Kafka, pushes actionable tasks to SQS.
SNS + SQS (AWS Fan-Out)#
SNS publishes to multiple SQS queues simultaneously. Each service has its own queue — independent scaling and isolated failure.
Kafka + Redis Pub/Sub#
Kafka ensures nothing is lost. A consumer processes events and publishes results to Redis for sub-millisecond delivery to live browser clients. If client misses a Redis message, it fetches from DB.
11. Quick-Reference Decision Tree#
START
│
├─ Need replay/rewind of historical events?
│ YES → Kafka
│ NO ↓
│
├─ Need to broadcast to multiple independent consumers?
│ YES ─┬─ High throughput or stream processing?
│ │ YES → Kafka
│ │ NO → SNS+SQS | RabbitMQ Fanout | Redis Pub/Sub (ephemeral)
│ NO ↓
│
├─ Need complex routing (patterns, headers)?
│ YES → RabbitMQ
│ NO ↓
│
├─ Need priority, TTL, or arbitrary delay?
│ Priority/arbitrary delay → RabbitMQ
│ Short delay only (< 15 min) → SQS Delay Queue
│ NO ↓
│
├─ Fully on AWS? Want managed service?
│ YES → SQS
│ NO ↓
│
├─ Sub-millisecond broadcast, can tolerate loss?
│ YES → Redis Pub/Sub
│ NO ↓
│
├─ Already using Redis, moderate scale?
│ YES → Redis Streams
│ NO ↓
│
└─ Default → RabbitMQ (versatile, well-understood)
12. Summary Table#
| Scenario | Best Fit |
|---|---|
| High-throughput event streaming | Kafka |
| Event sourcing / audit log | Kafka |
| Multiple teams consuming same events | Kafka |
| Replay historical data | Kafka |
| Stream processing (windowed aggregations) | Kafka + Kafka Streams |
| Complex routing by content or pattern | RabbitMQ |
| Message priority queues | RabbitMQ |
| Per-message TTL | RabbitMQ |
| Delayed/scheduled delivery (arbitrary) | RabbitMQ (plugin) |
| Task queues with dead-letter handling | RabbitMQ or SQS |
| Simple AWS task queue, no ops | SQS |
| Fan-out on AWS | SNS + SQS |
| Strict ordering + exactly-once (AWS) | SQS FIFO |
| Real-time ephemeral broadcast | Redis Pub/Sub |
| Live presence / typing indicators | Redis Pub/Sub |
| Cache invalidation signals | Redis Pub/Sub |
| Persistent consumer-group streams (Redis) | Redis Streams |
13. Queue vs Stream — The Most Confused Distinction#
Queue = A task handed off. Done. Forget it. Stream = A permanent record of what happened. Process it whenever you want.
A Queue is like a Post Office: You write a letter and drop it in the mailbox. Once delivered, it's gone. Its only job was to move the message from A to B.
A Stream is like a Bank Ledger: Every transaction is recorded permanently, in order. Multiple people — auditors, accountants, analysts — can independently read the same ledger.
The Five Fundamental Differences#
1. What happens after a message is consumed?
- Queue → deleted
- Stream → stays; multiple consumers read it at their own position
2. How many consumers can independently read?
- Queue → multiple workers share the queue; each message goes to exactly one worker
- Stream → unlimited independent consumer groups, each reads the full stream
3. Is ordering a first-class concern?
- Queue → best-effort in most implementations
- Stream → ordering is the entire point
4. Command vs Fact?
- Queue message → a command:
{ "action": "send_email", "to": "..." }(imperative) - Stream event → a fact:
{ "event": "user.signed_up", "at": "..." }(past tense, immutable)
5. Replay?
- Queue → cannot replay; once processed, messages are gone forever
- Stream → reset offset to any point and re-read; stream is the source of truth
Does Persistence Make Something a Stream?#
No. SQS and RabbitMQ both persist to disk — neither is a stream. Persistence in a queue means "survive until one consumer gets it." Persistence in a stream means "the log is the system — read it any time from any position."
Is Kafka a Stream by Default?#
Yes. But your consumption pattern decides whether you're using it as a stream or a queue:
- Stream mode → multiple independent consumer groups, each reads every event
- Queue mode → competing consumers in one group, each event processed by exactly one worker
When to Use Each#
Use a Queue when: Message is a command that happens once, work distribution is the goal, no history needed.
Use a Stream when: Event is a fact multiple systems need, multiple teams consume same data, you need history/replay, real-time data pipelines, order of events carries meaning.
Queue: "Here's something to do — someone please handle it." Stream: "Here's something that happened — anyone who cares can read it, now or later."
Last updated: 2026. Kafka 3.x. RabbitMQ 3.12+. SQS: current AWS. Redis 7.x.