Early Stage Startup — Reliability & Availability Roadmap#
The Core Principle Before Anything Else#
Over-engineering early is as dangerous as under-engineering.
Every hour you spend building redundant infrastructure in week 1 is an hour not spent validating your product. A perfectly available system serving zero users solves nothing.
The goal at each stage is: just enough reliability for the current risk level.
The Reliability Spectrum#
STAGE 1 STAGE 2 STAGE 3 STAGE 4
Idea/MVP → Early Traction → Growth → Scale
0-100 users 100-10K users 10K-500K users 500K+ users
"Does it work?" "Is it stable?" "Can it grow?" "Can it survive?"
Stage 1: MVP / Pre-Launch (0–100 users)#
Your Situation#
- Building to validate an idea
- Users are friends, beta testers, early adopters
- A few hours of downtime → nobody notices or cares yet
- You probably have little to no funding
What You Need#
Nothing fancy. Literally the simplest thing that works.
Recommended Stack#
Single VPS (DigitalOcean / Hetzner / Render)
↓
Your Application (single process)
↓
Managed Database (Supabase / PlanetScale / Railway)
Specific choices:
- Hetzner CX21 (€4/month) — 2 vCPU, 4GB RAM. Runs most MVPs easily.
- DigitalOcean Droplet ($6-12/month) — slightly more expensive but excellent documentation
- Render / Railway — even simpler, deploy from git, no server management
- Supabase (free tier) — managed Postgres + auth + storage
What NOT to do yet#
- ❌ Don't set up multi-region deployments
- ❌ Don't containerize with Kubernetes
- ❌ Don't build microservices
- ❌ Don't configure auto-scaling
What you should do#
1. Managed Database from Day 1 Never run your own database server at this stage. Use a managed service.
- Your data is infinitely more valuable than your code
- Managed DBs have automated backups, failover, point-in-time recovery
- You cannot replicate this reliability cheaply yourself
2. Automated Backups Configure daily backups. Know how to restore. Test restoring once. The worst thing that can happen at this stage is losing user data, not downtime.
3. Basic Monitoring
- UptimeRobot (free) — pings your URL every 5 minutes, emails you if it's down
- Sentry (free tier) — catches and reports application errors These two tools take 30 minutes to set up and save enormous debugging time.
4. Separate your database from your application Even on one server, don't run your DB on the same process as your app. Use the managed DB service. This decoupling matters when you scale.
Acceptable Downtime at This Stage#
Hours. If your server goes down and you restart it in 2 hours, your 50 users will probably not even notice. Don't lose sleep over this.
Cost#
₹500–2,000/month total.
Stage 2: Early Traction (100–10,000 users)#
Your Situation#
- Product-market fit is emerging
- Real users with real expectations
- Maybe some paying customers
- Downtime starts to hurt — you get complaints
- You have some budget now
What Changes#
You need basic redundancy and the ability to deploy without downtime. The single-server model starts showing cracks.
Problems You'll Actually Face at This Stage#
Problem 1: Deploying causes downtime When you restart your app to deploy, it's unavailable for 10-30 seconds. At 100 users this is annoying. At 1000 it's unacceptable.
Solution: Zero-downtime deployment Use a platform that handles this for you:
- Railway / Render — blue-green deploys built in
- AWS Elastic Beanstalk — rolling deploys
- Or run 2 app instances behind a simple load balancer: deploy one, then the other
Problem 2: Single server going down If your one server crashes, everything is down until you notice and fix it.
Solution: Managed platform or simple redundancy
Problem 3: Database is now the critical bottleneck Your app can be restarted in seconds. Your database losing data is catastrophic.
Solution: Upgrade to proper managed DB with automated failover
Recommended Stack#
Cloudflare (free tier) ← add this first, immediate benefits
↓
Single Cloud Instance (AWS EC2 t3.small or DigitalOcean Droplet)
+ Systemd / PM2 to auto-restart app on crash
↓
Managed Database with automated backups
(AWS RDS, Supabase Pro, PlanetScale)
↓
Redis (for sessions/caching) — ElastiCache or Upstash
Or move to a managed platform entirely:
Render / Railway / Fly.io
→ handles deploys, restarts, basic redundancy
→ you focus on code, not infrastructure
→ costs slightly more but saves enormous time
Key actions at this stage#
1. Add Cloudflare (Free — do this immediately)
- DDoS protection
- CDN for static assets
- Hides your origin IP
- Free SSL
- Takes 30 minutes to set up
- No reason not to do this at any stage
2. Set up proper application monitoring
- Datadog / New Relic / Grafana Cloud — track response times, error rates, resource usage
- Set alerts: "if error rate > 1% for 5 minutes, wake me up"
- Know about problems before your users report them
3. Structured logging
- Use a log management service (Papertrail, Logtail, CloudWatch)
- You will need to debug production issues. Logs are your eyes.
4. Database read replica (when DB becomes bottleneck)
- Most managed DBs offer a read replica at click of a button
- Read-heavy traffic goes to replica, writes go to primary
- Immediate 2x database capacity
5. CI/CD pipeline
- GitHub Actions is free and sufficient
- Every push to main → automated tests → deploy
- Removes human error from deployments
When to add a Load Balancer at this stage#
When you need zero-downtime deploys with multiple instances. Not before.
Cheapest path: AWS ALB ($16/month) + 2 EC2 t3.small instances
LB sends traffic to both, you deploy one at a time
Acceptable Downtime#
Minutes to tens of minutes. Users expect occasional issues from a startup. What they don't forgive is losing their data.
Cost#
₹5,000–15,000/month depending on choices.
Stage 3: Growth (10,000–500,000 users)#
Your Situation#
- Real revenue at stake
- SLA expectations from customers (especially B2B)
- You probably have a small engineering team
- Traffic has predictable peaks (morning, evening) and occasional spikes
- Some features are more critical than others
What Changes#
You need proper redundancy, auto-scaling, and the ability to survive infrastructure failures without manual intervention.
Recommended Stack#
Cloudflare (Pro or Business tier for WAF)
↓
AWS ALB (multi-AZ, managed, zero operational overhead)
↓
Auto Scaling Group (EC2 instances that scale 2-10 based on CPU/traffic)
↓
AWS RDS (Multi-AZ) — primary + standby replica, automatic failover
↓
ElastiCache (Redis) — for sessions, caching, rate limiting
↓
S3 + CloudFront — for all static assets
Key Concepts to Implement#
1. Multi-AZ Deployment
Deploy across at least 2 Availability Zones (physically separate data centers in same region).
AZ-a: ALB node + 2 EC2 instances + RDS primary
AZ-b: ALB node + 2 EC2 instances + RDS standby
If AZ-a loses power, AZ-b continues serving traffic and RDS automatically promotes standby to primary. This happens automatically with managed AWS services.
2. Auto Scaling
Define scaling rules based on real metrics:
CPU > 70% for 2 minutes → add 2 EC2 instances
CPU < 30% for 10 minutes → remove 1 EC2 instance
Minimum: 2 instances (always)
Maximum: 10 instances
This handles traffic spikes without manual intervention and saves money during quiet periods.
3. Health Checks Everywhere
Every service must expose a /health endpoint that checks all its dependencies.
GET /health
→ Check DB connection
→ Check Redis connection
→ Check any critical dependency
→ Return 200 if all good, 503 if not
ALB uses this to route around unhealthy instances. Auto Scaling uses this to terminate and replace broken instances.
4. Circuit Breakers
When a downstream service is failing, stop hammering it. Use a circuit breaker:
If Payment Service fails 5 times in 10 seconds:
→ Open circuit — stop calling Payment Service for 30 seconds
→ Show user a friendly error
→ After 30 seconds, try again (half-open state)
This prevents cascading failures where one broken service takes down everything.
5. Separate Your Stateless and Stateful Components
Stateless (can have many, can die freely):
Application servers — run as many as you want
Stateful (must be treated carefully):
Database — primary + replica, automated backups, Multi-AZ
Cache — Redis with persistence if needed
File storage — S3 (not local disk)
Never store anything on EC2 local disk that you'd be upset to lose.
6. Content Delivery
All static assets (JS, CSS, images) should come from CloudFront or Cloudflare, not your application servers. This reduces load on your servers dramatically and improves global response times.
7. Observability Stack
You need three things:
Metrics → Datadog / CloudWatch → "What is my system doing right now?"
Logs → CloudWatch Logs / Datadog → "What happened and when?"
Traces → AWS X-Ray / Datadog APM → "Why is this request slow?"
Without these three, debugging production incidents is guesswork.
8. Runbooks
Document procedures for common incidents:
- "Database failover in progress — what do we do?"
- "Traffic spike — how do we scale manually if auto-scaling isn't enough?"
- "Memory leak detected — how do we restart without downtime?"
These feel unnecessary until 2am when something is burning.
Target: 99.9% Uptime#
99.9% = ~8.7 hours downtime/year
Achievable with Multi-AZ + Auto Scaling + managed services.
Cost#
₹30,000–1,00,000/month depending on traffic and instance sizes.
Stage 4: Scale (500,000+ users)#
Your Situation#
- Downtime has direct revenue impact
- SLAs with enterprise customers (99.9%, 99.95%)
- Traffic is geographically distributed globally
- You have a dedicated infrastructure or platform engineering team
- Partial failures are acceptable — complete outage is not
What Changes#
You need multi-region deployment, database sharding or read replicas globally, and sophisticated traffic management.
Recommended Stack#
AWS Route 53 (latency-based routing across regions)
↓
Cloudflare Enterprise (global edge, advanced WAF)
↓
Per Region:
ALB → Auto Scaling Group (application tier)
RDS Aurora Global Database (replicates across regions, <1s lag)
ElastiCache with replication
DynamoDB (for truly global, low-latency data)
Key Additions#
1. Multi-Region Active-Active Serve traffic from 2-3 regions simultaneously:
Mumbai region → users in India, South Asia
Singapore region → users in SE Asia
US-East region → users in Americas
EU-West region → users in Europe
Route 53 routes each user to nearest healthy region.
2. Aurora Global Database Amazon Aurora Global can replicate from primary to 5 secondary regions with < 1 second lag. If a region fails, secondary can be promoted to primary in < 1 minute.
3. Chaos Engineering Deliberately break things in production (on a schedule) to verify your redundancy actually works. Netflix's Chaos Monkey randomly terminates EC2 instances to ensure their system survives. You don't need Chaos Monkey — but you should periodically run drills: "let's fail over our DB and see what happens."
4. Rate Limiting and Quotas Protect your services from runaway clients, scraping bots, or abuse. Implement at Cloudflare edge (fastest) or ALB level.
5. Event-Driven Architecture Decouple services using message queues (AWS SQS, EventBridge):
User signs up → Event published →
Email service consumes event → sends welcome email
Analytics service consumes event → records new user
Recommendation service consumes event → initializes profile
If email service is down, signup still works. Email is just delayed. Critical path is decoupled from non-critical operations.
Target: 99.95%–99.99% Uptime#
99.99% = ~52 minutes downtime/year Requires multi-region + significant operational investment.
Cost#
₹2,00,000–₹10,00,000+/month
Quick Decision Guide by Stage#
JUST STARTING?
→ Single managed instance (Render/Railway) + managed DB (Supabase)
→ Cloudflare free tier
→ Uptime monitoring (UptimeRobot free)
→ Sentry for errors (free tier)
Total: ₹1,000-3,000/month
GETTING TRACTION (first paying customers)?
→ Add 2nd instance + ALB for zero-downtime deploys
→ RDS or managed DB with automated backups
→ Basic monitoring (Datadog or CloudWatch)
→ CI/CD pipeline (GitHub Actions free)
Total: ₹8,000-20,000/month
GROWING (meaningful revenue at stake)?
→ Multi-AZ deployment
→ Auto Scaling Groups
→ RDS Multi-AZ
→ CloudFront for static assets
→ Full observability stack (metrics + logs + traces)
Total: ₹40,000-1,00,000/month
SCALING (enterprise customers, SLAs)?
→ Multi-region
→ Aurora Global Database
→ Cloudflare Enterprise or AWS Shield Advanced
→ Dedicated platform engineering
→ Chaos engineering practices
Total: ₹2,00,000+/month
The Universal Rules That Apply at Every Stage#
Rule 1: Your database is more important than your app Your app can be restarted in seconds. Lost user data cannot be recovered. Use managed databases. Enable automated backups. Test restoring them.
Rule 2: Stateless applications scale; stateful ones don't Store sessions in Redis, not in memory. Store files in S3, not on disk. If any instance can die at any time without losing data, you can run as many as you want.
Rule 3: Observability before redundancy You cannot fix what you cannot see. Add monitoring and logging before you add more servers.
Rule 4: Managed services are almost always worth it The cost of a managed RDS is almost always less than the cost of a database going down and an engineer spending a weekend recovering it.
Rule 5: Complexity has a maintenance cost Every Kubernetes cluster, every microservice, every config file is something that can break and someone has to maintain. Add complexity only when the pain of not having it is clearly greater than the cost of maintaining it.
Rule 6: Test your recovery procedures Backup that's never been restored is not a backup. Failover that's never been tested is not failover. Periodically drill your recovery procedures before you need them.
The Startup Reliability Maturity Model#
Level 1: We know when we're down (monitoring)
Level 2: We can deploy without going down (zero-downtime deploys)
Level 3: We survive a single server dying (redundancy)
Level 4: We survive an AZ failing (multi-AZ)
Level 5: We survive a region failing (multi-region)
Level 6: We survive partial failures gracefully (circuit breakers, graceful degradation)
Most startups should aim to reach Level 3-4 by the time they have meaningful revenue.
Level 5-6 is for companies where downtime has clear, direct, significant revenue impact.
Don't skip levels. Each level builds on the previous one.