DEEP DIVE SYSTEM DESIGN

How Meta Deploys Code to 4 Billion Users With Zero Downtime

By Akshay Ghalme·April 16, 2026·19 min read

Meta pushes code to production thousands of times per day. Facebook, Instagram, WhatsApp, Messenger — 4 billion users across the family of apps, zero scheduled maintenance windows, zero "we'll be right back" pages. When you open Instagram at 3 PM, the code serving your feed might be 2 hours old. By 5 PM, it has been replaced by a newer version — and you never noticed.

This is Part 2 of the Meta Infrastructure series. Part 1 covers WhatsApp's messaging architecture. Part 3 covers Meta's storage systems. This post covers the deployment pipeline — how code gets from an engineer's laptop to 4 billion users without anything breaking.

The Scale of Meta's Deployment Problem

Before diving into how it works, you need to understand the scale of the problem Meta solves every day:

Meta deployment stats (approximate):
  Engineers committing code daily:     ~1,000+
  Code commits per day:                ~1,000+
  Servers in production:               Millions
  Data centers globally:               20+
  Products deployed simultaneously:    Facebook, Instagram, WhatsApp, Messenger, Threads
  Users affected by any bad deploy:    Up to 4 billion
  Acceptable downtime per year:        Effectively zero

The challenge is not "how do you deploy code." Anyone can git push to a server. The challenge is: how do you deploy code 1,000 times a day to millions of servers serving 4 billion users, and guarantee that none of those deploys causes a visible outage? That is the problem Meta's deployment infrastructure solves.

Tupperware — Meta's Container Orchestration Platform

Long before Kubernetes existed, Meta built Tupperware — their internal container orchestration system. Tupperware does for Meta what Kubernetes does for the rest of the industry: schedules containers onto physical machines, manages resource allocation, handles health checks, and automatically replaces failed containers.

The key differences from Kubernetes:

  • Global single control plane. Kubernetes runs per-cluster (one control plane per cluster, typically per region). Tupperware is a single global scheduler that can place containers across any data center in Meta's fleet. This enables cross-data-center migrations, global load balancing, and capacity reallocation in minutes rather than hours.
  • Custom scheduling for Meta's workload mix. Meta's fleet runs everything from stateless web servers to stateful databases to GPU-heavy ML training jobs. Tupperware's scheduler understands these workload types natively and co-locates complementary workloads (batch ML jobs fill idle capacity left by bursty web traffic).
  • Integrated with the entire Meta stack. Tupperware talks directly to Meta's load balancers, service discovery, logging, monitoring, and deployment pipeline. There is no glue layer — it is one integrated system. Kubernetes relies on a ecosystem of third-party tools (Istio, Prometheus, ArgoCD) that must be integrated separately.
For most companies, Kubernetes is the right choice. Tupperware exists because Meta needed it before Kubernetes was mature, and at Meta's scale, the integration depth pays for the cost of maintaining a custom system. Do not build your own Tupperware.

The Deployment Pipeline — From Commit to Production

Here is the full lifecycle of a code change at Meta:

graph LR DEV[Engineer
writes code] --> REVIEW[Code Review
Phabricator] REVIEW --> LAND[Land to
main branch] LAND --> BUILD[Hermetic Build
+ automated tests] BUILD --> CANARY[Canary Deploy
1% traffic] CANARY --> CHECK1{Metrics
OK?} CHECK1 -->|Yes| ROLL10[Rollout
10% traffic] CHECK1 -->|No| ROLLBACK1[Auto
Rollback] ROLL10 --> CHECK2{Metrics
OK?} CHECK2 -->|Yes| ROLL50[Rollout
50% traffic] CHECK2 -->|No| ROLLBACK2[Auto
Rollback] ROLL50 --> CHECK3{Metrics
OK?} CHECK3 -->|Yes| FULL[Full
Production] CHECK3 -->|No| ROLLBACK3[Auto
Rollback] classDef stage fill:#6C3CE1,stroke:#00D4AA,stroke-width:2px,color:#fff; classDef check fill:#4A1DB5,stroke:#00D4AA,stroke-width:2px,color:#fff; classDef good fill:#047857,stroke:#00D4AA,stroke-width:3px,color:#fff; classDef bad fill:#7f1d1d,stroke:#ef4444,stroke-width:2px,color:#fff; class DEV,REVIEW,LAND,BUILD,CANARY,ROLL10,ROLL50 stage; class CHECK1,CHECK2,CHECK3 check; class FULL good; class ROLLBACK1,ROLLBACK2,ROLLBACK3 bad;

Step 1: Code Review (Phabricator)

Every code change at Meta goes through code review on Phabricator (Meta's internal review tool, which they open-sourced). At least one other engineer must approve the change. For changes to critical systems (payments, privacy, security), a specialized reviewer from that domain must also sign off. The review is not just "does this code work" — it also checks "if this code fails, what is the blast radius?"

Step 2: Land and Build

After approval, the change is "landed" (merged) to the main branch. A hermetic build system compiles the change alongside all dependencies into a deployable artifact. Automated tests run — not just the changed code's tests, but any test that could be affected by the change (determined by dependency analysis). If tests fail, the commit is automatically reverted.

Step 3: Canary Deployment (1% of Traffic)

The new build is deployed to a small subset of production servers — typically 1% of traffic. These canary servers run the new code while 99% of servers run the old code. Real users hit the canary servers; they are not simulated traffic. This is the most critical gate: the first contact between new code and real users.

Step 4: Automated Metric Comparison

An automated system continuously compares metrics between canary servers and the control group. The metrics include:

  • Error rates — HTTP 500s, crash rates, exception rates
  • Latency — p50, p95, p99 response times
  • Resource usage — CPU, memory, network per request
  • Business metrics — feed loads, story views, message sends (the metrics that prove the product still works)

If any metric degrades beyond a threshold, the canary is automatically rolled back without human intervention. No oncall engineer needed, no approval, no "let me check." The system decides, the system acts. At Meta's deployment velocity, human-in-the-loop rollback is too slow.

Step 5: Gradual Rollout (10% → 50% → 100%)

If the canary passes, the rollout proceeds to 10% of traffic, then 50%, then 100%. At each stage, the same metric comparison runs. A problem that only manifests at 10% load (like a race condition under higher concurrency) gets caught at this stage. The total time from commit to full production is typically 6-12 hours for a routine change, with the canary stage taking the majority of that time.

Feature Flags and Gatekeeper — Dark Launches at Meta Scale

Not every change should be visible to users the moment it is deployed. Meta's Gatekeeper system separates deployment (code exists in production) from release (users see the feature). This is the single most important concept in modern deployment.

Every new feature at Meta ships behind a Gatekeeper flag. The flag controls who sees it:

// Pseudocode — Gatekeeper check in product code
if (gatekeeper.check("new_feed_ranking_v2", user)) {
  // Show new feed algorithm
} else {
  // Show old feed algorithm
}

// Gatekeeper targeting rules:
// - 0% of users initially (dark launch — code deployed, feature invisible)
// - Then: employees only (internal dogfooding)
// - Then: 1% of US users (external canary)
// - Then: 10% → 50% → 100% (gradual rollout)
// - Or: instantly set to 0% (kill switch if something breaks)

Gatekeeper flags can be targeted by user ID, percentage, country, device type, app version, employee status, or any combination. This enables:

  • Dark launches: Deploy code to production but make it invisible. Engineers can test the code path in production without users seeing it.
  • A/B testing: Show feature A to 50% and feature B to 50%, measure which performs better.
  • Instant kill switch: If a feature causes problems, set the flag to 0% and the feature disappears for all users within seconds — no code deploy needed.
  • Regional rollouts: Launch in India first, then US, then globally — each controlled by the flag.

Load Balancing — Routing Traffic Across the Globe

When you open Facebook on your phone, your request does not go directly to a server in a data center. It passes through multiple layers of load balancing:

graph TD USER[User in Mumbai] --> DNS[DNS Resolution
route to nearest PoP] DNS --> EDGE[Edge PoP
Mumbai] EDGE --> L4[L4 Load Balancer
TCP termination] L4 --> PROXYGEN[Proxygen
L7 Load Balancer
HTTP routing] PROXYGEN --> DC1[Data Center A
service instances] PROXYGEN --> DC2[Data Center B
service instances] DC1 --> SVC1[Feed Service] DC1 --> SVC2[Story Service] DC1 --> SVC3[Messaging Service] classDef user fill:#6C3CE1,stroke:#00D4AA,stroke-width:2px,color:#fff; classDef edge fill:#4A1DB5,stroke:#00D4AA,stroke-width:2px,color:#fff; classDef dc fill:#047857,stroke:#00D4AA,stroke-width:3px,color:#fff; class USER user; class DNS,EDGE,L4,PROXYGEN edge; class DC1,DC2,SVC1,SVC2,SVC3 dc;

Edge Points of Presence (PoPs)

Meta operates dozens of Edge PoPs around the world. When you make a request, DNS routes you to the nearest PoP. The PoP terminates your TCP/TLS connection (eliminating the latency of connecting to a distant data center), caches static content, and forwards dynamic requests to the nearest data center. This is the same concept as CloudFront edge locations on AWS, but Meta runs its own.

Proxygen — Meta's Custom L7 Load Balancer

Proxygen is Meta's open-source HTTP framework that acts as the Layer 7 load balancer. It handles HTTP/2 and HTTP/3 (QUIC) termination, request routing to backend services, circuit breaking, and retry logic. Proxygen is where canary traffic splitting happens — the load balancer routes 1% of requests to canary servers and 99% to stable servers.

Graceful Degradation — What Gets Turned Off During Incidents

Not every feature on Facebook is equally important. When the system is under stress (high load, partial outage, cascade failure), Meta deliberately turns off less-important features to protect the most critical ones. This is graceful degradation, and it is pre-planned — not improvised during an incident.

Meta categorizes features into priority tiers:

  • Tier 0 (never degrade): Core feed loading, messaging delivery, login/authentication
  • Tier 1 (degrade under severe load): Comments, reactions, read receipts, typing indicators
  • Tier 2 (degrade under moderate load): Story recommendations, "people you may know," marketplace suggestions
  • Tier 3 (first to shed): Analytics logging, non-critical background jobs, A/B test data collection

Each tier has a Gatekeeper flag. During an incident, the incident commander can flip Tier 3 off globally in seconds, then Tier 2 if needed. This frees up compute, database connections, and network bandwidth for the features that matter most. Users experience a slightly degraded but functional product instead of a total outage.

The Cultural Evolution — "Move Fast and Break Things" Is Dead

Meta's original motto was "move fast and break things." This worked when Facebook had 100 million users and an outage was an inconvenience. It does not work when 4 billion people rely on your infrastructure for daily communication, business, and in some countries, emergency coordination.

The current engineering culture is better described as "move fast with stable infrastructure." The specific shifts:

  1. Automated guardrails replaced manual discipline. Instead of trusting engineers to test thoroughly, the canary system catches problems automatically. The system is designed so that a bad commit cannot reach all users.
  2. Feature flags replaced big-bang launches. Nothing goes from 0% to 100% in one step. Everything is gradual, measurable, and reversible.
  3. Blast radius thinking replaced "ship it and see." Every code review now includes "what happens if this fails?" The answer must be "the failure is contained to X% of users" — not "everyone is affected."
  4. Post-incident reviews replaced blame. After every SEV-1 or SEV-2 incident, Meta runs a blameless post-mortem. The output is always a system improvement (better canary metric, new circuit breaker, new graceful degradation tier) — never "engineer X messed up."

Incident Response — The SEV System

When something goes wrong at Meta scale, the response follows a structured severity system:

  • SEV-4: Minor degradation affecting a small number of users. Handled by the owning team during business hours.
  • SEV-3: Noticeable degradation for a significant number of users. On-call engineer investigates within 30 minutes.
  • SEV-2: Major feature broken for a large number of users. Incident commander assigned, war room opened, all non-critical deploys frozen.
  • SEV-1: Total or near-total outage. All hands on deck. External communication team activated. Example: the October 2021 global outage (BGP misconfiguration that made Facebook unreachable worldwide for ~6 hours).

The October 2021 outage was a watershed moment. A routine BGP configuration change accidentally withdrew the routes that told the internet where Facebook's servers were. The result: every Meta service (Facebook, Instagram, WhatsApp, Messenger, Oculus) became unreachable globally. Even Meta's internal tools went down, because they relied on the same infrastructure. Engineers could not access the data centers remotely and had to physically go on-site to fix the BGP configuration.

The aftermath led to massive investment in independent recovery paths — systems that can restore connectivity even when the primary infrastructure is completely down.

Comparison: Deployment Approaches at Scale

Dimension Meta Google Netflix Amazon
Container platform Tupperware (custom) Borg → Kubernetes (custom → open-sourced) Titus (custom, on AWS) ECS (AWS-native)
Deploy frequency Thousands/day (continuous) Continuous (per-service) ~100/day ~150,000/day (across all teams)
Canary strategy 1% → 10% → 50% → 100% Per-service canary (Spanner, etc.) Red/Black (new ASG, old ASG) Per-team, varied strategies
Feature flags Gatekeeper (built-in) Experiments framework Feature flags + A/B Varies by team
Rollback Automated on metric degradation Automated Manual (fast, ~minutes) Automated per-service
Monorepo? Yes (one of the largest) Yes (the largest) No (per-service repos) No (per-team repos)

What Smaller Companies Can Learn From Meta

You do not have 4 billion users or a custom container platform. But three of Meta's practices scale down to any team size:

1. Feature Flags From Day One

Every new feature behind a flag, even at a 10-person startup. It costs almost nothing to implement — LaunchDarkly or Unleash for managed, a simple database table for DIY. The payoff is enormous: instant rollback without deploys, gradual rollouts, A/B testing, and the ability to decouple deployment from release. See my CI/CD guide for the pipeline to pair with this.

2. Canary Deployments

Route 1-5% of traffic to new code before full rollout. On AWS, you can do this with ECS weighted target groups, Kubernetes with Istio traffic splitting, or AWS CodeDeploy's canary strategy. The pattern is the same: deploy new code alongside old code, compare metrics, roll forward or roll back. See my scaling guide for when to introduce this.

3. Graceful Degradation

Identify which features are critical (login, payments, core product) and which can be turned off under load (recommendations, analytics, non-critical notifications). Build kill switches for non-critical features from day one. When your first traffic spike hits, you will be glad you can shed load deliberately instead of watching everything fall over together. Pair this with a monitoring stack so you know when to flip the switches.

Frequently Asked Questions

How does Meta deploy code without downtime?

Through a multi-stage pipeline: code review, automated testing, hermetic build, canary deployment to 1% of traffic, gradual rollout (1% → 10% → 50% → 100%) with automated metric comparison at each stage, and instant auto-rollback if error rates spike. No code reaches all users without passing through every stage. Tupperware orchestrates the container rollout, and Gatekeeper feature flags gate visibility.

What is Tupperware at Meta?

Meta's custom container orchestration platform, built before Kubernetes existed. It manages millions of containers across all of Meta's data centers through a single global control plane. Unlike Kubernetes (per-cluster), Tupperware schedules across data centers simultaneously. For most companies, Kubernetes is the right choice — Tupperware exists because of Meta's unique scale and timing.

What is Gatekeeper at Facebook?

Meta's feature flag system. Every new feature ships behind a Gatekeeper flag controlling which users see it. Flags can target by user ID, percentage, country, device, or employee status. Enables dark launches, A/B tests, gradual rollouts, and instant kill switches. A feature causing problems can be turned off globally in seconds without a deploy.

How often does Meta deploy code?

Thousands of times per day across the family of apps. There are no deployment windows, no release trains, no Friday freezes. The system is designed for continuous delivery — the canary pipeline and automated rollback make it safe to deploy constantly.

How does Meta handle canary deployments?

1% of real traffic is routed to servers running new code. Automated monitoring compares error rates, latency, CPU, and business metrics between canary and control. If any metric degrades beyond a threshold, auto-rollback fires with no human intervention. Only after passing does the rollout proceed to 10% → 50% → 100%.

What happens when Meta has an outage?

A SEV classification is assigned. SEV-1 (total outage) triggers an incident commander, war room, all-hands response, and deploy freeze. The response follows: identify blast radius → isolate cause → mitigate (usually Gatekeeper flag flip or rollback) → root-cause analysis. The 2021 global outage (BGP misconfiguration) led to major investments in independent recovery paths.

Does Meta use Kubernetes?

Meta primarily uses Tupperware. They have contributed to Kubernetes ecosystem tools but their core orchestration is custom. Tupperware differs in operating as a single global scheduler. For most companies, Kubernetes is the right choice — Tupperware exists because Meta needed it before K8s was mature.

What can smaller companies learn from Meta's deployment practices?

Three things: (1) Feature flags from day one — LaunchDarkly, Unleash, or a simple DB table. (2) Canary deployments — route 1-5% of traffic to new code before full rollout, using ECS weighted targets or Istio. (3) Graceful degradation — identify critical vs non-critical features and build kill switches from the start.


Meta Infrastructure Series

Related Reading

AG

Akshay Ghalme

AWS DevOps Engineer with 3+ years building production cloud infrastructure. AWS Certified Solutions Architect. Currently managing a multi-tenant SaaS platform serving 1000+ customers.

More Guides & Terraform Modules

Every guide comes with a matching open-source Terraform module you can deploy right away.