DEEP DIVE SYSTEM DESIGN

How Meta Stores Trillions of Photos, Messages, and Social Graph Edges

By Akshay Ghalme·April 16, 2026·21 min read

Every photo you have ever posted on Facebook still exists. Every Instagram story (even the expired ones — for moderation). Every WhatsApp backup. Every Messenger conversation. Meta stores trillions of objects across custom-built storage systems that most engineers have never heard of — because they were never open-sourced, never available on any cloud provider, and never needed by anyone operating at less than planetary scale.

This is Part 3 of the Meta Infrastructure series. Part 1 covers WhatsApp's messaging architecture. Part 2 covers Meta's zero-downtime deployment. This post covers the storage layer — the systems that hold the data behind all of Meta's products.

The scale is difficult to comprehend:

Meta storage (estimated, 2026):
  Total data stored:           Exabytes (1 exabyte = 1 million terabytes)
  New photos uploaded daily:   Hundreds of millions
  Social graph edges:          Trillions (friendships, likes, follows, comments)
  Messages processed daily:    100+ billion (WhatsApp alone)
  Data growth rate:            Petabytes per day
  Storage systems:             TAO, Haystack, f4, Cassandra, MySQL, custom blob stores

At this scale, you cannot use off-the-shelf databases. PostgreSQL cannot handle 2 billion reads per second. S3 was not designed for the access patterns of a social graph. Standard filesystems collapse under the metadata overhead of trillions of small files. Meta had to build everything custom — and the systems they built are among the most interesting pieces of infrastructure engineering in the industry.

TAO — The Social Graph Cache (2 Billion Reads Per Second)

When you open Facebook and see your friend's latest post, the query that fetches it does not hit a traditional database. It hits TAO — The Associations and Objects — Meta's distributed, read-optimized cache layer for the social graph.

TAO's data model is elegantly simple. Everything in Facebook's social graph is either an object or an association:

  • Objects: users, posts, photos, pages, groups, comments — each has an ID, a type, and key-value attributes.
  • Associations: friendships, likes, follows, comments-on-post — each is a directed edge connecting two objects, with a type and optional metadata.

"Show me Alice's friends" is an association query: get_associations(alice_id, FRIEND). "Show me Alice's latest post" is: get_associations(alice_id, AUTHORED, limit=1) followed by get_object(post_id). The entire Facebook news feed, with all its likes, comments, shares, and friend connections, is built on these two primitive operations.

graph TD APP[Application Tier
Feed, Timeline, Search] --> TAO_L[TAO Leader Cache
read + write] APP --> TAO_F[TAO Follower Cache
read-only replicas] TAO_L --> MYSQL[(MySQL
persistent storage)] TAO_F --> TAO_L TAO_L -->|cache miss| MYSQL TAO_L -->|write| MYSQL TAO_L -->|invalidate| TAO_F classDef app fill:#6C3CE1,stroke:#00D4AA,stroke-width:2px,color:#fff; classDef cache fill:#4A1DB5,stroke:#00D4AA,stroke-width:2px,color:#fff; classDef db fill:#047857,stroke:#00D4AA,stroke-width:3px,color:#fff; class APP app; class TAO_L,TAO_F cache; class MYSQL db;

Why Not Just Use MySQL Directly?

MySQL is the persistent storage layer — TAO's truth lives in MySQL. But MySQL cannot serve 2 billion reads per second across a globally distributed fleet. TAO is a purpose-built cache layer that absorbs the vast majority of reads. A cache hit on TAO returns in microseconds. A cache miss falls through to MySQL and is then cached for future reads.

Consistency Model

TAO provides read-after-write consistency within a data center (if you write an object, you can immediately read it back from TAO in the same DC) and eventual consistency across data centers (a write in the US data center takes a few seconds to propagate to the EU data center's TAO caches). This is the right trade-off for a social network — you see your own post immediately, and your friend in another country sees it a few seconds later.

The Scale Numbers

TAO handles over 2 billion read queries per second at peak. The cache hit rate exceeds 99.8% — only 0.2% of reads actually touch MySQL. This means MySQL handles roughly 4 million reads per second (which is still enormous), but without TAO it would need to handle 2 billion — which is impossible for any relational database.

Haystack — Warm Photo Storage

Photos are the largest category of data at Meta by volume. Hundreds of millions of new photos are uploaded every day across Facebook, Instagram, WhatsApp backups, and Messenger. The classic approach — store each photo as a file on a filesystem — fails catastrophically at this scale.

Why Traditional Filesystems Fail

A POSIX filesystem stores metadata (file name, size, permissions, timestamps) in an inode for each file. The inode must be read from disk before the actual file content can be read. On a system with billions of photos, the inode table is too large to fit in memory, which means every photo read requires at least two disk seeks — one for the inode, one for the data. At Facebook's volume, this doubles the I/O load for no good reason.

Haystack's Solution — Eliminate Metadata Overhead

Haystack packs thousands of photos into a single large file called a haystack volume. An in-memory index maps each photo's ID to its offset and size within the volume. A photo read becomes:

  1. Look up the photo ID in the in-memory index → get (volume_id, offset, size)
  2. Seek to offset in the volume file, read size bytes

One memory lookup, one disk seek, one sequential read. The metadata overhead that crippled POSIX filesystems is eliminated because the index fits entirely in RAM. A single Haystack machine can serve hundreds of thousands of photo reads per second.

f4 — Cold Photo Storage

Not all photos are accessed equally. A photo uploaded today gets viewed hundreds of times in the first 24 hours. A photo from 2015 might get viewed once a year. Keeping old photos in Haystack (which optimizes for fast reads) wastes expensive storage capacity.

f4 is Meta's cold storage system. When photos age past a threshold (typically weeks to months of low access frequency), they are migrated from Haystack to f4. The key difference: f4 uses erasure coding instead of full replication.

  • Haystack (warm): stores 3 full copies of each photo (3x storage overhead) for fast reads and high availability.
  • f4 (cold): uses Reed-Solomon erasure coding to achieve the same durability with roughly 1.4x storage overhead — less than half the cost per byte.

The trade-off: reads from f4 are slower (erasure-coded data requires reconstruction from multiple fragments) and more CPU-intensive. This is acceptable because cold photos are rarely read, and when they are, a few hundred extra milliseconds of latency is unnoticeable.

graph LR UPLOAD[Photo Upload] --> PROCESS[Thumbnail Generation
+ Encoding] PROCESS --> HAYSTACK[(Haystack
Warm Storage
3x replication)] HAYSTACK -->|Aging policy
weeks/months| F4[(f4
Cold Storage
1.4x erasure coding)] HAYSTACK --> CDN[CDN Edge Cache] F4 --> CDN CDN --> USER[User Views Photo] classDef hot fill:#6C3CE1,stroke:#00D4AA,stroke-width:2px,color:#fff; classDef warm fill:#4A1DB5,stroke:#00D4AA,stroke-width:2px,color:#fff; classDef cold fill:#047857,stroke:#00D4AA,stroke-width:3px,color:#fff; class UPLOAD,PROCESS hot; class HAYSTACK,CDN warm; class F4 cold; class USER hot;

This lifecycle mirrors AWS S3's storage classes — S3 Standard (warm, high-availability) → S3 Glacier (cold, low-cost) — but Meta's versions are custom-built for their specific access patterns and far more efficient at Meta's scale. For the equivalent on AWS, see my S3 vs EFS vs EBS comparison.

Cassandra for Messaging — Write-Heavy at Planetary Scale

WhatsApp processes 100+ billion messages per day. Messenger handles billions more. This write volume is orders of magnitude beyond what MySQL or PostgreSQL can handle, even with sharding. Meta uses Apache Cassandra for messaging storage — a distributed NoSQL database designed for exactly this workload.

Why Cassandra for Messages

  • Write-optimized: Cassandra's LSM-tree storage engine handles sequential writes faster than B-tree databases. A message write is appended to a commit log and memtable — no random disk seeks.
  • Tunable consistency: For messages, eventual consistency is acceptable (a few hundred milliseconds of delay between replicas is fine). This lets Cassandra distribute writes across multiple nodes without coordination locks.
  • Linear scalability: Adding more Cassandra nodes linearly increases write throughput. No single master bottleneck.
  • Geographic distribution: Cassandra's multi-datacenter replication is built in. A message written in Mumbai is replicated to the US data center asynchronously.

Message Lifecycle

Messages on WhatsApp's server are transient — they are stored only until delivered, then deleted. The server is a store-and-forward relay, not a permanent archive. Messenger messages are persistent — Meta stores them indefinitely (until the user deletes them). The Cassandra clusters for WhatsApp vs Messenger have very different retention policies and data volumes because of this.

The CDN Layer — 50ms Photo Delivery

Most users never wait for a photo to be fetched from Haystack or f4. Meta operates a massive multi-tier CDN that caches content as close to the user as possible:

  1. Edge PoPs — points of presence in major cities around the world. The most popular content (trending photos, viral videos, profile pictures of popular accounts) is cached here. Most photo requests are served from the edge.
  2. Origin Caches — regional cache clusters that sit between the edge and the storage backend. Content that is not hot enough for the edge but too recent for cold storage is served from here.
  3. Storage Backend — Haystack (warm) or f4 (cold). Only cache misses from both the edge and origin cache layers reach here.

Because of the power-law distribution of social media content (a tiny percentage of photos get the vast majority of views), the CDN hit rate is extremely high — most photo requests never reach the storage backend at all. The CDN absorbs the read load, and Haystack/f4 only handle the long tail of rarely-viewed content.

Data Center Architecture — Built From Scratch

Meta does not rent space in other companies' data centers (with some exceptions for edge PoPs). They design and build their own data centers from the ground up. In 2011, Meta launched the Open Compute Project (OCP) — open-sourcing their data center designs so anyone can benefit.

Why Build Custom

At Meta's scale, standard enterprise hardware (Dell, HP, Cisco) is too expensive and too inflexible. A custom server designed for Meta's specific workload — lots of storage, moderate compute, specific network interface cards — costs 30-40% less than an equivalent off-the-shelf server. When you are buying millions of servers, that 30% savings is billions of dollars.

Open Compute Project Contributions

  • Custom servers — vanity-free (no bezels, no brand plates), modular, tool-less maintenance. Designed to be serviced by technicians who manage thousands of machines.
  • Custom racks — Open Rack standard with shared power shelves and a common power bus. Eliminates redundant power supplies per server.
  • Custom network switches — Wedge and 6-pack switches, running open-source networking OS (FBOSS). Replaced proprietary Cisco switches at a fraction of the cost.
  • Power and cooling — evaporative cooling in data centers located in cool climates (Lulea, Sweden; Prineville, Oregon), reducing energy cost by 40% compared to traditional HVAC.

Meta's Storage Stack vs AWS Equivalents

Meta System Purpose AWS Equivalent Key Difference
TAO Social graph cache (2B reads/sec) ElastiCache + DynamoDB Purpose-built for graph queries; custom consistency model
Haystack Warm photo/video storage S3 Standard Eliminates POSIX inode overhead; in-memory index
f4 Cold archival storage S3 Glacier / S3 Glacier Deep Archive Erasure coding at 1.4x overhead vs S3's 3x replication
Cassandra Write-heavy message storage DynamoDB / Amazon Keyspaces Open source, multi-DC native, tunable consistency
MySQL (with TAO) Persistent graph storage RDS MySQL / Aurora Massively sharded; TAO absorbs 99.8% of reads
Meta CDN Content delivery CloudFront Custom edge + origin cache; tight integration with storage
OCP Hardware Custom servers, racks, switches N/A (AWS uses custom hardware too, but it's not user-facing) Open-sourced designs; 30-40% cheaper than enterprise hardware

For most companies, the AWS column is the right answer. Meta builds custom because their scale justifies the engineering investment. A startup running the same workload patterns can achieve 90% of Meta's architecture using off-the-shelf AWS services at a fraction of the engineering cost. See my scaling guide for how to get there.

Warm vs Cold Storage — The Economics

The single most important insight from Meta's storage architecture is the storage lifecycle: data starts hot, cools down over time, and the optimal storage system changes as it cools. Here is the cost curve:

Storage tier:       Access frequency:     Cost per TB/month:    Retrieval latency:
Hot (in memory)     Millions/sec          $$$$$                 Microseconds
Warm (Haystack)     Thousands/day         $$$                   Low milliseconds
Cold (f4)           A few/month           $                     Hundreds of ms
Archive             Compliance only       ¢                     Minutes to hours

Most companies keep everything in "warm" storage because it is the default. Moving 80% of your data from warm to cold reduces storage costs by 60-70% with zero impact on user experience, because users almost never access old data. This is the exact same pattern behind AWS cost optimization — the hidden costs are in data you are paying premium prices to store but nobody accesses.

What Smaller Companies Can Learn From Meta's Storage

  • Tiered storage is not optional at scale. If you have data older than 90 days that is rarely accessed, it should be on a cheaper storage tier. On AWS: S3 Lifecycle policies to move objects from Standard → Infrequent Access → Glacier automatically.
  • Cache everything that gets read more than once. TAO's 99.8% cache hit rate is extreme, but even a simple Redis layer with a 90% hit rate reduces database load by 10x. See the caching stage in my scaling guide.
  • Content-addressable storage for media. Store uploads by hash (content-addressed) rather than user-assigned filenames. This deduplicates identical uploads automatically — if 1,000 users share the same meme, you store it once.
  • Separate the CDN from the storage backend. Your storage backend handles writes and cold reads. Your CDN handles hot reads. S3 + CloudFront is the AWS version of Haystack + Meta CDN.
  • The database is not for blobs. Never store images, videos, or large files in your relational database. Use object storage (S3) for blobs and the database for references (URLs, metadata). This is what Haystack taught Meta — and what most small teams learn the hard way.

Frequently Asked Questions

How does Facebook store photos?

Using two custom storage systems. Haystack stores warm (recently uploaded, frequently accessed) photos in content-addressable volumes with an in-memory index, optimized for fast sequential reads. When photos age and are accessed less frequently, they migrate to f4, a cold storage system using erasure coding at 1.4x overhead instead of 3x replication — same durability at one-third the cost.

What is TAO at Facebook?

TAO (The Associations and Objects) is Facebook's social graph cache. It handles over 2 billion reads per second, caching the social graph (friendships, likes, follows) on top of MySQL. TAO uses a simple objects + associations data model. Cache hit rate exceeds 99.8%, meaning MySQL only handles 0.2% of reads directly.

What database does WhatsApp use?

Mnesia (Erlang's built-in distributed database) for session and presence data. Custom-modified LMDB for temporary message storage on the server. Messages are stored only until delivered, then deleted. Messenger uses Cassandra for persistent message storage. Long-term chat backups live on Google Drive (Android) or iCloud (iOS).

How much data does Meta store?

Exabytes total, growing by petabytes per day. Hundreds of millions of new photos daily. Trillions of social graph edges. 100+ billion WhatsApp messages daily. Meta builds custom storage hardware through the Open Compute Project to manage this volume at reasonable cost.

What is Haystack at Facebook?

Meta's warm photo storage system. Traditional filesystems store one inode per file — at billions of photos, inode metadata exceeds memory, forcing two disk seeks per read. Haystack eliminates this by packing thousands of photos into single large files with an in-memory index. Result: one memory lookup + one disk seek per photo, handling hundreds of thousands of reads per second per machine.

How does Meta's CDN work?

Multi-tier: Edge PoPs in major cities cache the most popular content. Origin Caches between edge and storage handle the middle tier. Haystack/f4 storage backend handles cache misses. Power-law distribution means most requests never reach the storage backend — the CDN absorbs the vast majority of read load.

What is the Open Compute Project?

An initiative Facebook started in 2011 to open-source their data center designs — custom servers, racks, switches, and power/cooling systems. At Meta's scale, off-the-shelf enterprise hardware was too expensive. Custom designs are 30-40% cheaper and specifically optimized for their workloads.

How does Meta's storage compare to AWS?

TAO ≈ ElastiCache + DynamoDB. Haystack ≈ S3 Standard. f4 ≈ S3 Glacier. Cassandra ≈ DynamoDB/Keyspaces. Meta CDN ≈ CloudFront. The key difference: Meta builds custom for their specific patterns. For most companies, the AWS equivalents are the right choice — same architectural patterns without the engineering cost of building from scratch.


Meta Infrastructure Series

Related Reading

AG

Akshay Ghalme

AWS DevOps Engineer with 3+ years building production cloud infrastructure. AWS Certified Solutions Architect. Currently managing a multi-tenant SaaS platform serving 1000+ customers.

More Guides & Terraform Modules

Every guide comes with a matching open-source Terraform module you can deploy right away.