How Meta Stores Trillions of Photos, Messages, and Social Graph Edges
Every photo you have ever posted on Facebook still exists. Every Instagram story (even the expired ones — for moderation). Every WhatsApp backup. Every Messenger conversation. Meta stores trillions of objects across custom-built storage systems that most engineers have never heard of — because they were never open-sourced, never available on any cloud provider, and never needed by anyone operating at less than planetary scale.
This is Part 3 of the Meta Infrastructure series. Part 1 covers WhatsApp's messaging architecture. Part 2 covers Meta's zero-downtime deployment. This post covers the storage layer — the systems that hold the data behind all of Meta's products.
The scale is difficult to comprehend:
Meta storage (estimated, 2026):
Total data stored: Exabytes (1 exabyte = 1 million terabytes)
New photos uploaded daily: Hundreds of millions
Social graph edges: Trillions (friendships, likes, follows, comments)
Messages processed daily: 100+ billion (WhatsApp alone)
Data growth rate: Petabytes per day
Storage systems: TAO, Haystack, f4, Cassandra, MySQL, custom blob stores
At this scale, you cannot use off-the-shelf databases. PostgreSQL cannot handle 2 billion reads per second. S3 was not designed for the access patterns of a social graph. Standard filesystems collapse under the metadata overhead of trillions of small files. Meta had to build everything custom — and the systems they built are among the most interesting pieces of infrastructure engineering in the industry.
TAO — The Social Graph Cache (2 Billion Reads Per Second)
When you open Facebook and see your friend's latest post, the query that fetches it does not hit a traditional database. It hits TAO — The Associations and Objects — Meta's distributed, read-optimized cache layer for the social graph.
TAO's data model is elegantly simple. Everything in Facebook's social graph is either an object or an association:
- Objects: users, posts, photos, pages, groups, comments — each has an ID, a type, and key-value attributes.
- Associations: friendships, likes, follows, comments-on-post — each is a directed edge connecting two objects, with a type and optional metadata.
"Show me Alice's friends" is an association query: get_associations(alice_id, FRIEND). "Show me Alice's latest post" is: get_associations(alice_id, AUTHORED, limit=1) followed by get_object(post_id). The entire Facebook news feed, with all its likes, comments, shares, and friend connections, is built on these two primitive operations.
Feed, Timeline, Search] --> TAO_L[TAO Leader Cache
read + write] APP --> TAO_F[TAO Follower Cache
read-only replicas] TAO_L --> MYSQL[(MySQL
persistent storage)] TAO_F --> TAO_L TAO_L -->|cache miss| MYSQL TAO_L -->|write| MYSQL TAO_L -->|invalidate| TAO_F classDef app fill:#6C3CE1,stroke:#00D4AA,stroke-width:2px,color:#fff; classDef cache fill:#4A1DB5,stroke:#00D4AA,stroke-width:2px,color:#fff; classDef db fill:#047857,stroke:#00D4AA,stroke-width:3px,color:#fff; class APP app; class TAO_L,TAO_F cache; class MYSQL db;
Why Not Just Use MySQL Directly?
MySQL is the persistent storage layer — TAO's truth lives in MySQL. But MySQL cannot serve 2 billion reads per second across a globally distributed fleet. TAO is a purpose-built cache layer that absorbs the vast majority of reads. A cache hit on TAO returns in microseconds. A cache miss falls through to MySQL and is then cached for future reads.
Consistency Model
TAO provides read-after-write consistency within a data center (if you write an object, you can immediately read it back from TAO in the same DC) and eventual consistency across data centers (a write in the US data center takes a few seconds to propagate to the EU data center's TAO caches). This is the right trade-off for a social network — you see your own post immediately, and your friend in another country sees it a few seconds later.
The Scale Numbers
TAO handles over 2 billion read queries per second at peak. The cache hit rate exceeds 99.8% — only 0.2% of reads actually touch MySQL. This means MySQL handles roughly 4 million reads per second (which is still enormous), but without TAO it would need to handle 2 billion — which is impossible for any relational database.
Haystack — Warm Photo Storage
Photos are the largest category of data at Meta by volume. Hundreds of millions of new photos are uploaded every day across Facebook, Instagram, WhatsApp backups, and Messenger. The classic approach — store each photo as a file on a filesystem — fails catastrophically at this scale.
Why Traditional Filesystems Fail
A POSIX filesystem stores metadata (file name, size, permissions, timestamps) in an inode for each file. The inode must be read from disk before the actual file content can be read. On a system with billions of photos, the inode table is too large to fit in memory, which means every photo read requires at least two disk seeks — one for the inode, one for the data. At Facebook's volume, this doubles the I/O load for no good reason.
Haystack's Solution — Eliminate Metadata Overhead
Haystack packs thousands of photos into a single large file called a haystack volume. An in-memory index maps each photo's ID to its offset and size within the volume. A photo read becomes:
- Look up the photo ID in the in-memory index → get (volume_id, offset, size)
- Seek to
offsetin the volume file, readsizebytes
One memory lookup, one disk seek, one sequential read. The metadata overhead that crippled POSIX filesystems is eliminated because the index fits entirely in RAM. A single Haystack machine can serve hundreds of thousands of photo reads per second.
f4 — Cold Photo Storage
Not all photos are accessed equally. A photo uploaded today gets viewed hundreds of times in the first 24 hours. A photo from 2015 might get viewed once a year. Keeping old photos in Haystack (which optimizes for fast reads) wastes expensive storage capacity.
f4 is Meta's cold storage system. When photos age past a threshold (typically weeks to months of low access frequency), they are migrated from Haystack to f4. The key difference: f4 uses erasure coding instead of full replication.
- Haystack (warm): stores 3 full copies of each photo (3x storage overhead) for fast reads and high availability.
- f4 (cold): uses Reed-Solomon erasure coding to achieve the same durability with roughly 1.4x storage overhead — less than half the cost per byte.
The trade-off: reads from f4 are slower (erasure-coded data requires reconstruction from multiple fragments) and more CPU-intensive. This is acceptable because cold photos are rarely read, and when they are, a few hundred extra milliseconds of latency is unnoticeable.
+ Encoding] PROCESS --> HAYSTACK[(Haystack
Warm Storage
3x replication)] HAYSTACK -->|Aging policy
weeks/months| F4[(f4
Cold Storage
1.4x erasure coding)] HAYSTACK --> CDN[CDN Edge Cache] F4 --> CDN CDN --> USER[User Views Photo] classDef hot fill:#6C3CE1,stroke:#00D4AA,stroke-width:2px,color:#fff; classDef warm fill:#4A1DB5,stroke:#00D4AA,stroke-width:2px,color:#fff; classDef cold fill:#047857,stroke:#00D4AA,stroke-width:3px,color:#fff; class UPLOAD,PROCESS hot; class HAYSTACK,CDN warm; class F4 cold; class USER hot;
This lifecycle mirrors AWS S3's storage classes — S3 Standard (warm, high-availability) → S3 Glacier (cold, low-cost) — but Meta's versions are custom-built for their specific access patterns and far more efficient at Meta's scale. For the equivalent on AWS, see my S3 vs EFS vs EBS comparison.
Cassandra for Messaging — Write-Heavy at Planetary Scale
WhatsApp processes 100+ billion messages per day. Messenger handles billions more. This write volume is orders of magnitude beyond what MySQL or PostgreSQL can handle, even with sharding. Meta uses Apache Cassandra for messaging storage — a distributed NoSQL database designed for exactly this workload.
Why Cassandra for Messages
- Write-optimized: Cassandra's LSM-tree storage engine handles sequential writes faster than B-tree databases. A message write is appended to a commit log and memtable — no random disk seeks.
- Tunable consistency: For messages, eventual consistency is acceptable (a few hundred milliseconds of delay between replicas is fine). This lets Cassandra distribute writes across multiple nodes without coordination locks.
- Linear scalability: Adding more Cassandra nodes linearly increases write throughput. No single master bottleneck.
- Geographic distribution: Cassandra's multi-datacenter replication is built in. A message written in Mumbai is replicated to the US data center asynchronously.
Message Lifecycle
Messages on WhatsApp's server are transient — they are stored only until delivered, then deleted. The server is a store-and-forward relay, not a permanent archive. Messenger messages are persistent — Meta stores them indefinitely (until the user deletes them). The Cassandra clusters for WhatsApp vs Messenger have very different retention policies and data volumes because of this.
The CDN Layer — 50ms Photo Delivery
Most users never wait for a photo to be fetched from Haystack or f4. Meta operates a massive multi-tier CDN that caches content as close to the user as possible:
- Edge PoPs — points of presence in major cities around the world. The most popular content (trending photos, viral videos, profile pictures of popular accounts) is cached here. Most photo requests are served from the edge.
- Origin Caches — regional cache clusters that sit between the edge and the storage backend. Content that is not hot enough for the edge but too recent for cold storage is served from here.
- Storage Backend — Haystack (warm) or f4 (cold). Only cache misses from both the edge and origin cache layers reach here.
Because of the power-law distribution of social media content (a tiny percentage of photos get the vast majority of views), the CDN hit rate is extremely high — most photo requests never reach the storage backend at all. The CDN absorbs the read load, and Haystack/f4 only handle the long tail of rarely-viewed content.
Data Center Architecture — Built From Scratch
Meta does not rent space in other companies' data centers (with some exceptions for edge PoPs). They design and build their own data centers from the ground up. In 2011, Meta launched the Open Compute Project (OCP) — open-sourcing their data center designs so anyone can benefit.
Why Build Custom
At Meta's scale, standard enterprise hardware (Dell, HP, Cisco) is too expensive and too inflexible. A custom server designed for Meta's specific workload — lots of storage, moderate compute, specific network interface cards — costs 30-40% less than an equivalent off-the-shelf server. When you are buying millions of servers, that 30% savings is billions of dollars.
Open Compute Project Contributions
- Custom servers — vanity-free (no bezels, no brand plates), modular, tool-less maintenance. Designed to be serviced by technicians who manage thousands of machines.
- Custom racks — Open Rack standard with shared power shelves and a common power bus. Eliminates redundant power supplies per server.
- Custom network switches — Wedge and 6-pack switches, running open-source networking OS (FBOSS). Replaced proprietary Cisco switches at a fraction of the cost.
- Power and cooling — evaporative cooling in data centers located in cool climates (Lulea, Sweden; Prineville, Oregon), reducing energy cost by 40% compared to traditional HVAC.
Meta's Storage Stack vs AWS Equivalents
| Meta System | Purpose | AWS Equivalent | Key Difference |
|---|---|---|---|
| TAO | Social graph cache (2B reads/sec) | ElastiCache + DynamoDB | Purpose-built for graph queries; custom consistency model |
| Haystack | Warm photo/video storage | S3 Standard | Eliminates POSIX inode overhead; in-memory index |
| f4 | Cold archival storage | S3 Glacier / S3 Glacier Deep Archive | Erasure coding at 1.4x overhead vs S3's 3x replication |
| Cassandra | Write-heavy message storage | DynamoDB / Amazon Keyspaces | Open source, multi-DC native, tunable consistency |
| MySQL (with TAO) | Persistent graph storage | RDS MySQL / Aurora | Massively sharded; TAO absorbs 99.8% of reads |
| Meta CDN | Content delivery | CloudFront | Custom edge + origin cache; tight integration with storage |
| OCP Hardware | Custom servers, racks, switches | N/A (AWS uses custom hardware too, but it's not user-facing) | Open-sourced designs; 30-40% cheaper than enterprise hardware |
For most companies, the AWS column is the right answer. Meta builds custom because their scale justifies the engineering investment. A startup running the same workload patterns can achieve 90% of Meta's architecture using off-the-shelf AWS services at a fraction of the engineering cost. See my scaling guide for how to get there.
Warm vs Cold Storage — The Economics
The single most important insight from Meta's storage architecture is the storage lifecycle: data starts hot, cools down over time, and the optimal storage system changes as it cools. Here is the cost curve:
Storage tier: Access frequency: Cost per TB/month: Retrieval latency:
Hot (in memory) Millions/sec $$$$$ Microseconds
Warm (Haystack) Thousands/day $$$ Low milliseconds
Cold (f4) A few/month $ Hundreds of ms
Archive Compliance only ¢ Minutes to hours
Most companies keep everything in "warm" storage because it is the default. Moving 80% of your data from warm to cold reduces storage costs by 60-70% with zero impact on user experience, because users almost never access old data. This is the exact same pattern behind AWS cost optimization — the hidden costs are in data you are paying premium prices to store but nobody accesses.
What Smaller Companies Can Learn From Meta's Storage
- Tiered storage is not optional at scale. If you have data older than 90 days that is rarely accessed, it should be on a cheaper storage tier. On AWS: S3 Lifecycle policies to move objects from Standard → Infrequent Access → Glacier automatically.
- Cache everything that gets read more than once. TAO's 99.8% cache hit rate is extreme, but even a simple Redis layer with a 90% hit rate reduces database load by 10x. See the caching stage in my scaling guide.
- Content-addressable storage for media. Store uploads by hash (content-addressed) rather than user-assigned filenames. This deduplicates identical uploads automatically — if 1,000 users share the same meme, you store it once.
- Separate the CDN from the storage backend. Your storage backend handles writes and cold reads. Your CDN handles hot reads. S3 + CloudFront is the AWS version of Haystack + Meta CDN.
- The database is not for blobs. Never store images, videos, or large files in your relational database. Use object storage (S3) for blobs and the database for references (URLs, metadata). This is what Haystack taught Meta — and what most small teams learn the hard way.
Frequently Asked Questions
How does Facebook store photos?
Using two custom storage systems. Haystack stores warm (recently uploaded, frequently accessed) photos in content-addressable volumes with an in-memory index, optimized for fast sequential reads. When photos age and are accessed less frequently, they migrate to f4, a cold storage system using erasure coding at 1.4x overhead instead of 3x replication — same durability at one-third the cost.
What is TAO at Facebook?
TAO (The Associations and Objects) is Facebook's social graph cache. It handles over 2 billion reads per second, caching the social graph (friendships, likes, follows) on top of MySQL. TAO uses a simple objects + associations data model. Cache hit rate exceeds 99.8%, meaning MySQL only handles 0.2% of reads directly.
What database does WhatsApp use?
Mnesia (Erlang's built-in distributed database) for session and presence data. Custom-modified LMDB for temporary message storage on the server. Messages are stored only until delivered, then deleted. Messenger uses Cassandra for persistent message storage. Long-term chat backups live on Google Drive (Android) or iCloud (iOS).
How much data does Meta store?
Exabytes total, growing by petabytes per day. Hundreds of millions of new photos daily. Trillions of social graph edges. 100+ billion WhatsApp messages daily. Meta builds custom storage hardware through the Open Compute Project to manage this volume at reasonable cost.
What is Haystack at Facebook?
Meta's warm photo storage system. Traditional filesystems store one inode per file — at billions of photos, inode metadata exceeds memory, forcing two disk seeks per read. Haystack eliminates this by packing thousands of photos into single large files with an in-memory index. Result: one memory lookup + one disk seek per photo, handling hundreds of thousands of reads per second per machine.
How does Meta's CDN work?
Multi-tier: Edge PoPs in major cities cache the most popular content. Origin Caches between edge and storage handle the middle tier. Haystack/f4 storage backend handles cache misses. Power-law distribution means most requests never reach the storage backend — the CDN absorbs the vast majority of read load.
What is the Open Compute Project?
An initiative Facebook started in 2011 to open-source their data center designs — custom servers, racks, switches, and power/cooling systems. At Meta's scale, off-the-shelf enterprise hardware was too expensive. Custom designs are 30-40% cheaper and specifically optimized for their workloads.
How does Meta's storage compare to AWS?
TAO ≈ ElastiCache + DynamoDB. Haystack ≈ S3 Standard. f4 ≈ S3 Glacier. Cassandra ≈ DynamoDB/Keyspaces. Meta CDN ≈ CloudFront. The key difference: Meta builds custom for their specific patterns. For most companies, the AWS equivalents are the right choice — same architectural patterns without the engineering cost of building from scratch.
Meta Infrastructure Series
- Part 1: How WhatsApp Delivers Messages to 2 Billion Users — Erlang, Signal Protocol, messaging at planetary scale.
- Part 2: How Meta Deploys Code to 4 Billion Users With Zero Downtime — Tupperware, canary deployments, feature flags.
- Part 3: How Meta Stores Trillions of Photos, Messages, and Social Graph Edges — you are here.
Related Reading
- AWS S3 vs EFS vs EBS — the AWS storage comparison that maps to Meta's storage tiers.
- Scaling from 1K to 1M Users on AWS — the building blocks that lead to Meta-scale patterns.
- AWS NAT Gateway Cost Optimization — the hidden data transfer costs that compound at scale.
- How Netflix DRM Works at Scale — another infrastructure deep dive behind an everyday product.