How Dropbox Syncs Files Without Re-Uploading Them — Block-Level Deduplication and Delta Sync Explained
You edit a 4GB video file in Premiere, trim 10 seconds off the end, and save. Dropbox finishes syncing in 15 seconds. Over a home internet connection that could not possibly upload 4GB in 15 seconds. The reason is that Dropbox did not upload 4GB — it uploaded maybe 200KB. Every file in Dropbox is silently split into blocks, each block is identified by the hash of its contents, and the only blocks that get uploaded on every save are the ones whose hashes are new. This architecture is called content-addressable storage with block-level deduplication, and it is the reason cloud file sync works at all.
This post is a deep walk-through of how Dropbox actually handles the sync problem, grounded in Dropbox's own public engineering blog (they have been remarkably open about their architecture), their 2016 migration from AWS S3 to their custom Magic Pocket storage system, and their 2020 publication on the Nucleus sync engine rewrite in Rust. The core idea — content-addressable blocks — is not exclusive to Dropbox. Git uses it. Backup systems use it. Modern container registries use it. Once you see the pattern, you notice it everywhere.
The Problem Dropbox Had to Solve
Syncing a file across devices sounds easy: upload the whole file to a server, let other devices download it. That works for a 10KB document. It falls apart the moment you think about:
- A 4GB video file where you change a single frame
- 100 engineers who all have the same 500MB node_modules folder
- A 20-slide Keynote deck that you tweak the title of
- A laptop with flaky WiFi that needs to resume an interrupted upload
- A team with a shared 1TB photo library where photos are occasionally moved between folders
A whole-file upload model treats every one of those as "upload the whole file again," which is catastrophically slow and expensive. The fundamental insight Dropbox applied is that most file changes touch only a small fraction of the file's bytes. If you can figure out which bytes changed and only upload those, sync becomes almost free.
Step One — Chunking Files Into Blocks
When Dropbox sees a file, the first thing it does is split it into blocks — small fixed-size or variable-size pieces typically around 4 megabytes each. Dropbox has publicly described using 4MB as their block size. The 4GB video file becomes roughly 1,000 blocks.
Each block is hashed with a cryptographic hash function (Dropbox uses SHA-256 for block identification). The hash is a 32-byte fingerprint that uniquely identifies the block's contents. Two blocks with identical bytes produce identical hashes. Two blocks that differ by even a single bit produce completely different hashes.
The file is now represented as an ordered list of block hashes plus metadata:
file: my_video.mp4
size: 4,123,456,789 bytes
blocks:
0: sha256:3f4a8c91... (4 MB)
1: sha256:7d2e1b44... (4 MB)
2: sha256:a90c5f73... (4 MB)
...
999: sha256:1e8b4f22... (3.1 MB, last block)
This list of hashes — called a block list or manifest — is the true representation of the file in Dropbox's system. The actual blocks live separately in a giant content-addressable store, keyed by their hashes. Reconstructing the file means fetching the blocks in order and concatenating them.
Step Two — Content-Addressable Storage
Blocks are stored in a content-addressable store: a key-value system where the key is the hash of the value. When you write a block, the storage layer computes the hash of the block and stores it under that hash. When you read, you give the hash, and you get the bytes back.
Two magical things fall out of this design:
Automatic deduplication
If you write the same bytes twice, you get the same hash twice. The storage layer can detect this and skip the second write — the block is already there. At Dropbox's scale, this is an enormous savings. A popular stock photo, a common open-source library, a corporate logo — each of these exists exactly once in Dropbox's backend, no matter how many users have it in their accounts. Users never see this; it just means the storage system is a fraction of the size a naive design would require.
Integrity verification for free
Since the hash of a block is its address, any corruption of the block's contents would change its hash, and the system would fail to find it at the expected address. This is automatic tamper detection — you cannot silently corrupt a block in CAS without the system noticing. Backup systems like Borg and restic exploit this exact property to guarantee backup integrity.
Step Three — Delta Sync on Change
Now the interesting part. You save the 4GB video file after trimming 10 seconds off the end. Here is what actually happens on the Dropbox client:
- Re-chunk the file. The client splits the new version into blocks just like it did before. Most blocks will be identical to the old version because most of the video's bytes did not change. Only the blocks near the edit point will have new content.
- Compute the new block list. The client hashes each block and builds a new block list. For a typical small edit, this new list will have maybe 3-5 blocks different from the old list.
- Compare against the server's known block list. The client tells the server "I have a new version of this file. Here is the new block list." The server compares against what it has and replies with "I already have blocks X, Y, Z. I am missing blocks A, B, C. Please upload those."
- Upload only the missing blocks. The client uploads A, B, C — maybe 12-20MB total. The other 4GB of the file stays put because those blocks already live in the backend from before.
- Commit the new manifest. The server atomically updates the file's current version to point to the new block list. The old blocks that are no longer referenced may be garbage-collected later (or retained for version history).
This is delta sync: the only data that crosses the network is the data that actually changed. For most real-world file edits, this is a tiny fraction of the total file size. A slide deck with a changed title uploads maybe one modified block — a few hundred KB at most.
The Challenge — Where Do Block Boundaries Go?
There is a subtle problem I glossed over. Suppose you insert a single byte at the beginning of the file. With fixed-size blocks, every single block now has different contents — because each block is offset by one byte from where it used to be. Every block gets a new hash, and suddenly the "delta sync" becomes "re-upload everything." The optimization collapses.
The fix is content-defined chunking using a rolling hash. Instead of cutting the file at fixed-size boundaries (every 4MB), the client computes a rolling hash (like Rabin fingerprint or buzhash) over a sliding window and cuts the file at points where the hash has a particular property — for example, where the lowest 22 bits of the hash are all zero. This makes the average block size 4MB but the exact boundaries depend on the content.
The magic is that inserting a byte near the beginning shifts the byte positions but not the places where the rolling hash hits the cut condition. The blocks before and after the insertion point are mostly identical, just shifted — and the chunking algorithm finds the same boundaries in the shifted data. Only the block containing the inserted byte is actually different. The rest of the file re-uses the existing blocks.
This is the same trick used by rsync (the classic Unix file sync tool), restic, Borg, and most modern backup and sync tools. Dropbox's public writings have described using content-defined chunking for exactly this reason.
Magic Pocket — Dropbox's Custom Storage Backend
For its first several years, Dropbox stored blocks in AWS S3. The block-level deduplication architecture worked perfectly on top of S3 — the blocks are content-addressable keys, S3 is a key-value store, done. But at a certain scale, Dropbox's S3 bill was enormous, and the economics of running their own storage started to make sense.
In 2015-2016, Dropbox announced and then executed a migration from S3 to their own storage system, which they called Magic Pocket. They publicly documented the architecture in several engineering blog posts. Magic Pocket is not a reimplementation of S3 — it is a system purpose-built for Dropbox's workload, which has a few specific properties:
- Writes are dominated by content-addressed blocks that never change after write
- Reads are sequential for file reconstruction and random for sync
- Durability is critical (lost blocks mean lost user files)
- Cost per byte is the dominant operational concern at exabyte scale
Magic Pocket uses Reed-Solomon erasure coding for durability instead of full replication. Instead of storing each block 3 times (like traditional triple replication), Magic Pocket splits each block into N data fragments plus M parity fragments such that any N of the N+M fragments can reconstruct the block. Typical ratios give the same durability as triple replication with only 1.5x storage overhead instead of 3x. This is the same math that makes S3's "11 nines of durability" work, which we covered in the Netflix DRM post in a different context.
Critically, small teams should not build Magic Pocket. The break-even point where building your own storage is cheaper than S3 is enormous — Dropbox crossed it because they had hundreds of petabytes of steady-state storage. For almost any other company, S3 (or GCS, or Azure Blob) with deduplication done at the application layer is the correct choice. Magic Pocket is a cautionary tale about how much engineering it takes to beat a managed cloud service.
Nucleus — The Sync Engine Rewrite in Rust
The backend storing blocks is one part of the system. The other part is the sync engine — the code running on your laptop that watches the filesystem for changes, chunks files, talks to the backend, and reconciles local state with cloud state. This is a surprisingly hard piece of software.
In 2020, Dropbox publicly wrote about rewriting their sync engine, internally called Nucleus, in Rust. The blog post is one of the most detailed public descriptions of a production Rust migration. The motivations they gave:
- The original sync engine was written in Python and had accumulated years of technical debt. Fundamental state-machine bugs were hard to track down.
- Cross-platform file system behavior (Windows, macOS, Linux) requires precise control over file handles, memory, and threads. Python's GIL and garbage collector got in the way.
- Sync is effectively a distributed systems client, and strongly typed state machines (via Rust's enums and exhaustive pattern matching) eliminate entire categories of bugs at compile time.
- Correctness is a user-visible feature — silent data loss or file corruption in a sync product is catastrophic. Rust's borrow checker and lack of null pointers make certain classes of correctness bugs impossible.
The rewrite took years and involved shipping Nucleus alongside the old engine, shadow-syncing to compare behavior, and gradually migrating users. It is worth reading the public write-up if you are considering a major language migration in your own infrastructure — the engineering process described is a model for how to do it responsibly.
Conflict Handling — Last Write Wins With Conflicted Copies
What happens when two people edit the same file offline and then both come back online? In Google Docs, this is solved elegantly with Operational Transform — but arbitrary binary files cannot be merged the way text can.
Dropbox's answer is refreshingly simple: last write wins, but preserve the loser. The first client to upload wins. The second client's upload is still accepted, but the file is renamed with a "conflicted copy" suffix including the author's name and timestamp. You end up with two files in the folder — the winner and the conflicted copy. Neither is lost, but the user has to manually decide what to do.
This is a deliberate choice to avoid the complexity of automatic merging, which is not well-defined for most file types anyway. How do you "merge" two edits of a Photoshop file? You do not — you let the human do it. Dropbox gets out of the way and preserves both versions.
LAN Sync — The Trick Everyone Forgets About
Here is a feature of Dropbox that is easy to miss: if two computers on the same local network both have the same file, Dropbox will transfer new blocks between them directly, over the LAN, instead of going through the cloud. This is called LAN Sync, and it uses mDNS-based peer discovery plus direct TCP transfer.
The advantage is enormous for teams. A designer in the office pushes a new version of a 2GB Photoshop file. Without LAN Sync, every other designer's Dropbox client downloads the changed blocks from the cloud, over the office WiFi's upstream, which is usually slow. With LAN Sync, the changed blocks go directly from laptop to laptop at gigabit speeds. The cloud is still the source of truth, but the actual bytes never leave the local network.
This only works because the blocks are content-addressable — any peer can serve them because the identity of a block is its hash, not its storage location. The same block from any source is the same block. This is the same property that makes content-addressable systems (Git, IPFS, container registries) so compositional.
The DevOps Patterns You Can Actually Reuse
Block-level deduplication is not a Dropbox-specific trick. The pattern shows up in many places in modern infrastructure, and once you see it, you can apply it yourself:
- Content-addressable storage for deduplication. Any system with repeated data — build artifacts, container images, backups, Git objects, Docker layers — benefits from a CAS backend. Write libraries like
rclone,restic, andborguse this pattern. Docker image layers use it. Git objects use it. You can use it in your own system for caches, artifact stores, and backup tooling. - Content-defined chunking for insert-resilient delta sync. If you are building anything that syncs files or large blobs, content-defined chunking with a rolling hash handles the "insert one byte" case elegantly. Off-the-shelf libraries like FastCDC make this straightforward.
- Erasure coding for storage cost. For durability, erasure coding is dramatically cheaper than full replication at high volumes. S3 uses it. Magic Pocket uses it. If you are running your own storage tier — even for internal use — Reed-Solomon erasure coding gives you the same durability as triple replication at roughly half the storage cost.
- Last-write-wins with preserved losers. For any system where automatic merging is not well-defined, preserving both versions with a clear naming convention is almost always better than losing data. Dropbox, Git (with merge conflicts), and many wiki systems use variants of this pattern.
- Peer-to-peer delivery for distributed state. If your network has clients close to each other, LAN-level direct transfer is an enormous bandwidth optimization that the cloud cannot match. BitTorrent used it first, Facebook's
dcdrbuilt on it, and Dropbox LAN Sync is a consumer-visible version. Worth considering any time you have to distribute the same large artifact to many nearby clients.
Frequently Asked Questions
How does Dropbox avoid re-uploading unchanged parts of a file?
Block-level deduplication. Files are split into ~4MB blocks, each identified by its SHA-256 hash. Only new or changed blocks are uploaded — blocks whose hashes already exist on the server are skipped because they are already stored.
What is content-addressable storage?
A storage system where the address of each piece of data is the hash of its contents. Write the same bytes twice and you get the same address, which means duplicates are detected and deduplicated automatically.
What is Magic Pocket?
Dropbox's custom storage system, which they migrated to from AWS S3 starting around 2015. It uses Reed-Solomon erasure coding on commodity hardware in Dropbox-owned data centers for exabyte-scale storage at a fraction of S3 pricing.
Should I build my own Magic Pocket?
Almost certainly not. The break-even point where building beats S3 is enormous. For the vast majority of companies, putting content-addressable blocks on S3 (or GCS / Azure Blob) is cheaper and simpler than running your own storage layer.
How does Dropbox handle conflicts?
Last write wins, with the loser preserved as a "conflicted copy" file. Both versions are kept — the user has to manually decide which to keep or merge.
Next Steps
- How Google Docs Real-Time Collaboration Works — the elegant solution to sync when files can be merged automatically
- How Netflix-Scale DRM Works — another system where S3-level storage economics drove architecture decisions
- How Stripe Detects Fraud in Real Time
- Free DevOps resources