How Google Docs Real-Time Collaboration Actually Works — Operational Transform at Scale
You and a coworker are editing the same Google Doc. You type "hello" at position 10. At the exact same moment, they delete the character at position 5. Neither of you knows about the other's change yet. Somehow, within a few hundred milliseconds, both of your screens converge on the same document with both changes applied correctly — not "both changes in some order," but "both changes in the only order that makes sense given what each of you intended." Now scale that to 100 people editing one doc, and 1 billion docs in existence. That is what Operational Transform solves, and the algorithm behind it is one of the most elegant pieces of distributed systems design ever deployed to regular users.
This post walks through how Google Docs actually makes collaborative editing work, grounded in the Jupiter paper from AT&T Bell Labs (1995) that introduced the specific OT variant Google Docs uses, public talks from Google engineers, and the decades of academic literature on collaborative editing. The algorithm is not secret. What is rare is an explanation that treats it as a real systems problem rather than a theory exercise.
The Naive Approach, and Why It Does Not Work
The first thing anyone designing a collaborative editor thinks is: "Just send the whole document every time someone changes anything." This breaks the moment you think about it:
- A 100-page document being sent on every keystroke is absurd bandwidth
- Two users sending overlapping document states creates an obvious race — whose save wins?
- Latency means each user sees a stale version, and rebasing a whole document is a diff problem every round trip
The second idea is: "Send just the changes." Better — you now send a small patch like insert "hello" at position 10. But this has a subtle and devastating flaw.
Imagine the starting document is Hello, world!. Two users see it at the same time:
- Alice deletes character at position 0 ("H"). She expects the result
ello, world!. - Bob, simultaneously, inserts "!" at position 5 (between "o" and ","). He expects
Hello!, world!.
Both users send their operations to the server. Alice's op arrives first. The server applies it: ello, world!. Now Bob's op arrives — insert "!" at position 5. But position 5 in the new document is a different character than position 5 in the document Bob was looking at when he made the change. If the server naively applies Bob's op, the "!" ends up in the wrong place. If it then broadcasts that result to Alice and Bob, Alice sees a garbled document. Bob sees something he did not type.
This is the core problem of concurrent editing: operations are defined relative to a document state, and that state drifts as other operations arrive. You cannot just send the ops and apply them in arrival order. Every operation needs to be adjusted — "transformed" — to account for the operations that happened concurrently with it.
Operational Transform — The Trick That Solves It
Operational Transform (OT) is a family of algorithms that defines, for every pair of operation types, a transformation function. The transformation function takes two concurrent operations and produces two new operations that can be applied in either order to produce the same final state. Formally, if op1 and op2 are concurrent, then:
apply(apply(state, op1), transform(op2, op1))
==
apply(apply(state, op2), transform(op1, op2))
Read that carefully. It says: if you apply op1 first and then a transformed version of op2, you get the same state as if you applied op2 first and then a transformed version of op1. This property is called TP1 (Transformation Property 1), and it is the mathematical foundation that makes OT work.
Back to our example. Alice's delete at position 0 and Bob's insert at position 5 are concurrent. When Bob's operation arrives at the server after Alice's has been applied, the server transforms it: since a character was deleted before position 5, the new position 5 in the updated document corresponds to the old position 4. Bob's op becomes insert "!" at position 4, and when applied to ello, world!, produces ello!, world! — exactly what Bob intended, at the correct character.
Notice what just happened. The server did not ask Bob to redo anything. It did not reject his operation. It transformed his operation to match the reality that arrived between when he typed it and when it reached the server. Alice sees the correctly-transformed result. So does Bob. Everyone converges. Nothing is lost.
The Jupiter Model — Google's Server-Centric Variant
There are many OT variants, differing in how they handle the hardest cases (nested concurrent operations, multi-user convergence, operations on rich text with formatting). Google Docs uses a variant called the Jupiter model, originally published by David Nichols, Pavel Curtis, Michael Dixon, and John Lamping at Xerox PARC and then Bell Labs in 1995.
The Jupiter model is intentionally simpler than fully-general OT because it makes one strong assumption: there is a central server that defines the canonical ordering of operations. Every client only has to reason about two operation streams: its own local operations and the stream of operations coming from the server. The server, in turn, only has to reason about each client's operations relative to the server's history. This drastically reduces the number of cases the transformation function has to handle, compared to a peer-to-peer OT system where every pair of clients must pairwise resolve conflicts.
This server-centric choice is the architectural decision that makes Google Docs tractable. A fully decentralized collaborative editor is theoretically possible, but the complexity balloons with the number of concurrent editors. Making one node authoritative — the server — linearizes the world into a single sequence that everyone else rebases onto.
The Full Client-Server Loop
Here is what actually happens when you type a character into Google Docs.
- Local application. The client immediately applies your keystroke to your local copy of the document. Your screen updates instantly. This is called optimistic application and it is what makes the editor feel responsive — there is no round trip before you see your own typing.
- Operation encoding. The client creates an operation — something like
{type: "insert", position: 247, text: "a", author: "alice", base_revision: 1042}. Thebase_revisionfield is critical: it is the document revision number the client was looking at when it made the operation. - Send to server. The operation is pushed to the server over the persistent connection (historically a long-poll, now typically WebSocket or similar) the client maintains for the doc.
- Server rebases. The server receives the operation with
base_revision: 1042but the server is now on revision 1048 — six operations from other users have arrived since the client's last sync. The server transforms the incoming operation against each of those six, one at a time, producing a version of the operation that makes sense against revision 1048. - Server commits and assigns a new revision. The transformed operation is applied to the server's canonical document, producing revision 1049. The server's history now has one more operation in it.
- Server broadcasts. The transformed operation, along with its new revision number, is broadcast to every other client viewing the document.
- Clients rebase locally. Every receiving client checks what revision it thought it was on. If it matches the new operation's base, great — the operation applies cleanly. If not (because the client had its own pending ops in flight), the client transforms the incoming operation against its pending ops, applies the result to its displayed document, and updates its own pending ops so they match the new server state.
This loop runs on every keystroke, every cursor move, every bold-on-off toggle. At any given moment, every client has three versions of the document in memory: the last version it knows the server has confirmed, the version with all the operations it has sent but not yet heard back about, and the version the user is currently seeing. All three are continuously reconciled as the network delivers updates.
Why This Is Not a Trivial Piece of Code
Describing OT in three paragraphs makes it sound easy. Writing an OT implementation that survives adversarial real-world inputs is famously hard — hard enough that there are academic papers specifically about bugs in published OT algorithms. Some of the things that make production OT difficult:
Every operation pair needs a correct transform function
Insert vs insert, insert vs delete, delete vs delete, format vs insert, format vs format, list-bullet vs paragraph break, image embed vs text delete — the number of cases scales roughly quadratically with the number of operation types. Each case needs a correct transform, and every case needs to preserve TP1 (and sometimes TP2, a stronger convergence property). Get any one case wrong and the document eventually diverges between clients.
Rich text operations are not atomic
When you paste a 500-word block of formatted text, that is not one operation — it is dozens. Inserts, format ranges, style applications, list conversions. Each has to be transformable against any concurrent operation on the same text region. This is why the vast majority of OT implementations fail on corner cases: the combinatorial space of "what if this weird paste happens at the same time as that weird format change" is enormous.
Ordering without global consensus
The server assigns sequence numbers, but the network delivers messages out of order. Clients have to be able to accept operations in any order, buffer the ones they are not yet ready for, and apply them as soon as their dependencies arrive. This is a distributed systems problem on top of the OT problem.
Undo and redo across users
Your "undo" should undo your last action, not the action someone else made after you typed yours. Implementing a user-local undo stack in an OT system that has rebased your operations multiple times is surprisingly subtle — your "last op" may have been transformed several times and no longer means what you thought it meant.
Presence — Cursors, Selections, and the Easier Channel
Document operations are only one kind of real-time state in Google Docs. The other is presence: who else is in the doc, where their cursor is, what they have selected. Presence is much easier to implement because it does not need convergence guarantees — nobody cares if a cursor position is stale by 100ms.
Presence runs on the same persistent connection as operations but through a separate pub/sub channel. When you move your cursor, the client sends a lightweight update — "my cursor is now at position 423" — which the server fans out to everyone else in the doc. No OT, no transformation, no history. If two updates arrive out of order, the newest one wins because presence is ephemeral.
This separation of concerns is a general pattern worth knowing: not all real-time state needs the same consistency guarantees. Split your channels by the strictness of their convergence requirements. Document content needs OT. Cursors need pub/sub. Chat messages may need yet another model (append-only log). Treating them the same is how you end up with a complicated system that is over-engineered for the easy cases and under-engineered for the hard ones.
CRDTs — The Alternative Google Did Not Choose (and Why)
In the last decade, a different approach to collaborative editing has become popular: Conflict-free Replicated Data Types, or CRDTs. A CRDT is a data structure designed so that concurrent updates can be merged without needing a central server or a transformation function. Every update carries enough metadata (typically vector clocks or unique ID tags) that any two replicas can compute the same merged state independently.
CRDTs are elegant. They work peer-to-peer. They do not need a central authority to linearize operations. Figma uses CRDTs. Automerge uses CRDTs. Yjs uses CRDTs. Linear uses CRDTs. Why did Google not?
The honest answer is that Google Docs was built in 2006, before modern CRDT theory was mature. By the time CRDTs became production-ready, Google already had an enormous investment in Jupiter-model OT and a large body of working code. The cost of migrating to CRDTs — retraining the team, rewriting the algorithms, re-verifying correctness against real-world usage — was far higher than the marginal benefit. Operational Transform works, it has worked for over 15 years, and there is no user-visible problem that demands a switch.
For new systems today, CRDTs are often the right choice, especially if you want offline-first or peer-to-peer collaboration. But OT is not wrong — it is a different trade-off, and at Google's scale with a server-centric architecture, Jupiter-model OT is a legitimate production choice with decades of battle-testing.
The DevOps Patterns You Can Actually Reuse
Most engineers will never build a collaborative editor. But the patterns Google Docs relies on show up anywhere you have shared real-time state and multiple writers:
- Optimistic local updates, server reconciliation. Apply the user's action immediately on the client, send it to the server, and rebase against whatever the server did in the meantime. This is how every responsive real-time UI works — from Trello cards to Notion blocks to GitHub issue comments with typing indicators.
- Sequence numbers as convergence anchors. A monotonic server-assigned sequence number is the simplest primitive for "has everyone seen up to this point." Use it for client sync, for replication lag monitoring, and for conflict detection. ETags and version vectors in HTTP APIs use this same idea.
- Separate channels by consistency requirement. Real-time systems usually have multiple kinds of state. Put each kind on its own channel with its own consistency model. Operations (strong), presence (eventual), notifications (at-least-once). Mixing them makes everything worse.
- Linearize through a central authority when you can. Fully decentralized consensus is beautiful and extremely hard. If your problem can tolerate a server, having one linearize the order of events makes every downstream problem an order of magnitude easier. Jupiter's design choice to use a central server is what made Google Docs shippable.
- Rebase, do not reject. When a client's stale operation arrives, do not fail it — transform it to the current state. Git does this for commits, Google Docs does this for keystrokes. The pattern generalizes: any system where clients make changes based on potentially stale state can use rebase instead of reject for a much better user experience.
Frequently Asked Questions
How does Google Docs handle two people typing at once?
Operational Transform. Each keystroke is an operation sent to a central server. The server transforms concurrent operations against each other so everyone converges on the same final document, regardless of the order the operations arrive.
What is Operational Transform?
A family of algorithms for synchronizing changes to a shared data structure across multiple concurrent editors. The key property is that concurrent operations can be transformed so they can be applied in any order and produce the same final state.
Does Google Docs use CRDTs?
No. Google Docs uses Operational Transform, specifically the Jupiter model. CRDTs are a newer alternative used by Figma, Automerge, Yjs, and others. Both approaches work; they are different trade-offs.
How does Google Docs show cursors in real time?
Through a separate presence channel. Cursor and selection updates do not need OT's convergence guarantees — they are ephemeral pub/sub messages that the server forwards to everyone viewing the document.
What happens if I edit a Google Doc offline?
The client buffers operations locally and applies them optimistically. When you come back online, the client sends the buffered operations to the server, which rebases them against any concurrent changes. You may see text "jump" slightly as your local state is reconciled with the server's.
Next Steps
- How Uber's Surge Pricing Actually Works — real-time state synchronization, geospatial edition
- How Stripe Detects Fraud in Real Time — real-time ML inside a tight latency budget
- How Netflix-Scale DRM Works — more infrastructure hidden behind a simple UI
- Free DevOps resources