Reasoning About Serializability

Conflict Serializability
The Operation Swapping Game
Operation Scheduler
Multi-Version Databases
CockroachDB versus YugabyteDB
- CockroachDB
- YugabyteDB
Fin
Nuance

In Decomposing Transactional Systems, I hinted that comparing CockroachDB’s and YugabyteDB’s approach to transaction processing is an insightful adventure for understanding the tradeoffs in different approaches to serializability.

CockroachDB and YugabyteDB are remarkably architecturally similar databases. They both scale horizontally using partitions replicated with Raft. They implement a very similar three-phase protocol using write intents for atomic commitment. They both use loose clock synchronization and Hybrid Logical Clocks for ordering transactions across partitions. And yet, they still have significant differences in the behavior of their transactions. CockroachDB offers serializable by default, whereas YugabyteDB has a very expensive serializable implementation which requires persisting and replicating read requests to disk. Conversely, YugabyteDB offers abort-free read-only transactions, whereas read-only transactions can still interfere with read-write transactions in CockroachDB. How are these two systems so similar yet still so different?

There is one key difference: CockroachDB serializes transactions at their read version, whereas YugabyteDB serializes them at the commit version. If that statement did not explain the behavioral difference between the two systems to you, then I have great news! The rest of this post is for you! We’ll first need to take a sizable detour though a new mental model of serializability testing. This will better equip us to return to the topic of why choosing a different version to serialize transactions at has such an impact on the resulting behavior of a database.

Conflict Serializability

Standard database courses all cover the important topic of conflict serializability in nearly the same fashion. One is given a history of operations, and asked "is it serializable?" We’ll use the following example history:

T1.W(X) T1.W(Y) T2.R(Y) T3.W(Y) T2.W(Z) T3.W(X)

The instructed algorithm for answering this question is to form the precedence graph, a graph showing which transactions must commit after other transactions. Two transactions have a required ordering for committing when they have a conflict between them, of which there are three types:

Read-After-Write (RAW): If T2 reads a value written by T1, then T2 must commit after T1.
Write-After-Write (WAW): If T2 overwrites a value written by T1, then T2 must commit after T1.
Write-After-Read (WAR): If T2 overwrites a value read by T1, then T2 must commit after T1.

The precedence graph is formed by walking through the operations from left to right, creating a node for each transaction and an edge for each conflict. A cycle in the resulting graph tells us that there is no possible serial order of transactions, and thus our history is not serializable.

To mentally construct a precedence graph, the easiest flow is to look at the list of operations once for each item being operated upon, and then add the corresponding conflict between transactions which hae successive operations on the same item^[1].

X: T1.W                          T3.W
Y:       T1.W  T2.R  T3.W
Z:                         T2.W

X gives us a write-after-write conflict where T3 must commit after T1. Y gives us a read-after-write conflict from T2 to T1, and write-after-read conflict from T3 to T2. With our graph of commit-after relationships completed, we check for a cycle:

There is no cycle, and thus our history is serializable.

[1]: Note that if there’s a row like T1.W T2.R T3.R T4.W, then there must be read-after-write conflicts for T4->{T3,T2} and {T3,T2}->T1. Reads are collected together into a set which gains conflicts on the write immediately before and after. This lets us represent that reads can be freely reordered among each other, but they’re collectively bounded on either side by the preceding and following write.

The Operation Swapping Game

The precedence graph is a correct way to check for serializability, but there’s a second algorithm that’s sometimes more useful for mentally reasoning about serializability. One can view conflict serializability checking as a game of swapping the positions of adjacent operations. Our starting puzzle is a jumbled interleaving of transactions' operations. Our desired ending state is one in which each transaction’s operations are all adjacent. The ability to switch the positions of two operations are constrained by the same rules from above: we may only swap two adjacent operations if they do not conflict.

Using our example above, we can make one swap:

T1.W(X) T1.W(Y) T2.R(Y) T3.W(Y) T2.W(Z) T3.W(X)
                               ⤩
T1.W(X) T1.W(Y) T2.R(Y) T2.W(Z) T3.W(Y) T3.W(X)

And now all of T1’s, T2’s, and T3’s operations are grouped together, in that order. Serializability is defined as "being equivalent to a serial schedule of transactions". Our swap has transformed the history into exactly that: a serial execution of T1, T2, then T3. Thus, the history is serializable. No graph construction needed!

With a precedence graph, the cycle test tells us that there exists at least one serial schedule of transactions, and stops there. Generating all of the possible serial schedules can be done by enumerating all possible topological sorts of the graph. Looking at the precedence graph we constructed above, there is just one possible topological sort: T1, then T2, then T3. Our operation swapping game yielding the same is no mistake! The operation swapping game directly tries to find a serial schedule instead of taking the precedence graph approach of proving that at least one serial schedule must exist. Serializability only requires that there exists at least one serial schedule, and so both approaches are equally valid tests for serializability.

Being able to comfortably manipulate a history of operations is one of these things that just takes rote practice. It’s one of the few things that database intro class homework asking you to do a serializability test different histories over and over was actually somewhat useful. We also now have coding agents to be able to crank out arbitrary widgets, so just go wild until you get the idea of the operation swapping game:

If you play enough iterations such that the RNG gods bless you with a surprisingly tricky problem, you’ll find out that the swapping game is… not that great of a way to test for serializability when you have to untangle complicated interleaved operations. Precedence graphs are a very useful tool if someone gives you a large pile of transactions and asks "is it serializable?". Jepsen’s Elle ^[2] is based off of precedence graph reasoning for this exact reason. Even the interactive game itself constructs only solvable histories by ensuring there’s no cycle in a precedence graph. The operation swapping game is a terrible algorithm for testing serializability. [2]: Kyle Kingsbury and Peter Alvaro. 2020. Elle: inferring isolation anomalies from experimental observations. Proc. VLDB Endow. 14, 3 (November 2020), 268–280. [scholar]

Instead, the operation swapping game is here to teach the concept of getting to a serializable state by swapping operations leftward or rightward in a history, and when one is allowed to do so. This is best put to use in a slightly different context…

Operation Scheduler

What we’ve looked at so far is a history as a backwards-looking, fixed series of operations. An actively running database always has new work arriving, and its serializability checking is a slightly different job. A running database needs a scheduler, an algorithm that is provided an incoming stream of operations, and can decide which ones to admit into the schedule, and which ones to reject and abort the corresponding transaction (either immediately or at commit time). A live system starts with an empty history of operations, and then grows its history by always appending the new operation on the right, to reflect that each accepted operation is the most recent operation.

So far, it makes effectively no difference if we play the operation swapping game as an online scheduler algorithm or an offline serializable history testing algorithm, because there was no ability to abort a transaction and remove it from the history. For the sake of illustration, let’s run through our same example as an online game. We assume a transaction implicitly begins with its first operation, and implicitly commits after its last operation.

{
  "title": "Operation Scheduler",
  "steps": [
    {
      "title": "Start",
      "queue": ["T1.W(X)", "T1.W(Y)", "T2.R(Y)", "T3.W(Y)", "T2.W(Z)", "T3.W(X)"],
      "history": [],
      "note": "Initial queue of operations."
    },
    {
      "title": "Process T1.W(x)",
      "queue": ["T1.W(X)", "T1.W(Y)", "T2.R(Y)", "T3.W(Y)", "T2.W(Z)", "T3.W(X)"],
      "history": ["T1.W(X)"],
      "focus": [
        { "type": "remove", "queue_index": 0 },
        { "type": "add", "history_index": 0 }
      ]
    },
    {
      "title": "Process T1.W(Y)",
      "queue": ["T1.W(Y)", "T2.R(Y)", "T3.W(Y)", "T2.W(Z)", "T3.W(X)"],
      "history": ["T1.W(X)", "T1.W(Y)"],
      "focus": [
        { "type": "remove", "queue_index": 0 },
        { "type": "add", "history_index": 1 }
      ]
    },
    {
      "title": "Commit T1",
      "note": "T1's operations are already adjacent. Thus, it commits serializably.",
      "queue": ["T2.R(Y)", "T3.W(Y)", "T2.W(Z)", "T3.W(X)"],
      "history": ["T1.W(X)", "T1.W(Y)"],
      "focus": [ {"type": "commit", "history_indices": [0, 1]} ]
    },
    {
      "title": "Process T2.R(Y)",
      "queue": ["T2.R(Y)", "T3.W(Y)", "T2.W(Z)", "T3.W(X)"],
      "history": ["T1.W(X)", "T1.W(Y)", "T2.R(Y)"],
      "focus": [
        { "type": "remove", "queue_index": 0 },
        { "type": "add", "history_index": 2 }
      ]
    },
    {
      "title": "Process T3.W(Y)",
      "queue": ["T3.W(Y)", "T2.W(Z)", "T3.W(X)"],
      "history": ["T1.W(X)", "T1.W(Y)", "T2.R(Y)", "T3.W(Y)"],
      "focus": [
        { "type": "remove", "queue_index": 0 },
        { "type": "add", "history_index": 3 }
      ]
    },
    {
      "title": "Process T2.W(Z)",
      "queue": ["T2.W(Z)", "T3.W(X)"],
      "history": ["T1.W(X)", "T1.W(Y)", "T2.R(Y)", "T3.W(Y)", "T2.W(Z)"],
      "focus": [
        { "type": "remove", "queue_index": 0 },
        { "type": "add", "history_index": 4 }
      ]
    },
    {
      "title": "Can T2 commit?",
      "note": "The current history is not serial, because T2's operations are not adjacent. Swap T3.W(Y) and T2.W(Z).",
      "queue": ["T3.W(X)"],
      "history": ["T1.W(X)", "T1.W(Y)", "T2.R(Y)", "T3.W(Y)", "T2.W(Z)"],
      "focus": [ { "type": "swap", "history_indices": [3, 4] } ]
    },
    {
      "title": "Can T2 commit?",
      "note": "The swap is allowed because T3.W(Y) and T2.W(Z) do not conflict.",
      "queue": ["T3.W(X)"],
      "history": ["T1.W(X)", "T1.W(Y)", "T2.R(Y)", "T2.W(Z)", "T3.W(Y)"],
      "focus": [ { "type": "swap", "history_indices": [3, 4] } ]
    },
    {
      "title": "Commit T2",
      "note": "T2's operations are now adjacent, so T2 can commit.",
      "queue": ["T3.W(X)"],
      "history": ["T1.W(X)", "T1.W(Y)", "T2.R(Y)", "T2.W(Z)", "T3.W(Y)"],
      "focus": [ {"type": "commit", "history_indices": [2, 3]} ]
    },
    {
      "title": "Process T3.W(X)",
      "queue": ["T3.W(X)"],
      "history": ["T1.W(X)", "T1.W(Y)", "T2.R(Y)", "T2.W(Z)", "T3.W(Y)", "T3.W(X)"],
      "focus": [
        { "type": "remove", "queue_index": 0 },
        { "type": "add", "history_index": 5 }
      ]
    },
    {
      "title": "Commit T3",
      "note": "T3's operations are already adjacent.",
      "queue": [],
      "history": ["T1.W(X)", "T1.W(Y)", "T2.R(Y)", "T2.W(Z)", "T3.W(Y)", "T3.W(X)"],
      "focus": [ {"type": "commit", "history_indices": [4, 5]} ]
    },
    {
      "title": "Complete",
      "note": "The final history is the serial schedule of T1, T2, T3.",
      "queue": [],
      "history": ["T1.W(X)", "T1.W(Y)", "T2.R(Y)", "T2.W(Z)", "T3.W(Y)", "T3.W(X)"],
      "focus": [
        { "type": "commit", "history_indices": [0, 1] },
        { "type": "commit", "history_indices": [2, 3] },
        { "type": "commit", "history_indices": [4, 5] }
      ]
    }
  ]
}

And we end with the same result, but serializability is validated incrementally.

The reason that I like the operation swapping game as a mental model is that it’s much more useful in reasoning about this sort of online, incremental behavior of a database. However, if what you’re having to incrementally answer "can I serializably commit this transaction?", then that’s the job of a scheduler. You can use precedence graphs as the way to incrementally test if it’s okay to commit each transaction. Redshift ^[3] did this. However, it’s uncommon because it’s both computationally expensive and memory intensive. More commonly, databases either do pure two-phase locking, or they maintain small summarized metadata tracking the oldest or newest version of a read or write on each key to cheaply answer if it would be possible to swap operations sufficiently backwards or forwards. And if you read the Redshift paper, you’ll learn that this latter strategy is closer to what Redshift does today: they implemented Serial Safety Net ^[4]. Precedence graphs are a great tool for database theory, but they’re further from matching the actual design of what databases implement for enforcing serializable schedules in their scheduler. [3]: Nikos Armenatzoglou, Sanuj Basu, Naga Bhanoori, Mengchu Cai, Naresh Chainani, Kiran Chinta, Venkatraman Govindaraju, Todd J. Green, Monish Gupta, Sebastian Hillig, Eric Hotinger, Yan Leshinksy, Jintian Liang, Michael McCreedy, Fabian Nagel, Ippokratis Pandis, Panos Parchas, Rahul Pathak, Orestis Polychroniou, Foyzur Rahman, Gaurav Saxena, Gokul Soundararajan, Sriram Subramanian, and Doug Terry. 2022. Amazon Redshift Re-invented. In Proceedings of the 2022 International Conference on Management of Data (SIGMOD/PODS '22), ACM, 2205–2217. [scholar] [4]: Tianzheng Wang, Ryan Johnson, Alan Fekete, and Ippokratis Pandis. 2015. The Serial Safety Net: Efficient Concurrency Control on Modern Hardware. In Proceedings of the 11th International Workshop on Data Management on New Hardware (DaMoN'15), Association for Computing Machinery, New York, NY, USA. [scholar]

Sometimes the answer to "can I serializably commit this transaction?" is "No", and as mentioned above, the operation swapping game can help with this too. If conflicts prevent swapping all of a transaction’s operations to be adjacent, then it must be aborted. A transaction’s operations will live in the history until it aborts, and then all its operations will be removed. We can run through a similar example, but move T1.W(Y) to the end so that there’s a cycle in the precedence graph:

{
  "title": "Operation Scheduler With Abort",
  "steps": [
    {
      "title": "Start",
      "note": "Initial queue of operations.",
      "queue": ["T1.W(X)", "T2.R(Y)", "T3.W(Y)", "T2.W(Z)", "T3.W(X)", "T1.W(Y)"],
      "history": []
    },
    {
      "title": "Process T1.W(X)",
      "queue": ["T1.W(X)", "T2.R(Y)", "T3.W(Y)", "T2.W(Z)", "T3.W(X)", "T1.W(Y)"],
      "history": ["T1.W(X)"],
      "focus": [
        { "type": "remove", "queue_index": 0 },
        { "type": "add", "history_index": 0 }
      ]
    },
    {
      "title": "Process T2.R(Y)",
      "queue": ["T2.R(Y)", "T3.W(Y)", "T2.W(Z)", "T3.W(X)", "T1.W(Y)"],
      "history": ["T1.W(X)", "T2.R(Y)"],
      "focus": [
        { "type": "remove", "queue_index": 0 },
        { "type": "add", "history_index": 1 }
      ]
    },
    {
      "title": "Process T3.W(Y)",
      "queue": ["T3.W(Y)", "T2.W(Z)", "T3.W(X)", "T1.W(Y)"],
      "history": ["T1.W(X)", "T2.R(Y)", "T3.W(Y)"],
      "focus": [
        { "type": "remove", "queue_index": 0 },
        { "type": "add", "history_index": 2 }
      ]
    },
    {
      "title": "Process T2.W(Z)",
      "queue": ["T2.W(Z)", "T3.W(X)", "T1.W(Y)"],
      "history": ["T1.W(X)", "T2.R(Y)", "T3.W(Y)", "T2.W(Z)"],
      "focus": [
        { "type": "remove", "queue_index": 0 },
        { "type": "add", "history_index": 3 }
      ]
    },
    {
      "title": "Can T2 commit?",
      "note": "T2's operations are not adjacent yet. Swap T3.W(Y) and T2.W(Z).",
      "queue": ["T3.W(X)", "T1.W(Y)"],
      "history": ["T1.W(X)", "T2.R(Y)", "T3.W(Y)", "T2.W(Z)"],
      "focus": [ { "type": "swap", "history_indices": [2, 3] } ]
    },
    {
      "title": "Can T2 commit?",
      "note": "The swap is allowed because T3.W(Y) and T2.W(Z) do not conflict.",
      "queue": ["T3.W(X)", "T1.W(Y)"],
      "history": ["T1.W(X)", "T2.R(Y)", "T2.W(Z)", "T3.W(Y)"],
      "focus": [ { "type": "swap", "history_indices": [2, 3] } ]
    },
    {
      "title": "Commit T2",
      "note": "T2's operations are now adjacent, so T2 can commit.",
      "queue": ["T3.W(X)", "T1.W(Y)"],
      "history": ["T1.W(X)", "T2.R(Y)", "T2.W(Z)", "T3.W(Y)"],
      "focus": [ { "type": "commit", "history_indices": [1, 2] } ]
    },
    {
      "title": "Process T3.W(X)",
      "queue": ["T3.W(X)", "T1.W(Y)"],
      "history": ["T1.W(X)", "T2.R(Y)", "T2.W(Z)", "T3.W(Y)", "T3.W(X)"],
      "focus": [
        { "type": "remove", "queue_index": 0 },
        { "type": "add", "history_index": 4 }
      ]
    },
    {
      "title": "Commit T3",
      "note": "T3's operations are already adjacent, so T3 can commit.",
      "queue": ["T1.W(Y)"],
      "history": ["T1.W(X)", "T2.R(Y)", "T2.W(Z)", "T3.W(Y)", "T3.W(X)"],
      "focus": [ { "type": "commit", "history_indices": [3, 4] } ]
    },
    {
      "title": "Process T1.W(Y)",
      "queue": ["T1.W(Y)"],
      "history": ["T1.W(X)", "T2.R(Y)", "T2.W(Z)", "T3.W(Y)", "T3.W(X)", "T1.W(Y)"],
      "focus": [
        { "type": "remove", "queue_index": 0 },
        { "type": "add", "history_index": 5 }
      ]
    },
    {
      "title": "Can T1 commit?",
      "note": "T1's operations are split apart, so T1.W(X) must swap rightward until it becomes adjacent to T1.W(Y).",
      "queue": [],
      "history": ["T1.W(X)", "T2.R(Y)", "T2.W(Z)", "T3.W(Y)", "T3.W(X)", "T1.W(Y)"]
    },
    {
      "title": "Can T1 commit?",
      "note": "Swap T1.W(X) and T2.R(Y).",
      "queue": [],
      "history": ["T1.W(X)", "T2.R(Y)", "T2.W(Z)", "T3.W(Y)", "T3.W(X)", "T1.W(Y)"],
      "focus": [ { "type": "swap", "history_indices": [0, 1] } ]
    },
    {
      "title": "Can T1 commit?",
      "note": "Swap T1.W(X) and T2.R(Y).",
      "queue": [],
      "history": ["T2.R(Y)", "T1.W(X)", "T2.W(Z)", "T3.W(Y)", "T3.W(X)", "T1.W(Y)"],
      "focus": [ { "type": "swap", "history_indices": [0, 1] } ]
    },
    {
      "title": "Can T1 commit?",
      "note": "Swap T1.W(X) and T2.W(Z).",
      "queue": [],
      "history": ["T2.R(Y)", "T2.W(Z)", "T1.W(X)", "T3.W(Y)", "T3.W(X)", "T1.W(Y)"],
      "focus": [ { "type": "swap", "history_indices": [1, 2] } ]
    },
    {
      "title": "Can T1 commit?",
      "note": "Swap T1.W(X) and T3.W(Y).",
      "queue": [],
      "history": ["T2.R(Y)", "T2.W(Z)", "T3.W(Y)", "T1.W(X)", "T3.W(X)", "T1.W(Y)"],
      "focus": [ { "type": "swap", "history_indices": [2, 3] } ]
    },
    {
      "title": "Can T1 commit?",
      "note": "Cannot swap T1.W(X) and T3.W(X) because they have a write-after-write conflict on X.",
      "queue": [],
      "history": ["T2.R(Y)", "T2.W(Z)", "T3.W(Y)", "T1.W(X)", "T3.W(X)", "T1.W(Y)"],
      "focus": [ { "type": "swap", "history_indices": [3, 4] } ]
    },
    {
      "title": "Abort T1",
      "note": "Since conflicts prevent us from making T1's operations adjacent, we must abort T1.",
      "queue": [],
      "history": ["T2.R(Y)", "T2.W(Z)", "T3.W(Y)", "T1.W(X)", "T3.W(X)", "T1.W(Y)"],
      "focus": [ { "type": "remove", "history_indices": [3, 5] } ]
    },
    {
      "title": "Complete",
      "note": "The serializable schedule is T2, T3.",
      "queue": [],
      "history": ["T2.R(Y)", "T2.W(Z)", "T3.W(Y)", "T3.W(X)"],
      "focus": [
        { "type": "commit", "history_indices": [0, 1] },
        { "type": "commit", "history_indices": [2, 3] }
      ]
    }
  ]
}

If you’re squinting at this and wondering how T3.W(X) managed to execute and T3 commit if an earlier T1.W(X) would be holding an exclusive lock, that’s the point! The rules of the operation scheduler can be very flexible! Change the rules to match the system being examined. The behavior shown matches how a backwards-validating optimistic concurrency control database would behave. If one set of rules aborts more transactions than another set of rules, that’s fine! The only requirement is that the scheduler needs to be able to produce a serial order of transactions at the end, in order to satisfy serializability.

Multi-Version Databases

Our discussion so far has implicitly been that of a single-version database. Single-version databases are remarkably convenient, as it means operations arrive in time order, and time is the same order we wish to execute them in. Multi-version databases remove this convenience. Operations arrive in time order, but we wish to reflect that they need to execute in version order. This changes how our operation swapping game works in a number of ways.

We view the provided history as the time-ordered queue of operations we wish to execute. We pop the front of the queue to get the next operation, and

The history is time-ordered, and now due

Multi-version databases introduce two different axis: time and versions. Our ending goal is to end up with history of operations in version order.

Operations have always been presented in time order, and time goes older to newer, left to right.

We’ll adapt to this by writing the version as a suffix on each operation, and then the operation swapping game gains another layer of complexity.

The

One could equally view a single-version database as a multi-version

Reads can be inserted immediately at the correct location that puts it in version. MVCC means you can read always read backwards in time whenever you wish!
Version Order: The ending operation order must be in version order.
Read-Only Transactions: If a transaction is only reads, its read operations are allowed to ignore read-write conflicts.
Writes are inserted with an unknown version. When the version becomes known (the transaction tries to commit), we must attempt to swap all the writes into the correct location according to the new version.

To make this more concrete, let’s walk through a couple simple examples.

T1: R(x)         R(Y)
T2:       W(Y)         R(X)

T1.R(X) T2.W(Y) T1.R(Y) T2.R(X) T1.Commit() T1.Commit()

History: T1.R(X)
History: T1.R(X) T2.W(Y)
History: T1.R(X) T2.W(Y) T1.R(Y)
History: T1.R(X) T2.W(Y) T1.R(Y) T2.R(X)
                        ⤩
History: T1.R(X) T1.R(Y) T2.W(Y) T2.R(X)

T1.R(X)@v20 T2.W(Y)@v30 T1.R(Y)@v20 T2.R(X)@v30 T1.Commit() T1.Commit()

History: T1.R(X)@v20
History: T1.R(X)@v20 T2.W(Y)@v30
History: T1.R(X)@v20 T1.R(Y)@v20 T2.W(Y)@v30
History: T1.R(X) T2.W(Y) T1.R(Y) T2.R(X)
                        ⤩
History: T1.R(X) T1.R(Y) T2.W(Y) T2.R(X)

Histories are generally presented as a strictly a view of what previously happened in the database.

In an MVCC database, we have two dimensions: time and version. The history is presented in oldest to newest versions, left to right. Reads should be inserted into history according to the version at which the read snapshot was taken. Writes should be appended on the right, and then perform any swaps if needed to get the operation in the correct position according to its version. At commit time, operations are swapped to group all of a transaction’s operations. Following these rules makes it so that swapping reads forwards or writes backwards applies the same logic of "check for conflicts between the read and commit version".

CockroachDB versus YugabyteDB

We’re finally equipped to return back to the original topic.

CockroachDB

CockroachDB’s scheduler component maintains a cache that answers for each key, what is the maximum version at which the key has been read. When a transaction sends a read request, the read asks for a key as of a specified read version, and the server updates the cache accordingly as part of serving the read. To bound the size of the cache, key which haven’t been read for a while are just assigned one version which is less than or equal to the oldest version in the cache. When a write is performed, it also has the transaction’s read version attached, and the write wants to be inserted into the history at that version. The per-key cache is consulted to ask "if I accept this write, will it invalidate any previous read request I’ve served?" If yes, the write is rejected. If no, the write is accepted.

In terms of our operation swapping game, CockroachDB appends writes onto the end of the history, and then wants to swap them leftward to be grouped with the rest of the transaction’s reads (and writes) at the read version.

Operations to execute: T1.R(X)@v100 T2.W(X)@v90
Current history:

Operations to execute: T2.W(X)@v90
Current history: T1.R(X)@v100

YugabyteDB

Fin

CockroachDB serializes transactions as of their initial read version. YugabyteDB allocates a new commit version higher than any previously committed transaction, and serializes transactions as of the commit version. Phrasing this within the operation swapping game: CockroachDB always swaps all writes backwards to the read version, whereas Yugabyte swaps all reads forwards to the commit version. The major differences in their transactional capabilities above all fall out of this difference in which direction operations are moved in the history.

Precedence graphs are great for testing a history of operations for serializability. The operation swapping game is the mental model I use when thinking through how a database enforces serializability. Operation swapping for verifying serializability isn’t novel. I’ve poked fun at the CMU DB Intro class above, but Andy Pavlo’s course notes state "One can verify that a schedule is conflict serializable by swapping non-conflicting operations until a serial schedule is formed. For schedules with many transactions, this becomes too expensive. A better way to verify schedules is to use a dependency graph (precedence graph)." It’s just that that’s the end of the treatment of the topic, whereas I use it more frequently in my head than I do precedence graph building.

Nuance

Cockroach does have abort-free read-only transactions in the form of (slightly) stale reads.

Yugabyte can do serializability, but it requrires persisting reads as writes, so that the conflict information isn’t lost.

Spanner actually doesn’t persist reads, and thus loses opacity, but that’s a topic for another time.

↑ Blog ↑ NULL BITMAP on SIMD ↦

See discussion of this page on Reddit, Hacker News, and Lobsters.