Infrequently asked questions on deterministic distributed transaction management

Updated since my original post: added plug; explained why centralization is still hard; noted the issues unaddressed by this work; clarified that Daniel is focusing on ACID.

My homeboy Daniel Abadi baits flames in true Stonebraker style in his latest blog post, proclaiming: “the NoSQL decision to give up on ACID is the lazy solution to these scalability and replication issues.” Then he shows you how he promises to pimp your database by imposing a deterministic serial ordering to all transactions.

Without further ado, it’s time for some Infrequently Asked Questions On Deterministic Distributed Transaction Management!

I’m busy. Gimme the executive summary.

Daniel introduces his new VLDB paper, where he says he’s got the distributed DBMS your distributed DBMS could smell like, if only you used a centralized, deterministic transaction manager instead of lady-scented body wash. Specifically, he proposes avoiding the dreaded two-phase commit by centrally controlling concurrency, and minimizing lock time with optimistic operation.

I react.

What are your high-level thoughts?

My main critique: I’m not sure I agree that centralizing access qualifies as the (non-lazy) solution to scalable transaction processing. Centralized access of course makes life much easier and enables many possibilities, but—unlike centralized components in other distributed systems, e.g. metadata servers and lock servers—this thing is gonna get hit by every single transaction that comes in. Imagine every operation served by BigTable (one of those newfangled “NoSQL” systems) for Gmail, Google Apps, and a hojillion other Google external products and internal services, all running in data centers with full bisection bandwidth. Now imagine every one of those operations being funneled through a single box. Poof! Your ass is grass.

It’s nonetheless an interesting paper because of how it exploits this centralized concurrency control. And there might be some workloads that require partitioning for scalability, yet for which a single entrypoint can support the load. For the sake of discussing the rest of this paper, I’ll come back to centralization later.

Can you refresh my memory on distributed transaction management?

I’m assuming familiarity with this topic, but here’s some background for the sake of framing the discussion.

There are two big problems in distributed transactions:

  • distributed concurrency control
  • distributed commits

Concurrency control schemes can be classified roughly as being either pessimistic or optimistic. Traditional PCC schemes lock data before operating on it, but—as long as you can’t say up-front what exact data you’re going to lock (typical), in which case you can use lock ordering—locking is susceptible to deadlocks, which are hard to detect, requiring timeouts or centralized lock managers. Traditional OCC schemes involve an execution phase, followed by a validation barrier that checks for conflicts; if this passes, the transaction commits. In both cases, aborts may take place due to concurrency conflicts, due to interleaved transactions.

Distributed commits are needed to ensure the commit is effective on all partitions or none of them. Two-phase commit is the most popular distributed commit protocol, and both execution and concurrency control take place in the “prepare” phase of the 2PC, which is the only time aborts may be raised. Reasons a node aborts are if:

  1. it (or the network) fails,
  2. the transaction manually aborts (e.g. if a bank account balance is empty and an attempt was made to withdraw money), or
  3. there was a conflict with another transaction.

Aiite, so what’s the new hotness that Prof. Abadi’s proposing?

Here’s the scheme presented in the paper:

  1. When the central coordinator receives any transactions, it logs them to disk to ensure that they will eventually be executed.

  2. A pre-processing phase—not part of the actual transaction—performs reads to dereference all data locations (participant nodes). This is implemented a separate read-only transaction to optimistically determine the read/write sets.

  3. Once all data locations are resolved, the actual transaction can be collapsed down to a single PCC execution step in which each participant:

    1. locks its data and begins processing
    2. sends any data dependencies to the waiting cohorts
    3. waits to receive any data dependencies from other cohorts
    4. finish processing and unlock

What’s so special about this scheme?

Ideally just the one-way network latency dominates the locking contention latency. The system doesn’t use 2PC, since there are no non-deterministic aborts, relying on (1) global transaction ordering to avoid aborts needed due to conflicting concurrency, and (2) replication to ignore node failures. (Note that user aborts are allowed, as they are played across all replicas deterministically.)

To summarize, the main techniques at play here are:

  • minimize lock time with optimistic operation
  • use centralized global transaction ordering to avoid aborts due to conflicting concurrency
  • use replication to avoid aborts due to node/network failures
  • with both sources of non-deterministic aborts out of the way, toss out 2PC

Going back to centralization…didn’t Abadi address this in the comments?

When Ariel asked about scaling centralization in the comments, Dan replies:

All the preprocessor must do is agree on an arbitrary ordering. To avoid a single point of failure, it can be replicated across multiple independent nodes, and any of the well-known agreement protocols from the systems community (e.g. Paxos) can be used to agree on the transaction ordering.

This only addresses availability, not scalability.

Hm…isn’t that a pretty big problem?

I put on my robe and optimist hat.

One way forward is to try maintaining this architecture but focus on scaling the centralized access point. This is something I’ve thought about in the context of H-Store, which also advocates (among a bunch of other things) global transaction ordering via a centralized entry point, but only for multi-partition transactions. (The H-Store project also makes the very strong assumption that high-scaling applications won’t frequently use multi-partition transactions. That may be a true assumption, but it’s far from a satisfactory answer.)

Here’s a half-baked idea: partition the central coordinator, and establish an ordering among the partitions. Time is divided into epochs, and the transactions received in each epoch are ordered according to the partition ordering. To prevent clock skew from drifting apart each partition’s notion of the same epoch (causing transactions to block on other transactions in the same epoch), the partitions would need to synchronize themselves every so often. This approach introduces some transaction latency, but hopefully not substantially more than typical transaction batching.

Um, did you just solve the problem you’ve been whining about?

NO AND YOU DON’T END A SENTENCE WITH A PREPOSITION UP IN THIS BITCH. THIS IS MY HOUSE.

There are a whole slew of problems with this tightly-knit form of distributing the transaction manager, where all workers are dependent on and affected by all actors. For one thing, time is divided in a way such that when a worker hears from some manager m, it has to wait to hear from the managers that precede m to ensure that it executes all transactions in order (or that there were no preceding transactions). In other words: every manager messages every node every epoch.

There’s more we can do to clean this up, but a deeper problem is that the system has become sensitive to delays or failures in the “lock-step” operation of the transaction managers. This is a problem with any centralized scheme for totally ordered multicast, but with a greater number of such actors, the greater the probability of any one of them experiencing hiccups or faults that block the entire system.

E.g., in the Chubby paper, the authors’ account of how even KeepAlive RPCs dominated the system’s RPC traffic and had to move to UDP gives a taste for some of the problems lurking in the totally ordered reliable transport requirement back in our subject paper:

RPC use affects transport protocols KeepAlives are used both for refreshing the client’s session lease, and for passing events and cache invalidations from the master to the client. This design has the automatic and desirable consequence that a client cannot refresh its session lease without acknowledging cache invalidations.

This would seem ideal, except that it introduced a tension in our choice of protocol. TCP’s back off policies pay no attention to higher-level timeouts such as Chubby leases, so TCP-based KeepAlives led to many lost sessions at times of high network congestion. We were forced to send KeepAlive RPCs via UDP rather than TCP; UDP has no congestion avoidance mechanisms, so we would prefer to use UDP only when high-level timebounds must be met.

Anyway, I didn’t see any discussion or mention of this issue in Daniel’s paper or blog post, so one can only assume that this was not addressed.

OK, putting aside scaling transaction ordering, are there any other problems?

Sure, there are still plenty of big problems unaddressed here but “addressed” in NoSQL solutions, such as wide-area scalability and partition tolerance. (Granted, eventual consistency isn’t exactly the most appealing answer.) To be fair, Daniel’s not claiming to have solved all the world’s problems, and is instead focusing on scalable ACID.

However, I do think that when one presents an alternative to and makes strong statements against an existing approach, while one can choose to focus on any particular aspect, I humbly believe it’s only responsible to acknowledge the reasons for the existing approach that aren’t addressed by the new one, and to position things more clearly as the trade-offs they are—more so if one’s voice carries significant weight, as Daniel’s does.

Fine, what do you propose, punk?

If I (or anyone) knew the ultimate answer, this field of research wouldn’t exist, there wouldn’t be so much buzz about NoSQL, and I would’ve been famous as well as handsome. That said…once upon a time, I did write a thesis proposal on scalable ACID, and in particular techniques for abolishing distributed transactions altogether, among other things—an approach to the overarching problem of scalable ACID is something I believe comes from many places, not just the transaction manager. Anyway, since I’ve moved on and am no longer pursuing this, none of what I’m writing really counts as a shameless plug for anything. All stories for future blog posts!

But I will give a plug to some of the early fruits of that labor I enjoyed with my team Carlo Curino, Evan Jones, Sam Madden and Eugene Wu: namely, an automatic partitioning system that optimizes the colocation of coreferenced data. How sweet is that?

Pretty fscking sweet. Anything else you wanted to say?

Yeah, random nit: “Higher-order dependent transactions…appear much less frequently in real-world workloads.” Except that this happens, like, any time you want to traverse a many-to-many relationship in the relational data model. Which, despite TPC-C, is prevalent in many web applications.

All in all, a fun paper on one of my favorite topics, distributed transaction management for transactional workloads.

Follow me on Twitter for stuff far more interesting than what I blog.

  • http://community.voltdb.com Ariel Weisberg

    Can you provide a written example or diagram of a partitioned preprocessor that uses epochs? I am having trouble wrapping my head around what they would do.

  • Pingback: » links for 2010-09-03 (Dhananjay Nene)

  • http://www.saltwatercleanse.net Salt Water Cleanse

    This is the great blog, I’m reading them for a while, thanks for the new posts!

  • Anonymous

    Hi Ariel: Say you have 2 coordinators, A and B. Incoming transactions go to either one of them. Initially, any transactions A receives in the first 50ms are numbered 0-999, and B numbers its transactions 1000-1999. (The workers process transactions in the global order designated by this numbering.) After 50ms, A & B move on to the next epoch: A starts numbering its transactions 2000-2999, and B starts numbering its transactions 3000-3999. And so on, so that every 50ms, transactions from both coordinators can be interleaved.

  • http://pulse.yahoo.com/_JLPQ3ZQUMPKCOHSHELLSMDXYGM Jane Cole

    Very fascinating. Glad I stumbled across this info about deterministic distributed transaction management, definitely going to bookmark this to refer back again to it later this is an excellent collection of sites. management crm