TODO phi accrual failure detector TODO merge in notes from CS192, 6.829 TODO comparison of database replication techniques based on total order broadcast TODO http://bytepawn.com/readings-in-distributed-systems/ TODO http://www.cs.jhu.edu/~jak/docs/paxos_for_system_builders.pdf TODO Availability in Globally Distributed Storage Systems





  • distributed concurrency control: serializability
  • distributed commit: agree to do something or abort
  • consensus: agree on a value
  • replication: for availability, durability
    • paxos/viewstamped replication
    • virtual synchrony (spread)
  • CAP theorem’s semantics
    • assume partition of 5 nodes, with X,Y in non-quorum and Z in quorum; X holds lock on resource, and Y and Z get request for resource
    • if consistent & available, then Y can reach X to acquire
    • if consistent & partition-tolerant, then Z can acquire, breaking X’s lock
    • TODO what if lock were on quorum side? TODO tie in with “req” vs “node” failure
    • consistency:
    • one-copy serializability
    • linearizability
    • http://pl.atyp.us/wordpress/?p=2521
    • http://pl.atyp.us/wordpress/?p=2532
  • PACELC: if there is a partition (P) how does the system tradeoff between availability and consistency (A and C); else (E) when the system is running as normal in the absence of partitions, how does the system tradeoff between latency (L) and consistency (C)
  • consistency
    • eventual consistency: stale reads (OOO if read from diff nodes), divergent writes (clocks and conflicts)
    • strict cons
    • event cons: staleness
      • monotonic read cons aka timeline cons: can be stale but always in-order
      • RYOW aka RYW aka session cons: not nec implies (subset of) timeline cons
      • causal cons: use version numbers (lamport)
      • immediate cons
    • event cons: multi-writer
  • replication: quorums, primary-backup, master-master
    • R+W>N; no guarantees unless joins/leaves incur synchronization barriers
    • primary-backup: R=1,W=2,N=3; only need ack from majority (2), but write to all 3; majority prevents divergence in case of partition
    • quorums stronger guarantee: always read latest; PB read can be stale (if read from a replica that hasn’t applied update yet)
    • PB gives in-order, since all updates serialized through primary
    • quorums require app-specified ordering, eg cassandra’s app timestamps, since different clients write directly to their quorums (no global ordering)
    • with viewstamped replication, joins/leaves change views; this is when sync should happen (to bring new node up to speed)
      • but sync barriers are expensive, eg full merkle tree comparison or log shipping
    • dynamo uses sloppy quorums, bc trad’l strict quorums implies unavailability on failures/partitions
      • replicate to first N healthy nodes, not first N nodes; hence, the N may change or be entirely different
      • dynamo tries to spread quorum across datacenters
      • cassandra claims to use strict quorums
    • multi-master: requires dist commit and dist concurrency control or eventual cons
  • design patterns
    • tombstones: for deletes in eventually consistent systems
    • vector timestamps
  • logical clocks
    • happened-before relationship
      • a \rightarrow b if a happened before b in same process
      • a \rightarrow b if a,b are the send, recv of a msg
    • Lamport clocks/timestamps
      • timestamp = (counter, process ID); all msgs have unique timestamp; total order possible
      • on msg send, first incr then stamp
      • on msg recv, set counter = max(msg.counter, counter)
      • a \rightarrow b \implies L(a) < L(b) but not vice versa
      • however, key utility: L(a) \nless L(b) \implies a \nrightarrow b
    • vector clocks/timestamps
      • all msgs include vector timestamp (snapshot of global vector clock)
      • on update, increment own component
      • on msg send, first incr own component
      • on msg recv, incr own component and use component-wise max
      • time is more advanced on receiver than sender
      • a \rightarrow b \iff V(a) < V(b)
    • version vectors: for syncing replicas
      • after sync, both vectors are same: component-wise max
    • only for detecting conflicts; resolving is still separate, hard problem
      • eg commonly need to maintain complete history, e.g. with counters or version control (ancestor info needed)
      • easier cases include unioning sets


  • directed (one-way) synchronization
  • receiver sends rolling checksums of each non-overlapping block
  • sender compares these with its checksums of every (incl. overlapping) block
    • rolling checksums efficient to incrementally update
    • if match, calculate and compare MD5


  • SAN: block protocols, eg scsi, FC, iscsi, ataoe
  • NAS: remote FS, eg CIFS, NFS; NAS heads can be atop SANs


  • MPI
    • lib for apps
    • don’t mandate thread-safe impls, but allow them
    • high perf on high perf systems
    • modular: all refs relative to a module, not entire program
      • eg linear solver module must be restricted to a subset of processes
      • process src/dst must be spec’d by rank in a group not absolute ID
    • extensible
    • heterogeneous computing (impls working together)
    • well-defined behavior: no race conds/avoidable impl-specific behavior
  • PVM
    • also supporting heterogeneous computing and diff approach to extensibility
    • provide portable, heterogeneous env for using clusters over TCP


  • system objects: apps have handles to these
    • group: set of n processes; each process has a rank in [0,n); group routines mainly for specifying processes for communicator ctor
    • communicator: always associated 1:1 with a group; a “tag” passed in MPI calls; MPI_COMM_WORLD represents all tasks
  • organize tasks into task groups


  • cloudera
    • consulting/training services
    • products: hadoop distro, desktop gui, squool for rdbms import/export
    • people
      • doug cutting: hadoop/nutch/lucene founder; yahoo
      • mike cafarella
      • jeff hammerbacher: facebook data team; hive
  • consistency
    • strict: timestamp everything; impractical
    • sequential: some total ordering that’s consistent with the local orderings each computer sees
  • Merkle trees

  • consistent hashing: when changing hash table width, impose minimal disruption
    • eg, when adding, take a fair amount from each existing bucket
    • chord ring is a way to implement this
      • can replicate each node multiple times to make things smoother
    • http://citeseer.ist.psu.edu/karger97consistent.html (Karger)
  • gossip: decentralized but eventually consistent

general principles for reducing latency

  • distributed caches
  • CDN
  • caching proxy server
  • app-level (yslow, ajax)
  • edge DNS accelerator
  • zero copy
  • no serialization
  • load-balanced replicas
  • TCP offload engine


  • http://www.cs.umd.edu/~keleher/bib/dsmbiblio/dsmbiblio.html

TODO consistency

  • http://en.wikipedia.org/wiki/Consistency_model


  • lamport clocks: total ordering; single values
  • vector clocks: causal ordering; partial ordering; more information to eg resolve conflicts; easy to move to total order (eg by appending node ids)

data grids

  • data grids: general distributed utilities
    • caches, in-mem data structures, txns, pub-sub/msging, membership
  • oracle coherence, ibm websphere extreme scale/object grid, terracotta, gigaspaces, gemstone, jbosscache/jgroups/infinispan, hazelcast
    • hazelcast: undocumented consistency guarantees


  • normal operation: why simple approach similar to failstop replication won’t work

    invoke 1,1       X         -> B
    invoke 1,1       Y         -> C
      nobody's been fooled yet at this point since the you need 2f+1 agreements
    invoke 1,2       Z         -> B
    invoke 1,2       Z         -> C
      but at this point both will reply correctly even though they have diff
  • what about sending hashes of logs?
  • SUNDR: guarantees fork consistency property: if server lies to clients (drops, reorders, etc.) causing a fork in the clients’ histories

  • actually what happens: see journal
    • a prepared certificate is a set of 2f+1 prepares
    • commit point: when f+1 non-faulty replicas have a prepared certificate
  • read requests: send to all replicas, just wait for enough consistent replies

  • view changes: completely diff from paxos
    • go through primaries round-robin; no leader election
    • R_n to (new) P: R_n
    • P to all: P all the operations that were supposed to have been committed in the prev view
    • note: 2 replicas can execute the same operation in diff viewstamps; only thing that matters is that they all run in the same order
  • questions
    • diff btwn viewstamp replication and paxos?
    • what was the point about fork consistency in SUNDR and log hashes?
    • isn’t the all-to-all comm necessary since the primary could lie?

queueing systems

TODO dtxns

  • http://citeseer.ist.psu.edu/533331.html a fast commit protocol for distributed main memory database systems
  • http://citeseer.ist.psu.edu/589561.html
  • http://citeseer.ist.psu.edu/238153.html Transaction Routing for Distributed OLTP Systems
  • transaction management in the R* distributed database management system

TODO distributed query optimization

  • R*


amazon aws

  • ebs: volumes can only be mounted from 1 instance at a time, so no consistency issues
    • behaves like SAN
  • autoscaling: set conditions eg CPU, network, disk util

microsoft chicago

  • trailer containers
  • $.5B, 700 ft^2, 2 levels, 112 containers, 224K servers
  • each container: 2K servers, .5MW critical load, 50K lb, air skates
  • 11 diesel generators, currently 30MW total load, fully 56MW critical load

FAWN (dga, sosp09)

  • best paper; vijay presented
  • low-power embedded CPUs + small flash storage
    • increasing CPU/IO gap: reduce CPU
    • power grows super-linearly with speed
    • dynamic power scaling on trad’l systems is inefficient; near min volt at highest freqs
  • flash
    • fast random reads: << 1 ms (up to 175x faster than disk)
    • power-efficient IO: < 1 W at heavy load (disks at load use 10 W)
    • slow random writes
  • FAWN-DS: log-structured store
    • main-mem hash index: maps 160-bit keys to log pos
      • only stores 15-bit fragment of key, 1 valid bit, 4 B log pos = 16 B
      • index into hash table is lowest i bits (16 bits); tables have 2^i elts
      • key frag is next 15 bits; cmp this
      • pseudocode

        index = key & 0xffff
        keyfrag = (key >> 16) & 0x7fff
        for i = 0 to n:
          bucket = hash[i][index]
          if bucket.valid and bucket.keyfrag == keyfrag and
             readkey(bucket.offset) == key:
            return bucket
        return not_found
      • periodic snapshots
      • typical obj size: 256 B to 1 KB; for 256 MB RAM, yields approx 4 GB storage
    • each node has several virtual IDs in the key ring, so multiple files; these semi-random writes are still fast
    • maintenance: split, merge, compact are all scans that can run concurrently with ongoing operation (thanks to immutability/main-memory index)
  • TODO
  • FAWN-KV: consistent, replicated, HA KV store
    • chained replication
  • 350 KV queries/Joule: 2 OOM more than disk-based system


  • Michael J Freedman
  • Frans Kaashoek
  • Robert Morris
  • Ion Stoica
  • Scott Shenker

vuze aka azureus

  • custom msging protocol for extensions
    • peer exchange
  • superseeding
  • vivaldi coords
  • kademlia dht


Ivy: Memory Coherence in Shared Virtual Memory Systems (Yale)

  • migrate pages among writers; write invalidation
  • read-only copies can be replicated
  • implements sequential consistency
  • centralized manager algorithm
    • to get a read
      • C: RQ to M (read query)
      • M: RF to O (read forward)
      • O: RD to C (read data)
      • C: RC to M (read confirm)
    • to get a write
      • C2: WQ to M (write query)
      • M: IV to C1 (invalidate)
      • C1: IC to M (invalidate confirm)
      • M: WF to O (write forward)
      • O: WD to C2 (write data)
      • C2: WC to M (write confirm)
    • tricky: read/write confirms are done to order the read/write data requests
  • efficiency problems: comm intensive, page granularity, false sharing

TreadMarks: DSM on Standard Workstations and Operating Systems

  • “better” than Ivy
  • implements lazy release consistency with write notices/write diffs
  • compatibility with existing apps, since you can always re-write programs to respect sharing granularity (to avoid false sharing)
  • technqiues
    • LRC: lazy release consistency
    • write diffs: just send RLE of changes
    • write notices: don’t send write diffs immediately, but on demand (even lazier)
    • vector timestamps
  • details
    • release consistency: just publish changes following a lock release
    • eager RC: all coherence actions performed immediately on release operations
    • lazy RC: all coherence actions delayed until after a corresponding subsequent acquire
      • even lazier: use write notices
      • less work: use write diffs
    • coherence actions = publishing all changes
  • vector timestamps are necessary for this situation:

    CPU0: al1 x=1 rl1  al2 z=99 rl2
    CPU1:              al1 y=x rl1
    CPU2:                            al1 print x, y, z rl1
    • want CPU2 to see not just CPU1’s release but CPU0’s as well
    • CPU2 knows about CPU0’s change via CPU1’s vector timestamp, which includes CPU0’s entry



  • simplified overview: there are proposers and acceptors
    • propose
      • proposer sends proposal (with unique round number n) to all acceptors
      • acceptors promise to accept proposal if n is highest n seen (promise to ignore any lower n)
    • accept
      • if proposer hears back from quorum, send accept requests to quorum with the value it wants to set
      • if acceptor hasn’t promised to a higher n, tell proposer it accepted
    • if proposer gets quorum of accepts, then commit
  • normal for all nodes to be both proposer and acceptor
  • actual overview: acceptor promise identifies accepted n to prevent future proposers with higher n from changing the value

    prepare(n) ->
      "if you've already *accepted* a proposal what's its n and value?"
    <- prep_ok(n_a, v_a)
      the proposer will adopt this value, if it hears it from a majority
    accept(n,v) ->
    <- acc_ok(n)
    decide ->
  • proposal n is chosen when a majority of acceptors sent acc_ok(n)
    • interesting property of paxos: once proposal is chosen, system will never change its mind
  • problem: if proposers all get prep_ok responses that say nothing was agreed upon, then there may be conflicts in the accept stage
    • this is why we have acc_ok
  • pseudocode

      choose n, unique and higher than any n seen so far
      send prepare(n) to acceptors
      if prep_ok(n_a, v_a) from majority:
        # go with the *lower* proposal's value (but retain same n)
        if any v_a != nil: v' = v_a with max n_a 
        else: v' = v
        send accept(n, v') to all
        if acc_ok(n) from majority: send decided(v') to all
      n_p: highest prepare seen, initially 0
      v_a: highest accept seen, initially nil
      n_a: highest accept seen, initially 0
      prepare(n) handler:
        if n > n_p:
          n_p = n
          reply prep_ok(n_a, v_a)
      accept(n,v) handler:
        if n >= n_p: # reject anything lower than highest *prepare*
          # this ensures that we only accept the highest prepare we've seen
          n_a = n
          v_a = v
          reply acc_ok(n)
    commit point: majority of acceptors record a particular v_a
    if acceptor times out (doesn't receive decide):
      ask all servers for v_a, see if majority agrees
      otherwise become a proposer
  • scenario

    A1: p10              a10v10       p11     a11v10
    A2: p10              a10v10       p11     a11v10
    A3: p10     p11      a10v10 X             a11v10
  • say two proposers both get back a majority of prep_oks; n_p allows us to reject any accept request that isn’t the highest
  • vague argument for why other interleavings work out right:
  • important: the variable values must persist across crashes; must not be allowed to forget values
    • otherwise nodes that have already accepted a value will forget that they have done so, and may subsequently reply incorrectly to a different prepare/accept
    • what this effectively means is that nodes must record their variable updates to disk

life beyond distributed transactions: an apostate’s opinion (pat helland)

  • dtxns are fragile

Storage & Consistency

Errors in Database Systems, Eventual Consistency, and the CAP Theorem (Michael Stonebraker, cacm blog)

From http://www.allthingsdistributed.com/2008/12/eventually_consistent.html:

The first formal treatment of the consistency vs. latency (and if one wishes, availability and tolerance to network partitions) in fact appeared in the paper

Hagit Attiya, Jennifer L. Welch: Sequential Consistency versus Linearizability. ACM Trans. Comput. Syst. (TOCS) 12(2):91-122 (1994) - initially published in the SPAA 1991 conference.

It formally proves that for lineariability both read and write operations cannot be faster than the network latency. In sequential consistency, either reads or writes can be faster than the network latency, but at least one of them has to be slow.

In particular, this means that strong consistency conditions cannot be made highly available in partitionable environments, and cannot be implemented in a scalable manner.

Measurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systems (read 2/25/08)

  • cluster storage systems where data is striped across multiple servers experience tcp throughput collapse
  • cause: exceeding output port buffer capacity within switches -> packets dropped -> tcp timeouts
  • solutions: nothing very satisfactory

Chain Replication for Supporting High Throughput and Availability

  • simple idea: have a chain of servers
  • all writes start at head and return from tail
  • all reads go to tail
  • “high availability”: you can stay up, but you’re only reading from tail….

The Chubby Lock Service for Loosely-Coupled Distributed Systems

Rethink the Sync

  • TODO
  • basically: bluefs allows for speculative IO assuming that cached data is still valid; allow computations to proceed while buffering linux IO so that speculative output is not released until data is validated

Tra (Russ Cox)

  • sync time >= mod time; time btwn is guaranteed to have no changes
  • only benefit of VT pair is bounding the times for dirs; same thing can be done with VTs, but more verbose repr
  • impl: traserv scans fs for changes
  • paper bugs
    • why is fig 8 known to have no mods at B3? it could’ve….
    • why doesn’t d have min of sync times in fig 12?



  • [work underappreciated for its answers to systems issues like stragglers and query fault-tolerance]
  • [issues]
    • pull-based reducers mean lots of disk seeks
      • O(mr), where m is mapper count, r is reducer count; on single disk, O(r)
      • bc after each mapper sorts locally, all reducers read a slice from each sorted map output


  • job: full map-reduce; task: an execution slice
  • jobtracker: master node that accepts jobs from clients, splits jobs into tasks, and assigns them to slaves (workers)
  • tasktracker: on each slave; manages task execution
  • map task execution
    • split: portion of input file for a task; defaults to 1 chunk (64MB)
    • mapper interface

      class mapper:
        map(key, val, OutputCollector)
        close(): called when map has been called on all records in split
    • OutputCollector: stores output in a format for reducer
      • apply partition function on key, buffer record/partition number in buffer
      • spill to disk occasionally; spill sorts by (partition, key)
    • commit phase: on task completion; merges spills; read when servicing reqs
    • combiner: optional map-side reducer
  • reduce task execution
    • each reduce task is assigned a partition of key range
    • shuffle phase: fetch task partition from each map task’s output (HTTP; default 5 at a time)
    • sort phase: group records with same key
    • reduce phase: apply reducer to each key and corresponding list of values
  • oozie workflow engine uses DAG of hadoop, pig, ssh, http, email jobs


  • TODO (below is just starter)
  • hiveql
    • from subqueries, inner/outer equijoins, multi-group-by, multi-table inserts, UDFs
    • sampling
    • nested collections: arrays, maps, UDTs
  • query optimizer
    • predicate pushdown
      • left outer join: left side preds pushed
      • right outer join: right side preds pushed
      • full outer join: none pushed
      • non-det functions not pushed (rand; specified with java annot)
    • map joins: user-specified small in-mem hash tables on mappers; no sort/reduce needed
    • group by optimizations: map-side partial aggs, load balancing for data skew
  • extensibility
    • disk data formats: text, sequence, …
    • in-mem (‘serde’) data formats: java, hadoop writable, thrift, …
    • user map-reduce scripts
    • UDFs: substr, trim, from_unixtime, …
    • user-defined agg functions: sum, average, …
    • user-defined table functions: explode, …

mapreduce online (tyson condie, nconway, …, jmh, cal TR 2009)

  • built hadoop online prototype (HOP); submitting to hadoop project
  • key idea: pipeline (no materialization)
  • features
    • online aggregation: see snapshots of results as they are generated (“early returns”)
    • continuous data stream processing: indefinite input
    • jobs finish faster
  • single job: start as many map/reduce tasks as you have slots for
    • don’t schedule all at once, since not enough slots and too many TCP conns
      • if no reducer task to take from a mapper, then just materialize
    • buffer records for opportunity to apply sort/combiner
      • resort/recombine periodically if reducer not keeping up
  • multi-job pipelining
  • fault-tolerance: reducer just avoids mixing tentative (“uncommitted”) map task output with “committed”
  • online agg: snapshot accuracy is hard; report simple progress metric along with snapshot results
  • continuous mr: continuous online agg
    • circular buffer of spillfiles
  • [db group meeting discussion]
    • they claim that they can’t overlap/pipeline mapreduce jobs, but unclear why
    • for regular non-online-agg MR jobs, unclear how much pipelining benefits perf
    • seems to be pushing the disk-writing to the reducer
      • mappers send smaller sorted units to reducers; reducers are reading from many input file descriptors
      • but reducers need to buffer these somehow; most likely need to spill to disk


  • TODO
  • DAGs, pipelining, explicit materialization
  • iterative algos incl. k-means, matrix power iteration, and graph traversal
  • cosmos distributed file system
  • ecosystem
    • scope: sql-like language; by bing team
    • dryadlinq: academic release
    • nebula: a distributed shell with file descriptors as connectors

manimal: relational optimization for data-intensive programs (mcafarella, webdb10, talk at cloudera)

  • dbms efficient, mr slow but flexible
  • given unmodified mr programs as bytecode
  • apply well understood opts like indexes, col stores
  • analyzer -> optimizer -> executor
  • special handling of regexes; eg constant regex must start w chars
  • eval 16 programs: 4 from pavlo, 12 from mahout
    • detected 3/4 sel opt oppty’s
      • couldn’t handle hashtable, but that’s popular enough for special handling
  • counters, log msgs, etc - we’re fine for abusing
  • opts: sel, proj, delta compression, dict compression w direct operation
    • dict compression can use the compressed value directly if you’re just doing e.g. equality comparisons; no need to decompress/look up in dictionary
    • sel: 1% selectivity -> 1% runtime
    • proj: huge data reductions
    • compression: delta compression not worth the trouble; too small gains
      • need semantic analysis of the code
  • outstanding problems
    • indexing time overhead is another mr job; only makes sense for repeat jobs
  • details
    • works w writables and text
    • overridden recordreader can use the index
  • mentioned haloop from uw on optimizing gradient ascent mr jobs
  • discussion
    • what % of mr programs are java or pig/hive?
      • karmasphere: most (90%) mr jobs are in java
      • amr thinks it’s the other way around; some even ban java
      • 99% are python streaming at fb, visa
    • flumejava/plume: closer to linq; java lib for mr jobs
      • craig chambers’ uw project -> google
    • hive has indexes
    • optimizer is rule-based; cost-based optimizers that take runtime feedback would be nice
  • extra cloudera work (not part of webdb talk)
    • learnavro: ‘zero to olap in 60s’; mcaf’s cloudera hacking
      • based on at&t learnpads paper; python done; java wip
      • go from raw data files to structured data
    • schemadictionary
      • inputs: anon structured data + previously seen datasets
      • find k-closest datasets to anon, + schema mappings
        • schema type info, data statistics
        • “how likely is it that this data was drawn from the same distrib?”
      • after user picks one of the 3, label anon data using mapping
      • dist computation function
    • tool to auto pick the correct visualization

Block stores


  • 2-node block-level replication, either sync or async
  • active/active only for FSs like OCFS2 or GFS2 (slower FSs), not ext3 etc.
  • supports online recovery
  • http://lwn.net/Articles/329543/


ceph (sage weil, osdi06) TODO incomplete

  • in linux 2.6.34; weil’s ucsd phd thesis, then founded new dream/dreamhost
  • near-POSIX semantics; has linux VFS client
    • extends interface & selectively relaxes consistency for app needs & for perf
    • O_LAZY: don’t need to disable read caching & write buffering when file opened by multiple clients
    • HPC community proposed extensions to posix
  • metadata server (MDS) cluster manages namespace and coordinates security, consistency, and coherence
    • dynamic subtree partitioning: can redundantly replicate portions for perf
    • measure metadata popularity using exponentially decaying counters; each op affects parent dir up to root
    • 2PC subtree authority transfers
  • CRUSH distribution function designed for heterogeneous, unreliable object storage devices (OSDs, i.e. backends)
    • file broken into objects mapped into placement groups (PGs) using simple hash fn
    • PGs assigned to OSDs using CRUSH
  • ceph monitor cluster manages MDS clusters
    • just has to handle MDS heartbeats
    • uses Paxos w leasing mechanism
  • largest scalabality test: 430 nodes, 128 MDS, per-MDS perf 50% of 1-MDS cluster, ~250K metadata ops/s (enough for 1000s of OSDs)
  • OSDs only comm w peers
  • actively migrates data to new nodes and re-replicates data on failed nodes
  • uses (and contributes to) btrfs; orig used own EBOFS (extent and b-tree object file system)
    • tuned for obj semantics & other features eg async notif of commits to disk
  • other features: snapshots; file/capacity accounting at dir level

pohmelfs TODO

farsite distributed file system (MSR, 2000)

  • replicates each file onto multiple desktops

Reclaiming Space from Duplicate Files in a Serverless Distributed File System (2002) [32 citations — 2 self]

  • over half of all consumed space is occupied by duplicate files
  • convergent encryption: coalesce duplicates to take single file’s space
  • database for aggregating file content and location information in decentralized, scalable, fault-tolerant manner

frangipani: scalable distributed file system (1997)

  • built on petal block storage


  • WAN operation
  • TODO

Replication in the Harp File System (Liskov, Ghemawat, et al)

  • primary copy replication: single primary server waits on multiple backup servers before returning
    • primary ensures enough backups to guarantee operation effects will survive all subsequent failovers
    • failover algorithm masks failures, removes failed node
  • operations are immediately recorded into volatile memory log
    • write-behind to disk
    • UPS for power failures
  • each file managed by a group (with one master)
    • modifications require 2PC
    • non-modifications don’t require 2PC, but instead TODO
  • replication method
    • overview
      • view change: each group configuration is a view
      • as with any replication scheme that tolerates network partitions, require 2n+1 for n failures
      • only actually need to store data on n+1 of the machines; can be propagated on failures (partition)
      • witnesses: replicas that don’t store data
      • 1 designated primary (will act as primary whenever it can), n designated backups, n designated witnesses
      • arrange groups so that each node is designated primary of one group, designated backup of another, designated witness of a third; this distributes workload
    • normal-case processing
      • log of event records with growing IDs of operations, both phase 1 and committed
      • commit point (CP): ID of latest committed op (separates phase 1 and committed)
      • 2PC
        • primary logs, sends logged info to backups
        • backups only accept in log order; ack for n means got up through n
        • primary commits by advancing CP and send CP to backups
      • applying changes
        • log is WAL no-steal: changes not applied on disk (FS) until committed
        • application point: ID of latest op that has started being applied to the local FS
        • lower bound (LB): ID of latest op that has finished being applied to the local FS
        • primary and backups exchange LBs
        • global LB: min(all LBs); may truncate things beyond this
      • recovery: log brings recovering node up to date
        • log is redo: only sufficient to redo op after failure
      • reads can be serviced immediately without 2PC
        • results reflect committed, no uncommitted (i.e. serialize these reads at the CP)
        • if primary is partitioned off, then the result of the read may not reflect a write that has been committed in the new view
        • compromises external consistency, which requires comm outside FS
        • make this unlikely by using loosely synchronized clocks
          • each message from backup to primary contains a time equal to backup’s clock + \delta (few hundred ms)
          • backup promises to not start new view until that time
          • expect that starting a new view will not be delayed, since \delta will have passed before new view is ready to run
          • primary needs to comm with the backup about a read op only if the time of its local clock is greater than the promised time
        • access time: can enable loose consistency on this
          • return immediately from primary, before committing
          • may be lost if there’s a failure/partition
    • view changes
      • views have view numbers
      • promoted witnesses may not have resources (eg FS) so must never trunc log
      • TODO left off here

The Google File System

  • 3x replication

GFS Evolution

  • main problem: master has memory index of all files’ metadata
  • orig master per cell per DC, but isolation was difficult and 1-master not scalable in # files; moved to multi-cell, multi-master
    • multi-cell: multiple masters (isolated FSs) over shared pool of chunkservers; app must manually partition over masters
    • namespaces: static partitioning of one FS namespace over multi-cell
  • started having more online apps like gmail
    • 64MB chunks too big but smaller chunks -> more files; 1MB compromise; see above discussion on master scalability
    • orig GFS was for throughput not latency
    • also needed more availability; 1-master is SPOF
      • orig master failover was manual: cell could be down for an hr
      • then auto, but slow (minutes); now down to 10s
    • generally, bigtable solves a bunch of problems
      • bigtable is very failure-aware & responds faster
  • multihoming
    • bigtable txn log: have 2 logs, write to one and failover to other, merging later
    • gmail uses multihoming, for availability and to hide gfs problems
  • many disks claimed to linux driver that they supported a range of ide protocol versions but actually responded reliably only to more recent ones, silently corrupting data
    • motivated GFS checksums, but these were E2E so covered everything (eg TCP corruption)
  • loose/eventual consistency
    • apps can read stale data
    • client failures are big issue
    • multi-writers are problematic; eg RecordAppend interface (dupes, ordering)
  • general google approaches
    • get things working reasonably well, then focus on scaling, not so much on efficiency; usually never focus on a binary, except for GFS master perf
    • had flexibility since apps & infrastructure all by google
  • http://queue.acm.org/detail.cfm?id=1594206
  • http://cacm.acm.org/magazines/2010/3/76283-gfs-evolution-on-fast-forward/fulltext

wheelfs: Don’t Give Up On Distributed File Systems (strib)

  • specify hints about consistency

CloudStore aka Kosmos File System (KFS)

  • from kosmix auto-portal site
  • hadoop/hypertable/hbase integration [compat with HDFS?]
  • for: mainly write-once/read-many loads; Ms of big files; mostly seq access
  • replication (configurable per-file) with async re-replication on failure
  • periodic chunk rebalancing; elasticity
  • for data integrity, verify checksum on read; re-replicate to recover
  • leases for client cache consistency
  • chunk versioning to detect stale chunks
  • C++, STL, boost, aio, log4cpp; solaris, linux; fuse


UsenetDHT (Emil Sit, NSDI08)

  • TODO: read paper
  • high bandwidth for peering and high storage/durability rquirements
  • related work
    • DHash, OpenDHT optimized for small objects (eg CFS)
    • Galcier: focus on catastrophic failure
    • Total Recall: …
    • Coral: perf, but a cache, so no durability
  • UsenetDHT: shared Usenet server
    • 80 TB/yr

    • Passing Tone: algo for data maintenance
      • re-replication on failures
      • durability for immut dat
      • using only pairwise sync
      • supports obj expiration
    • enhanced DHash for perf
      • bulk data
      • 6MB/s per server
  • goal: reduce distribution inefficiency and storage reqs
    • share articles among eg universities, ISPs
  • nodes do full exchange of headers only
  • challenges
    • in talk and paper
      • respond to eachfailure?
      • syncs expensive
      • TODO
    • in paper
      • limited disk cap: can spurious repairs be avoided
      • rand disk IO
  • Passing Tone
    • r_L reps for durab
    • extra replicas to mask transient failures
    • repair decisions need only pairwise syncing
    • two algos:
      • local: replication among current replica set
      • global: ensures correct placement
  • data placement
    • replicate to r_L successors
    • idea from Carbonite system: extra replicas to mask transient failures
    • make new copies when < r_L avail
  • predecessors enable local decisions
    • chord only know about immed pred
      • forces immed succ to handle maint
      • coord hard: constant writes/epirations
    • pred list allows nodes to handle own maint
      • given r_L and preds, know responsible range
      • extras reps can be reclaimed for space if nec
  • local maint
    • calc respons range from preds
    • sync with nbrs over range to
      • identify objs missing locally
      • make local reps of those objs

Kademlia (Petar Maymounkov, David Mazieres)

  • simplest


Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications


Friday: global debugging (paper notes)

  • intro
    • objective: track global state
    • contributions [basically new debugger]
      • primitives detect events with watch/breakpoints
      • attach commands to these
  • liblog
    • for libc/posix c/c++ apps
    • each thread logs side effects of non-det syscalls (recvfrom, select)
    • maintains causally consistent group replay by inserting lamport clocks in messages
    • incrementally deployable
    • can simulate a reasonably large # nodes on debugging machine
    • still hard to debug - find needle in haystack
  • design
    • watch/breakpoints
      • breakpoints are the same
      • implementation [watchpoints]
        • write-protected pages
        • alternatives
          • hardware - limited # watchpoint regs (usu. 8) not enough
          • single-stepping - too slow
          • local breakpoints - hard to analyze program to know where to insert them
          • periodic sampling - coarse-grained
      • implementation complexity
        • replicated gdb functionality
        • replaced gdb’s calling facilities due to conflicts with write-protection
    • commands
      • scripting language (running in own world; effectively ‘global’ state)
        • syntax to access app state, current node, lamport/real clocks, etc
      • language choice
        • python - easy to embed
      • syntax
        • translate to gdb’s ‘print’/‘set’
        • string interpolation
        • memory protection on app function invocs
        • general data structure marshalling -> raw bytes, enough for equivalence testing
    • limitations
      • false positives slow down replay
        • can recompile app to spread out stack
      • must manually build debugging state if starting from middle of replay
        • future: add debugging state to liblog checkpoints (at debug time)
      • friday predicates often require debugging themselves
  • case studies
    • routing consistency: i3/chord dht



  • overlog

BOOM: data-centric programming for the data center

  • implemented HDFS, including multi-paxos master node, in overlog
  • master/namespace is partitioned for scale-out; took 1 day
  • overlog timestep

The NIL distributed systems programming language: a status report

Mace: language support for building distributed systems



A System for Constructing Configurable High-Level Protocols

From asynchronous to synchronous specifications for distributed program synthesis


The Darknet and the Future of Content Distribution

A Survey of Anonymous Peer-to-Peer File-Sharing



  • elastic compute cloud
  • “compute units” = some amd opteron
  • vm’s (~2-3 min to start)
  • can’t run arbitrary os; must be linux 2.6-based with paravirt (for xen)


  • storage service

simpledb TODO

  • similar to bigtable
  • objects with loose set of multi-valued attrs
  • simple query lang: select ATTRS from DOMAIN where PREDS
  • eventual consistency
  • queries capped at 5 sec runtime
  • $1.50 / gb-month
  • $0.14 / cpu-hr consumed by queries


On Designing and Deploying Internet-Scale Services

  • lessons learned from Windows Live


PeerReview: Practical Accountability for Distributed Systems


Hundreds of Impossibility Results in Distributed Computing http://citeseer.ist.psu.edu/fich03hundreds.html

  • TODO




  • endpoints talk to a gateway (eg belonging to the company)
  • gateways talk to each other (over IP)
  • no P2P network as in Skype


r2: record and replay debugging tool

  • lib-based replay
    • binary rewriting (loader): liblog, jockey, recplay
      • intercept at syscall/libc
      • more lightweight than whole-sys replay
    • record run
      • call orig syscall/libc funcs
      • log input
      • log thread scheduling info
  • read()
    • record: log input
    • replay: feed buf
  • challenges
    • mem addrs must be identical
      • but fundamentally diff, eg shared libs
      • control mem consumption
    • nondet must be enclosed
      • shared mem, ioctl, benign race, …
      • fixing syscall/libc too restricted
    • fat interface
      • need to wrap too many api funcs
  • contribs
    • app-level isolation
      • kernel ideas: replay & system space
      • manage mem and exec
      • customize record & replay layer
    • annotation keywords
      • code template
    • experiments and experience
      • sqlite, bt, mpi
  • replay & system spaces
    • api as r2 syscalls: replay -> system
    • r2 upcall (from lib into app): system -> replay
  • space mode bit
    • 0 - system space; 1 - replay space
    • start in 0
    • when entering main, switch to 1
    • when invoking a syscall, switch to 0; e.g., read()
    • when issuing an upcall, switch to 1
      • wrap upcall (callback), e.g., createThread(…workerThread…)
  • mem mgmt
    • separate mem pools: all calls to malloc inside system space do not exist during replay run
    • explicit data xfers: read()
  • code templates
    • annotated prototype: int read([in]int fd, [out,bsize(return)]void *buf, [in]unsigned int nbytes)
      • language already in use (windows); see also SafeDrive(?)
    • php code template to generat c++ code snippet
  • mem mgmt (cont)
    • cross space pointers (xpointer)
      • make a copy in replay pool, e.g., getcwd(NULL,o)
    • encapsulated resources: as ints, e.g., files
  • capturing exec orders: lamport’s algo
    • maintain logical timestamp for each thread
    • syscall-upcall (callback): createthread < workerthread
    • syscall-syscall (signal & wait): releasemutex < waitforsingleobject
    • total order vs. causal order: race
  • related work
    • lib-based replay
    • whole system replay: hardware (strata), vm (revirt, idna)
    • annots: sal icse06, deputy osdi06
    • isolation: xfi osdi06, nooks sosp03
    • domain-specific replay: ml, mpi, java
    • replay apps: predicate checking, rate detection, intrusion detection


hashcache (nsdi09)

  • web proxy cache with pluggable indexing policy
    • goal: minimize in-mem index for storing largest amount of data
    • goal: maximize perf by minimize disk seeks
    • take advantage of the fact that there is a hash table on disk
    • index mainly to serve existence queries
    • achieved just 1 disk seek per req
  • 3 policies, from most compact to highest performant: Basic, SetMem, and Log
  • Basic: just store 8 bit low-false-positive hash
  • SetMem: store add’l 3 bits of LRU info
  • Log: store offset onto disk improves write performance
  • compared against open-source squid (least efficient) and commercial tiger (more efficient)

partial content replication (ucsd nsdi09)

  • eventual filter-based consistency

cimbiosys: content-based partial replication/partial content replication (doug terry, msrsv, ucsd, nsdi09)

  • motiv: photos
  • reqs: content-based selection, flexible (mesh) communication network
  • protocol: simple
    • target sends knowledge and filter to source
    • source sends unknown and filtered items to target
    • target adds items and updates knowledge
  • provides: eventual filter consistency (correctness), eventual knowledge singularity (perf)
  • move-out notifications for removing item when item no longer matches filter
    • “local” move-outs: must hold on to it so that it’s not lost permanently (in case someone else wanted to get it)
    • move-out notification chains: when a node that has removed an item syncs with another node that should remove the item as well; send move-out iff:
      • source’s filter is a superset of target’s filter,
      • source is as knowledgeable,
      • source doesn’t store item that target stores
  • filter changes: local operation
    • remove items that no longer match
    • retract knowledge
  • to ensure eventual singularity: tree embedded sync topology
  • misc
    • implemented in Mace on linux

rpc chains (nsdi09)

  • this primitive chains together multiple RPC invocations so that the computation can flow from one server to the next without involving the client every time
  • results are added to param list and fwded
  • evaluated with chained copy from one nfs server to another
  • nest/compose sub-chains by maintaining stack of chaining functions and state
    • also allow concurrent sub-chains using splits in the path
  • also ties in to efficient bandwidth usage a la GFS (chained replication)

google storage (google io 2010)

  • like S3
  • all data WAN-replicated, RYW consistency, any size objects

transactions across datacenters (google io 2009)

  • google dubs it “multihoming”; focus on app engine datastore
  • consistency (wrt durability)
    • weak: reads might not see prior write
      • gae memcache
    • eventual: reads eventually see prior write
      • dns, email, S3/simpledb (~500ms lag in S3)
    • strong: reads will see prior write
      • gae datastore, filesystems (sync), rdbms, azure tables
  • stats
    • within DC: 1-100Gbps, <1ms within rack, 1-5ms cross racks, no cost
    • across DC: 10Mbps-1Gbps, 50-150ms, cost for fiber
      • 30ms baseline speed-of-light across US
  • options
    • bunkerize single DC: azure, simpledb
      • redundant power sources, backbone connectivity, cooling
    • single-master
      • if async, then window of data loss vulnerability; sometimes even inconsistent replication
      • AWS
        • EC2, S3, SQS: choose US or EU
        • EC2: availability zones, elastic load balancing
      • financial institutions (banks, brokerages, etc.) use this frequently
      • can geolocate reads, depending on your consistency requirements
    • multihoming/multi-master: sync replication
  • techniques
    • backups: super-async, weak consistency; low latency, high throughput
    • master-slave: async, eventual consistency; low latency, high throughput
      • gae datastore uses this
    • multi-master: async, eventual consistency; need ordering; can keep updating on DC failures (since no single master)
    • 2PC: for cross-shard operations
      • semi-distributed consensus protocol (always need coordinator)
      • synchronous, high latency, low throughput
      • 3PC buys async with extra round trip
    • paxos
      • fully distributed; no single coordinator
      • synchronous, high latency (prohibitively high even in same DC), medium throughput
      • used in chubby; for moving state, memcache, offline processing
    • no silver bullet; embrace tradeoffs
  • entity groups in megastore are the unit of partitioning/consistency
  • criticism: if you watch this, don’t get too caught up in the taxonomies here; they aren’t very carefully designed. unfortunately the talk doesn’t bring to the forefront the fundamental dimensions, instead convolving them into a jumble of different “packages” without sufficiently teasing them apart. my fear is that this talk will lead to the proliferation of such falsehoods as “master-slave replication != paxos” (!) and “master-slave replication means == weak/eventual consistency.” that, and “multihoming.” (google managed to do it before with “sharding.”)
  • http://code.google.com/events/io/2009/sessions/TransactionsAcrossDatacenters.html

CAP theorem

  • bigtable, hbase opt for consistency and availability
  • dynamo opts for availability and partition-tolerance

revision control systems


  • cherry-picking: choosing individual changes from one branch and applying them to another branch
  • daggy fixes: fix the code as a tiny branch off of wherever the bug was introduced
    • nicer than cherry-picking because the problem is reduced to one of merge
    • reuse the same merge/conflict-resolution machinery


  • do work in client workspaces; submit changed files together in changelists
  • p4 client
  • used at google; g4 frontend

source depot

  • microsoft’s internal rcs; fast
  • shares the same syntax and model as perforce
  • there’s a vast hierarchy of branches
    • integration happens up the tree (reverse integration) and down the tree (forward integration)
    • integrations occur more frequently toward the leaves and less frequently toward the root
    • the leaves are what individual teams work on
    • they’re all mirroring the same file/dir namespace
    • there’s actually a set of these trees; the roots are called trunks

distributed oltp databases


  • multi-version, global, sync-repl, “externally consistent” txns
  • novel time API exposes clock uncertainty for external consistency
  • non-blocking reads in past, lock-free read-only txns, atomic schema changes
  • auto partition across paxos RSMs
  • trillions of rows
  • relational data model, SQL-like query lang
  • configurable repl
  • architectural hierarchy
    • universe: eg production, dev
      • universe master: display status of all zones
      • placement driver: manages data xfer across zones
      • zone: administrative unit; single location
        • zonemaster: assigns data to spanservers
        • location proxy: helps clients locate spanservers that house the data
        • spanserver: controls 100-1K tablets
          • tablet: bag of (key:string, timestamp -> string)
  • tablet data stored in ~B-trees & WALs on colossus (new GFS)
  • paxos replica group per tablet
    • leader also does CC w lock table
    • 2PC across leaders for dist txns
  • directory: keys w common prefix; unit of data placement; ~50MB
    • all data in dir have same (geo-)replication config
    • move data btwn paxos groups in dir-wise manner (TODO huh?); not txn’l (or blocks txns for few secs)
  • spanner layers a bucketing abstraction
  • TODO wound-wait truetime 2px+paxos 2pl


  • in-memory memtable LSM for holding pending writes
  • on write, flush WAL and update memtable
  • on reads, read memtable & multiple disktables, but first checking for non-existence in their respective bloom filters (which may say “maybe” or “no”)
  • when full, flush memtable to disk as a new disktable
  • periodically compact the disktables, merging them into a single disktable
  • not partition-tolerant
  • scale to PB range across thousands of machines; auto elasticity
  • coprocessors: run arbitrary code over rows/row ranges

voldemort (linkedin, 2009)

  • based on dynamo
  • linkedin: no joins, lot of denorm, no orm, no constraints/triggers, memcache, no-downtime releases, NOT horizontally partitioned (!), latency is key
  • large/persistent dataset cannot all be in mem
  • combines memory caching
  • design
    • layers top to bottom
      • client API, conflict resolution, serialization
      • routing and read repair, failover (hinted handoff)
      • storage engine (bdb/mysql/memory)
    • layers all have same interface: put, get, delete
  • physical arch configs (all top to bottom)
    • 3-tier, server-routed: frontend, load bal, backend service, load bal, voldemort cluster
    • 3-tier, client-routed: frontend, load bal, backend service, partition-aware routing, voldemort cluster
    • 2-tier, frontend-routed: frontend, best-effort partition-aware routing, backend service with voldemorts
  • client api
    • put, get, get_all, delete
    • store: a table; keys are unique to stores
    • put, delete: can specify expected version
    • OCC to support multi-row updates and consistent read-update-delete
  • versioning and conflict resolution and consistency
    • vector clocks provide partial order on values; like an improved OCC
    • conflicts resolved at read time and write time; user can supply handler
    • no locking/blocking necessary
    • also recently added paxos
  • routing
    • turns single get/put/delete into multiple parallel ops
    • consistent hashing variant that allows for weighing nodes by their capacity
    • repairs stale data at read time (??)
    • can add user strategies, eg only do synchronous ops on nodes in local DC (??)
  • storage: bdb (default), mysql, mmap, or memory for testing
  • limitations: no elasticity
  • offline processing
    • parallel extract from voldemort with throttling to respect live workload
    • use hadoop to produce new voldemort stores then atomic swap with live
  • perf
    • latency: median 0.1ms, 99.9% <3ms
    • 1-cli 1-svr throughput: 20K reads/s, 16.5K writes/s
  • TODO: any relation to memcached?
  • http://project-voldemort.com/

cassandra (facebook, 2008)

  • based on dynamo’s eventual consistency, P2P arch, DHT, gossip mgmt
  • based on bigtable’s column family data model, SSTables, hadoop integration
  • designed by one of the authors of dynamo, avinash lakshman
  • richer data model than voldemort
  • data model and physical storage layout
    • KeySpace > CF > Key > SuperColumn > SubColumn
    • with range partitioning, can range_slice to range-query on keys and columns; with random partitioning, can range-query on columns
    • cluster: the machines in a logical cassandra instance; set of keyspaces
    • keyspace: a namespace for column families; typically one per app
    • column families: ordered list of columns; analogous to relational table
      • rows have column name, value, timestamp; referenced by row keys
      • each stored in separate file, in row (i.e. key) major order
    • supercolumns: columns that have subcolumns
  • writes in LSMs
  • consistency
    • every kv pair has a client-managed timestamp
      • no version vectors or vector clocks or server-managed timestamps (vclocks may be added in 0.7)
      • clients must have synchronized clocks
      • latest timestamp wins
    • can specify replication degree per request
  • flexible replication: sync or async
    • RackUnaware: place replicas on n-1 subsequent nodes around ring
    • RackAware: place replica 2 in other DC, then rest on other racks in same DC
    • DatacenterShard: place m of n reps in other DC, rest on nodes in other racks in same DC
    • use Snitches, mappings from node (IP) to physical location (DC, rack)
  • usage
    • serialization
      • thrift, avro added in 2010
      • expects column names in network order (eg for ordering)
    • nodetool: check nodes, decommission, trigger compaction
    • working on compressed LSMs
    • need to regularly schedule major compactions to free up disk space
    • keyspaces, column families statically specified; changes require restarts
    • 0.6 allows direct access from hadoop
  • no mem caching
  • gossip-based distributed failure detection & membership protocol
    • sync membership & partitioning info w random node every sec
    • preserves symmetry among nodes
  • WIP
    • session level consistency
    • avro instead of thrift
    • backend for hive
  • adoption: fb, twitter, digg, rackspace
    • fb aborted adoption of cassandra, trying to move away but still stuck with it for inbox search; this helped the 3 devs leave (spullara)
    • twitter not planning to use cassandra for any critical data stores (7/7/10)
  • 14K writes/s, 7K reads/s; 15% write overhead from fsync (durable log)
  • refs

scalien keyspace TODO

  • KV + PaxosLease [seems interesting]
  • supports safe reads from master and dirty reads from any replica

riak (basho)

  • faithful impl of dynamo, incl all key components + extra features below
  • erlang; primary key lookup only; json fmt; consistent hashing; vnodes
  • eventual consistency w vector clocks & app conflict resolution
  • pluggable storage backend (in-mem, etc)
  • allows replicas to be temporarily inconsistent (tunable RW quora)
    • read-repair, anti-entropy
    • full single server durability
  • raw speed slower than cassandra
  • buckets are buckets are dynamically created
  • extra features
    • links: lightweight ptrs among data; can select along these paths
    • map-reduce over data in js
  • [basho founded by former akamai ppl; brewer joined BOD]
  • http://riak.basho.com/nyc-nosql/

perf comparisons with key-value usage model

  • chunkd: 100% tyrant in warm-read case, 83% in cold
  • cassandra: 30-85% chunkd
  • voldemort: 30-80% cassandra
  • tabled: 50-90% voldemort
  • riak: 15-45% tabled; 8-24x slower than tyrant
  • http://pl.atyp.us/wordpress/?p=2417


  • provides consistent hashing on top of tokyo tyrant
  • TODO

piccolo (jiyang, nyu, osdi10)

  • partitioned KV store; can eg force together partition i from multiple tables
  • commutative updates
  • load balancing
  • python/c++ impl
  • eg pagerank
  • other apps: crawler, numeric algos


  • works across DCs
  • eventual consistency
  • hinted handoff: if a node that should receive a write is down, Cassandra will send that write to another node with a “hint” saying that when the destination node becomes available again, the write should be forwarded there
  • vector clocks to detect conflicts
  • quorum consensus: if R+W>N then [usu.] consistent reads [but see critique]
  • sloppy quorums?
    • writes don’t need to happen on first N members of preference list
    • eg if first 3 are down then just take next 3 on the list; if first 3 came back, inconsistency
    • but then: “Using hinted handoff, Dynamo ensures that the read and write operations are not failed due to temporary node or network failures. Applications that need the highest level of availability can set W to 1, which ensures that a write is accepted as long as a single node in the system has durably written the key it to its local store.”
    • so, if W < N, then N - W writes can be written using hinted handoff to nodes other than the final destination, and forwarded when the natural nodes come back online. BUT the W writes specified by the requested consistency level do have to be to the right nodes or everything falls apart.
  • 3 levels of repair: read-repairs, hinted handoffs, replica synchronization/anti-entropy
    • read-repairs and hinted handoffs are near-realtime, minimizing drift
    • read repair: when reading from quorum, update any stale replicas
    • hinted handoffs: bulk op done on node rejoin
    • replica sync/anti-entropy: complete merge once a day using merkel trees
  • critique: http://jsensarma.com/blog/2009/11/dynamo-a-flawed-architecture-part-i/
    • inconsistency: actually asymmetric when claiming to be symmetric; symmetry is a flawed goal because nothing else in the datacenter architecture is symmetric
    • disaster recovery: lack of point-in-time consistency pushes complicated recovery onto app
      • eg can’t just replay logical operations from a point in time since the DB is not a consistent snapshot
      • standard log-shipping replication/recovery can at least track replication lag
    • quorum consensus: doesn’t actually guarantee consistency because nodes that (re-)join a quorum don’t have to go through a sync barrier (not brought up to date before being considered part of the quorum), so eg if R=1 then it may respond to a read request with stale data
    • decentralization !iff availability
      • centralized resources may be made highly available, eg dual-everything in SANs, eg highly available
      • centralization is a scalability bottleneck if anything
      • in fact dynamo cluster itself is likely centralized; eg no dual-routers means unavailability/partitions


  • bigtable clone for HDFS (or KFS); apache hadoop subproject
  • region: key range in consistent hashing (a shard/partition)
    • rows stored in byte-lexicographic order [first bytes must be key]
  • similar to HDFS: master:namenode :: regionserver:datanode (usu. deployed alongside each other on same hosts)
    • regionserver: assigned regions
    • master: manages assignments, notices failures
  • memstore: fast random writes
  • column family data model provides user-controlled locality and versioning and semi structure
    • think of sparse columns, so fields stored as key-values
    • separate column families stored in separate files
  • hbase 0.20: fast random reads
  • in-progress:
    • multi-master WAN replication
    • rewrite the master to leverage zookeeper more
    • need HDFS appends to avoid data loss
  • writes: WAL is HDFS file
    • commit after in mem on 3 replicas, but flushed to disk asynchronously
    • fast writes; CDH3 adds single-server durability
  • HDFS is rack-aware
  • http://hstack.org/why-were-using-hbase-part-2/
  • coprocessors: same as in bigtable

http://developer.yahoo.net/blog/archives/2009/06/nosql_meetup.html http://stackoverflow.com/questions/1189911/non-relational-database-design http://www.google.com/search?hl=en&rlz=1C1GGLS_enUS337US337&q=voldemort+cassandra+hbase+hypertable+tyrant+dynomite&aq=f&oq=&aqi= http://research.yahoo.com/files/ycsb.pdf

google app engine (GAE) data store

  • entity groups are units of partitioning
  • supports optional stale reads
  • separate blobstore for storing large objects
  • supports secondary indexes
  • built on megastore


  • built on big-table
  • provides declarative schemas, multi-row transactions, secondary indices
  • used to do in-order & atomic wide-area replication with Paxos, but now realizes the folly of its synchronous ways
    • switch to asynchronous replication
    • along with fixing replication to maintain transactional consistency (before, replication would leave destination in potentially transactionally inconsistent state)
  • means that apps now get to deal with eventual consistency (the alternative being unavailability)
  • multi-row txns limited to rows sharing the same key prefix
    • optimistic concurrency control: a separate “row” contains the version number that clients check before and after that operation
  • asynchronous replication but durable commit log at master
    • if master fails before replication, then fresh read requests fail
    • later attempts to access fresh will replay the log
    • probably delay replay till subsequent access to give transient errors time to go away
  • supports declarative secondary indexes
    • like replication, these are updated asynchronously, hence may be stale
    • commit log also replays these
    • system understands protocol buffers; eg can index a PB’s repeated fields

infinispan (data grid)

  • FLOSS from jboss
  • formerly jboss cache
  • supports read committed and repeatable read
  • distributed transactions
  • supports eager distributed locking
  • mvcc

DRKP: working on “mul-tree-cast” membership service

  • multicast updates down multiple trees
  • each tree has the same root (leader)
  • leader changes by epoch
  • is BFT (without BFT protocol)
  • the idea is that if any leader is BF, then just wait for the next epoch
  • they use the currently constructed trees to disseminate the other trees being constructed
  • trees are constructed using a deterministic algorithm that each node can independently calculate
  • problem: what if you send different subsets of the membership to each tree?


  • cmds: set, add, replace, append, prepend, get, gets/cas, delete, incr/decr, stats*, flush_all, get_multi
  • all ops atomic; can set expiration time (impl: lazy on-access expiration); LRU
  • max key 250B; max value 1MB; slab mem allocator
  • TCP and UDP, ASCII and binary wire protocols
  • non-blocking; multi-threaded parsing, but not cache access; may be future work
  • interesting use cases
    • flood control: allow user attempt to post iff put with 60s expire succeeds
  • advantages vs db caching
    • size: no need for 1:1 balance btwn cache nodes and db nodes
    • efficiency: RDBMSs have a lot of other engineering (buffer manager, etc.); also, eg, mysql writes invalidate buffer
    • control: build very efficient caches, at the level application desires (cache not just DB accesses but the result of arbitrary computations/accesses)
  • pains, greatest to least
    • requires manual invalidation; del or put new value (usu preferred)
      • typically used as read-only cache, so only worry about staleness
      • though, causality can be easily guaranteed with version numbers
    • stampedes: overload backing DB on expiry/invalidation (if highly requested)
      • reduce window by:
        • using app-level expiry timestamp; on get, if expired, immediately put back with extended deadline to prevent others from stampeding; then hit DB and update cache
        • use probabilistic app-level expiry timestamps
        • somewhat helps: invalidate via cron instead of on-demand
        • somewhat helps: for invalidation, update eagerly instead of delete
      • avoid by: issue a unique work queue item to hit DB & update cache, eg using gearman
      • [ideally: allow the above “notify-one” expiry, informing later requests that a request is already working on it; then the others can wait a while for the value to appear in memcache, trying again if they still don’t see it after the delay, in case the first requestor failed to hit DB & update]
    • see note on facebook’s fancier invalidation experiences
    • no secondary indexing: can’t reference multiple keys to one value; must do this app-level
  • tips
    • scalable lists
      • maintain lists of keys instead of data; can spread load too
      • chunk up the lists



  • data model/organization
    • server can manage multiple isolated DBs
    • each DB has 1+ collections; collections have documents (schema-less objects with fields) + ACL
    • all objects have _id field; always indexed
  • BSON: like JSON data model; general binary serialization format; primitives, objects, arrays
  • dynamic queries
    • JSON-like notation; equality and gt/lt comparators
    • atomic update modifiers via set operators (constrained to same document)
    • generates plan, uses indexes; cursors
  • indexing on inner objects and embedded arrays; compound keys; specify sorting order; indexes support unique
  • query profiler
  • master-slave replication: async; both manual and auto
    • replica pairs (sets): node 1 is master 1 slave 2, node 2 is master 2 slave 1
    • auto failover
  • efficient storage of binary data incl. large objects (photos, videos)
  • auto-sharding for high scalability
    • each shard is replica pair and manages some chunks
    • chunk: continuous range of objects from particular collection, ordered by a key (specified similarly as index)
    • other components: frontend coordinators; config/metadata servers
  • aggregation (group by); support for large map-reduce operations
  • no durability: no sync disk/replica writes
  • more features: capped collections, gridfs for large blobs
  • focus on performance
    • C++; mmap’d files
    • no REST; binary TCP protocol
  • many supported langs, platforms
  • no txns

sinfonia (HP including mehul shah, sosp07)

  • application nodes & storage nodes are separate
  • data model: raw linear address space; accessed directly by client libs
  • minitransaction: a single-shot txn containing compare, read, and write phases
    • batchable; can be executed within phase 1 of 2PC; in par with replication

RAMcloud (ousterhout, HPTS09)

  • goal: low latency
  • rationale
    • 64GB RAM/server w 1M ops/s, 5-10us RPC; 1TB RAM in 5-10 yrs
    • $60/GB today; $4/GB in 5-10 yrs
    • lower latency reduces txn durations, conflicts

gizzard (twitter)

  • middleware sharding framework
  • for any storage system, not just sql
  • specify range routing table; more control than consistent hashing
  • each shard is a replication tree
    • each partition can have distinct replication tree
    • better fault tolerance
  • async replication only
  • writes are buffered till shard or replica available again
  • requires writes to be idempotent; retry-later strategy can apply ops OOO (eg newer ops before older failed ops)
  • handles migrations by applying WriteOnly shard in front of new node and replicating ops to both
  • http://engineering.twitter.com/2010/04/introducing-gizzard-framework-for.html