DB

TODO merge in notes from CS186

Background

resources

vocab

  • PDBMS = MPP
  • data cube: hypercube of multi-dim info; analytical aggregation along dims
  • a predicate is sargable if DBMS can use index to speed up execution
    • usu. requires invertible, order-preserving (monotonic) function on indexed attribs

fields

  • analytical processing, BI, data warehousing
  • transaction processing: worry about scaling, consistency
  • stream processing, sensor nets
  • data integration, deep web, federated db’s
    • template extraction, schema recognition
    • schema resolution
    • entity resolution
    • data extraction, wrapping (deep web)
    • query routing (federated db’s)
  • information extraction

companies

  • BI
    • vectorwise: orig monetdb; from CWI, holland; led by marten kersten
      • partnered with ingres
      • shared-everything (single machine) multicore
      • open-source column store (actually row/col hybrid)
      • query operators are run via query execution primitives written in low-level code that allow compilers to produce extremely efficient processing instrs
      • vectors of 100-1000 values get SIMD’d
      • founders from CWI
      • 3GB/s decompression, 4-5 cycles/tuple, 3-4X ratio
      • operate in decompressed domain
      • shared scans
    • kickfire/C2: entered with TPC-H win
    • sybase iq: data mining
      • scale-out; C++ UDFs; embedded data mining/predictive analytics lib (modeling & scoring)
    • sas: has own internal db
    • teradata
    • XSPRADA/Algebraix
    • paraccel
    • kognitio
    • enterprisedb: postgresql company
      • major contributor to postgresql
      • postgresql plus standard server: extensions, eg connectors, geospatial, replication, caching
      • postgresql plus advanced server: deeper oracle compat; infinite cache
    • greenplum: postgresql based; acquired by emc
    • aster data: customers: myspace; hadoop interop; COTS; investors: sequoia
      • ncluster: row-based MPP DBMS based on PG; supports map-reduce
    • calpont’s infinidb: mysql + columnar storage engine (MPP, dict/token compression, multicore queries, ACID, no indexes/mat views, “extends” like netezza’s zone maps)
      • tried this out since it seemed like most promising free c-store; suffers from mysql row size limit (no large tables, which is common in OLAP!)
      • varchar 8000 max
    • infobright: wants multicore queries, ACID
      • also column store, competes with infinidb
      • tried this out; not the fastest but pretty good sql support out of box
      • no CREATE TABLE AS
    • luciddb, monetdb: supposed to be slower/less mature than infinidb/infobright
    • netezza: row-based db hw with compression, clustering; acquired by ibm
      • compression: prefix compression, huffman coding
      • clustering: multi-dimensional simultaneous partitioning w space-filling hilbert curve
      • twinfin: well-designed proprietary hw appliance
      • twinfini: parallel
    • datallegro/ms
    • xtremedata: plugs into cpu socket for cpu bus bandwidth
    • groovy sql switch: mem oltp rdbms, appliance & sw
    • vertica: column store
      • stonebraker
      • h-store
  • main-memory
    • ibm soliddb: uses tries instead of btrees for indexes
    • oracle timesten
    • scaledb: mysql clustering; uses tries instead of btrees for indexes; better in-mem perf, more compact
  • distributed
    • continuent tungsten replicator: DB-neutral master/slave async replication
  • elephants
    • MS: sql server
      • shared scans
    • oracle/sun: mysql
      • oracle exadata: analytical
    • postgresql
    • hp
      • hp neoview: analytical
    • ibm
  • emc
  • marklogic: XML DB with XQuery as primary QL

people

  • dave dewitt: uwisc prof; started msr wisc

misc

  • XA transactions: open group specification for common dtxn standard

denormalization: simplex/complex reads vs writes

optimizations TODO

  • from ormperf: caching, outer-join fetching, transactional write-behind, etc.

applications

  • on-line transaction processing (OLTP): many small updates
  • on-line analytical processing (OLAP) aka data warehousing: ad-hoc, long-running queries
  • decision support systems
  • business intelligence (BI)
  • data mining

OLAP

  • multi dimensional expressions (MDX): lang for querying OLAP cubes
    • resembles SQL on surface; expresses things that may be clumsy in SQL
    • basically like pivot tables over RDBMSs
  • closely related: cubes, pivot tables, contingency tables

benchmarks

  • TPC-C: OLTP
  • TPC-E: new OLTP (TODO: does anyone use this? what’s different?)
  • TPC-H: OLAP

indexes

  • primary vs. secondary
    • primary: no duplicates, usu. primary key, may store rec directly in index (otherwise point to rec using rid)
    • secondary: support duplicates, points to rec using rid or primary key
  • clustered vs. unclustered
    • TODO

joins

  • bitmap join indices
    • build bitmap index over low cardinality outer table R (at least on join key)
    • map into this while scanning over S
  • simple hash join
    • build hash table of R in memory, and probe into it with S
    • on overflow, start spilling to file R’; S similarly spills to S’
      • a new hash function h’ is used to determine whether incoming R tuples should be sent to the hash table or to disk
      • puzzle: should you just stop reading and perform joins with what you have so far (i.e. potentially many keys, but not all tuples with that key are necessarily in the table)? or should you evict things to make space for all the tuples belonging to each key in the hash table (i.e. have fewer keys)?
        • the former degenerates into O(n^2) nested loop joins - worse because you can’t start to throw out tuples of S at any point
        • so you’ll need to start making room for the incoming tuples that h’ says should go in memory - but now you need to choose an eviction policy (probably unimportant since it’s never discussed - or I’m just misunderstanding something)
    • recursively (iteratively?) repeat on the overflow files R’ and S’
  • grace hash join (assume |R| < |S|)
    • phase 1: partition R and S into buckets R_i and S_i (all spilled onto disk)
      • hash function dictates partitioning; same for R, S so R_i need only be joined with S_i
      • if ~B buffer pages, can have at most ~B partitions (1 output buffer each)
    • phase 2: for each R_i, build hash table, then probe S_i against this
      • if R_i is too big and doesn’t fit in memory, overflow
        • can recursively partition R_i back onto disk into sub-partitions
  • hybrid hash join
    • uses same partitioning, but instead of using all of memory for output buffers, use part of it for the first bucket’s hashtable (i.e. build hash table for bucket 1 immediately)
    • usually you want to make sure you have enough buckets so that each bucket will subsequently be able to fit in memory; if this is the primary goal, then the immediate hash-tabling of bucket 1 is opportunistic (using whatever pages remain)
    • makes sense when you have larger memory (N buffer pages >> B buckets)

logging and recovery

  • buffer management policies
    • steal: steal frames with dirty pages
      • thus dirty pages are written (“swapped”) to disk
      • faster but complicates recovery
    • force: force all updates to disk before commit
      • thus committed pages might not be on disk
      • slower but helps recovery

        no ste al steal
        no force no undo, redo undo, redo
        force no undo, no redo undo, no redo
  • write-ahead logging (WAL)
    • undo info: force log record to disk for an (uncommitted) update before corresponding data page is written to disk (swapped/stolen)
    • redo info: force log record to disk for a xact before commit
  • log seq num (LSN): increasing ID for each log record
    • each data page contains a page LSN, the LSN of the last log record for an update to that page
    • system keeps track of flushed LSN, the max LSN flushed so far
  • technique to ensure coherent page written to disk: include in commit marker the CRC of the data
  • WAL: before page i written to disk, log must satisfy pageLSN \le flushedLSN
  • log records
    • fields: LSN, prevLSN, XID, type, [pageID, len, offset, beforeImg, afterImg]
    • prevLSN: LSN of last record in this xact
    • type: update, commit, abort, checkpoint (for log maintenance), compensation (for undo), end (end of commit/abort)
  • in-memory tables
    • xact table: XID, status (running/committing/aborting), lastLSN (last written by xact)
    • dirty page table: pageID (entry exists for each dirty page), recLSN (log record which first dirtied the page)
  • checkpoint periodically for faster recovery

B+ trees TODO

  • B-link-trees: sibling links are for concurrency, as a “temp repair bridge” while parent nodes are being rebalanced

LSM trees

  • uses log-structured merge (LSM) trees to create full DB replicas using purely sequential IO
  • LSM-trees don’t become fragmented; provide fast, predictable index scans
  • LSM-tree lookup perf comparable to B-tree lookups
  • good for many (amortized) random writes; bad for random reads (need to hit multiple components)

mysql

  • engines
    • myisam
    • innodb
    • percona’s xtradb: innodb replacement that scales better to mcores and better utilizes mem
    • mariadb’s maria: evolve myisam toward innodb; adds txns, crash safety, etc
  • forks
    • mariadb: founder’s community-oriented fork; dissatisfied with sun procedure
    • drizzle: complete refactoring/slimming by brian aker, long-time mysql dev

innodb

  • index-organized tables
    • full row data in PK index; saves space
    • 2ary indexes refer to PK value, not phys loc (can change due to splitting)
    • can’t add/drop/change PK def’n without full-table rewrite
    • can’t scan in physical page order; must scan in key order
  • supports index-only scans; covering indexes run faster
  • MVCC
    • updates copy old row to rollback segment first
    • deletes mark row for deletion, insert info into rollback segment too
    • “purge” GC can quickly determine what to discard from rollback segment
    • purge is single-threaded
    • not sure how deleted space is reclaimed/reused/kept around (doesn’t info about it get removed from the rollback segment by the purge process?)

sqlite

  • concurrency
    • one global lock
    • multi-process concurrency: only whole-database locks
      • unlocked, shared, reserved, pending, exclusive
        • reserved: intend to write (upgrade to exclusive)
        • pending: waiting to upgrade to exclusive; prevents new locks
      • isolation levels: eg BEGIN DEFERRED
        • deferred: acquire only on first query
        • immediate: acquire reserved lock right away; reads can still proceed
        • exclusive: acquire exclusive lock right away
    • shared-cache concurrency (disabled by default)
      • supports multiple cnxns to shared cache; 1 txn per cnxn at a time
      • 3 levels of locking
        • txn-level locking: at most 1 write txn at a time; txn implicitly becomes writer on first INSERT/UPDATE/DELETE; coexist with any readers
        • table-level locking: reader and writer locks
        • schema-level locking: on sqlite_master catalog table; some special rules
      • isolation levels
        • default isolation level is (actually) serializable, since <=1 writer per db at a time
        • read uncommitted: cnxn doesn’t take read locks; no blocking
    • features
      • heavy testing: 100% branch coverage; fault injection (malloc, IO); virtual file system snapshots to simulate crashes
      • virtual table: you can export a table interface and provide callback implementations for events on this table
  • storage
    • data stored directly in btree

sql server 2008

  • integration services: ETL tools
  • reporting services
  • analysis services: BI tools
  • powerpivot for excel/sharepoint makes report designing easier/interactive
  • full text search engine
  • r2
    • streaminsight: CEP
    • master data management: dealing with multiple DBs/sources
    • reporting services integration w sharepoint CMS: publish olap analytics

postgresql

  • table data and PK index are separate
    • can scan in physical page order; fast
    • supports changing PKs wo full-table rewrite; even allows concurrent writes
  • no index-only scans yet; covering indexes don’t help
    • adding more cols to indexes slows things down
  • MVCC: on update, old row left in place, new row quickly inserted
    • VACUUM must scan all data as a result
    • VACUUM is multi-threaded
  • users: disqus, skype, debian, omniti

postgresql vs mysql

oracle

  • table data and PK index can be same (mysql) or sep (pg)

distributed postgresql

  • postgres-r: async master-master replication
  • pgcluster: sync multi-master replication
  • pgpool-II: sync

distributed mysql

  • mysql cluster/NDB (network database)
    • node types
      • mgmt node: config server, start/stop nodes, run backup, etc
      • data node: number of data nodes = replicas * fragments
      • sql node or api node
    • database
      • partition: a fragment of the database; these are replicated
      • replica: a copy of a partition
      • hash-partitioned on table primary keys
      • stored in mem or on disk asynchronously
    • organization
      • data node: manages a replica
      • node group: manages a partition
    • checkpoints
      • local: save all data to disk; every few mins
      • global: txns for all nodes synced; redo log flushed; every few secs
    • system
      • replication may be sync or async
      • 2PC
      • no referential integrity
      • no txn isolation above read committed, so dtxns are naive
      • replication via mysql replication (replicate entire cluster)
  • mysql replication
    • system
      • async replication or semi-sync replication (wait for receipt; contributed by google in 2007)
      • supports multi-master
      • simple “greatest timestamp wins” conflict resolution
      • no auto failover
    • formats
      • statement-based replication: stream the queries
      • row-based replication: stream the updates
      • mixed format: default; usu statements, but rows for things like UUIDs
  • mysql multi-master replication manager (MMM)
    • async replication with auto failover
    • also: flipper, tungsten

clustering

blenddb

  • colocate related tuples on same pages in navigational queries
    • eg web page for a movie: actors >< movies >< directors; movies -- ratings
    • eg further navigate into
    • construct paths starting from prim key into one table and following foreign key joins to other tables; eg movie.id -> acted.movieid -> actor.id, movie.id -> directed.movieid -> director.id, movie.id -> ratings.movieid
    • pull in all tuples read by those paths
    • clustering is an offline process
      • optimization i didn’t try to fully understand: prioritize path heads by how frequently they spill over one page (don’t waste space on path sets that are too big anyway and require multiple seeks?)
    • faster than performing joins online and less space than materialized views
  • related work
    • (stochastic) clustering in OODBMSs: could use their algos here, but they’re about migration not replication
    • merged indices of multiple cols/tables: can cluster on this, but that only benefits joins on a single key, i.e. no navigation
    • memcached: blenddb is not a cache; prepped for any data
    • multi-table indices, eg oracle clusters: crappy implementation of merged indices that requires sep keys to be on sep pages
  • first work on clustering in RDBMS

RDBMSs

amazon relational database as a service (RDS)

  • mysql 5.1
  • synchronous wan repl (multi-az)
  • TODO how does it scale?

OLTP Through the Looking Glass, and What We Found There (stavros, dna, madden, stonenbraker)

  • lots of CPU overhead in conventional RDBMS
  • removed features from Shore to get to main memory DB
    • Shore uses user level non-preemptive threads with IO processes
    • [hence latching overheads not from multiprocessing; unclear why latching is so impactful or even prevalent]
  • went from 640 to 12700 TPS; down to 6.8% of orig work
  • removed in order:
    • 35% buffer mgr
    • 14% latching: many structs
    • 16% locking: 2PL lock mgr
    • 12% WAL: building records; pinning records; IO
    • 16% hand-coded optimizations: tuned B trees
  • “every reason to believe that many OLTP systems already fit or will soon fit into main memory”
    • [but no cost benefit analysis; james hamilton says even move to SSDs is not worth it, contrary to myspace’s claims]

scientific databases

general

  • projects
  • extremely large database (xldb): workshop started by jacek becla of slac
    • 55 PB raw images, 100 PB astronomical data for LSST
    • involvement
      • ppl: stonebraker, dewitt, kersten
      • vendors: teradata, greenplum, cloudera
      • companies: ebay, web companies
  • scidb: stonebraker, dewitt
    • A data model based on multidimensional arrays, not sets of tuples
    • A storage model based on versions and not update in place
    • Built-in support for provenance (lineage), workflows, and uncertainty
    • Scalability to 100s of petabytes and 1,000s of nodes with high degrees of tolerance to failures (including query fault tolerance)
    • Support for “external” objects so that data sets can be queried and manipulated without ever having to be loaded into the database
    • Open source in order to foster a community of contributors and to insure that data is never “locked up”—a critical requirement for scientists
  • refs

scientifica (DB group meeting talk, Mike Stonebraker, 4/24/08)

  • multi-dimensional array-based
    • interesting question is storage manager: how to partition the data (and then in each partition, how to chunk it)
  • uncertainty
    • minimal, since error analysis is different for each application
    • use error bars
  • lineage, provenance
    • need also to remember the derivation process, not just origin
  • interest from various scientific institutions but also amazon, google, etc.
  • CERN had planned to use Objectivity

Efficient Provenance Storage

  • main problem: compressing binary xml
  • intro
    • fields: science, law, etc.
    • dataset 270mb, provenance store 6gb

parallel/distributed olap

“query evaluation techniques for large databases” TODO

HARBOR (Lau, VLDB06)

  • intro
    • integrate mechanisms for recovery and high availability
    • take advantage of data redundancy in replicated dbms
    • features
      • no logging
      • recover without quiescing
  • traditional approaches
    • logging
  • approach overview
    • data warehousing techniques (assumptions)
      • data replication: logical (phys unnec); helps perf anyway
      • snapshot isolation
        • large ad-hoc query workloads over large read data sets intermingled with a smaller number of OLTP transactions
        • using snapshot isolation avoids lock contention (TODO is this a requirement?)
        • harbor uses time travel mechanism similar to snapshot isolation
        • historical queries of past can proceed without locks since past data never changes
    • fault tolerance model
      • k-safety: requires k+1, k may fail
      • no network partitions, no BFT
    • historical queries
      • assume T < now, so no locks needed (result set don’t change due to updates/deletes after T)
    • recovery approach
      • checkpoints: flush dirty pages and record current time T
      • after coming back up, execute historical query on other live sites that replicate your data; this catches you up to a point closer to present
      • execute standard non-hist query to catch up to present; this uses read lock, but is shorter than the prev query
  • query execution
    • recovery query is a range query on T, so break up data into time-partition segments; means normal queries will need to merge from all the segments
    • commit processing
      • 2pc with changes
        • commits include commit times (for modified tuples)
        • in-mem lists of modified tuple ids can be deleted on commit/abort
      • opt 2pc: no need for logging/disk writes, except coord
      • opt 3pc: no need for logging/disk writes, including coord (persist state onto workers, basically); non-blocking (can establish timeout)
  • recovery
    • query remote, online sites for missing updates
  • evaluation
    • compare against 2pc and aries

A Performance Evaluation of Four Parallel Join Algorithms in a Shared-Nothing Multiprocessor Environment (DeWitt, SIGMOD ’89)

  • tried out some hash joins in gamma
  • simple hash join
    • partition tuples among the join nodes (via a split table)
    • overflow possible at any of the join nodes; overflow handling is same as in centralized version, i.e., each node has its own overflow h’ and R’
    • the split table is augmented with h’, so S tuples are directed to either the correct join node, or to S’
  • grace hash join
    • each partition (bucket) is in turn partitioned across all partition nodes
    • then, for each bucket, its tuples are partitioned across the join nodes, where the hash table is constructed
    • hence, all partition nodes talk to all join nodes
  • hybrid hash join
    • the bucket-1 tuples go straight to some join nodes, and the rest are spilled to the partition nodes
    • it seems in their system that join nodes are diskless, so that’s all they can do - but if the set of join nodes overlaps with the set of partition nodes, then you can imagine immediately processing a number of buckets immediately (as many as there are nodes - the immediate hash table will just consume part of the memory of each of the join nodes)

chained declustering (dewitt 92)

  • problem space: distributed data placement (replication, clustering, etc.) strategies for high availability, reliability in shared nothing dbs
  • compared against:
    • tandem: mirrored disks
      • each partition mirrored on 2 disks that are each connected to same 2 controllers that are each connected to same 2 cpu’s
      • on cpu failure, cpu 2 has to support the load of parts 1 & 2
      • [silly: the problem with this comparison is that it’s comparing a shared-nothing system against a shared-something system]
        • equivalent shared nothing system is one where you treat each disk pair as the single partition they’re storing, and replicate each such partition onto 2 machines
        • and that would be obv. stupid
    • teradata
      • each partition has primary and backup copies
      • in n-node system, fragment the backup copy into n-1 pieces scattered over the other nodes
      • good innate load redistribution on failure
      • but expensive writes in normal operation
    • RAID5: maintain (parity) bytes
      • even a blind write of one sector of a block requires 4 accesses (1 extra “round”): read all 3 other disks then write parity
  • chained declustering
    • eg
      • node 1: part 1 primary + part n backup
      • node 2: part 2 primary + part 1 backup
      • node 3: part 3 primary + part 2 backup
      • etc
    • on failure, evenly redistribute the partition over all other hosts
    • in larger systems, don’t spread across all nodes; partition up system (called “relation clusters”)
  • [this was not designed for high scaling distributed systems]
    • assuming indep failures
    • back to inefficient system if nodes i and i+2 fail
    • compared against weird old architectures, not giant commodity pc clusters in datacenters

hybrid-range partitioning strategy: a new declustering strategy for multiprocessor database machines (dewitt, vldb90)

  • partitionings in gamma: round-robin, range, hash
    • for small range queries, range partitioning can localize execution to only relevant processors and is faster
    • for large range queries consuming significant CPU/IO resources, hash/RR can parallelize, lowering response time
  • hybrid-range partitioning strategy (HRPS): do query analysis to get optimal degree of intra-query parallelism, balancing range and hash/RR
    • fragments contain FC tuples and fragments contain unique range of values of partitioning attr
    • collect some stats on avg query characteristics, plugging into formulas for first derivatives to optimize (e.g.) # servers to distributed across
  • The hybrid-range partitioning strategy is an alternative to hash- and range-partitioning that uses query analysis to compute the typical resource consumption requirements of queries and from this determines the optimal partition size for range-partitioning. The scheme attempts to reduce resource consumption by localizing small range queries while declustering and parallelizing long-running range queries; however, HRPS is only applicable to range queries.

adaptively parallelizing distributed range queries (adam silberstein, brian cooper, vldb09)

  • problem space: parallelizing range queries
    • traditionally, lay out data to achieve highest throughput/max parallelism
    • in high-scaling systems like PNUTS/bigtable, maximizing server parallelism produces too much throughput for 1 client
  • assumption: tables are range-partitioned
  • proposal: adaptive approach
    • step 1: adaptive server allocation
      • find ideal parallelism for single query execution
      • depends on: query selectivity, client load, client-server BW, etc
      • min # servers K_q for query q is adjusted as scanning proceeds
    • step 2: multi-query scheduling
      • even after knowing K_q, must avoid contention (assigning too many queries to same server simultaneously)
      • num concurrent scans on a server to avoid random IO can result in faster execution for all queries
      • minimize disk contention and ensure all queries get good perf
      • max # queries L_s on server s
  • tunable: can favor short vs long queries or high vs low priority
  • implemented & evaluated in PNUTS

gamma (project dewitt led 8x-92)

  • shared-nothing dbms; horizontal partitioning
  • implemented and evaluated parallel joins: simple hash, grace hash, hybrid hash, sort-merge

sharing

  • shared memory
  • shared disk
  • shared nothing
  • the first two are relics anyway

why main memory is bad for olap (DB group meeting talk, Mike Stonebraker, 4/24/08)

  • compared: Vertica with disk, Vertica with ramdisk, and Vertica WOS only
    • seek times disappear; a few disks can satisfy bandwidth (even for several cores)
    • absence of indexes contributes 2x
    • compression contributes 7x
  • SAP will compare their TREX against Vertica

distributed dbms

building a database on s3

rose: compressed, log-structured replication (russell sears, VLDB08)

  • ROSE: replication oriented storage engine
    • storage engine for high-tput replication
    • targets seek-limited, write-intensive workloads that do near-real-time OLAP
    • target usage in replication: replication log (of actual data) comes streaming into replicas, which are fed as input to ROSE (which itself just uses STASIS for logging LSM tree ops)
    • although targeting replication, provides high-tput writes to any app that updates tuples without reading existing data, eg append-only, streaming, and versioning DBMSs
    • write perf depends on replicas’ ability to do writes without looking up old values; reads are expensive (esp. of deleted values)
  • maintains multiple versions/snapshots for each item, supporting eg OCC, MVCC, and more
    • deletion: tombstone insertion
  • introduce page compression format that takes advantage of LSM-tree’s sequential, sorted data layout
    • increases replication tput by reducing sequential IO
    • enables efficient tree lookups by supporting small page sizes & doubling as an index of the values it stores
    • any scheme that can compress data in a single pass and provide random access to compressed values could be used by rose
  • replication envs have multiple readers, 1 writer; thus rose gets atomicity, consistency, isolation to concurrent txns without resorting to rollbacks, blocking index reqs or interfering with maintenance tasks
  • rose avoids random IO during replication and scans
    • leaves more IO capacity for queries than existing systems
    • provides scalable, real-time replication of seek-bound workloads
  • analytical models and experiments: OOMs greater replication bandwidth
  • column-store within pages (a la PAX)
    • within each column, doesn’t sort by column values; just stores in tuple slot id order
    • adds overhead to lookups
    • appends: append to each col; buffered staging space

OLTP

the end of an architectural era (it’s time for a complete rewrite) (stonebraker, madden, dna, stavros, nabil, pat helland)

  • features
    • main memory
      • “vast majority of OLTPs <1 TB in size, and growing slowly”
      • [implies you’re throughput bound vs. capacity bound]
      • timesten, soliddb: inherit RDBMS baggage of System R, incl. disk-based recovery log, dynamic locking
    • single-threading vs. multi-threading and resource control (sync)
      • OLTP txns lightweight
      • current systems interleave CPU and IO via synchronization
    • elasticity vs. fork-lift upgrades
    • high avail through shared-nothing replication
    • no knobs
  • txn and schema characteristics
    • OLTP tend to have highly partitionable tree schemas [think entity groups]
      • [unsubstantiated; I do believe a more cyclic reasoning: high-scaling apps already are partitionable]
    • constrained-tree application: txns run within 1 partition
    • one-shot txn: can be executed in parallel without requiring intermed results to be serialized; still needs 2PC; can be done via vert part
    • also: two-phase, strongly two-phase, commutativity
  • system architecture
    • all txns are stored procedures; runtime in same thread as DB
    • distributed txn mgmt: single-sited, one-shot, general
    • replication and recovery
    • DB designer: make as many txn classes as possible single sited
      • [this is focused on static property of txn classes; I would like to focus on dynamic property of txn instances, i.e. individual objects]
  • [lot of faith in TPC-C]

h-store concurrency control (evan jones)

  • meat is speculative execution with multi-partition tweak
  • local speculation scenario
    • say you have the following txns, where x=2 is on partition p and y=10 is on q:
      • txn A: swap x, y
        • this is a multi-partition txn
        • this is multi-fragment on p
      • txn B1: x++
      • txn B2: x++
      • end result should be x=12, y=2 OR x=11, y=3 OR x=10, y=4
    • timeline: say we execute in A, B1, B2 order
      • read x and read y, sending to coord
      • cannot yet begin speculation of B1/B2, since A left inconsistent state
        • ow, second fragment of A will overwrite B1/B2’s effects
        • end result will be x=10, y=2; not serializable
      • locking or occ can “execute” B1, but they have overhead, and in this (conflicting) case, they will be blocked
      • each partition gets second fragment of A from coord
      • each partition sends ack for A, waits to commit (2PC)
      • while waiting, can start speculating B1/B2
      • must queue results; release only once no longer speculative
        • first fragment of a multi-part txn can always be speculated like this
        • can do better for speculating these first fragments of multi-part txns if using central coord and coord is aware of speculative txns; see next sec
      • must keep undo buffers; must undo/re-exec B1/B2 iff A aborts
  • if you have one coord for a batch of multi-partition txns…
    • A, B1, C, B2, where C increments both x,y; note C is multi-part single-frag
    • everything goes as before up till B1 speculation
    • because A and C have same coordinator, q can return the result to its portion of C, along with indication that result is speculating on A
      • this is the end of all of C; it’s a speculative ack
      • after committing A, coord can immediately commit C
    • p can speculate its fragment of C, but like B1, cannot send this result, because coord is unaware of prior B1 txn, since single-partition txns do’t go through the coord
    • where does this help?
      • allows a sequence of multi-part txns, each composed of a single frag at each part, to be executed without blocking; call these simple multi-part txns
      • these are common, e.g. updates to highly replicated read-mostly tables
      • e.g. updates based on a non-partition column need to search all partitions
      • all distributed txns in TPC-C are simple multi-part txns
    • limitations
      • only applicable after executing last fragment of a multi-part txns, so not useful for non-simple txns
      • require same coord; on abort, need to be able to cascade aborts
        • might be able to avert this by assigning txns a global order
  • speculation is on the success of a 2PC
    • unlike concurrency control, it assumes all txns conflict

voltdb

  • 1.0
    • server-compiled java stored procedures
    • specify partitioning columns for each table
    • k-safety; not viewstamped replication; static config
    • supports DB-wide replicated (read-mostly) tables
    • no concurrency control; strictly sequential operation
    • statements can’t do joins on 2 partitioned tables
    • supports aggs
    • periodic/continuous snapshots to disk
    • 2PC for dtxns?
    • early beta w 150 customers featured bunch of web gaming companies and bunch of use cases resembling stream processing

percolator (daniel peng, frank dabek, google, osdi10)

  • btwn DBMS & mapreduce: online processing that tolerates latency
  • latency comes from conflicts
  • built on bigtable, chubby
  • no blocking - on conflict, abort
  • locks taken at the sites
  • one of the locked sites designated as primary
  • ACID snapshot isolation (write-write conflicts only; reads are not monitored)
  • centralized timestamp oracle hands out timestamps; can reply to 2B req/s
    • allows others to run their own systems
  • solves blockage on 2PC coordinator failure: after some time, try to undo the primary lock
    • the primary lock site is “the” commit site
    • bigtable has row txns, which allows for atomic test-and-set to rollback or commit txns at the primary lock site
  • abort & backoff if see a lock w any timestamp or a write w more recent timestamp than current txn timestamp
  • observers run on storage servers; multiple changes may coalesce into one observer run
    • implemented via notifications in special bigtable column group/column
  • slower than MR due to excessive RPCs, but major engineering gains
    • lock operations made faster by adding special RPC to bigtable
  • app: incremental indexing

haystack (facebook, osdi10)

  • facebook photo store
  • replaces NFS NAS solution that choked on sheer # files and excessive disk seeks
  • photos stored in 100GB append-only physical volumes (files) that are replicas of logical volumes
  • lookup operation
    1. get metadata/location-specifying URL from the metadata server (haystack directory), which balances load across physical volumes
    2. browser issues img requests to the CDN, which fwds to haystack, or directly to haystack
    3. lookup in haystack cache (DHT, basically a CDN)
    4. haystack store looks up some position info in in-memory index and grabs the photo in 1 seek
    5. put into haystack cache iff requesting directly (CDN can cache) & requesting from a write-enabled store (recently-written photos are most-read)
  • index has filename, offset, and size for each photo ID
  • photo IDs have random number (cookie) to prevent guessing valid URLs
  • writes synchronously append to volume and asynchronously write index checkpoints for faster recovery (the volume is like the DB log)
    • batched uploads when possible; luckily many users upload whole albums
    • updates: latest ver in physical offset has highest offset; if diff physical vol, must update
    • deletes synchronously update physical volumes & index; compaction does GC
  • compactions: processes deletes & duplicate keys
  • occasionally, full reset needed of a node, which takes many hours
  • use XFS extent-based FS bc it has efficient prealloc & blockmaps that fit in mem

Data Integration

Indexing Dataspaces (Xin Dong, Alon Halevy, sigmod07)

  • INTRO
    • users explore and use predicate queries and neighborhood
    • keyword queries
    • traditionally: inverted list or index tree or XML
  • PROBLEM DEFINITION
    • INDEXING HETEROGENEOUS DATA
      • eg: bibtex, word, ppt, email, www, xls, xml, rdb
      • triples: (instance, attr, val) or (inst, assoc, inst) (directional)
      • how to extract from data sources to this model is a separate topic
      • synonyms among attrs & assocs
      • hierarchies: 2 types (don’t distinguish between them)
        • sub-property: eg father < parent, contactAuthor < author
        • sub-field: eg city < addr, firstName < name
          • [not about complex structures’ components, but see 4.4]
          • [what happens when name is a unique inst? question of extraction]
    • QUERYING HETEROGENEOUS DATA
      • predicate queries
        • attr pred: matches on direct attr/sub-attr
        • assoc preds: matches instances that have a (sub-)association with another instance that has any matching attr
        • [sub-attr/sub-assoc: either sub-property or sub-attr]
        • [(firstName ‘Yang’) won’t match an obj with name ‘Yang’]
      • neighborhood keyword queries
        • relevant inst: has direct attr
        • associated inst: assoc with a relevant inst
      • user need not have knowledge of exact schema
      • ranking is a separate issue
      • [both queries limited to 2 degrees]
    • INVERTED LISTS
      • store occurrence counts
  • INDEXING STRUCTURE
    • INDEXING ATTRS
      • attr inverted lists (ATIL): inverted list on (keyword//attr//)
    • INDEXING ASSOCS
      • attr-assoc inverted list (AAIL): ATIL + (keyword//assoc//) entries too
      • [doesn’t count occurrences within each assoc inst]
      • easy extension to support k-ary assocs
  • INDEXING HIERARCHIES
    • INDEX WITH DUPLICATION
      • dup-ATIL: duplicate (eg Yang//firstName//, Yang//name//)
      • increases index size
    • INDEX WITH HIERARCHY PATH
      • hier-ATIL: eg Yang//name//firstName//
      • use row-wise deltas to minimize space overhead
      • prefix serach more expensive
    • HYBRID INDEX
      • summary rows shadow other rows and end in additional //
      • [need to skip the shadowed rows somehow - easy problem]
      • start with hier-ATIL; add summary row for prefix p if count of other rows with prefix p exceeds threshold
      • [what if: a/a/{a,b,c/{a,b,c}}?]
    • SCHEMA-LEVEL SYNONYMS
      • assoc in one src = attr in another: naturally handled
      • term heterogeneity: have table mapping sets of synonyms to canonical names, and use this replacement in index and queries
    • NEIGHBORHOOD KEYWORD QUERIES
      • keyword inverted list (KIL): hybrid-AAIL, but also summarize prefixes that correspond directly to keywords, i.e. k//
  • EXPERIMENTAL EVALUATION

functiondb - arvindt

bootstrapping pay as you go - alon halevy

pay as you go - alon halevy

oltp ttlg

uncovering the relational web

column-stores vs. row-stores

column stores

general

  • column stores only recently gained traction because there’s enough memory for fetching large ranges of blocks

hybrid approaches

  • PAX (Weaving Relations for Cache Performance by Natassa Ailamaki, David DeWitt, Mark Hill, and Marios Skounakis, VLDB 2001)
    • store data by column within disk block
    • pros
      • CPU efficiency of C-stores
      • improved cache hit ratio & mem bandwidth
      • better compression
      • easy to implement in row-store
    • cons
      • IO properties of row-stores
      • cache prefetching on modern CPUs makes PAX cache efficiency obsolete
  • fractured mirrors (A Case for Fractured Mirrors by Ravi Ramamurthy, David DeWitt, and Qi Su, VLDB 2002)
    • replicate multiple ways; c-stores are OOM faster than row-stores and vice-versa
    • cons
      • may need higher degrees of replication
      • harder impl: need full c-store & row-store exec engines & query optimizers
      • writes can be a problem
  • fine-grained hybrids (Cudre-Mauroux, Wu, and Madden, CIDR 2009)
    • individual tables can be row- or col-oriented; declarative storage
    • pros: get perf advantages of fractured mirrors without additional replication
    • cons: complex to implement; may be easier to start with c-store since that already supports early materialization and so requires operators be able to handle input in both column and row-oriented format
    • vertica
  • http://dbmsmusings.blogspot.com/2009/09/tour-through-hybrid-columnrow-oriented.html

distributed olap

hadoopdb

  • hadoop scalability while faster on structured data workloads
  • hadoop coordinates/exports from hdfs; dbms’s execute; allows fast sql
  • currently using pg; slower than db2; may use vectorwise
  • pg significantly outperforms hadoop on structured data
  • benefits vs PDBMSs
    • open/cheap/free
    • query fault-tolerance
    • tolerate node degradation (eg stragglers, not outright failures)

greenplum

  • postgresql based
  • table storage options
    • traditional pg
    • row-oriented append-only: compressible, scan-optimized
    • col-oriented append-only (storage-level-only solution)
    • treat external source as relational
  • compression: LZ, gzip
  • no need for joins to get multi columns
  • no: columnwise compression options, in-mem compression, late materialization
  • greenplum chorus: some new kind of ETL

teradata

  • major customers, 10/2008
    • ebay: 5 PB
    • walmart: 2.5 PB
    • bofa: 1.5 PB
    • anon financial services co: 1.4 PB
    • dell: 1 PB

alternative DBs

nimbusdb (jim starkey, ieee/acm talk at mit, 3/18/10)

  • consistency is overloaded
  • serializability: sufficient for consistency, but not necessary
    • expensive; “almost serializable” is useless (no guarantees and all cost)
    • serializable -> sequential txn order -> at every point, db has definitive state
  • keeps relating to relativity
    • time is seq of events, not just clock
    • comm requires latency
    • 2 nodes just can’t see same events in same order
    • not a bug, but a fact of life
  • mvcc/si: alternative to serializability
    • row updates create new versions pointing to old versions
    • each version tagged with creating txn
    • a txn sees a version consistent w when it started
    • a txn can’t update a version it can’t see
    • each txn sees stable, consistent state
  • nimbusdb: elastic, scalable, acid, sql rdbms
    • modest goals:
      • very high perf in data center
      • high perf geographically dispersed
      • fault tol: sw, hw, geological
    • less modest goals:
      • put DBAs outta work: 0 admin; dynamic, self-tuning
      • arbitrary redundancy; multi-tenant; of, for, in the cloud
  • chorus: set of nodes that instantiate a db
    • txn nodes: do sql
    • archive nodes: maintain a persistent disk archive; have copy of entire db
  • a nimbus db composed of distributed objs called atoms
    • can be serialized to net/disk
    • can be on any # of chorus nodes
    • all insts of atom know of each other
    • atoms replicate p2p
    • every atom has a chairman node: enforces consistency
      • deterministically chosen; if chairman disappears, everyone knows who next one is
    • [partition/replica group]
  • examples of types of nimbus atoms
    • txn mgr: start/end txns
    • table: metadata for a relational table
    • data: container for user data
    • catalog: tracks atom locations
  • comm is fully connected, async, ordered, batched
  • messaging
    • most data is archival and inactive
    • small fraction is active but stable (eg what was best seller 2 yrs ago)
    • smaller fraction is volatile but local (customer’s shopping cart or account in active use)
    • even less data is volatile and global
    • replicate only to those who care
  • nodes are autonomous: choose where to get an atom, which atoms to keep, which to drop
  • txn control
    • record version based
    • txn sees only the results of txns reported committed when & where it started
    • consistency maintained by atom chairman
      • eg can’t insert dupe into unique index
      • eg can’t update record 42 if another txn updating it
    • atom updates bcast repl msgs; repl msgs precede commit msg
  • nimbusdb is relativistic
    • db/consistency viewed only thru txns; no single definitive db state
    • nodes may differ due to msg skew
  • archive nodes provide durability
    • see all atom updates followed by a pre-commit msg
    • bcast the actual commit
    • txn nodes retain ‘dirty’ atoms till an archive node reports the atoms archived
    • multiple archive nodes provide redundancy
  • txn nodes provide scalability
    • any txn node can do anything
    • conn key is client session #
    • conn broker can give effect of sharding
    • txn nodes tend to request atoms from local nodes
    • data dynamically trends toward locality
  • misc features
    • semantic extensions eg structural inheritance
    • unbounded strs, unbounded nums
    • all metadata is dynamic
    • chorus members platform indep, rolling sw updates, 24/365
  • network partitions, CAP, nimbusdb
    • certain archive nodes are designated as commit agents (really, canaries)
    • subsets of commit agents form into coteries
      • subsets where no 2 are disjoint
    • a pre-commit must be received at least one commit agent in every coterie to commit
    • post partition, the partition that contains a coterie survives
  • [this talk did not convince me that this system would actually work; others felt similarly]

Alternative DBs

  • key‐value‐cache: memcached, repcached, coherence, infinispan, eXtreme scale, jboss cache, velocity, terracoqa
  • key‐value‐store: keyspace, flare, schema‐free, RAMCloud
  • eventually‐consistent key‐value‐store: dynamo, voldemort, Dynomite, SubRecord, Mo8onDb, Dovetaildb
  • ordered‐key‐value‐store: tokyo tyrant, lightcloud, NMDB, luxio, memcachedb, actord
  • data‐structures server: redis
  • tuple‐store: gigaspaces, coord, apache river
  • object database: ZopeDB, db4o, Shoal
  • document store: CouchDB, Mongo, Jackrabbit, XML Databases, ThruDB, CloudKit, Perservere, Riak Basho, Scalaris
  • wide columnar store: BigTable, Hbase, Cassandra, Hypertable, KAI, OpenNeptune, Qbase, KDI

hyder (phil bernstein, msr 2009)

  • modern scalable systems: abandon dtxns; eventual consistency; memcached
  • Hyder: oltp KV store on raw flash
    • single, reliable, distributed log accessed by all servers; “log is DB”
    • “no partitioning, all data avail to all”; servers can fetch from each others’ caches
    • DB is multi-versioned; server caches are trivially coherent
  • all homogeneous nodes (symmetric); app+db in-process
    • unified cache (no server/client/memcached)
    • no RPC
  • enabling hardware assumptions
    • Flash
      • 10^4 more IOPS/GB than HDD; higher perf but lower density
      • 4K pages: 100us to read, 200us to write, per chip
    • data center networks: RTTs <25us, 10GigE; so, many servers can share load
    • large mem: so, reduce rate that hyder needs to hit the log
    • manycore servers: so, maintain consistent views of DB
  • Flash
    • no mutates; must erase before writing
    • end up treating as sequential device
    • wear-leveling: multi-layer (MLC): ~10K erases; single-layer (SLC): ~100K
    • 12/09: 4GB MLC is $8.50; 1GB SLC is $6 (SLC ~= 3*MLC)
    • densities can double only 2-3 more times, but other non-vol memories (PCM, memristors)
  • Hyder stack: API (SQL/LINQ), OCC txn layer, indexed record layer, storage layer
  • storage layer
    • custom controller interface
    • log uses RAID erasure coding for reliability
  • txn
    • read a snapshot as cache of last committed db state
    • appends intention log record to log
    • record points to last committed record
    • lookups result in random access; unthinkable in disks but not in flash
    • OCC: detect conflicts (R/W sets) after appending, abort
    • broadcast intention at same time as appending intention
    • when appending, no idea where intention will land; append returns addr
    • all servers get a copy of this ack
  • indexed record layer: multi-versioned binary tree
    • think of whole DB as one big binary search tree
    • multiversioned binary search trees marshalled into log leaf-first (backptrs)
    • no update-in-place; must make new copy; shadow copies are wasteful
    • instead, just record the path, not actually the internal node contents
    • to avoid having every txn conflict with each other, distinguish between structural updates to internal nodes vs content updates
    • rationale: Flash space is costly
    • binary trees vs b-trees: b-tree nodes are large, so cheaper to update lots of small nodes
  • txn execution: get snapshot (ptr to root), generate updates locally, append intention log record
    • if there was an earlier conflicting update, cannot commit the change of this log entry
    • ergo entries that appear in log may not have been committed
    • key ingredient: log updates are bcast to all machines
    • but no global ordering
      • custom flash controller supports append-page() that returns addr where it stored the page
      • [could’ve alternatively had a txn manager sit in the controller that simply tells you whether to abort]
        • avoided this to minimize latency; longer-running txns increases conflicts in OCC
        • but now writing all this crap to flash (previously noted as “precious”)
    • a lot of effort spent on reliability of controller
  • every server maintains copy of log tail in memory
    • every server rolls fwd update records one by one
    • on each one, check if the txn actually committed
    • all servers will do the same things and make the same abort/commit decisions
  • bottlenecks
    • network: 15K update txns/s over 1 GigE; 150K update txns/s over 10 GigE
    • conflict detection & merge: every machine must roll fwd every op; up to 300K updates/s
    • root gets conflicting structural (physical) updates all the time, but db content actually doesn’t change conflictingly; they can be deterministically merged
    • aborts on write-hot data; currently takes 200us to write txn
  • major techs
    • flash is append-only: custom controller has mechanisms for sync & fault tol
    • storage is striped, with self-adaptive algo for storage alloc & load bal
    • fault-tol protocol for total ordering
    • fast algo for conflict detection & merging of intention records into last-committed state
  • [essentially building shared disk system]
    • one of criticisms is scalability: requires lots of cross-sectional bandwidth
    • answer: “predictable cross-sectional talk (bcasting as network was intended)”
    • answer?: cross-sectional bandwidth solved by fat trees

benchmarks

redis

  • non-volatile memcached that can also map to lists and sets with atomic push/pop ops
  • designed for in-memory workloads
  • can be persistent, either async or sync, but sync is currently slow
  • microbench: 110K sets/s, 81K gets/s on entry level linux box
  • impl in C; single-threaded event-driven
  • simple async master/slave replication out of the box

scalaris

  • in-mem erlang kv store w sync replication; multi-hop chord-style lookups; not much community/support

memcachedb

  • supports replication
  • thread pool

bitcask

  • simple backend introduced in 4/2010 by riak guys for riak
  • on-disk log files of CRC’d key-val pairs
  • in-mem keydir maps keys to log file/offset; is a complete mapping
  • reads req disk seek; rely on OS file cache
  • periodically merge/compact logs

tokyo cabinet

  • successor to GDBM, QDBM
  • features compactness, speed, parallelism, simple API, fault tol, 64-bit
  • C lib
  • much faster than BDB
    • 1M inserts/.4 s; 1M lookups/.33 s
  • record-level locking
  • txns: WAL and shadow paging [but apparently can disable this, since tyrant relies only on replication for durability]
  • backends
    • mem: string, array list dequeue, hash map, ordered tree
    • disk: hash, B+tree, fixed, column
      • hash: chaining with binary search
      • column: built on hash DB; col indexes built on B+ tree
  • lots of low-level opt/tuning, eg combining mmap and pwrite/pread syscalls
  • optional compression
  • tokyo dystopia: FTS engine; uses tokyo cabinet for inverted index

tokyo tyrant

  • server interface to tokyo cabinet
  • hot backup/update log
  • async replication; master-slaves or master-master (but no atomic conflict detection, only eventual)
  • speaks binary, memcached, HTTP
  • built-in ops: no-reply updates, multi-gets
  • lua extensibility for more ops (stored procs)
  • epoll/kqueue + thread pool scalability
  • 1M queries/17.2 s

chunkd

  • single-node storage daemon for project hail
  • similar to tokyo cabinet but with multi users and channel security
  • 2nd place to tokyo cabinet in perf

simple DBMs

  • DBM: simple database engine by ken thompson
  • CDB: constant database by dan j bernstein; super fast

berkeleydb

  • very slow for random writes into DBs of >100K rows
  • hard to configure

GT.M

  • old DB from 1986
    • OCC txns; non-distributed
    • users can also use locks
  • from wikipedia:

    GT.M is a high-performance schemaless database engine, optimized for transaction processing. GT.M is also an application development platform and a compiler for the ANSI/ISO standard M language, also known as MUMPS.

    GT.M compiles the M source code into the target machine language. These object files are dynamically linked into an image. GT.M databases are UNIX files identified by a file called Global directory. By changing the Global Directories, one can make the same program access different databases. Internal to these files, GT.M stores data in b-tree based multidimensional arrays (otherwise known as MUMPS globals), similarly to most modern M implementations.

  • MUMPS language predates C
  • M/DB is a SDB interface to GT.M

FluidDB

  • impl in Python
  • schemaless attrib-val pairs
  • attrib names are OWNER/NAME/NAME/…
  • each of read/update/create/see is open/closed has exceptions
  • basic data types
  • CQL-like query lang: has sara/rating and (tim/rating > 5 or mike/comment matches fantastic or mary/product-reviews/keywords contains "kids")

misc db

talk on mcore systems (nedbday08, anastasia)

  • total latency going up
  • larger aggregate l2 cache due to more l2 per core => higher-latency access times into l2
    • due to reaching across cores
  • only thing to do is to write sw that explicitly manages memory locality
  • dss is more cache-friendly than oltp
    • [what distinguishes these 2 workloads?]
  • contribution of
    • travel grows log-linearly
    • access drops log-linearly
  • move to shared-nothing arch
  • see also: gamma machine (uwisc, dewitt)

couchdb (apache, 2008)

site stats

google

  • 2006: 400M queries/day
  • 9/2009: 12.8B queries/mo, 427M queries/day, 5K queries/s
  • 6/8/2010: index ~100PB, growing 100-999TB/day
  • 11/2009 jeff dean LADIS talk
    • typical server: 16GB RAM, 2TB disk
      • easily hold full web index in mem with just 500K servers
    • failures/year: 1-5% disks, 2-4% servers; ea server expected to crash twice+
    • DC storage hierarchy
      • Server: DRAM: 16GB, 100ns, 20GB/s; Disk: 2TB, 10ms, 200MB/s
      • Rack: DRAM: 1TB, 300us, 100MB/s; Disk: 160TB, 11ms, 100MB/s
      • Aggregate Switch: DRAM: 30TB, 500us, 10MB/s; Disk: 4.8PB, 12ms, 10MB/s
    • times everyone should know (ns)
      • L1 cache reference: 0.5
      • Branch mispredict: 5
      • L2 cache reference: 7
      • Mutex lock/unlock: 25
      • Main memory reference: 100
      • Compress 1KB bytes with Zippy: 3,000
      • Send 2K bytes over 1 Gbps network: 20,000
      • Read 1MB sequentially from memory: 250,000
      • Roundtrip within same datacenter: 500,000
      • Disk seek: 10,000,000
      • Read 1MB sequentially from disk: 20,000,000
      • Send packet CA -> Netherlands -> CA: 150,000,000
    • GFS usage: 200+ 1000+-machine clusters, 4+ PB clients, 40 GB/s R/W load
    • mapreduce usage: 3.5M jobs/yr (488 machines/job, 8min/job)
    • bigtable usage: 500 clusters; largest 70PB, 30+GB/s IO
    • bigtable changes
      • better scaling, perf isolation, corruption protection
      • per-table replication; uses eventual consistency
        • all user-facing production uses of bigtable use replication
      • coprocessors: code that lives/moves/splits with data; auto scaling, load bal, req routing
        • eg used in managing metadata for colossus (new GFS)
        • eg lang model serving for machine xlation system
        • eg distributed query processing for full-text indexing support
        • eg regex code search
    • spanner: new bigtable; similar tabular data model, locality groups, coprocs
      • auto move/replicate data based on usage
      • optimizes resources of entire cluster
      • hierarchical dir structure
      • allows fine-grained access and replication control
      • supports distributed txns for apps that need it
      • plan to scale to 10M machines and 1 EB data
    • refs
  • 05/2010 non-volatile memory workshop at UCSD, talk by al borchers of platform group
    • apps: 99.9%ile latency, disk queueing, is ~100ms
    • experimented with flash as cache on GFS chunkservers
      • simulated traces of bigtable tablet servers for several apps
      • write-heavy
    • 16GB cache: 130 misses/s; 512GB: 450; logarithmic (exponential increase yields linear reduction)
    • write perf highly variable/chaotic
      • seq writes fine, rand writes kill perf & lifetime
      • lifetime: FIFO ~ 2x LRU (monitored GCs)
      • difference evens out as cache grows; at 512GB, FIFO ~ LRU
      • research opp: replacement policy w good hit rate & lifetime
    • shift from seek-bound to CPU-bound
      • PCIe less overhead than SAS/SATA; no need for blcok layer
      • NUMA (going through diff core) hurt just as much
      • sync multithreaded used 2-3x CPU of async
      • block layer: 2-3x CPU overhead of accessing RAM
      • IO scheduler: 30% overhead; unnecessary disk optimizations, lock contention
      • FS: 39% overhead; metadata writes, cold data hurt perf & lifetime
      • NUMA: 20-40% overhead
    • IO size constrained by page/block sizes; need new interface
    • read error rate higher than predicted by bit error rate
      • block, plane, die failures dominated
      • concentrated in a few bad chips
      • small impact to caching apps; treat as cache miss
      • large impact to DB apps: data loss
      • traditional RAID drivers yields terrible perf, more CPU overhead
      • research opp: fault tolerance in flash devices; long-term/large-scale failure rates and mechanisms
      • SLC/MLC both saw early-life failures
    • http://dga.livejournal.com/44132.html

internet archive

ebay

  • olap 4/2009
    • main teradata warehouse
      • 2 PB user data, 10,000s of users, Ms of queries/day

      • 72 nodes, >140 GB/s IO (2 GB/node/sec) (peak?)
      • 100s of production DBs being fed in
    • greenplum warehouse
      • 6.5 PB user data, 17e12 records, 150B new records/day (50 TB/day)
      • 96 nodes, 200 MB/node/sec IO, 4.5 PB storage, 70% compression
      • few concurrent users
      • only for web/network event logs
    • http://www.dbms2.com/2009/04/30/ebays-two-enormous-data-warehouses/
  • oltp 11/2006
  • randy shoup, HPTS09
    • 89M active users, 190M items in 50K categories, 8B hits/day, 10% items start/end auction, 70B R/W ops/day
    • arch lessons: partition everything, use async, automate everything, tolerate failures, embrace inconsistency, expect evolution, understand dependencies, cache, save everything, invest in infrastructure
  • article 5/2008

amazon 9/2007

wikipedia 2007

  • 300 servers
    • 200 app servers
    • 70 squid servers
    • 30 memcached servers, 2 GB mem each
    • 20 mysql innodb servers, 16 GB mem each (200-300 GB each, total 5TB)
    • also: squid, nagios, dsh, nfs, ganglia, linux virtual service, lucene on mono, powerdns, lighttpd, apache, php, mediawiki
  • 50K http reqs/sec
  • 80K mysql reqs/sec
  • 7M registered users
  • 18M objects in english version
  • http://perspectives.mvdirona.com/2008/12/28/WikipediaArchitecture.aspx

youtube, 2007 seattle conference on scalability

  • apache -> lighttpd
  • mini-clusters (partitions/replica groups)
  • CDNs for most popular
  • squid: perf degraded w load
  • netscaler load balancer
  • most scaling issues in db
    • mysql w async master-slave replication
    • mysql slaves replay logs in serial order w 1 CPU, 1 disk; master can zoom ahead with multiprocessor and multidisk
      • warmed up buffer by reading in relevant pages first before the log could get replayed
    • split DB into 2 instances: video watch and general
      • video watch higher priority than general
    • initially served off RAID 10, but linux only sees 1 vol so doesn’t schedule too many parallel disk IOs
      • changed to software RAID 0 (striping) over 5 RAID 1 pairs (mirroring), ie manual RAID 10
    • ultimately partitioned by user

linkedin

hive at facebook, 11/2010

facebook, 2010-11-02 talk at facebook on “mysql at scale”

  • stats
    • Query response times: 4ms reads, 5ms writes.
    • Rows read per second: 450M peak
    • Network bytes per second: 38GB peak
    • Queries per second: 13M peak
    • Rows changed per second: 3.5M peak
    • InnoDB disk ops per second: 5.2M peak
  • operations
    • 99.9th percentile metrics
    • lots of query optimization
  • performance
    • extensive, redundant monitoring: error logs, frequent show variables sampling, slow query logs, client error logs, sampled query logs
    • look at macro trends: by day, by week
  • engineering: fb patch for mysql: lots of non-general changes
    • ppl: vamsi ponnekanti, mark callaghan, (someone else)
    • features:
      • online schema changes (OSC): eg add col, add/change index
        • setup triggers to track changes; briefly lock table
        • copy data to new table w desired schema
          • break up into small copies; see openarkkit
        • replay changes on new table
        • rename new table sa target table; again briefly lock table
        • updates can happen except during table locks
      • support major upgrades
      • fixed many stalls
      • fixed many bugs
      • perf improvements
    • current changes
      • incremental hot backup w xtrabackup
        • fast binary backup with innobackup + fast compression (qpress)
        • incremental w xtra to save space, working w percona
          • enhance w remote/streaming
        • fast selective table restore
          • sys tablespace ibdata1
          • per table tablespaces foo.ibd
      • single-threaded replication: want multi-threaded slaves
    • non-stored procedures
      • problem: want to reduce lock contention
      • trad’l approaches: SPs, triggers
        • thousands of DBs, 20 hards each, 50 edge tables each
        • millions of SPs to maintain
      • soln: pass CLIENT_MULTI_STATEMENTS to mysql_connect()
        • insert into counts on duplicate key
    • Insights
      • hadoop to mysql, then advertisers can see analytics
      • didn’t really follow this
  • http://www.livestream.com/facebookevents/video?clipId=flv_cc08bf93-7013-41e3-81c9-bfc906ef8442
  • http://highscalability.com/blog/2010/11/4/facebook-at-13-million-queries-per-second-recommends-minimiz.html

facebook, misc

facebook, 2009 talk at MSR

  • 300M active users
  • 80B photos; upload 1B+/mo
  • videos: 10M/mo
  • 1B pieces of content
  • 100,000s of apps
  • per day talk
    • 4 TB compressed new data per day
    • 135 TB compressed data scanned per day
    • 7500+ hive jobs on production cluster per day
    • 80K compute hours per day
  • accessibility
    • 200 ppl/mo run jobs on hadoop/hive
    • analysts (non-engs) use hive
  • warehouse
    • 4800 cores, 5.5 PB
    • 12 TB per node
    • 2-lev topology
      • 1 Gb/s node to rack switch
      • 4 Gb to top level rack switch
  • hadoop
    • 36 PB of uncompressed data, >2250 machines, 23,000 cores, 32 GB of RAM per machine, processing 80-90TB/day (hadoop summit 2010)
  • last year “20 percent more photo servers and 50 percent more upload servers” to process Halloween traffic, adding up to an additional 40 terabytes (that’s 40,000,000,000,000 bytes) of storage. That was when the site had 140 million users, less than half its current active user base of more than 300 million. http://www.datacenterknowledge.com/archives/2009/10/30/facebooks-spooktacular-scaling-for-halloween/

datacenter sizes, 2010-04-03

facebook, 2009 talk at UCSD

  • general
    • #2 property on internet, measuring time spent on site
    • 200B hits/mo, >3.9T feed actions/day, 15K sites using FB content
    • exponential user growth; 70% users outside US
  • 20B photos in 4 resolutions; 2-3B uploads/mo; serve 600K photos/s
    • orig served from NFS; can’t handle many files; added cache
    • optimized from 10 down to 2-4 IOs per serve
    • haystack: 1 IO per serve
  • arch
    • load balancers -> web servers -> memcached
    • php: easy but slow, sloppy, hard to interop w C++ backend; optimized php engine
    • syslog breaks at scale; built scribe; 25TB msgs/day
    • 30K (mostly web) servers
  • misc
    • looked at clustering algos; found little win for add’l complexity; use hash partitioning
    • requests get data from 100s of hosts; network perf critical
  • memcached: total 120M req/s
    • initially, 15-20K req/s/host; today, 250K
    • added: multithreading, polling device drivers, stats locking, batched packet handling
  • incast: get all responses in same 40us window -> overflow commodity switch buffers -> 200ms TCP timeout
    • custom congestion-aware UDP-based transport to manage congestion across multiple reqs instead of within single conn
  • authoritative storage in mysql
    • no joins; no dynamically changing data; no heavily-referenced static data
    • hard:
      • logical data migration
      • load balancing many logical db’s over varying number of hosts
      • scaling CPU
  • WAN replication in west/east (master/slave) US DCs
    • layer 7 load balancer routes to west (master) any requests that perform writes (based on a hard-coded list)
    • problem
      • async (cross-country) master-slave DB stmt replication
      • originally, invalidate both memcaches on update at master
      • but remote (slave) memcache could get repopulated with stale value from remote DB before new stmt arrives at remote DB (under load, up to 20s)
      • soln: don’t invalidate remote from master, invalidate on stmt receipt
        • replicated stmts must thus carry info on what memcached keys to invalidate
        • eg: REPLACE INTO profile (first_name) VALUES ('Monkey') WHEREuser_id='jsobel' MEMCACHE_DIRTY 'jsobel:first_name'
    • problem: read stale values after updating at master due to lag
      • soln: on write, set cookie w current time; app-aware load balancer routes all reqs to master for 20s
    • http://www.facebook.com/note.php?note_id=23844338919&id=9445547199&index=0
  • future work
    • load balancing
    • middle tier: balance btwn productivity/efficiency
    • graph-based caching/storage systems
    • search relevance via social graph
    • object discovery and ranking
    • storage systems
    • personalization
  • http://idleprocess.wordpress.com/2009/11/24/presentation-summary-high-performance-at-massive-scale-lessons-learned-at-facebook/

facebook & hive, 3/23/10 talk at qcon

  • 5800 cores, 8.7TB storage capacity (div by 3 for replication), 12TB/node
  • 2-level network topology: 1Gb/s node-to-rack switch, 4Gb/s top-level rack switch
  • stats
    • 12TB/day compressed new data, 135TB/day compressed scanned data
    • 7500+ hive jobs/day, 80K compute hrs/day, 95% jobs are hive jobs
    • used by analysts/non-engrs
  • apps
    • reporting
      • daily/weekly aggs of impression/click counts
      • measures of user engagement
      • microstrategy dashboards
    • ad-hoc analysis: # group admins broken down by state/country
    • machine learning (assembling training data)
      • ad optimization
      • user engagement as function of user attribs
    • many others
  • challenges: space constraints, resource scheduling, job isolation
    • RAID-like parity scheme for less replication space
    • implementing columnar compression
    • using fair share scheduler
  • http://www.infoq.com/presentations/Facebook-Hive-Hadoop

myspace

stackoverflow

twitter

  • hadoop summit 2010
  • stats 2010-09-19 http://www.lukew.com/ff/entry.asp
    • Twitter is seeing on average 90 million Tweets per day.
    • Twitter currently has more than 145 million registered users.
    • Twitter’s total mobile users jumped 62% since mid-April.
    • 16% of all new users to Twitter start on mobile now.
    • 46% of active Twitter users make mobile a regular part of their Twitter experience.
    • Traffic to Twitter.com has grown about 100% this year.
    • More than 60% of Twitter users aren’t in America.
    • 90% of Tweets are available to the public. And 25% of tweets contain links.
    • Twitter serves 80MB of updates per second.
    • Twitter delivers six billion API calls per day (70,000 per second.
    • The number of registered Twitter OAuth applications is now at almost 300,000
  • stats revealed at Chirp 2010
    • registered users: 105,779,710 (1,500% growth over the last three years.)
    • new sign-ups per day: ~ 300,000 (More recently, 60% of new accounts were from outside the U.S.)
    • new tweets per day: 55 million
    • unique daily visitors to the site twitter.com: ~ 180 million. (That’s actually dwarfed by the traffic that flows through twitter’s API – 75% of traffic is through the API.)
    • API requests per day: 3 billion
    • registered apps: 100,000 (from 50,000 in Dec/2009)
    • search queries per day: 600 milion
    • Twitter’s instance, of their recently open-sourced graph database (FlockDB), has 300 billion edges and handles 100,000 reads per second.
    • servers: “… in the hundreds”
    • http://radar.oreilly.com/2010/04/twitter-by-the-numbers.html
  • big data in real-time at twitter (qcon 2010-04-20)
    • tweets
      • partitioned by time
      • future: cassandra; back to partitioning by tweet id, but this time maintain index from user id to list of tweet ids
      • memcached for 90+% of reads
    • timeline: what user sees when they log in (relevant tweets from friends)
      • materialize messages in followers’ inbox caches
    • social graph: set operations (eg intersect) can’t be pre-computed
      • store fwd & bwd edges, partitioned by user
      • flockdb: real-time, distributed graph store
    • search: real-time indexing
      • currently partitioning by (term id, doc id) by time in mysql
      • poor write tput; rare terms hit many partitions; poor space efficiency
      • future: partition by document and time; may switch to lucene
    • stats
      • reads/s, writes/s, cardinality, bytes/item, durable?
      • tweets: 100K, 850, 12B, 300, yes
      • timelines: 80K, 1.2M, a lot, 3.2K, no
      • graphs: 100K, 20K, 13B, 110, yes
      • search: 13K, 21K, 315M, 1K, yes
    • principles
      • all reads must be from mem; disk for writes only
      • can precompute some problems but not all
    • http://www.slideshare.net/nkallen/q-con-3770885
  • use pig and hadoop http://www.slideshare.net/kevinweil/hadoop-pig-and-twitter-nosql-east-2009
    • basic stats
      • reqs/day
      • req latency
      • distribution of response codes
      • searches/day
      • unique searches
      • unique users
      • geo dist
    • correlations
      • diffs for other clients (mobile, desktop, services, etc)
      • cohort analyses
      • what fails at the same time
      • what features get users hooked
      • what features do successful users use often
      • search corrections/suggestions
      • A/B testing
    • research
      • profiling users based on tweets (theirs, followers, followees)
      • what graph structures lead to successful networks
      • reputation
      • sentiments
      • retweets, retweet depths
      • duplicates
      • language detection
  • nosql eu 2010

yahoo

zynga

NYSE (brian clark’s talk at linux foundation end user summit 2009)

  • systems run under high pressure and tight constraints
  • process 3B txns/day (more than goog); each txn must be <1ms
  • “customers can switch to competing exchanges instantly and for almost no cost, so if NYSE’s systems are not performing, its customers will vanish”
  • typically 1.5TB/day; PBs of data kept online
  • “subject to thousands of attacks every day”
  • downtime must be limited to 90s/yr
  • wants to be able to move everything except specific app off a given core
    • also wants to lock a proc’s mem in place, but this has been avail via mlock
    • also wants non-disruptive way to measure latency, esp in net stack

stumbleupon

streamy

blizzard

  • AT&T has provided space, network monitoring/mgmt for 9 yrs
  • 10 DCs around world: WA, CA, TX, MA, fr, de, sw, kr, cn, tw
  • 20K systems, 1.3 PB storage, 13,250 blades, 75K cores, 112.5 TB RAM
  • 68 admins/ops; monitored from global NOC (GNOC), “which like many NOCs, features televisions tuned to the weather stations to track potential uptime threats across its data center footprint”

adobe

  • uses hbase
  • need sub-50ms accesses, no downtime, no data loss
  • mirrors git repository
    • faces transient bugs eg ZK byte-loss bug
  • preferred over cassandra despite its complexity; argues complexity is non-accidental
  • http://hstack.org/why-were-using-hbase-part-2/

microsoft

  • ieee computing article http://glinden.blogspot.com/2010/09/insights-into-performance-of-microsofts.html
    • hotmail
      • several PBs in 10,000s of servers
      • typical server: 2 CPU, 2 disks, add’l storage enclosure of up to 40 SATAs
    • typical cosmos server: 2 CPU, 16-24GB RAM, up to 4 SATAs
    • bing
      • 10,000s of servers
      • use RAM of 1,000s of servers
      • typical server: 2 CPU, 2-3GB RAM/core, 2-4 SATAs
    • odd: bing/cosmos servers CPU bound; bc of data compression?

github

  • http://github.com/blog/667-webpulp-tv-interview-with-tom-preston-werner
    • git repos on 6 file server pairs now
  • http://github.com/blog/530-how-we-made-github-fast
    • ldirectord on active/standby pair of xens
    • ships your request off to one of the four frontend machines
    • 4 frontend machines w nginx -> 16 unicorns w rails
    • mysql on 2 8-core 32GB RAM boxes w 15KRPM SAS HDDs, replicating over DRBD
    • git access
      • grit is ruby lib that speaks git
      • swapped grit’s FS backend to use the smoke RPC proxying service
      • proxymachine layer 7 TCP routing proxy with ruby routing logic
        • extracts username of specified repo
        • uses chimney to look up route for that user; look up in redis
        • each frontend runs 4 proxymachine instances behind haproxy
    • 4 active/standby pairs of (git) file servers: 8 core, 16GB RAM, 6 * 300GB 15K RPM SAS HDDs in RAID10
      • each has 2 ernie rpc servers spawning 16 ruby workers
    • ssh
      • sshd patched to look up PKs in mysql
      • gerve (git serve) is smarter git-shell (restricts exec)
        • grant access if you are owner or in ACL (in mysql)
        • chimney for route lookup
        • exec ssh git@ROUTE COMMAND ARGS
    • git: only does clones (a public operation); handled by proxymachine
    • sub-systems: job queue, archive downloads, billing, mirroring, svn importer
    • side-systems: github pages, gist, gem server, internal tools

disqus (djangocon 2010)

  • ~100 servers, 30% web, 10% db, 25% memcache, 20% HAProxy/heartbeat, 15% util
  • syslog-ng, pgbouncer, apache+mod_wsgi, ganglia, libmemcached
  • db: postgresql, slony-i (trigger-based) replication
    • pgfouine log analyzer, slow query logging
    • app-level partitioning (tables/vert, rows/horiz), replayed by slony
    • non-transactional; not sure how they deal with concurrency control, atomicity
    • single row atomic updates done on db, not in python (row.update(attr=val))
  • prevent thundering herd on memcached objects that have a predetermined lifespan with mintcache approach
    • have stale date as well as expire date
    • when stale, first requester is told to refresh the data
    • returning invalidate value instead of deleting then missing
  • cont deployment, fabric, hudson, selenium, redmine
    • can easily revert to earlier vers; can toggle features (for % users)
    • interesting idea: sentry aggregates their exceptions (instead of one per email)
  • http://www.slideshare.net/zeeg/djangocon-2010-scaling-disqus

flickr

  • @sampullara: replicates objs to shards where they’re needed

others