• semaphores: p decrements/acquires, v increments/releases

  • annoying to do performant multiprocessor programming
    • on tile64, singly cached shared memory means that all accesses to that memory need to go through the owning tile; can’t do this remotely
    • writes/msg-passing (non-blocking) may be faster than reads
    • cross-platform programming
    • on linux, can’t specify where to place things (stack, text, heap, etc.)
      • only from within a thread
      • same for tile64 memory:thread ownership


  • ILP, task par, data par
  • Amdahl’s law: sublinear speedup
    • though cache aggregation can actually lead to superlinear speedup
  • Flynn’s taxonomy
    • (S:single, M:multi, I:instr, D:data, P:program)
    • SISD (usual CPU), SIMD (GPUs), MIMD (you are here)
    • MIMD subtypes: SPMD (most common parallelism), MPMD


  • chips hit the power wall; higher clock frequency leads to hotter cores
    • Patterson: power wall, memory wall, ILP wall -> serial performance wall
  • fundamental differences from distributed systems
    • tightly coupled failures
    • communication timing over buses, interconnects is much more predictable, synchronous, etc.
  • symmetric multiprocessor (SMP)
    • multi-core processor aka chip-level multiprocessor (CMP): on a single integrated circuit (IC) aka die
  • network on chip (NOC) vs buses
  • memory hierarchy
    • non-uniform mem arch (NUMA)
      • separate mem per proc
      • early supercomputers traditionally focus on providing high-speed mem access
      • cache-coherent NUMA (ccNUMA): large inter-proc comm overhead
        • OS support: avoid scheduling/locking algos that make unfriendly accesses
      • for architects: investigate whether prediction has been used for cache-coherence
    • cache-only mem arch (COMA)
  • parallelism
    • coherence: wrt single datum
    • consistency: across separate things
    • cache coherency
      • false sharing: adjacent writes by different cores can steal cache lines between each other
      • CC mechanisms
        • central directory
          • scales better; ccNUMA tend to use this
        • bus snooping/bus sniffing: watch memory access addrs
          • invalidate own copy if others write
          • bus/full-mesh does not scale
        • snarfing: snoop on both addr and data
          • update own copy write away
      • coherence models: MSI, MESI, MOSI, MOESI, write-once, …
        • M modified E exclusive S shared I invalid O owned
        • eg MSI
    • related: DSM coherence protocols
  • misc
    • systolic array (Kung, CEL 78)
      • pipe network arrangement of data processing units (DPUs)
      • dual of von Neumann
  • IO channels/pins
  • chip multiprocessor watch from View
  • TODO faults

Contemporary Architecture

  • HyperTransport: standard bidirectional point-to-point DDR link
    • used in various electronics
    • dual date rate (DDR): sends data on both rising and falling edge of clock
    • abbrev often confused with HyperThreading
  • QuickPath: Intel’s point-to-point CPU interconnect; their answer to HT
  • Opteron
    • Barcelona: the current (11/17/2007) architecture
      • up to 4-way
      • per-core L1 (64K) L2 (512K), shared L3 (2M)
    • Direct Connect: the IO architecture
      • connection among CPUs: proprietary extension of 2 HT interfaces
      • integrated memory controller
      • HT connection to high-perf IO subsystem
      • crossbar switch between L3 and mem/IO
    • scales better than Xeon?
  • Core Duo (32-bit), Core 2 Duo (32-bit), Core 2 Quad
  • Xeon
    • two common buses: more vulnerable to congestion
    • separate memory controller
  • Cell
    • Power Processor Element (PPE): 2-way Power chip, the “controller”
    • multiple Synergistic Processing Elements (SPE), each with local storage
    • PS2: 8 SPEs with 256K local storage
    • difficult to program; must explicitly move memory
  • resources


  • TM: don’t worry about failures, persistence (vs. DB xacts - need recovery)

Speculation: two different meanings

  • optimistic concurrency control: speculate that we can safely parallelize
  • prediction, eg Speculator file system, branch prediction

Pessimistic concurrency control

  • mutex locks
  • most RDBMSs use 2PL: phase 1 acquire, phase 2 release
    • non-strict: phase 2 can start at any time
    • strict: release write locks after xact ends
    • strong
  • problems with locking
    • usability: not composable; lock ordering; reasoning difficulty
    • conservative: reduces possibilities for parallelism
    • overhead; slow
    • priority inversion
      • solutions: priority ceiling (?), priority inheritance
    • convoys: multiple threads of same priority repeatedly contend for lock
    • failure-intolerant (eg if node goes down, no waking)
    • preemption-intolerant (eg page faults)
    • deadlock, livelock
      • livelock still problem with other schemes (eg OCC)

Optimistic concurrency control

  • suited for low contention
  • challenges: performance (hardware), IO, lots more….

  • conflict detection
    • read-write, write-read, write-write
    • lazy: detect at end
    • eager: detect on conflict; less wasted work
  • version management
    • lazy, aka write buffers
      • old values in place, new values elsewhere
      • commits slower than aborts
      • most HTM use this; also DBs with OCC
      • two subclasses
        • eager commits: merge speculative state back into main memory before passing commit token to next head
        • lazy commits: pass token immediately; lazily merge state (eg on-demand); staleness is always clear
    • eager, aka undo logs
      • old values put in undo log, new values in place
      • common commits fast; aborts slower
      • most RDBMSs use this, but for recovery, not isolation
  • used in TM and SpMT

Transactional Memory

  • primitives: atomic, retry, abort

  • number of uncommitted transactions per processor
    • single-transaction (SingleT): can’t switch among xacts, so blocks when xact is stalled on another (potentially long-running) head xact
    • multi-transaction (MultiT)
      • single-version: when two xacts on same proc each need own version of a speculative state, the later one must stall
      • multi-version: no such restriction
  • contention management
    • various policies; can consider conflict type (ww/rw), time, contention, …
  • interaction with non-xact code
    • weak atomicity: xacts not atomic wrt non-xact code
    • strong atomicity: xacts are atomic wrt non-xact code
    • STM mostly weak, HTM mostly strong, due to varying costs
  • cannot always perform straightforward translation to transactions
    • eg: the following (legal) barrier code will deadlock when converted to xacts

      a=0; b=0; T1: lock(o1) { while (!a); b=1; } T2: lock(o2) { a=1; while (!b); }

    • determining when translation is safe is still open question
    • obviously, you would never write code from scratch that does this; the legacy code expects a,b to change, whereas when using xacts, you expect values to not change

  • open-nested transactions: screwed-up semantics
    • but: see Jim’s paper on ownership typing TODO
  • HTMs
    • limited amount of speculative state
    • virtualization: support overflows
    • still in exploration

Speculative Multi-Threading

  • speculative multithreading (SpMT), aka thread-level speculation (TLS)
  • spawn a task per call
  • lots of work: explored by SUIF, Sable, CMU DBs, …
  • not much work on prediction (understandably hard)
  • TODO crawl: explicit speculation, programming models with explicit prediction


  • data races: two accesses (at least one of which is a write) from diff threads to same mem location execute without proper synchronization
    • not always bugs
      • barrier synchronization: lock(y) { x += 1; }; ...; while (x != n);
      • flag synchronization: x = true; ...; while (!x); x = false;
  • atomicity violations: things that should be part of the same atomic block aren’t

Serializability and linearizability

  • serializability aka atomicity
    • history: ordering of operations (aka schedules, orderings)
    • history equivalence: final states are identical given any initial states
    • serializability: history has equivalent serial history
    • deciding history equivalence: tractable
    • deciding history serializability: NP complete (many possible interleavings)
  • linearizability (aka sequential consistency or atomic consistency)
    • history: ordering of operations on object
    • sequential history: one in which each invocation is followed by a response
    • linearizability: history has equivalent sequential history
    • two distinct properites: locality and non-blocking ???
  • serializability vs. linearizability
    • get at the same thing, but incomparable
    • linearizabilty does not address operations on more than one object, which serializability does
    • serializability does not require consistency of the partial and total orders, which linearizability does ???
  • great ref with examples

Consistency TODO

View and conflict serializability

  • all conflict ser scheds are view ser (conflict ser implies view ser)
  • determining view serializability is NP complete
  • so in practice, DBMSs restrict scheds to only conflict ser scheds

Multi-Granularity Locking

  • example granularity hierarchy: DB, table, page, tuple
  • idea of intention locking: must intention-lock all ancestors in granularity heirarchy
    • equivalent to just having locks at the leafs of the hierarchy
    • e.g.: you have an X on a tuple and want an S on the table (should conflict)
      • you don’t have to search all the tuple locks to see if there’s an X
      • instead just observe that there’s an IX on the table already
  • intention locks
    • IS: intend to get S on finer granularity
    • IX: intend to get X on finer granularity
    • SIX: S + IX, shared but intend to get X on fine granularity
  • protocol
    • to get IS/S, must have IS/IX on parents
    • to get X/IX/SIX, must have IX/SIX on parents
    • release in bottom-up order
  • lock escalation: dynamically ask for coarser-grained locks when too many low-level locks acquired
    • example: when reading tuples, may first try to use IS and get S on individual tuples; once we decide we’re reading a lot of tuples, then upgrade IS to S
compatible? IS IX SIX S X
IS y y y y
IX y y
S y y

Isolation levels

isolation level dirty read nonrep read phantom read
read uncommitted possible possible possible
read committed impossible possible possible
repeatable read impossible impossible possible
snapshot/ser impossible impossible impossible
  • undesirable phenomena
    • dirty read: read an uncommitted value (which may be rolled back)
    • nonrepeatable read: read committed values, but multiple reads may each return different values (if multiple other updates were committing)
    • phantom read: when reading sets of values twice, second set different from first (a more general nonrepeatable read)
  • more anomalies
    • lost update: in repeatable reads
      • A=0; T1: A++ & T2: A++ can end with A=1 instead of A=2
  • implementing isolation levels
    • read uncommitted: no snapshots; no intra-query consistency
    • read committed: per-query snapshots; no inter-query consistency; may see changes to previously read values
    • repeatable read: per-query snapshots that remain in effect just for the read values; may see newly inserted values (phantoms) and lose updates
    • snapshot: per-txn snapshots; won’t see things inserted later; almost always implemented using MVCC
    • serializable: read-write conflicts
  • snapshot isolation: weaker than serializable; only write-write conflicts
    • write skew: two transactions read a value each, then change the other’s value, but neither sees the result of the other’s update
      • because system doesn’t track reads; no read-write conflicts; eg MVCC
      • example: A=B=0; T1: A=B+1 & T2: B=A+1 can result in A=B=1 (no 2)
      • example: A=B=1; T1: if A=1 then B=0 & T2: if B=1 then A=0 can violate implicit invariant A+B>=1
      • easy to avoid: whenever reading A,B for writing B, also write A; this is what select for update does
      • example from PG docs: two classes that miss each other
        • insert into r (x,class) values ((select sum(x) from r where class = 1), 2)
        • insert into r (x,class) values ((select sum(x) from r where class = 2), 1)
        • harder to avoid; this also affects the next level of isolation i called “better than snapshot”
    • read-only anomaly example
      • X=Y=0; T1: Y+=1 & T2: X+=1 if X+Y=0; T3: print X,Y may print 0,1 when final values are 1,1
      • just T1,T2 is serializable; 1,1 and 0,1 are both valid final values
      • read-only T3 causes observable anomaly: should never get 1,1 after seeing 0,1
    • TODO: read “serializable isolation for snapshot databases”, SIGMOD08
  • “better than snapshot” (unnamed): track reads (eg read locks or read sets), but fail to track newly inserted values that should be tracked (eg we’re reading from a set restricted by a predicate, and a newly inserted value matches)
    • should really have read-locked the entire table, but that may be prohibitively expensive
    • alternatively, monitor new incoming values for matching values; i imagine this can be difficult for complex queries with joins/aggs/udf/etc, but how difficult? doesn’t seem uncomputable
    • this is sort of a read skew; the PG example from above falls into this category
  • serializable: need full read-write concurrency control; typically implemented using 2PL in RDBMSs
  • implementations
    • oracle, postgresql: “serializable” is a form of snapshot isolation
      • SI still get write conflicts to same data, but inserts of new tuples that would’ve matched an earlier read can slip by
      • eg race among sessions running select col from onerow; if no res, insert into onerow values (0);
      • constraints preserved, though (eg if onerow used col as PK, fail)
    • innodb “repeatable read” is actually “read committed”, “serializable” is serializable
    • innodb, sybase, mssql, db2 also support true serializability
      • bc they have few access paths, all closely tied to index structures
      • so they can implement shared locks on index, to prevent ghost inserts
      • in particular they implement “gap locks” to deal with phantoms
      • pg has many diff access paths, and indexes are plugin-replaceable w limited interface
      • predicate locking could be faster, since index locks = more locks
    • postgresql: “read uncommitted” is “read committed”, “repeatable read” is “serializable”
      • claims to need to implement predicate locking for true serializability [but other systems don’t]
  • ref:

OCC Techniques TODO

  • read-versioning

Lock-free and wait-free algorithms

  • difficult; focus on primitive data structures, eg non-blocking queues/stacks
  • memory management: eg lock-free mallocs, concurrent GCs
  • wait-free subset of lock-free subset of obstruction-free
  • lock-/wait-free: fault-tolerant, nobody can die with a lock
  • universal synchronization primitives: powerful enough to implement any lock-/wait-free data structure (Herlihy)

Read-Copy-Update (RCU) Locks

  • key idea: leave old data for readers, don’t update in-place
    • no need for reader locks if they don’t see changes in-place
    • fundamentally removes need for feedback from readers to updaters!
  • instead, update by copying data items, atomically change pointers
    • versions form a linked list; old readers free old version when done
  • benefits
    • reader code simpler, avoids locking issues and bus ops to notify updaters
    • writer code simpler than lock-free data structures
    • good performance for read-heavy workloads

Low-level hardware primitives

  • test and set: non-universal (proved cannot implement lock-/wait-free algos with only read+write+TAS), can only implement 2-consensus (?)
  • compare and swap: universal
  • load-linked + store-conditional (LL/SC): store only if no changes since load; also universal
  • multi-word compare and swap (MCAS)
    • good primitive, can be HTM-implemented

Memory model

  • example 1
    • initially, all 0; x=1; r0=x | y=1; r1=y; actually r0, r1 can end up 0
    • due to write buffer; stores may not have made it to visible memory or coherent cache
    • no sequential consistency
    • outcome determined by hardware, thread libs, compiler; but even if thread libs and compiler work as expected, most hardware still allows this to happen
  • example: independent reads, independent writes (IRIW)
    • x=1 | y=1 | r1=x (1); fence; r2=y (0); | r3=y; (1) fence; r4=x; (0)
    • thread 3 proves x was set first; thread 4 proves y was set first
    • this outcome is still possible because (e.g.) threads 1, 3 can be on same chip, and threads 2, 4 can be on same chip, so they see each other’s effects at different times
    • but what more can we do? already inserted fence in btwn each instr
    • TODO understand this better
  • Intel, AMD published better descriptions about sequential consistency only in late 2007
  • fence instructions give us sequential consistency when we need it
    • gcc: __sync_synchronize()
  • ref
  • From
    • x86, x64: strong memory model where memory access (loads/stores) are all effectively volatile
      • volatile only forces the compiler to avoid high level optimizations, hence 0 perf penalty for just marking sthg volatile
      • fence instrs: lfence, sfence, mfence
        • lfence, sfence apparently unneeded
        • mfence: if core writes then reads, the read may be served from the store buffer even though the write has not yet been written to memory; this forces a flush
    • itanium: weaker mem model; requires ld.acq/st.rel not ld/st


  • cache blocking

Programming models

Means of communication

  • Dataflow variables. The dataflow variable is a shared variable which only set once, but read from many times. Setting can be limited to a particular thread, or any thread might be able to set it–but only the first will succeed. (Subsequent threads that might try to set it to the same value might also “succeed”). Reads prior to the set may block or fail. See PVR’s excellent book for more on this style of concurrency.
  • Message queues. A message queue is a FIFO data structure that in the concurrent case, is generally owned by a particular process–other processes can post messages on it. Messages are generally copied from one process’s memory space to another, though on a shared-memory machine, shared memory might be used under the hood (if data is immutable, a reference is generally indistinguishable from a copy). Message queues can be implemented “on top of” dataflow variables, above.
  • Mailbox. Mailboxes, as in Erlang, are essentially message queues with extra processing on the read end which permit nifty things like out-of-order reads. All the fancy stuff occurs in the context of the reading process.
  • Object spaces, or tuple spaces. A form of shared memory which is “safe”–essentially a big blackbord of objects from which processes can concurrently read and write. Implementations exist for both shared-memory architectures (which is easy) and networks (which is harder).
  • From

Formal concurrency models

  • petri nets
    • models control flow, not data flow, thus not composable
    • another problem: simultaneous action
  • actor model
    • models dataflow
    • directly address processes (rather than channels)
    • asynchronous
  • process calculi
    • named (sync/async) channels, anon processes, par, seq, reductions
    • guarded choice
      • requires two-phase commit to choose
      • issues: starvation, livelock, efficiency
    • major types
      • calculus of communicating systems (CCS)
        • continues as pi calculus
      • concurrent sequential processes (CSP)
      • algebra of communicating process (ACP)
  • join calculus
    • basically, queues + pattern matching
  • comparisons
    • CSP v actors
      • CSP: sync 0-buffering rendezvous, anon procs, explicit channels
      • actors: async, named procs, per-actor mailboxes
  • TODO bisimulation

Misc hardware

  • Flash storage
    • inherently small; parallelize multiple storage units
    • parallelism exhibits non-sequential IO
    • downsides: cost, writes
  • (dynamically) reconfigurable computing
    • anti machine: data-stream-based anti machine

Nitty gritty programming

  • Most hardware platforms have 64-bit atomic operations; (ref: from sbw, from some amd manual)
  • Linux
    • sched_setaffinity: pin thread to a core
    • libnuma: pin memory

Acronyms TODO remove (incomplete) or automate

  • DSM: distributed shared memory
  • HTM: hardware TM
  • OCC: optimistic concurrency control
  • RTS: run-time system
  • STM: software TM
  • TM: transactional memory

still need to finish crawling:

MS technologies

  • Parallel Framework Extensions (PFX): TPL and PLINQ
  • Parallel LINQ (PLINQ): data parallel comprehensions
  • Task Parallel Library (TPL): fine-grained tasks are run in thread pools
  • Concurrency and Control Runtime (CCR): TODO
  • Axum: actors (“agents”) pass msgs asynchronously across ports on channels
  • F# asynchronous workflows: TODO
  • CHESS: Find and Reproduce Heisenbugs in Concurrent Programs TODO
    • repeatedly runs a concurrent test ensuring that every run takes a different interleaving



  • Scala DSL library of parallel operators, a la PLINQ
  • Long-term research angle includes an executor that schedules for locality


Transactional Memory Systems

McRT-STM: a high performance software transactional memory system for a multi-core runtime (Stanford)

  • evaluates STM design tradeoffs
  • reader-version 2-3 OOMs faster than reader-lock: to use locks, must perform writes, leads to cache effects
  • undo-logging < 1 OOM faster than writer-buffering: writer-buffering searches for last-written value
  • object-based and cache-line locking are comparable
  • object-based: compiler can gen code to acquire whole obj; harder in cache-line schemes
    • small objects in size-segregated heap, large objects in separate heap; tx for former only, too coarsed-grained with latter
  • keep as ref for performance numbers

LogTM (Wisc, HPCA 2006)

  • most HTMs leave old values, buffer new values; fast aborts, slow commits
  • LogTM: eager versioning (undo logging); also, eager conflict detection
  • transaction log
    • per-thread linear virtual address space
    • filled by hardware, read by software
  • eager conflict detection using ‘directory’
    • per-processor read/write bits
    • requesting processor issues coherence request to directory
    • directory forward to other processors
    • responding processors detect conflicts using local r/w bits, informs requesting processor of conflict

Subtleties of Transactional Memory Atomicity Semantics (UPenn)

  • basically, TM has some subtle tricky issues
  • interesting read; keep as ref for problems with TM

Tradeoffs in Buffering Memory State for Thread-Level Speculation in Multiprocessors (UIUC)

  • arch main memory (lazy, eager commits) vs future main memory
  • singleT vs multiT+singleVersion vs multiT+multiVersion
  • performance conclusion are murky, only that FMM (eager versioning) wins
  • performance: multi-task, multi-version > multi-task, single-version > single-task
  • lazy commits tends to be faster than eager commits
  • multiVersion tends to be faster than singleVersion – why?
  • keep as ref for taxonomy and for perf comparison

Memory-mapped transactions (Sukha 05)

  • perform xacts on mem-mapped file data

Other systems

  • Intel/Brown VTM
  • Herlihy/Moss TM
  • Stanford TCC

Distributed Speculation

Speculative Execution in a Distributed File System (UMich)

  • don’t just wait for the result of a remote request for data; simply predict the cached value, and run with it as far as possible without leaking the speculation (no IO “externalization”)
  • Speculator system adds Linux kernel support for speculation
    • TODO how is this done?
  • 2x performance SpecNFS over NFS improvement in LAN, OOM in WAN
  • Blue File System: full consistency semantics + synchronous IO safety
    • can outperform DFSs with weaker consistency/safety
  • awesome

to crawl:

  • Lightweight Distributed Selective Re-Execution and its Implications for Value Speculation
  • Hardware Support for Data Dependence Speculation in Distributed Shared-Memory Multiprocessors Via Cache-block Reconciliation
  • Distributed Models of Thread-Level Speculation
  • Speculations: Providing Fault-tolerance and Recoverability in Distributed Environments

Operating Systems & VMs

QoS Policy and Architecture for Cache/Memory in CMP Platforms (Intel, Stanford)

  • problem: cache thrashing, etc. can have serious impact
  • idea: use QOS for contention mgmt (inevitable as we move to NOC)
    • two priorities, high and low; assigned by users (or system)
  • modified Linux and Xen to support QOS
    • each core (“tile”) has control register containing current thread priority and some history about thread behavior
    • (OS) threads need to keep QOS bits and control register bit
  • Dynamic Quality Monitor (DQM): special hardware unit
    • monitors thread, records usage info in QOS Resource Table (QRT)
    • compares thread’s resource usage to priority; enforce QOS
    • QOS-aware mem unit: higher priority mem requests can cut in line
    • QOS-aware cache: mark cache lines with priority; maintain counters to cap low-priority entries; evict those first if too many
  • results
    • 20-30% fewer cache misses per instr
    • 10-20% perf improvement for high-priority threads (in the Linux system)
  • resources
  • misc commentary
    • this doesn’t seem necessarily specific to [SC]MPs, but simultaneous sharing exacerbates things; there is apparently a bunch of prior work on QOS in architecture, with a lot of attn on real-time systems
    • reminds me of the paper on SUIF+OS cooperation (hints for prefetches/releases), which isn’t about multiprocessors at all
    • POWER5 had hardware to enforce priority-based resource utilization

Disco VMM (Stanford)

  • changing OSes to use these is hard
  • what if app needs more resources than what commodity OSes can provide? two options:
    • make a small change to OS to be able to share mem across VMs
      • [would apps have to be designed with this peculiar model in mind for this to be effective? NFS seems to do fine, but need to better understand how the network sharing works to answer this]
    • apps that don’t need full OS can use a lightweight OS that does scale
      • [but then you don’t get a full OS]
  • challenges
    • execution overhead
    • memory is replicated
    • VMMs lack info to make good policy decisions on resource management
      • eg don’t know when an OS is no longer using a page
      • eg don’t know which instruction stream is important for scheduling
    • comm/sharing across machines is as cumbersome as on separate physical machines
  • interface
    • processor
      • emulates MIPS R10000 instrs, MMU, trap architecture
      • extends arch to support efficient access to some processor functions
        • eg toggling interrupts can be done using loads/stores to special addrs
        • enables OSes tuned for Disco to reduce trap emulation overheads
    • physical mem
      • dynamic page migration and replication to export nearly UMA
    • IO: virtualizes disks, network devices
      • virt disks configurable for different sharing and persistence models
      • provides special net interface to handle large transfer sizes without frag
  • implementation
    • 13K LOC; 72K footprint; small so replicate on all stores
    • attn to: NUMA mem placement, cache-aware data structures, interprocessor comm patterns
    • no data structures with poor cache behavior, eg linked lists
    • few locks; wait-free sync using LL/SC; interproc comm for eg TLB shootdowns, posting interrupts to given VCPU
    • virtual CPUs: real CPUs time-shared by OSes
      • each virtual CPU has process table entry (saved regs, etc)
      • physical CPU modes
        • kernel mode: Disco
        • supervisor mode: OS; can use protected supervisor segment of addr space but no privileged instrs or physical mem
        • user mode: otherwise
    • virtual physical memory
      • maintain physical-to-machine addr map (pmap)
      • intercept TLB operations; further mem references through this TLB mapping have no overhead (TLB: VM -> MM)
        • pmap: physical page -> pre-computed TLB entry + backmaps (for taking away pages from CPU TODO)
        • merge protection bits with orig entry before inserting into TLB
    • NUMA mem mgmt (replication, migration)
      • FLASH counts cache misses to each page; use these counters
      • to migrate/replicate, must touch TLBs (TLB “shootdowns”)
    • virtual IO
      • custom device drivers rather than emulate PIO
      • intercept DMA addrs, maintain DMA map
    • COW disks (section is very unclear)
      • start with same root disk containing kernel, apps, etc.
      • share machine mem means DMA can be a single map update
      • disk writes logged
      • [note: not actually for persistent disks]
    • virtual network interface doesn’t impose MTU and lets data be shared (via DMA map)
  • commodity OS: SGI IRIX
    • minor changes, mostly to HAL (privileged instrs changed to special mem ops)
    • problem: virtual CPUs (supervisor mode) cannot efficiently access KSEG0, an unmapped space which bypasses TLB (seems to be MIPS-specific)
      • kernel code/data resides here; had to relink to a mapped portion
    • custom device drivers: simple, since working with idealized device
    • tweaked network subsystem to behave well for remapping technique
    • page zeroing must be done anyway for security; hints to do it in Disco
  • implemented SPLASHOS, a specialized bare-metal OS
  • simulation on SimOS
    • used simplified model of FLASH (OOM faster to sim); short workloads
    • uniprocessor execution time overheads: 16% for IO intensive workloads, 3% for CPU intensive workloads (due to misses)
      • Disco has more misses due to mapped kernel code/data
      • kernel time shortened by Disco page initialization and second-level TLB
    • mem overhead (pmake workload, 8 CPUs)
      • lots can be shared, even over NFS
      • kernel private data not shareable, but is small
    • scalability (pmake, radix, 8 CPUs): pmake 40% faster on 8 VMs
      • IRIX is NUMA-unaware, suffers from high overheads in synchronization (semaphores) and memory system
      • VM has more sync overhead (slower, leads to more semaphore contention)
      • partitioning scales well, even with NFS, with single server serving 8 disks
    • dynamic page placement and migration/replication: 37% improvement
  • misc commentary
    • note that FLASH is an multichip not CMP, hence they introduce fault containment in Cellular Disco

Cellular Disco (Stanford)

  • what Disco lacked
    • (static) hardware partitioning has fault containment
    • OSes have fine-grained resource mgmt mechanisms/policies
  • fault containment: similar to Hive, but can assume Disco code correct
  • CPU load balancing
    • idle balancer (local view)
      • check neighboring run queues (intra-cell only)
      • VCPU migration cost: 37us to 1.5ms
        • cache and node mem affinity: >8ms
      • backoff
      • fast, local
    • periodic balancer (global view)
      • check for disparity in load tree
      • cost: affinity loss, fault dependencies (if moving across containment boundaries)
  • mem load balancing
    • each VM has allocation prefs
    • borrow mem based on:
      • combined alloc prefs of VMs
      • mem availability on other cells
      • mem usage
    • loan if enough mem available
  • performance results on prototype 32-CPU SGI Origin 2000 system
    • ~0% overhead, contained all faults
    • as well as single OS
    • much better than static hardware partitioning
  • resources

Hive (Stanford)

  • summary: fault containment, resource sharing, single system image
  • each kernel cell runs modified IRIX; hardware nodes partitioned into cells
  • fault containment vs. fault tolerance: in former, partial failures allowed
  • fault containment (confine to cell)
    • hardware: CPU halt, mem range failure [are these real problems?]
    • software: randomly corrupt OS structures [what is this supposed to model?]
  • fault containment arch
    • messages are RPCs; sanity check all RPCs
    • remote reads: reader protects self
    • remote writes: use FLASH hardware firewall to only allow remote writes to pages owned by processes that span cells
      • pessimistic: on detecting cell failure, discard all pages writeable by that cell
    • Byzantine failures: heuristics to detect; distributed consensus to reboot cell
  • resource sharing arch
    • mem sharing
      • logical-level: eg global file buffer cache, shared mem
      • physical-level: share free page frames
    • CPU sharing: can span and migrate tasks
    • Wax: centralized user-level resource allocation policy maker
      • migration, replication
  • SimOS simulating FLASH
    • 0-11% slowdown vs. 4-way IRIX
    • more scalable than NUMA-unaware SMP kernels
  • misc commentary
    • much more work than Disco, hence Disco
    • unclear whether this effort ever “finished”
    • led to Cellular IRIX?

TxLinux (UTexas)

  • HTM: making kernel concurrency easier
  • MetaTM primitives: xbegin, xend, xrestart, xgettxid
    • xpush, xpop: save/restore xact (for interrupt handlers)
    • informing xacts: xbegin returns status (first time? retry? why retry?)
  • critical sections: defy xact (?)
  • locks + xacts must cooperate
    • why
      • legacy code, IO, page faults (?)
      • flexibility: switch to locks for high-contention sections
    • how
      • cxspinlocks: acquisition via contention manager
      • cx_optimistic: attempt xact, roll back to lock on needed (IO); if always needed, default to lock
      • cx_exclusive: acquire lock via contention manager (use xcas)
      • cx_end: release
  • hardware simulation using Simics: 16,32 CPUs
    • usu. 1-2% speedups
  • passing thread priorities to contention manager removes all priority inversion for negligible (1%) perf cost


  • Novell’s API for managing NUMA policies
  • by regular kernel hacker Andi Kleen of SuSE Labs
  • policies
    • default (local), bind (fail if exhaust), interleave (bandwidth), preferred (fall back on others)
  • also numactl controls processes externally
  • ref:

NUMA Linux

  • doesn’t seem to be much active work on NUMA-aware scheduler; current attn is on CFS and rivals
  • NUMA Linux: heavily outdated
    • scheduler: locate processes as close to the memory they access as possible
      • 2.4: single runqueue design couldn’t scale (wrt not just number of CPUs but even number of tasks)
      • there was a bunch of work in 2002-2003 on various NUMA-aware multi-queue (MQ) schedulers
      • O(1) scheduler: runqueue per CPU; not NUMA aware
        • pretty much stopped work on NUMA-aware schedulers
      • O(1) load balancing extensions
        • keep process on same node (stationary)
        • spawn on least loaded nodes
    • discontigmem
    • locking primitives
    • userspace APIs
    • multipath IO

Linux Scalability for Large NUMA Systems


  • scalable, hierarchical clusters of resources
  • replicate system services and data boejcts to increase locality, increase concurrency, avoid centralization



  • trade off speed for precision
  • extensible through C API
  • machine model

SimOS (Stanford)

  • trade off speed for precision
  • used for FLASH, Hive, Disco, etc.


Azul Systems

  • Vega 2: 48-core custom CPU, 64-bit, 768-core-total multi-chip SMP, 768 GB mem
  • Java support: eg hardware-assisted pauseless GC, replacing locks with OCC

TRIPS (The Tera-op, Reliable, Intelligently adaptive Processing System)

RAW/Tilera CMP (Agarwal, MIT)

  • TILE64 product pitched at embedded market (network processing, HD media encoding)
  • RAW pioneers idea of tile + processor/cache + switched routing + mesh network
  • buses inefficient wrt perf and power (see this already in Opteron vs Xeon)
  • this is starting to bring us into the HPC domain
  • how big should a core be?
    • KILL: kill if less than linear
    • as soon as core performance starts dropping off from being a linear function of core size, stop growing
  • caches
    • each tile has L1 (16K), L2 (64K); L2s form a distributed L3 (4M)
    • caches are coherent; eg can run SMP Linux
  • “Hardwall” protection/virtualization technology
    • memory-based protection/virtualization no longer works when using direct IO interfaces/messaging
    • Hardwall partitions tiles up protected groups; disallow comm between groups
  • power: 30x perf per watt over 3GHz Xeon
  • misc details
    • MIPS-derived VLIW ISA
    • 90nm process, 600-900MHz, 32-bit, 2 ALUs, no FPU
    • 10,000 for $435
  • RAW: Raw Architecture Workstation; expose wire delay to programmer
    • akin to RISC philosophy: let compilers manage complexity
    • one cycle for data to move between tiles; compiler can statically schedule ops among ALUs
    • can write to special regs that directly send results of ALUs into another tile’s ALU
  • memory performance details, from discussions with Hank and Wentzlaff
    • architecture
      • tile64 has 4 memory controllers, each with 8 memory banks
      • 3 connections to each memory controller (tiles 1-3, 6-8, 57-59, 62-64)
      • high bandwidth memory access: memory controllers each support up to 64 outstanding memory requests
    • locality policies
      • tiles allocate on the memory controller that is the fewest hops away (and has enough space)
      • a running process that hasn’t spawned any threads (and only such a process) is eligible for automatic mirgation
      • libnuma, sched both work on tile-linux
    • programming for performance
      • mesh itself can become congested
      • can get pretty far using straightforward approaches (with either shared mem or ilib)
      • very complicated; lots of places things can go wrong
      • understanding performance is an ongoing open problem throughout tilera
      • slow software TLB miss handler, so for large data, use large pages (mmap)
    • memory sharing policies
      • uncached
      • singly cached, coherent
        • everything goes through an owning tile
        • writes are non-blocking, so communication by writing may be more performant than communication by reading
      • multiply cached, non-coherent
        • manage coherency yourself
        • basically, avoid overlapping cache lines
      • multiply cached, read-only
      • write from one tile and read from many tiles: supported only by hardware, not yet in tile-linux, still somewhat buggy
    • misc
      • optimization guide: not very useful
  • programming
    • tooling
      • Eclipse plugin for selecting “spatial views” (grid views)
      • modified gdb: aggregate control/state, breakpoint all related processes
      • profiler: collective/aggregated stats
    • model
      • incrementally exploit parallelism in C apps
      • iLib: lightweight sockets-like API for stream programming
  • resources

Software-Based Instruction Caching for the RAW (Miller)

  • use software to manage the i-cache
  • extended basic blocks: muliple exit points
  • TODO

Terascale (Intel)

  • 80-core, FP-centric
  • through silicon via (TSV) interconnect
  • large core counts

Merrimac stream processor (Dally, Stanford)

  • comm is expensive (power, latency, heat)
  • stream processor ideas
    • one control unit, 100s of FPUs
    • deep reg hierarchy for high local bandwidth
    • stream: seq of data objs
    • expose data parallelism to utilize FPUs and work while waiting for memory
    • expose multiple levels of locality
      • short term producer-consumer locality (LRF)
      • long term producer-consumer locality (SRF)
      • cannot be exploited by caches: no reuse, no spatial locality
  • Merrimac
    • 1 scalar core, 16 arith clusters ea. with 4 64-bit multiply-accumulate FPUs
    • execute on same VLIW instr
    • local register file (LRF) per FPU
    • stream register file (SRF) per cluster
  • programming model: series of computational kernels processing stream elements
  • memory: stream instructions for stride access, scatter/gather
  • interconnection network: “fat tree”

FLASH SMP (Stanford)

  • TODO
  • collection of nodes, each with CPU, mem, IO
  • high-perf scalable interconnect
  • directory for cache coherency
  • MIPS R10000 ISA

Speculative Lock Elision (SLE)

  • micro-architectural technique for dynamically removing unnecessary lock-induced serialization
  • misspeculation due to inter-thread data conflicts detected using existing cache mechanisms; rollback used for recovery



Concurrency Control and Recovery in Database Systems (Phil Bernstein, MSR)

  • multiversion concurrency control (MVCC)
    • each write produces new version of object
    • need never reject read; normally, value may be overwritten
      • eg read x; read x (self-join): x may have changed
      • eg read x; read y and write y; write x: y may have changed but not x
    • may be needed anyway for recovery (“before” images)
    • requires GC

Concurrency Control and Recovery (MJF 97)

  • book chapter; solid overview/ref
  • serializability
    • conflict serializability: most widely used, since it’s efficient to detect/enforce
    • view serializability: less restrictive (allows more schedules), but impractical to implement
  • buffer management policies (steal/force)
  • recovery logging (undo/redo), WAL, checkpoints
  • 2PL, S/X locks, hierarchical locking
  • isolation levels: read committed, read uncommitted, repeatable read, serializable
  • recovery difficult, ARIES
  • alternatives not in paper
    • OCC typically unused because it is only suitable when resources are abundant
    • MVCC, aka snapshot isolation, keeps previous versions so that read-only xacts can access older versions; commonly implemented

On Optimistic Methods for Concurrency Control (Kung, CMU)

  • seminal paper argues for OCC and against locks
  • two versions of validation: serial and parallel
    • presents application in B-tree insertion and performance analysis
  • keep as ref for details on validation
  • no story for recovery!
  • resources: 1 2

The serializability of concurrent database updates (Papadimitriou 86)

  • seminal paper on serializability

“Applying Thread-Level Speculation to Database Transactions.” Chris Colohan, CMU 06

  • goal: use parallelism to reduce latency; achieved in 4-way sim by 46-66% for 3/5 tpc-c xacts

  • user divides transaction up into ‘epochs’, which are run in separate threads (called epochs to differentiate from explicit threads in the transaction)

  • asks for changes to software, hardware; also proposes a new software engineering approach (in this case, iteratively optimize the dbms to remove dependencies for tls parallelism)

  • epochs restart on violation; hardware creates sub-epochs to tolerate violations

  • hardware detects RAW conflicts as they occur; large speculative changes are buffered; lot of discussion about hardware changes

  • modified bdb; ran tcpcc; simulated harware

  • some technique details
    • delay operations until non-spec (locks, resource release)
    • escape spec: resource allocation
    • traditional par: mem alloc, buf pool, err chks, false sharing
  • speculative xacts can interact with non-spec xacts TODO details

  • TODO crawl refs

Sagas (molina & salem 87): technique for intra-transaction parallelism that doesn’t require modifying DBMS

  • context: long-running xacts that can be broken up = sagas
  • useful for reducing contention, esp with more common shorter xacts
  • issues: how to implement, but also how to change dbms/xacts to make them more feasible to obtain in first place
  • composed of several DBMS-visible xacts
  • “compensating xacts” needed to undo individual xacts (not always possible to get back consistency)

TP monitors

  • evolved from work on sagas
  • important in 3-tier archs that use load-balancing
  • allow some xacts to execute in parallel

Transaction chopping (Shasha 95)

  • static analysis of data dependencies -> conservative ‘chopping’ into smaller xacts (more concurrency)

Intra-transaction parallelism in the mapping of an object model to a relational multi-processor system

  • ORM using replication and view materialization for fast reads
  • ITP by decomposition into shorter parallel relational ops

Parallel Database Systems (dewitt gray 92)

  • pipeline (task) parallelism/dataflow model, data parallelism/partitioning
  • perf: shared mem < shared disk < shared nothing
    • shared mem: “interference” (access contention) bottleneck
    • shared-nothing for linear speedup, linear scaleup
      • obviously would be nice, but no discussion on what conditions are suitable for this!
  • problems: startup, interference, skew
  • data partitioning: cluster related data together
    • range partitioning is simple but has data skew (if non-uniform distribution)
  • parallel operators
  • systems: Teradata, Tandem, Gamma, Super Database Computer, Bubba, …

StagedDB (Stavros, Anastassia, CMU)

  • STEPS: improve perf of cache-resident code for OLTP
    • share data among concurrent queries by breaking queries up into stages
    • added STEPS to Shore
  • QPipe query engine monitors queries for shareable results, then dispatches the results to the appropriate queries
    • eg different selections can share same scan
    • built on BDB

Encapsulation of Parallelism in the Volcano Query Processing System

(“case for shared nothing architecture” stonebraker (86))[]


  • Wisconsin Multiscalar & Trace Processor
  • CMU STAMPede


Cilk (CEL MIT 95)

  • C language extensions and runtime
  • primitives: spawn, sync
    • cactus stack (this sounds vacuous – doesn’t seem necessary to change anything about the stacks)
    • clearly does not support C! explicit sync is the first hint. zero data synchronization, zero speculation
    • ‘spawn’ = 2-6x proc call, so “scales down” to single-processor
    • supports malloc
    • vs. futures?
  • theory (actually pretty obvious!)
    • theorem from ’68: greedy scheduling achieves time to run on P processors is upper-bounded by (time to run on 1 processor)/P + (time length of critical path)
    • corollary: greedy scheduler achieves ≤ 2(optimal)
    • corollary: near-perfect speedup when P much smaller than (time to run on 1 processor) / (time length of critical path)
  • scheduling
    • each processor has work deque; spawns push onto this
    • once depleted, steal from other end of another deque
    • load balancing: always maintain busy leaves; protects against space exhaustion by letting a thread continually spawn work (a non-leaf thread)
  • misc
    • inlets: separate accumulator threads; seems completely uninteresting and unnecessary
    • aborts: to support “breaks”, use abort to terminate other children (use from within inlets, add other code)
    • speculation in alpha-beta pruning of minimax adversarial game search trees: “young brothers wait” – in an optimal (best-order first) game tree, either 1 child is searched, or all are searched
    • more recently: include page faults as measure of locality
    • nondeterminator tool: finds data races in program executions
    • JCilk: adds exception handling; compiled via jgo (gcj with goto)
  • implementation: cilk2c translator + cilk rts
    • fast (serial) and slow (parallel) version of each cilk procedure
    • default to fast, switch to slow if thread is stolen (restores stack vars)

Jade (Rinard)

  • did not process this in any detail
  • annotate with data usage information
  • tasks interact with shared objects
  • compiler uses this to relax program’s sequential order
  • cons
    • identify tasks to parallelize
    • cannot express algos requiring bi-directional comm
    • assumes single addr space

StreamIt (MIT)

  • filters, pipelines, splitjoin, feedbackloop
  • comm over infinite FIFO
  • static-rate (isochronous) streams
  • nicely statically parallelizable. 3 compilation steps:
    • partitioning: merge/split filters to get N load-balanced filters
      • aka fusion and fission
      • subtle issues; eg, a peeking filter can introduce state upstream (affecting data parallelism transparency)
    • layout: use simulated annealing algo and a communication cost function to arrange filters on processors
    • scheduling: map FIFO to hardware comm network
  • targets RAW arch
  • cons
    • restricted prog model: isochronous streams, no interleaving sends/receives
    • compiler: highly hardware specific
  • future directions
    • DMA achieves high utilization
      • experimented on Cell, which has things like 2D transformations; transformations too limited
      • interested in more complex transformations, eg varying strides
    • analytical model for understandign/predicting performance behavior (on various archs)

Deterministic Multithreading (Marek MIT)

  • apps
    • aside: CAD tools tend to yield non-deterministic results (but can be seeded)
  • deterministic locking framework
    • deterministic lock acquisition ordering
    • use pseudo-time rather than real-time; ensure that pseudo-time is not too far off from real-time
    • currently, pseudo-time is number of stores, using perf counters; actually deterministic
    • lots of comm overhead (70%); might be improved by new hardware features

Orca (Kaashoek)

  • data model: shared monitors with system-determined distribution/replication
  • parallelism: declare/fork processes (ie tasks)
    • pass objs by ref (shared) or by val (value)
    • no lightweight threads
  • condition synchronization: similar to guards in process/join calculi
    • operations have guards; block until any are true, then execute 1
  • RTS: read locally, write globally
  • high level of abstraction means hard to control perf

Lampson’s slides (06)

Sable (05)

  • sw-only approach: JVM with method-level TLS (and other features)
  • speculative versions of side-effectful bytecodes (25%); spec terminates on unsafe ops
  • threads fork at all callsites, joined on returns (forking heuristics implemented but unused)
  • design/techniques
    • return value prediction
    • heap buffering in open-addressing hashtables; on join, validate reads, commit writes
    • stack buffering separate (straightfwd)
    • alloc objs speculatively
  • eval
    • thread terminations mostly due to “signal from parent” (dependence violation??)
    • many correct specs, bunch of incorrect predictions, fewer dependence violations
    • 1-2x speedup

Evaluating MapReduce for Multi-core and Multiprocessor Systems (Stanford)

  • Phoenix: simplified MR built on C++ & pthreads for shared mem systems (both CMP & SMP)
  • features/characteristics
    • load balancing scheduler, pointer passing, buffer mgmt
    • still “materialize” between map & reduce
    • set input chunk size to cache size to balance between overhead and locality
    • fault tolerance/recovery: fail-stop; straightfwd
      • detection: worker time-out normalized by other times on similar tasks
      • recovery: restart/re-assign tasks; there are some details on buffer mgmt
      • future work: faults in scheduler, cross-node mem corruption (sandboxing)
  • experiments: 8 apps [including “enterprise computing”: word count], 8-core Sparc CMP, 24-chip Sparc SMP
    • [unmentioned: no disk; all datasets were small enough to fit in mem]
    • always compared speedup vs. sequential version
    • perf/scalability
      • 4 workers/core
      • frequently sublinear speedup; sometimes scaling on SMP -> slowdown
      • mat-mult superlinear due to cache agg
      • map dominates; short reduce/merge -> bandwidth saturation
    • pthreads vs Phoenix: the one that better fits the app performs better, due to issues unrelated to Phoenix per se, but more to restrictions of MR
  • related: MapReduce for the Cell B.E. Architecture (Wisc)
  • resources

The Problem With Threads (Lee, Cal 06)

  • threads are wildly nondeterministic; trying to prune nondeterminism
  • instead, start with deterministic primitives; explicit and judicious introduction of nondet
  • anecdote: his Ptolemy project had a major sleeper bug
  • possible models
    • Kahn process networks (PN)
    • synchronous/reactive (SR)
    • discrete events (DE)
  • seeing as how all attempts to remove threads remain esoteric, promotes coordination languages orthogonal to current languages
  • what about concern: interleaving (with IO) for efficiency?

Disciplined Message Passing (Lee, MSR 09)

  • [this talk turned out to be sorta uninteresting/aloof/masturbatory; not bothering to clean up these notes; did learn something about MPI though]
  • whipping boy: MPI
    • get your rank (ie process id) and figure out what to do based on that
    • ex: implementing a bool select: recv control bit on control source, then read from the data source
    • some modularity problems: specify direct source processes (no indir)
    • no type safety: must specify type of received object
    • send: standard allows *either( guaranteed delivery or buffering semantics!
  • problem about buffers: with 2 sinks on a connection, when should scheduler execute what?
    • if one proc consumes, other still needs, so can’t consume right away
    • can deadlock
  • naive schedulers all fail
    • is “fair” scheduling a good idea? no, …
    • is data-driven execution a good idea? no, eg bool select
    • is demand-driven execution a good idea? no, flip the prev example
    • most mixtures thereof
  • problems have been solved; ptolemy ii has director, which provides much more structure than msg-passing or thread lib; provides concurrent model of computation
    • correct execution
      • kahn least fixed point denotational semantics
      • kahn: functions that map streams to streams; these can be characterized using a nice topological framework as monotonic functions
        • eg if prefix of input, then prefix of output; provides ordering; functions are monotonic using usu. def’n
        • network of such processes can be ; has least solution; could be inf seqs
    • useful execution: extend at least one stream in finite add’l time
      • if correct execution satisfying criterion (1) exists that executes with bounded buffers, then a useful execution will execute with bounded buffers (prefer over buffer overrun)
        • hard to satisfy: trivial programs are undecidable: will it deadlock? can it be executed with bounded buffers?
  • synchronous data flow (SDF): lee’s phd thesis
    • boolean selects/switches no longer allowed (but muxes can work; muxes read from both at a time and discard one)
    • by itself too restrictive; other extension work, eg streamit
  • collective (higher order) operations

Unified Parallel C (UPC)

  • MYTHREAD, upc_forall, upc_barrier

Vivek Sarkar

  • projects
    • @stanford: SISAL (single assignment)
      • cons: haskell-like syntax
    • jikes rvm
    • @ibm: tools, X10
    • @rice: jabanero
  • collaborators
    • Kathy Yelick


  • TODO
  • JVM language; an extended subset of Java 1.4
    • removed/changed
      • no threads, synchronized, etc.
      • X10 arrays replace Java arrays
    • added TODO
      • async, finish, atomic, future, force, foreach, ateach, clocks
      • distribution: points, distributions
      • X10 arrays: multidim distributed arrays, array reductions, array initializers
      • serial constructs: nullable, const, extern, value types
  • partitioned global address space TODO
  • parent-child relationships for tasks (supposed to reduce cycles/deadlocks) TODO
  • talks
    • shared memory
    • messaging
    • threads
      • granularity (partitioning)
    • locality
      • data and computing mappings
      • colocation
      • affinity
  • places
    • flat set, disjoint
    • async(<place>) <stmt>; ...; finish { <stmt> }
    • or: hierarchical

Java semi-automatic parallelization (Danny Dig)

  • select code to parallelize and a transformation from the IDE
  • tool checks whether the xform is safe and performs the changes
  • eg: uphold class invariants and determine smallest critical sections

scalable reader writer locks (for .NET) (msr talk)

  • problem: acquiring read locks does a contended write
    • STM atomic perf was the best
    • slower than mutex lock on readonly workload; mutex can get away with hardware op on acquire (still write on release)
  • on current 8-way machine, must do 1000 empty calls within lock to avoid dramatic slowdown from contention
  • scalable, pure .NET lib, fairness, space efficient (difficult)
  • scalability: divide reader count into ‘subcount’ buckets (indexed with thread id hash)
    • acquiring write lock acquires all
    • each subcount has cache line sized padding
  • fairness: queue-based lock; nodes represent 1 writer or a bunch of readers (add to readers node if already at end)
    • can have multiple writers in a row but not readers
  • locking discipline
    • lock has tail and head ptrs
      • tail: lock to write/read consistently, but can sometimes dirty read
      • head: lock for null-non-null transitions
    • reader …
    • writer lock owning lock

      while true
        an = tail
        if an != null ad an is readers node
          increment lock count for thread for this thread
          if failed, continue loop
          wait until this node becomes the head
        else # null or writer
          increment lock count for thread for this thread
          lock this
            an = tail
            if an == null
              head = tail = rn; return
            elif a is readers node
    • IncrementLockCountForThread()

      id = thread.hash()
      index = id % nprocs
      lock arr[ind]
        if next != null return false
        return true
    • WaitUntilHead()

      for k in 0 to spincount
        if an == head, return
      lock an
        while an != head


  • Cache-Conscious Structure Layout
  • A Comparative Evaluation of Parallel Garbage Collector Implementations
  • []
  • parallel garbage-collection
  • Ptolemy project
  • Sequoia (Dally, Stanford)
  • BrookGPU (Stanford)
  • Chapel
  • Fortress
  • Titanium
  • Berkeley UPC
  • Borealis
  • Intel TBB
  • Ease, Carnap
  • In search of speculative thread-level parallelism
  • Dynamic Multithreading (DMT)
  • Stanford Hydra
  • Sun MAJC
  • goog: speculative multi-threading, thread-level speculation
  • Mojave, Caltech: Distributed Speculation
  • Data-Demultiplexing

Parallel Computing, Supercomputing, HPC

BFS for cell

  • I could only find this DDJ article, not the IEEE one


  • de facto std in computer clusters
  • comm protocol provides synchronization and comm primitives
    • eg point-to-point send/receive, gathering/reduction, barriers, meta-level stuff (process management), choosing topologies
  • does not address increasing internal/fine-grained concurrency (multi-core)
  • does not provide fault tolerance
  • low-level, difficult to use


  • more suited for a heterogeneous collection of computers than MPI
  • deals with failures with transparency (notifications)
  • VM concept provides set of dynamic resource manager/process control functions


  • for shared memory multiprocessing (fine-grained)
    • seen together with (complements) MPI
    • vs. using (eg) pthreads
  • simply sprinkle compiler directives (C PP pragmas)
    • spawn threads, synchronization primitives, copying, reduction
    • work-sharing (eg data-parallel for)
    • memory protection (private/shared regions & vars)
  • pros
    • simplicity: auto data layout, decomposition; typically no changes to serial code
    • incrementally add parallelism
  • cons
    • no numa awareness
    • no error handling


  • Futexes (what pthreads are implemented on)
  • PM2
  • Thinking Machines

Program Analysis


  • statically infer multi-variable access correlations using some heuristics with rel. simple data mining theory/techniques
  • detect inconsistent updates and multi-var concurrency bugs
  • OK false positive rate, found a number of real bugs

Massively Parallel/Distributed Programming Models

Space-time programming/computing (Sussman, Project MAC 05)

  • Amorphous Medium Language (AML)
  • “statistical computing”; natural fault tolerance
  • space (and time) are explicit: useful for exploiting locality in space and time
  • very pie-in-sky, but seems like a natural direction to head toward
  • see region streams/Regiment, GHT

Spatial logics


  • K42/Tornado
    • Gamsa-Tornado.pdf
    • Appavoo-K42.pdf
    • Krieger-HFS.pdf
  • Xen
    • Barham-Xen.pdf
  • McRT
    • Saha-CMPScalability.pdf
  • TxLinux/MetaTM
    • Rossbach-TxLinux.pdf
    • Ramadan-MetaTM_TxLinux.pdf
  • MapReduce
    • Ranger-MapReduce_SMP.pdf
    • He-MR_Mars.pdf
  • Scheduling
    • Tam-ThreadClustering.pdf
  • OLS large scale NUMAs
    • Gough-LinuxScalability.pdf
    • Sarma-SelectedLinuxScaling.pdf
    • Bryant-ExtremeLinuxScaling.pdf
  • Berkeley Parallel Computing Research
  • Memory Allocation
    • Mckenney-ParallelMemAlloc.pdf
  • MP Web Server
    • Veal-MP_WebServer.pdf


A View From Berkeley

  • lots of other material, but only covering systems perspectives here
  • rely less on compilers, more on autotuners
    • compilers hard to enhance
    • search-based autotuners gaining popularity for scientific code
      • optimize library kernels by generating many and testing perf on target platform
    • may apply to parallelism, but search space much larger
      • for given problem, many possible par algos, data layouts
      • perhaps decouple the problems, incorporate perf models
        • characterize system using perf tests, plug these into perf models
  • deconstructing OS support (somewhat disconnected)
    • increasing need of protection in embedded computing (eg cell phones)
    • traditional OSes: too large/brittle to support radical innovation and have a lot of particular support for legacy apps
    • VMs popular due to protection (attacks), isolation (software failures), migration (hardware failures)
    • Rosenblum: OSes should be libraries linked by app atop thin VMM
    • VMs enable end apps to select only portion of OS capabilities needed

Locating Cache Performance Bottlenecks Using Data Profiling (alex pesterev, pdos talk)

  • focus
    • apps bottlenecked by cache accesses
    • how to find these bottlenecks
    • how to find their causes
  • motivating example
    • producer to queue to consumer
    • if prod, cons on sep cores, and prod writes to queue (onto own cache), then cons has to pull data in from prod, resulting in misses
    • want prod, cons on same core; this is actually faster
  • memory levels and access times in cycles on amd opteron
    • L1 3, L2 14, L3 50, mem 250, L1-to-L1 120
  • detecting the problem
    • profiler can detect cache misses, but reason unknown: capacity? sharing?
    • need tool to help find sources of cache miss problems
    • hard bc requires finding 2 events: cache miss and cache invalidation/eviction
  • dprof profiles data accesses and presents to user:
    • which data types suffer cache misses
    • for each data type, how much space it takes up in cache
    • classifies the cache miss reason
    • which set of functions access a data type
  • design
    • data cache misses: core expects data to be in cache but it’s not
    • cache miss types/causes and solutions
      • false sharing: separate falsely shared cache line into 2+ cache lines
      • true sharing: factor data into multiple pieces or assign to single CPU
      • conflict (in n-way caches): allocate over a wider set of associativity sets
      • capacity: try to extract some locality or establish admission control (limit how much work CPU does at a time)
    • views
      • data profile view: data types ordered by # misses
      • classification view: which of the 4 kinds of misses (see above)
      • working set view: which data types most active, how many of each at any given time, and what cache assoc sets used by each
      • flow view: seq of functions that access data type
        • shows how a data type is processed
        • indicates pts at which data migrates btwn cores
  • profiling memcached
    • 16 memcached’s, ea pinned to 1 core, and 1 off-host client per core
    • single memcached perf: 1.1M req/s, 80 L2 misses/req, 1.4% L2 miss rate, 70 L3 misses/req, 1.2% L3 miss rate
    • oprofile report: not very helpful
    • dprof: reports high-miss data types as bouncing a lot; then track down to transmission control flow in flow view
  • dprof impl
    • 2 kernel modules
    • path traces built from access samples and obj access histories
    • access samples: amd instruction-based profiling
      • [this seems to be the key to everything]
    • for each obj access: offset into obj, L1/L2/L3 misses, avg miss latencies, timestamp, CPU change
  • overhead