Algorithms

TODO bertsekas’ auction algorithm (max weighting in bipartite graph) TODO merge in notes from CS170 TODO http://www.catonmat.net/blog/summary-of-mit-introduction-to-algorithms/ TODO http://cstheory.stackexchange.com/questions/189/algorithms-from-the-book

TODO

  • basics
    • the chinese remainder theorem (really an algo)
    • the euclidean algorithm
    • gaussian elimination
    • linear programming
    • primality testing
    • fast matrix product
    • tsp
    • integer factoring
  • combinatorial optimization
    • minimum spanning tree
    • shortest path in graphs
    • for-fulkerson max-flow algorithm
    • gale-shapley stable marriage
    • general matching in graphs
  • randomized algos
    • primality testing
    • volume of convex bodies
    • counting matchings
    • min-cut of graphs
    • string matching
    • hashing
    • gibbs sampling
    • markov chain monte carlo (MCMC)
  • heuristic algos
    • local search
    • shotgun sequencing algorithm
    • simulated annealing
  • real-world impact
    • fast fourier transform
    • rsa encryption
    • miller-rabin primality test
    • reed-solomon codes
    • lempel-ziv compression
    • page rank of google
    • consistent hashing of akamai
    • viterbi and hidden markov models
    • smith-waterman
    • spectral low rank approximation algorithms
    • binary decision diagrams
  • info theory
    • digital fountain codes
    • cdma
    • compressive sensing of images
  • http://rjlipton.wordpress.com/2009/10/27/highlights-of-focs-theory-day/

strategies

  • if you need to find say an O(n)-time algo, think about how to make use of O(n) space (dynamic programming); consider cases where the reuse is not exactly straightforward, e.g. palindrome
  • look for ways to simplify the problem by eliminating entire classes of stuff from the solution space, e.g. max j-i where A_j > A_i

randomized algorithms

  • las vegas: always gives correct results; runtime resources may vary
    • eg quicksort (rand pivots)
  • monte carlo: deterministic runtime; output may be incorrect w certain prob
    • eg min cut

misc terms

  • binary decision diagram (BDD): binary tree representation of truth table; each layer switches on one input

bit tricks

core concepts

hash tables

  • direct chaining: list per bucket; optional move-to-front heuristic
  • linear scanning aka open addressing: try other slots
    • try h, h+1, h+2, …
    • try h_1, h_2, h_3, …
  • cuckoo hashing: k (usu. 2) hash fn’s
    • insert: into one of the 2 locs, displacing any resident and repeat the insertion of the resident, possibly evicting multiple keys
    • lookup: inspect 2 locs
    • inserts succeed in expected const time, even amortizing table rebuilds (growths), as long as load factor <50%; with 3 fn’s, up to <91% load
    • also considering n-way tables; with 2 keys per bucket, up to <80% load
    • much faster than chained hashing for small, cache-resident hash tables on modern processors
    • n-way buckets faster than conventional methods for large tables

hash functions

  • rotating hashes: for open addressing, avoid computing a bunch of different hash functions; just compute one long hash value and apply sliding window on it to get a bunch of shorter hash values
  • implementations
    • second-generation: murmurhash
    • third-generation: cityhash, spookyhash; faster, greater ILP

bloom filter

  • TODO
  • TODO scalable bloom filters
  • m = -n*ln(p)/(ln(2)^2), where m is # bits given n elts, desired false prob p
  • k = 0.7*m/n, where k is # hash fns
  • sometimes bits are shared by all hash fns; sometimes partition up into disjoint ranges for each fn; same asymptotic behavior
  • Network Applications of Bloom Filters: A Survey

prime sieve

given n, return all primes up through n
isprime = [true] * n
for i in [2..n]
  for j in [2i, 3i, ..., n]
    isprime[j] = false
return [ i for i in [2..n] where isprime[i] ]

gcd

more concise (yay mod operator):
gcd(a,b) =
  if b = 0, a
  else,     gcd(a, b % a)

explicit:
gcd(a,b) =
  if b = 0, a
  if a < b, gcd(b, a)      // swap
  else,     gcd(a, b % a)

self-balancing binary search trees

  • red-black
    • leaf nodes don’t contain data; sometimes use single sentinal
    • root, leaf nodes are black
    • red nodes have black children
    • every simple path from any node to any descendant leaf contains same number of black nodes
  • AVL
  • splay: tree-rotate (“zig-zig/zig-zag”) recently accessed node to root

string search

  • knuth morris pratt
  • boyer-moore
  • boyer-moore-horspool

hashing

  • locality sensitive hashing
  • perfect hashing
  • universal hashing
  • linear probing, linear hashing

spatial indexing

  • quadtree: 4-ary tree (child per quadrant)
  • geohashing: treat each quadtree as trie, and address the quadrants as 00, 01, 10, and 11; then leaves are 00 11 01 10 00
    • to query region intersection, can just choose leaves for tightest bound, but that’s slow/numerous
    • instead, specify max number of regions, then recursively choose the quadrant that when subdivided will remove the most area while staying under max
    • visit selected areas in “Z” order; this limits the continuity
  • hilbert curves: “U” order; better locality behavior [for convex hulls?]
    • a type of 1D fractal known as a space-filling curve
    • can translate (x,y) to hilbert curve pos
  • http://blog.notdot.net/2009/11/Damn-Cool-Algorithms-Spatial-indexing-with-Quadtrees-and-Hilbert-Curves

bin packing

  • NP-hard
  • items of diff sizes must be packed into identical bins; minimize # bins
  • formally: given bin size V and item sizes a_1,\dots,a_n, find int B and B-partition S_1 \cup \dots \cups S_B of {1,\dots,n} such that \sum_{i \in S_k} a_i \le V for all k \in 1,\dots,B

integer linear programming aka integer programming

  • NP-hard
  • maximize \mathbf{c}^T \mathbf{x} subject to A \mathbf{x} \le \mathbf{b}

linear programming

  • simplex algorithm: simplest; usu efficient but poor worst-case behavior
  • both simplex-based methods and interior point methods have similar efficiency for routine linear programs
  • in P with ellipsoid algorithm or interior point methods
  • dual problem/duality: complement to the primal problem
    • linear case
      • primal objective function combines n vars w m upper-bound constraints
      • dual function combines m vars w n lower-bound constraints
    • all LP have strong duality: optimal values of primal & dual are equal
    • lagrange duality/lagrange multiplier: TODO

priority queues

  • binary heap: maintain invariant that each node is \ge children
    • add to bottom and take from top; bubble up/down as necessary
  • binomial heap: bin-heap that supports fast (log-time) merges, but log-time (non-const) min-peeks
    • special tree structure: a binomial tree is defined recursively
      • order 0 tree is single node
      • order n tree’s children: roots of trees of orders k-1,k-2,\dots,1,0
    • a heap is collection of O(\log n) trees
    • to find min, inspect all roots
    • merges merge trees of same order (each is O(1) op)
    • inserts are merges of order-0 trees, O(1) amortized
  • Fibonacci heap: O(1) insert/decrease/merge; log-time find/del-min
    • TODO
  • for heaps implemented as linked objects, keeping pointers to the nodes enables things like deleting specific nodes or relaxing their weights

randomized

sampling

  • reservoir sampling: sample n elts from stream of unknown len in 1 pass
    • keep buffer (reservoir) of n elts, where elts are x_i
    • \forall x_i, i>n, with prob n/i replace random buffered elt with x_i
    • prob n/i is prob of no combination of n elts containing x_i (or any particular elt in a set of i elts)
  • weighted reservoir sampling: each elt has weight
    • maintain total seen weight and total reservoir weight

random variate generation

sort

TODO heap, insertion, selection, bubble, radix, quick

in-place quicksort (c.a.r. hoare)

  • select a pivot, possibly doing so using sampling
  • rather than building lists less/equal/greater, walk the list from both ends and swap as necessary (to construct sublists in-place)

timsort

dynamic programming

dynamic time-warping

  • general term for matching up two sequences that are varying in time/speed
  • one simple formulation for sequences of discrete symbols is edit distance but only allowed ops are insertions and deletions

edit distance

  • levenshtein distance: the standard way of measuring distance; ins/del/upd
    • damerau-levenshtein: also transposition of two chars
  • hamming distance: considers only substitutions in same-length strings

subsequence sum

  • given x_i, A_i is max subsequence sum up to and including x_i

    A_i = \max{ A_{i-1}, 0 } + x_i

compression

  • huffman coding
    • variable length coding: more freq symbols given shorter codes
    • optimal (but not trivial to show)
    • pseudocode: build tree bottom-up
      • O(n) assuming you have symbols sorted by frequency
      • start w ea symbol having its own root node; node freq = freq of symbol
      • iterate until one root node
        • merge two nodes with min freqs; new parent freq is sum of child freqs
    • doesn’t yield ordered trees aka optimal BSTs aka alphabetic trees; need hu-tucker algo for that
  • LZ77
    • sliding window compression, replacing strings with backrefs ((length,offset) pairs)
    • strings must be within window size apart
  • LZ78: like LZ77 but looking forward as well; less popular
  • lempel-ziv-welch (LZW): patent-encumbered modification of LZ78

    dict = dict((chr(code), code) for code in xrange(256))
    loop
      find the next max-len substr W of input that exists in dict
      replace it with its dict code
      dict[W+next char] = next code
  • deflate
    • intended to be patent-unencumbered replacement for LZW
    • LZ77 then huffman coding, but uses one huffman tree for char literals and length codes, and another for backwards offsets
  • zip
    • permits multiple compression algos, but deflate dominant
    • can take collections of files, but compresses each individually, so can’t exploit redundancy across files
  • LZMA: lempel-ziv markov chain
    • used in 7z
  • burrows-wheeler transform (BWT)
    • sort all rotations of a string (padded with start and end) and take last col
    • reversible: last col gives all chars; sort to get first col; last and first give all pairs of successive chars
  • bzip2: BWT + MTF transform + Huffman coding; parallelizable; slower than gzip
    • move-to-front (MTF) transform: decrease entropy; usu useful only after BWT
  • prediction by partial matching (PPM): smaller/slower than bzip2
  • lempel-ziv-oberhumer (LZO): very fast decompression
    • block-based (parallelizable)
    • similar compression speed as deflate
    • added to hadoop by twitter
  • zippy: google’s fast compression algo
  • implementations:
    • gzip: same as zlib internally but diff headers/trailers
    • zlib: deflate; compresses single files (freq. tar); not parallelizable

On Reducing the Size of Compressed Javascript (by up to 20%)

  • browser only supports gzip; implementing lzma/ppm in js too slow
  • gzip = LZ77 + huffman; huffman uses separate huffman tree for backwards offset
    • arrange for the backwards offsets to be the same
  • sort functions to cluster by similarity, eg tf-idf, edit distance
    • impl: use edit distance, then recursively combine best-matching pairs
  • other tricks:
    • larger-base charset for var renames
    • rename from bottom-up scopes to reuse vars
    • stable ordering of renames (so always get f(a,b,c))

high-perf linear algebra

TODO singular value decomposition (SVD)

  • many apps in signal processing and statistics
  • important factorization of rectangular real or complex matrix

TODO fast-fourier transform (FFT)

  • compute (inverse) discrete FT
  • FFT on columns
  • multiply by twiddle factors
  • FFT on rows
  • transposes
  • butterfly diagram: used to describe FFT; portion of compute that combines smaller DFTs into larger DFT

TODO parallel HPC algorithms

coding/information theory

error detection coding

  • repetition codes: just repeat the original signal
  • checksum: modular sum of words/bytes in msg
  • parity bit: simplest 1-bit checksum
    • even: 1 iff # ones is even; ditto for odd
    • special case of CRC, where 1-bit CRC is generated by polynomial x+1
  • cyclic redundancy check: non-secure hash fn based on cyclic codes
    • characterized by generator polynomial
    • there’s a table of commonly used/standard CRCs
  • crypto hash function: infeasible to find input with same hash
  • eg: Ethernet has CRC-32, IPv4 has 2B checksum over header, IPv6 has none, UDP has checksum over UDP/IP headers (optional over IPv4), TCP has checksum over TCP/IP headers with ARQ via triple-ack or timeout

checksums

  • internet checksum: 2 bytes; used in IPv4, TCP, UDP headers; very weak (amazon outage)
  • adler-32: 4 bytes; stronger; still efficient to compute; weak for short msgs
  • fletcher’s checksum: slightly stronger than adler-32; slightly slower to compute
  • CRC: good middle ground; studied by mathematicians; easily implemented in HW
    • CRC32C: particularly strong for communicating apps; implemented in SSE4.2
      • slicing-by-8: best SW algo for CRC32C
  • http://evanjones.ca/crc32c.html

error correcting code (ECC): usu. mean FEC

  • automatic repeat request (ARQ) aka backward error correction: use error detection codes so receiver can request re-xmits; more of a protocol
  • hybrid ARQ (HARQ): combine FEC for minor errors & ARQ for major errors
  • forward error correction (FEC): always include ECC
    • convolutional codes: bit-by-bit; esp. suitable for HW
      • Viterbi decoder: optimal decoding
    • block codes: block-by-block
      • early, inefficient
        • repetition codes
        • Hamming codes: single bit errors
        • multi-dimensional parity-check (MDPC) codes: put msg in grid & calc parity for each row & col
      • efficient
        • Reed-Solomon codes: popular; corrects d errors with 2d check symbols
        • turbo codes: near-optimal
        • low-density parity-check (LDPC) codes: near-optimal
        • BCH codes: handles random and burst errors
      • fountain codes aka rateless erasure codes
        • “potentially limitless sequence of encoding symbols can be generated from a given set of source symbols such that the original source symbols can ideally be recovered from any subset of the encoding symbols of size equal to or only slightly larger than the number of source symbols”
        • LT codes: first practical realization; simplest
          • for each generated block, combine (xor) random sample of source blocks
          • sample size is sampled from a special distribution, the (robust) soliton dist

misc

feed-forward bloom filters (dga)

  • high-level idea: for 2-phase exact pattern matching on some target, first filter the target with bloom filter built from the patterns, then traditional-grep smaller datasets
  • assumes that the matches are all of the same len (else what are the target extracts that you should be hashing against the bloom filter?)
    • take the min len; if too small then handle those patterns separately
  • feed-forward bloom filters are regular bloom filters but where each positive test flips on bits in a second “shadow” array (initially zeroed), which can then be used to filter out irrelevant patterns
  • paper analysis shows high prob that there will be only few patterns in second array that don’t get any matches

cardinality estimation

cardinality estimation: k minimum values (KMV) sketch

  • simple & intuitive: remember just k minimum hashes and extrapolate cardinality from interval between hashes
  • can do set unions/intersects; eg hyper log log doesn’t support intersects
  • compression: can discard high order bits (state shrinks as set grows)
  • expansion to multisets (counts duplicates): AKMV
    • add counter to each hash; eg hyper log log doesn’t support these
  • http://blog.aggregateknowledge.com/2012/07/09/sketch-of-the-day-k-minimum-values/

count-min sketch

  • similar to bloom filter, but instead of 1 array of bits, have k arrays of counters (one for each of the k hash fn’s), and each element is a number not a bit (representing counts)
  • basically an approximate histogram / summarizes large vectors
  • operations
    • add: increment the elements that the hash functions point to
    • remove: decrement the elements that the hash functions point to
    • count: return the min of all the elements that the hash functions point to
  • can optionally implement conservative add: only increment the array with the min counter value
    • but decrement operation is no longer valid
  • simple descrip / used in statistical counting of passwords in MSR paper http://research.microsoft.com/pubs/132859/popularityISeverything.pdf

count-mean-min sketch

streaming quantiles TODO

stream-summary

universal hash families

  • if H is a family of functions h:M \rightarrow N where M={0,1,\dots,m-1},N={0,1,\dots,n-1},m \ge n:
    • weak universal hash family aka weak 2-universal hash family aka 2-universal hash family: \forall x,y \in M, x \ne y, P[h(x)=h(y)] \le 1/n
    • strongly 2-universal hash family aka (strongly) 2-independent hash family: \forall x,y \in M, a,b \in N, P[h(x)=a \wedge h(y)=b] \le 1/n^2
    • (strongly) k-independent universal hash family: \forall x_1, x_2, \dots, x_k \in M, a_1, a_2, \dots, a_k \in N, P[h(x_1) = a_1 \wedge h(x_2) = a_2 \wedge \dots] \le 1/n^k
    • “weak” commonly elided from “weak hash function” bc 2-universal hash families tend to also be 2-independent
  • carter and wegman’s hash family: h_{a,b}(x) = (ax+b \mod p) \mod n
  • multiply-carry: fastest 2-universal hash family for ints
    • is it strongly 2-universal?
  • http://blog.ezyang.com/2010/11/is-multiply-carry-strongly-universal/

XOR linked list

  • replace prev/next ptrs w a single field = prev xor next
  • starting traversal reqs knowning addrs of 2 nodes, not just 1

skip lists

  • for storing sorted list of items, like trees, but probabilistic
  • insertion, removal, contains: O(\log n); enumeration: O(n)
  • insert: decide # lists this node will be a part of
    • prob .5: make node part of lowest level only
    • prob .25: make node part of lowest 2 lists
  • delete: remove node from all lists it’s in
  • contains: traverse lists forward and down
  • http://igoro.com/archive/skip-lists-are-fascinating/

tries

  • trie: ordered tree where keys usu strings/radixes and all descendants of a node have a common prefix of the key w that node
  • patricia trie aka radix tree: space-optimized trie where each node w only one child is merged w its child
    • thus every internal node has at least 2 children
  • hash array mapped trie (HAMT): persistent version popularized in clojure
    • basically, high-fanout (very flat - 32x) radix tree keyed by hashing
    • use bitmap so nodes w <32 items can allocate only as many words as nec and bitmap identifies which items are present (population count aka hamming wt)

binary space partitioning (BSP) trees

  • TODO recursively partition space; apps in graphics
  • kd-tree aka k-dimensional tree: special case
    • organizes k-dim points (this is important)
      • binary tree, each internal node is a hyperplane splitting some dimension
      • remember though that each node also corresponds to a point!
      • think decision tree recursively splitting different continuous dimensions
    • useful for range searches, NN searches
      • NN search
        • go down tree as if inserting query point
        • once at leaf, save it as current best
        • walk back up tree until back at root
          • if current node closer, update current best
          • any points on other (sibling) subtree that could be closer? (does hypersphere with radius from query to current node intersect with other subtree?)
            • if so, recurse down to leaf of other subtree
    • various construction algos
      • canonical construction
        • as we move down the tree, we cycle through the dimensions; eg for 3D tree, level 1 splits x, 2 splits y, 3 splits z, 4 splits x, etc.
        • split on median point
    • not efficient for high dim; number of data points n should be much larger than 2^k

BK Tree

  • useful for spell correction/fuzzy searches
  • structure
    • one node per word
    • each word has edges such that edge i links to sub-tree of words all (Levenshtein) distance i from current word
  • querying: given term q and max dist m
    • start at root node and repeat the following at each traversed node
    • let d be distance btwn current node and q
    • use triangle inequality: follow only edges d-m \dots d+m
  • building: start w arbitrary root
    • recursively traverse tree using distance until hit leaf, then insert
  • properties
    • N-ary and irregular but generally well-balanced
    • searching w dist 1 traverses <5-8% of tree, 2 traverses <17-25% of tree
  • not a great explanation but not many out there http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees

shotgun genome assembly

  • de brujin graphs commonly used in genome sequencing
  • abstraction: reassemble full string from (2B) overlapping fragments
  • preparation: k-shingle fragments, store shingles into bloom filter
  • reassembly: start w some string, and recursively try appending every char to it and verifying it exists in the BF

    def extensions(prefix):
      """Yields extensions of prefix reachable via de Brujin graph (via
      shingling checks against bloom filter), or the prefix if no more
      extensions."""
      word = prefix[-k:]
      cs = [c for c in allchars if word+c in bf]
      for n,c in enumerate(cs):
        for ext in extensions(prefix+c):
          yield ext
      if cs==[]:
        yield prefix
    • exponential search tree but should cut off dead-ends quickly?
  • can’t directly assemble w this (false connections, also many convenient graph properties), but can use data structure to grok graph properties and eliminate/break up data before feeding into other programs
    • eliminate small graphs
    • find and shard out disconnected partitions
    • local graph complexity reduction & error/artifact trimming
  • http://www.slideshare.net/c.titus.brown/pycon-2011-talk-may-not-be-final-note

similarity with shingling

  • jaccard(set(shingles(stream1)), set(shingles(stream2)))

find majority item in stream

  • pseudocode from http://unmaintainable.wordpress.com/2010/02/21/finding-majority-in-stream/:

    maj, n = None, 0
    for x in xs:
      if x == maj: n += 1
      elif n == 0: maj, n = x, 1
      else: n -= 1
    return maj
  • assumes that there is a majority item; requires 2nd pass to verify this if unknown
  • can generalize to find all k items that exceed 1/k fraction of all items; useful for finding outliers in power law dist

automatic differentiation