Math

signals

  • fourier, z, laplace, wavelet, hilbert xforms
  • butterworth filter

statistics

  • frequentist statistics
    • Point estimates trade off bias and variance
    • Interval estimates have the goal of achieving nominal coverage and the goal of being informative
    • Tests have the goals of calibration and power
  • frequentists express prior info not in dist but in choice of methods—may still be too restrictive

  • prospective vs retrospective studies
    • prospective: cohort is random sample of population; watch them; may be very skewed (eg if studying rare disease)
    • retrospective: already know cases & controls; sample both evenly
  • ascertainment: just means sampling (ascertainment bias, ascertainment method)

  • TODO mixed effects

  • characteristic function = fourier transform

  • TODO http://itunes.apple.com/us/itunes-u/statistics-110-introduction/id495213607

  • bootstrap hypothesis testing: for 2 groups, merge them, generate a bunch of bootstrap samples, and see how frequently they generate means as different as what’s seen (the p-value)

  • areas: descriptive, inferential

  • moments
    • positive skew: lumped to the left (counter-intuitive)
    • positive kurtosis: pointier peak
  • sample variance: same as var but divide by n-1 instead of n

  • mean deviation: the mean absolute deviation from the mean
    • harder to eval in closed form
    • used by TPC To calculate large deviations from RTT by calcing EWM deviation
  • weak law of large numbers (prove using Chebyshev’s ineq): \lim_{n \rightarrow \infty} P(|\bar{X_n} - \mu| < \eps) = 1
  • strong law of large numbers (more complex proof): P( \lim_{n \rightarrow \infty} \bar{X_n} = \mu ) = 1
  • central limit theorem: mean/sum of sufficiently many independent random variables, each with finite mean & variance, is approximately normal
    • as sample size increases, sample mean distribution approaches normal
  • standard error of the mean: standard deviation of the sampling distribution of the sample mean (“sampling distribution of the sample mean” is the “distribution of the mean of samples”)
  • confidence interval: at a certain confidence level (eg 95%), the prob this experiment yielded a sample that contains the true pop param
    • “Were this procedure to be repeated on multiple samples, the calculated confidence interval (which would differ for each sample) would encompass the true population parameter 95% of the time.”
    • not for predicting prob of true pop param being in interval; pop param has actual value, not a prob dist
    • instead, indicates reliability of estimate
    • find lowest value of param (e.g. population mean or population binomial proportion) such that estimate (e.g. sample mean or sample binomial proportion) is still in innermost 95% of the sampling distribution of the estimate
      • innermost means excluding lowest/highest 2.5%
      • sampling distribution:
        • for normal data ~ N(\mu, \sigma), it’s ~ N(\mu, \sigma / \sqrt{n}), where \mu is unknown
        • for bernoulli data ~ Bern(p), it’s ~ Bin(n,p), where p is unknown
      • do the same for highest such value of param
    • closely related to significance testing: testing null hyp \theta=0 can be done by seeing if confidence interval for \theta includes 0 (w confidence level y=1-\alpha)
    • intervals calculated from (functions of) sample, so they’re statistics, functions of RVs, hence RVs themselves
    • margin of error: radius (half) of confidence interval
  • statistical error: deviation of a sample from unobservable true/theoretical value (eg person’s height vs population mean)
  • residual: deviation of a sample from observable estimated value (eg person’s height vs sample mean)

  • survival function: complement of CDF

  • 2 time series are cointegrated if linear combo is stationary
    • 2 time series are cointegrated if they share common stochastic drift
    • eg drunkard and his dog: both randomly constantly moving but periodically error-correcting; cointegrated of order zero
    • eg stock market index and its futures contract, pairs trading
    • eg, R-squared fails when dealing w non-stationary vars, leading to spurious regressions suggesting relationships when there’s none
    • Engle-Granger test: simplest detection approach TODO
  • order of integration I(p): min num diffs required to obtain a stationary series
    • eg 1,1,1: already stationary
    • eg 1,2,3: diff w lag-1 is stationary
    • eg 1,4,9,16,25: diff w lag-1 gives 3,5,7,9; second diff is stationary
    • similar to calculus integration/differentiation
  • simpson’s paradox: a correlation present in diff groups is reversed when groups are combined; often in social-/med-sciences when frequency data hastily given causal interps

  • TODO
  • (paired) student t-test
  • distributions
    • student t-distribution: something about normals for small populations where variance is unknown
    • chi-square distribution
    • Gaussian distribution, Cauchy, Exponential, Gamma, Generalized Extreme Value, Rician, Wrapped Cauchy, Von Mises, Binomial, and Beta
  • \mathrm{Correlation}(X, Y) = \frac{ \mathrm{Covariance}(X, Y) }{ \sigma_X \sigma_Y } where \sigma is the SD

  • seq of RVs has heteroscedasticity if they have diff variances
    • usu. error terms vary w each obs, often true in cross-sectional/time-series
  • copula, Gaussian copula: TODO

  • Kendall tau rank correlation coefficient \tau
    • similarity of orderings of data
    • TODO uses p-value somewhere

bayesians vs frequentists

  • views of probability
    • frequentist: P(H)=.5 turns up heads .5 of the time
    • bayesian: coin w P(H)=.5 deserves even-odds betting
  • views of stats
    • frequentist: ask what’s P(H=h) given that null hyp that true odds are .5
    • bayesian: estimate P(\theta=.5) vs P(\theta\ne.5) given #H & prior
  • frequentist
    • believes pop mean is real, but unknown and unknowable
  • frequentist estimator
    • find sample mean; construct confidence interval around it
    • can’t say “95% prob that true mean is in interval”; must say “95% prob over intervals that an interval contains true mean”, i.e. “if we were to repeatedly sample data, find mean, and construct confidence interval, 95% of intervals would contain the true mean”
  • frequentist tester
    • form hypothesis: “mean is blah”
    • accept or reject hypothesis with test to significance level
  • bayesian
    • only data is real; pop mean is an abstraction; construct a credible interval, centered near the sample mean and tempered by prior beliefs (sometimes very non-informative)
    • can say “95% prob that this interval contains the mean”
  • http://www.statisticalengineering.com/frequentists_and_bayesians.htm
  • http://www.rasmusen.org/x/2007/09/25/bayesian-vs-frequentist-statistical-theory/

conjugate priors

  • conjugate priors produce posteriors of the same algebraic form
  • an algebraic convenience that yields closed-form posteriors
  • beta dist is conjugate prior for binomial & bernoulli
  • dirichlet dist is generalization of beta and conjugate prior for multinomial & categorical
  • TODO

non-informative aka objective priors

probability vs statistics

test accuracy

  • precision = relevant & retrieved / retrieved
  • recall = relevant & retrieved / relevant
  • F1 score aka F-measure: F = 2 * (precision * recall) / (precision + recall)

statistical significance

  • standard score: number of SDs; aka z-value, z-score, normal score
  • this is frequentist hypothesis testing; Bayesian hypothesis testing is Bayesian inference
  • alt hyps
    • Fisherian testing: no alt hyp, only accept/reject null hyp
    • Neyman-Pearson: alt hyps
    • in practice: alt hyp is just negation of null hyp
  • p-value: significance level; P(observation or more extreme|null hypothesis)
    • assuming null hyp (no underlying relationship or diff btwn the variables), what’s prob we’ll see a diff at least as extreme as the one observed just by chance?
      • not same as saying, given that the alt hyp is true, we can replicate the results 5% of the time or 95% of the time; that’s called the statistical power of the design
    • the lower, the less likely the result if the null hyp is true, thus the more statistically significant the result is
    • if < .05 or .01, corresponding to a 5% or 1% chance of rejecting the null hyp when it’s actually true (type I error)
    • 5% is a typical value (border-line acceptability); in the end this is all arbitrary, no way to say something is/isn’t “really” significant
    • eg P(14+ heads of 20 flips|fair coin)
  • simple hypothesis: of the form \theta = \theta_0; specifies pop dist completely
  • composite hypothesis: of the form \theta > \theta_0; doesn’t specify population dist completely
  • two-tail: p-value is P(X>a \vee X
    • H_0 : \theta = \theta_0 vs H_1 : \theta \ne \theta_0
  • one-tail: p-value is P(X>x|\theta)
    • H_0 : \theta \le \theta_0 vs H_1 : \theta > \theta_0 (or vice versa)
    • problem is that the \theta is unknown (just something under \theta_0; TODO how do we justify the p-value, then, which must assume some value of \theta?
  • E-value: expected num tests until a certain observation, assuming null hyp
  • how to efficiently calc prob of at least h H in n flips?
  • type I error: false pos; rej null hyp when it’s actually true
  • type II error: false neg; acc null hyp when it’s actually false
  • mustn’t stop experiment as soon as significant results are detected
  • common “general format” of significance tests
    • they’re some ratio of (some measure of the differentiation common in the variables) / (overall differentiation of those vars)
    • eg: (part of overall differentiation of the WCC scores) / (overall differentiation of the WCC scores)
    • called ratio of “explained variation” to “total variation”
  • can only get to P(null hypothesis|observation) if we’re Bayesian and have a prior
    • Fisher violently opposes the concept of an alt hyp
  • many different and widespread interpretation pitfalls
  • quantile regression: estimate median or other quantile instead of mean
    • good for risk models
    • uses linear programming
  • robust/resistant statistics: around for 20+ yrs; not widely used; p-values reign for many non-ideal real-world reasons
  • z-test: appropriate only if you have a SD, else use t-test

statistical tests

  • many different statistical measures
  • power of a test is its probability of correctly rejecting null hyp (complement of false negative rate, \beta)
    • power function \beta(\theta) = P_\theta(X \in R), where R is rejection region (this is a diff \beta)
  • size/significance level of a test (\alpha): prob of incorrectly rejecting null hyp (false positive rate), or upper bound of prob of rejecting the null hyp over all cases covered by the null hyp
  • always want to choose the most powerful test (highest power under H_1) for a given size/significance level
    • finding most powerful tests is hard; most powerful tests don’t exist for many cases
    • 4 widely used tests: Wald, \chi^2, permutation, likelihood ratio
  • contingency tables: used by many statistical tests
  • test selection decision table: http://www.graphpad.com/www/book/choose.htm
    • the complexity here is a critique of the frequentist approach
  • chi-square: typically Pearson’s chi-square
  • Fisher’s exact test: for small categorical data; compares two groups
  • Pearson’s chi-square: for large categorical data
  • binomial test: for small categorical data
    • compare group against theoretical value (i.e. with-replacement; see binomial dist vs hypergeom dist)
  • non-parametric tests TODO: mann-whitney, wilcoxon
    • Kolmogorov-Smirnov (K-S) test: nonparam test for equality of continuous 1D prob dists; often used to test for GOF
      • “The two-sample KS test is one of the most useful and general nonparametric methods for comparing two samples, as it is sensitive to differences in both location and shape of the empirical cumulative distribution functions of the two samples.”
  • sequential analysis: sample size isn’t fixed in advance; data is evaluated during collection; may conclude exp earlier
    • alt is bayesian exp design
  • Mann-Whitney-Wilcoxon RankSum test: useful to determine if 2 distributions are significantly different or not. Unlike the t-test, the RankSum test does not assume that the data are normally distributed, potentially providing a more accurate assessment of the data sets.
  • Chow test: whether 2 linear regressions on diff data sets are equal
    • also (in econometrics) for whether 2 time series over different times exhibit a structural break

data

  • censoring: limiting values to be within bounds
  • truncation: omitting values that are out of bounds
  • eg: if policyholders subject to policy limit u and deductible d, then losses are left-truncated (unreported if under d) and right-censored (limited to u-d)

time series (TODO also cover math)

  • stationary process: stochastic process whose joint prob dist doesn’t change when shifted in time/space, so mean/variance/autocorr don’t change
    • white noise is stationary; cymbal clash variance diminishes over time
    • i.e. “flat”, not trending over time
    • for a non-stationary signal, either fitting a curve and then taking residuals or differencing may give stationary data
  • autocorrelation: cross-correlation of signal w itself; a lagged self-correlation; to detect non-randomness; works if fixed intervals
    • IID iff 0 autocorrelation
    • Durbin-Watson test: conventional test to detect autocorrelation
      • 0 to 4; 2 means no autocorrelation
  • cross-correlation: sliding dot product of two signals; like convolution but no reversal
  • linear prediction: future values of discrete-time signal estimated as linear function of previous samples
  • autoregressive (AR) model: random process used to model natural phenomena; a type of linear prediction formula
  • vector autoregressive (VAR) model: popular in econometrics
  • Dickey Fuller test: whether unit root is present in autoregressive model
  • TODO: (partial) autocorrelation plots, ARMA, ARIMA, Ljung-Box for zero autocorr, unit root test for coint, Granger-causality, whiteness (iid-ness) & normality, VAR impulse response analysis and forecast error variance decomp,
  • TODO filtering: Hodrick-Prescott (HP), Baxter-King, Christiano-Fitzgerald (all popular in finance)
  • TODO Bayesian dynamic linear models (DLM)
  • repetition
    • seasonal aka periodic: fixed-length cycles
    • cyclic: variable-length cycles
  • cyclic/seasonal models

statistical inference

  • analysis of variance (ANOVA or AOV): TODO
  • canonical (variate) analysis: TODO
    • 1 predictor, 1 criterion: coef of corr r, coef of det r^2, coef \beta
    • n predictors, 1 criterion: multiple corr R, multiple coef of det R^2, weights \beta
    • n predictors, m criteria: canonical corrs \rho_1, \dots, canonical weights C,D
    • find linear combo of predictors and linear combo of criteria w max corr
  • kernel density estimator: basically place a small kernel everywhere you have a sample, resulting in somtehing much smoother than a histogram (where bins can make all the difference in interpretation)

  • mean absolute percentage error (MAPE): M = \frac{1}{n} \sum_{t=1}^n \left\vert \frac{A_t - F_t}{A_t} \right\vert

relationship btwn mean squared error and gaussian dist? (TODO cs229 notes)

outlier elimination

exponential smoothing for time series analysis

  • exponential moving average (EMA) aka exponentially weighted moving average (EWMA) aka single exponential smoothing (SES): does not spot trends
  • double exponential moving average (DEMA) aka linear exponential smoothing (LES): incorporates seasonality
  • triple exponential moving average (TEMA) aka triple exponential smoothing (TES): incorporates seasonality & trend
  • http://www.itl.nist.gov/div898/handbook/pmc/section4/pmc43.htm

likelihood

misc

  • benford’s law: distribution of leading digits is non-uniform (exp decay)

pitfalls

  • unbiased estimators can be terrible
  • drug X results may have p=.04 and drug Y results may have p=.06 but difference may well NOT be significant (diff btwn sig & insig may well be insig)
  • p-value is not prob of hyp being right; to compute this (posterior), more info is needed (prior)
  • statistical significance != significance
  • data dredging: many sig tests will inevitably result in false pos results
  • example takedown: http://www.americanscientist.org/issues/id.14344,y.0,no.,content.true,page.1,css.print/issue.aspx

regression tips

  • from andrew gelman http://andrewgelman.com/2010/12/what_do_practit/
  • (wouldn’t take all thes as gospel)
  • pre-process your vars; discretize numeric, avg several related
  • usu fine to just treat discrete outcomes (eg 1-5 scale) as numeric; don’t worry about ordered logit/probit/etc., just run the reg already
  • take the 2 most important input vars in your reg & throw in their interaction
  • The key assumptions of a regression model are validity and additivity. Except when you’re focused on predictions, don’t spend one minute worrying about distributional issues such as normality or equal variance of the errors.

surveys

  • Likert scale aka rating scale: psychometric scale (5/7-point)
    • ordinal data, but not interval/linear data; requires nonparam tests

experiments

  • factorial design: 2+ factors, ea w discrete values/levels; experimental units take on all combos
    • eg compare 2 motors at 2 speeds (4 combos)

combinatorics

  • permutations without replacement: P(n,r) = n! / (n-r)!
  • permutations with replacement: n^r
  • combinations without replacement (partitioning sets): C(n,k) = n! / (k! (n-k)!)
    • this is the binomial coefficient
    • multinomial coefficient: partitioning into multiple sets
      • binomial is actually C(n; k, n-k)
      • C(n; k1, k2, …, kp) = n! / (k1! k2! … kp!)
  • combinations with replacement (multisets/bags): C(n-1 + k; n-1, k) = (n-1+k)! / (k! (n-1)!)
    • proof using “stars and bars”

mathematics

  • abstract algebra: study of algebraic structures such as groups, rings, fields, modules, vector spaces, and algebras
  • topology: extension of geometry that focuses on structure
  • algebraic topology: uses abstract algebra
  • topological graph theory: topology + graph theory; studies embeddings of graphs in surfaces
  • number theory
  • computational number theory, algorithmic number theory: applications in crypto

geometry

  • Pythagorean theorem
    • easy proof

Interesting questions that came up during ICFPC08

  • find the tangent lines (up to four) between a pair of circles
    • find the tangent lines between a circle and a point
  • if we roll a circle around the outside of an ellipse and trace the path of its center, is the resulting oval an ellipse?

  • circumcenter of a polygon = point equidistant to all vertices = center of circumscribed circle = intersection of all edge-bisecting lines

linear algebra

inverting matrices

  • Gaussian elim yields row echelon form (0s in lower-left)
  • Gaussian-Jordan elim yields reduced row echelon form (I)

From http://en.wikipedia.org/wiki/Sparse_matrix:

Conceptually, sparsity corresponds to systems which are loosely coupled. Consider a line of balls connected by springs from one to the next; this is a sparse system. By contrast, if the same line of balls had springs connecting every ball to every other ball, the system would be represented by a dense matrix. The concept of sparsity is useful in combinatorics and application areas such as network theory, of a low density of significant data or connections.

eigenvalues and eigenvectors

  • treating M as linear xform, which vectors would keep same direction (but possibly scaled)? (Mx=\lambda x)
    • M xforms unit sphere x^T x = 1 into ellipsoid where evecs are axes, evals are radii
    • if M has data pts as col vecs, covariance matrix M^T M has evecs as principal components, evals as their variances
    • shear: 1 evec; uniform scale: \infty evecs; non-uniform scale: 2 evecs; rotate: 0 evecs
  • power method or power iteration: eigenvalue algo
    • given mat A, find scalar \lambda (eigenvalue) and non-0 vector v (eigenvector) s.t. A v = \lambda v
    • doesn’t compute matrix decomposition so suitable for large sparse A
    • but will only find one eigenvalue (with max val, aka dominant eigenpair) and may converge slowly
    • start with random vector and iteratively multiply by A and normalize
    • used in pagerank, HITS
    • http://meyer.math.ncsu.edu/Meyer/PS_Files/IMAGE.pdf
  • for transition matrix of markov chain, evec is stationary state \pi
  • for ratings matrix (eg rows are users, cols are movies), evecs of M M^T are row features, of M^T M are col features

symmetric positive definite, positive semi-definite, negative definite, indefinite, etc.

  • M is positive definite iff any of:
    • z^T M z > 0 for all nonzero vector z
    • all evals of M are pos
    • all leading principle minors are pos
      • i.e. upper left k \times k submatrix of M has pos det for all k = 1,2,\dots,n
      • eg: if M is diagonal, then M is pos def iff all diag elts are pos (diag elts are evals)
  • M is neg def iff any of:
    • -M is pos def, i.e. x^T M x < 0 for all nonzero vector x
    • all evals of M are neg
    • upper left 1 \times 1 submatrix has pos det, upper left 2 \times 2 submatrix has neg det, upper left 3 \times 3 submatrix has pos det, etc.
  • M is pos semi-def iff any of:
    • z^T M z \ge 0 for all vector z (usual definition)
    • all evals of M are non-neg
    • note: if M is pos semi-def, then all upper left k \times k submatrices have non-neg dets, but this is not a sufficient condition any more; same for neg semi-def below
  • M is neg semi-def iff any of:
    • z^T M z \le 0 for all vector z (usual definition)
    • all evals of M are non-pos M is indef iff any of:
      • M is neither pos semi-def nor neg semi-def
      • M has pos and neg evals

definitions

  • vectors linearly dependent iff
    • any of the vectors can be linear combo of any others (redundancy)
    • only linear combo that equals 0 is all-0 coefs
    • vectors are n-dimensional and span \Re^n
    • determinant of vectors as cols is 0
  • basis of vector space: linearly indep spanning set of vectors
  • subspace of \Re^n: set of vectors that’s closed under scalar mult (hence must incl. origin)
  • nullspace of A is set of vectors x s.t. Ax=0; a valid subspace
  • only way to represent a line in >2D is parametrically (t)
  • A matrix A is diagonalizable if there exists a non-singular matrix B and a diagonal matrix D such that A = BDB^{-1}.
  • singular value decomposition (SVD): important factorization of a rectangular real or complex matrix
    • many applications in signal processing, statistics
    • eg computing the pseudoinverse, least squares fitting of data, matrix approximation, and determining the rank, range and null space of a matrix
  • singularity
    • matrix is singular iff has no inverse iff any eigenvalues are zero iff characteristic polynomial is 0
    • matrix is nonsingular iff invertible iff all eigenvalues are non-zero iff characteristic polynomial is non-0
  • characteristic polynomial: p(\lambda) = \det(A - \lambda I)
  • eigenvalues are roots of characteristic polynomial
  • eigenspace: for a given eigenvalue, linear subspace spanned by its eigenvectors
  • if M is pos def then z^* M z > 0 for any vector z
  • a pos def matrix has pos def inverse
  • all eigenvalues of pos def are positive
  • a real sym matrix has real sym inverse
  • kernel of a mapping: elts that map to the zero elt
  • kernel aka null space of a matrix: nullity of A

foundations

  • vector space: eg \mathbf{R}^2 (set of all 2-dim vecs)
    • subspace: subset of vector space closed under linear combination
      • subspaces of \mathbf{R}^2: \mathbf{R}^2, any line through origin, and \{(0,0)\}
      • subspaces of \mathbf{R}^3: \mathbf{R}^3, any plane through origin, any line through origin, and \{(0,0)\}
    • column space C(A) is span of A’s col vecs
      • if not fully linearly indep, then fewer dimensions spanned then #columns
      • Ax=b can be solved iff b \in C(A)
    • kernel aka null space N(A) is set of all solutions to Ax=0
      • solutions x always form a subspace, i.e. linear combos of xs are 0
      • N(A)=0 means col vecs of A lin indep (C(A) is a basis)
  • TODO
    • stochastic matrix
    • eigen*, singular, etc.
    • http://www.cut-the-knot.org/arithmetic/algebra/eigens.shtml
    • spectral decomposition aka eigendecomposition
    • spectral theorem

norms/distance functions

  • taxicab geometry aka rectilinear dist aka L1 distance aka \ell_1 norm aka city block dist aka manhattan dist aka manhattan length
    • sum of absolute diffs in coordinates
    • hamming dist is manhattan dist of unit hypercube (since hamming dst components are either 0 or 1; hamming dist is # positions that are diff)
  • euclidian dist aka L^2/\ell^2 dist/norm
  • p norm aka L^p norm
    • ( \sum_i |x_i|^p )^{1/p}
  • maximum norm aka infinity norm aka uniform norm aka supremum norm
    • \max( |x_1|, \dots, |x_2| )
  • triangle inequality holds for n-dim vectors

cryptography

RSA (TODO REVIEW!)

  • choose primes p, q
    • primality testing: wheel factorization is certain; there are also probabilistic tests
  • find e such that gcd(e, (p-1)(q-1)) = 1
    • i.e., find a such that ae + b(p-1)(q-1) = 1. This is a diophantine equation and can be solved using the Euclidian algorithm.
  • exponentiation: find M^e \mod n
    • M is data, e is pub key, n is priv key
    • square and multiply algorithm
    • Chinese Remainder Theorem (CRT) for more complex, faster algos

How To Share A Secret (Shamir 79)

signals, coding, information theory

Audio & Video Compression (Keith Winstein’s SIPB cluedump 2008/11/06)

  • huffman code: recurse: least freq are same len, differ only in last bit
  • keithw: slides
  • entropy coding
  • arithmetic coding can get us flatter?

general

  • Shannon’s theorem: describes channel capacity, the max information rate at which reliable comm is possible over a channel that has a certain error prob or SNR
    • only existential, not constructive, hence actual capacity depends on algo

maximum entropy

  • given constraints (“testable information”), the dist that best represents the current state of knowledge is the one w largest entropy
    • i.e., choose the least informative (and thus most generalizable) dist given what we know, since we don’t know anything else
    • also, many physical systems tend to move toward maxent configs over time
    • makes explicit our freedom in using different forms of prior information
  • applications
    • obtain priors for bayesian inference
    • maxent models: observed data itself is the testable information
  • examples
    • given just mean and sd, use normal
    • given just the support [a,b], use uniform over [a,b]
    • exponential dist w mean 1/\lambda is maxent dist over all dists in [0,\infty] w mean 1/\lambda

shannon entropy

  • H(X) = E(I(X)) = \sum_i p(x_i) I(x_i) = - \sum_i p(x_i) \log p(x_i), assuming X has possible values x_i
    • I(X) is the information content of X
    • if p_i = 0, the 0 \log 0 is taken to be 0
  • eg binary entropy: H(X) for varying P(X=1)
  • distinguish btwn entropy of a distribution of messages/symbols and the entropy rate of a stochastic process
    • eg need 7 bits to distinguish the first 128 symbols generated in the fibonacci sequence
    • the fib sequence formula itself has a much lower entropy and applies to any length fib seq

Fisher information

  • reflects amt of (pos) info of observer
  • depends on PDF & its first derivatives
  • FIM is the curvature of the KL divergence

minimum description length (MDL)

  • best hyp yields best compression; occam’s razor
  • minimum message length (MML) is very similar

probability

stochastic processes

  • bernoulli process: stream of discrete & independent events, usu binary; discrete-time, unlike poisson
  • poisson process: stream of continuous & independent events occurring at const avg rate; continuous-time (infinite-granularity) bernoulli process
    • characterized by rate param \lambda

distributions

  • bernoulli: heads or tails
  • binomial: number of successes in n draws with replacement (multiple bernoulli trials)
  • hypergeometric: number of successes in n draws without replacement (unlike binomial)
  • poisson: # events over some amount of time according to poisson process; infinite-granularity binomial
  • exponential: time btwn events in poisson process
  • gaussian/normal
  • log-normal distribution: dist of a RV whose log is normal
    • in Black-Scholes model, changes in the log of exchange rates, price indices, and stock market indices are assumed normal
  • Cauchy: occurs in spectroscopy
  • gamma: sum of fixed # exponentially dist vars
    • eg total time for some # ppl to arrive
  • great reference illustrating relationships among distributions: http://www.johndcook.com/distribution_chart.html

misc

  • law of total expectation: E(E(X|Y)) = E(X)
  • law of total variance: V(X) = V(E(X|Y)) + E(V(X|Y))
  • P(X < min(Y,Z)) = P(X > max(Y, Z)) + 1 – P(X > Y) – P(X > Z)
  • Chebyshev’s inequality: P(|X-\mu| \ge k \sigma) \le \frac{1}{k^2} for any real k>0
  • Jensen’s inequality: If f is convex and X is a random variable, f(E(X)) \le E(f(X)).

problems

  • expected # flips before HTH > that of HTT, because HTH overlaps with itself
    • sketch: with a million flips, same # HTH and HTT, but HTH are more clumped, so more space btwn them
  • st. petersburg’s paradox http://mindyourdecisions.com/blog/2010/06/07/the-st-petersburg-paradox-a-flimsy-critique-of-expectation-theory-by-people-who-dont-know-math-or-economics/
    • game: toss coin until H; payoff is $2 for 0T, $4 for 1T, $8 for 2T, …
    • expected payout is 1/2 * 2 + 1/4 * 4 + 1/8 * 8 + ... = \infty
    • but must consider realistic payouts (nobody has \infty), utility of payout (e.g. log scale), and weight-vs-risk (why you don’t play the lottery)
  • kelly criterion: optimal fraction to bet in a series of bets to maximize expected return/growth rate which is log(growth rate)

game theory

  • shapley shubik power index: measure power of players in weighted voting game
    • of all possible permutations, when votes in favor happen in order, which player tilts the vote? rank them by the frequency this applies
  • pareto efficiency aka pareto optimality: a possibly local optimum
    • can’t improve one person without worsening another
  • nash equilibrium: nobody can improve by changing only own strategy unilaterally
    • assumes each player knows equilibrium strategies of others
    • doesn’t mean there aren’t strategies that are better for all
      • eg competing businesses forming cartel to increase profits
    • may appear irrational to 3rd party, since not necessarily pareto optimal

puzzles

  • pirates
  • prisoner’s dilemma
    • if A testifies against B and B is silent, A goes free and B gets full 10yr
    • if both silent, both get 6mo
    • if both testify, both get 5yr
    • regardless of what opponent chooses, you always are better off betraying
    • nash equilibrium not nec pareto-optimal
    • non-0-sum game
  • hotelling’s game
    • 2 competing hot dogs (i,j) stand on beach that ranges over [-1, 1]
    • customers go to nearest stand
    • stands will naturally go to (0,0), even though social opt is (-.5,.5)

number theory

  • fermat’s last theorem: no 3 pos ints a,b,c can satisfy a^n+b^n=c^n for any int n>2
    • long-standing conjecture; proven in 1995

misc

queueing theory

  • little’s law: L=\lambda W where L is avg number of customers in system, \lambda is avg arrival rate, W is avg time each customer spends in system
  • Kendall’s notation: A/B/C where A is arrival process, B is service time dist, C is # servers

measure theory

  • studies ways of generalizing the notions of length/area/volume
  • measure theory asks questions like:
    • How do we define a measure on our space? (Jordan measure and Lebesgue measure are two different options in Euclidean space.)
    • What properties does our measure satisfy? (For example, does it satisfy translational invariance, rotational invariance, additivity?)
    • Which objects are measurable/which objects can we say it’s okay not to measure in order to preserve nice properties of our measure? (The Banach-Tarski ball can be rigidly reassembled into two copies of the same shape and size as the original, so we don’t want it to be measurable, since then we would lose additivity properties.)
  • TODO URL

sequences and series

graph theory

  • a branch of combinatorics
  • eulerian paths
    • eg bridges of Konigsberg
    • poly-time algos
    • euler trail aka eulerian path aka eulerian walk: visit each edge once; iff only 2 nodes have odd degrees
    • eulerian circuit aka eulerian cycle aka eulerian tour: also return to start; iff all nodes have even degrees
  • hamiltonian path: visit each vertex once; NP-complete

  • overview: http://arxiv.org/abs/condmat/0303516

  • spectral graph theory
    • study of graphs and the characteristic polynomials, evals, and evecs of A or L
  • digraph: directed graph

  • sparse graph: # edges close to min # edges
  • dense graph: # edges close to max # edges

  • finite graph: finite # edges and # nodes
  • locally finite graph: each node has finite degree

  • maximal clique: clique that’s not a subset of any other clique
  • maximum clique: biggest clique(s) in the graph

  • multigraph: a graph with multiple edges between same node pairs

  • spectrum of a graph: spectrum (eigenvalues) of graph’s adj mat A
    • isomorphic graphs have same spectrum
    • d-regular graph is connected iff \lambda_1 > \lambda-2
    • d-regular graph has largest eval d
  • laplacian matrix of a graph (of n nodes) L = D - A where
    • D = diag(d_1, ..., d_n), where d_i = degree of node i
    • A = adj mat
  • spectral embedding: plotting a graph using some of its spectral eigenvectors as coordinates

  • spectral gap: difference between largest evals, \lambda_1 - \lambda_2

  • expander graph: sparse graph with high connectivity
    • name: subsets “expand” quickly
    • standard constraint: d-regular
    • expansion parameters: edge expansion, vertex expansion, spectral expansion
      • hard to compute (exponentially many subsets)
      • lower- and upper-bounded by expressions on the gap of the graph
    • many apps involve random walks on expanders
      • markov transition matrix is just proportional to the adj mat
    • applications
      • fault tolerance: networks with guaranteed connectivity (removing random edges does not reduce property of an expander by much)
      • more surprisingly: ECCs, PRNGs, crypto hashes, PCP theorem
  • k-regular graph: each node has degree k
    • max degree is an eval
  • clustering coefficient of some node is (# links between neighbors / # possible); best to refer to wikipedia illustration
    • intuitively: how close a node + neighbors are to being a clique
  • centrality: importance of a node; various measures
    • degree centrality: just the node degree
    • nodes that occur on many shorest paths between pairs of nodes in the graph have a high betweenness centrality score
    • eigenvector centrality: proportional to sum of centrality scores of neighbors
      • calc eigendecomp of adj matrix, select evec w biggest eval
      • elt i in evec gives centrality of node i
  • complex networks: exhibit non-trivial topological features
    • e.g.:
      • heavy tail in degree distribution
      • high clustering coefficient TODO
      • assortativity: preference for a network’s nodes to attach to others that are similar or different in some way
      • clustering at many scales
      • hierarchical structure
  • scale-free network: describes many “real-world networks”
    • some nodes are highly connected hubs (high degree)
    • most nodes are of low degree
    • degree distribution follows power law
    • structure and dynamics are independent of # nodes
    • preferential attachment
    • fault-tolerance: assuming uniformly random node failures, unlikely that a hub fails
  • hypergraph: generalization of a graph, where edges can connect any number of vertices

  • de Bruijn graph: with n dimensions and m symbols, this is a directed graph of m^n vertices, one for each length-n sequence of symbols
    • rule for edges is more complicated; captures overlaps between sequences of symbols
  • butterfly graph: recursive criss-crosses
    • found in fft
  • butterfly network: a network that’s almost partitioned, connected only by a few edges
    • different from butterfly graph
  • network theory: application of graph theory to networks, which are graphs that represent relations between objects

  • network coding: “combining” values to maximize information flow

  • Euclidian graph: edge weights = Euclidian distance

  • random graph: graph generated by randomly adding edges to a node set
    • there are different random processes used
    • Erdos number: collaborative distance to (“degrees away from”) Paul Erdos
    • eg: start w all disconnected nodes, then create an edge btwn each pair w some prob
  • small world networks
    • randomly adding few edges to high-diameter graph drops diameter drastically
    • typical properties:
      • high clustering coefficient
      • short avg path length
      • over-abundance of hub nodes
      • many local links and few long range “shortcuts”
      • create ring of nodes, each connected to k nearest neighbors; then w some prob, rewire each edge to an existing dst node
  • planar graph
    • O(n) to determine planarity
    • sparse: O(v) edges, asymptotically smaller than max (O(v^2))
    • polyhedra can be “unrolled” into planar graphs
  • Euler characteristic: number that describes an aspect of a toplogical space’s shape or structure
  • Euler’s formula: for finite, connected, planar graphs, v - e + f = 2, where f is # faces
    • i.e., Euler characteristic is 2

category theory

  • a category consists of
    • a class of objects
    • a class of morphisms aka arrows
    • for each morphism f, a domain object and a codomain object
    • for each object A, an identity morphism 1_A: A \rightarrow A
    • for each pair of morphisms f: A \rightarrow B and g: B \rightarrow C, a composite morphism g \ocirc f: A \rightarrow C
  • with these rules:
    • identity composition: for each f: A \rightarrow B, f \ocirc 1_A = f and 1_B \ocirc f = f
    • associativity
  • functors: morphisms from category to category
    • define functor F: C1 \rightarrow C2
      • every object A \in C1 maps to an object F(A) \in C2
      • every map A \right B \in C1 maps to F(A) \rightarrow F(B) \in C2
    • must preserve rules:
      • F(1_A) = 1_{F(A)} \forall A \in C1
      • F(g \ocirc f) = F(g) \ocirc F(f) \forall f: A \rightarrow B, g: B \rightarrow C where A,B,C \in C1