## 1 Identities, approximations, limits

• Identity: ${lim}_{x\to \infty }{\left(1+\frac{a}{x}\right)}^{x}={e}^{a}$
• ${e}^{x}>1+x$ for $x>0$ and ${e}^{x}\approx 1+x$ for $-.1
• Euler's identity: ${e}^{i\pi }+1=0$ (from ${e}^{ix}=cosx+isinx$)

## 2 General

• geometric mean ${\left({\prod }_{i}{x}_{i}\right)}^{\frac{1}{n}}$ is exp of arith mean of logs, $exp\left(\frac{1}{n}{\sum }_{i}log{x}_{i}\right)$
• eg annualizing compounding: given annual growths $a,b,c>1$ and initial price ${p}_{0}$, ${p}_{3}=abc{p}_{0}={\mu }^{3}{p}_{0}$ where geometric mean $\mu =\sqrt[3]{abc}$
• harmonic mean ${\left(\frac{1}{n}{\sum }_{i=1}^{n}{x}_{i}^{-1}\right)}^{-1}$
• if ${x}_{i}$ subject to (arithmetic-)mean-preserving spread, harmonic mean decreases
• preferable way to avg multiples, e.g. P/E ratio
• vs arith mean
• A travels 20mph for 1h then 30mph for 1h, avg speed is arith mean
• A travels 20mph for 1mi then 30mph for 1mi, avg speed is harmonic mean
• F-1 score is harmonic mean of precision & recall
• power mean ${M}^{r}\left(\left\{{x}_{i}\right\}\right)={\left(\frac{1}{n}{\sum }_{i}{x}_{i}^{r}\right)}^{\frac{1}{r}}$
• $r=-1$ harmonic, $r=0$ geom, $r=1$ arith, $r=2$ quadratic (root mean square), $r=-\infty$ min, $r=\infty$ max
• Stirling's approx: $lnn!=nlnn-n+O\left(logn\right)$ where last term is $\frac{1}{2}ln\left(2\pi n\right)$
• Or, ${lim}_{n\to \infty }\frac{n!}{\sqrt{2\pi n}{\left(\frac{n}{e}\right)}^{n}}=1$ or $n!\sim \sqrt{2\pi n}{\left(\frac{n}{e}\right)}^{n}$

## 3 Information Theory

• surprisal: $-logP\left(x\right)=log\frac{1}{P\left(x\right)}$; in bits; additive; used in entropy, KLIC, etc.
• $P\left(x\right)=\frac{1}{n}⟹-logP\left(x\right)=n$
• entropy $H\left(X\right)=E\left[I\left(X\right)\right]=-{\sum }_{x}p\left(x\right)logp\left(x\right)\ge 0$ (expected information content)
• lower prob events have higher information content
• measured in bits
• mutual information $I\left(X;Y\right)={\sum }_{y}{\sum }_{x}{p}_{X,Y}\left(x,y\right)log\frac{{p}_{X,Y}\left(x,y\right)}{{p}_{X}\left(x\right){p}_{Y}\left(y\right)}\ge 0$
• self-information is entropy: $I\left(X;X\right)=H\left(X\right)$
• $I\left(X;Y\right)=H\left(X\right)-H\left(X\mid Y\right)=H\left(Y\right)-H\left(Y\mid X\right)=H\left(X\right)+H\left(Y\right)-H\left(X,Y\right)=H\left(X,Y\right)-H\left(X\mid Y\right)-H\left(Y\mid X\right)$
• symmetric uncertainy $U\left(X,Y\right)=2\frac{I\left(X;Y\right)}{H\left(X\right)+H\left(Y\right)}\in \left[0,1\right]$
• relationship to correlation
• MI measures general dependence, correlation measures linear dependence; MI is better for measuring dependence
• MI applicable to symbolic sequences; correlation applicable only to numerical sequences; but MI must estimate continuous distributions
• Kullback–Leibler divergence aka KLIC: non-symmetric measure of difference btwn dists $P,Q$
• expected # extra bits to code samples from $P$ when using code based on $Q$ rather than on $P$
• alt intuition: avg likelihood of data distributed as $P$ given $Q$ as model: ${D}_{KL}\left(P|Q\right)=-log\stackrel{‾}{L}$ where $L=Pr\left[X\sim P\mid Q\right]$
• ${D}_{KL}\left(P\parallel Q\right)={\sum }_{i}P\left(i\right)log\frac{P\left(i\right)}{Q\left(i\right)}={\sum }_{i}P\left(i\right)\left(logQ\left(i\right)-logP\left(i\right)\right)$; integral for continuous
• ${D}_{KL}\ge 0$; ${D}_{KL}=0$ for $P=Q$; asymmetric
• mutual information $I\left(X;Y\right)={D}_{KL}\left(Pr\left[X,Y\right]|Pr\left[X\right]Pr\left[Y\right]\right)$
• http://www.snl.salk.edu/~shlens/kl.pdf
• normalized compression distance (NCD): $NCD\left(x,y\right)=\frac{C\left(xy\right)-min\left\{C\left(x\right),C\left(y\right)\right\}}{max\left\{C\left(x\right),C\left(y\right)\right\}}$

## 4 Finance

• rate of return (ROR) aka return on investment (ROI) aka return
• let ${V}_{f}$ be final value, ${V}_{i}$ be initial value
• ratio: $r=\frac{{V}_{f}}{{V}_{i}}$
• arithmetic return aka yield: ${r}_{arith}=\frac{{V}_{f}-{V}_{i}}{{V}_{i}}=r-1$
• logarithmic/continuous compound return: ${r}_{log}=ln\frac{{V}_{f}}{{V}_{i}}=ln\left(1+r\right)$
• compound annual growth rate (CAGR): ${\left(\frac{{V}_{f}}{{V}_{i}}\right)}^{\frac{1}{n}}-1$ where $n$ is # years
• annual percentage rate (APR)

## 5 Signal Processing

• DFT: ${X}_{k}={\sum }_{n=0}^{N-1}{x}_{n}exp\left(-\frac{2\pi i}{N}kn\right)$
• IDFT: ${X}_{k}=\frac{1}{N}{\sum }_{n=0}^{N-1}{x}_{n}exp\left(i2\pi k\frac{n}{N}\right)$ (normalized, changed exp sign)
• interesting presentation: strength of freq $k$ is distance from origin of the midpoint of your signal's points as the signal are spun around a circle http://altdevblogaday.org/2011/05/17/understanding-the-fourier-transform/
• IIR, FIR: TODO

## 6 Probability

### 6.1 Distributions

• Binomial: # successes in $n$ Bernoulli trials each with success prob $p$
• $Pr\left[X=k\right]=\left(\genfrac{}{}{0}{}{n}{k}\right){p}^{k}{\left(1-p\right)}^{n-k}$
• $E\left[X\right]=np$
• $Var\left[X\right]=np\left(1-p\right)$
• Geometric: # trials until Bernoulli success with prob $p$
• $Pr\left[X=k\right]={\left(1-p\right)}^{k-1}p$
• $E\left[X\right]=\frac{1}{p}$
• $Var\left[X\right]=\frac{1-p}{{p}^{2}}$
• Hypergeom: # successes in $n$ draws from population of $N$ containing $m$ successes
• $Pr\left[X=k\right]=\frac{\left(\genfrac{}{}{0}{}{m}{k}\right)\left(\genfrac{}{}{0}{}{N-m}{n-k}\right)}{\left(\genfrac{}{}{0}{}{N}{m}\right)}$
• $E\left[X\right]=n\frac{m}{N}$
• $Var\left[X\right]=n\frac{m}{N}\frac{\left(N-m\right)}{N}\frac{N-n}{N-1}$
• Negative binomial: # successes in $n$ Bernoulli trials before $r$ failures (generalization of geom)
• $Pr\left[X=k\right]=\left(\genfrac{}{}{0}{}{k+r-1}{k}\right){\left(1-p\right)}^{r}{p}^{k}$
• $E\left[X\right]=\frac{pr}{1-p}$
• $Var\left[X\right]=\frac{pr}{{\left(1-p\right)}^{2}}$
• Poisson: # arrivals in sliver of time (infinite-granularity binomial) assuming mean $\lambda$ arrival rate
• $Pr\left[X=k\right]=\frac{{\lambda }^{k}}{k!}{e}^{-\lambda }$
• $E\left[X\right]=\lambda$
• $Var\left[X\right]=\lambda$
• Simple interesting proof from binomial
• Normal: mean $\mu$, standard deviation $\sigma$
• $f\left(x\right)=\frac{1}{\sqrt{2\pi {\sigma }^{2}}}exp\left(-\frac{{\left(x-\mu \right)}^{2}}{2{\sigma }^{2}}\right)=\dots exp\left(-\frac{{Z}^{2}}{2}\right)$
• $E\left[X\right]=\mu$
• $Var\left[X\right]={\sigma }^{2}$
• Empirical rule: z-scores of 1/2/3 span 68%/95%/99.7%
• Is its own Fourier transform
• Beta: density shape over $\left(0,1\right)$
• uniform dist is a beta dist
• params $a,b$ s.t. $beta\left[a,b\right]\left(\theta \right)=\alpha {\theta }^{a-1}{\left(1-\theta \right)}^{b-1}$
• $E\left[X\right]=\frac{a}{a+b}$: higher $a$ suggests $\Theta$ closer to 1 than 0
• conjugate prior for Bernoulli/binomial dists
• Exponential: time btwn Poisson process events
• $f\left(x\right)=\left\{\begin{array}{cc}\lambda {e}^{-\lambda x},& x\ge 0\\ 0,& x<0\end{array}$; $Pr\left[X
• $E\left[X\right]=\frac{1}{\lambda }$
• $Var\left[X\right]=\frac{1}{{\lambda }^{2}}$
• memoryless: $Pr\left[X>s\mid X>t\right]=Pr\left[X>s-t\right]$ / constant event rate $\lambda$ / constant hazard $\lambda$
• Gamma: scale $\theta$ and shape $k$
• models waiting times: sum of $k$ indep exponentially distributed RVs, each with mean $\theta$
• also the sample variance of normal data
• conjugate prior for many dists TODO
• Student's t-distribution: a more spread-out normal distribution
• for when sample size is small, population SD unknown
• converges to normal as DF (corresponds to sample size) increases
• Chi-square, chi distributions
• chi-square: sum of squares of $k$ normal RVs ${\sum }_{i}{\left(\frac{{X}_{i}-{\mu }_{i}}{{\sigma }_{i}}\right)}^{2}$
• chi: length of vector of $k$ normal components $\sqrt{{\sum }_{i}{\left(\frac{{X}_{i}-{\mu }_{i}}{{\sigma }_{i}}\right)}^{2}}$

### 6.2 Conjugate prior relationships

•  Likelihood $Pr\left[x\mid \theta \right]$ Conjugate prior $Pr\left[\theta \right]$ Posterior $Pr\left[\theta \mid x\right]$ Gaussian Gaussian Gaussian Binomial$\left(N,\theta \right)$ Beta$\left(r,s\right)$ Beta$\left(r+n,s+N-n\right)$ Poisson$\left(\theta \right)$ Gamma$\left(r,s\right)$ Gamma$\left(r+n,s+1\right)$ Multinomial$\left({\theta }_{1},\dots ,{\theta }_{k}\right)$ Dirichlet$\left({\alpha }_{1},\dots ,{\alpha }_{k}\right)$ Dirichlet$\left({\alpha }_{1}+{n}_{1},\dots ,{\alpha }_{k}+{n}_{k}\right)$

### 6.3 General definitions and properties

• Union bound aka Boole's inequality: $Pr\left[A\cup B\right]\le Pr\left[A\right]+Pr\left[B\right]$
• Bonferroni's inequality: $Pr\left[A\cap B\right]\ge Pr\left[A\right]+Pr\left[B\right]-1$
• $Pr\left[A\mid B\right]>Pr\left[A\right]⟺Pr\left[B\mid A\right]>Pr\left[B\right]$
• Linearity: $E\left[aX+bY\right]=aE\left[X\right]+bE\left[Y\right]$
• $E\left[X|Y=y\right]={\sum }_{x}x\cdot Pr\left[X=x|Y=y\right]$
• Iterated: $E\left[E\left[X|Y\right]\right]=E\left[X\right]$
• $Var\left[X\right]=E\left[{\left(X-\mu \right)}^{2}\right]=E\left[{X}^{2}\right]-{\left(E\left[X\right]\right)}^{2}$
• $Var\left[aX+b\right]=Var\left[aX\right]={a}^{2}Var\left[X\right]$
• $Var\left[aX+bY\right]={a}^{2}Var\left[X\right]+{b}^{2}Var\left[Y\right]+2abCov\left[X,Y\right]$
• $Var\left[X+Y\right]=Var\left[X\right]+Var\left[Y\right]$ if $X,Y$ indep/uncorrelated
• Pearson's product-moment coefficient is a “normalized” covariance: ${\rho }_{X,Y}=\frac{Cov\left[X,Y\right]}{{\sigma }_{X}{\sigma }_{Y}}\in \left[-1,1\right]$
• Law of total variance: $Var\left[X\right]=E\left[Var\left[X\mid Y\right]\right]+Var\left[E\left[X\mid Y\right]\right]$ (unexplained and explained components)
• coefficient of variation aka unitized risk aka variation coefficient: $c=\frac{\sigma }{|\mu |}$
• normalized measure of dispersion of a distribution
• signal to noise ratio (SNR): $\frac{\mu }{\sigma }$
• reciprocal of coefficient of variation; only sensical for positive variables
• Markov's inequality: $Pr\left[f\left(X\right)\ge t\right]\le E\left[f\left(X\right)\right]/t$, for any non-neg function $f$
• corollary: Chebyshev's inequality, $Pr\left[|X-E\left[X\right]|\ge a\right]\le \frac{Var\left[X\right]}{{a}^{2}}$
• Hoeffding's inequality: upper bound on prob that sum of RVs deviates from expected value
• for Bernoullis (important special case): $Pr\left[{\sum }_{i}{X}_{i}\le \left(p-\epsilon \right)n\right]\le E\left[-2{\epsilon }^{2}n\right]$ where ${X}_{i}\sim Bernoulli\left(p\right)$

## 7 Algorithms

### 7.1 LSH

• usually use shingling
• minhash: for clustering sets by similarity
• ${h}_{min}\left(x\right)={min}_{x\in X}h\left(x\right)$
• for two sets of numbers $A,B$, $Pr\left[min\left(A\right)=min\left(B\right)\right]=J\left(A,B\right)=\frac{|A\cap B|}{|A\cup B|}$
• with $k$ hash fns:
• $Pr\left[{\bigwedge }_{i=1}^{k}{h}_{min}^{\left(i\right)}\left(A\right)={h}_{min}^{\left(i\right)}\left(B\right)\right]={\left(\frac{|A\cap B|}{|A\cup B|}\right)}^{k}$ (low false positives)
• $Pr\left[{\bigvee }_{i=1}^{k}{h}_{min}^{\left(i\right)}\left(A\right)={h}_{min}^{\left(i\right)}\left(B\right)\right]=1-{\left(1-\frac{|A\cap B|}{|A\cup B|}\right)}^{k}$ (low false negatives)
• estimate $J\left(A,B\right)$ as ratio of matching hash fns
• with $k$ smallest hashes:
• for $n\ll |A\cap B|$
• estimate $J\left(A,B\right)$ as ratio of matching hashes
• simhash: similar documents have low Hamming distance between their simhashes (Moses Charikar, STOC02)
• $V=\left[0\right]×64$ for 64-bit simhash
• for each item, if bit $i$ of $h\left(x\right)$ set, increment $V\left[i\right]$, else decrement $V\left[i\right]$
• bit $i$ of simhash is 1 if $V\left[i\right]>0$ else 0

## 8 Statistics

### 8.1 Basics

• coefficient of variation (CV) aka unitized risk: $\frac{\sigma }{\mu }$ (to scale/normalize SDs)
• (Pearson's) kurtosis: fourth standardized moment: $\frac{{\mu }_{4}}{{\sigma }^{4}}$
• measure of peakedness, but it's argued this really measures heavy tails

### 8.2 Fisherian tests TODO

• z-test/z-statistic: approximations are OK when $n>30$
• one-sample: $z=\frac{\stackrel{‾}{x}-{\mu }_{\stackrel{‾}{x}}}{{\sigma }_{\stackrel{‾}{x}}},{\sigma }_{\stackrel{‾}{x}}=\frac{\sigma }{\sqrt{n}}\approx \frac{S}{\sqrt{n}}$
• two-sample: $z=\frac{\left(\stackrel{‾}{X}-\stackrel{‾}{Y}\right)-\left({\mu }_{\stackrel{‾}{X}-\stackrel{‾}{Y}}=0\right)}{{\sigma }_{\stackrel{‾}{X}-\stackrel{‾}{Y}}=\sqrt{\frac{{\sigma }_{X}^{2}}{n}+\frac{{\sigma }_{Y}^{2}}{m}}\approx \sqrt{\frac{{S}_{X}^{2}}{n}+\frac{{S}_{Y}^{2}}{m}}}$
• t-test/t-statistic: same as z-test but refer to student's t-distribution; use when $n\le 30$
• one-sample: $t=\frac{\stackrel{‾}{x}-{\mu }_{\stackrel{‾}{x}}}{S/\sqrt{n}}$ (same quantity as in z-test), DF is $n-1$
• two-sample: $t=\frac{\stackrel{‾}{X}-\stackrel{‾}{Y}-\left({\mu }_{\stackrel{‾}{X}-\stackrel{‾}{Y}}=0\right)}{{S}_{\stackrel{‾}{X}-\stackrel{‾}{Y}}=\sqrt{\frac{{S}_{X}^{2}}{n}+\frac{{S}_{Y}^{2}}{m}}}$
• chi-square test: ${X}^{2}={\sum }_{ij}\frac{{\left({O}_{ij}-{E}_{ij}\right)}^{2}}{{E}_{ij}}={\sum }_{i}{\left(\frac{{X}_{i}-{\mu }_{i}}{{\sigma }_{i}}\right)}^{2}$, where $ij$ enumerate cells in contingency table
• ${E}_{ij}$ are computed from marginal (global) frequencies of table
• dof is $n-1$ where $n$ is number of cells
• compare freqs of a sample against theoretical dist
• asymptotically approaches ${\chi }^{2}$ distribution
• uses normal approximation of multinomial distribution
• multinomial test
• Fisher's exact test

### 8.3 Neymann-Pearson tests

• Wald test: reject iff $2\cdot \frac{|\stackrel{ˆ}{\theta }-{\theta }_{0}|}{\sigma }>{z}_{\alpha }$
• depends on representation of $\theta$ (eg log-scaling); LR works with any monotonic transformation
• uses 2 approximations (know standard error, dist is ${\chi }^{2}$); LR only assumes dist is ${\chi }^{2}$ (if using that test)
• only deals with scalars; LR can deal with vector params
• likelihood ratio test: for comparing fit of 2 models/hyps, where null is special case of the alternative
• model with more params will be better fit, but is it significantly better?
• likelihood ratio test statistic for simple hyps ${\theta }_{0},{\theta }_{1}$: $\Lambda \left(x\right)=\frac{L\left({\theta }_{0}\mid x\right)}{L\left({\theta }_{1}\mid x\right)}$
• log-likelihood ratio test statistic: $D=-2ln\Lambda \left(x\right)=-2ln\left(\frac{L\left({\theta }_{0}\mid x\right)}{L\left({\theta }_{1}\mid x\right)}\right)=-2ln\left(L\left({\theta }_{0}\mid x\right)\right)+2ln\left(L\left({\theta }_{1}\mid x\right)\right)$
• approx. distributed as ${\chi }^{2}$ with dof $d{f}_{1}-d{f}_{0}$
• G-test is more accurate than ${\chi }^{2}$
• for composite hyps: $\Lambda \left(x\right)=\frac{sup\left\{L\left(\theta \mid x\right):\theta \in {\Theta }_{0}\right\}}{sup\left\{L\left(\theta \mid x\right):\theta \in \Theta \right\}}$ (compare MLEs)
• Z-test, F-test, chi-square, G-test are tests for nested models and can be phrased as (approximations of) log-likelihood ratios
• F-test: compare models; ANOVA; TODO
• G-test: for log-likelihood ratio tests; $G=2{\sum }_{ij}{O}_{ij}ln\left(\frac{{O}_{ij}}{{E}_{ij}}\right)$ TODO
• chi-square tests are approximations; useful before computers
• G-test much better where for any contingency cell $|{O}_{i}-{E}_{i}|>{E}_{i}$
• for small samples, use multinomial test, Fisher's exact test, or even Bayesian hyp selection
• Bayes factors: vs likelihood ratio tests

### 8.4 Hypothesis testing

• effect size: basically, how big a difference; fluffy notion/many formulations
• absolute: e.g., “difference between groups is 30lb”
• standardized difference of means: e.g., $\frac{\stackrel{‾}{X}-\stackrel{‾}{Y}}{S}$, where $S$ is SD of either or both groups
• problem: can get significance (low $p$-value) with big effect and small sample or big sample and small effect
• hence, report effect/sample size with $p$-value
• power analysis: prob of rejecting false ${H}_{0}$ (not making a type II error, false neg)
• power aka sensitivity is $1-\beta$ where $\beta$ is false negative rate
• can use to find min sample size to likely detect given effect size, or min effect size likely detected by given sample size
• e.g., more powerful experiments may have more subjects

### 8.5 Unsorted

• kalman filter: TODO understand
• untrusted predictions, untrusted measurements; combine them
• optimal estimate
• predict using previous data, measure, fuse/correct prediction and measurement
• usually no control signals ${\stackrel{\to }{u}}_{k}$
• http://www.swarthmore.edu/NatSci/echeeve1/Ref/Kalman/ScalarKalman.html
• Mahalanobis distance: similarity of a sample to a distribution
• ${D}_{M}=\sqrt{{\left(x-\mu \right)}^{T}{S}^{-1}\left(x-\mu \right)}$ where $x$ is new sample, $\mu$ is dist mean, $S$ is covar matrix
• same as normalized Euclidian dist if $S=I$: ${D}_{M}=\sqrt{{\sum }_{i}\frac{{\left({x}_{i}-{\mu }_{i}\right)}^{2}}{{s}_{i}^{2}}}$

### 8.6 Smoothing

• additive smoothing aka Laplace smoothing: $\stackrel{ˆ}{{\theta }_{i}}=\frac{{n}_{i}+k}{n+dk}$, $i=1,\dots ,d$
• called “rule of succession” for $k=1$
• Good-Turing estimation: complex

### 8.7 Estimation

• 90% confidence interval: in repeated samplings, the computed intervals $I\left(X\right)$ would contain the true param 90% of the time (0.1 miss rate); $Pr\left[\stackrel{ˆ}{\theta }\in I\left(X\right)\right]=.9$ ($\theta$ is const, $I\left(X\right)$ is RV)
• 90% credible interval $C\left(X\right)$: $Pr\left[\theta \in C\left(X\right)\right]=.9$ ($\theta ,C\left(X\right)$ are RV); “Bayesian confidence interval”
• statistic: any function of data $\delta \left(X\right)$
• estimator: any statistic used to estimate an (unknown) param $\theta$; usu denoted $\stackrel{ˆ}{\theta }=\delta \left(X\right)$
• bias: $E\left[\stackrel{ˆ}{\theta }-\theta \right]=E\left[\stackrel{ˆ}{\theta }\right]-\theta$ (unbiased if 0 as $n\to \infty$)
• variance: $Var\left[\stackrel{ˆ}{\theta }-\theta \right]$
• unbiased estimator: converges to true param over repeatedly sampling
• e.g.: sample variance
• unbiased sample variance ($E\left[{s}^{2}\right]={\sigma }^{2}$) is ${s}^{2}=\frac{1}{n-1}{\sum }_{i}{\left({X}_{i}-\stackrel{‾}{X}\right)}^{2}$ (Bessel's correction)
• biased is ${s}_{n}^{2}=\frac{1}{n}{\sum }_{i}{\left({X}_{i}-\stackrel{‾}{X}\right)}^{2}$
• note: $s$ is not unbiased for SD
• can be terrible; they may average to true value but individual estimates may be ridiculous
• e.g. for Poisson $X$, estimator $\delta \left(X=x\right)$ of statistic $Pr{\left[X=0\right]}^{2}={e}^{-2\lambda }=E\left[\delta \left(X\right)\right]$ is ${\left(-1\right)}^{x}$, which is nonsense
• MLE is ${e}^{-2x}$, which is always positive and has smaller MSE
• besides bias, look also at efficiency, the MSE of individual estimates
• consistent: ${lim}_{n\to \infty }\delta \left({X}_{n}\right)=\theta$
• bias & variance must go to 0
• biased but consistent estimate of mean: $\frac{1}{n}{\sum }_{i}{x}_{i}+\frac{1}{n}$
• unbiased but inconsistent estimate of mean: ${x}_{1}$
• maximum likelihood estimation (MLE): $arg{max}_{\theta }Pr\left[X\mid \theta \right]$
• find the peak in the likelihood function
• Fisher popularized this by showing that typically MLE is unbiased, consistent, & asymptotically the lowest variance estimator
• least squares: in OLS, if errors are normal, then least squares estimate is MLE
• now clear that MLE sometimes bad in practice, and finite sample behavior not samp esa asymptotic behavior, leading to Bayesian strategies
• disregards the uncertainty (“spread” in the likelihood function, i.e. Fisher information)
• property: $f\left(\stackrel{ˆ}{\theta }\right)$ is MLE of $f\left(\theta \right)$ for any $f$
• maximum a posteriori (MAP): $arg{max}_{\theta }\left(Pr\left[\theta \mid X\right]=Pr\left[X\mid \theta \right]Pr\left[\theta \right]\right)$
• the Bayesian approach
• since we have a (posterior) prob dist over $\theta$, can extract whatever we want: mean, median, mode, intervals
• Estimation: intervals, consistency, bias, MLE, MAP
• choose the single most probable $\theta$ given $X$ (mode of posterior)
• note: prediction using MAP is approx to Bayes prediction using all $\theta$ and their probabilities
• equiv: $arg{min}_{\theta }\left(-logPr\left[\theta \mid X\right]=-logPr\left[X\mid \theta \right]-logPr\left[\theta \right]\right)$
• $-logPr\left[\theta \right]$ is bits to describe $\theta$, $-logPr\left[X\mid \theta \right]$ is add'l bits to describe data
• hence, MAP chooses $\theta$ that provides max compression, or MDL
• Bayes estimation: given loss $L\left(\theta ,\delta \right)$ (eg sq err), minimize Bayes risk $E\left[L\left(\theta ,\delta \right)\right]$
• https://github.com/johnmyleswhite/JAGSExamples/blob/master/slides/

### 8.8 Exponential smoothing for time series analysis

• exponential moving average (EMA) aka exponentially weighted moving average (EWMA) aka single exponential smoothing (SES)$\begin{array}{ccc}{S}_{t}& =& \left(1-\alpha \right)\cdot {S}_{t-1}+\alpha \cdot {y}_{t-1}\\ {S}_{2}& =& 1\end{array}$
• Does not spot trends
• double exponential moving average (DEMA) aka linear exponential smoothing (LES)$\begin{array}{cccc}{S}_{t}& =& \alpha {y}_{t}+\left(1-\alpha \right)\left({S}_{t-1}+{b}_{t-1}\right),& 0\le \alpha \le 1\\ {b}_{t}& =& \gamma \left({S}_{t}-{S}_{t-1}\right)+\left(1-\gamma \right){b}_{t-1},& 0\le \gamma \le 1\end{array}$
• triple exponential moving average (TEMA) aka triple exponential smoothing (SES)
• http://www.itl.nist.gov/div898/handbook/pmc/section4/pmc4.htm

### 8.9 Applied

• recency, frequency, monetary value (RFM): simple customer behavior analysis
• segment customers along these 3 axes into discrete bins
• identify highest-value intersections of bins

### 8.10 Time series

• autoregressive model: ${X}_{t}=c+{\sum }_{i=1}^{p}{\varphi }_{i}{X}_{t-i}+{ϵ}_{t}$
• moving avg model: ${X}_{t}=\mu +{ϵ}_{t}+{\sum }_{i=1}^{q}{\theta }_{i}{ϵ}_{t-i}$
• autoregressive moving avg (ARMA) aka Box-Jenkins model: ${X}_{t}=\mu +{a}_{1}{X}_{t-1}+\dots +{a}_{k}{X}_{t-p}+{ϵ}_{t}+{b}_{1}{ϵ}_{t-1}+\dots +{b}_{q}{ϵ}_{t-q}$ where $E\left[{ϵ}_{t,}{ϵ}_{s}\right]=0\forall t\ne s$ and ${ϵ}_{t}\sim N\left(0,{\sigma }^{2}\right)$
• one of most common univariate time series models
• Kalman filter can calculate exact log-likelihood, but “conditional” likelihood is easier/more commonly used
• vector autoregression (VAR) model: ${X}_{t}={A}_{1}{X}_{t-1}+\dots +{A}_{p}{X}_{t-p}+{ϵ}_{t}$ where each ${X}_{i}$ is a vector, ${A}_{i}$ is a matrix, and ${ϵ}_{t}\sim N\left(0,\Sigma \right)$
• multivariate time series generalization of univariate AR models
• widely used, esp. in macroecon
• http://www.slideshare.net/wesm/scipy-2011-time-series-analysis-in-python

## 9 Time Series

### 9.1 Processes

• second-order stationary process: $\mu ,{\sigma }^{2}$ are time-indep
• hazard rate at time $t$ event rate at $t$ conditional on survival until $t$; eg bathtub curve; eg constant (in exponential dists)

### 9.2 Autocorrelation

• autocorrelation function (ACF): for second-order stationary process, $R\left(\tau \right)=\frac{E\left[\left({X}_{t}-\mu \right)\left({X}_{t+\tau }-\mu \right)\right]}{{\sigma }^{2}}$ ($\tau$ is lag)
• when normalized by mean and variance, called autocorrelation coefficient
• partial ACF (PACF): TODO

### 9.3 Models

• Order-$q$ moving avg model MA($q$): ${X}_{t}=\mu +{ϵ}_{t}+{\sum }_{i=1}^{q}{\theta }_{i}{ϵ}_{t-i}$
• Autoregressive model: ${X}_{t}=c+{\sum }_{i=1}^{p}{\varphi }_{i}{X}_{t-i}+{ϵ}_{t}$
• Autoregressive moving avg (ARMA) aka Box-Jenkins model: ${X}_{t}=c+{ϵ}_{t}+{\sum }_{i=1}^{p}{\varphi }_{i}{X}_{t-i}+{\sum }_{i=1}^{q}{\theta }_{i}{ϵ}_{t-i}$
• Autoregressive integrated moving avg (ARIMA): TODO

## 10 Linear Algebra

• Power method or power iteration: an eigenvalue algo
• Given matrix $A$, produce scalar $\lambda$ and non-0 vector $v$ (eigenvector) s.t. $Av=\lambda v$
• Doesn't compute matrix decomposition so suitable for large sparse $A$
• But will only find one eigenvalue (with max val) and may converge slowly
• Start with random vector ${b}_{0}$ and iteratively multiple by $A$ and normalize: ${b}_{k+1}=\frac{A{b}_{k}}{\parallel A{b}_{k}\parallel }$
• TODO
• $a\cdot b=\parallel a\parallel \parallel b\parallel cos\theta$
• (direction given by right hand rule)
• Cauchy-schwartz inequality: $|a\cdot b|\le \parallel a\parallel \parallel b\parallel$
• triangle inequality: $\parallel a+b\parallel \le \parallel a\parallel +\parallel b\parallel$
• reduced row echelon form
• identity means unique solution
• $\left[\begin{array}{cccc}1& 2& 0& 3\\ 0& 0& 1& -2\\ 0& 0& 0& 0\end{array}\right]$ with pivot variables ${x}_{1},{x}_{3}$ and free vars ${x}_{2},{x}_{4}$ has inf solutions
• $\left[\begin{array}{cccc}1& 2& 0& 3\\ 0& 0& 1& -2\\ 0& 0& 0& -4\end{array}\right]$ has no solutions ($0=-4$)
• tensor product aka outer product $\stackrel{\to }{a}\otimes \stackrel{\to }{b}=\stackrel{\to }{a}{\stackrel{\to }{b}}^{T}$ is $|\stackrel{\to }{a}|×|\stackrel{\to }{b}|$ matrix
• gradient $\nabla f=\left[\begin{array}{c}\frac{\partial f}{\partial {x}_{1}}\\ \frac{\partial f}{\partial {x}_{2}}\\ ⋮\\ \frac{\partial f}{\partial {x}_{n}}\end{array}\right]$
• Hessian ${\nabla }^{2}f=\left[\begin{array}{cccc}\frac{{\partial }^{2}f}{\partial {x}_{1}^{2}}& \frac{{\partial }^{2}f}{\partial {x}_{1}\partial {x}_{2}}& \dots & \frac{{\partial }^{2}f}{\partial {x}_{1}\partial {x}_{n}}\\ \frac{{\partial }^{2}f}{\partial {x}_{2}\partial {x}_{1}}& \frac{{\partial }^{2}f}{\partial {x}_{2}^{2}}& & \frac{{\partial }^{2}f}{\partial {x}_{2}\partial {x}_{n}}\\ ⋮& & \ddots & ⋮\\ \frac{{\partial }^{2}f}{\partial {x}_{n}\partial {x}_{1}}& \frac{{\partial }^{2}f}{\partial {x}_{n}\partial {x}_{2}}& \dots & \frac{{\partial }^{2}f}{\partial {x}_{n}^{2}}\end{array}\right]$ is Jacobian of gradient
• Jacobian of function $f:{\Re }^{n}\to {\Re }^{m}$: $Jf=\left[\begin{array}{cccc}\frac{\partial {f}_{1}}{\partial {x}_{1}}& \frac{\partial {f}_{1}}{\partial {x}_{2}}& \dots & \frac{\partial {f}_{1}}{\partial {x}_{n}}\\ \frac{\partial {f}_{2}}{\partial {x}_{1}}& \frac{\partial {f}_{2}}{\partial {x}_{2}}& & \frac{\partial {f}_{2}}{\partial {x}_{n}}\\ ⋮& & \ddots & ⋮\\ \frac{\partial {f}_{m}}{\partial {x}_{1}}& \frac{\partial {f}_{m}}{\partial {x}_{2}}& \dots & \frac{\partial {f}_{m}}{\partial {x}_{n}}\end{array}\right]$
• graph/Markov chain is irreducible iff strongly connected; matrix is irreducible iff it's transition matrix of irreducible graph/Markov chain

### 10.1 Matrices

• Elementary matrix: matrix that differs from $I$ by one elementary row op
• $A$ is symmetric iff ${A}^{T}=A$
• Let $A$ be $n×n$ over field $K$ (e.g. $\Re$). Then these are equiv:
• $A$ is invertible
• $A$ is nonsingular
• $A$ is nondegenerate
• $A$ is row-equiv to $I$ (can be changed to $I$ using elementary row ops)
• $A$ is col-equiv to $I$
• $A$ has $n$ pivot positions
• $detA\ne 0$
• $rankA=n$
• $Ax=0$ has only one trivial sol $x=0$ (i.e. $NullA=\left\{0\right\}$)
• $Ax=b$ has exactly one sol for each $b\in {K}^{n}$ ($x\ne 0$)
• cols of $A$ are lin indep
• cols of $A$ span ${K}^{n}$ (i.e. $ColA={K}^{n}$)
• cols of $A$ form a basis of ${K}^{n}$
• the linear transformation mapping $x$ to $Ax$ is a bijection from ${K}^{n}$ to ${K}^{n}$
• ${A}^{T}$ is invertible
• 0 is not an eval of $A$
• $A$ can be expressed as product of finitely many elementary matrices
• For any invertible $A$, these hold:
• ${\left(AB\right)}^{-1}={B}^{-1}{A}^{-1}$
• ${\left(kA\right)}^{-1}={k}^{-1}{A}^{-1}$ for $k\ne 0$
• ${\left({A}^{T}\right)}^{-1}={\left({A}^{-1}\right)}^{T}$
• $det\left({A}^{-1}\right)=det{\left(A\right)}^{-1}$

### 10.2 Decomposition

• Gaussian elim: for solving system of equations
• phase 1, fwd elim: use elementary row ops to get row echelon form (pivots move to right as you move down)
• phase 2, back-subst: solve for unknowns using simplified row echelon form
• Gaussian-Jordan elim: extension of Gaussian elim to get reduced row echelon form ($I$ on left square)
• to invert $A$: $\left[AI\right]⇒\left[I{A}^{-1}\right]$
• LU: matrix form of Gaussian elim's fwd elim phase
• a square matrix $A=LU$ where $L,U$ are lower/upper triangular matrices
• efficient; good for “good” matrices
• when solving $Ax=b$, faster to reuse $LU$: if know $A=LU$, then can solve $LUx=b$ by solving $Ly=b$ for $y$ then $Ux=y$ for $x$, with each of these done efficiently using fwd-/back-subst
• QR: good balance btwn LU and SVD
• any real $m×n$ matrix $A=QR$ where $Q$ is orthog $m×m$, $R$ is upper triangular $m×n$
• SVD: very robust

### 10.3 HITS

• $\forall i,\forall k=1,2,3,\dots ,$$\begin{array}{ccc}{a}_{i}^{\left(k\right)}& =& \sum _{j:{e}_{ji}\in E}{h}_{j}^{\left(k-1\right)}\\ {h}_{i}^{\left(k\right)}& =& \sum _{j:{e}_{ij}\in E}{a}_{j}^{\left(k\right)}\end{array}$ and ${a}_{i}^{\left(1\right)}={h}_{i}^{\left(1\right)}=\frac{1}{n}$.
• if $\stackrel{\to }{h},\stackrel{\to }{a}$ are vectors of ${h}_{i},{a}_{i}$ and $\stackrel{\to }{L}$ is adj matrix () then $\begin{array}{ccc}{\stackrel{\to }{a}}^{\left(k\right)}& =& {\stackrel{\to }{L}}^{T}{\stackrel{\to }{h}}^{\left(k-1\right)}\\ {\stackrel{\to }{h}}^{\left(k\right)}& =& \stackrel{\to }{L}{\stackrel{\to }{a}}^{\left(k\right)}\end{array}$ or $\begin{array}{ccc}{\stackrel{\to }{a}}^{\left(k\right)}& =& {\stackrel{\to }{L}}^{T}\stackrel{\to }{L}{\stackrel{\to }{a}}^{\left(k-1\right)}\\ {\stackrel{\to }{h}}^{\left(k\right)}& =& \stackrel{\to }{L}{\stackrel{\to }{L}}^{T}{\stackrel{\to }{h}}^{\left(k-1\right)}\end{array}$ i.e., power method applied to positive semi-definite matrices $\stackrel{\to }{L}{\stackrel{\to }{L}}^{T}$ (auth matrix) and ${\stackrel{\to }{L}}^{T}\stackrel{\to }{L}$ (hub matrix).
• Thus, HITS amounts to solving for largest eigenvalue ${\lambda }_{1}$ in ${\stackrel{\to }{L}}^{T}\stackrel{\to }{L}\stackrel{\to }{a}={\lambda }_{1}\stackrel{\to }{a}$ and $\stackrel{\to }{L}{\stackrel{\to }{L}}^{T}\stackrel{\to }{h}={\lambda }_{1}\stackrel{\to }{h}$.
• http://meyer.math.ncsu.edu/Meyer/PS_Files/IMAGE.pdf

### 10.4 PageRank

• rank of page $i$ at iteration $k$: ${r}_{i}^{\left(k+1\right)}={\sum }_{j\in {I}_{i}}\frac{{r}_{j}^{\left(k\right)}}{|{O}_{j}|}$, where ${I}_{i}$ is pages linking to $i$ and ${O}_{j}$ is pages $j$ links to
• start with uniform ${r}_{i}^{\left(0\right)}=\frac{1}{n}\phantom{\rule{6px}{0ex}}\forall i$
• in matrix notation, ${{\pi }^{\left(k+1\right)}}^{T}={{\pi }^{\left(k\right)}}^{T}H$, where ${H}_{ij}=\frac{1}{|{O}_{i}|}$
• problems
• dangling nodes i.e. pages w no outlinks (rows of all 0s)
• replace rows with $\stackrel{\to }{u}$ of all $\frac{1}{n}$; call resulting matrix $S$
• graph may not be strongly connected
• replace $S$ with $G=\alpha S+\left(1-\alpha \right)E$ where $E$ is teleportation matrix where all rows are $\stackrel{\to }{u}$
• $\pi =\pi G$, usu. with normalization condition ${\sum }_{i}{\pi }_{i}=1$
• properties
• $S,G$ are stochastic matrices (Markov chain transition matrix; rows sum to 1), hence there's always a solution
• actually a unique solution (stationary distribution), either by Markov theory or Perron-Frobenius theorem
• http://meyer.math.ncsu.edu/Meyer/PS_Files/IMAGE.pdf http://cacm.acm.org/magazines/2011/6/108660-pagerank-standing-on-the-shoulders-of-giants/fulltext

## 11 Optimization

### 11.2 Linear programming

• $arg{max}_{\stackrel{\to }{x}}{\stackrel{\to }{c}}^{T}\stackrel{\to }{x}$ subject to $A\stackrel{\to }{x}\le \stackrel{\to }{b}$ and $\stackrel{\to }{x}\ge 0$
• algo families: basis exchange (eg simplex), interior point (eg ellipsoid)

• $arg{min}_{\stackrel{\to }{x}}f\left(\stackrel{\to }{x}\right)=\frac{1}{2}{\stackrel{\to }{x}}^{T}Q\stackrel{\to }{x}+{\stackrel{\to }{c}}^{T}\stackrel{\to }{x}$ subject to $A\stackrel{\to }{x}\le \stackrel{\to }{b}$ and $E\stackrel{\to }{x}=\stackrel{\to }{d}$
• algos: interior point, active set, augmented Lagrangian, conjugate gradient, simplex extensions
• for pos-def $Q$, ellipsoid method solves in poly time
• for indef $Q$ or if $Q$ has only 1 negative eigenvalue, NP-hard

## 12 Machine Learning

### 12.1 Information criteria

• Akaike's information criterion (AIC): $-2logL+2p=deviance+params$
• $L$ is maximized likelihood using all avail data for estimation
• $p$ is # free params in model
• also seen (TODO resolve?)
• $-2log\frac{RSS}{n}+\frac{2nk}{n-k-1}$ where RSS is residual sum of squares and $k$ is # params
• $-2logL+2p+\frac{2p\left(p+1\right)}{n-p-1}$ http://scott.fortmann-roe.com/docs/MeasuringError.html
• asymptotically, minimizing AIC equiv to minimizing CV value
• Schwarz Bayesian information criterion (BIC): $-2logL+plogn$
• $n$ is # obs
• heavier penalty means model chosen by BIC is same or simpler (fewer params) than AIC
• asymptotically, minimizing BIC equiv to minimizing leave-$v$-out CV where $v=n\left(1-\frac{1}{logn-1}\right)$

### 12.2 Bayesian learning

• hypothesis prior $Pr\left[\Theta \right]$, where $\Theta$ is hyp RV
• likelihood $L\left(\theta \right)=Pr\left[\stackrel{\to }{x}\mid \theta \right]$; log-likelihood $\ell \left(\theta \right)=logPr\left[\stackrel{\to }{x}\mid \theta \right]$
• posterior $Pr\left[\theta \mid \stackrel{\to }{x}\right]=\alpha Pr\left[\stackrel{\to }{x}\mid \theta \right]Pr\left[\theta \right]$
• Bayesian learning: predict $Pr\left[X\text{'}\mid \stackrel{\to }{x}\right]={\sum }_{\theta }Pr\left[X\text{'}\mid \stackrel{\to }{x},\theta \right]Pr\left[\theta \mid \stackrel{\to }{x}\right]={\sum }_{i}Pr\left[X\text{'}\mid \theta \right]Pr\left[\theta \mid \stackrel{\to }{x}\right]$
• calculate prob of each hyp and predict over all hyps
• maximum a posteriori (MAP): predict $Pr\left[X\text{'}\mid {\theta }_{MAP}\right]$ where ${\theta }_{MAP}=arg{max}_{\theta }Pr\left[\theta \mid \stackrel{\to }{x}\right]$
• predict from just the single hyp with greatest posterior (easier)
• usually $Pr\left[X\text{'}\mid {\theta }_{MAP}\right]\to Pr\left[X\text{'}\mid \stackrel{\to }{x}\right]$ as more data arrives; otherwise, may snap to incorrect hypothesis
• equiv to MDL: ${\theta }_{MAP}=arg{min}_{\theta }-logPr\left[\stackrel{\to }{x}\mid \theta \right]-logPr\left[\theta \right]$
• maximum lilkelihood: assume uniform prior over hyps, then ${\theta }_{MAP}=arg{max}_{\theta }Pr\left[\stackrel{\to }{x}\mid \theta \right]$
• reasonable for more data, since data swamps priors

### 12.3 Complete data

• observations are commonly IID: $Pr\left[\stackrel{\to }{x}\mid \theta \right]={\prod }_{i=1}^{n}Pr\left[{\stackrel{\to }{x}}_{i}\mid \theta \right]$
• maximum likelihood
• log likelihood easier to maximize because products become sums: $logPr\left[\stackrel{\to }{x}\mid \theta \right]={\sum }_{i=1}^{n}Pr\left[{\stackrel{\to }{x}}_{i}\mid \theta \right]$
• Naive Bayes: MLE on Bayesian network where (discrete) class is root, attrs are leaves, and attrs are IID given class
• generative model: either class is a component of $\stackrel{\to }{x}$ (${\stackrel{\to }{x}}_{0}$) or say $Pr\left[\stackrel{\to }{x},C\mid \theta \right]$
• same for both discrete and continuous models
• for network $A\to B$ where $A,B$ are continuous, MLE over $Pr\left[b\mid a\right]=\frac{1}{\sqrt{2\pi }\sigma }exp\left(-\frac{{\left(b-\left({\theta }_{1}a+{\theta }_{2}\right)\right)}^{2}}{2{\sigma }^{2}}\right)$ same as minimizing the exponent, the sum of squared errors $E={\sum }_{i=1}^{n}{\left({b}_{i}-\left({\theta }_{1}{a}_{i}+{\theta }_{2}\right)\right)}^{2}$
• Bayesian parameter learning (incorporating hyp probs)
• commonly use conjugate prior for hyp prior $Pr\left[\Theta \right]$ to simplify math
• e.g. if $\Theta =Pr\left[X=head\right]$ and $Pr\left[\theta \right]=beta\left[a,b\right]\left(\theta \right)=\alpha {\theta }^{a-1}{\left(1-\theta \right)}^{b-1}$ and our data is $x=head$, then $Pr\left[\theta \mid x\right]=\alpha Pr\left[x\mid \theta \right]Pr\left[\theta \right]=\alpha \text{'}\theta {\theta }^{a-1}{\left(1-\theta \right)}^{b-1}=beta\left[a+1,b\right]\left(\theta \right)$
• usu. assume param independence so each param can have own beta dist
• incorporating param RVs into Bayesian network itself requires making copies of the variables describing each instance

### 12.4 Incomplete data: EM algo

• hidden/latent variables: indirection that dramatically reduces number of parameters in Bayesian network
• EM algo
• E-step: calculate expected likelihood given current param estimates: $Q\left(\theta \mid {\theta }^{\left(t\right)}\right)={E}_{Z\mid X,{\Theta }^{\left(t\right)}}\left[logL\left(\theta ;X,Z\right)\right]$
• often, in reality, we don't actually have to calculate expected likelihood, just as we don't have to compute cost function in gradient descents
• this step usually just updates hidden value estimates, which will then be used in M-step
• M-step: find params that maximize expected likelihood: ${\theta }^{\left(t+1\right)}=arg{max}_{\theta }Q\left(\theta \mid {\theta }^{\left(t\right)}\right)$
• EM algo examples (all have $N$ data points)
• unsupervised clustering: Gaussian mixture model
• single Gaussian: $N\left(x\mid \mu ,\Sigma \right)=\frac{1}{{\left(2\pi \right)}^{d/2}\sqrt{|\Sigma |}}exp\left(-\frac{1}{2}{\left(x-\mu \right)}^{T}{\Sigma }^{-1}\left(x-\mu \right)\right)$
• GMM: $Pr\left[x\right]={\sum }_{i=1}^{N}Pr{\left[C=i\right]}_{i}\cdot N\left(x\mid {\mu }_{i},{\Sigma }_{i}\right)$ where $N$ is # Gaussians
• while ANNs are universal approximators of functions, GMMs are universal approximators of densities
• even diagonal GMMs are universal approximators; full-rank are unwieldy, since # params is square of # dims
• like K-means with probabilistic assignments and Gaussians instead of means (K-means is a hard EM algo)
• very sensitive to initialization; may initialize with K-means
• mixture model of $k$ components: $Pr\left[\stackrel{\to }{x}\right]={\sum }_{i=1}^{k}Pr\left[C=i\right]Pr\left[\stackrel{\to }{x}\mid C=i\right]$
• Bayes net: $C\to \stackrel{\to }{X}$, $C$ hidden, $C$ discrete, $\stackrel{\to }{X}$ continuous
• E-step: calculate some quantities useful later (assignment probabilities, which are the hidden variables):$\begin{array}{ccc}{p}_{ij}& ←& Pr\left[{C}_{j}=i\mid {\stackrel{\to }{x}}_{j}\right]=\alpha Pr\left[{\stackrel{\to }{x}}_{j}\mid {C}_{j}=i\right]Pr\left[{C}_{j}=i\right]\\ {p}_{i}& ←& \sum _{j}{p}_{ij}\end{array}$
• M-step: maximize expected likelihood of observed & hidden vars$\begin{array}{ccc}{\stackrel{\to }{\mu }}_{i}& ←& \sum _{j}{p}_{ij}{\stackrel{\to }{x}}_{j}/{p}_{i}\\ {\stackrel{\to }{\Sigma }}_{i}& ←& \sum _{j}{p}_{ij}{\stackrel{\to }{x}}_{j}{{\stackrel{\to }{x}}_{j}}^{T}/{p}_{i}\\ Pr\left[C=i\right]={\theta }_{i}& ←& {p}_{i}/N\end{array}$
• to avoid local maxima (component shrinking to a single point, two components merging, etc.):
• use priors on params to apply MAP version of EM
• restart components with new random params if it gets too small/too close to another component
• initialize params with reasonable values
• can also do MAP GMM
• Naive Bayes with hidden class
• Bayes net: $X\to \stackrel{\to }{Y}$, $X$ hidden, $X,Y$ discrete
• E-step: $\begin{array}{ccc}{p}_{ij}& ←& Pr\left[X=i\mid \stackrel{\to }{Y}={\stackrel{\to }{y}}_{j}\right]\\ {p}_{i}& ←& \sum _{j}{p}_{ij}\end{array}$
• M-step: $\stackrel{ˆ}{N}$ are expected counts: $\begin{array}{ccc}Pr\left[X=i\right]={\theta }_{i}& ←& \frac{\stackrel{ˆ}{N}\left(X=i\right)}{N}=\frac{1}{N}\sum _{j=1}^{N}Pr\left[X=i\right]=\frac{{p}_{i}}{N}\\ Pr\left[\stackrel{\to }{Y}=\stackrel{\to }{y}\mid X=i\right]={\stackrel{\to }{\theta }}_{i}& ←& \frac{\stackrel{ˆ}{N}\left(\stackrel{\to }{Y}=\stackrel{\to }{y},X=i\right)}{\stackrel{ˆ}{N}\left(X=i\right)}=\frac{{\sum }_{j:{\stackrel{\to }{y}}_{j}=\stackrel{\to }{y}}{p}_{ij}}{{p}_{i}}\end{array}$
• HMMs: dynamic Bayes net with single discrete state var
• each data point is sequence of observations
• transition probs repeat across time: $\forall t,{\theta }_{ijt}={\theta }_{ij}$
• E-step: modify forward-backward algo to compute expected counts below
• obtained by smoothing rather than filtering: must pay attn to subsequent evidence in estimating prob of a particular transition (eg evidence is obtained after crime)
• M-step: ${\theta }_{ij}←\frac{{\sum }_{t}\stackrel{ˆ}{N}\left({X}_{t+1}=j,{X}_{t}=i\right)}{{\sum }_{t}\stackrel{ˆ}{N}\left({X}_{t}=i\right)}$
• EM algo
• pretend we know params, then “complete” data infer prob dists over hidden vars, then find params that maximize likelihood of observed & hidden vars
• gist: ${\stackrel{\to }{\theta }}^{\left(t+1\right)}=arg{max}_{\stackrel{\to }{\theta }}{\sum }_{\stackrel{\to }{z}}Pr\left[\stackrel{\to }{Z}=\stackrel{\to }{z}\mid \stackrel{\to }{x},{\stackrel{\to }{\theta }}^{\left(t\right)}\right]\ell \left(\stackrel{\to }{x},\stackrel{\to }{Z}=\stackrel{\to }{z}\mid \theta \right)$
• E-step: compute $Q\left(\stackrel{\to }{\theta }\mid {\stackrel{\to }{\theta }}^{\left(t\right)}\right)={E}_{\stackrel{\to }{Z}\mid \stackrel{\to }{x},{\stackrel{\to }{\theta }}^{\left(t\right)}}\left[\ell \left(\stackrel{\to }{x},\stackrel{\to }{Z}\mid \stackrel{\to }{\theta }\right)\right]$
• expected likelihood over $\stackrel{\to }{Z}$ under current ${\stackrel{\to }{\theta }}^{\left(t\right)}$
• misnomer: what's calculated are fixed, data-dependent params of $Q$
• M-step: compute ${\stackrel{\to }{\theta }}^{\left(t+1\right)}=arg{max}_{\stackrel{\to }{\theta }}Q\left(\stackrel{\to }{\theta }\mid {\stackrel{\to }{\theta }}^{\left(t\right)}\right)$
• new $\stackrel{\to }{\theta }$ that maximizes the expected likelihood
• resembles gradient-based hill-climbing but no “step size” param
• monotonically increases likelihood

### 12.5 Kernel models

• aka Parzen-Rosenblatt window
• each instance contributes small density function $K\left(\stackrel{\to }{x},{\stackrel{\to }{x}}_{i}\right)$
• density estimation: $p\left(\stackrel{\to }{x}\right)=\frac{1}{N}{\sum }_{i=1}^{N}K\left(\stackrel{\to }{x},{\stackrel{\to }{x}}_{i}\right)$
• kernel normally depends only on distance $D\left(\stackrel{\to }{x},{\stackrel{\to }{x}}_{i}\right)$
• eg $d$-dimensional Gaussian $K\left(\stackrel{\to }{x},{\stackrel{\to }{x}}_{i}\right)=\frac{1}{{\left({w}^{2}\sqrt{2\pi }\right)}^{d}}exp\left(-\frac{D{\left(\stackrel{\to }{x},{\stackrel{\to }{x}}_{i}\right)}^{2}}{2{w}^{2}}\right)$
• supervised learning: take weighted combination of all predictions
• vs kNN's unweighted combination of $k$ instances

### 12.6 Classification

• linear classifiers: simplest type of feedforward neural network
• TODO: single- vs multi-layer perceptron; feedforward vs backpropagation
• perceptron learning algorithm: for each iteration, if ${y}_{t}\left(\stackrel{\to }{\theta }\cdot \stackrel{\to }{{x}_{t}}\right)\le 0$ (mistake), then $\stackrel{\to }{\theta }←\stackrel{\to }{\theta }+{y}_{t}\stackrel{\to }{{x}_{t}}$.
• makes at most $\frac{{R}^{2}}{{\gamma }_{g}^{2}}$ mistakes on training set, where $\parallel \stackrel{\to }{{x}_{i}}\parallel \le R$ and ${\gamma }_{g}\le \frac{{y}_{i}\left(\stackrel{\to }{{\theta }^{*}}\cdot \stackrel{\to }{{x}_{i}}\right)}{\parallel \stackrel{\to }{{\theta }^{*}}\parallel }$ is the margin
• support vector machine (SVM): maximum margin classifier with some slack
• equivalent formulation assuming ${y}_{t}$ are 1/0 instead of $±1$:$\begin{array}{ccc}minimize& & C\sum _{t=1}^{n}\left[{y}_{t}{cost}_{1}\left({\stackrel{\to }{\theta }}^{T}{\stackrel{\to }{x}}_{t}+{\theta }_{0}\right)+\left(1-{y}_{t}\right){cost}_{0}\left({\stackrel{\to }{\theta }}^{T}{\stackrel{\to }{x}}_{t}+{\theta }_{0}\right)\right]+\frac{1}{2}{\parallel \stackrel{\to }{\theta }\parallel }^{2}\\ where& & {cost}_{1}\left(z\right)=max\left(0,1-z\right)\\ & & {cost}_{0}\left(z\right)=max\left(0,1+z\right)\end{array}$since $\begin{array}{ccc}{y}_{t}\left({\stackrel{\to }{\theta }}^{T}{\stackrel{\to }{x}}_{t}+{\theta }_{0}\right)& \ge & 1-{\xi }_{t}\\ {\xi }_{t}& \ge & 0\end{array}$becomes (given that we're trying to minimize all the ${\xi }_{t}$)$\begin{array}{ccc}{\xi }_{t}& =& max\left(0,1-{y}_{t}\left({\stackrel{\to }{\theta }}^{T}{\stackrel{\to }{x}}_{t}+{\theta }_{0}\right)\right)\\ & =& {cost}_{1}\left({\stackrel{\to }{\theta }}^{T}{\stackrel{\to }{x}}_{t}+{\theta }_{0}\right)\end{array}$
• LOOCV error $\frac{1}{2}{\sum }_{i=1}^{n}Loss\left({y}_{i},f\left({x}_{i};{\stackrel{ˆ}{\stackrel{\to }{\theta }}}^{-i},{\stackrel{ˆ}{\stackrel{\to }{\theta }}}_{0}^{-i}\right)\right)$ where Loss is the 0-1 loss. Upper bound is (# support vectors / $n$).
• quadratic programming optimization problem: single global maximum that can be found efficiently
• dual:
• kernel trick: substitute kernel function $K\left({\stackrel{\to }{x}}_{i},{\stackrel{\to }{x}}_{j}\right)=F\left({\stackrel{\to }{x}}_{i}\right)\cdot F\left({\stackrel{\to }{x}}_{j}\right)$ where $F$ maps to high/infinite dimensions but $K$ can still be computed efficiently
• Mercer's theorem: any “reasonable” (positive definite) kernel function corresponds to some feature space
• kernels
• quadratic: ${\left({\stackrel{\to }{x}}_{i}\cdot {\stackrel{\to }{x}}_{j}\right)}^{2}$ (common illustration: slicing hyperparabola yields circular separator)
• polynomial: ${\left(1+{\stackrel{\to }{x}}_{i}\cdot {\stackrel{\to }{x}}_{j}\right)}^{d}$
• radial basis function (RBF): often the best
• logistic regression: optimize using logit/logistic/sigmoid function $Pr\left[y=1\mid \stackrel{\to }{x};\stackrel{\to }{\theta }\right]={h}_{\stackrel{\to }{\theta }}\left(\stackrel{\to }{x}\right)=g\left(z\right)=\frac{{e}^{z}}{{e}^{z}+1}=\frac{1}{1+{e}^{-z}},\phantom{\rule{6px}{0ex}}z={\theta }_{0}+\stackrel{\to }{\theta }\cdot \stackrel{\to }{x}$
• loss function
• $J\left(\theta \right)=\frac{1}{m}{\sum }_{i=1}^{m}Cost\left({h}_{\stackrel{\to }{\theta }}\left({\stackrel{\to }{x}}^{\left(i\right)}\right),{y}^{\left(i\right)}\right)=-\frac{1}{m}{\sum }_{i=1}^{m}\left[{y}^{\left(i\right)}log{h}_{\stackrel{\to }{\theta }}\left({\stackrel{\to }{x}}^{\left(i\right)}\right)+\left(1-{y}^{\left(i\right)}\right)log\left(1-{h}_{\stackrel{\to }{\theta }}\left(\stackrel{\to }{{x}^{\left(i\right)}}\right)\right)\right]$ (a clever differentiable form)
• intuition: 0 when completely correct and $\infty$ when completely incorrect
• gradient descent: identical (LMS) update rule & derivation as in linear regression but $h$ is non-linear (logit)
• for perfectly separable data, $\theta \to \infty$; need regularization
• another algo: Newton's method to find zero of $\ell \text{'}\left(\theta \right)$
• coefficients & intercept have log-odds interpretation
• $z=log\frac{p}{1-p}$, where $p=Pr\left[y=1\mid \stackrel{\to }{x};\stackrel{\to }{\theta }\right]=g\left(z\right)$
• intercept: log-odds if $\stackrel{\to }{x}=\stackrel{\to }{0}$
• coefficient for indicator: log-odds between 1 and 0 groups
• coefficient for continuous: log-odds between unit deltas in value
• http://www.ats.ucla.edu/stat/mult_pkg/faq/general/odds_ratio.htm

### 12.7 Regression

• squared error $SE={\sum }_{i}{\left({y}_{i}-{\stackrel{ˆ}{y}}_{i}\right)}^{2}$, $MSE=\frac{1}{n}SE$, $RMSE=\sqrt{SE}$
• bias-variance tradeoff: $MSE=Var\left[SE\right]+E{\left[{y}_{i}-{\stackrel{ˆ}{y}}_{i}\right]}^{2}=Var\left[SE\right]+{Bias}^{2}$
• relative sq err $RSE=\frac{{\sum }_{i}{\left({y}_{i}-{\stackrel{ˆ}{y}}_{i}\right)}^{2}}{{\sum }_{i}{\left({y}_{i}-\stackrel{‾}{y}\right)}^{2}}$, root rel sq err $RRSE=\sqrt{RSE}$
• $MAE=\frac{1}{n}{\sum }_{i}|{y}_{i}-{\stackrel{ˆ}{y}}_{i}|$, rel abs err $RAE=\frac{{\sum }_{i}|{y}_{i}-{\stackrel{ˆ}{y}}_{i}|}{{\sum }_{i}|{y}_{i}-\stackrel{‾}{y}|}$
• mean abs pct err $MAPE=\frac{1}{n}{\sum }_{i}|\frac{{y}_{i}-{\stackrel{ˆ}{y}}_{i}}{{y}_{i}}|$ (drawbacks: zeros, unbounded)
• relative errors can compare models of different units
• coefficient of determination ${R}^{2}=\frac{S{S}_{reg}}{S{S}_{tot}}=1-\frac{S{S}_{err}}{S{S}_{tot}}=1-RSE$ where
• $S{S}_{tot}={\sum }_{i}{\left({y}_{i}-\stackrel{‾}{y}\right)}^{2}$: total sum of squares (proportional to sample variance)
• $S{S}_{reg}={\sum }_{i}{\left({f}_{i}-\stackrel{‾}{y}\right)}^{2}$: regression/explained sum of squares
• $S{S}_{err}={\sum }_{i}{\left({y}_{i}-{f}_{i}\right)}^{2}$: residual sum of squares; sum of squared residuals
• ${R}^{2}=1$ is perfect; ${R}^{2}<0$ means $\stackrel{‾}{y}$ is better than the model
• adjusted ${R}^{2}$ is $1-\left(1-{R}^{2}\right)\frac{n-1}{n-p-1}=1-\frac{S{S}_{err}}{S{S}_{tot}}\cdot \frac{d{f}_{t}}{d{f}_{e}}=1-\frac{{Var}_{err}}{{Var}_{tot}}$, where
• $p$ is total number of regressors in linear model
• $n$ is sample size
• $d{f}_{t}$ is dof $n-1$ of estimate of population variance of $Y$
• $d{f}_{e}$ is dof $n-p-1$ of estimate of underlying population error variance
• ${Var}_{err}=\frac{S{S}_{err}}{n-p-1}$, ${Var}_{tot}=\frac{S{S}_{tot}}{n-1}$
• increases only if new regressor improves model more than would be expected by chance; penalizes too-many-regressors
• useful for feature selection, with small samples
• intervals
• confidence interval: tells us about population params (mean/variance, or model params)
• “at confidence level 95%, 95% of samples (repeating this experiment) would generate CIs containing true params”
• prediction interval: tells us about data values; always wider than CI
• “next value is in PI of 95% of samples (repeated experiments)”
• or: “95% of repeated experiments generate PI containing next random value”
• non-parametric assumes nothing about population; simply think of numbers of points/gaps
• tolerance interval: tells us about percentage of all future values; usu. wider than PI
• “95% of all future values are in TI in 95% of samples/repeated experiments”
• eg: in simple linear regression, $\stackrel{ˆ}{y}=\stackrel{ˆ}{\alpha }+\stackrel{ˆ}{\beta }x=E\left[y\mid x\right]$ is the mean response, while $y=\alpha +\beta x+\epsilon$ is actual response
• $\stackrel{ˆ}{y}$ use CI, since $\stackrel{ˆ}{\alpha },\stackrel{ˆ}{\beta }$ use CI; draw confidence bands in plot for what the possible regression line is
• $y$ use PI; draw prediction bands in plot for what the possible values are

### 12.8 Support vector regression (SVR)

• linear regression formulation with no slack: minimize $\frac{1}{2}{\parallel \stackrel{\to }{w}\parallel }^{2}$ subject to ${y}_{i}-⟨\stackrel{\to }{w},{\stackrel{\to }{x}}_{i}⟩-b\le ϵ$ and $⟨\stackrel{\to }{w},{\stackrel{\to }{x}}_{i}⟩+b-{y}_{i}\le ϵ$ (require prediction $⟨\stackrel{\to }{w},{\stackrel{\to }{x}}_{i}⟩+b$ to be within $±ϵ$ of ${y}_{i}$)

### 12.9 Linear regression

• ordinary least squares (OLS) regression: a linear regression
• minimize cost function $J\left(\theta \right)=\frac{1}{2}{\sum }_{i=1}^{m}{\left(y-{h}_{\stackrel{\to }{\theta }}\left(\stackrel{\to }{x}\right)\right)}^{2},{h}_{\stackrel{\to }{\theta }}\left(\stackrel{\to }{x}\right)=\stackrel{\to }{\theta }\cdot \stackrel{\to }{x}$ (sum of squared errors)
• gradient descent: repeatedly ${\theta }_{j}←{\theta }_{j}-\alpha \frac{\partial }{\partial {\theta }_{j}}J\left(\theta \right)$
• $\alpha$ is configurable learning rate
• batch: cost over all training instances
• ${\theta }_{j}←{\theta }_{j}+\alpha {\sum }_{i=1}^{m}\left({y}^{\left(i\right)}-\stackrel{\to }{\theta }\cdot {\stackrel{\to }{x}}^{\left(i\right)}\right){\stackrel{\to }{x}}_{j}^{\left(i\right)}$
• stochastic aka incremental: converges faster
• ${\theta }_{j}←{\theta }_{j}+\alpha \left({y}^{\left(i\right)}-\stackrel{\to }{\theta }\cdot {\stackrel{\to }{x}}^{\left(i\right)}\right){\stackrel{\to }{x}}_{j}^{\left(i\right)}$
• update rule is called least mean squares (LMS) rule aka Widrow-Hoff rule
• derivation for single training instance: $\frac{\partial }{\partial {\theta }_{j}}J\left(\stackrel{\to }{\theta }\right)=\frac{\partial }{\partial {\theta }_{j}}\frac{1}{2}{\left(y-\stackrel{\to }{\theta }\cdot \stackrel{\to }{x}\right)}^{2}=-2\cdot \frac{1}{2}\left(y-\stackrel{\to }{\theta }\cdot \stackrel{\to }{x}\right)\frac{\partial }{\partial {\theta }_{j}}\left(y-\stackrel{\to }{\theta }\cdot \stackrel{\to }{x}\right)=-\left(y-\stackrel{\to }{\theta }\cdot \stackrel{\to }{x}\right){x}_{j}$
• magnitude of update proportional to error term
• can minimize in closed form without iterative algo (some matrix calculus): $\theta ={\left({X}^{T}X\right)}^{-1}{X}^{T}\stackrel{\to }{y}$
• LOOCV can also be computed without training $n$ models: $\frac{1}{n}{\sum }_{i=1}^{n}{\left(\frac{{e}_{i}}{1-{h}_{i}}\right)}^{2}$ where ${e}_{i}$ is residual
• probabilistic interp: why linear regression/why $J$?
• assume $y=\stackrel{\to }{\theta }\cdot \stackrel{\to }{x}+\epsilon$ where $\epsilon$ is normally distributed
• in the following, design matrix $X$ has training inputs as rows, and examples are indep
• to maximize likelihood $L\left(\stackrel{\to }{\theta }\right)=p\left(\stackrel{\to }{y}\mid X;\stackrel{\to }{\theta }\right)={\prod }_{i=1}^{m}\frac{1}{\sqrt{2\pi }\sigma }exp\left(-\frac{{\left({y}^{\left(i\right)}-\stackrel{\to }{\theta }\cdot {\stackrel{\to }{x}}^{\left(i\right)}\right)}^{2}}{2{\sigma }^{2}}\right)$, must minimize cost function in exponent
• can work it out by writing log likelihood
• always passes through mean of $x$s and $y$s

### 12.10 Multi-layer feed-forward neural networks

• Assume $m$ examples, $K$ outputs, $L$ layers, ${s}_{l}$ nodes in layer $l$
• Params ${\Theta }^{\left(l\right)}$ is ${s}_{l+1}×{s}_{l}$ such that ${\Theta }_{ij}^{\left(l\right)}$ is weight from node $j$ in layer $l$ to node $i$ in layer $l+1$ (subscripts are “backward”)
• Regularized cost $J\left(\Theta \right)=-\frac{1}{m}\left[{\sum }_{i=1}^{m}{\sum }_{jk=1}^{K}{y}_{k}^{\left(i\right)}log{h}_{\Theta }\left({x}^{\left(i\right)}\right)+\left(1-{y}_{k}^{\left(i\right)}\right)log\left(1-{h}_{\Theta }\left({x}^{\left(i\right)}\right)\right)\right]+\frac{\lambda }{2m}{\sum }_{l=1}^{L}{\sum }_{i=1}^{{s}_{l}}{\sum }_{j=1}^{{s}_{l}+1}{\left({\Theta }_{ji}^{\left(l\right)}\right)}^{2}$
• Or squared error for regressions; these turn out to have the same derivations
• Backpropagation: algorithm used to compute gradients
• Need random initialization of $\Theta$ to break symmetry or all params will be equal
• ${\Delta }_{ij}^{\left(l\right)}←0\forall l,i,j$
• for $i=1$ to $m$:
• Forward-propagate activations ${\stackrel{\to }{a}}^{\left(l\right)}={\Theta }^{\left(l-1\right)}{\stackrel{\to }{a}}^{\left(l-1\right)}$ for $l=2,\dots ,L$ where ${\stackrel{\to }{a}}^{\left(1\right)}={\stackrel{\to }{x}}^{\left(i\right)}$
• Backpropagate errors ${\stackrel{\to }{\delta }}^{\left(l\right)}={\left({\Theta }^{\left(l\right)}\right)}^{T}{\stackrel{\to }{\delta }}^{\left(l+1\right)}\odot g\text{'}\left({\stackrel{\to }{z}}^{\left(l\right)}\right)$ for $l=L-1,\dots ,2$ where
• ${\stackrel{\to }{\delta }}^{\left(L\right)}={\stackrel{\to }{a}}^{\left(L\right)}-{y}^{\left(i\right)}$
• $g\text{'}\left({\stackrel{\to }{z}}^{\left(l\right)}\right)={\stackrel{\to }{a}}^{\left(l\right)}\odot \left(1-{\stackrel{\to }{a}}^{\left(l\right)}\right)$ since $g\text{'}\left(x\right)=g\left(x\right)\left(1-g\left(x\right)\right)$
• $a\odot b$ is element-wise multiplication (http://math.stackexchange.com/questions/52578/symbol-for-elementwise-multiplication-of-vectors)
• ${\Delta }_{ij}^{\left(l\right)}←{\Delta }_{ij}^{\left(l\right)}+{\delta }_{i}^{\left(l+1\right)}{a}_{j}^{\left(l\right)}$ for $l=L-1,\dots ,1$
• Excellent explanations of derivation at http://www.ml-class.org/course/qna/view?id=3740, http://www.scribd.com/doc/72228829/Back-Propagation
• Derivation of gradient for layer $L-1$ focusing on a single example ($m=1$)
• Define ${\delta }_{i}^{\left(l\right)}=\frac{\partial J}{\partial {z}_{i}^{\left(l\right)}}$: (how do we reduce the second factor?)$\begin{array}{ccc}\frac{\partial J}{\partial {\Theta }_{ij}^{\left(l-1\right)}}& =& \frac{\partial J}{\partial {z}_{i}^{\left(l\right)}}\frac{\partial {z}_{i}^{\left(l\right)}}{\partial {\Theta }_{ij}^{\left(l-1\right)}}={\delta }_{i}^{\left(l\right)}\frac{\partial }{\partial {\Theta }_{ij}^{\left(L-1\right)}}\left(\sum _{k}{\Theta }_{kj}^{\left(l-1\right)}{a}_{j}^{\left(l-1\right)}\right)={\delta }_{i}^{\left(l\right)}{a}_{j}^{\left(l-1\right)}\end{array}$
• For output layer:$\begin{array}{ccc}{\delta }_{i}^{\left(L\right)}& =& \frac{\partial J}{\partial {z}_{i}^{\left(L\right)}}\\ & =& \frac{\partial }{\partial {z}_{i}^{\left(L\right)}}\left(-\sum _{k=1}^{K}{y}_{k}logg\left({z}_{k}^{\left(L\right)}\right)+\left(1-{y}_{k}\right)log\left(1-g\left({z}_{k}^{\left(L\right)}\right)\right)\right)\\ & =& \frac{\partial }{\partial {z}_{i}^{\left(L\right)}}\left(-{y}_{i}logg\left({z}_{i}^{\left(L\right)}\right)+\left(1-{y}_{i}\right)log\left(1-g\left({z}_{i}^{\left(L\right)}\right)\right)\right)\\ & =& -\left({y}_{i}\frac{1}{g\left({z}_{i}^{\left(L\right)}\right)}-\left(1-{y}_{i}\right)\frac{1}{1-g\left({z}_{i}^{\left(L\right)}\right)}\right)g\left({z}_{i}^{\left(L\right)}\right)\left(1-g\left({z}_{i}^{\left(L\right)}\right)\right)\\ & =& -\left({y}_{i}\left(1-g\left({z}_{i}^{\left(L\right)}\right)\right)-\left(1-{y}_{i}\right)g\left({z}_{i}^{\left(L\right)}\right)\right)\\ & =& -\left({y}_{i}-g\left({z}_{i}^{\left(L\right)}\right)\right)\end{array}$
• Build on this to get previous layers
• For simpler squared error $J\left(\Theta \right)=\frac{1}{2}{\sum }_{j=1}^{K}{\left({y}_{j}-{a}_{j}\right)}^{2}$, slightly different:$\begin{array}{ccc}{\delta }_{i}& =& \frac{\partial }{\partial {z}_{i}^{\left(L\right)}}\left(\frac{1}{2}\sum _{k=1}^{K}{\left({y}_{k}-g\left({z}_{k}^{\left(L\right)}\right)\right)}^{2}\right)\\ & =& \left({y}_{i}-g\left({z}_{i}^{\left(L\right)}\right)\right)g\text{'}\left({z}_{i}^{\left(L\right)}\right)\\ & & \end{array}$
• Generally if single hidden layer then choose more hidden units than inputs/outputs, and choose same # hidden units in each layer if more than 1 hidden layer

### 12.11 Local regression

• locally weighted scatterplot smoothing (LOESS aka LOWESS) aka locally weighted regression (LWR)
• a type of smoother
• typically LOESS is variable-bandwidth (fixed-span) smoother (like nearest-neighbors)
• typically LOESS is locally quadratic or linear
• weight functions/kernels: tricubic (traditional), Gaussian, ...
• fixed-bandwidth example: to predict at $x$, fit $\theta$ to minimize ${\sum }_{i}{w}^{\left(i\right)}{\left({y}^{\left(i\right)}-\stackrel{\to }{\theta }\cdot {\stackrel{\to }{x}}^{\left(i\right)}\right)}^{2}$ where typically ${w}^{\left(i\right)}=exp\left(-\frac{{\left({x}^{\left(i\right)}-x\right)}^{2}}{2{\tau }^{2}}\right)$ (some kernel) and $\tau$ is bandwidth param

### 12.12 Regularization

• overfitting tends to occur when large weights found in $\stackrel{\to }{\beta }$
• regularization pressures $\stackrel{\to }{\beta }$ to be small
• LASSO/L1: minimize ${\sum }_{i}{\left(\stackrel{\to }{\beta }{\stackrel{\to }{x}}_{i}-{y}_{i}\right)}^{2}$ where ${\parallel \stackrel{\to }{\beta }\parallel }_{1}\le s$ and ${\parallel \stackrel{\to }{\beta }\parallel }_{1}={\sum }_{j}|{\stackrel{\to }{\beta }}_{j}|$ (${\ell }_{1}$ norm)
• equiv: minimize ${\sum }_{i}\frac{1}{2n}{\left(\stackrel{\to }{\beta }{\stackrel{\to }{x}}_{i}-{y}_{i}\right)}^{2}+\lambda {\parallel \stackrel{\to }{\beta }\parallel }_{1}$
• more notes in AI.page
• better than L2 when many features (vs examples)
• ridge/Tikhonov: instead of ${\parallel X\stackrel{\to }{\beta }-\stackrel{\to }{y}\parallel }_{2}^{2}$, minimize ${\parallel X\stackrel{\to }{\beta }-\stackrel{\to }{y}\parallel }_{2}^{2}+{\parallel \Gamma \stackrel{\to }{\beta }\parallel }_{2}^{2}$ for some suitable Tikhonov matrix $\Gamma$
• L2 regularization: when $\Gamma =I$
• effect aka shrinkage
• $\Gamma$ can also be highpass to enforce smoothing
• explicit solution $\stackrel{ˆ}{\beta }={\left({X}^{T}X+{\Gamma }^{T}\Gamma \right)}^{-1}{X}^{T}\stackrel{\to }{y}$ ($O\left({n}^{3}\right)$ time)
• elastic net: handles highly correlated vars; balance L1 & L2
• minimize $\frac{1}{2n}{\parallel X\stackrel{\to }{\beta }-\stackrel{\to }{y}\parallel }_{2}^{2}+\lambda \alpha {\parallel \stackrel{\to }{\beta }\parallel }_{1}+\frac{\lambda \left(1-\alpha \right)}{2}{\parallel \stackrel{\to }{\beta }\parallel }_{2}^{2}$
• http://www-stat.stanford.edu/~owen/courses/305/Rudyregularization.pdf http://scikit-learn.org/stable/modules/linear_model.html

### 12.13 GLM

• exponential family: $p\left(y;\eta \right)=b\left(y\right)exp\left({\eta }^{T}T\left(y\right)-a\left(\eta \right)\right)$
• $\eta$ is natural param aka canonical param
• $T\left(y\right)$ is sufficient statistic; often $T\left(y\right)=y$
• $a\left(\eta \right)$ is log partition function
• ${e}^{-a\left(\eta \right)}$ is normalizing const; ensures $p$ sums/integrates to 1 over $y$
• fixed $T,a,b$ defines a family (set) of dists param'd by $\eta$ (vary $\eta$ for diff dists in fam)
• assumptions to derive GLM
• predicting $h\left(x\right)=E\left[T\left(y\right)\mid x\right]=E\left[y\mid x\right]$
• $\eta =\stackrel{\to }{\theta }\cdot \stackrel{\to }{x}$ (more of a “design choice” rather than an assumption)

### 12.14 PCA

• algorithm: normalize data, then find eigenvectors of covariance matrix
• can use SVD: $U,S,V=SVD\left(Cov\left(\frac{X-\mu }{\sigma }\right)\right)$, where $U$ cols are eigenvectors in decreasing importance, $S$ is diagonal matrix of singular values
• down-project: $\stackrel{\to }{z}=U\stackrel{\to }{x}$
• recover: $\stackrel{\to }{\stackrel{ˆ}{x}}={U}^{T}\stackrel{\to }{z}$
• use: viz (project to 2D/3D), compression, speeding up learning
• don't use to reduce dimensionality for learning since it doesn't consider labels; use regularization
• choose smallest $k$ where lost variance $\frac{\frac{1}{m}{\sum }_{i=1}^{m}{\parallel {x}^{\left(i\right)}-{\stackrel{ˆ}{x}}^{\left(i\right)}\parallel }^{2}}{\frac{1}{m}{\sum }_{i=1}^{m}{\parallel {x}^{\left(i\right)}\parallel }^{2}}=\le .01$
• or (equivalently) retained variance $\frac{{\sum }_{i=1}^{k}{S}_{ii}}{{\sum }_{i=1}^{n}{S}_{ii}}\ge .99$ (“99% variance retained”)
• PCA transpose trick http://blog.echen.me/2011/03/14/pca-transpose-trick/
• given $m×n$ obs matrix $A$, often $n\gg m$ (dims $\gg$ obs)
• finding evecs of big $n×n$ matrix ${A}^{T}A$ is expensive
• insight: if $v$ is evec of $A{A}^{T}$, then ${A}^{T}v$ is evec of ${A}^{T}A$ w same eval (short proof)
• so, find evecs of $A{A}^{T}$, then multiply by ${A}^{T}$

### 12.15 SVD

• $M=U\Sigma {V}^{\ast }$, where
• $U$ is $m×m$ real/complex unitary matrix; cols (left singular vectors) are evecs of $M{M}^{\ast }$
• $\Sigma$ is $m×n$ non-neg real diag matrix
• entries are singular values
• non-0 singular values are sqrts of non-0 evals of $M{M}^{\ast }$ or ${M}^{\ast }M$
• ${V}^{\ast }$ is $n×n$ real/complex unitary matrix; cols (right singular vectors) are evecs of ${M}^{\ast }M$
• when all real, $U,V$ are rotations and $\Sigma$ is scaling
• eval decomp: similar concept but only for square matrices $M$
• ${M}^{\ast }M=V{\Sigma }^{\ast }{U}^{\ast }U\Sigma {V}^{\ast }=V\left({\Sigma }^{\ast }\Sigma \right){V}^{\ast }$
• $M{M}^{\ast }=U\Sigma {V}^{\ast }V{\Sigma }^{\ast }{U}^{\ast }=U\left(\Sigma {\Sigma }^{\ast }\right){U}^{\ast }$
• RHSs are eval decomps of LHSs
• use SVD to perform PCA
• let $M$ be deviations matrix; covar matrix is $\frac{1}{n}M{M}^{T}=\frac{1}{n}U{\Sigma }^{2}{U}^{T}$
• low-rank approx to $M$: $\stackrel{˜}{M}=U\stackrel{˜}{\Sigma }{V}^{\ast }$, where $\stackrel{˜}{\Sigma }$ is same as $\Sigma$ but with only $r$ largest singular values (rest replaced by 0)

### 12.16 Matrix factorization

• $R\approx MU$ where $M$ is ${n}_{M}×d$, $U$ is $d×{n}_{U}$
• $d$ latent features
• EM-like algo (6.867 project)
• randomly init $M$
• learn column $i$ of $U$ with OLS: $\stackrel{\to }{{u}_{i}}={\left({M}^{T}M\right)}^{-1}{M}^{T}\stackrel{\to }{{r}_{i}}$ where $\stackrel{\to }{{r}_{i}}$ has just known values from column $i$ of $R$
• learn $M$ from $U$; iterate
• strongly overfits; can try introducing prior to prevent overfitting and to coerce values to be within 1 to 5
• can't run if too sparse because ${\left({M}^{T}M\right)}^{-1}$ becomes singular
• ${e}_{ij}^{2}={\left({r}_{ij}-\stackrel{ˆ}{{r}_{ij}}\right)}^{2}={\left({r}_{ij}-{\sum }_{k=1}^{K}{p}_{ik}{q}_{kj}\right)}^{2}$
• gradient:$\begin{array}{ccc}\frac{\partial }{\partial {p}_{ik}}{e}_{ij}^{2}& =& -2{\left({r}_{ij}-\stackrel{ˆ}{{r}_{ij}}\right)}^{2}{q}_{kj}=-2{e}_{ij}{q}_{kj}\\ \frac{\partial }{\partial {q}_{kj}}{e}_{ij}^{2}& =& -2{\left({r}_{ij}-\stackrel{ˆ}{{r}_{ij}}\right)}^{2}{p}_{ik}=-2{e}_{ij}{p}_{ik}\end{array}$
• update rule (usu. $\alpha =.0002$):$\begin{array}{ccc}{p}_{ik}\text{'}& =& {p}_{ik}+\alpha \frac{\partial }{\partial {p}_{ik}}{e}_{ij}^{2}={p}_{ik}+2\alpha {e}_{ij}{q}_{kj}\\ {q}_{kj}\text{'}& =& {q}_{kj}+\alpha \frac{\partial }{\partial {q}_{kj}}{e}_{ij}^{2}={q}_{kj}+2\alpha {e}_{ij}{p}_{ik}\end{array}$
• with regularization to avoid overfitting:$\begin{array}{ccc}{e}_{ij}& =& {\left({r}_{ij}-\sum _{k=1}^{K}{p}_{ik}{q}_{kj}\right)}^{2}+\frac{\beta }{2}\sum _{k=1}^{K}\left({\parallel P\parallel }^{2}+{\parallel Q\parallel }^{2}\right)\\ {p}_{ik}\text{'}& =& {p}_{ik}+\alpha \frac{\partial }{\partial {p}_{ik}}{e}_{ij}^{2}={p}_{ik}+\alpha \left(2{e}_{ij}{q}_{kj}-\beta {p}_{ik}\right)\\ {q}_{kj}\text{'}& =& {q}_{kj}+\alpha \frac{\partial }{\partial {q}_{kj}}{e}_{ij}^{2}={q}_{kj}+\alpha \left(2{e}_{ij}{p}_{ik}-\beta {q}_{kj}\right)\end{array}$
• important extension: non-negative matrix factorization (NMF)
• require $P,Q$ to be non-negative

### 12.17 Decision trees

• information gain or entropy discretization
• entropy $H\left(.\right)$ is information uncertainty
• in total, need $H\left(Y\right)$ bits to classify (e.g. 1 bit for even 2-class distribution)
• each branch gives some information/removes uncertainty; want to move toward 0 uncertainty
• after branching on an attr with value dist $V$, avg remaining uncertainty is ${E}_{V}\left[H\left({Y}_{V}\right)\right]\le H\left(Y\right)$, where ${Y}_{v}$ is class dist down branch for attr value $v$
• choose attr w lowest remaining uncertainty, or greatest information gain $H\left(Y\right)-{E}_{V}\left[H\left({Y}_{V}\right)\right]$
• stopping criterion
• AIMA suggests building full tree then prune instead of early stop (eg consider learning xor), but this doesn't seem to be done elsewhere
• AIMA suggests chi-square test
• MDL suppose to be best-performing http://www.jair.org/media/279/live-279-1538-jair.pdf

### 12.18 Markov Chain Monte Carlo (MCMC)

• given multivariate dist, simpler to sample from conditional dists than the joint (hard to marginalize by integrating over joint dist)
• Gibbs sampling: initialize vars ${X}_{i}^{\left(0\right)}\forall i$, then iteratively sample ${X}_{i}^{\left(t\right)}=P\left({X}_{i}^{\left(t\right)}\mid \left\{{X}_{j}^{\left(t-1\right)}\mid j\ne i\right\}\right)\forall i$
• typically discard samples from initial burn-in period
• typically consider only every $n$ samples, since successive samples have some correlation
• useful when conditional distributions known, e.g. in Bayesian networks
• M-H special case where proposal dist is conditional dist; proposals accepted 100% of time
• Metropolis Hastings: (slower) generalization of Gibbs sampling for when conditional distributions unknown
• use proposal conditional dist $Q\left(X\text{'};X\right)$, eg $N\left(X,{\sigma }^{2}I\right)$ (needn't be symmetric)
• $a={a}_{1}{a}_{2}=\frac{P\left(X\text{'}\right)Q\left({X}^{\left(t\right)};X\text{'}\right)}{P\left({X}^{\left(t\right)}\right)Q\left(X\text{'};{X}^{\left(t\right)}\right)}$ where:
• ${a}_{1}=\frac{P\left(X\text{'}\right)}{P\left({X}^{\left(t\right)}\right)}$ is likelihood ratio btwn proposed sample $X\text{'}$ and prior sample ${X}^{\left(t\right)}$
• ${a}_{2}=\frac{Q\left({X}^{\left(t\right)};X\text{'}\right)}{Q\left(X\text{'};{X}^{\left(t\right)}\right)}$ is ratio of proposal density in both directions
• Burn-in commentary: http://www.johndcook.com/blog/2011/08/10/markov-chains-dont-converge/

### 12.19 Unsupervised learning

• K-means
• Algo: given inputs ${x}^{\left(1\right)},\dots ,{x}^{\left(m\right)}$, repeatedly:
• update cluster assignments ${c}^{\left(1\right)},\dots ,{c}^{\left(m\right)}$: assign each point to nearest cluster
• update cluster centroids ${\mu }_{1},\dots ,{\mu }_{K}$ to mean of assigned points
• Objective: minimize distortion function $J\left({c}^{\left(1\right)},\dots ,{c}^{\left(m\right)},{\mu }_{1},\dots ,{\mu }_{K}\right)=\frac{1}{m}{\sum }_{i=1}^{m}{\parallel {x}^{\left(i\right)}-{\mu }_{{c}^{\left(i\right)}}\parallel }^{2}$
• Use random initialization (good to initialize to $K$ random data points); run many times; choose clustering with min $J$
• Choose $K$ at knee/elbow in curve of $J$ over $K$ (or choose based on the problem you're solving)

### 12.20 Infinite Mixture Models with Nonparametric Bayes and the Dirichlet Process

• nonparametric: params can change w data
• in the following
• $n$ is # diners, $\alpha$ is dispersion param
• ${G}_{0}$ is base dist
• Chinese restaurant process
• each new diner sits at new table w prob $\frac{\alpha }{n+a}$ or table $k$ w prob $\frac{{n}_{k}}{n+\alpha }$; call table assignments ${g}_{i}$
• table $k$ serves some set of food parameterized by ${\phi }_{k}\sim {G}_{0}$
• generate data points ${p}_{i}\sim F\left({\phi }_{{g}_{i}}\right)$
• Polya urn model
• start w urn containing $\alpha {G}_{0}\left({\phi }_{k}\right)$ balls of “color” ${\phi }_{k}$, for ea possible color ${\phi }_{k}$
• draw ball, note color ${\phi }_{i}$, put back orig ball & new ball of same color
• generate data points ${p}_{i}\sim F\left({\phi }_{{g}_{i}}\right)$
• stick-breaking process
• start w stick of length 1
• sample ${\beta }_{1}\sim Beta\left(1,\alpha \right)$; ${w}_{1}={\beta }_{1}$ is the eventual proportion of customers at table 1
• sample ${\beta }_{2}\sim Beta\left(1,\alpha \right)$; ${w}_{2}=\left(1-{\beta }_{1}\right){\beta }_{2}$ is eventual prop of customers at table 2
• generate group assignments ${g}_{i}\sim Multinomial\left({w}_{1},\dots ,{w}_{\infty }\right)$
• generate data points ${p}_{i}\sim F\left({\phi }_{{g}_{i}}\right)$
• Dirichlet process
• generate a dist $G\sim DP\left({G}_{0},\alpha \right)$
• generate group-level params ${x}_{i}\sim G$, where ${x}_{i}$ is group param for $i$th data point
• generate each data point ${p}_{i}\sim F\left({x}_{i}\right)$
• inference: Gibbs sampling
• randomly init group assignments
• pick a point; fix assignments of other points; assign point to most likely group (existing or new)
• repeat till convergence
• Indian buffet process: customers belong to multiple tables instead of just 1
• CRP, Polya, sticks: sequential; DP: parallel
• http://blog.echen.me/2012/03/20/infinite-mixture-models-with-nonparametric-bayes-and-the-dirichlet-process/

## 13 Natural language processing (NLP)

### 13.1 Naive Bayes

• NB usu uses multi-variate Bernoulli event model: ${X}_{i}\sim {Bernoulli}_{c,i}$ are whether dict word $i$ is present (for class $c$)
• multinomial event model: better for text classif; ${X}_{i}\sim {Multinomial}_{c}$ is $i$th word in doc, sampled from dict (for class $c$)
• model: $Pr\left[D\mid {\stackrel{\to }{\theta }}_{c}\right]=\frac{\left({\sum }_{i}{f}_{i}\right)!}{{\prod }_{i}{f}_{i}!}\prod {\left({\theta }_{c,i}\right)}^{{f}_{i}}$ where $D$ is document (this is just multinomial PMF)
• prediction: $arg{max}_{c}\left[logp\left({\stackrel{\to }{\theta }}_{c}\right)+{\sum }_{i}{f}_{i}log{\theta }_{c,i}\right]$
• estimation: ${\stackrel{ˆ}{\theta }}_{c,i}=\frac{{N}_{c,i}+{\alpha }_{i}}{{N}_{c}+\alpha }$ where $\alpha ={\sum }_{i}{\alpha }_{i}$
• data skew bias: sensitive to training class dist
• fix with complement NB: ${\stackrel{ˆ}{\theta }}_{\stackrel{˜}{c},i}=\frac{{N}_{\stackrel{˜}{c},i}+{\alpha }_{i}}{{N}_{\stackrel{˜}{c}}+\alpha }$ and $arg{max}_{c}\left[logp\left({\stackrel{\to }{\theta }}_{c}\right)-{\sum }_{i}{f}_{i}log{\theta }_{\stackrel{˜}{c},i}\right]$ (notice the $-$)
• note: doesn't do anything for binary; need 3+ classes
• weight magnitude errors: deal with deps like “San Francisco” vs. “Boston”
• use weight normalization $\frac{log{\stackrel{ˆ}{\theta }}_{c,i}}{{\sum }_{k}|log{\stackrel{ˆ}{\theta }}_{c,k}|}$ instead of $log{\stackrel{ˆ}{\theta }}_{c,i}$ (given without much explanation)
• transforms to model text more accurately based on empirical text dists
• term frequency: empirical dists were heavier-tailed than predicted by multinomial, more like power-law
• eg multinomial says $Pr\left[{f}_{i}=9\right]$ (see a word 9 times) is tiny, but in reality pretty high (“burstiness”)
• use ${f}_{i}\text{'}=log\left(d+{f}_{i}\right)$ where $d=1$ (or some optimized value)
• brings closer to the Bernoulli (0-1) model
• document frequency
• common words unlikely to be predictive, but random variations can create fictitious correlations
• use ${f}_{i}\text{'}={f}_{i}log\frac{{\sum }_{j}1}{{\sum }_{j}{\delta }_{i,j}}$ to downweight common words where ${\delta }_{i,j}=1$ iff word $i$ in doc $j$
• document length
• docs have strong inter-word deps; if word appears, likely to re-appear
• longer docs have disproportionately higher empirical term freqs/heavier tails; can have strong effect
• use ${f}_{i}\text{'}=\frac{{f}_{i}}{\sqrt{{\sum }_{k}{\left({f}_{k}\right)}^{2}}}$; denom is doc length; discounts long docs; common in IR
• http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf

## 14 Problems

• Solve for $x$ in ${x}^{{x}^{{x}^{\dots }}}=2$.
• Which is greater, ${e}^{\pi }$ or ${\pi }^{e}$? (Hint: ${e}^{x}>1+x$ for $x>0$)