LyX Document

1 Identities, approximations, limits

Identity: ${lim}_{x \to \infty} {(1 + \frac{a}{x})}^{x} = e^{a}$
$e^{x} > 1 + x$ for $x > 0$ and $e^{x} \approx 1 + x$ for $- .1 < x < .1$
Euler's identity: $e^{i π} + 1 = 0$ (from $e^{i x} = cos x + i sin x$ )

2 General

geometric mean ( ∏ i x i ) 1 n is exp of arith mean of logs, exp ( 1 n ∑ i log x i )
- eg annualizing compounding: given annual growths $a, b, c > 1$ and initial price $p_{0}$ , $p_{3} = a b c p_{0} = μ^{3} p_{0}$ where geometric mean $μ = \sqrt[3]{a b c}$
harmonic mean ( 1 n ∑ i=1 n x i -1 ) -1
- if $x_{i}$ subject to (arithmetic-)mean-preserving spread, harmonic mean decreases
- preferable way to avg multiples, e.g. P/E ratio
- vs arith mean
  - A travels 20mph for 1h then 30mph for 1h, avg speed is arith mean
  - A travels 20mph for 1mi then 30mph for 1mi, avg speed is harmonic mean
- F-1 score is harmonic mean of precision & recall
power mean M r ( { x i } ) = ( 1 n ∑ i x i r ) 1 r
- $r = - 1$ harmonic, $r = 0$ geom, $r = 1$ arith, $r = 2$ quadratic (root mean square), $r = - \infty$ min, $r = \infty$ max
Stirling's approx: ln n!=n ln n-n+O( log n ) where last term is 1 2 ln ( 2 π n )
- Or, ${lim}_{n \to \infty} \frac{n!}{\sqrt{2 π n} {(\frac{n}{e})}^{n}} = 1$ or $n! \sim \sqrt{2 π n} {(\frac{n}{e})}^{n}$

3 Information Theory

surprisal: - log P( x ) = log 1 P( x ) ; in bits; additive; used in entropy, KLIC, etc.
- $P (x) = \frac{1}{n} ⟹ - log P (x) = n$
entropy H( X ) = E [ I( X ) ] =- ∑ x p( x ) log p( x ) ≥ 0 (expected information content)
- lower prob events have higher information content
- measured in bits
mutual information I( X;Y ) = ∑ y ∑ x p X,Y ( x,y ) log p X,Y ( x,y ) p X ( x ) p Y ( y ) ≥ 0
- self-information is entropy: $I (X; X) = H (X)$
- $I (X; Y) = H (X) - H (X ∣ Y) = H (Y) - H (Y ∣ X) = H (X) + H (Y) - H (X, Y) = H (X, Y) - H (X ∣ Y) - H (Y ∣ X)$
- symmetric uncertainy $U (X, Y) = 2 \frac{I (X; Y)}{H (X) + H (Y)} \in [0, 1]$
- relationship to correlation
  - MI measures general dependence, correlation measures linear dependence; MI is better for measuring dependence
  - MI applicable to symbolic sequences; correlation applicable only to numerical sequences; but MI must estimate continuous distributions
  - http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=D065413DAA29F4C500219B28221E904A?doi=10.1.1.15.672&rep=rep1&type=pdf
Kullback–Leibler divergence aka KLIC: non-symmetric measure of difference btwn dists P,Q
- expected # extra bits to code samples from $P$ when using code based on $Q$ rather than on $P$
- alt intuition: avg likelihood of data distributed as $P$ given $Q$ as model: $D_{K L} (P | Q) = - log \overline{L}$ where $L = Pr [X \sim P ∣ Q]$
- $D_{KL} (P ∥ Q) = \sum_{i} P (i) log \frac{P (i)}{Q (i)} = \sum_{i} P (i) (log Q (i) - log P (i))$ ; integral for continuous
- $D_{KL} \geq 0$ ; $D_{KL} = 0$ for $P = Q$ ; asymmetric
- mutual information $I (X; Y) = D_{K L} (Pr [X, Y] | Pr [X] Pr [Y])$
- http://www.snl.salk.edu/~shlens/kl.pdf
normalized compression distance (NCD): $N C D (x, y) = \frac{C (x y) - min {C (x), C (y)}}{max {C (x), C (y)}}$

4 Finance

rate of return (ROR) aka return on investment (ROI) aka return
- let $V_{f}$ be final value, $V_{i}$ be initial value
- ratio: $r = \frac{V_{f}}{V_{i}}$
- arithmetic return aka yield: $r_{a r i t h} = \frac{V_{f} - V_{i}}{V_{i}} = r - 1$
- logarithmic/continuous compound return: $r_{log} = ln \frac{V_{f}}{V_{i}} = ln (1 + r)$
- compound annual growth rate (CAGR): ${(\frac{V_{f}}{V_{i}})}^{\frac{1}{n}} - 1$ where $n$ is # years
- annual percentage rate (APR)

5 Signal Processing

DFT: X k = ∑ n=0 N-1 x n exp ( - 2 π i N kn )
- IDFT: $X_{k} = \frac{1}{N} \sum_{n = 0}^{N - 1} x_{n} exp (i 2 π k \frac{n}{N})$ (normalized, changed exp sign)
- interesting presentation: strength of freq $k$ is distance from origin of the midpoint of your signal's points as the signal are spun around a circle http://altdevblogaday.org/2011/05/17/understanding-the-fourier-transform/
IIR, FIR: TODO

6 Probability

6.1 Distributions

Binomial: # successes in n Bernoulli trials each with success prob p
- $Pr [X = k] = (\binom{n}{k}) p^{k} {(1 - p)}^{n - k}$
- $E [X] = n p$
- $V a r [X] = n p (1 - p)$
Geometric: # trials until Bernoulli success with prob p
- $Pr [X = k] = {(1 - p)}^{k - 1} p$
- $E [X] = \frac{1}{p}$
- $V a r [X] = \frac{1 - p}{p^{2}}$
Hypergeom: # successes in n draws from population of N containing m successes
- $Pr [X = k] = \frac{(\binom{m}{k}) (\binom{N - m}{n - k})}{(\binom{N}{m})}$
- $E [X] = n \frac{m}{N}$
- $V a r [X] = n \frac{m}{N} \frac{(N - m)}{N} \frac{N - n}{N - 1}$
Negative binomial: # successes in n Bernoulli trials before r failures (generalization of geom)
- $Pr [X = k] = (\binom{k + r - 1}{k}) {(1 - p)}^{r} p^{k}$
- $E [X] = \frac{p r}{1 - p}$
- $V a r [X] = \frac{p r}{{(1 - p)}^{2}}$
Poisson: # arrivals in sliver of time (infinite-granularity binomial) assuming mean λ arrival rate
- $Pr [X = k] = \frac{λ^{k}}{k!} e^{- λ}$
- $E [X] = λ$
- $V a r [X] = λ$
- Simple interesting proof from binomial
Normal: mean μ , standard deviation σ
- $f (x) = \frac{1}{\sqrt{2 π σ^{2}}} exp (- \frac{{(x - μ)}^{2}}{2 σ^{2}}) = \dots exp (- \frac{Z^{2}}{2})$
- $E [X] = μ$
- $V a r [X] = σ^{2}$
- Empirical rule: z-scores of 1/2/3 span 68%/95%/99.7%
- Is its own Fourier transform
Beta: density shape over ( 0,1 )
- uniform dist is a beta dist
- params $a, b$ s.t. $b e t a [a, b] (θ) = α θ^{a - 1} {(1 - θ)}^{b - 1}$
- $E [X] = \frac{a}{a + b}$ : higher $a$ suggests $Θ$ closer to 1 than 0
- conjugate prior for Bernoulli/binomial dists
Exponential: time btwn Poisson process events
- $f (x) = {\begin{matrix} λ e^{- λ x}, & x \geq 0 \\ 0, & x < 0 \end{matrix}$ ; $Pr [X < x] = {\begin{matrix} 1 - e^{- λ x}, & x \geq 0 \\ 0, & x < 0 \end{matrix}$
- $E [X] = \frac{1}{λ}$
- $V a r [X] = \frac{1}{λ^{2}}$
- memoryless: $Pr [X > s ∣ X > t] = Pr [X > s - t]$ / constant event rate $λ$ / constant hazard $λ$
Gamma: scale θ and shape k
- models waiting times: sum of $k$ indep exponentially distributed RVs, each with mean $θ$
- also the sample variance of normal data
- conjugate prior for many dists TODO
Student's t-distribution: a more spread-out normal distribution
- for when sample size is small, population SD unknown
- converges to normal as DF (corresponds to sample size) increases
Chi-square, chi distributions
- chi-square: sum of squares of $k$ normal RVs $\sum_{i} {(\frac{X_{i} - μ_{i}}{σ_{i}})}^{2}$
- chi: length of vector of $k$ normal components $\sqrt{\sum_{i} {(\frac{X_{i} - μ_{i}}{σ_{i}})}^{2}}$

6.2 Conjugate prior relationships

Likelihood $Pr [x ∣ θ]$	Conjugate prior $Pr [θ]$	Posterior $Pr [θ ∣ x]$
Gaussian	Gaussian	Gaussian
Binomial $(N, θ)$	Beta $(r, s)$	Beta $(r + n, s + N - n)$
Poisson $(θ)$	Gamma $(r, s)$	Gamma $(r + n, s + 1)$
Multinomial $(θ_{1}, \dots, θ_{k})$	Dirichlet $(α_{1}, \dots, α_{k})$	Dirichlet $(α_{1} + n_{1}, \dots, α_{k} + n_{k})$

6.3 General definitions and properties

Union bound aka Boole's inequality: $Pr [A \cup B] \leq Pr [A] + Pr [B]$
Bonferroni's inequality: $Pr [A \cap B] \geq Pr [A] + Pr [B] - 1$
$Pr [A ∣ B] > Pr [A] ⟺ Pr [B ∣ A] > Pr [B]$
Linearity: $E [a X + b Y] = a E [X] + b E [Y]$
$E [X | Y = y] = \sum_{x} x \cdot Pr [X = x | Y = y]$
Iterated: $E [E [X | Y]] = E [X]$
$V a r [X] = E [{(X - μ)}^{2}] = E [X^{2}] - {(E [X])}^{2}$
$V a r [a X + b] = V a r [a X] = a^{2} V a r [X]$
$V a r [a X + b Y] = a^{2} V a r [X] + b^{2} V a r [Y] + 2 a b C o v [X, Y]$
$V a r [X + Y] = V a r [X] + V a r [Y]$ if $X, Y$ indep/uncorrelated
Pearson's product-moment coefficient is a “normalized” covariance: $ρ_{X, Y} = \frac{C o v [X, Y]}{σ_{X} σ_{Y}} \in [- 1, 1]$
Law of total variance: $V a r [X] = E [V a r [X ∣ Y]] + V a r [E [X ∣ Y]]$ (unexplained and explained components)
coefficient of variation aka unitized risk aka variation coefficient: c= σ | μ |
- normalized measure of dispersion of a distribution
signal to noise ratio (SNR): μ σ
- reciprocal of coefficient of variation; only sensical for positive variables
Markov's inequality: Pr [ f( X ) ≥ t ] ≤ E [ f( X ) ] /t , for any non-neg function f
- corollary: Chebyshev's inequality, $Pr [| X - E [X] | \geq a] \leq \frac{V a r [X]}{a^{2}}$
Hoeffding's inequality: upper bound on prob that sum of RVs deviates from expected value
- for Bernoullis (important special case): $Pr [\sum_{i} X_{i} \leq (p - ε) n] \leq E [- 2 ε^{2} n]$ where $X_{i} \sim Bernoulli (p)$

7 Algorithms

7.1 LSH

usually use shingling
minhash: for clustering sets by similarity
- $h_{min} (x) = {min}_{x \in X} h (x)$
- for two sets of numbers $A, B$ , $Pr [min (A) = min (B)] = J (A, B) = \frac{| A \cap B |}{| A \cup B |}$
- with k hash fns:
  - $Pr [⋀_{i = 1}^{k} h_{min}^{(i)} (A) = h_{min}^{(i)} (B)] = {(\frac{| A \cap B |}{| A \cup B |})}^{k}$ (low false positives)
  - $Pr [⋁_{i = 1}^{k} h_{min}^{(i)} (A) = h_{min}^{(i)} (B)] = 1 - {(1 - \frac{| A \cap B |}{| A \cup B |})}^{k}$ (low false negatives)
  - estimate $J (A, B)$ as ratio of matching hash fns
- with k smallest hashes:
  - $Pr [all match] = \frac{(\binom{|}{A \cap B})}{(\binom{|}{A \cup B})} \approx {(\frac{| A \cap B |}{| A \cup B |})}^{n}$ for $n ≪ | A \cap B |$
  - estimate $J (A, B)$ as ratio of matching hashes
simhash: similar documents have low Hamming distance between their simhashes (Moses Charikar, STOC02)
- $V = [0] \times 64$ for 64-bit simhash
- for each item, if bit $i$ of $h (x)$ set, increment $V [i]$ , else decrement $V [i]$
- bit $i$ of simhash is 1 if $V [i] > 0$ else 0
- patented by Google

8 Statistics

8.1 Basics

coefficient of variation (CV) aka unitized risk: $\frac{σ}{μ}$ (to scale/normalize SDs)
(Pearson's) kurtosis: fourth standardized moment: μ 4 σ 4
- measure of peakedness, but it's argued this really measures heavy tails

8.2 Fisherian tests TODO

z-test/z-statistic: approximations are OK when n>30
- one-sample: $z = \frac{\overline{x} - μ_{\overline{x}}}{σ_{\overline{x}}}, σ_{\overline{x}} = \frac{σ}{\sqrt{n}} \approx \frac{S}{\sqrt{n}}$
- two-sample: $z = \frac{(\overline{X} - \overline{Y}) - (μ_{\overline{X} - \overline{Y}} = 0)}{σ_{\overline{X} - \overline{Y}} = \sqrt{\frac{σ_{X}^{2}}{n} + \frac{σ_{Y}^{2}}{m}} \approx \sqrt{\frac{S_{X}^{2}}{n} + \frac{S_{Y}^{2}}{m}}}$
t-test/t-statistic: same as z-test but refer to student's t-distribution; use when n ≤ 30
- one-sample: $t = \frac{\overline{x} - μ_{\overline{x}}}{S / \sqrt{n}}$ (same quantity as in z-test), DF is $n - 1$
- two-sample: $t = \frac{\overline{X} - \overline{Y} - (μ_{\overline{X} - \overline{Y}} = 0)}{S_{\overline{X} - \overline{Y}} = \sqrt{\frac{S_{X}^{2}}{n} + \frac{S_{Y}^{2}}{m}}}$
chi-square test: X 2 = ∑ ij ( O ij - E ij ) 2 E ij = ∑ i ( X i - μ i σ i ) 2 , where ij enumerate cells in contingency table
- $E_{i j}$ are computed from marginal (global) frequencies of table
- dof is $n - 1$ where $n$ is number of cells
- compare freqs of a sample against theoretical dist
- asymptotically approaches $χ^{2}$ distribution
- uses normal approximation of multinomial distribution
multinomial test
Fisher's exact test

8.3 Neymann-Pearson tests

Wald test: reject iff 2 ⋅ | θ ˆ - θ 0 | σ > z α
- depends on representation of $θ$ (eg log-scaling); LR works with any monotonic transformation
- uses 2 approximations (know standard error, dist is $χ^{2}$ ); LR only assumes dist is $χ^{2}$ (if using that test)
- only deals with scalars; LR can deal with vector params
likelihood ratio test: for comparing fit of 2 models/hyps, where null is special case of the alternative
- model with more params will be better fit, but is it significantly better?
- likelihood ratio test statistic for simple hyps $θ_{0}, θ_{1}$ : $Λ (x) = \frac{L (θ_{0} ∣ x)}{L (θ_{1} ∣ x)}$
- log-likelihood ratio test statistic: D=-2 ln Λ ( x ) =-2 ln ( L( θ 0 ∣ x ) L( θ 1 ∣ x ) ) =-2 ln ( L( θ 0 ∣ x ) ) +2 ln ( L( θ 1 ∣ x ) )
  - approx. distributed as $χ^{2}$ with dof $d f_{1} - d f_{0}$
  - G-test is more accurate than $χ^{2}$
- for composite hyps: $Λ (x) = \frac{sup {L (θ ∣ x) : θ \in Θ_{0}}}{sup {L (θ ∣ x) : θ \in Θ}}$ (compare MLEs)
- Z-test, F-test, chi-square, G-test are tests for nested models and can be phrased as (approximations of) log-likelihood ratios
F-test: compare models; ANOVA; TODO
G-test: for log-likelihood ratio tests; G=2 ∑ ij O ij ln ( O ij E ij ) TODO
- chi-square tests are approximations; useful before computers
- G-test much better where for any contingency cell $| O_{i} - E_{i} | > E_{i}$
- for small samples, use multinomial test, Fisher's exact test, or even Bayesian hyp selection
Bayes factors: vs likelihood ratio tests

8.4 Hypothesis testing

effect size: basically, how big a difference; fluffy notion/many formulations
- absolute: e.g., “difference between groups is 30lb”
- standardized difference of means: e.g., $\frac{\overline{X} - \overline{Y}}{S}$ , where $S$ is SD of either or both groups
problem: can get significance (low p -value) with big effect and small sample or big sample and small effect
- hence, report effect/sample size with $p$ -value
power analysis: prob of rejecting false H 0 (not making a type II error, false neg)
- power aka sensitivity is $1 - β$ where $β$ is false negative rate
- can use to find min sample size to likely detect given effect size, or min effect size likely detected by given sample size
- e.g., more powerful experiments may have more subjects

8.5 Unsorted

kalman filter: TODO understand
- untrusted predictions, untrusted measurements; combine them
- optimal estimate $\hat{y} = p r e d i c t i o n + (K a l m a n g a i n) (m e a s u r e m e n t - p r e d i c t i o n)$
- predict using previous data, measure, fuse/correct prediction and measurement
- usually no control signals ${\vec{u}}_{k}$
- http://www.swarthmore.edu/NatSci/echeeve1/Ref/Kalman/ScalarKalman.html
Mahalanobis distance: similarity of a sample to a distribution
- $D_{M} = \sqrt{{(x - μ)}^{T} S^{- 1} (x - μ)}$ where $x$ is new sample, $μ$ is dist mean, $S$ is covar matrix
- same as normalized Euclidian dist if $S = I$ : $D_{M} = \sqrt{\sum_{i} \frac{{(x_{i} - μ_{i})}^{2}}{s_{i}^{2}}}$

8.6 Smoothing

additive smoothing aka Laplace smoothing: θ i ˆ = n i +k n+dk , i=1, … ,d
- called “rule of succession” for $k = 1$
Good-Turing estimation: complex

8.7 Estimation

90% confidence interval: in repeated samplings, the computed intervals $I (X)$ would contain the true param 90% of the time (0.1 miss rate); $Pr [\hat{θ} \in I (X)] = .9$ ( $θ$ is const, $I (X)$ is RV)
90% credible interval $C (X)$ : $Pr [θ \in C (X)] = .9$ ( $θ, C (X)$ are RV); “Bayesian confidence interval”
statistic: any function of data $δ (X)$
estimator: any statistic used to estimate an (unknown) param $θ$ ; usu denoted $\hat{θ} = δ (X)$
bias: $E [\hat{θ} - θ] = E [\hat{θ}] - θ$ (unbiased if 0 as $n \to \infty$ )
variance: $V a r [\hat{θ} - θ]$
unbiased estimator: converges to true param over repeatedly sampling
- e.g.: sample variance
  - unbiased sample variance ( $E [s^{2}] = σ^{2}$ ) is $s^{2} = \frac{1}{n - 1} \sum_{i} {(X_{i} - \overline{X})}^{2}$ (Bessel's correction)
  - biased is $s_{n}^{2} = \frac{1}{n} \sum_{i} {(X_{i} - \overline{X})}^{2}$
  - note: $s$ is not unbiased for SD
- can be terrible; they may average to true value but individual estimates may be ridiculous
  - e.g. for Poisson $X$ , estimator $δ (X = x)$ of statistic $Pr {[X = 0]}^{2} = e^{- 2 λ} = E [δ (X)]$ is ${(- 1)}^{x}$ , which is nonsense
  - MLE is $e^{- 2 x}$ , which is always positive and has smaller MSE
  - besides bias, look also at efficiency, the MSE of individual estimates
consistent: lim n → ∞ δ ( X n ) = θ
- bias & variance must go to 0
- biased but consistent estimate of mean: $\frac{1}{n} \sum_{i} x_{i} + \frac{1}{n}$
- unbiased but inconsistent estimate of mean: $x_{1}$
maximum likelihood estimation (MLE): arg max θ Pr [ X ∣ θ ]
- find the peak in the likelihood function
- Fisher popularized this by showing that typically MLE is unbiased, consistent, & asymptotically the lowest variance estimator
- least squares: in OLS, if errors are normal, then least squares estimate is MLE
- now clear that MLE sometimes bad in practice, and finite sample behavior not samp esa asymptotic behavior, leading to Bayesian strategies
- disregards the uncertainty (“spread” in the likelihood function, i.e. Fisher information)
- property: $f (\hat{θ})$ is MLE of $f (θ)$ for any $f$
maximum a posteriori (MAP): arg max θ ( Pr [ θ ∣ X ] = Pr [ X ∣ θ ] Pr [ θ ] )
- the Bayesian approach
- since we have a (posterior) prob dist over $θ$ , can extract whatever we want: mean, median, mode, intervals
Estimation: intervals, consistency, bias, MLE, MAP
- choose the single most probable $θ$ given $X$ (mode of posterior)
- note: prediction using MAP is approx to Bayes prediction using all $θ$ and their probabilities
- equiv: arg min θ ( - log Pr [ θ ∣ X ] =- log Pr [ X ∣ θ ] - log Pr [ θ ] )
  - $- log Pr [θ]$ is bits to describe $θ$ , $- log Pr [X ∣ θ]$ is add'l bits to describe data
  - hence, MAP chooses $θ$ that provides max compression, or MDL
Bayes estimation: given loss $L (θ, δ)$ (eg sq err), minimize Bayes risk $E [L (θ, δ)]$
https://github.com/johnmyleswhite/JAGSExamples/blob/master/slides/

8.8 Exponential smoothing for time series analysis

exponential moving average (EMA) aka exponentially weighted moving average (EWMA) aka single exponential smoothing (SES) S t = ( 1- α ) ⋅ S t-1 + α ⋅ y t-1 S 2 = 1
- Does not spot trends
double exponential moving average (DEMA) aka linear exponential smoothing (LES) $\begin{matrix} S_{t} & = & α y_{t} + (1 - α) (S_{t - 1} + b_{t - 1}), & 0 \leq α \leq 1 \\ b_{t} & = & γ (S_{t} - S_{t - 1}) + (1 - γ) b_{t - 1}, & 0 \leq γ \leq 1 \end{matrix}$
triple exponential moving average (TEMA) aka triple exponential smoothing (SES)
http://www.itl.nist.gov/div898/handbook/pmc/section4/pmc4.htm

8.9 Applied

recency, frequency, monetary value (RFM): simple customer behavior analysis
- segment customers along these 3 axes into discrete bins
- identify highest-value intersections of bins

8.10 Time series

autoregressive model: $X_{t} = c + \sum_{i = 1}^{p} ϕ_{i} X_{t - i} + ϵ_{t}$
moving avg model: $X_{t} = μ + ϵ_{t} + \sum_{i = 1}^{q} θ_{i} ϵ_{t - i}$
autoregressive moving avg (ARMA) aka Box-Jenkins model: X t = μ + a 1 X t-1 + … + a k X t-p + ϵ t + b 1 ϵ t-1 + … + b q ϵ t-q where E[ ϵ t, ϵ s ] =0 ∀ t ≠ s and ϵ t ∼ N( 0, σ 2 )
- one of most common univariate time series models
- Kalman filter can calculate exact log-likelihood, but “conditional” likelihood is easier/more commonly used
vector autoregression (VAR) model: X t = A 1 X t-1 + … + A p X t-p + ϵ t where each X i is a vector, A i is a matrix, and ϵ t ∼ N( 0, Σ )
- multivariate time series generalization of univariate AR models
- widely used, esp. in macroecon
http://www.slideshare.net/wesm/scipy-2011-time-series-analysis-in-python

9 Time Series

9.1 Processes

second-order stationary process: $μ, σ^{2}$ are time-indep
hazard rate at time $t$ event rate at $t$ conditional on survival until $t$ ; eg bathtub curve; eg constant (in exponential dists)

9.2 Autocorrelation

autocorrelation function (ACF): for second-order stationary process, R( τ ) = E [ ( X t - μ ) ( X t+ τ - μ ) ] σ 2 ( τ is lag)
- when normalized by mean and variance, called autocorrelation coefficient
partial ACF (PACF): TODO

9.3 Models

Order- $q$ moving avg model MA( $q$ ): $X_{t} = μ + ϵ_{t} + \sum_{i = 1}^{q} θ_{i} ϵ_{t - i}$
Autoregressive model: $X_{t} = c + \sum_{i = 1}^{p} ϕ_{i} X_{t - i} + ϵ_{t}$
Autoregressive moving avg (ARMA) aka Box-Jenkins model: $X_{t} = c + ϵ_{t} + \sum_{i = 1}^{p} ϕ_{i} X_{t - i} + \sum_{i = 1}^{q} θ_{i} ϵ_{t - i}$
Autoregressive integrated moving avg (ARIMA): TODO

10 Linear Algebra

Power method or power iteration: an eigenvalue algo
- Given matrix $A$ , produce scalar $λ$ and non-0 vector $v$ (eigenvector) s.t. $A v = λ v$
- Doesn't compute matrix decomposition so suitable for large sparse $A$
- But will only find one eigenvalue (with max val) and may converge slowly
- Start with random vector $b_{0}$ and iteratively multiple by $A$ and normalize: $b_{k + 1} = \frac{A b_{k}}{∥ A b_{k} ∥}$
- TODO
$a \cdot b = ∥ a ∥ ∥ b ∥ cos θ$
$∥ a \times b ∥ = ∥ a ∥ ∥ b ∥ sin θ = area of parallelogram with sides a, b$ (direction given by right hand rule)
Cauchy-schwartz inequality: $| a \cdot b | \leq ∥ a ∥ ∥ b ∥$
triangle inequality: $∥ a + b ∥ \leq ∥ a ∥ + ∥ b ∥$
reduced row echelon form
- identity means unique solution
- $[\begin{matrix} 1 & 2 & 0 & 3 \\ 0 & 0 & 1 & - 2 \\ 0 & 0 & 0 & 0 \end{matrix}]$ with pivot variables $x_{1}, x_{3}$ and free vars $x_{2}, x_{4}$ has inf solutions
- $[\begin{matrix} 1 & 2 & 0 & 3 \\ 0 & 0 & 1 & - 2 \\ 0 & 0 & 0 & - 4 \end{matrix}]$ has no solutions ( $0 = - 4$ )
tensor product aka outer product $\vec{a} \otimes \vec{b} = \vec{a} {\vec{b}}^{T}$ is $| \vec{a} | \times | \vec{b} |$ matrix
gradient $\nabla f = [\begin{matrix} \frac{\partial f}{\partial x_{1}} \\ \frac{\partial f}{\partial x_{2}} \\ ⋮ \\ \frac{\partial f}{\partial x_{n}} \end{matrix}]$
Hessian $\nabla^{2} f = [\begin{matrix} \frac{\partial^{2} f}{\partial x_{1}^{2}} & \frac{\partial^{2} f}{\partial x_{1} \partial x_{2}} & \dots & \frac{\partial^{2} f}{\partial x_{1} \partial x_{n}} \\ \frac{\partial^{2} f}{\partial x_{2} \partial x_{1}} & \frac{\partial^{2} f}{\partial x_{2}^{2}} & \frac{\partial^{2} f}{\partial x_{2} \partial x_{n}} \\ ⋮ & ⋱ & ⋮ \\ \frac{\partial^{2} f}{\partial x_{n} \partial x_{1}} & \frac{\partial^{2} f}{\partial x_{n} \partial x_{2}} & \dots & \frac{\partial^{2} f}{\partial x_{n}^{2}} \end{matrix}]$ is Jacobian of gradient
Jacobian of function $f : ℜ^{n} \to ℜ^{m}$ : $J f = [\begin{matrix} \frac{\partial f_{1}}{\partial x_{1}} & \frac{\partial f_{1}}{\partial x_{2}} & \dots & \frac{\partial f_{1}}{\partial x_{n}} \\ \frac{\partial f_{2}}{\partial x_{1}} & \frac{\partial f_{2}}{\partial x_{2}} & \frac{\partial f_{2}}{\partial x_{n}} \\ ⋮ & ⋱ & ⋮ \\ \frac{\partial f_{m}}{\partial x_{1}} & \frac{\partial f_{m}}{\partial x_{2}} & \dots & \frac{\partial f_{m}}{\partial x_{n}} \end{matrix}]$
graph/Markov chain is irreducible iff strongly connected; matrix is irreducible iff it's transition matrix of irreducible graph/Markov chain

10.1 Matrices

Elementary matrix: matrix that differs from $I$ by one elementary row op
$A$ is symmetric iff $A^{T} = A$
Let A be n × n over field K (e.g. ℜ ). Then these are equiv:
- $A$ is invertible
- $A$ is nonsingular
- $A$ is nondegenerate
- $A$ is row-equiv to $I$ (can be changed to $I$ using elementary row ops)
- $A$ is col-equiv to $I$
- $A$ has $n$ pivot positions
- $det A \neq 0$
- $r a n k A = n$
- $A x = 0$ has only one trivial sol $x = 0$ (i.e. $N u l l A = {0}$ )
- $A x = b$ has exactly one sol for each $b \in K^{n}$ ( $x \neq 0$ )
- cols of $A$ are lin indep
- cols of $A$ span $K^{n}$ (i.e. $C o l A = K^{n}$ )
- cols of $A$ form a basis of $K^{n}$
- the linear transformation mapping $x$ to $A x$ is a bijection from $K^{n}$ to $K^{n}$
- $A^{T}$ is invertible
- 0 is not an eval of $A$
- $A$ can be expressed as product of finitely many elementary matrices
For any invertible A , these hold:
- ${(A B)}^{- 1} = B^{- 1} A^{- 1}$
- ${(k A)}^{- 1} = k^{- 1} A^{- 1}$ for $k \neq 0$
- ${(A^{T})}^{- 1} = {(A^{- 1})}^{T}$
- $det (A^{- 1}) = det {(A)}^{- 1}$

10.2 Decomposition

Gaussian elim: for solving system of equations
- phase 1, fwd elim: use elementary row ops to get row echelon form (pivots move to right as you move down)
- phase 2, back-subst: solve for unknowns using simplified row echelon form
Gaussian-Jordan elim: extension of Gaussian elim to get reduced row echelon form ( I on left square)
- to invert $A$ : $[A I] \Rightarrow [I A^{- 1}]$
LU: matrix form of Gaussian elim's fwd elim phase
- a square matrix $A = L U$ where $L, U$ are lower/upper triangular matrices
- efficient; good for “good” matrices
- when solving $A x = b$ , faster to reuse $L U$ : if know $A = L U$ , then can solve $L U x = b$ by solving $L y = b$ for $y$ then $U x = y$ for $x$ , with each of these done efficiently using fwd-/back-subst
QR: good balance btwn LU and SVD
- any real $m \times n$ matrix $A = Q R$ where $Q$ is orthog $m \times m$ , $R$ is upper triangular $m \times n$
SVD: very robust

10.3 HITS

$\forall i, \forall k = 1, 2, 3, \dots,$ $\begin{matrix} a_{i}^{(k)} & = & \sum_{j : e_{j i} \in E} h_{j}^{(k - 1)} \\ h_{i}^{(k)} & = & \sum_{j : e_{i j} \in E} a_{j}^{(k)} \end{matrix}$ and $a_{i}^{(1)} = h_{i}^{(1)} = \frac{1}{n}$ .
if $\vec{h}, \vec{a}$ are vectors of $h_{i}, a_{i}$ and $\vec{L}$ is adj matrix ( $L_{i j} = {\begin{matrix} 1 & p a g e i l i n k s t o j \\ 0 & o t h e r w i s e \end{matrix}$ ) then $\begin{matrix} {\vec{a}}^{(k)} & = & {\vec{L}}^{T} {\vec{h}}^{(k - 1)} \\ {\vec{h}}^{(k)} & = & \vec{L} {\vec{a}}^{(k)} \end{matrix}$ or $\begin{matrix} {\vec{a}}^{(k)} & = & {\vec{L}}^{T} \vec{L} {\vec{a}}^{(k - 1)} \\ {\vec{h}}^{(k)} & = & \vec{L} {\vec{L}}^{T} {\vec{h}}^{(k - 1)} \end{matrix}$ i.e., power method applied to positive semi-definite matrices $\vec{L} {\vec{L}}^{T}$ (auth matrix) and ${\vec{L}}^{T} \vec{L}$ (hub matrix).
Thus, HITS amounts to solving for largest eigenvalue $λ_{1}$ in ${\vec{L}}^{T} \vec{L} \vec{a} = λ_{1} \vec{a}$ and $\vec{L} {\vec{L}}^{T} \vec{h} = λ_{1} \vec{h}$ .
http://meyer.math.ncsu.edu/Meyer/PS_Files/IMAGE.pdf

10.4 PageRank

rank of page $i$ at iteration $k$ : $r_{i}^{(k + 1)} = \sum_{j \in I_{i}} \frac{r_{j}^{(k)}}{| O_{j} |}$ , where $I_{i}$ is pages linking to $i$ and $O_{j}$ is pages $j$ links to
start with uniform $r_{i}^{(0)} = \frac{1}{n} \forall i$
in matrix notation, ${π^{(k + 1)}}^{T} = {π^{(k)}}^{T} H$ , where $H_{i j} = \frac{1}{| O_{i} |}$
problems
- dangling nodes i.e. pages w no outlinks (rows of all 0s)
  - replace rows with $\vec{u}$ of all $\frac{1}{n}$ ; call resulting matrix $S$
- graph may not be strongly connected
  - replace $S$ with $G = α S + (1 - α) E$ where $E$ is teleportation matrix where all rows are $\vec{u}$
$π = π G$ , usu. with normalization condition $\sum_{i} π_{i} = 1$
properties
- $S, G$ are stochastic matrices (Markov chain transition matrix; rows sum to 1), hence there's always a solution
- actually a unique solution (stationary distribution), either by Markov theory or Perron-Frobenius theorem
http://meyer.math.ncsu.edu/Meyer/PS_Files/IMAGE.pdf http://cacm.acm.org/magazines/2011/6/108660-pagerank-standing-on-the-shoulders-of-giants/fulltext

11 Optimization

11.1 Convex optimization

lin prog, quad prog

11.2 Linear programming

$arg {max}_{\vec{x}} {\vec{c}}^{T} \vec{x}$ subject to $A \vec{x} \leq \vec{b}$ and $\vec{x} \geq 0$
algo families: basis exchange (eg simplex), interior point (eg ellipsoid)

11.3 Quadratic programming

$arg {min}_{\vec{x}} f (\vec{x}) = \frac{1}{2} {\vec{x}}^{T} Q \vec{x} + {\vec{c}}^{T} \vec{x}$ subject to $A \vec{x} \leq \vec{b}$ and $E \vec{x} = \vec{d}$
algos: interior point, active set, augmented Lagrangian, conjugate gradient, simplex extensions
for pos-def $Q$ , ellipsoid method solves in poly time
for indef $Q$ or if $Q$ has only 1 negative eigenvalue, NP-hard

12 Machine Learning

12.1 Information criteria

Akaike's information criterion (AIC): -2 log L +2p= deviance + params
- $L$ is maximized likelihood using all avail data for estimation
- $p$ is # free params in model
- also seen (TODO resolve?)
  - $- 2 log \frac{RSS}{n} + \frac{2 n k}{n - k - 1}$ where RSS is residual sum of squares and $k$ is # params
  - $- 2 log L + 2 p + \frac{2 p (p + 1)}{n - p - 1}$ http://scott.fortmann-roe.com/docs/MeasuringError.html
- asymptotically, minimizing AIC equiv to minimizing CV value
Schwarz Bayesian information criterion (BIC): -2 log L +p log n
- $n$ is # obs
- heavier penalty means model chosen by BIC is same or simpler (fewer params) than AIC
- asymptotically, minimizing BIC equiv to minimizing leave- $v$ -out CV where $v = n (1 - \frac{1}{log n - 1})$

12.2 Bayesian learning

hypothesis prior $Pr [Θ]$ , where $Θ$ is hyp RV
likelihood $L (θ) = Pr [\vec{x} ∣ θ]$ ; log-likelihood $ℓ (θ) = log Pr [\vec{x} ∣ θ]$
posterior $Pr [θ ∣ \vec{x}] = α Pr [\vec{x} ∣ θ] Pr [θ]$
Bayesian learning: predict Pr [ X' ∣ x → ] = ∑ θ Pr [ X' ∣ x → , θ ] Pr [ θ ∣ x → ] = ∑ i Pr [ X' ∣ θ ] Pr [ θ ∣ x → ]
- calculate prob of each hyp and predict over all hyps
maximum a posteriori (MAP): predict Pr [ X' ∣ θ MAP ] where θ MAP = arg max θ Pr [ θ ∣ x → ]
- predict from just the single hyp with greatest posterior (easier)
- usually $Pr [X' ∣ θ_{M A P}] \to Pr [X' ∣ \vec{x}]$ as more data arrives; otherwise, may snap to incorrect hypothesis
- equiv to MDL: $θ_{M A P} = arg {min}_{θ} - log Pr [\vec{x} ∣ θ] - log Pr [θ]$
maximum lilkelihood: assume uniform prior over hyps, then θ MAP = arg max θ Pr [ x → ∣ θ ]
- reasonable for more data, since data swamps priors

12.3 Complete data

observations are commonly IID: $Pr [\vec{x} ∣ θ] = \prod_{i = 1}^{n} Pr [{\vec{x}}_{i} ∣ θ]$
maximum likelihood
- log likelihood easier to maximize because products become sums: $log Pr [\vec{x} ∣ θ] = \sum_{i = 1}^{n} Pr [{\vec{x}}_{i} ∣ θ]$
- Naive Bayes: MLE on Bayesian network where (discrete) class is root, attrs are leaves, and attrs are IID given class
  - generative model: either class is a component of $\vec{x}$ ( ${\vec{x}}_{0}$ ) or say $Pr [\vec{x}, C ∣ θ]$
- same for both discrete and continuous models
- for network $A \to B$ where $A, B$ are continuous, MLE over $Pr [b ∣ a] = \frac{1}{\sqrt{2 π} σ} exp (- \frac{{(b - (θ_{1} a + θ_{2}))}^{2}}{2 σ^{2}})$ same as minimizing the exponent, the sum of squared errors $E = \sum_{i = 1}^{n} {(b_{i} - (θ_{1} a_{i} + θ_{2}))}^{2}$
Bayesian parameter learning (incorporating hyp probs)
- commonly use conjugate prior for hyp prior $Pr [Θ]$ to simplify math
- e.g. if $Θ = Pr [X = h e a d]$ and $Pr [θ] = b e t a [a, b] (θ) = α θ^{a - 1} {(1 - θ)}^{b - 1}$ and our data is $x = h e a d$ , then $Pr [θ ∣ x] = α Pr [x ∣ θ] Pr [θ] = α' θ θ^{a - 1} {(1 - θ)}^{b - 1} = b e t a [a + 1, b] (θ)$
- usu. assume param independence so each param can have own beta dist
- incorporating param RVs into Bayesian network itself requires making copies of the variables describing each instance

12.4 Incomplete data: EM algo

hidden/latent variables: indirection that dramatically reduces number of parameters in Bayesian network
EM algo
- E-step: calculate expected likelihood given current param estimates: Q( θ ∣ θ ( t ) ) = E Z ∣ X, Θ ( t ) [ log L( θ ;X,Z ) ]
  - often, in reality, we don't actually have to calculate expected likelihood, just as we don't have to compute cost function in gradient descents
  - this step usually just updates hidden value estimates, which will then be used in M-step
- M-step: find params that maximize expected likelihood: $θ^{(t + 1)} = arg {max}_{θ} Q (θ ∣ θ^{(t)})$
EM algo examples (all have N data points)
- unsupervised clustering: Gaussian mixture model
  - single Gaussian: $N (x ∣ μ, Σ) = \frac{1}{{(2 π)}^{d / 2} \sqrt{| Σ |}} exp (- \frac{1}{2} {(x - μ)}^{T} Σ^{- 1} (x - μ))$
  - GMM: $Pr [x] = \sum_{i = 1}^{N} Pr {[C = i]}_{i} \cdot N (x ∣ μ_{i}, Σ_{i})$ where $N$ is # Gaussians
  - while ANNs are universal approximators of functions, GMMs are universal approximators of densities
  - even diagonal GMMs are universal approximators; full-rank are unwieldy, since # params is square of # dims
  - like K-means with probabilistic assignments and Gaussians instead of means (K-means is a hard EM algo)
    - very sensitive to initialization; may initialize with K-means
  - mixture model of $k$ components: $Pr [\vec{x}] = \sum_{i = 1}^{k} Pr [C = i] Pr [\vec{x} ∣ C = i]$
  - Bayes net: $C \to \vec{X}$ , $C$ hidden, $C$ discrete, $\vec{X}$ continuous
  - E-step: calculate some quantities useful later (assignment probabilities, which are the hidden variables): $\begin{matrix} p_{i j} & \leftarrow & Pr [C_{j} = i ∣ {\vec{x}}_{j}] = α Pr [{\vec{x}}_{j} ∣ C_{j} = i] Pr [C_{j} = i] \\ p_{i} & \leftarrow & \sum_{j} p_{i j} \end{matrix}$
  - M-step: maximize expected likelihood of observed & hidden vars $\begin{matrix} {\vec{μ}}_{i} & \leftarrow & \sum_{j} p_{i j} {\vec{x}}_{j} / p_{i} \\ {\vec{Σ}}_{i} & \leftarrow & \sum_{j} p_{i j} {\vec{x}}_{j} {\vec{x}}_{j}^{T} / p_{i} \\ Pr [C = i] = θ_{i} & \leftarrow & p_{i} / N \end{matrix}$
  - to avoid local maxima (component shrinking to a single point, two components merging, etc.):
    - use priors on params to apply MAP version of EM
    - restart components with new random params if it gets too small/too close to another component
    - initialize params with reasonable values
  - can also do MAP GMM
  - http://bengio.abracadoudou.com/lectures/gmm.pdf
- Naive Bayes with hidden class
  - Bayes net: $X \to \vec{Y}$ , $X$ hidden, $X, Y$ discrete
  - E-step: $\begin{matrix} p_{i j} & \leftarrow & Pr [X = i ∣ \vec{Y} = {\vec{y}}_{j}] \\ p_{i} & \leftarrow & \sum_{j} p_{i j} \end{matrix}$
  - M-step: $\hat{N}$ are expected counts: $\begin{matrix} Pr [X = i] = θ_{i} & \leftarrow & \frac{\hat{N} (X = i)}{N} = \frac{1}{N} \sum_{j = 1}^{N} Pr [X = i] = \frac{p_{i}}{N} \\ Pr [\vec{Y} = \vec{y} ∣ X = i] = {\vec{θ}}_{i} & \leftarrow & \frac{\hat{N} (\vec{Y} = \vec{y}, X = i)}{\hat{N} (X = i)} = \frac{\sum_{j : {\vec{y}}_{j} = \vec{y}} p_{i j}}{p_{i}} \end{matrix}$
- HMMs: dynamic Bayes net with single discrete state var
  - each data point is sequence of observations
  - transition probs repeat across time: $\forall t, θ_{i j t} = θ_{i j}$
  - E-step: modify forward-backward algo to compute expected counts below
    - obtained by smoothing rather than filtering: must pay attn to subsequent evidence in estimating prob of a particular transition (eg evidence is obtained after crime)
  - M-step: $θ_{i j} \leftarrow \frac{\sum_{t} \hat{N} (X_{t + 1} = j, X_{t} = i)}{\sum_{t} \hat{N} (X_{t} = i)}$
EM algo
- pretend we know params, then “complete” data infer prob dists over hidden vars, then find params that maximize likelihood of observed & hidden vars
- gist: ${\vec{θ}}^{(t + 1)} = arg {max}_{\vec{θ}} \sum_{\vec{z}} Pr [\vec{Z} = \vec{z} ∣ \vec{x}, {\vec{θ}}^{(t)}] ℓ (\vec{x}, \vec{Z} = \vec{z} ∣ θ)$
- E-step: compute Q( θ → ∣ θ → ( t ) ) = E Z → ∣ x → , θ → ( t ) [ ℓ ( x → , Z → ∣ θ → ) ]
  - expected likelihood over $\vec{Z}$ under current ${\vec{θ}}^{(t)}$
  - misnomer: what's calculated are fixed, data-dependent params of $Q$
- M-step: compute θ → ( t+1 ) = arg max θ → Q( θ → ∣ θ → ( t ) )
  - new $\vec{θ}$ that maximizes the expected likelihood
- resembles gradient-based hill-climbing but no “step size” param
- monotonically increases likelihood

12.5 Kernel models

aka Parzen-Rosenblatt window
each instance contributes small density function $K (\vec{x}, {\vec{x}}_{i})$
density estimation: $p (\vec{x}) = \frac{1}{N} \sum_{i = 1}^{N} K (\vec{x}, {\vec{x}}_{i})$
kernel normally depends only on distance D( x → , x → i )
- eg $d$ -dimensional Gaussian $K (\vec{x}, {\vec{x}}_{i}) = \frac{1}{{(w^{2} \sqrt{2 π})}^{d}} exp (- \frac{D {(\vec{x}, {\vec{x}}_{i})}^{2}}{2 w^{2}})$
supervised learning: take weighted combination of all predictions
- vs kNN's unweighted combination of $k$ instances

12.6 Classification

linear classifiers: simplest type of feedforward neural network
- TODO: single- vs multi-layer perceptron; feedforward vs backpropagation
perceptron learning algorithm: for each iteration, if y t ( θ → ⋅ x t → ) ≤ 0 (mistake), then θ → ← θ → + y t x t → .
- makes at most $\frac{R^{2}}{γ_{g}^{2}}$ mistakes on training set, where $∥ \vec{x_{i}} ∥ \leq R$ and $γ_{g} \leq \frac{y_{i} (\vec{θ^{*}} \cdot \vec{x_{i}})}{∥ \vec{θ^{*}} ∥}$ is the margin
support vector machine (SVM): maximum margin classifier with some slack minimize 1 2 ∥ θ → ∥ 2 +C ∑ t=1 n ξ t subject to y t ( θ → T x t → + θ 0 ) ≥ 1- ξ t and ξ t ≥ 0 ∀ t=1, … ,n
- equivalent formulation assuming $y_{t}$ are 1/0 instead of $\pm 1$ : $\begin{matrix} minimize & C \sum_{t = 1}^{n} [y_{t} {cost}_{1} ({\vec{θ}}^{T} {\vec{x}}_{t} + θ_{0}) + (1 - y_{t}) {cost}_{0} ({\vec{θ}}^{T} {\vec{x}}_{t} + θ_{0})] + \frac{1}{2} {∥ \vec{θ} ∥}^{2} \\ where & {cost}_{1} (z) = max (0, 1 - z) \\ {cost}_{0} (z) = max (0, 1 + z) \end{matrix}$ since $\begin{matrix} y_{t} ({\vec{θ}}^{T} {\vec{x}}_{t} + θ_{0}) & \geq & 1 - ξ_{t} \\ ξ_{t} & \geq & 0 \end{matrix}$ becomes (given that we're trying to minimize all the $ξ_{t}$ ) $\begin{matrix} ξ_{t} & = & max (0, 1 - y_{t} ({\vec{θ}}^{T} {\vec{x}}_{t} + θ_{0})) \\ = & {cost}_{1} ({\vec{θ}}^{T} {\vec{x}}_{t} + θ_{0}) \end{matrix}$
- LOOCV error $\frac{1}{2} \sum_{i = 1}^{n} L o s s (y_{i}, f (x_{i}; {\hat{\vec{θ}}}^{- i}, {\hat{\vec{θ}}}_{0}^{- i}))$ where Loss is the 0-1 loss. Upper bound is (# support vectors / $n$ ).
- quadratic programming optimization problem: single global maximum that can be found efficiently
- dual: $\begin{matrix} maximize & \sum_{i} α_{i} - \frac{1}{2} \sum_{i, j} α_{i} α_{j} y_{i} y_{j} ({\vec{x}}_{i} \cdot {\vec{x}}_{j}) \\ subject to & α_{i} \geq 0 \forall i \\ and & \sum_{i} α_{i} y_{i} = 0 \end{matrix}$
- kernel trick: substitute kernel function $K ({\vec{x}}_{i}, {\vec{x}}_{j}) = F ({\vec{x}}_{i}) \cdot F ({\vec{x}}_{j})$ where $F$ maps to high/infinite dimensions but $K$ can still be computed efficiently
- Mercer's theorem: any “reasonable” (positive definite) kernel function corresponds to some feature space
- kernels
  - quadratic: ${({\vec{x}}_{i} \cdot {\vec{x}}_{j})}^{2}$ (common illustration: slicing hyperparabola yields circular separator)
  - polynomial: ${(1 + {\vec{x}}_{i} \cdot {\vec{x}}_{j})}^{d}$
  - radial basis function (RBF): often the best
logistic regression: optimize using logit/logistic/sigmoid function Pr [ y=1 ∣ x → ; θ → ] = h θ → ( x → ) =g( z ) = e z e z +1 = 1 1+ e -z ,z= θ 0 + θ → ⋅ x →
- loss function
  - $Cost (h_{\vec{θ}} (\vec{x}), y) = {\begin{matrix} - log h_{\vec{θ}} (\vec{x}), & if y = 1 \\ - log (1 - h_{\vec{θ}} (\vec{x})), & if y = 0 \end{matrix}$
  - $J (θ) = \frac{1}{m} \sum_{i = 1}^{m} Cost (h_{\vec{θ}} ({\vec{x}}^{(i)}), y^{(i)}) = - \frac{1}{m} \sum_{i = 1}^{m} [y^{(i)} log h_{\vec{θ}} ({\vec{x}}^{(i)}) + (1 - y^{(i)}) log (1 - h_{\vec{θ}} (\vec{x^{(i)}}))]$ (a clever differentiable form)
  - intuition: 0 when completely correct and $\infty$ when completely incorrect
- gradient descent: identical (LMS) update rule & derivation as in linear regression but $h$ is non-linear (logit)
- for perfectly separable data, $θ \to \infty$ ; need regularization
- another algo: Newton's method to find zero of $ℓ' (θ)$
- coefficients & intercept have log-odds interpretation
  - $z = log \frac{p}{1 - p}$ , where $p = Pr [y = 1 ∣ \vec{x}; \vec{θ}] = g (z)$
  - intercept: log-odds if $\vec{x} = \vec{0}$
  - coefficient for indicator: log-odds between 1 and 0 groups
  - coefficient for continuous: log-odds between unit deltas in value
  - http://www.ats.ucla.edu/stat/mult_pkg/faq/general/odds_ratio.htm

12.7 Regression

squared error SE = ∑ i ( y i - y ˆ i ) 2 , MSE = 1 n SE , RMSE = SE
- bias-variance tradeoff: $MSE = V a r [SE] + E {[y_{i} - {\hat{y}}_{i}]}^{2} = V a r [SE] + {Bias}^{2}$
relative sq err $RSE = \frac{\sum_{i} {(y_{i} - {\hat{y}}_{i})}^{2}}{\sum_{i} {(y_{i} - \overline{y})}^{2}}$ , root rel sq err $RRSE = \sqrt{RSE}$
$MAE = \frac{1}{n} \sum_{i} | y_{i} - {\hat{y}}_{i} |$ , rel abs err $RAE = \frac{\sum_{i} | y_{i} - {\hat{y}}_{i} |}{\sum_{i} | y_{i} - \overline{y} |}$
mean abs pct err $MAPE = \frac{1}{n} \sum_{i} | \frac{y_{i} - {\hat{y}}_{i}}{y_{i}} |$ (drawbacks: zeros, unbounded)
relative errors can compare models of different units
coefficient of determination R 2 = S S reg S S tot =1- S S err S S tot =1- RSE where
- $S S_{tot} = \sum_{i} {(y_{i} - \overline{y})}^{2}$ : total sum of squares (proportional to sample variance)
- $S S_{reg} = \sum_{i} {(f_{i} - \overline{y})}^{2}$ : regression/explained sum of squares
- $S S_{err} = \sum_{i} {(y_{i} - f_{i})}^{2}$ : residual sum of squares; sum of squared residuals
- $R^{2} = 1$ is perfect; $R^{2} < 0$ means $\overline{y}$ is better than the model
adjusted R 2 is 1-( 1- R 2 ) n-1 n-p-1 =1- S S err S S tot ⋅ d f t d f e =1- Var err Var tot , where
- $p$ is total number of regressors in linear model
- $n$ is sample size
- $d f_{t}$ is dof $n - 1$ of estimate of population variance of $Y$
- $d f_{e}$ is dof $n - p - 1$ of estimate of underlying population error variance
- ${V a r}_{err} = \frac{S S_{err}}{n - p - 1}$ , ${V a r}_{tot} = \frac{S S_{tot}}{n - 1}$
- increases only if new regressor improves model more than would be expected by chance; penalizes too-many-regressors
- useful for feature selection, with small samples
intervals
- confidence interval: tells us about population params (mean/variance, or model params)
  - “at confidence level 95%, 95% of samples (repeating this experiment) would generate CIs containing true params”
- prediction interval: tells us about data values; always wider than CI
  - “next value is in PI of 95% of samples (repeated experiments)”
  - or: “95% of repeated experiments generate PI containing next random value”
  - non-parametric assumes nothing about population; simply think of numbers of points/gaps
- tolerance interval: tells us about percentage of all future values; usu. wider than PI
  - “95% of all future values are in TI in 95% of samples/repeated experiments”
- eg: in simple linear regression, y ˆ = α ˆ + β ˆ x= E [ y ∣ x ] is the mean response, while y= α + β x+ ε is actual response
  - $\hat{y}$ use CI, since $\hat{α}, \hat{β}$ use CI; draw confidence bands in plot for what the possible regression line is
  - $y$ use PI; draw prediction bands in plot for what the possible values are

12.8 Support vector regression (SVR)

linear regression formulation with no slack: minimize $\frac{1}{2} {∥ \vec{w} ∥}^{2}$ subject to $y_{i} - ⟨ \vec{w}, {\vec{x}}_{i} ⟩ - b \leq ϵ$ and $⟨ \vec{w}, {\vec{x}}_{i} ⟩ + b - y_{i} \leq ϵ$ (require prediction $⟨ \vec{w}, {\vec{x}}_{i} ⟩ + b$ to be within $\pm ϵ$ of $y_{i}$ )
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.99.2073&rep=rep1&type=pdf

12.9 Linear regression

ordinary least squares (OLS) regression: a linear regression
- minimize cost function $J (θ) = \frac{1}{2} \sum_{i = 1}^{m} {(y - h_{\vec{θ}} (\vec{x}))}^{2}, h_{\vec{θ}} (\vec{x}) = \vec{θ} \cdot \vec{x}$ (sum of squared errors)
- gradient descent: repeatedly θ j ← θ j - α ∂ ∂ θ j J( θ )
  - $α$ is configurable learning rate
  - batch: cost over all training instances
    - $θ_{j} \leftarrow θ_{j} + α \sum_{i = 1}^{m} (y^{(i)} - \vec{θ} \cdot {\vec{x}}^{(i)}) {\vec{x}}_{j}^{(i)}$
  - stochastic aka incremental: converges faster
    - $θ_{j} \leftarrow θ_{j} + α (y^{(i)} - \vec{θ} \cdot {\vec{x}}^{(i)}) {\vec{x}}_{j}^{(i)}$
- update rule is called least mean squares (LMS) rule aka Widrow-Hoff rule
  - derivation for single training instance: $\frac{\partial}{\partial θ_{j}} J (\vec{θ}) = \frac{\partial}{\partial θ_{j}} \frac{1}{2} {(y - \vec{θ} \cdot \vec{x})}^{2} = - 2 \cdot \frac{1}{2} (y - \vec{θ} \cdot \vec{x}) \frac{\partial}{\partial θ_{j}} (y - \vec{θ} \cdot \vec{x}) = - (y - \vec{θ} \cdot \vec{x}) x_{j}$
  - magnitude of update proportional to error term
- can minimize in closed form without iterative algo (some matrix calculus): $θ = {(X^{T} X)}^{- 1} X^{T} \vec{y}$
- LOOCV can also be computed without training $n$ models: $\frac{1}{n} \sum_{i = 1}^{n} {(\frac{e_{i}}{1 - h_{i}})}^{2}$ where $e_{i}$ is residual
- probabilistic interp: why linear regression/why J ?
  - assume $y = \vec{θ} \cdot \vec{x} + ε$ where $ε$ is normally distributed
  - in the following, design matrix $X$ has training inputs as rows, and examples are indep
  - to maximize likelihood $L (\vec{θ}) = p (\vec{y} ∣ X; \vec{θ}) = \prod_{i = 1}^{m} \frac{1}{\sqrt{2 π} σ} exp (- \frac{{(y^{(i)} - \vec{θ} \cdot {\vec{x}}^{(i)})}^{2}}{2 σ^{2}})$ , must minimize cost function in exponent
  - can work it out by writing log likelihood
- always passes through mean of $x$ s and $y$ s

12.10 Multi-layer feed-forward neural networks

Assume $m$ examples, $K$ outputs, $L$ layers, $s_{l}$ nodes in layer $l$
Params $Θ^{(l)}$ is $s_{l + 1} \times s_{l}$ such that $Θ_{i j}^{(l)}$ is weight from node $j$ in layer $l$ to node $i$ in layer $l + 1$ (subscripts are “backward”)
Regularized cost J( Θ ) =- 1 m [ ∑ i=1 m ∑ jk=1 K y k ( i ) log h Θ ( x ( i ) ) +( 1- y k ( i ) ) log ( 1- h Θ ( x ( i ) ) ) ] + λ 2m ∑ l=1 L ∑ i=1 s l ∑ j=1 s l +1 ( Θ ji ( l ) ) 2
- Or squared error for regressions; these turn out to have the same derivations
Backpropagation: algorithm used to compute gradients
- Need random initialization of $Θ$ to break symmetry or all params will be equal
- $Δ_{i j}^{(l)} \leftarrow 0 \forall l, i, j$
- for i=1 to m :
  - Forward-propagate activations ${\vec{a}}^{(l)} = Θ^{(l - 1)} {\vec{a}}^{(l - 1)}$ for $l = 2, \dots, L$ where ${\vec{a}}^{(1)} = {\vec{x}}^{(i)}$
  - Backpropagate errors δ → ( l ) = ( Θ ( l ) ) T δ → ( l+1 ) ⊙ g'( z → ( l ) ) for l=L-1, … ,2 where
    - ${\vec{δ}}^{(L)} = {\vec{a}}^{(L)} - y^{(i)}$
    - $g' ({\vec{z}}^{(l)}) = {\vec{a}}^{(l)} ⊙ (1 - {\vec{a}}^{(l)})$ since $g' (x) = g (x) (1 - g (x))$
    - $a ⊙ b$ is element-wise multiplication (http://math.stackexchange.com/questions/52578/symbol-for-elementwise-multiplication-of-vectors)
  - $Δ_{i j}^{(l)} \leftarrow Δ_{i j}^{(l)} + δ_{i}^{(l + 1)} a_{j}^{(l)}$ for $l = L - 1, \dots, 1$
- Gradient $\frac{\partial}{\partial Θ_{i j}^{(l)}} J (Θ) = {\begin{matrix} \frac{1}{m} Δ_{i j}^{(l)} + λ Θ_{i j}^{(l)} & if j \neq 0 \\ \frac{1}{m} Δ_{i j}^{(l)} & if j = 0 \end{matrix}$
Excellent explanations of derivation at http://www.ml-class.org/course/qna/view?id=3740, http://www.scribd.com/doc/72228829/Back-Propagation
Derivation of gradient for layer L-1 focusing on a single example ( m=1 )
- Define $δ_{i}^{(l)} = \frac{\partial J}{\partial z_{i}^{(l)}}$ : (how do we reduce the second factor?) $\begin{matrix} \frac{\partial J}{\partial Θ_{i j}^{(l - 1)}} & = & \frac{\partial J}{\partial z_{i}^{(l)}} \frac{\partial z_{i}^{(l)}}{\partial Θ_{i j}^{(l - 1)}} = δ_{i}^{(l)} \frac{\partial}{\partial Θ_{i j}^{(L - 1)}} (\sum_{k} Θ_{k j}^{(l - 1)} a_{j}^{(l - 1)}) = δ_{i}^{(l)} a_{j}^{(l - 1)} \end{matrix}$
- For output layer: $\begin{matrix} δ_{i}^{(L)} & = & \frac{\partial J}{\partial z_{i}^{(L)}} \\ = & \frac{\partial}{\partial z_{i}^{(L)}} (- \sum_{k = 1}^{K} y_{k} log g (z_{k}^{(L)}) + (1 - y_{k}) log (1 - g (z_{k}^{(L)}))) \\ = & \frac{\partial}{\partial z_{i}^{(L)}} (- y_{i} log g (z_{i}^{(L)}) + (1 - y_{i}) log (1 - g (z_{i}^{(L)}))) \\ = & - (y_{i} \frac{1}{g (z_{i}^{(L)})} - (1 - y_{i}) \frac{1}{1 - g (z_{i}^{(L)})}) g (z_{i}^{(L)}) (1 - g (z_{i}^{(L)})) \\ = & - (y_{i} (1 - g (z_{i}^{(L)})) - (1 - y_{i}) g (z_{i}^{(L)})) \\ = & - (y_{i} - g (z_{i}^{(L)})) \end{matrix}$
- Build on this to get previous layers
- For simpler squared error $J (Θ) = \frac{1}{2} \sum_{j = 1}^{K} {(y_{j} - a_{j})}^{2}$ , slightly different: $\begin{matrix} δ_{i} & = & \frac{\partial}{\partial z_{i}^{(L)}} (\frac{1}{2} \sum_{k = 1}^{K} {(y_{k} - g (z_{k}^{(L)}))}^{2}) \\ = & (y_{i} - g (z_{i}^{(L)})) g' (z_{i}^{(L)}) \end{matrix}$
Useful to test with gradient checking (numerical gradient)
Generally if single hidden layer then choose more hidden units than inputs/outputs, and choose same # hidden units in each layer if more than 1 hidden layer

12.11 Local regression

locally weighted scatterplot smoothing (LOESS aka LOWESS) aka locally weighted regression (LWR)
- a type of smoother
- typically LOESS is variable-bandwidth (fixed-span) smoother (like nearest-neighbors)
- typically LOESS is locally quadratic or linear
- weight functions/kernels: tricubic (traditional), Gaussian, ...
fixed-bandwidth example: to predict at $x$ , fit $θ$ to minimize $\sum_{i} w^{(i)} {(y^{(i)} - \vec{θ} \cdot {\vec{x}}^{(i)})}^{2}$ where typically $w^{(i)} = exp (- \frac{{(x^{(i)} - x)}^{2}}{2 τ^{2}})$ (some kernel) and $τ$ is bandwidth param

12.12 Regularization

overfitting tends to occur when large weights found in $\vec{β}$
regularization pressures $\vec{β}$ to be small
LASSO/L1: minimize ∑ i ( β → x → i - y i ) 2 where ∥ β → ∥ 1 ≤ s and ∥ β → ∥ 1 = ∑ j | β → j | ( ℓ 1 norm)
- equiv: minimize $\sum_{i} \frac{1}{2 n} {(\vec{β} {\vec{x}}_{i} - y_{i})}^{2} + λ {∥ \vec{β} ∥}_{1}$
- more notes in AI.page
- better than L2 when many features (vs examples)
ridge/Tikhonov: instead of ∥ X β → - y → ∥ 2 2 , minimize ∥ X β → - y → ∥ 2 2 + ∥ Γ β → ∥ 2 2 for some suitable Tikhonov matrix Γ
- L2 regularization: when $Γ = I$
- effect aka shrinkage
- $Γ$ can also be highpass to enforce smoothing
- explicit solution $\hat{β} = {(X^{T} X + Γ^{T} Γ)}^{- 1} X^{T} \vec{y}$ ( $O (n^{3})$ time)
elastic net: handles highly correlated vars; balance L1 & L2
- minimize $\frac{1}{2 n} {∥ X \vec{β} - \vec{y} ∥}_{2}^{2} + λ α {∥ \vec{β} ∥}_{1} + \frac{λ (1 - α)}{2} {∥ \vec{β} ∥}_{2}^{2}$
http://www-stat.stanford.edu/~owen/courses/305/Rudyregularization.pdf http://scikit-learn.org/stable/modules/linear_model.html

12.13 GLM

exponential family: p( y; η ) =b( y ) exp ( η T T( y ) -a( η ) )
- $η$ is natural param aka canonical param
- $T (y)$ is sufficient statistic; often $T (y) = y$
- $a (η)$ is log partition function
- $e^{- a (η)}$ is normalizing const; ensures $p$ sums/integrates to 1 over $y$
fixed $T, a, b$ defines a family (set) of dists param'd by $η$ (vary $η$ for diff dists in fam)
assumptions to derive GLM
- $y ∣ x; θ \sim (some exponential family) (η)$
- predicting $h (x) = E [T (y) ∣ x] = E [y ∣ x]$
- $η = \vec{θ} \cdot \vec{x}$ (more of a “design choice” rather than an assumption)

12.14 PCA

algorithm: normalize data, then find eigenvectors of covariance matrix
- can use SVD: $U, S, V = SVD (C o v (\frac{X - μ}{σ}))$ , where $U$ cols are eigenvectors in decreasing importance, $S$ is diagonal matrix of singular values
- down-project: $\vec{z} = U \vec{x}$
- recover: $\vec{\hat{x}} = U^{T} \vec{z}$
use: viz (project to 2D/3D), compression, speeding up learning
don't use to reduce dimensionality for learning since it doesn't consider labels; use regularization
choose smallest k where lost variance 1 m ∑ i=1 m ∥ x ( i ) - x ˆ ( i ) ∥ 2 1 m ∑ i=1 m ∥ x ( i ) ∥ 2 = ≤ .01
- or (equivalently) retained variance $\frac{\sum_{i = 1}^{k} S_{i i}}{\sum_{i = 1}^{n} S_{i i}} \geq .99$ (“99% variance retained”)
PCA transpose trick http://blog.echen.me/2011/03/14/pca-transpose-trick/
- given $m \times n$ obs matrix $A$ , often $n ≫ m$ (dims $≫$ obs)
- finding evecs of big $n \times n$ matrix $A^{T} A$ is expensive
- insight: if $v$ is evec of $A A^{T}$ , then $A^{T} v$ is evec of $A^{T} A$ w same eval (short proof)
- so, find evecs of $A A^{T}$ , then multiply by $A^{T}$

12.15 SVD

M=U Σ V ∗ , where
- $U$ is $m \times m$ real/complex unitary matrix; cols (left singular vectors) are evecs of $M M^{*}$
- Σ is m × n non-neg real diag matrix
  - entries are singular values
  - non-0 singular values are sqrts of non-0 evals of $M M^{*}$ or $M^{*} M$
- $V^{*}$ is $n \times n$ real/complex unitary matrix; cols (right singular vectors) are evecs of $M^{*} M$
when all real, $U, V$ are rotations and $Σ$ is scaling
eval decomp: similar concept but only for square matrices M
- $M^{*} M = V Σ^{*} U^{*} U Σ V^{*} = V (Σ^{*} Σ) V^{*}$
- $M M^{*} = U Σ V^{*} V Σ^{*} U^{*} = U (Σ Σ^{*}) U^{*}$
- RHSs are eval decomps of LHSs
use SVD to perform PCA
- let $M$ be deviations matrix; covar matrix is $\frac{1}{n} M M^{T} = \frac{1}{n} U Σ^{2} U^{T}$
low-rank approx to $M$ : $\tilde{M} = U \tilde{Σ} V^{*}$ , where $\tilde{Σ}$ is same as $Σ$ but with only $r$ largest singular values (rest replaced by 0)

12.16 Matrix factorization

R ≈ MU where M is n M × d , U is d × n U
- $d$ latent features
EM-like algo (6.867 project)
- randomly init $M$
- learn column $i$ of $U$ with OLS: $\vec{u_{i}} = {(M^{T} M)}^{- 1} M^{T} \vec{r_{i}}$ where $\vec{r_{i}}$ has just known values from column $i$ of $R$
- learn $M$ from $U$ ; iterate
- strongly overfits; can try introducing prior to prevent overfitting and to coerce values to be within 1 to 5
- can't run if too sparse because ${(M^{T} M)}^{- 1}$ becomes singular
gradient descent http://www.quuxlabs.com/blog/2010/09/matrix-factorization-a-simple-tutorial-and-implementation-in-python/
- $e_{i j}^{2} = {(r_{i j} - \hat{r_{i j}})}^{2} = {(r_{i j} - \sum_{k = 1}^{K} p_{i k} q_{k j})}^{2}$
- gradient: $\begin{matrix} \frac{\partial}{\partial p_{i k}} e_{i j}^{2} & = & - 2 {(r_{i j} - \hat{r_{i j}})}^{2} q_{k j} = - 2 e_{i j} q_{k j} \\ \frac{\partial}{\partial q_{k j}} e_{i j}^{2} & = & - 2 {(r_{i j} - \hat{r_{i j}})}^{2} p_{i k} = - 2 e_{i j} p_{i k} \end{matrix}$
- update rule (usu. $α = .0002$ ): $\begin{matrix} p_{i k}' & = & p_{i k} + α \frac{\partial}{\partial p_{i k}} e_{i j}^{2} = p_{i k} + 2 α e_{i j} q_{k j} \\ q_{k j}' & = & q_{k j} + α \frac{\partial}{\partial q_{k j}} e_{i j}^{2} = q_{k j} + 2 α e_{i j} p_{i k} \end{matrix}$
- with regularization to avoid overfitting: $\begin{matrix} e_{i j} & = & {(r_{i j} - \sum_{k = 1}^{K} p_{i k} q_{k j})}^{2} + \frac{β}{2} \sum_{k = 1}^{K} ({∥ P ∥}^{2} + {∥ Q ∥}^{2}) \\ p_{i k}' & = & p_{i k} + α \frac{\partial}{\partial p_{i k}} e_{i j}^{2} = p_{i k} + α (2 e_{i j} q_{k j} - β p_{i k}) \\ q_{k j}' & = & q_{k j} + α \frac{\partial}{\partial q_{k j}} e_{i j}^{2} = q_{k j} + α (2 e_{i j} p_{i k} - β q_{k j}) \end{matrix}$
- important extension: non-negative matrix factorization (NMF)
  - require $P, Q$ to be non-negative

12.17 Decision trees

information gain or entropy discretization
- entropy $H (.)$ is information uncertainty
- in total, need $H (Y)$ bits to classify (e.g. 1 bit for even 2-class distribution)
- each branch gives some information/removes uncertainty; want to move toward 0 uncertainty
- after branching on an attr with value dist $V$ , avg remaining uncertainty is $E_{V} [H (Y_{V})] \leq H (Y)$ , where $Y_{v}$ is class dist down branch for attr value $v$
- choose attr w lowest remaining uncertainty, or greatest information gain $H (Y) - E_{V} [H (Y_{V})]$
stopping criterion
- AIMA suggests building full tree then prune instead of early stop (eg consider learning xor), but this doesn't seem to be done elsewhere
- AIMA suggests chi-square test
- MDL suppose to be best-performing http://www.jair.org/media/279/live-279-1538-jair.pdf

12.18 Markov Chain Monte Carlo (MCMC)

given multivariate dist, simpler to sample from conditional dists than the joint (hard to marginalize by integrating over joint dist)
Gibbs sampling: initialize vars X i ( 0 ) ∀ i , then iteratively sample X i ( t ) =P( X i ( t ) ∣ { X j ( t-1 ) ∣ j ≠ i } ) ∀ i
- typically discard samples from initial burn-in period
- typically consider only every $n$ samples, since successive samples have some correlation
- useful when conditional distributions known, e.g. in Bayesian networks
- M-H special case where proposal dist is conditional dist; proposals accepted 100% of time
Metropolis Hastings: (slower) generalization of Gibbs sampling for when conditional distributions unknown
- use proposal conditional dist $Q (X'; X)$ , eg $N (X, σ^{2} I)$ (needn't be symmetric)
- a= a 1 a 2 = P( X' ) Q( X ( t ) ;X' ) P( X ( t ) ) Q( X'; X ( t ) ) where:
  - $a_{1} = \frac{P (X')}{P (X^{(t)})}$ is likelihood ratio btwn proposed sample $X'$ and prior sample $X^{(t)}$
  - $a_{2} = \frac{Q (X^{(t)}; X')}{Q (X'; X^{(t)})}$ is ratio of proposal density in both directions
- $X^{(t + 1)} = {\begin{matrix} X' & if a \geq 1 \\ X' & with prob a if a < 1 \\ X^{(t)} & with prob 1 - a if a < 1 \end{matrix}$
Burn-in commentary: http://www.johndcook.com/blog/2011/08/10/markov-chains-dont-converge/

12.19 Unsupervised learning

K-means
- Algo: given inputs x ( 1 ) , … , x ( m ) , repeatedly:
  - update cluster assignments $c^{(1)}, \dots, c^{(m)}$ : assign each point to nearest cluster
  - update cluster centroids $μ_{1}, \dots, μ_{K}$ to mean of assigned points
- Objective: minimize distortion function $J (c^{(1)}, \dots, c^{(m)}, μ_{1}, \dots, μ_{K}) = \frac{1}{m} \sum_{i = 1}^{m} {∥ x^{(i)} - μ_{c^{(i)}} ∥}^{2}$
- Use random initialization (good to initialize to $K$ random data points); run many times; choose clustering with min $J$
- Choose $K$ at knee/elbow in curve of $J$ over $K$ (or choose based on the problem you're solving)

12.20 Infinite Mixture Models with Nonparametric Bayes and the Dirichlet Process

nonparametric: params can change w data
in the following
- $n$ is # diners, $α$ is dispersion param
- $G_{0}$ is base dist
Chinese restaurant process
- each new diner sits at new table w prob $\frac{α}{n + a}$ or table $k$ w prob $\frac{n_{k}}{n + α}$ ; call table assignments $g_{i}$
- table $k$ serves some set of food parameterized by $φ_{k} \sim G_{0}$
- generate data points $p_{i} \sim F (φ_{g_{i}})$
Polya urn model
- start w urn containing $α G_{0} (φ_{k})$ balls of “color” $φ_{k}$ , for ea possible color $φ_{k}$
- draw ball, note color $φ_{i}$ , put back orig ball & new ball of same color
- generate data points $p_{i} \sim F (φ_{g_{i}})$
stick-breaking process
- start w stick of length 1
- sample $β_{1} \sim B e t a (1, α)$ ; $w_{1} = β_{1}$ is the eventual proportion of customers at table 1
- sample $β_{2} \sim B e t a (1, α)$ ; $w_{2} = (1 - β_{1}) β_{2}$ is eventual prop of customers at table 2
- generate group assignments $g_{i} \sim M u l t i n o m i a l (w_{1}, \dots, w_{\infty})$
- generate data points $p_{i} \sim F (φ_{g_{i}})$
Dirichlet process
- generate a dist $G \sim D P (G_{0}, α)$
- generate group-level params $x_{i} \sim G$ , where $x_{i}$ is group param for $i$ th data point
- generate each data point $p_{i} \sim F (x_{i})$
inference: Gibbs sampling
- randomly init group assignments
- pick a point; fix assignments of other points; assign point to most likely group (existing or new)
- repeat till convergence
Indian buffet process: customers belong to multiple tables instead of just 1
CRP, Polya, sticks: sequential; DP: parallel
http://blog.echen.me/2012/03/20/infinite-mixture-models-with-nonparametric-bayes-and-the-dirichlet-process/

13 Natural language processing (NLP)

13.1 Naive Bayes

NB usu uses multi-variate Bernoulli event model: $X_{i} \sim {Bernoulli}_{c, i}$ are whether dict word $i$ is present (for class $c$ )
multinomial event model: better for text classif; X i ∼ Multinomial c is i th word in doc, sampled from dict (for class c )
- model: $Pr [D ∣ {\vec{θ}}_{c}] = \frac{(\sum_{i} f_{i})!}{\prod_{i} f_{i}!} \prod {(θ_{c, i})}^{f_{i}}$ where $D$ is document (this is just multinomial PMF)
- prediction: $arg {max}_{c} [log p ({\vec{θ}}_{c}) + \sum_{i} f_{i} log θ_{c, i}]$
- estimation: ${\hat{θ}}_{c, i} = \frac{N_{c, i} + α_{i}}{N_{c} + α}$ where $α = \sum_{i} α_{i}$
data skew bias: sensitive to training class dist
- fix with complement NB: ${\hat{θ}}_{\tilde{c}, i} = \frac{N_{\tilde{c}, i} + α_{i}}{N_{\tilde{c}} + α}$ and $arg {max}_{c} [log p ({\vec{θ}}_{c}) - \sum_{i} f_{i} log θ_{\tilde{c}, i}]$ (notice the $-$ )
- note: doesn't do anything for binary; need 3+ classes
weight magnitude errors: deal with deps like “San Francisco” vs. “Boston”
- use weight normalization $\frac{log {\hat{θ}}_{c, i}}{\sum_{k} | log {\hat{θ}}_{c, k} |}$ instead of $log {\hat{θ}}_{c, i}$ (given without much explanation)
transforms to model text more accurately based on empirical text dists
- term frequency: empirical dists were heavier-tailed than predicted by multinomial, more like power-law
  - eg multinomial says $Pr [f_{i} = 9]$ (see a word 9 times) is tiny, but in reality pretty high (“burstiness”)
  - use $f_{i}' = log (d + f_{i})$ where $d = 1$ (or some optimized value)
  - brings closer to the Bernoulli (0-1) model
- document frequency
  - common words unlikely to be predictive, but random variations can create fictitious correlations
  - use $f_{i}' = f_{i} log \frac{\sum_{j} 1}{\sum_{j} δ_{i, j}}$ to downweight common words where $δ_{i, j} = 1$ iff word $i$ in doc $j$
- document length
  - docs have strong inter-word deps; if word appears, likely to re-appear
  - longer docs have disproportionately higher empirical term freqs/heavier tails; can have strong effect
  - use $f_{i}' = \frac{f_{i}}{\sqrt{\sum_{k} {(f_{k})}^{2}}}$ ; denom is doc length; discounts long docs; common in IR
http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf

14 Problems

Solve for $x$ in $x^{x^{x^{\dots}}} = 2$ .
Which is greater, $e^{π}$ or $π^{e}$ ? (Hint: $e^{x} > 1 + x$ for $x > 0$ )
http://www.quantnet.com/forum/threads/quantitative-interview-questions-and-answers.437/