Kullback–Leibler divergence aka KLIC: non-symmetric measure of difference btwn dists
expected # extra bits to code samples from when using code based on rather than on
alt intuition: avg likelihood of data distributed as given as model: where
; integral for continuous
; for ; asymmetric
mutual information
http://www.snl.salk.edu/~shlens/kl.pdf
normalized compression distance (NCD):
4 Finance
rate of return (ROR) aka return on investment (ROI) aka return
let be final value, be initial value
ratio:
arithmetic return aka yield:
logarithmic/continuous compound return:
compound annual growth rate (CAGR): where is # years
annual percentage rate (APR)
5 Signal Processing
DFT:
IDFT: (normalized, changed exp sign)
interesting presentation: strength of freq is distance from origin of the midpoint of your signal's points as the signal are spun around a circle http://altdevblogaday.org/2011/05/17/understanding-the-fourier-transform/
IIR, FIR: TODO
6 Probability
6.1 Distributions
Binomial: # successes in Bernoulli trials each with success prob
Geometric: # trials until Bernoulli success with prob
Hypergeom: # successes in draws from population of containing successes
Negative binomial: # successes in Bernoulli trials before failures (generalization of geom)
Poisson: # arrivals in sliver of time (infinite-granularity binomial) assuming mean arrival rate
Simple interesting proof from binomial
Normal: mean , standard deviation
Empirical rule: z-scores of 1/2/3 span 68%/95%/99.7%
Mahalanobis distance: similarity of a sample to a distribution
where is new sample, is dist mean, is covar matrix
same as normalized Euclidian dist if :
8.6 Smoothing
additive smoothing aka Laplace smoothing: ,
called “rule of succession” for
Good-Turing estimation: complex
8.7 Estimation
90% confidence interval: in repeated samplings, the computed intervals would contain the true param 90% of the time (0.1 miss rate); ( is const, is RV)
90% credible interval : ( are RV); “Bayesian confidence interval”
statistic: any function of data
estimator: any statistic used to estimate an (unknown) param ; usu denoted
bias: (unbiased if 0 as )
variance:
unbiased estimator: converges to true param over repeatedly sampling
e.g.: sample variance
unbiased sample variance () is (Bessel's correction)
biased is
note: is not unbiased for SD
can be terrible; they may average to true value but individual estimates may be ridiculous
e.g. for Poisson , estimator of statistic is , which is nonsense
MLE is , which is always positive and has smaller MSE
besides bias, look also at efficiency, the MSE of individual estimates
consistent:
bias & variance must go to 0
biased but consistent estimate of mean:
unbiased but inconsistent estimate of mean:
maximum likelihood estimation (MLE):
find the peak in the likelihood function
Fisher popularized this by showing that typically MLE is unbiased, consistent, & asymptotically the lowest variance estimator
least squares: in OLS, if errors are normal, then least squares estimate is MLE
now clear that MLE sometimes bad in practice, and finite sample behavior not samp esa asymptotic behavior, leading to Bayesian strategies
disregards the uncertainty (“spread” in the likelihood function, i.e. Fisher information)
property: is MLE of for any
maximum a posteriori (MAP):
the Bayesian approach
since we have a (posterior) prob dist over , can extract whatever we want: mean, median, mode, intervals
commonly use conjugate prior for hyp prior to simplify math
e.g. if and and our data is , then
usu. assume param independence so each param can have own beta dist
incorporating param RVs into Bayesian network itself requires making copies of the variables describing each instance
12.4 Incomplete data: EM algo
hidden/latent variables: indirection that dramatically reduces number of parameters in Bayesian network
EM algo
E-step: calculate expected likelihood given current param estimates:
often, in reality, we don't actually have to calculate expected likelihood, just as we don't have to compute cost function in gradient descents
this step usually just updates hidden value estimates, which will then be used in M-step
M-step: find params that maximize expected likelihood:
EM algo examples (all have data points)
unsupervised clustering: Gaussian mixture model
single Gaussian:
GMM: where is # Gaussians
while ANNs are universal approximators of functions, GMMs are universal approximators of densities
even diagonal GMMs are universal approximators; full-rank are unwieldy, since # params is square of # dims
like K-means with probabilistic assignments and Gaussians instead of means (K-means is a hard EM algo)
very sensitive to initialization; may initialize with K-means
mixture model of components:
Bayes net: , hidden, discrete, continuous
E-step: calculate some quantities useful later (assignment probabilities, which are the hidden variables):
M-step: maximize expected likelihood of observed & hidden vars
to avoid local maxima (component shrinking to a single point, two components merging, etc.):
use priors on params to apply MAP version of EM
restart components with new random params if it gets too small/too close to another component
initialize params with reasonable values
can also do MAP GMM
http://bengio.abracadoudou.com/lectures/gmm.pdf
Naive Bayes with hidden class
Bayes net: , hidden, discrete
E-step:
M-step: are expected counts:
HMMs: dynamic Bayes net with single discrete state var
each data point is sequence of observations
transition probs repeat across time:
E-step: modify forward-backward algo to compute expected counts below
obtained by smoothing rather than filtering: must pay attn to subsequent evidence in estimating prob of a particular transition (eg evidence is obtained after crime)
M-step:
EM algo
pretend we know params, then “complete” data infer prob dists over hidden vars, then find params that maximize likelihood of observed & hidden vars
gist:
E-step: compute
expected likelihood over under current
misnomer: what's calculated are fixed, data-dependent params of
M-step: compute
new that maximizes the expected likelihood
resembles gradient-based hill-climbing but no “step size” param
monotonically increases likelihood
12.5 Kernel models
aka Parzen-Rosenblatt window
each instance contributes small density function
density estimation:
kernel normally depends only on distance
eg -dimensional Gaussian
supervised learning: take weighted combination of all predictions
vs kNN's unweighted combination of instances
12.6 Classification
linear classifiers: simplest type of feedforward neural network
TODO: single- vs multi-layer perceptron; feedforward vs backpropagation
perceptron learning algorithm: for each iteration, if (mistake), then .
makes at most mistakes on training set, where and is the margin
support vector machine (SVM): maximum margin classifier with some slack
equivalent formulation assuming are 1/0 instead of :since becomes (given that we're trying to minimize all the )
LOOCV error where Loss is the 0-1 loss. Upper bound is (# support vectors / ).
quadratic programming optimization problem: single global maximum that can be found efficiently
dual:
kernel trick: substitute kernel function where maps to high/infinite dimensions but can still be computed efficiently
Mercer's theorem: any “reasonable” (positive definite) kernel function corresponds to some feature space
ordinary least squares (OLS) regression: a linear regression
minimize cost function (sum of squared errors)
gradient descent: repeatedly
is configurable learning rate
batch: cost over all training instances
stochastic aka incremental: converges faster
update rule is called least mean squares (LMS) rule aka Widrow-Hoff rule
derivation for single training instance:
magnitude of update proportional to error term
can minimize in closed form without iterative algo (some matrix calculus):
LOOCV can also be computed without training models: where is residual
probabilistic interp: why linear regression/why ?
assume where is normally distributed
in the following, design matrix has training inputs as rows, and examples are indep
to maximize likelihood , must minimize cost function in exponent
can work it out by writing log likelihood
always passes through mean of s and s
12.10 Multi-layer feed-forward neural networks
Assume examples, outputs, layers, nodes in layer
Params is such that is weight from node in layer to node in layer (subscripts are “backward”)
Regularized cost
Or squared error for regressions; these turn out to have the same derivations
Backpropagation: algorithm used to compute gradients
Need random initialization of to break symmetry or all params will be equal
for to :
Forward-propagate activations for where
Backpropagate errors for where
since
is element-wise multiplication (http://math.stackexchange.com/questions/52578/symbol-for-elementwise-multiplication-of-vectors)
for
Gradient
Excellent explanations of derivation at http://www.ml-class.org/course/qna/view?id=3740, http://www.scribd.com/doc/72228829/Back-Propagation
Derivation of gradient for layer focusing on a single example ()
Define : (how do we reduce the second factor?)
For output layer:
Build on this to get previous layers
For simpler squared error , slightly different:
Useful to test with gradient checking (numerical gradient)
Generally if single hidden layer then choose more hidden units than inputs/outputs, and choose same # hidden units in each layer if more than 1 hidden layer