# NLP

TODO merge in notes from 6.864 shingles markov chain pos tagging logistic stemming porter stemming: exclusion lists

general NLP problems

- POS tagging
- sentence detection
- tokenization/segmentation: figuring out word/sentence boundaries
- simplest: over all possible segmentations, return seg w max prod over all words of P(word)

- chunking
- parsing
- named entity recognition
- coreference

misc

- n-grams of chars: useful for language guessing

term frequency inverse document frequency (TFIDF)

term frequency is num occurrences of term t_i in doc d_j over total num words in d_j (how prominent t_i is in d_j):

tf_{i,j} = \frac{ n_{i,j} }{ \sum_k n_{k,j} }inverse document frequency is num docs over num docs with t_i (more common words are less important):

idf_i = \log \frac{ |D| }{ | \{ d : t_i \in d \} | }tf-idf = tf * idf

soundex algorithm

- rudemintary “sounds-alike” phonetic algorithm for encoding things that sound similar to the same representation, for matching

vector space model

- represent text docs as vectors of identifiers
- eg index terms: each dim corresponds to a term; some possible values are:
- bits indicating presence
- counts
- tf-idf

- relevancy rankings: treat query as doc, find most cosine-similar doc
- limitations
- long docs poorly represented, have poor similarity values (small scalar product/numerator and large dimensionality/denominator)
- order information not preserved

cosine similarity

- \frac{A \cdot B}{\|A\| \|B\|}
- intuitively, find angle between vector representations of docs
- don’t actually need to calc cosine; can just leave as is; ordering preserved
- -1 means opposite, 0 means same, 1 means identical; in IR, components are always non-neg, so range is [0,1]

jaccard similarity/coefficient/index

- for binary features
- simple: J(A,B) = \frac{|A \cap B|}{|A \cup B|}
- extended jaccard coefficient aka tanimoto coefficient: T(A,B) = \frac{A \cdot B}{\|A\|^2 + \|B\|^2 - A \cdot B}
- extended cosine similarity that yields jaccard for binary features

NLP at Google (Katja Filippova)

- NLP problems at google: translation, speech recognition, language modeling, information extraction
- 7-gram language models on 2T+ tokens using simplified smoothing

- graph-based multi-sentence (sentence fusion) summarization
- each unique (token,pos) is a vertex; also start & end vertices
- merge graphs of multiple sentences together (merging vertices may require their neighbors to also be the same)
- edges are inverse frequencies of transitions (bigram freqs)
- find k shortest paths from start to end
- constraint: path must contain a verb
- use google news’ clusters as dataset

- http://videolectures.net/russir2010_filippova_nlp/