TODO merge in notes from 6.864 shingles markov chain pos tagging logistic stemming porter stemming: exclusion lists

general NLP problems

  • POS tagging
  • sentence detection
  • tokenization/segmentation: figuring out word/sentence boundaries
    • simplest: over all possible segmentations, return seg w max prod over all words of P(word)
  • chunking
  • parsing
  • named entity recognition
  • coreference


  • n-grams of chars: useful for language guessing

term frequency inverse document frequency (TFIDF)

  • term frequency is num occurrences of term t_i in doc d_j over total num words in d_j (how prominent t_i is in d_j):

    tf_{i,j} = \frac{ n_{i,j} }{ \sum_k n_{k,j} }
  • inverse document frequency is num docs over num docs with t_i (more common words are less important):

    idf_i = \log \frac{ |D| }{ | \{ d : t_i \in d \} | }
  • tf-idf = tf * idf

soundex algorithm

  • rudemintary “sounds-alike” phonetic algorithm for encoding things that sound similar to the same representation, for matching

vector space model

  • represent text docs as vectors of identifiers
  • eg index terms: each dim corresponds to a term; some possible values are:
    • bits indicating presence
    • counts
    • tf-idf
  • relevancy rankings: treat query as doc, find most cosine-similar doc
  • limitations
    • long docs poorly represented, have poor similarity values (small scalar product/numerator and large dimensionality/denominator)
    • order information not preserved

cosine similarity

  • \frac{A \cdot B}{\|A\| \|B\|}
  • intuitively, find angle between vector representations of docs
  • don’t actually need to calc cosine; can just leave as is; ordering preserved
  • -1 means opposite, 0 means same, 1 means identical; in IR, components are always non-neg, so range is [0,1]

jaccard similarity/coefficient/index

  • for binary features
  • simple: J(A,B) = \frac{|A \cap B|}{|A \cup B|}
  • extended jaccard coefficient aka tanimoto coefficient: T(A,B) = \frac{A \cdot B}{\|A\|^2 + \|B\|^2 - A \cdot B}
    • extended cosine similarity that yields jaccard for binary features

NLP at Google (Katja Filippova)

  • NLP problems at google: translation, speech recognition, language modeling, information extraction
    • 7-gram language models on 2T+ tokens using simplified smoothing
  • graph-based multi-sentence (sentence fusion) summarization
    • each unique (token,pos) is a vertex; also start & end vertices
    • merge graphs of multiple sentences together (merging vertices may require their neighbors to also be the same)
    • edges are inverse frequencies of transitions (bigram freqs)
    • find k shortest paths from start to end
    • constraint: path must contain a verb
    • use google news’ clusters as dataset