IR

TODO PPjoin, PPjoin+, MPjoin

general

  • worry about search index sizes; get very big very easily

IR models

  • vector space model (VSM): cosine similarity
    • documents are Euclidian-normalized vectors of TF(-IDF)
    • idf most commonly uses log(N / df_t)
    • variants
      • tf
        • sublinear (log) scaling
        • max tf normalization: normalize tf of all doc terms by doc’s max tf
          • ntf(t,d) = a + (1 - a) * tf(t,d) / tf_max(d) where a usu .4
          • avoids large swings in ntf from modest changes in tf(t,d) (eg 1 to 2)
          • unstable: change in stop word list dramatically alters weightings
          • sensitive to outlying # occurrences
          • doesn’t distinguish ntf=1 meaning high skew or ~same as other terms
      • SMART notation identifies doc/query weighting/normalization schemes
      • pivoted normalized doc len: replace denom |V(d)| with…
        • a * |V(d)| + (1 - a) * piv, or
        • a * u_d + (1 - a) * piv where u_d is # unique terms
  • probabilistic: order by prob of relevance
    • loss functions: 0/1 or with costs; rank by expected loss
    • binary independence model: assume boolean term vectors
      • many simplifying assumptions; same as multivariate naive bayes
      • query time boils down to computing retrieval status value (RSV)
      • simple MLE like NB; optional \alpha backoff (MAP uniform prior)
      • approximation yields the log IDF measure
    • okapi BM25
      • slightly more complex, but not that many params
      • incorporates tf
      • first prob model that performed very well
    • http://nlp.stanford.edu/IR-book/pdf/11prob.pdf
  • only impl diff among all these is scoring fn

google mustang

  • multi-stage ranking: first filter on each child node, send to root, rank
  • few hundred from each children

query understanding at bing

  • http://www.searchenginecaffe.com/2010/07/sigir-2010-industry-day-query.html
  • two modes
    • Best effort retrieval: Find the most relevant results for the user query
      • Query segmentation
      • Stemming and synonym expansion
      • Term deletion
    • Automated Query Reformulation: Modify the user query to produce more relevant results for the inferred intent
      • spell correction
      • term deletion
      • This takes more liberty with the user’s intent
  • stemming: restaurant -> resturant, sf -> san francisco
    • abbrevs: un -> united nations
    • utilize co-click patterns to find un/united nations
  • term relaxation
    • [what is a diploid chromosome] -> “what is a” is not important from the SE matching, it introduces noise
    • [where can I get an iPhone 4] -> where is an important part of the query. Removing “where” misses the whole point of the query
    • [bowker’s test of change tutorial] -> test of symmetry is the correct terminology. How do you know that it is the incorrect terminology? If you relax it to “bowker’s test” you get better results
  • key concepts
    • win/loss ratio
      • wins are queries whose results improve
      • losses are queries whose results degrade
      • Related to precision, but not all valid reformulations change results
    • Pre vs. Post result analysis
      • Query alternatives generated pre-results
      • Blending decisions are post results
  • query evaluation
    • “Matching” level 0/l1/l2 -> inverted index, matching and ranking. Reduce billions to hundreds of thousands of pages. Much of the loss can occur here because it never made it into the candidate set. Assume that the other layers that use ML, etc… will bubble the correct results to the top.
    • “Merge” layer L3 -> the blending with multiple queries will be brought together
    • Federation layer L4 -> top layer coordinating activity
    • An important component is the query planner in L3 that performs annoation and rewriting.
  • matching & ranking
    • L0/l1 - 10^10 docs. l0 - boolean set operations, l1- ir score (a linear function over simple features like bm25, simple and fast, but not very accurate)
    • L2 reranking - 10^5 docs - ML heavy lifting: 1500 features, proximity
    • L3 reranking - 10^3 - federation and blending
    • L4 -> 10^1
    • Learning to rank Using Gradient Descent (for L2 layer)
  • query annotation
    • NLP Query annotation: offline analysis; similar to a ‘parse tree’
    • Ambiguity preserving: multiple interpretations
    • Backend independent: shared
    • Structure & Attributes: Syntax & semantics (how to handle tree leaf nodes)
  • query planning
    • [{un | “united nations”} jobs] -> l3-merge(l2-rank([un jobs]), l2-rank([“united nations” jobs])
    • OR
    • [{un| “united nations”} jobs] -> l3-cascade(threshold, l2-rank([un jobs]), l2-rank([“united nations” jobs])
    • the second is less certain and conditional
  • design considerations
    • Efficiency
      • one user query may generate multiple backend queries that are merged in L3
      • Some queries are cheaper than others
      • query reduction can improve performance
    • Relevance: L3 merging has maximal information, but is costly
    • Multiple query plan strategies: Depending on query analysis confidence
  • query analysis models
    • Noisy Channel Model: argmax{P(reqire | query) } = arg max{ P(rewrite)P(query| rewrite)}
    • bayes inversion
    • example: spelling
      • languagel model: likelihood of the correction
      • translation model: likelihood of the error occurring
  • language models
    • Based on Large-scale text mining
      • unigrams and N-grams (to favor common previously seen things, they make sense)
      • Probability of query term sequence
      • favor queries seen before
      • avoid nonsensical combinations
    • 1T n-gram resource: see MS ngram work here in SIGIR
    • Translation Models: training sets of aligned pairs (mispelling/correction; surface/stemmed)
    • Query log analysis
      • session reformulation
      • coclick-> associated queries
      • manual annotation
    • (missed the references, but see: Wei et al, Pang et al., Craswell)
  • summary
    • 60-70% of queries are reformulated
    • Can radically improve results
    • Trade-off between relevance and efficiency
      • rewrites can be costly
      • win/loss ratio is the key
    • Especially important for tail queries
      • no metadata to guide matching and ranking

implementations

  • sphinx: C++; mysql integration; BM25F; vik says it’s nice and “tight”
  • lucene: the most widely used index; java
  • solr: search service built atop lucene
  • solrcloud: distributed solr
  • elasticsearch: solrcloud alternative (dunno if it changes lucene)
  • indextank: reuses parts of lucene 3; customizable scoring; pretty simple; not sure what ranking it uses
  • vespa: yahoo’s kick-ass index
  • terrier: platform from U of glasgow; engineered to TREC
  • lemur: platform from CMU; well-regarded; also engineered to TREC but then made to do a bunch of other things

lucene

  • java library
  • solr: server around lucene
  • incremental updates only; removed batch updates early on
  • indexing speed (from homepage): 20MB/m; 1MB heap; incremental indexing same as batch; index ~20-30% of text
  • uses VSM model for scoring/ranking; not “academically approved”
  • filters w standard boolean model then ranks all matches
  • IndexReaders and IndexWriters
    • exposes delete(id, document) and index(id, document)
    • separate commit() flushes to disk
    • instantiating IndexReader gets a snapshot of what’s on disk
  • scoring TODO
  • GSOC11 changes lucene to accomodate modern ranking like BM25, DRF and separate matching/ranking

sphinx

  • c++ library and server
  • supports distributed index
  • uses BM25 (not BM25F?)
  • in-memory attribute map
    • supports live updates
  • on-disk index and document store
  • main selling point: mysql integration

vector space model

standard boolean model

  • simplest, exact-match

evaluation measures

  • set-based measures: don’t consider rankings
    • precision, recall, F-measure are single-value metrics of entire result set
    • precision: # relevant and retrieved / # retrieved
    • recall: # relevant and retrieved / # relevant
    • f-measure aka F1 score aka F-score: F = 2 \frac{PR}{P+R} where P is precision, R is recall
    • precision at some cutoff (precision@k aka PC): precision in the top k
    • R-precision: precision@R where R is # known relevant docs; addresses instability of precision@k
  • average precision (AP): considers ranking
    • mean precision over the precision-recall curve (recall is x var)
    • \frac{1}{|\text{relevant}|} \sum_{r=1}^k P(r) \text{rel}(r), where
    • may use interpolated precision value to smooth out curve
  • mean avg precision (MAP): mean AP over multiple queries
  • discounted cumulative gain (DCG): uses graded relevance scale instead of bool
    • insight: penalize highly relevant documents of low rank
    • similar spirit to precision@k and AP but for continuous relevance
    • DCG accumulated at rank p = rel_1 + \sum_{i=2}^p \frac{rel_i}{\log_2 i}
    • intuitively: relevance of 1 + other relevances, discounted logarithmically
  • normalized discounted cumulative gain (NDCG)
  • mean reciprocal rank (MRR)
  • these are all non-smooth cost functions
  • SIGIR10 paper from MSR comparing standard IR metrics (NDCG, MAP, PC) with clicks from A/B tests of rankings findings:

learning to rank (LETOR)

  • features
    • query-dependent: tf-idf, bm25, # matches in linking-anchors/title/body/url, # clicks on page for this query
    • query-independent: PageRank, # clicks/visits, spam score

latent semantic indexing (LSI)

  • uncover the underlying concepts/themes in a document by analyzing cohesively all docs in the collection, seeing what terms co-occur frequently
    • allows us to retrieve documents even if the exact terms don’t appear
    • construct term-document matrix (eg TF-IDFs)
    • then find low-rank approximation to the matrix (since orig too large); result is that some dimensions are combined and depend on more than one term, e.g. ${car),(truck),(flower)} -> {(1.3452*car + 0.2828*truck),(flower)}; mitigates synonymy problem
    • uses SVD
  • http://www.knowledgesearch.org/lsi/lsa_definition.htm

Structured Querying of Web Text (Mike Cafarella)

  • ExDB: probabilistic DB for SQL-like queries over web text
  • data model: unary/binary predicates over objects
  • information extraction
    • facts: uses an existing NLP system called TextRunner for extracting triples
    • types: uses existing KnowItAll system, which tags PSO and constructs predicates from phrases in well-formed sentences
    • synonyms
      • uses existing DIRT algo; compares degrees to which args of two preds coincide
      • eg: arg-pairs of preds invented and has-invented overlap strongly
    • inclusion dependencies or troponyms: invented(x,y) discovered(x,y)
      • no concrete idea, but an enhanced DIRT will probably work
    • functional dependencies: queries with negation
      • best left for data mining research
  • query processing
    • straightforward: series of joins, use top-k algos
    • projections: harder TODO
  • TODO
    • rest of paper
    • probabilistic query constraints

yahoo labs: contextual and display ads team

  • key challenges
    • scale
    • data sparseness
    • user privacy
    • imprecisely defined objectives/metrics
    • different train/test set distribuions
  • applied research areas in display ads
    • targeting: inferring user interest from historical behavior
    • contextual ads: serving contextually relevant (text) ads
    • GD stone: next gen display ad serving platform
    • categorization: ML approaches for categorizing queries, ads, pages
  • news direct display (NDD) optimization
  • techniques
    • linear: log reg, NB, SVM
    • non-linear: SVM (RBF), treenet, C5.0

google adwords bidding

  • as an advertiser: how to figure place CPC bids
    • conversion rate = # visitors who buy / # visitors
    • max profitable cost per action (CPA) = $ profit from buy
      • maybe your selling price - cost to you
    • value per click = value per action * conversion rate
      • this is what you should initially bid
    • cost per click (CPC): actual cost from an auction; always < your max bid
    • incremental cost per click (ICC) = cost of incremental clicks / number of incremental clicks
      • “incremental” means “additional,” over the next-lower bid
      • ICC = how much you’re paying for each additional click over the next-lower bid
      • if value per click < ICC, lower bid; else, raise bid
    • http://googleblog.blogspot.com/2009/09/new-adwords-bidding-tutorial.html

HITS (jon kleinberg, cornell)

  • some pages are hubs/portals with many outlinks
  • others are topic authorities with many inlinks
  • good hubs point to good auths
  • each page has a hub score and an auth score
  • hub score is sum of outbound auth scores
  • auth score is sum of inbound hub scores
  • equiv to solving max eigenval problem

SPEAR (michael noll)

  • find expert users and quality documents
  • based on HITS: mutual reinforcement of users and documents
  • time-based: discoverers vs followers
  • evaluated on delicious.com