IR Practice 2
When ranking performance gets improved, we should expect:
increased result clicks
What are the basic assumptions in a query generation model?
p(Q|D;R = 0) = p(Q|R = 0), Uniform document prior
Zipf's law tells us:
In a given French corpus, if the most frequent word's frequency is 1, then the second frequent word's frequency is around 0.5; smoothing is necessary
Query Processing - Term-at-a-time
It accumulates scores for documents by processing term lists one at a time
Query Processing - Document-at-a-time
It calculates complete scores for documents by processing all term lists, one document at a time
Smoothing
It is a technique for estimating probabilities for missing (or unseen) words. - lower (or discount) the probability estimates for words that are seen in the document text. - assign that "left-over" probability to the estimates for the words that are not seen in the text
idf Using Combiners - Discussion - Intro Disadvantage:
It is inefficient: high amount of dummy counters, i.e. unnecessary disk reads during reducer phase
Query Processing - What is the threshold?
It is the maximum possible score of unseen documents
Use VSM for ranking - The first step
It is to build query and document vectors
Stemming
It means to normalize lexical variations of words that have very similar or same meanings. For example, computer, computing, and computation can all be normalized into "compute".
Zipf's law
It says that many types of data studied in the physical and social sciences can be approximated with a Zipfian distribution, one of a family of related discrete power law probability distributions
Which of the following metric(s) emphasize(s) recall:
MAP (Mean Average Precision)
idf Using Combiners - Discussion - Mappers - Alternative?
Most efficient: Mappers emit "1" for each term, combiners then sum the dummy counters before reduce phase. No dynamic memory allocation. Reduces amount of disk reads
idf Using Combiners - Discussion - Intro
Most term count examples emit "1" for each term
Query Processing - Skipping
Search involves comparison of inverted lists of different lengths: - Can be very inefficient, - "Skipping" ahead to check document numbers is much better, - Compression makes this difficult
VSM - Advantages
Simple computational framework for ranking, - Any similarity measure or term weighting scheme could be used
Fagin's Algorithm - Pros and Cons
TA accesses less objects than FA, TA stops at least as early as FA, TA performs more random accesses, FA uses unbounded buffer space, FA only performs random access at the end
Difference between Document-at-a-time vs Term-at-a-time
Term-at-a-time uses more memory for accumulators, but accesses disk more efficiently • Document-at-a-time needs little memory (priority queue of top documents) but reads much of the inverted index.
Vector Space (VS)
The basic idea of the _____ _____ _____ model is to represent both a document and a query as a vector in a high-dimensional space where each dimension corresponds to a term.
Query Processing - Optimizations
Two classes of optimization: - 1. Read less data from inverted lists • e.g., skip lists, • better for simple feature functions. 2. Calculate scores for fewer documents: • e.g., conjunctive processing, • better for complex feature functions
Stemming, vocabulary
______ reduces the ______ size.
Stopword removal
_______ ______ will decrease recall.
stop word
A _____ ______ is a word that usually doesn't reflect the content of a document where it occurs; for example, functional words of English are generally stop words.
What are three key assumptions we have made when using classical IR evaluation methods?
1. Search result relevance serves as a good proxy of user satisfaction. 2. Users will sequentially browse the search results. 3. The relevance quality of the search results will be independently evaluated.
What are the three key heuristics commonly shared by vector space models, BM25 and language models?
1. TF, 2. IDF, 3. Document length normalization
What are the three basic assumptions in classical IR evaluation?
1. query is a proxy of users' information need, 2. document relevance is independent from each other, 3. sequential browsing from top to bottom
VSM - Disadvantages
Assumption of term independence
Vector Space Model
Documents and query represented by a vector of term weights. Collection represented by a matrix of term weights. Query and Document are presented as term vectors with tf-idf weighting.
Boolean Retrieval - disadvantages
Effectiveness depends entirely on user - Simple queries usually don't work well - Complex queries are difficult - Νo ranking
Query Processing -Threshold - Fagin's Algorithm
Fagin's Algorithm is similar to TA with NRA, but: Read lists until 𝑘 objects have read a value for all terms, Use random access to compute the missing terms, Return the top-𝑘 objects
Vector space model is equivalent to the Bag-of-Word model
False because Bag-of-Word model is just a special case of vector space model, where we treat individual words/N-grams as the bases.
The assumption of words are "independent and identically distributed (i.i.d.)" in documents is the foundation of statistic language models.
False because Language models do not make this assumption, and it is only made to facilitate the parameter estimation in language models.
Zipf's Law predicts that tail words take major portion of vocabulary and are usually semantically meaningless.
False because Tail words indeed take major portion of vocabulary, but they are not necessarily semantically meaningless.
The goal of retrieval models we have learnt is to improve some specific IR evaluation metrics, such as NDCG and MAP
False because The goal of any retrieval model is to help user fulll their information need.
We can directly get the number of unique terms in a particular document from an inverted index.
False because as an inverted index is a mapping from words to posting lists, it is non-trivial to get the number of unique words in a single document. We have to traverse the whole index for this purpose, and this is clearly very expensive.
We prefer cosine similarity over Euclidean distance in vector space models because the former is computationally more efficient.
False because cosine similarity normlizes document length.
Words with high DF are more discriminative than those with low DF
False because low DF words are usually more discriminative as they only appear in a handful of documents.
We do not use a database system to solve information retrieval problems mostly because of efficiency concern.
False because the major concern is that a database system cannot deal with unstructured text content.
Term independence is a basic assumption in all retrieval algorithms we have learned.
False because there is no retrieval algorithm we have learned assumes term independence; we make such an assumption is only to reduce computational complexity.
idf Using Combiners - Discussion - Mappers
Most term count examples emit "1" for each term. Inefficient: high amount of dummy counters, i.e. unnecessary disk reads during reducer phase. Mappers could use an associative array to sum all terms in each document processed Careful! Dynamic memory allocation may cause out of memory errors. Alternative?
Query Processing Skipping - Pros and Cons?
Must also consider random vs. sequential hard disk speed.
Query Processing -Threshold Methods (i.e. Fagin's Algorithm)
Only read subset of posting lists to compute top-k elements with highest score. Variants for Document-at-a-time query processing: - Posting lists sorted by doc id • WAND algorithm: in each list, skip ranges of doc ids with no chance of being in top-k. - Posting lists sorted by score • Threshold Algorithm (TA)
Discuss: Relationship between Precision, Recall and F-1 measure
Precision is used when probability that a positive result is correct is important. F Measure is the Harmonic mean of recall and precision. False Negative (Type II error) - a relevant document is not retrieved, - 1- Recall
Caching
Query distributions similar to Zipf - About ½ each day are unique, but some are very popular
Unigram language model
probability distribution over the words in a language. - generation of text consists of pulling words out of a "bucket" according to the probability distribution and replacing them
N-gram language model
some applications use bigram and trigram language models where probabilities depend on previous words