IR Practice 2

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

When ranking performance gets improved, we should expect:

increased result clicks

What are the basic assumptions in a query generation model?

p(Q|D;R = 0) = p(Q|R = 0), Uniform document prior

Zipf's law tells us:

In a given French corpus, if the most frequent word's frequency is 1, then the second frequent word's frequency is around 0.5; smoothing is necessary

Query Processing - Term-at-a-time

It accumulates scores for documents by processing term lists one at a time

Query Processing - Document-at-a-time

It calculates complete scores for documents by processing all term lists, one document at a time

Smoothing

It is a technique for estimating probabilities for missing (or unseen) words. - lower (or discount) the probability estimates for words that are seen in the document text. - assign that "left-over" probability to the estimates for the words that are not seen in the text

idf Using Combiners - Discussion - Intro Disadvantage:

It is inefficient: high amount of dummy counters, i.e. unnecessary disk reads during reducer phase

Query Processing - What is the threshold?

It is the maximum possible score of unseen documents

Use VSM for ranking - The first step

It is to build query and document vectors

Stemming

It means to normalize lexical variations of words that have very similar or same meanings. For example, computer, computing, and computation can all be normalized into "compute".

Zipf's law

It says that many types of data studied in the physical and social sciences can be approximated with a Zipfian distribution, one of a family of related discrete power law probability distributions

Which of the following metric(s) emphasize(s) recall:

MAP (Mean Average Precision)

idf Using Combiners - Discussion - Mappers - Alternative?

Most efficient: Mappers emit "1" for each term, combiners then sum the dummy counters before reduce phase. No dynamic memory allocation. Reduces amount of disk reads

idf Using Combiners - Discussion - Intro

Most term count examples emit "1" for each term

Query Processing - Skipping

Search involves comparison of inverted lists of different lengths: - Can be very inefficient, - "Skipping" ahead to check document numbers is much better, - Compression makes this difficult

VSM - Advantages

Simple computational framework for ranking, - Any similarity measure or term weighting scheme could be used

Fagin's Algorithm - Pros and Cons

TA accesses less objects than FA, TA stops at least as early as FA, TA performs more random accesses, FA uses unbounded buffer space, FA only performs random access at the end

Difference between Document-at-a-time vs Term-at-a-time

Term-at-a-time uses more memory for accumulators, but accesses disk more efficiently • Document-at-a-time needs little memory (priority queue of top documents) but reads much of the inverted index.

Vector Space (VS)

The basic idea of the _____ _____ _____ model is to represent both a document and a query as a vector in a high-dimensional space where each dimension corresponds to a term.

Query Processing - Optimizations

Two classes of optimization: - 1. Read less data from inverted lists • e.g., skip lists, • better for simple feature functions. 2. Calculate scores for fewer documents: • e.g., conjunctive processing, • better for complex feature functions

Stemming, vocabulary

______ reduces the ______ size.

Stopword removal

_______ ______ will decrease recall.

stop word

A _____ ______ is a word that usually doesn't reflect the content of a document where it occurs; for example, functional words of English are generally stop words.

What are three key assumptions we have made when using classical IR evaluation methods?

1. Search result relevance serves as a good proxy of user satisfaction. 2. Users will sequentially browse the search results. 3. The relevance quality of the search results will be independently evaluated.

What are the three key heuristics commonly shared by vector space models, BM25 and language models?

1. TF, 2. IDF, 3. Document length normalization

What are the three basic assumptions in classical IR evaluation?

1. query is a proxy of users' information need, 2. document relevance is independent from each other, 3. sequential browsing from top to bottom

VSM - Disadvantages

Assumption of term independence

Vector Space Model

Documents and query represented by a vector of term weights. Collection represented by a matrix of term weights. Query and Document are presented as term vectors with tf-idf weighting.

Boolean Retrieval - disadvantages

Effectiveness depends entirely on user - Simple queries usually don't work well - Complex queries are difficult - Νo ranking

Query Processing -Threshold - Fagin's Algorithm

Fagin's Algorithm is similar to TA with NRA, but: Read lists until 𝑘 objects have read a value for all terms, Use random access to compute the missing terms, Return the top-𝑘 objects

Vector space model is equivalent to the Bag-of-Word model

False because Bag-of-Word model is just a special case of vector space model, where we treat individual words/N-grams as the bases.

The assumption of words are "independent and identically distributed (i.i.d.)" in documents is the foundation of statistic language models.

False because Language models do not make this assumption, and it is only made to facilitate the parameter estimation in language models.

Zipf's Law predicts that tail words take major portion of vocabulary and are usually semantically meaningless.

False because Tail words indeed take major portion of vocabulary, but they are not necessarily semantically meaningless.

The goal of retrieval models we have learnt is to improve some specific IR evaluation metrics, such as NDCG and MAP

False because The goal of any retrieval model is to help user fulll their information need.

We can directly get the number of unique terms in a particular document from an inverted index.

False because as an inverted index is a mapping from words to posting lists, it is non-trivial to get the number of unique words in a single document. We have to traverse the whole index for this purpose, and this is clearly very expensive.

We prefer cosine similarity over Euclidean distance in vector space models because the former is computationally more efficient.

False because cosine similarity normlizes document length.

Words with high DF are more discriminative than those with low DF

False because low DF words are usually more discriminative as they only appear in a handful of documents.

We do not use a database system to solve information retrieval problems mostly because of efficiency concern.

False because the major concern is that a database system cannot deal with unstructured text content.

Term independence is a basic assumption in all retrieval algorithms we have learned.

False because there is no retrieval algorithm we have learned assumes term independence; we make such an assumption is only to reduce computational complexity.

idf Using Combiners - Discussion - Mappers

Most term count examples emit "1" for each term. Inefficient: high amount of dummy counters, i.e. unnecessary disk reads during reducer phase. Mappers could use an associative array to sum all terms in each document processed Careful! Dynamic memory allocation may cause out of memory errors. Alternative?

Query Processing Skipping - Pros and Cons?

Must also consider random vs. sequential hard disk speed.

Query Processing -Threshold Methods (i.e. Fagin's Algorithm)

Only read subset of posting lists to compute top-k elements with highest score. Variants for Document-at-a-time query processing: - Posting lists sorted by doc id • WAND algorithm: in each list, skip ranges of doc ids with no chance of being in top-k. - Posting lists sorted by score • Threshold Algorithm (TA)

Discuss: Relationship between Precision, Recall and F-1 measure

Precision is used when probability that a positive result is correct is important. F Measure is the Harmonic mean of recall and precision. False Negative (Type II error) - a relevant document is not retrieved, - 1- Recall

Caching

Query distributions similar to Zipf - About ½ each day are unique, but some are very popular

Unigram language model

probability distribution over the words in a language. - generation of text consists of pulling words out of a "bucket" according to the probability distribution and replacing them

N-gram language model

some applications use bigram and trigram language models where probabilities depend on previous words


Ensembles d'études connexes

urinalysis specimen, collection, chemical

View Set

Chapter 15: Language and Higher Cognition

View Set

Life and Health Insurance: Mandator Provisions

View Set

Chp 1 Privacy Laws and Regulations

View Set