INF 141 / CS 121 Information Retrieval Quiz 4 W18, CS 121 - Quiz 4, Inf 141: Quiz 4, Quiz 3 CS 121, Inf 141: Quiz 3, (CS 121) Quiz #2 Review, CS 121 Final
Vector Space Similarity Coding Similarity
- Cos(vector q, vector d) the cosine of the angle between vector q and vector d - For length-normalized vectors, cosine similarity is simply the dot product (or scalar product) - vector q is the tf-df weight of term i int he query - di is the tf-idf weight of term i in the document - All vectors are normalized => unit vector - Cos(theta) = dot product(vector d, vector d(i)) - Rank docs in increasing order of Cos(theta)
Term Frequency tf calculation and limitations
- Definition: The term frequency tf(t,d) of term t in document d is the number of times that t occurs in d - The term frequency of a query (set of terms) can be found as tf(t1+t2+tn, d) = sigma(n, 1) tf(ti, d) - Raw term frequency is not what we want - doc with 10 reoccurring terms is more valuable than a doc with 1 term ( not 10 times more relevant) - Relevance does not increase proportionally with term frequency
Euclidean distance
- Distance between two points - Distance between the end points of the two vectors D(a,b) = sqrt(sigma(n, i =1) (b(i)-a(i))^2
What is a very standard weighting scheme for tf, df and normalization
- Documents: logarithmic f(I as first character), no idf (n as second character and cosine normalization (c as last character) - Query: logarithmic tf (I as first character), idf(t as second character) and cosine normalization (c as last character
Term Frequency Term-document incidence matrix
- Each document is represented by a binary vector ∈{0,1} ^(|v|) Hamlet Antony 0 Brutus 1 Caesar 1
Why Indexing? Term-document Incidence Matrix
- Each file will either have a 1 or 0 if it has a word - Get the vector of each term and complement it (AND) to answer query - 1 1 0 1 0 0 AND - 1 1 0 1 1 1 AND - 1 0 1 1 1 1 = - 1 0 0 1 0 0
Introduction to efficient Scoring
- Efficient Ranking - Find the K docs in the collection "nearest" to the query -> K largest query-doc cosines - Two aspect to consider for efficient ranking 1. Choosing to consider for efficient ranking - Can we do this without sorting all N scores? 2. Computing cosine values efficiently Can we do this without computing all N cosine?
Ranked Retrieval Introductions Boolean limitations
- Feast or Famine - Boolean queries often result in either too few (=0) or too many (1000s) results - For instance: - Query 1: "standard user link 650"->200,000 hits - It takes a lot of skill to come up with a query that produces manageable number of hits - "AND" = too little results - "OR" operators gives too many - Good for expert users with precise understanding of their needs and the collection (and Applications because they can easily analyze 1000s of results) - Not good for majority of users - Users do not want/know how to write boolean queries - Dont want to wade through 1000s of results
Computing Cosines Efficiently Techniques: Generic Approach
- Find a set of A values, with L < |A| << N - A does not necessarily contain the top K, but as many docs from among the top K - Return the top K docs in A - Think of A as trimming non-competitors in N
Vector Space Similarity Euclidean distance
- First cut: distance between two points - Euclidean distance is a bad idea because Euclidean distance is large for vectors of different lengths
Collection vs Document frequency Example word collection freq doc freq insurance 10440 3997 try 10442 8760 which is word is a better search term (and should get a higher weight)?
- In general, we will consider the document frequency (df) a better measure than collection frequency for ranked retrieval
- Consider the following big collection - N = 1 million documents - each with about 1000 words - M = 500K terms (unique words) - Avg 8 bytes/word including spaces/punctuation
- Incidence Matrix would have half-a-trillion 0's and 1's - Most of the entries (99.8%) are 0 - Making the matrix very sparse
Scoring Documents Jaccard Coefficient: limitations
- It doesn't consider term frequency - Rare terms in a collection are more informative than frequent terms - Does not consider stop words - Need a more sophisticated way of normalizing for length - For length normalization, use a different scheme (|A ∩ B|)/(sqrt(|A ∪ B|))
Scoring Documents Jaccard Coefficiente
- Jaccard (Jc) is a commonly used measure of overlap terms in two sets A and B - Jc(A,B) = |A ∩ B|=|A ∪ B| - Jc(A,A) = 1 - Jc(A,B) = 0 if (A ∪ B) - A and B don't have to be the same size - Always assigns a number between 0 and 1
Vector Space Model Queries as vectors
- Key idea 1: Represent queries as vectors in the space, as if they were documents as ell - Key idea 2: Rank documents according to their proximity in the query in the space - Recall: we do this because we want to get away from you're-either-in-out Boolean model - Instead: rank more relevant documents higher than less relevant documents
Boolean Retrieval Model Query Process: OR, NOT
- Merge cn be adapted for - BRUTUS or CAESAR: All docIDs without duplicates - Not BRUTUS: Gaps on the Brutus lists
Static Quality Scores: Modeling authority
- Modeling authority: - Assign to each document a query-independent QUALITY SCORE in [0,1] to each document d. Denote this by g(d) Thus, a quantity like the number of citations is scaled into [0,1] - Net score: - Consider a simple total score combing cosine relevance and authority: net-score(q,d) = g(d) + cosine(q,d) - Now we seek the top K docs by net score
Boolean Retrieval Model Limitations
- No spell checking - No index to capture portion information in docs - Proximity: Find Gates NEAR Microsoft - What about phrases? Stanford University - Zones in documents: Find documents with (author= Ullman) AND (text contains automata) - Does not consider term frequency information in docs - We store 1 vs 0 occurrence of a search term - 3 v 2 occurrences, etc - Usually more seems better (relevance, ranking)
Why Indexing? Bigger collections: Inverted Index
- Only record term t by recording the "1" positions - Identify each document by a unique docID(document serial) - The list of docIDs can be an array or linked list
Document Frequency
- Rare terms are more informative than frequent terms - consider a term in the query that is rare in the collection (high weight for rare terms) - Frequent terms are less informative than rare terms - consider a query term that is frequent in the collection (low weights for frequent terms) - document frequency (df) to capture this
Ranked Retrieval Introduction Advantages of Ranked Retrieval
- Rather than a set of documents satisfying a query expression, in RANKED RETRIEVAL, the system returns an ordering over the (top) documents in the collection for a query - Free text queries: Rather than a query language of operators and expressions, the user's query is just one or more words in a human language - Ranked Retrieval has normally been associated with free text queries and vice versa - "Feast or Famine" is not an issue - system produces a ranked result set, large result sets are not an issue - show top k (= 5-10 results) - do not overwhelm user - Works better than boolean models for most users
Semi-Structured Data facilities queries such as
- Retrieve documents in which Title contains fish AND Bullets contain food
Choosing K largest Cosines Goal
- Selection vs. Sorting - Typically we want to retrieve the top K docs in the cosine ranking for the query - Not to totally sort all docs in the collection - Selection docs with K highest cosines: - Let J be the number of docs with non-zero cosines - We seek the K best of theses J
Why is doing a grep not a good solution to Unstructured Data?
- Slow (for large corpus) - Other operations (find the word Romans near countrymen) would require a new grep
Computing Cosines Efficiently Unweighted query terms
- Special Case simplification: No weighting on query terms - Assume each query term occurs only once - Then for ranking, do not need to consider term frequency - Actually, we do not need to normalize either, then W(t,d) = 1
Cosine Similarity Example How similar are the following three novels ? SsS: Sense and Sensibility PaP: Pride and Prejudice WH: Withering Heights
- Step 1: Term Frequencies (counts) matrix - get Log frequenting weighting - get length normalization - add them - Have a scoring system?
Computing Cosines Efficiently: 1. Index Elimination
- Technic B: Docs containing many of the query terms - Any doc with at least one query term is a candidate for docs contain ing several of the query terms - Say, at least 3 out of 4 - Imposes a "soft conjunction" on queries seen on web search engines (say Google) - Easy to implement in postins transversal
Boolean Retrieval Model
- The Boolean Retrieval model answers queries based on Boolean expression (AND, OR, and NOT) - Views documents as a set of terms' - Precise! - It needs to meet the condition - primary retrieval tool for 3 decades - Many professional searchers (e.g., lawyers) prefer the boolean model
Document Frequency Collection vs. Document Frequency
- The collection frequency (cf) of t is the number of occurrences of t in the collection, counting multiple occurrences per document.
Why is Euclidean distance is a bad idea?
- The distance between q and d(2) is large even though the distribution of terms in the query and d(2) are very similar
Vector Space Similarity From angles to cosines
- The following two notions are equivalent - Rank document in decreasing order of the angle between query and document - Rank documents in increasing order of coding (query, document) - Cosine is monotonically decreasing functional in the interval[0 degrees, 180 degrees]
Term Frequency Log-Frequency Weighting
- The log frequency weight of term t in d is | 1 + log10 tf (t,d), if tf(t,d) > 0 w(t, d) = | 0, otherwise - Score for a document-query pair: Score(q,d) = sigma(t∈q∩d) (1+log tf(t,d)) - The score is 0 if none of the query terms is present in the document
Weighting Schemes tf-idf weighing
- The tf-idf weight of a term is the product of its tf weight of a term is the product of its tf weight and its idf weight w(t,d) = [1 +log(tf(t,d)] x log(N/d(f,t)) - Properties - Increases with the number of occurrence within a doc - increases with the rarity of the term in the collection - Best known weighting scheme in information retrieval - Note: the "-" in tf-idf is a hyphen, not a minus sign! - Alternative names: tf.idf, tf x idf
Vector Space Model as Vectors
- The weight matrix defines a |V|-dimensional vector space where: - Terms are axes of the space - Documents are points in the space - Documents can be represented as vectors - Properties: - Very-high dimensional: tens of millions of dimensions where apply this to a web search engine - Results in very sparse vectors- most entries are zero
Vector Space Similarity Angles for Similarity
- Thought experiment: tase a document vector d and append it to itself. Call this doc vector d prime - "Semantically" d and d prime have the same content - The Euclidean distance between the two docs can be quite large - The angle between the two docs is 0, corresponding to maximal similarity - Key Idea: Rank documents according to angle with query
Boolean Retrieval Model Query Processing: AND
- Try to process BRUTUS AND CAESER - Locate BRUTUS in the dictionary (retrieve postings) - Locate Caesar in the dictionary (retrieve postings) - Merge the postings - Walk through the two postings simultaneously - If list lengths are x and y, merge takes O(x+y) operations - CRUCIAL: postings sorted by docID
Why do we index?
- Unstructured Data - Term-document incidence - Inverted Index
Choosing K largest Cosines Heap Data Structure
- Use heap for selecting top - Heap = Binary tree in which parent's node's value > the values of children (max-heap - A heap of J elements can be built in O(J) steps (linear time of J) - Deleting K elements from the heap takes (k log(J)) steps - Finding the K "winners" using heaps takes O(J+ k log(j)) - E..g. For J = 1m, K = 100, this is about 10% of the cost of sorting J elements
Term Frequency Bag of Words model
- Vector representation doesn't consider the ordering of words in a document - example: following sentences have the same vectors - John is quicker than Mary - Mary is quicker than John - This is called the bag of words model - This model does not track positional information
Scoring Documents Query-Document Matching
- We need a way of assigning a score to a query/document pair - Example - Let's start with a one-term query and score 0-1: - If the query term does not occur in the document => score = 0 - The more frequent the query term in the document, the higher the score (closer to 1 for this example)
Computing Cosines Efficiently: 3. Static Quality Scores
- We want top-ranking documents to be both RELEVANT and AUTHORITATIVE - RELEVANCE issuing modeled by cosine scores - AUTHORITY is a query-independent property of a document - Examples of authority signals - Wikipedia among website - Articles in certain newspapers - A paper with many citations - "Like" or "re-twitts" or "hits" marks ---> PageRank: Number indicating the authority of a site
Document Frequency idf weight
- df(t) is the document frequency of t, or the number of documents that contain t, or the length of the posting list of t - df(t) is an inverse measure of the informativeness of t - df(t) <= N, where N is total number of documents in the corpus - we define the idf (inverse document frequency) of t by idf(t) = log10 (N/df(t)) - We use log (N/df(t)) instead of N/df(t) to "dampen" the effect of idf.
Document Frequency Effect of idf on ranking
- idf has no effect on ranking one term queries - idf affects the ranking of documents for queries with at least two terms - For the query CAPRICIOUS PERSON, idf weighting makes occurrences of CAPRICIOUS count for much more in the final document ranking than occurrences of PERSON
How to Tell if Users are Happy
- results are relevant - many clicks - many purchases - repeating buyers
Weighting tf, df and normalization
- tf-idf weighting has many variants - columns headed 'n' are acronyms for weight schemes - Many search engines allow for different weightings for queries vs. document - Smart Notation: denotes the combination in use n an engine, with the notation ddd.qqq, using the acronyms from the previous table
Static Quality Scores: Top K by net score - fast methods
- top K by net score -fast methods: - First idea: Order all postings by g(d) - This is a common ordering for all postings - Concurrently traverse query terms' postings for - Postings intersection - Coding score computation - Advantages - Under g(d)-ordering, top-scoring docs likely to appear early in postings traversal - In time- bound applications, this allows us to stop postings traversal early (Traverse as doc-at-a-time)
Computing Cosines Efficiently Techniques: Techniques
1. Index Elimination 2. Champion Lists 3. Static Quality Scores 4. High and low lists 5. Impact-ordered Postings 6. Cluster Pruning
Vector Space Ranking Summary
1. Represent the query as weighted tf-idf vector 2. Represent each document as weighted tf-idf vector 3. Compute the cosine similarity score for the query vector and each document vector 4. Rank documents with respect to the query by score 5. Return the top K (e.g. K = 10) to the user
In lecture, we saw that metadata about documents is important. Why? - It allows to identify the field/zone where a specific term is found in a document. - It can be used to define Parametric Indexes for a more advanced search. - It is needed to encode zones in dictionary or postings for Parametric Indexes. - All of the statements are good reasons to justify the importance of metadata.
All of the statements are good reasons to justify the importance of metadata.
Index Elimination Example: - Query: Antony Brutus Caesar Calpurnia - We only compute documents with 3 of 4 query terms - Scores are only computed for docs 8,16, and 32
Antony -> 3 4 [8] [16] [32] 64 128 Brutus -> 2 4 [8] [16] [32] 64 128 Caesar -> 1 2 3 5 [8] 13 21 34 Calpurnia -> 3 [16] [32]
Why would you implement Tiered Indexes in your search engine? - Because it is a requirement of the course assignment. - Because that allows to skip less important documents in your collection. - Because that allows a more efficient scoring, starting with important documents first. - There is no reason to implement Tiered Indexes.
Because that allows a more efficient scoring, starting with important documents first.
Grep
Brute Linear Scan
Mark the *false* statement with regards to Hubs and Authorities? - A good hub page for a topic points to many authoritative pages for that topic - A good authority page for a topic is pointed to by many good hubs for that topic. - By using hub pages, we still cannot retrieve documents in other languages. - Using hubs is best suited for "broad topic" queries rather than for page-finding queries.
By using hub pages, we still cannot retrieve documents in other languages.
Corpus
Collection
B
Cosine similarity captures the geometric proximity in terms of A. the Euclidean distance between the points B. the angle between the vectors
What would be the smallest windows to calculate the proximity score of the query "information retrieval" in the document "The retrieval of too much information is bad" - Zero words, because terms don't occur in same order. - Four words. - Three words. - Five words.
Five words.
C
In Boolean retrieval, a query that ANDs three terms results in having to intersect three lists of postings. Assume the three lists are of size n, m, q, respectively, each being very large. Furthermore, assume that each of the three lists are unsorted. In terms of complexity, will your intersection algorithm benefit, or not, from sorting each list before merging them, with respect to merging the 3 unsorted lists? A. Sorting first will make it worse than merging the unsorted lists B. It will be the same C. It will benefit from sorting first
C
In a Web search engines, the Text Acquisition system consists of: A. Desktop crawlers B. Scanning documents C. Web crawlers
B
In ranked retrieval of multi-term queries, the scores for the query can be computed in either document-at-a-time or term-at-a-time. The following two sentences can be both true, both false, or one true and one false. Mark all true sentences: A. When using term-at-a-time, the final score of the document di is calculated before any calculations are performed for di+1 B. When using document-at-a-time, the final score of the document di is calculated before any calculations are performed for di+1
B
In the vector space model for information retrieval, the dimension in the multi-dimensional space are A. the documents B. the terms
Based on the diagram of IR components that we saw in lecture, where would you place the parameters to fine-tune your query-doc scoring algorithm? - Within the linguistic modules. - It is an external input to apply to indexes and query. - Before the spell-check of the query. - Within the indexes.
It is an external input to apply to indexes and query.
Scoring Documents Jaccard coefficient: example - Find the Jaccard coefficient in the next documents - Query: ides of march (set Q) - Doc 1: Caesar died in march (set D1) - Doc 2: the long march (set D2) What is Jc(Q,D1) = |Q ∩ D1 |/ |Q ∪ D1| What is Jc(Q,D2) = |Q ∩ D2 |/ |Q ∪ D2|
Jc(Q,D1) = 1/6 Jc(Q,D2) = 1/5
Mark the *false* statement with regards to the Mean Average Precision (MAP)? - MAP is macro-averaging: each query counts equally. - MAP considers multi-level of relevance in documents, as opposite to binary relevance. - Most commonly used measure in scientific research. - MAP assumes user is interested in finding many relevant documents for each query.
MAP considers multi-level of relevance in documents, as opposite to binary relevance.
Consider the following figure representing the web as a graph with good (green with vertical-lined pattern) and bad (red with horizontal-lined pattern) pages/nodes in an interlinked structure. Which statement is true regarding the nodes 1-4? - Nodes 1-4 are bad. - Nodes 1-4 are good. - Nodes '1' and '3' are good, and '2' and '4' are bad. - Nodes '1' and '3' are good, '4' is bad, and '2' is unknown.
Nodes '1' and '3' are good, and '2' and '4' are bad.
Consider the following diagram depicting documents (dots) in a collection. The rectangle in the middle represents the documents retrieved for a given query. Mark the *true* statement with regards to its precision (P) and recall (R). - P=4/8 and R= 3/9 - R=3/7 and P= 3/9 - P=4/7 and R= 4/9 - R=4/7 and P= 3/7
P=4/7 and R= 4/9
Binary Assessment
Precision: B < 1 Recall: B > 1 Balance: B = 1
false
Ranked retrieval is a model where documents either match (1) or dont match (0) the queries (true or false)
Weighting Schemes tf-idf weighting: Scoring for a document given a query
Score(q,d) = sigma (sigma(t∈q∩d)) tf.idf(t,d)
High and Low Lists
The High and the Overflow: keep a list of high rankings and then the rest in low as backup if more needed
Mark the false statement with regards to the hub score [h(x)] and authority score [a(x)] of a page X when conducting Hyperlink-Induced Topic Search (HITS)? - Highest scores for h( ) and a( ) will define the hubs and authorities respectively. - In the iterative part, both initial values for h(x) and a(x) can be setup to one. - There is no reason to scale down h( ) and a( ) between iterations. - Relative values of scores h( ) and a( ) need few iterations to converge.
There is no reason to scale down h( ) and a( ) between iterations.
B
What is the main problem of using a term-document matrix for searching in large collections of documents? A. It is slow to search the documents B. It is an inefficient use of memory C. It is slow to search the terms
false
When using tf-idf as the ranking score on queries with just one term, the idf component has effect on the final ranking. (true or false)
B, C, D, and E
Which of the following are examples of work within the Information Retrieval field? A. The design of a relational database for a library B. Classifying books into categories C. Automatically answer customer support questions D. filtering for documents of interest E. Web search engines
Examples of Authority Signals
Wikipedia among websites Articles in certain newspapers A paper with many citations "Like" or "re-tweets" marks PageRANK
Gain
a documents relevance
Zones
a region of the doc that contains an arbitrary amount of text, longer than fields title, author, etc
Consider the following representation of a web as a graph. Mark the *true* statement with regards to the hub score h(N) and the authority score a(N) for each node N after one iteration (iteration 1). Iter0 Iter1 h(N) a(N) h(N) a(N) A 1 1 0 2 B 1 1 1 2 C 1 1 1 1 D 1 1 3 0 - a(A) > a(B), and h(b) = h(C) - a(A) = a(B), and highest h(N) is for N=D - a(C) = a(D), and highest h(N) is for N=B - h(B) > h(A), and highest a(N) is for N=D
a(A) = a(B), and highest h(N) is for N=D
Unweighted Query Terms
assume each term occurs only once don't consider term frequency
Tiered Indexes
break indices into tiers of importance, where you go down a tier if there's not enough results
Champion Lists
calculate doc highest weights for each term earlier and only analyze that set
Static Quality Scores
combine cosine relevance and authority scores Benefit: Stop posting traversal early
Impact-Ordered Postings
compute scores for high enough weights
Efficient Ranking
find the K docs in the collection "nearest" to the query
Index EL: High-Idf Terms Benefit
many docs get eliminated from the set
Index Elimination
narrow to only the docs with at least one query term
Parametric Indexes
query by metadata info
Cluster Pruning
select root of N docs at random to be leaders of a group go to nearest leader and then traverse docs
false
tf-idf decreases with the number of occurrences of the terms inside a document (true or false)
true
tf-idf increases with the rarity of the terms in the corpus (true or false)
Computing Cosines
the primary computational bottleneck
Query Term Proximity
window between which the words can be found (WINDOW includes the words and the in between)
Term Frequency Term-Document Count Matrix
- Consider the number of occurrences of a term in a document: - Each document is a count vector N^(|v|): a column below Julius Caesar Antony 73 Bruttus 157 Caesar 227
Brutus[7] -> 2, 4, 8, 16, 32, 64, 128 Caesar[8] -> 1, 2, 3, 5, 8, 16, 21, 34 Calpurnia[2] -> 13, 16 what is the most efficient query?
(BRUTUS AND CALPURNIA) AND CAESER - remember to start with the smallest set and keep cutting further
Term Doc Freq eyes 213,312 kaleidoscope 87,009 marmalade 107,913 skies 271,658 tangerine 46,653 trees 316,812 Recommend a query processing order for
(TANGERING or TREES) AND (MARMALADE or SKIES) AND (KALEIDOSCOP or EYES)
Boolean Retrieval Model Optimization: OR
- (MADDING or CROWD) AND (ignoble OR strife) - Get doc. freqs for all terms - estimate the size of each OR by the sum of its doc.freqs (conservative) - Process in increasing order of OR sizes
Vector Space Similarity Length normalization
- A vector can be (length-) normalized by dividing each of its components by its length - Dividing a vector by its length makes it a unit vector - vectors are just directional on surface of unit hemisphere - Effect on the two documents d and d prime (d appeared to itself) from earlier slide: they have identical vectors after length normalization (long and short docs now have comparable weights )
Computing Cosines Efficiently: 2. Champion Lists
- Approach 1. Precompute for each dictionary term t, the r docs of highest weight in t's postings - Know as champion list, fancy list, or top docs for t 2. At query time, only compute scores for docs in the champion list of some query term 3. Pick the K top-scoring docs from amongst these - Note that r has to be chosen at index build time - this, it is possible that r < K
Vector Space Model Weight Matrix
- Binary -> count -> weight matrix - Each document is now represented by a real-valued vector of tf-idf weights w(t,d) = [1 +log(tf(t,d)] x log(N/d(f,t)) Hamlet Antony 2 Brutus 1.51 Caesar 0
Computing Cosines Efficiently Techniques
- Computing cosines is the primary computational bottleneck in scoring - We can avoid this problem but with less precision - is a problem if we include a doc not in the top K into list of k output? - Not really - as we are using approaches to cues what users wish - But it is important that K is "close" to the top
Boolean Retrieval Model Optimization: AND
- Consider a query that is an AND of n terms - For each of the n terms, get its postings, then AND them together - Process in order of increasing frequency: - start with smallest set, then keep cutting further - (Reason why document frequency is kept in the dictionary and not in memory)