CS 121 Final Combined Set

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Term Frequency Term-document incidence matrix

- Each document is represented by a binary vector ∈{0,1} ^(|v|) Hamlet Antony 0 Brutus 1 Caesar 1

Why Indexing? Term-document Incidence Matrix

- Each file will either have a 1 or 0 if it has a word - Get the vector of each term and complement it (AND) to answer query - 1 1 0 1 0 0 AND - 1 1 0 1 1 1 AND - 1 0 1 1 1 1 = - 1 0 0 1 0 0

Ranked Retrieval Introductions Boolean limitations

- Feast or Famine - Boolean queries often result in either too few (=0) or too many (1000s) results - For instance: - Query 1: "standard user link 650"->200,000 hits - It takes a lot of skill to come up with a query that produces manageable number of hits - "AND" = too little results - "OR" operators gives too many - Good for expert users with precise understanding of their needs and the collection (and Applications because they can easily analyze 1000s of results) - Not good for majority of users - Users do not want/know how to write boolean queries - Dont want to wade through 1000s of results

- Consider the following big collection - N = 1 million documents - each with about 1000 words - M = 500K terms (unique words) - Avg 8 bytes/word including spaces/punctuation

- Incidence Matrix would have half-a-trillion 0's and 1's - Most of the entries (99.8%) are 0 - Making the matrix very sparse

Scoring Documents Jaccard Coefficient: limitations

- It doesn't consider term frequency - Rare terms in a collection are more informative than frequent terms - Does not consider stop words - Need a more sophisticated way of normalizing for length - For length normalization, use a different scheme (|A ∩ B|)/(sqrt(|A ∪ B|))

Scoring Documents Jaccard Coefficiente

- Jaccard (Jc) is a commonly used measure of overlap terms in two sets A and B - Jc(A,B) = |A ∩ B|=|A ∪ B| - Jc(A,A) = 1 - Jc(A,B) = 0 if (A ∪ B) - A and B don't have to be the same size - Always assigns a number between 0 and 1

Boolean Retrieval Model Query Process: OR, NOT

- Merge cn be adapted for - BRUTUS or CAESAR: All docIDs without duplicates - Not BRUTUS: Gaps on the Brutus lists

Boolean Retrieval Model Limitations

- No spell checking - No index to capture portion information in docs - Proximity: Find Gates NEAR Microsoft - What about phrases? Stanford University - Zones in documents: Find documents with (author= Ullman) AND (text contains automata) - Does not consider term frequency information in docs - We store 1 vs 0 occurrence of a search term - 3 v 2 occurrences, etc - Usually more seems better (relevance, ranking)

Why Indexing? Bigger collections: Inverted Index

- Only record term t by recording the "1" positions - Identify each document by a unique docID(document serial) - The list of docIDs can be an array or linked list

Ranked Retrieval Introduction Advantages of Ranked Retrieval

- Rather than a set of documents satisfying a query expression, in RANKED RETRIEVAL, the system returns an ordering over the (top) documents in the collection for a query - Free text queries: Rather than a query language of operators and expressions, the user's query is just one or more words in a human language - Ranked Retrieval has normally been associated with free text queries and vice versa - "Feast or Famine" is not an issue - system produces a ranked result set, large result sets are not an issue - show top k (= 5-10 results) - do not overwhelm user - Works better than boolean models for most users

Semi-Structured Data facilities queries such as

- Retrieve documents in which Title contains fish AND Bullets contain food

Why is doing a grep not a good solution to Unstructured Data?

- Slow (for large corpus) - Other operations (find the word Romans near countrymen) would require a new grep

Boolean Retrieval Model

- The Boolean Retrieval model answers queries based on Boolean expression (AND, OR, and NOT) - Views documents as a set of terms' - Precise! - It needs to meet the condition - primary retrieval tool for 3 decades - Many professional searchers (e.g., lawyers) prefer the boolean model

Boolean Retrieval Model Query Processing: AND

- Try to process BRUTUS AND CAESER - Locate BRUTUS in the dictionary (retrieve postings) - Locate Caesar in the dictionary (retrieve postings) - Merge the postings - Walk through the two postings simultaneously - If list lengths are x and y, merge takes O(x+y) operations - CRUCIAL: postings sorted by docID

Why do we index?

- Unstructured Data - Term-document incidence - Inverted Index

Scoring Documents Query-Document Matching

- We need a way of assigning a score to a query/document pair - Example - Let's start with a one-term query and score 0-1: - If the query term does not occur in the document => score = 0 - The more frequent the query term in the document, the higher the score (closer to 1 for this example)

How to Tell if Users are Happy

- results are relevant - many clicks - many purchases - repeating buyers

What is a vector space model?

Algebraic model for representing text documents as vectors of identifiers, such as index terms; used in information filtering, information retrieval, indexing, and relevancy rankings

In lecture, we saw that metadata about documents is important. Why? - It allows to identify the field/zone where a specific term is found in a document. - It can be used to define Parametric Indexes for a more advanced search. - It is needed to encode zones in dictionary or postings for Parametric Indexes. - All of the statements are good reasons to justify the importance of metadata.

All of the statements are good reasons to justify the importance of metadata.

In lecture, we saw that metadata about documents is important. Why? - It allows to identify thfield/zone where a specific term is found in a document. - It can be used to define Parametric Indexes for a more advanced search. - It is needed to encode zones in dictionary or postings for Parametric Indexes - All of the statements are good reasons to justify the importance of metadata.

All of the statements are good reasons to justify the importance of metadata.

What are some special case simplifications with weighting query terms?

Assume each query term occurs once, don't need to consider term frequency, documents frequency already weighted in the documents vector, and don't need to normalize

Describe index elimination

Basic algorithm cosine computation algorithm that only considers docs containing at least one query term; going further, one can only consider high-idf query terms and/or docs containing many of the query terms

-Why would you implement Tiered Indexes in your search engine? - Because it is a requirement of the course assign. - Because that allows to skip less important documents in your collection. - Because that allows a more efficient scoring, starting with important documents first. - There is no reason to implement Tiered Indexes.

Because that allows a more efficient scoring, starting with important documents first.

Why would you implement Tiered Indexes in your search engine? - Because it is a requirement of the course assignment. - Because that allows to skip less important documents in your collection. - Because that allows a more efficient scoring, starting with important documents first. - There is no reason to implement Tiered Indexes.

Because that allows a more efficient scoring, starting with important documents first.

Grep

Brute Linear Scan

Mark the *false* statement with regards to Hubs and Authorities? - A good hub page for a topic points to many authoritative pages for that topic - A good authority page for a topic is pointed to by many good hubs for that topic. - By using hub pages, we still cannot retrieve documents in other languages. - Using hubs is best suited for "broad topic" queries rather than for page-finding queries.

By using hub pages, we still cannot retrieve documents in other languages.

What are two aspects to consider in efficient ranking?

Choosing the K largest cosine values efficiently (can we do this without sorting all N scores?) and computing cosine values efficiently (can we do this without computing all N cosines?)

Corpus

Collection

B

Cosine similarity captures the geometric proximity in terms of A. the Euclidean distance between the points B. the angle between the vectors

How can a vector be length-normalized?

Divide each of its components by its length; makes it a unit vector. Now, long and short documents have comparable weights

What is a better measure for ranked retrieval, document frequency or collection frequency?

Document frequency

What is the relationship between relevance and term frequency?

Does not increase proportionally

What are some limitations of the Jaccard coefficient?

Doesn't consider term frequency, rare terms in a collection are more informative than frequent terms, and we need a more sophisticated way of normalizing length

Why is measuring vector space similarity with Euclidian distance a bad idea?

Euclidean distance is large for vectors of different lengths even though the distribution of terms in the two documents being compared are very similar

The F measure is an efficient way to evaluate unranked retrieval. Knowing that F=2PR/(P+R) where P refers to precision and R to recall. Which of the following statements is *true*? - F=2PR/(P+R) results from β = 0. This expression is giving more emphasis on precision. - F=2PR/(P+R) results from β =1. This expression is giving more emphasis on recall. - F=2PR/(P+R) results from β =1. This expression is balancing precision and recall. - F=2PR/(P+R) results from β =0. This expression is giving more emphasis on recall.

F=2PR/(P+R) results from β =1. This expression is balancing precision and recall.

What would be the smallest windows to calculate the proximity score of the query "information retrieval" in the document "The retrieval of too much information is bad" - Zero words, because terms don't occur in same order. - Four words. - Three words. - Five words.

Five words.

What would be the value of the window 'w' regarding the proximity score for the search query "Information Sciences" in the document "The School of Information and Computer Sciences is really competitive" - Zero, because the query terms are not consecutive. - Four words. - Three words. - Five words.

Four words.

What is a data structure that can be used for selecting the top K largest cosines?

Heap; binary tree in which each parent's node's value is greater than its children. Takes O(J + k log(J)) to finding K "winners" where J represents the number of elements

What is raw term frequency?

If a term occurs in one document ten times and once in another, raw term frequency implies that the one that has the term ten times is ten times more relevant but we don't want this

C

In Boolean retrieval, a query that ANDs three terms results in having to intersect three lists of postings. Assume the three lists are of size n, m, q, respectively, each being very large. Furthermore, assume that each of the three lists are unsorted. In terms of complexity, will your intersection algorithm benefit, or not, from sorting each list before merging them, with respect to merging the 3 unsorted lists? A. Sorting first will make it worse than merging the unsorted lists B. It will be the same C. It will benefit from sorting first

C

In a Web search engines, the Text Acquisition system consists of: A. Desktop crawlers B. Scanning documents C. Web crawlers

B

In ranked retrieval of multi-term queries, the scores for the query can be computed in either document-at-a-time or term-at-a-time. The following two sentences can be both true, both false, or one true and one false. Mark all true sentences: A. When using term-at-a-time, the final score of the document di is calculated before any calculations are performed for di+1 B. When using document-at-a-time, the final score of the document di is calculated before any calculations are performed for di+1

B

In the vector space model for information retrieval, the dimension in the multi-dimensional space are A. the documents B. the terms

What are the properties of tf-idf weighting?

Increases with the number of occurrences within a document and increases with the rarity of the term in the collection

What are techniques for computing cosines efficiently?

Index elimination, champion lists, static quality scores, high and low lists, impact-ordered postings, cluster pruning

What is the relationship between the document frequency of a term and the informativeness of the term?

Inverse measure

Based on the diagram of IR components that we saw in lecture, where would you place the parameters to fine-tune your query-doc scoring algorithm? - Within the linguistic modules. - It is an external input to apply to indexes and query. - Before the spell-check of the query. - Within the indexes.

It is an external input to apply to indexes and query.

Scoring Documents Jaccard coefficient: example - Find the Jaccard coefficient in the next documents - Query: ides of march (set Q) - Doc 1: Caesar died in march (set D1) - Doc 2: the long march (set D2) What is Jc(Q,D1) = |Q ∩ D1 |/ |Q ∪ D1| What is Jc(Q,D2) = |Q ∩ D2 |/ |Q ∪ D2|

Jc(Q,D1) = 1/6 Jc(Q,D2) = 1/5

Mark the *false* statement with regards to the Mean Average Precision (MAP)? - MAP is macro-averaging: each query counts equally. - MAP considers multi-level of relevance in documents, as opposite to binary relevance. - Most commonly used measure in scientific research. - MAP assumes user is interested in finding many relevant documents for each query.

MAP considers multi-level of relevance in documents, as opposite to binary relevance.

What is the effect of the inverse document frequency on a one-word query?

No effect; must contain at least two terms

Consider the following figure representing the web as a graph with good (green with vertical-lined pattern) and bad (red with horizontal-lined pattern) pages/nodes in an interlinked structure. Which statement is true regarding the nodes 1-4? - Nodes 1-4 are bad. - Nodes 1-4 are good. - Nodes '1' and '3' are good, and '2' and '4' are bad. - Nodes '1' and '3' are good, '4' is bad, and '2' is unknown.

Nodes '1' and '3' are good, and '2' and '4' are bad.

What is document frequency?

Number of documents that contain term t or the length of the posting list of t

What is the collection frequency of a term t?

Number of occurrences of the term t in the collection, counting multiple occurrences per document

What is term frequency?

Number of times that a term t occurs in a document d

What does the Jaccard (Jc) coefficient measure?

Overlap of terms in two sets A and B

Consider the following diagram depicting documents (dots) in a collection. The rectangle in the middle represents the documents retrieved for a given query. Mark the *true* statement with regards to its precision (P) and recall (R). - P=4/8 and R= 3/9 - R=3/7 and P= 3/9 - P=4/7 and R= 4/9 - R=4/7 and P= 3/7

P=4/7 and R= 4/9

Binary Assessment

Precision: B < 1 Recall: B > 1 Balance: B = 1

Consider the following diagram depicting documents (dots) in a collection. T rectangle in the middle represents the documents retrieved for a given query. Mark the *true* statement with regards to its precision (P) and recall (R). - P=4/8 and R= 3/9 - R=3/9 and P= 3/7 - P=4/7 R= 4/9 - R=4/7 and P= 3/7

R=3/9 and P= 3/7

What are two ways of ranking documents with vector space similarity?

Rank documents in increasing order of the angle-query document and in decreasing order of cosine(query, document)

false

Ranked retrieval is a model where documents either match (1) or dont match (0) the queries (true or false)

How are documents represented in a vector space model?

Real-valued vector of tf-idf weights

What are the key ideas with representing queries as vectors?

Represent queries as vectors as if they are documents as well; rank documents according to their proximity to the query in this space

What is the goal of scoring documents?

Returning documents that are most likely to be useful based in the search query

What is the goal with choosing K largest cosines?

Selection vs. sorting; typically we want to retrieve the top K docs in the cosine ranking for the query but not to totally sort all docs in the collection

How is "proximity" defined between vector space models?

Similarity of vectors; inverse of distance

Based on the diagram of IR components that we saw in lecture, which module below does not belong to the Indexes block? - Metadata in zones and fields. - Spell Correction. - Positional Information. - k-grams.

Spell Correction.

What is the difference between a term-document incidence matrix and a term-document count matrix?

Term-document count matrices consider the number of occurrences of a term in a document; each document is a count vector vs. binary vectors in the incidence matrices

High and Low Lists

The High and the Overflow: keep a list of high rankings and then the rest in low as backup if more needed

Mark the false statement with regards to the hub score [h(x)] and authority score [a(x)] of a page X when conducting Hyperlink-Induced Topic Search (HITS)? - Highest scores for h( ) and a( ) will define the hubs and authorities respectively. - In the iterative part, both initial values for h(x) and a(x) can be setup to one. - There is no reason to scale down h( ) and a( ) between iterations. - Relative values of scores h( ) and a( ) need few iterations to converge.

There is no reason to scale down h( ) and a( ) between iterations.

How can we come up with a more sophisticated way of normalizing length with Jaccard coefficients?

Use another scoring scheme that uses (A & B)/(A || B)^(1/2) instead of (A & B)/(A || B)

Imagine you are constructing Tiered Indexes to improve the efficiency of your search engine. Which of the following statements is *false*? - You will break index up into tiers of decreasing importance. - You can break the index by Authority or term frequency, among other scores. - Using Authority to break the index, the same document may appear in different tiers. - Using term frequency to break the index, same documents may appear in different tiers.

Using Authority to break the index, the same document may appear in different tiers.

What is the bag of words model?

Vector representation that doesn't consider the ordering of words

What are some characteristics of documents represented as vectors?

Very sparse vectors; most entries are zero because of the high-dimensionality (tens of millions of dimensions when applied to web search engines)

B

What is the main problem of using a term-document matrix for searching in large collections of documents? A. It is slow to search the documents B. It is an inefficient use of memory C. It is slow to search the terms

false

When using tf-idf as the ranking score on queries with just one term, the idf component has effect on the final ranking. (true or false)

B, C, D, and E

Which of the following are examples of work within the Information Retrieval field? A. The design of a relational database for a library B. Classifying books into categories C. Automatically answer customer support questions D. filtering for documents of interest E. Web search engines

Examples of Authority Signals

Wikipedia among websites Articles in certain newspapers A paper with many citations "Like" or "re-tweets" marks PageRANK

Can we avoid computing all cosines?

Yes, but with less precision; acceptable if a doc that is not in the top K is included in results but it is important that the K documents outputted are close to the top

Gain

a documents relevance

Zones

a region of the doc that contains an arbitrary amount of text, longer than fields title, author, etc

Consider the following representation of a web as a graph. Mark the *true* statement with regards to the hub score h(N) and the authority score a(N) for each node N after one iteration (iteration 1). Iter0 Iter1 h(N) a(N) h(N) a(N) A 1 1 0 2 B 1 1 1 2 C 1 1 1 1 D 1 1 3 0 - a(A) > a(B), and h(b) = h(C) - a(A) = a(B), and highest h(N) is for N=D - a(C) = a(D), and highest h(N) is for N=B - h(B) > h(A), and highest a(N) is for N=D

a(A) = a(B), and highest h(N) is for N=D

Unweighted Query Terms

assume each term occurs only once don't consider term frequency

Tiered Indexes

break indices into tiers of importance, where you go down a tier if there's not enough results

Champion Lists

calculate doc highest weights for each term earlier and only analyze that set

Static Quality Scores

combine cosine relevance and authority scores Benefit: Stop posting traversal early

Impact-Ordered Postings

compute scores for high enough weights

Efficient Ranking

find the K docs in the collection "nearest" to the query

Index EL: High-Idf Terms Benefit

many docs get eliminated from the set

Index Elimination

narrow to only the docs with at least one query term

Parametric Indexes

query by metadata info

Cluster Pruning

select root of N docs at random to be leaders of a group go to nearest leader and then traverse docs

false

tf-idf decreases with the number of occurrences of the terms inside a document (true or false)

true

tf-idf increases with the rarity of the terms in the corpus (true or false)

Computing Cosines

the primary computational bottleneck

Query Term Proximity

window between which the words can be found (WINDOW includes the words and the in between)

Brutus[7] -> 2, 4, 8, 16, 32, 64, 128 Caesar[8] -> 1, 2, 3, 5, 8, 16, 21, 34 Calpurnia[2] -> 13, 16 what is the most efficient query?

(BRUTUS AND CALPURNIA) AND CAESER - remember to start with the smallest set and keep cutting further

Term Doc Freq eyes 213,312 kaleidoscope 87,009 marmalade 107,913 skies 271,658 tangerine 46,653 trees 316,812 Recommend a query processing order for

(TANGERING or TREES) AND (MARMALADE or SKIES) AND (KALEIDOSCOP or EYES)

Boolean Retrieval Model Optimization: OR

- (MADDING or CROWD) AND (ignoble OR strife) - Get doc. freqs for all terms - estimate the size of each OR by the sum of its doc.freqs (conservative) - Process in increasing order of OR sizes

Boolean Retrieval Model Optimization: AND

- Consider a query that is an AND of n terms - For each of the n terms, get its postings, then AND them together - Process in order of increasing frequency: - start with smallest set, then keep cutting further - (Reason why document frequency is kept in the dictionary and not in memory)

Term Frequency Term-Document Count Matrix

- Consider the number of occurrences of a term in a document: - Each document is a count vector N^(|v|): a column below Julius Caesar Antony 73 Bruttus 157 Caesar 227


Ensembles d'études connexes

American Lit. since 1865 Final Exam Review

View Set

CH 23 Management of Patients with Chest and Lower Respiratory Tract Disorders (E2)

View Set

Chapter 6 Study Guide (Principles of Marketing)

View Set

VNSG 1400: Chapter 12 (Med-Surg) Prep U Questions

View Set

Match the term with the definition. - English final

View Set