Chapter 3 - Modeling

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

Which models take advantage of term-term-correlation?

- generalized vector space model - fuzzy IR model - set based model - the language models

What is the document frequency df of an index term?

The number of documents that a term occurs in. If we have a term-document-matrix, we just count the non-0-entries in the term-vector.

What is a ranking function?

A function that assigns scores to documents with regard to a given query. Takes a query from the query collection Q, and a document from the document collection D, and assigns a numeric value indicating how relevant the document is for the query. Once all documents have been put through the function for a given query, the relevances can be sorted and the documents are ranked.

Which 3 classical IR models do we have for representing unstructured text?

1. Boolean model 2. Vector model 3. Probabilistic model We also have many IR models that build upon the classical ones and extend them in various ways.

If we were to include all words in our document collection in the vocabulary V, we would get a huge amount of words. Imagine that "if," and "if" would be two different words in the vocabulary if we only separated words by spaces. A large vocabulary would in turn make the term-document matrix very large. Some document preprocessing is needed. What are the steps we take in preprocessing?

1. Structure recognition We flatten the structure of the document 2. Removing punctuation, converting to lower case etc. 3. Removing stopwords 4. Removing non-noun-groups 5. Stemming 6. Indexing The logical view of the document goes from full text to just a set of index terms.

We distinguish between 3 main types of IR models. Which types?

1. Those based on text We use only the text of the documents to rank them with regard to a query 2. Those based on links Commonly used in the web where the link structure is important for obtaining a good ranking. 3. Those based on multimedia objects Multimedia objects are encoded rather differently than text documents, and they are therefore ranked differently, or retrieved without ranking.

We have a vocabulary of n distinct index terms. V={k₁,k₂,...,kₙ} How many patterns of term co-occurrence can we find?

2ⁿ To illustrate, we can express each document as an n-dimensional vector where position 1 is 1 if k₁ is present in the document, and 0 is k₁ is not in the document. For the vocabulary V={k₁,k₂,...,kₙ} we can express each document as e.g. (1,0,...,1) For vocabulary of size 3, we get the co-occurrences (0,0,0) (0,0,1) (0,1,0) ... (1,1,1) So we get 2³=8 possible co-occurrence patterns.

What is an IR model?

A logical framework for representing documents and queries. It is usually based on sets, vectors or probability distributions. It directly affects the computation of document ranks, which are then used to sort the documents returned in response to a given query. It is represented by a quadruple [D,Q,F,R(q_i, d_j)] D = Set of logical views (representations) of documents in a collection. Q = Set of logical views (representations) of user information needs (queries) F = Framework for modeling document representations, queries and their relationships, such as sets and Boolean relations, vectors and linear algebra operations, sample spaces and probability distributions. R(q_i, d_j) = Ranking function that takes in a query representation q_i ∈ Q and a document representation d_j ∈ D and outputs a number indicating how relevant the document is for that query. This is used for ranking.

What does an IR system use as input?

A set of (query) index terms to index and retrieve documents.

What is the logical view of a document?

A summary of its contents. It's "theme". What is it about? Index terms provide a logical view of the documents. Full text is clearly the most logical view of a document, but implies higher computational costs. A small set of categories specified by human specialists provides the most concise logical view of a document, adding a semantic layer (a layer of meaning) to the document representation.

What is used to characterize the importance of an index term?

A weight w_i,j > 0 is associated with each index term k_i of a document d_j in the collection. For an index term k_i that doesn't appear in the collection with regard to a given query, w_i,j=0.

What is an index term? What are index terms used for?

A word or group of consecutive words in a document. In its most general form, it is any word in the collection. In a more restricted interpretation, it is a preselected group of words that represents a key concept / topic in a document. A preselected set of index terms can be used e.g. to summarize the document contents.

What are the advantages (4) and disadvantages (2) of the vector model?

Advantages: 1. Term weighting improves retrieval quality 2. Partial matching allows retrieval of documents that approximate the query conditions 3. Its cosine ranking formula sorts the documents according to their degree of similarity to the query 4. Document-length normalization is naturally built-in into the cosine similarity ranking, as you already divide the equation by the vector norm to find cosθ Disadvantages: 1. Index terms are assumed to be mutually independent (no term correlation). This means that searching for e.g. phrases, quotes or song lyrics (text where term order matters) can give inaccurate results. 2. Depending on how high the ranking threshold is set, it can include many irrelevant documents.

How does the boolean model work?

All elements of the term-document matrix are either 1, to indicate presence of the term in the document, or 0, to indicate absence of the term in the document. A query q is a boolean expression on the index terms, e.g. q=k₁∧(k₃∨k₂) We produce all term conjunctive components that match the query c(q)=(1,0,1) c(q)=(1,1,0) c(q)=(1,1,1) and compile them into 1 conjunctive component q_dnf=(1,0,1)∨(1,1,0)∨(1,1,1) called the disjunctive normal form. The document is also represented as a conjunctive component c(d)=(1,0,1) If c(d) is not in q_dnf, the document is deemed irrelevant for q. In other words, the similarity sim(d,q)=1 | c(q)=c(d) sim(d,q)=0 | otherwise Note that if the query contains terms that are not in the vocabulary, the results won't be affected. The term conjunctive component of each document will both include and exclude that index term. ALTERNATIVELY, the way it was done in the lecture, we take the query q=k₁∧(k₂∨!k₃) For each term, we take the column for the term from the term-document-matrix, and we put it through the query. In the image, we take the columns for words 1, 2 and 3: w1 = (1,1,0,0,0) w2 = (0,1,0,1,0) w3 = (1,0,0,0,1) We put it through the query: (1,1,0,0,0)AND (0,1,0,1,0)OR(0,1,1,1,0) = (1,1,0,0,0)AND (0,1,1,1,0) = (0,1,0,0,0) Thus, we see that only document 2 is relevant

Why is image retrieval easier than e.g. video retrieval?

Because videos also include a time-dimension so the files become much larger and it is even harder to estimate if a video is relevant.

What is the difference between - collection frequency cf - term frequency tf - document frequency df

Collection frequency cf The number of times a term occurs in the entire collection Term frequency tf The number of times a term occurs in a document Document frequency df The number of documents that a term occurs in

What is a term-term correlation matrix?

In the real world, many index terms are related. "Computer" often comes together with "networks". We can use this fact to produce better search results. We obtain the term-term correlation matrix by multiplying a term-document matrix with its own transpose. The term-term correlation matrix C establishes a relationship between any two terms ku and kv based on their joint co-occurrences inside documents of the collection The higher the number of documents in which the terms ku and kv co-occur, the stronger is this correlation.

What is the difference between unstructured and structured IR models? What are semistructured IR models?

In unstructured IR models, text is simply a sequence of words. In structured IR models, we have titles, sections, subsections and paragraphs. Structured models are often called semistructured, because they also contain plain, unstructured text.

We have a vocabulary of n distinct index terms V={k₁,k₂,...,kₙ}. What do we mean when we say that we have observed the pattern [k₁,k₂,k₃] of term-co-occurrence in the collection?

Index terms 1, 2 and 3 often occur together (co-occur) in that order in the collection.

What is the term-document matrix?

It is a term x document matrix that for each (term,document) pair shows the frequency of occurrence of the term in that specific document.

What is a "bag of words" representation of a query / document?

It is when we use conjunctive components for representing documents / queries. If we have a vocabulary of 3 distinct index terms, V={k₁,k₂,k₃} then each document/query will be represented as a 3-dimensional vector, e.g. c(d₁)=(0,1,1) this means that document 1 contains the index terms k₂ and k₃. This is a boolean bag-of-words representation, and it is just showing IF an index term exists in a document / query. We also have a more sophisticated bag of words representation, which also counts the number of occurrences, as shown in the image. https://en.wikipedia.org/wiki/Bag-of-words_model

What is the disadvantage of using human specified categories as logical views of documents?

It restricts recall, and might lead to a poor search experience, especially if the users are not specialists with detailed knowledge of the document collection. Recall = fraction of relevant documents that are retrieved.

How does document length normalization work?

It takes the rank of a document and divides it by the document length. The length can be measured in many different ways - Size in bytes - Number of words - Vector norms (lengths) Each term k is associated with a unit vector in a t-dimensional space where t=total number of terms. Each term has a scalar weight, so we get wk. A document is the vector composed of all its term vector components. The document length is given by the length of the document vector, computed as shown in the image.

What is the difference between precision and recall?

Precision: The fraction of retrieved documents that are relevant to the query. Relevant retrieved documents / retrieved documents. Of all the documents that are retrieved, how many are relevant? Recall: The fraction of the relevant documents that are successfully retrieved. Of all documents in the collection that are relevant, how many are retrieved?

Why is document length normalization important?

Longer documents have a larger vocabulary and are more likely to be retrieved by a given query simply because they contain more terms. Document length normalization deals with this imbalance, and leads to a better ranking.

Are modern implementations of IR systems based on 1 type of IR model?

No, they often combine elements from multiple IR models. E.g. web ranking functions combine characteristics of classic IR models with characteristics of link-based models to improve retrieval.

Are all index terms equally useful for describing document contents?

No. Some words have very little meaning in themselves. We call them stopwords, and they are often removed in the preprocessing phase of text. Even after the stopwords are removed, some words are more useful for uniquely describing a document. A word which appears in 100'000 documents is less useful than a word which appears in 50, as the less common word can narrow down the set of documents which might be of interest to the user.

What are the pros and cons of using term-term correlation?

Pros: 1. Results that better satisfy the user's information needs Cons: 1. More computationally expensive

What are the pros and cons of retrieval based on index terms?

Pros: Can be implemented efficiently and is simple to refer to in a query. Simplicity is important because it reduces the effort of query formulation on the part of the user Cons: Expressing query intent using a few words restricts the semantics of what can be expressed

What are the pros (1) and cons (4) of the boolean model?

Pros: Very simple because of binary index term weights. Cons: 1. There is no ranking, which may lead to the retrieval of too many / too few documents. 2. Formatting boolean queries is cumbersome for most users. 3. Huge table - large memory usage 4. Often large amount of 0-entries compared to 1-entries. (Very sparse matrix) 5. Boolean operators allow only exact matches to terms. A document cannot partially match a query. The memory limitation is the reason why inverted indexes are much more useful. See the image showing what an inverted index is.

How far terms occur apart from each other tells something about their correlation. What does it say?

Terms that are closer together have a stronger correlation.

How does the vector space model work?

The document d and query q are represented as vectors in a |V|-dimensional vector space where V is the vocabulary. The |V| index terms k are represented as unit vectors. Each document is represented as a |V|-dimensional vector on the form dj = (w_1j, w_2j, ..., w_|V|j) These weights can be TF-IDF weights. The query q is also represented as a |V|-dimensional vector with weights: q = (w_1q, w_2q, ..., w_|V|q) The similarity between document d and query q is calculated by the angle between the two vectors. Smaller angles give better representations. We know that d*q = |d|*|q|*cosθ cosθ = d*q / |d|*|q| If cosθ is near 1, the angle is near 0, so we deem the document as relevant to the query q. In the image, the m-s are the TF-IDF weights w.

What is a term conjunctive component c(x)?

The term conjunctive component takes in a query- or document representation, and returns a vector representation of the vocabulary index terms that are present and absent in x. If we have a vocabulary of 3 distinct index terms, V={k₁,k₂,k₃} then each document/query will be represented as a 3-dimensional vector, e.g. c(d₁)=(0,1,1) this means that document 1 contains the index terms k₂ and k₃.

Of the following document length normalization methods: - bytesize - word count - vector norms which is the most accurate?

The method using vector norms is most accurate, because it also takes the term weights into account, so more selective terms are emphasized. We see that bytesize and word-count document lengths don't differ by more than 26%, while the vector norm method differs by more than 100%. It is however the most computationally demanding.

What is the term frequency tf of an index term?

The number of times a term occurs in a document. If we have a term-document-matrix, we just find the entry for the term-document-pair.

What is the collection frequency cf of an index term?

The number of times a term occurs in the entire collection. If we have a term-document-matrix, we just sum the term-vector.

What is a vocabulary?

The set V of all distinct index terms in the document collection. As the collection grows, the size of the vocabulary also grows because of misspellings, various forms of numbers, and a variety of identification symbols / acronyms.

It is difficult to write a query for retrieving images, because describing an image is difficult to a computer which only has pixel data. Which other approach can be used?

The user may point at an image and the IR model can retrieve images that are similar based on pixel similarity. However, this approach is deceptive, as images can be very similar on pixel level, but completely different for humans. Using this approach, there is no ranking involved, so multimedia retrieval is generally very different from text retrieval.

When index terms are used to represent a document, why can't we just use all words of the document?

There are many words that have little meaning by themselves. In fact, only nouns / noun groups have meaning by themselves. Adjectives, adverbs and connectives are less useful, and work mainly as complements. Having all words as representations for a document is only really useful when we query for an exact match.

We search for "John Lennon" and there are 1000 documents that all contain the words "John" and "Lennon". How can we sort the documents to display first those that are more likely to interest the user?

We can't. We say that the amount of information associated with the query is 0.

In the vector space model, we use cosine similarity to see how similar a document d is to a query q. We get cos(θ)=d*q / |d|*|q| Can we simplify this?

Yes. The length of the query |q| will always be the same, so we can drop it from the cosine similarity, so we get cos(θ) ~ d*q / |d| The cos(θ) value won't be the same, but the *ranking* of documents will be the same, which is what we care about.


Kaugnay na mga set ng pag-aaral

MindTap: Worksheet 12.2: Consideration

View Set

Chapter 60: Assessment of Integumentary Function

View Set

AP Statistics Unit 5 Progress Check: MCQ Part B

View Set

CFA_L1_Assignment_141_Lesson 2: Risk Aversion, Portfolio Selection and Portfolio Risk

View Set

MGMT 1 Chapter 17-18 Accounting and Financial Management

View Set

Content Area: Psych / Mental Health

View Set

Hazing Prevention 101 Course - College Edition 2021

View Set