Information Retrieval I

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

IR Task - Question answering

Give a specific answer to a question

queries, TREC

IR evaluation methods now used in many fields; Typically use test collection of documents, _______, and relevance judgments. Most commonly used are _____ collections

Dimensions of IR

IR is more than just text, and more than just web search, People doing IR work with different media, different types of search applications, and different tasks

IR Task - Classification

Identify relevant labels for documents

IR Task - Filtering

Identify relevant user profiles for a new document

Dimensions of IR - Applications

Web Search, vertical search, P2P search, Forum Search, Enterprise search, Desktop search, Literature search

Examples of Search Engines

Web search, Open Source Search Engines are important for R&D; examples: Lucene, Lemur, Indri, Galago.

Examples of database records

bank records with account numbers, balances, names, addresses, social security numbers, dates of birth, etc.

Most models describe statistical properties of text rather than linguistic

i.e. counting simple text features such as words instead of parsing and analyzing the sentences, Statistical approach to text processing started with Luhn in the 50s, Linguistic features can be part of a statistical model

One of the challenges for search engine design

it is to give good results for a range of queries, and better results for more specific queries.

Inverted indexes

or sometimes inverted files, are by far the most common form of index used by search engines.

Examples of types of Spam

spamdexing or term spam, link spam, "optimization"

Factors influence a person's decision about what is relevant

tasks, context, novelty, style

Recall and precision

two examples of effectiveness measures

An inverted index

very simply, contains a list for every index term of the documents that contain that index term. It is inverted in the sense of being the opposite of a document file that lists, for every document, the index terms they contain

Examples of documents

web pages, email, books, news stories, scholarly papers, text messages, Word™, Powerpoint™, PDF, forum postings, patents, IM sessions, etc.

Common properties of documents

Significant text content, Some structure (e.g., title, author, date for papers; subject, sender, destination for email)

Search Engine Issue - Adaptability

Changing and tuning search engine components such as ranking algorithm, indexing strategy, interface for different applications

Search Engine Issue - Scalability

Making everything work with millions of users every day, and many terabytes of documents; Distributed processing is essential

Search Engine Issue - Performance

Measuring and improving the efficiency of search e.g., reducing response time, increasing query throughput, increasing indexing speed; Indexes are data structures designed to improve search efficiency (designing and implementing them are major issues for search engines)

Examples of Other Media

New applications increasingly involve video, photos, music, speech. Content is difficult to describe and compare,

Course Goals

To help you to understand search engines, evaluate and compare them, and modify them for specific applications; Provide broad coverage of the important issues in information retrieval and search engines (includes underlying models and current research directions)

same topic, everything else

Topical relevance (________) vs. user relevance (______)

Example bank database query

-Find records with balance > $50,000 in branches located in Amherst, MA. -Matches easily found by comparison with field values of records

Why is exact matching of words is not enough

-Many different ways to write the same thing in a natural language like English, -Some stories will be better matches than others

Example search engine query

-bank scandals in western mass, -This text must be compared to the text of entire news stories

Big Issues in IR - Relevance

A relevant document contains the information that a person was looking for when they submitted a query to the search engine

Dimensions of IR - Tasks

Ad hoc search, Filtering, Classification, Question answering

It is also knows as tuples in relational databases

Database records

Big Issues in IR - Evaluation

Experimental procedures and measures for comparing system output with user expectations; Originated in Cranfield experiments in the 60s

IR Task - Ad-hoc search

Find relevant documents for an arbitrary text query

semantics, difficult

In Database Records, it is easy to compare fields with well-defined ________ to queries in order to find matches, and text is more ______

structure, storage

Information retrieval is a field concerned with the ______ , analysis, organization, _______, searching, and retrieval of information." (Salton, 1968)

A retrieval model

It is a formal representation of the process of matching a query and a document. It is the basis of the ranking algorithm that is used in a search engine to produce the ranked list of documents.

Spam (subfield)

It is called adversarial IR, since spammers are "adversaries" with different goals

The core issue of information retrieval

It is comparing the query text to the document text and determining what is a good match

Spam

It is one of the major issues for Web search because it affects the efficiency of search engines and, more seriously, the effectiveness of the results

Information Retrieval

It is the field of computer science that is most involved with R&D for search

Search Engine

It is the practical application of information retrieval techniques to large scale text collections

IR and Search Engines - Search Engines

Performance (Efficient search and indexing), Incorporating new data (Coverage and freshness), Scalability (Growing with data and users), Adaptability (Tuning for applications), Specific problems (e.g. Spam),

Search Engine Issues

Performance, Dynamic data, Scalability, Adaptability

text, documents

Primary focus of IR since the 50s has been on ____ and ______

expansion, feedback

Query refinement techniques such as query _______, query suggestion, relevance ________ improve ranking

Explain Big Issues in IR - Relevance

Ranking algorithms used in search engines are based on retrieval models, Most models describe statistical properties of text rather than linguistic

IR and Search Engines - Information Retrieval

Relevance (Effective ranking ), Evaluation (Testing and measuring), Information needs (User interaction)

Big Issues in IR

Relevance, Evaluation, Users and Information Needs,

Big Issues in IR - Users and Information Needs

Search evaluation is user-centered, Keyword queries are often poor descriptions of actual information needs, Interaction and context are important for understanding user intent, and Query refinement techniques such as query expansion, query suggestion, relevance feedback improve ranking

Dimensions of IR - Content

Text, Images, Video, Scanned docs, Audio, and Music.

Search Engine Issue - Dynamic Data

The "collection" for most real applications is constantly changing in terms of updates, additions, deletions (e.g., web pages), Acquiring or "crawling" the documents is a major task (Typical measures are coverage (how much has been indexed) and freshness (how recently was it indexed); Updating the indexes while processing queries is also a design issue.

Database records

They are typically made up of well-defined fields (or attributes)


Kaugnay na mga set ng pag-aaral

Chapter 7: Managerial Planning and Goal Setting

View Set

Introduction to Criminal Justice Chapter 10 (CJC 101)

View Set

Gastrointestinal Tract Structure

View Set

KINE 3362: Spinal Column Terms & Concepts

View Set

Unit 10: Taxation of Life Insurance & Annuities

View Set