Information Retrieval I
IR Task - Question answering
Give a specific answer to a question
queries, TREC
IR evaluation methods now used in many fields; Typically use test collection of documents, _______, and relevance judgments. Most commonly used are _____ collections
Dimensions of IR
IR is more than just text, and more than just web search, People doing IR work with different media, different types of search applications, and different tasks
IR Task - Classification
Identify relevant labels for documents
IR Task - Filtering
Identify relevant user profiles for a new document
Dimensions of IR - Applications
Web Search, vertical search, P2P search, Forum Search, Enterprise search, Desktop search, Literature search
Examples of Search Engines
Web search, Open Source Search Engines are important for R&D; examples: Lucene, Lemur, Indri, Galago.
Examples of database records
bank records with account numbers, balances, names, addresses, social security numbers, dates of birth, etc.
Most models describe statistical properties of text rather than linguistic
i.e. counting simple text features such as words instead of parsing and analyzing the sentences, Statistical approach to text processing started with Luhn in the 50s, Linguistic features can be part of a statistical model
One of the challenges for search engine design
it is to give good results for a range of queries, and better results for more specific queries.
Inverted indexes
or sometimes inverted files, are by far the most common form of index used by search engines.
Examples of types of Spam
spamdexing or term spam, link spam, "optimization"
Factors influence a person's decision about what is relevant
tasks, context, novelty, style
Recall and precision
two examples of effectiveness measures
An inverted index
very simply, contains a list for every index term of the documents that contain that index term. It is inverted in the sense of being the opposite of a document file that lists, for every document, the index terms they contain
Examples of documents
web pages, email, books, news stories, scholarly papers, text messages, Word™, Powerpoint™, PDF, forum postings, patents, IM sessions, etc.
Common properties of documents
Significant text content, Some structure (e.g., title, author, date for papers; subject, sender, destination for email)
Search Engine Issue - Adaptability
Changing and tuning search engine components such as ranking algorithm, indexing strategy, interface for different applications
Search Engine Issue - Scalability
Making everything work with millions of users every day, and many terabytes of documents; Distributed processing is essential
Search Engine Issue - Performance
Measuring and improving the efficiency of search e.g., reducing response time, increasing query throughput, increasing indexing speed; Indexes are data structures designed to improve search efficiency (designing and implementing them are major issues for search engines)
Examples of Other Media
New applications increasingly involve video, photos, music, speech. Content is difficult to describe and compare,
Course Goals
To help you to understand search engines, evaluate and compare them, and modify them for specific applications; Provide broad coverage of the important issues in information retrieval and search engines (includes underlying models and current research directions)
same topic, everything else
Topical relevance (________) vs. user relevance (______)
Example bank database query
-Find records with balance > $50,000 in branches located in Amherst, MA. -Matches easily found by comparison with field values of records
Why is exact matching of words is not enough
-Many different ways to write the same thing in a natural language like English, -Some stories will be better matches than others
Example search engine query
-bank scandals in western mass, -This text must be compared to the text of entire news stories
Big Issues in IR - Relevance
A relevant document contains the information that a person was looking for when they submitted a query to the search engine
Dimensions of IR - Tasks
Ad hoc search, Filtering, Classification, Question answering
It is also knows as tuples in relational databases
Database records
Big Issues in IR - Evaluation
Experimental procedures and measures for comparing system output with user expectations; Originated in Cranfield experiments in the 60s
IR Task - Ad-hoc search
Find relevant documents for an arbitrary text query
semantics, difficult
In Database Records, it is easy to compare fields with well-defined ________ to queries in order to find matches, and text is more ______
structure, storage
Information retrieval is a field concerned with the ______ , analysis, organization, _______, searching, and retrieval of information." (Salton, 1968)
A retrieval model
It is a formal representation of the process of matching a query and a document. It is the basis of the ranking algorithm that is used in a search engine to produce the ranked list of documents.
Spam (subfield)
It is called adversarial IR, since spammers are "adversaries" with different goals
The core issue of information retrieval
It is comparing the query text to the document text and determining what is a good match
Spam
It is one of the major issues for Web search because it affects the efficiency of search engines and, more seriously, the effectiveness of the results
Information Retrieval
It is the field of computer science that is most involved with R&D for search
Search Engine
It is the practical application of information retrieval techniques to large scale text collections
IR and Search Engines - Search Engines
Performance (Efficient search and indexing), Incorporating new data (Coverage and freshness), Scalability (Growing with data and users), Adaptability (Tuning for applications), Specific problems (e.g. Spam),
Search Engine Issues
Performance, Dynamic data, Scalability, Adaptability
text, documents
Primary focus of IR since the 50s has been on ____ and ______
expansion, feedback
Query refinement techniques such as query _______, query suggestion, relevance ________ improve ranking
Explain Big Issues in IR - Relevance
Ranking algorithms used in search engines are based on retrieval models, Most models describe statistical properties of text rather than linguistic
IR and Search Engines - Information Retrieval
Relevance (Effective ranking ), Evaluation (Testing and measuring), Information needs (User interaction)
Big Issues in IR
Relevance, Evaluation, Users and Information Needs,
Big Issues in IR - Users and Information Needs
Search evaluation is user-centered, Keyword queries are often poor descriptions of actual information needs, Interaction and context are important for understanding user intent, and Query refinement techniques such as query expansion, query suggestion, relevance feedback improve ranking
Dimensions of IR - Content
Text, Images, Video, Scanned docs, Audio, and Music.
Search Engine Issue - Dynamic Data
The "collection" for most real applications is constantly changing in terms of updates, additions, deletions (e.g., web pages), Acquiring or "crawling" the documents is a major task (Typical measures are coverage (how much has been indexed) and freshness (how recently was it indexed); Updating the indexes while processing queries is also a design issue.
Database records
They are typically made up of well-defined fields (or attributes)