INF 141

What is Information Retrieval?

It's a field concerned with the structure, analysis, organization, storage, searching, and retrieval of information

What is the frontier of a Web crawler?

It's the set of URLs that have been seen but not yet crawled

What are some Universal Resource Identifiers (URI)?

http://www.ics.uci.edu/~lopes ISBN 0-486-2777-3 rmi://filter.uci.edu

Consider the following sentences: S1: deep fry turkey is the bomb turkey" S2: "I like slow slow roasted turkey" S3: "turkey stuffing contains the turkey liver" What is the inverse document frequency of the word slow?

log(3/1)

Consider the following query and corresponding postings lists (with doc ids only): Query: master of software engineering engineering: 4, 5, 10, 11, 14, 15, 16master: 4, 11of: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16software: 11,15,16 Assume Boolean retrieval with an AND operation between the terms. What is the best order for processing this query as fast as possible? (processing is left to right)

master AND software AND engineering AND of

Cosine similarity captures geometric proximity in terms of

the angle between the vectors

In the Vector Space Model, the data points in the multi-dimensional space are

the documents

tf-idf increases with the rarity of the terms in the corpus.

true

What would be the value of the window "w" regarding some proximity-weighted score for the search query "Information Sciences" in the document "The School of Information and Computer Sciences is really competitive" ?

w = 4

Select the most efficient processing order, if any, for the Boolean query Q considering the document frequency information from the table: Q: web AND ranked AND retrieval web 154383 ranked 623146 retrieval 483259

(web AND retrieval) first, then merge with ranked

Consider the following sentences: S1: "deep fry turkey is the bomb turkey" S2: "I like slow slow roasted turkey" S3: "turkey stuffing contains the turkey liver" What is the tf-idf of the word turkey in S1?

0

Consider the following sentences: S1: "deep fry turkey is the bomb turkey" S2: "I like slow slow roasted turkey" S3: "turkey stuffing contains the turkey liver" What is the document frequency of the word slow?

1

Consider the following sentences: S1: "deep fry turkey is the bomb turkey" S2: "I like slow slow roasted turkey" S3: "turkey stuffing contains the turkey liver" And the following query: Q: turkey breast What is the Jaccard coefficient between Q and S2?

1/6

A://B/C?D#E

A = scheme B = domain/hostname/authority C = path D = query E = fragment

User-agent: *Disallow: /foo Disallow: /bar User-agent: Googlebot Disallow: /baz/a

Allowed to crawl /foo and /bar Not allowed to crawl /baz/a

A normal crawler fetches pages directly from the Web servers. However, your crawler used a cache server to fetch pages. Why?

Because having more than one hundred crawlers fetching pages directly could overload the ICS network if the crawlers are not properly developed.

#1 def run(self): #2 while True: #3 tbd_url = self.frontier.get_tbd_url() #4 if not tbd_url: #5 self.logger.info("Frontier is empty. Stopping Crawler.") #6 break #7 resp = download(tbd_url, self.config, self.logger) #8 self.logger.info( #9 f"Downloaded {tbd_url}, status <{resp.status}>, " #10 f"using cache {self.config.cache_server}.") #11 scraped_urls = scraper(tbd_url, resp) #12 for scraped_url in scraped_urls: #13 self.frontier.add_url(scraped_url) #14 self.frontier.mark_url_complete(tbd_url) #15 time.sleep(self.config.time_delay) What is going on in lines #4--#6?

Check if there are no more URLs to download, and stop the crawler in that case

Cost Per Mil (CPM)

Cost for showing the ad 1000 times

Cost Per Click (CPC)

Cost of clicking on the ad afters it's shown to them

The web is continuously changing. What is considered the best strategy to update the index of a search engine?

Create small temporary indexes in memory from modified and new pages, use them together with the main index for search, and later merge them together with the main index before the memory is full.

Consider a vocabulary of 4 words. Two of the documents have the following coordinates in that space: D1: [0, 2, 1, 0] D2: [1, 0, 1, 1] Using cosine similarity as the ranking formula, what is the relative ranking of these documents for a query that has coordinates [1, 1, 1, 1]?

D2, D1

What is the minimum information in a posting?

Doc id

"Web crawlers can and should send hundreds of requests per second to a Web site, because otherwise they will take a very long time to crawl"

False

A disk seek operation is 10 times slower than a main memory reference

False

All the crawler traps the exist on the web are deliberately created.

False

It is feasible to use a single file in a disk as a direct replacement of the RAM memory and thus to avoid the creation of partial indexes during the index creation for an index that is expected to attain several terabytes.

False

Reading 1 MB sequentially from memory is 2 times faster than Reading 1 MB sequentially from disk

False

The deep web is a large part of the web that only has encrypted content, and thus it is not crawled nor indexed by normal search engines.

False

The right side of the architecture diagram pertains to processes that are done well before any query is issued

False

When using tf-idf as the ranking score on queries with just one term, the idf component has effect on the final ranking.

False

tf-idf decreases with the number of occurrences of the terms inside a document

False

Which of the following are examples of work within the Information Retrieval field?

Filtering for documents of interest Web search engines The design of a relational database for a library Classifying books into categories

Vertical search

Gathering and finding information on a network of independent nodes?

Desktop search

Gathering and finding information on a single computer?

Web search

Gathering and finding information on the web?

Peer-to-peer search

Gathering and finding information within a company network X

Enterprise search

Gathering and finding information within a company network?

What's "inverted" in an inverted index?

It "inverts" from docs->terms to terms->docs

What is the main problem of using a term-document matrix for searching in large collections of documents?

It is an inefficient use of memory

In Boolean retrieval, a query that ANDs three terms results in having to intersect three lists of postings. Assume the three lists are of size n, m, q, respectively, each being very large. Furthermore, assume that each of the three lists are unsorted. In terms of complexity, will your intersection algorithm benefit, or not, from sorting each list before merging them, with respect to merging the 3 unsorted lists?

It will benefit from sorting first

What is a crawler trap?

It's a chain of URLs that may never end

Should crawlers hit the same web site as fast as possible as a strategy to crawl faster?

No

Consider the following pseudo code of an indexer that takes a sequence of documents, all of which have a URL d.url. We want to associate that URL with an integer, so to use the integer in the postings. What is the complexity of this algorithm with respect to N = number of documents?

O(N2)

In Boolean retrieval, a query that ANDs three terms results in having to intersect three lists of postings. Assume the three lists are of size n, m, q, respectively, each being very large. If you keep the lists unsorted, what best approximates the complexity of a 3-way intersection algorithm?

O(nxmxq)

What are HTTP status codes 2xx?

Page retrieved sucessfully

What are HTTP status codes 3xx?

Redirection

Which of the following methods you can directly use to detect pages or documents that are near duplicates?

Simhash, fingerprint

What characterizes most queries on Web search engines?

Spelling mistakes Short (just a few words) Use of special characters and tokens with specific meaning (e.g. 5+3)

Within the architecture of a search engine, Homework 1 (tokenization) pertains to:

Text transformation

#1 def run(self): #2 while True: #3 tbd_url = self.frontier.get_tbd_url() #4 if not tbd_url: #5 self.logger.info("Frontier is empty. Stopping Crawler.") #6 break #7 resp = download(tbd_url, self.config, self.logger) #8 self.logger.info( #9 f"Downloaded {tbd_url}, status <{resp.status}>, " #10 f"using cache {self.config.cache_server}.") #11 scraped_urls = scraper(tbd_url, resp) #12 for scraped_url in scraped_urls: #13 self.frontier.add_url(scraped_url) #14 self.frontier.mark_url_complete(tbd_url) #15 time.sleep(self.config.time_delay) What would happen if line #14 was removed?

The crawl would never end

What contributes to the relevance of a document with respect to a query in the context of a search engine?

The geographic location of the person who's querying Textual similarity The popularity of the document The author of the document Prior queries made by the same user The geographic origin of the document

#1 def run(self): #2 while True: #3 tbd_url = self.frontier.get_tbd_url() #4 if not tbd_url: #5 self.logger.info("Frontier is empty. Stopping Crawler.") #6 break #7 resp = download(tbd_url, self.config, self.logger) #8 self.logger.info( #9 f"Downloaded {tbd_url}, status <{resp.status}>, " #10 f"using cache {self.config.cache_server}.") #11 scraped_urls = scraper(tbd_url, resp) #12 for scraped_url in scraped_urls: #13 self.frontier.add_url(scraped_url) #14 self.frontier.mark_url_complete(tbd_url) #15 time.sleep(self.config.time_delay) What is going on in lines #7?

The next URL is downloaded

#1 def run(self): #2 while True: #3 tbd_url = self.frontier.get_tbd_url() #4 if not tbd_url: #5 self.logger.info("Frontier is empty. Stopping Crawler.") #6 break #7 resp = download(tbd_url, self.config, self.logger) #8 self.logger.info( #9 f"Downloaded {tbd_url}, status <{resp.status}>, " #10 f"using cache {self.config.cache_server}.") #11 scraped_urls = scraper(tbd_url, resp) #12 for scraped_url in scraped_urls: #13 self.frontier.add_url(scraped_url) #14 self.frontier.mark_url_complete(tbd_url) #15 time.sleep(self.config.time_delay) What is going on in lines #3?

The next URL to be downloaded is picked from the frontier

How is advertisement integrated with Web search?

The user's query goes both to the search engine and the ad engine; the search engine retrieves the most relevant results, the ad engine uses an auction system on the query words.

#1 def run(self): #2 while True: #3 tbd_url = self.frontier.get_tbd_url() #4 if not tbd_url: #5 self.logger.info("Frontier is empty. Stopping Crawler.") #6 break #7 resp = download(tbd_url, self.config, self.logger) #8 self.logger.info( #9 f"Downloaded {tbd_url}, status <{resp.status}>, " #10 f"using cache {self.config.cache_server}.") #11 scraped_urls = scraper(tbd_url, resp) #12 for scraped_url in scraped_urls: #13 self.frontier.add_url(scraped_url) #14 self.frontier.mark_url_complete(tbd_url) #15 time.sleep(self.config.time_delay) What would happen if line #15 was removed?

This crawler would crawl much faster This crawler would be impolite

Consider the pseudo code of a simple in-memory indexer:

This indexer processes all documents it is given This indexer will not work for sufficiently large data

Can we build indexes that are larger than the available memory?

We can off load the in-memory hashtable to files, every so often, and merge those files in the end.

Besides Web crawling, what other ways are there to obtain data from Web sites?

Web APIs provided by certain Web sites Targeted downloads of specific URLs Data dumps provided by companies and organizations

In a Web search engine, the Text Acquisition system consists of _____

Web crawlers

Good matches for search engines

a) Find out what is going on today at UCI c) Find how to split words in Python d) What is the weather like in Bali?

Good matches for relational databases

b) Find all female students whose last name is Smith f) What were the temperature and humidity values registered in Crystal Cove between 9/1/2019 and 10/1/2019?

INF 141

Related study sets

Business Law: Chapter 6 (Tort Law)

Culture Q's

Calculations 8-14

Computer Concepts Chapter 3

econ 13 and 14

chapter 15 exam review

Final Review - 4-"C"

Chapter 3 Business Law

COSC 1306 SET 4 REVIEW

Community Health Exam #3

7:5 Muscular System assignment

Nutrition 100 Chapter 5-7

Anatomy- Final Exam

LS Ch.2

Nutrition Chapter 11

Biology 214 Exam 2

Martin Luther King's 'Letter From Birmingham Jail'

BIOS1063 Exam 1

Week #9: Spinal Cord Injuries

Ch 6 Quiz