INF 141
What is Information Retrieval?
It's a field concerned with the structure, analysis, organization, storage, searching, and retrieval of information
What is the frontier of a Web crawler?
It's the set of URLs that have been seen but not yet crawled
What are some Universal Resource Identifiers (URI)?
http://www.ics.uci.edu/~lopes ISBN 0-486-2777-3 rmi://filter.uci.edu
Consider the following sentences: S1: deep fry turkey is the bomb turkey" S2: "I like slow slow roasted turkey" S3: "turkey stuffing contains the turkey liver" What is the inverse document frequency of the word slow?
log(3/1)
Consider the following query and corresponding postings lists (with doc ids only): Query: master of software engineering engineering: 4, 5, 10, 11, 14, 15, 16master: 4, 11of: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16software: 11,15,16 Assume Boolean retrieval with an AND operation between the terms. What is the best order for processing this query as fast as possible? (processing is left to right)
master AND software AND engineering AND of
Cosine similarity captures geometric proximity in terms of
the angle between the vectors
In the Vector Space Model, the data points in the multi-dimensional space are
the documents
tf-idf increases with the rarity of the terms in the corpus.
true
What would be the value of the window "w" regarding some proximity-weighted score for the search query "Information Sciences" in the document "The School of Information and Computer Sciences is really competitive" ?
w = 4
Select the most efficient processing order, if any, for the Boolean query Q considering the document frequency information from the table: Q: web AND ranked AND retrieval web 154383 ranked 623146 retrieval 483259
(web AND retrieval) first, then merge with ranked
Consider the following sentences: S1: "deep fry turkey is the bomb turkey" S2: "I like slow slow roasted turkey" S3: "turkey stuffing contains the turkey liver" What is the tf-idf of the word turkey in S1?
0
Consider the following sentences: S1: "deep fry turkey is the bomb turkey" S2: "I like slow slow roasted turkey" S3: "turkey stuffing contains the turkey liver" What is the document frequency of the word slow?
1
Consider the following sentences: S1: "deep fry turkey is the bomb turkey" S2: "I like slow slow roasted turkey" S3: "turkey stuffing contains the turkey liver" And the following query: Q: turkey breast What is the Jaccard coefficient between Q and S2?
1/6
A://B/C?D#E
A = scheme B = domain/hostname/authority C = path D = query E = fragment
User-agent: *Disallow: /foo Disallow: /bar User-agent: Googlebot Disallow: /baz/a
Allowed to crawl /foo and /bar Not allowed to crawl /baz/a
A normal crawler fetches pages directly from the Web servers. However, your crawler used a cache server to fetch pages. Why?
Because having more than one hundred crawlers fetching pages directly could overload the ICS network if the crawlers are not properly developed.
#1 def run(self): #2 while True: #3 tbd_url = self.frontier.get_tbd_url() #4 if not tbd_url: #5 self.logger.info("Frontier is empty. Stopping Crawler.") #6 break #7 resp = download(tbd_url, self.config, self.logger) #8 self.logger.info( #9 f"Downloaded {tbd_url}, status <{resp.status}>, " #10 f"using cache {self.config.cache_server}.") #11 scraped_urls = scraper(tbd_url, resp) #12 for scraped_url in scraped_urls: #13 self.frontier.add_url(scraped_url) #14 self.frontier.mark_url_complete(tbd_url) #15 time.sleep(self.config.time_delay) What is going on in lines #4--#6?
Check if there are no more URLs to download, and stop the crawler in that case
Cost Per Mil (CPM)
Cost for showing the ad 1000 times
Cost Per Click (CPC)
Cost of clicking on the ad afters it's shown to them
The web is continuously changing. What is considered the best strategy to update the index of a search engine?
Create small temporary indexes in memory from modified and new pages, use them together with the main index for search, and later merge them together with the main index before the memory is full.
Consider a vocabulary of 4 words. Two of the documents have the following coordinates in that space: D1: [0, 2, 1, 0] D2: [1, 0, 1, 1] Using cosine similarity as the ranking formula, what is the relative ranking of these documents for a query that has coordinates [1, 1, 1, 1]?
D2, D1
What is the minimum information in a posting?
Doc id
"Web crawlers can and should send hundreds of requests per second to a Web site, because otherwise they will take a very long time to crawl"
False
A disk seek operation is 10 times slower than a main memory reference
False
All the crawler traps the exist on the web are deliberately created.
False
It is feasible to use a single file in a disk as a direct replacement of the RAM memory and thus to avoid the creation of partial indexes during the index creation for an index that is expected to attain several terabytes.
False
Reading 1 MB sequentially from memory is 2 times faster than Reading 1 MB sequentially from disk
False
The deep web is a large part of the web that only has encrypted content, and thus it is not crawled nor indexed by normal search engines.
False
The right side of the architecture diagram pertains to processes that are done well before any query is issued
False
When using tf-idf as the ranking score on queries with just one term, the idf component has effect on the final ranking.
False
tf-idf decreases with the number of occurrences of the terms inside a document
False
Which of the following are examples of work within the Information Retrieval field?
Filtering for documents of interest Web search engines The design of a relational database for a library Classifying books into categories
Vertical search
Gathering and finding information on a network of independent nodes?
Desktop search
Gathering and finding information on a single computer?
Web search
Gathering and finding information on the web?
Peer-to-peer search
Gathering and finding information within a company network X
Enterprise search
Gathering and finding information within a company network?
What's "inverted" in an inverted index?
It "inverts" from docs->terms to terms->docs
What is the main problem of using a term-document matrix for searching in large collections of documents?
It is an inefficient use of memory
In Boolean retrieval, a query that ANDs three terms results in having to intersect three lists of postings. Assume the three lists are of size n, m, q, respectively, each being very large. Furthermore, assume that each of the three lists are unsorted. In terms of complexity, will your intersection algorithm benefit, or not, from sorting each list before merging them, with respect to merging the 3 unsorted lists?
It will benefit from sorting first
What is a crawler trap?
It's a chain of URLs that may never end
Should crawlers hit the same web site as fast as possible as a strategy to crawl faster?
No
Consider the following pseudo code of an indexer that takes a sequence of documents, all of which have a URL d.url. We want to associate that URL with an integer, so to use the integer in the postings. What is the complexity of this algorithm with respect to N = number of documents?
O(N2)
In Boolean retrieval, a query that ANDs three terms results in having to intersect three lists of postings. Assume the three lists are of size n, m, q, respectively, each being very large. If you keep the lists unsorted, what best approximates the complexity of a 3-way intersection algorithm?
O(nxmxq)
What are HTTP status codes 2xx?
Page retrieved sucessfully
What are HTTP status codes 3xx?
Redirection
Which of the following methods you can directly use to detect pages or documents that are near duplicates?
Simhash, fingerprint
What characterizes most queries on Web search engines?
Spelling mistakes Short (just a few words) Use of special characters and tokens with specific meaning (e.g. 5+3)
Within the architecture of a search engine, Homework 1 (tokenization) pertains to:
Text transformation
#1 def run(self): #2 while True: #3 tbd_url = self.frontier.get_tbd_url() #4 if not tbd_url: #5 self.logger.info("Frontier is empty. Stopping Crawler.") #6 break #7 resp = download(tbd_url, self.config, self.logger) #8 self.logger.info( #9 f"Downloaded {tbd_url}, status <{resp.status}>, " #10 f"using cache {self.config.cache_server}.") #11 scraped_urls = scraper(tbd_url, resp) #12 for scraped_url in scraped_urls: #13 self.frontier.add_url(scraped_url) #14 self.frontier.mark_url_complete(tbd_url) #15 time.sleep(self.config.time_delay) What would happen if line #14 was removed?
The crawl would never end
What contributes to the relevance of a document with respect to a query in the context of a search engine?
The geographic location of the person who's querying Textual similarity The popularity of the document The author of the document Prior queries made by the same user The geographic origin of the document
#1 def run(self): #2 while True: #3 tbd_url = self.frontier.get_tbd_url() #4 if not tbd_url: #5 self.logger.info("Frontier is empty. Stopping Crawler.") #6 break #7 resp = download(tbd_url, self.config, self.logger) #8 self.logger.info( #9 f"Downloaded {tbd_url}, status <{resp.status}>, " #10 f"using cache {self.config.cache_server}.") #11 scraped_urls = scraper(tbd_url, resp) #12 for scraped_url in scraped_urls: #13 self.frontier.add_url(scraped_url) #14 self.frontier.mark_url_complete(tbd_url) #15 time.sleep(self.config.time_delay) What is going on in lines #7?
The next URL is downloaded
#1 def run(self): #2 while True: #3 tbd_url = self.frontier.get_tbd_url() #4 if not tbd_url: #5 self.logger.info("Frontier is empty. Stopping Crawler.") #6 break #7 resp = download(tbd_url, self.config, self.logger) #8 self.logger.info( #9 f"Downloaded {tbd_url}, status <{resp.status}>, " #10 f"using cache {self.config.cache_server}.") #11 scraped_urls = scraper(tbd_url, resp) #12 for scraped_url in scraped_urls: #13 self.frontier.add_url(scraped_url) #14 self.frontier.mark_url_complete(tbd_url) #15 time.sleep(self.config.time_delay) What is going on in lines #3?
The next URL to be downloaded is picked from the frontier
How is advertisement integrated with Web search?
The user's query goes both to the search engine and the ad engine; the search engine retrieves the most relevant results, the ad engine uses an auction system on the query words.
#1 def run(self): #2 while True: #3 tbd_url = self.frontier.get_tbd_url() #4 if not tbd_url: #5 self.logger.info("Frontier is empty. Stopping Crawler.") #6 break #7 resp = download(tbd_url, self.config, self.logger) #8 self.logger.info( #9 f"Downloaded {tbd_url}, status <{resp.status}>, " #10 f"using cache {self.config.cache_server}.") #11 scraped_urls = scraper(tbd_url, resp) #12 for scraped_url in scraped_urls: #13 self.frontier.add_url(scraped_url) #14 self.frontier.mark_url_complete(tbd_url) #15 time.sleep(self.config.time_delay) What would happen if line #15 was removed?
This crawler would crawl much faster This crawler would be impolite
Consider the pseudo code of a simple in-memory indexer:
This indexer processes all documents it is given This indexer will not work for sufficiently large data
Can we build indexes that are larger than the available memory?
We can off load the in-memory hashtable to files, every so often, and merge those files in the end.
Besides Web crawling, what other ways are there to obtain data from Web sites?
Web APIs provided by certain Web sites Targeted downloads of specific URLs Data dumps provided by companies and organizations
In a Web search engine, the Text Acquisition system consists of _____
Web crawlers
Good matches for search engines
a) Find out what is going on today at UCI c) Find how to split words in Python d) What is the weather like in Bali?
Good matches for relational databases
b) Find all female students whose last name is Smith f) What were the temperature and humidity values registered in Crystal Cove between 9/1/2019 and 10/1/2019?