IR

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

what is the title of the textbook for this class?

Introduction to Information Retrieval

Google was successfully sued by the United States Federal government for offering ads by Canadian pharmacies. What did Google do that was wrong?

ads from rogue pharmacies were still appearing on Google despite a change in advertising policy

When Google must decide how to order the ads for a given query phrase, what formula does it use?

CTR * Bid Amt

Define "stemming".

Reducing words to their morphological roots

Approximately how many websites are there? (circle the best answer): - 1 million - 10 million - 100 million - 1 billion - 10 billion - 100 billion

1 billion

What might a search engine do to quickly check that a newly discovered web page is identical to one that was already seen and indexed?

1) Produce fingerprints and test for similarity 2) Produce shingles and check from random positions in the document.

Name the 3 major phases that a search engine goes through:

1) Spider (a.k.a. crawler/robot) - builds corpus 2) The indexer - creates inverted indexes 3) Query processor - serves query results

Google first appeared on the web in (circle the best answer): - 1998 - 2004 - 2010

1998

In one sentence define: Dendrogram

A dendrogram is a tree diagram frequently used to illustrate the arrangement of the clusters produced by hierarchical clustering.

We looked at two algorithms for classifying documents into groups. What are they called?

Agglomerative Clustering, K-means Clustering

Do Google, Yahoo and Bing track ALL clicks by users, or just clicks on ads?

All clicks by users

In the class we have often mentioned Google, Yahoo, and Bing as the major web search engines. However, others have also been mentioned. Name three:

AltaVista, Lycos, DuckDuckGo

What is Hadoop?

Apache Hadoop is an open-source implementation of map/reduce written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware.

Kleinberg's "Authoritative Sources in a Hyperlinked Environment" was covered in the Essential Papers section of the course and in lecture material. Consider Twitter as a set of hyperlinked resources - would you consider Twitter an authority or a hub in today's world? Explain your answer

Authority since each hyperlinked resource would contain information

What is the second largest web search engine in terms of revenue?

Baidu

How many characters does BING wait before auto suggesting a keyword to search on?

Bing does not even wait for the first character to be entered, they make use of previous queries and enter some possibilities before the user even types a single character

In the Google AdWords system what does CTR stand for?

Click Through Rate

Two improper techniques used to enhance a web pages ranking in search results are cloaking and page jacking. Define them both

Cloaking is a technique in which the content presented to the search engine spider is different from that presented to the users' browser Page-Jacking (Plagiarized Content) - The activity of stealing content from a web site and copying it into another Web site in order to siphon some of the original site's traffic to the copied pages.

Define "case folding".

Converting text to lower case.

True or False? Google, Yahoo, and Bing record all user clicks, both on ads and on organic search results.

False

True or False, Page rank is calculated for each search query and highly relevant pages will always have a higher page rank?

False. Page Rank is calculated prior to query and stored in index. Relevant documents are retrieved and then ranked as per relevance score. It is not necessary that all relevant pages will be shown at the top.

Write out the 3-grams for the phrase: "Fourscore and seven years ago our fathers brought forth a nation"

Fourscore and seven and seven years seven years ago years ago our ago our fathers our fathers brought fathers brought forth brought forth a forth a nation

What are the names of the Google and Yahoo! web crawlers?

Google - googlebot Yahoo - Yahoo! Slurp Bing - Bingbot, Adidxbot, MSNbot, BingPreview

Google has two programs for advertisers, one that places ads next to search engine results and one that places ads on a website. Name both of the programs.

Google Adsense is used for placing ads on website. Google Adwords is used for placing ads next to search engine results.

In one sentence define Heaps Law

Heap's law describes the number of distinct words (V) in a set of documents as a function of the document length (n) V = KN^Beta

What is the difference between hard clustering and soft clustering?

In Soft a point can belong to two whereas in hard it can only belong to one.

Where is the file in your answer above located?

In the root folder

What is a way to guarantee that an advertiser's ad will appear at the top of a Google or Bing results page?

Increase the bid amount exponentially.

Cho and Garcia-Molina describe strategies for distributed crawlers and they define three different ways the crawlers can interact: independent, dynamic assignment, or static assignment. In a few words define each

Independent: no coordination, every process follows its extracted links Dynamic assignment: a central coordinator dynamically divides the web into small partitions and assigns each partition to a process Static assignment: Web is partitioned and assigned without a central coordinator before the crawl starts

A cryptographic hash function of file X has three main properties: it is easy to compute it is difficult to find a file that has the same hash value, and what is the third property?

It is extremely computationally difficult to calculate an alphanumeric text that has a given hash.

Given sets S and T, define the Jaccqard Similarity of S and T.

Jaccqard Similarity = size (S Intersect T) / size (S Union T) Jaccqard Distance = 1 - (size (S Intersect T) / size (S Union T)) Euclidean distance = d([x1...xn], [y1,...,yn]) = sqrt(Sum(xi-yi)^2) i=1...n S(D,w) is the set of shingles of a document D of width w Resemblance(A, B) = size(S(A,w) intersect S(B,w)) / size(S(A,w) union S(B,w)) Containment(A, B) = size of (S(A,w) intersect S(B,w)) / size of (S(A,w))

List the four main features/functions that Apache Tika provides.

Language Detection, Document Type/MIME Type, Content Detection, Metadata Extraction.

Given a set of N queries and AVGPrec(N) the average precision of each query, what is the formula for the Mean Average Precision?

MAP = summation(AVGPrec(n)) / n

Name a cryptographic hashing method:

MD5

Name one of the authors of the textbook

Manning, Raghavan, Schutze

Define Spearman's footrule distance for two lists of n items without using mathematical symbols.

Measure of absolute difference in two ranking list.

Name three reasons that it is important to detect mirrors during deduplication

Mirroring is the single largest cause of duplication on the web Saves resources (on the crawler end, as well as the remote host) Increases crawler politeness Reduces the analysis that a crawler will have to do later

Name the three types of spelling errors:

Non Word Error, Typological Errors, Cognitive Errors

Which content type is NOT indexed by Google? (circle the best answer): - swf - xlsx - rtf - svg - None of the above as all of them are indexed

None of the above as all of them are indexed

When investigating click fraud, there are both online tests and offline tests. Give an example of: i) an online test. ii) an offline test.

Online Test: these test cases are checked as the clicks are registered. This can be from blocked ips and can be checked in real-time. Offline Test: done usually on a daily basis. Clicks from the same ip or rapid clicks within short intervals.

Several heuristic techniques were presented for speeding up the computation of search results. Mention two of them:

Only consider high-idf query terms Only consider docs containing many (or all) of the query terms

What is a "parked domain"?

Parked domains are often used by businesses that want to have more than one web address for advertising purposes. Parked domains are additional domains hosted on your account which display the same website as your primary domain and share web statistics as well; however, you can give the parked domain its own email boxes.

In class you saw the Block Sort-Based Indexing algorithm. What was the algorithm attempting to minimize?

Performing sort in memory

Stemming is a method for normalizing tokens. What is the name of a famous stemming algorithm?

Porter's Stemming Algorithm

what formula or law is expressed this way? log(y) = log(k*xc)

Power Law

Define "stop words" and provide three examples

Removing words that are so common they provide no information Stop Words are words which do not contain important significance to be used in Search Queries. stop words are words which are filtered out before or after processing of natural language. These are usually referred to as common words of that language. Examples - a, the, be, by, do

When a search engine crawls the web and visits a website for the first time, what is the first file the crawler should look for?

Robots.txt Example - No robot visiting this domain should visit any URL starting with "/yoursite/temp/": User-agent: * Disallow: /yoursite/temp/ Example of '*': User-agent: Slurp Allow: /public*/ Disallow: /*_print*.html Disallow: /*?sessionid

The HITS Algorithm developed by Jon Kleinberg identifies two types of web pages that have special significance. What are these two types of web pages?

Some pages known as hubs serve as large directories of information on a given topic authority Some pages known as authorities serve as the page(s) that best describe the information on a given topic

How is the failure of a Map worker handled in the MapReduce framework?

The compute node of a Map worker fails - This is detected by the Master and all Map tasks that were assigned are re-done - The Master sets the status of each Map task to idle and re-schedules them when a worker becomes available - The Master informs each Reduce task of the location of its new input

What is de-duplication and give two examples of why it needs to be done.

The process of identifying and avoiding essentially identical web pages. With respect to web crawling, de-duplication essentially refers to the identification of identical and nearly identical web pages and indexing only a single version to return as a search result. This technique is used to improve storage utilization and can also be applied to network data transfers to reduce the number of bytes that must be sent. Why? Smarter crawling - Avoid returning many duplicate results to a query - Allow fetching from the fastest or freshest server Better connectivity analysis - By combining in-links from the multiple mirror sites to get an accurate PageRank - Avoid double counting out-links Add redundancy in result listings - "If that fails you can try: <mirror>/samepath" Examples 1. Same website with different host name 2. Same website with different ads

Define tokenization

The task of chopping a document unit into pieces, called tokens, and possibly throwing away certain characters.

Inverted indices typically include for each term a list of documents where the term occurs. Some inverted indices also include the position of a word in a document which adds a lot of additional space. What advantage is gained by including the position of a word?

This is usually helps for phrase suggests. You can also use Bi Grams for phase suggest but storing positions save more space as compared to bigrams.

Define token

Token is a word which is created using tokenization with certain set of rules

Lucene/Solr uses two methods for ranking results. What are they?

Vector Space Model (VSM) of Information Retrieval and the Boolean model

What format does Apache Tika use to represent and return the parsed content of a document stream?

XHTML

Suppose the Pepsi Cola company wants to bid on the words Coca Cola whenever they are entered as a query, so a Pepsi Cola ad will appear. Is this legal?

Yes, as per the Rosseta Case this is exactly what happened.

Can a web page author claim his page is copyrighted if he forgets to insert a "Copyright ©" notice statement on the page?

Yes, it is no longer required.

Name the four types of protection for intellectual property.

copyright (for literary works, art, and music) patents (for inventions and processes) trademarks (for company and product names and logos) trade secrets (for recipes, code, and processes).

David Filo and Jerry Yang are (circle the best answer): - founders of Google - creators of spreadsheets - founders of Yahoo

founders of Yahoo

crawler4j was downloaded from what online source repository?

github.com

What is Google's reason for not telling an advertiser why each and every click was marked as valid?

if Google discloses this information, it opens itself to click fraud on a massive scale because, by doing so, it provides certain hints about how its invalid click detection methods work

In the formula for Discounted Cumulative Gain, how are documents appearing lower in a search result list penalized?

log(1/rank) where rank is bigger if you appear more at the bottom.

Given two sequences of length n, what is their maximum Kendall Tau distance?

n

Define the term-document incidence matrix

A sparse matrix where rows represent terms and columns represent documents. The value in the matrix is 1 if the document contains the term otherwise zero.

When an advertiser bids on a set of keywords he can ask for 3 different types of matches. Mention two of them and define them.

1. Broad Match - Broad match just matches the occurrence of the terms in the text. 2. Exact Match - This is used for exact match in keyword. This means only when the user enters the exact query that the ad should appear. 3. Phrase Match - Your ad appears when users search on the exact phrase 4. Negative Keyword - Negative keywords allow you to eliminate searches that you know are not related to your message

Distance measures are defined by 4 properties: 1. no distances are negative 2. d(x,y) = 0 iff x = y 3. d(x,y) = d(y,x) what is the fourth property?

D(x,y) <= d(x,z) + d(z,y) triangle inequality

As a website grows and adds more pages with more links to web pages outside of the website, how is the total PageRank of the website affected?

It begins to distribute the score outside or leaks it outside. Page Rank Score decreases.

With respect to search engines, what does the term "relevance feedback" refer to?

Relevance feedback is a way by which the query processing can be improved after query interaction is done by the user.

What is Porter's Algorithm?

Stemming Algorithm that is used to produce stemming of words.

The terms "TF" and "IDF" are used in information retrieval. What do the terms stand for?

TF - Term Frequency tfij = fij / max{fij} IDF - Inverse Document Frequency dfi = document frequency of term i = number of documents containing term I of course dfi is always <= N (total number of documents) idfi = inverse document frequency of term i, = log2 (N/ dfi) Given a document containing 3 terms with given frequencies: A(3), B(2), C(1) Assume collection contains 10,000 documents and document frequencies of these 3 terms are: A(50), B(1300), C(250) Then: A: tf = 3/3; idf = log(10000/50) = 5.3; tf.idf = 5.3 B: tf = 2/3; idf = log(10000/1300) = 2.0; tf.idf = 1.3 C: tf = 1/3; idf = log(10000/250) = 3.7; tf.idf = 1.2

In an inverted index why is it important to use numeric identifiers as opposed to URLs? Explain your answer (5 pts).

The URLs are normally replaced by numeric identifiers for compactness

When a search engine gets a query such as "what are the movie times for The Artist, how are they able to identify the local movie theaters?

These are already stored in their database and using GPS they can pin point theater locations. Its available in their index.

State Zipf's Law

Zipf's law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table.

Given the query a OR "b c" d where a, b, c, and d represent keywords being searched for, fully parenthesize the query as Google would do it. Insert any implied Boolean operators.

a OR "b c" AND d

What is the better strategy for crawling a website? (circle the best answer): - breadth first search - depth first search

breadth first search

What effect does the following line in a web page have? <meta name=robots content="noindex,nofollow">

don't index this page and don't follow the links

What effect does the following line in a web page have? <meta name=robots content="noindex, follow, noarchive">

don't index this page, follow the links and Do not show a "Cached" link in search results.

A study of how to design a web page crawler to locate the best quality pages was done by Cho and Garcia-Molina. What measure of quality did they use? What algorithm did they determine would produce the highest quality pages in the shortest time?

measure of quality - Coverage, Overlap, Quality, Communication • They computed the page rank for all pages in their study • They limited their crawl of each site to the first 3,000 pages encountered in a breadth first Search - Firewall crawlers attain good, general coverage with low cost - Cross-over ensures 100% quality, but suffer from overlap - Replicating URLs and batch communication can reduce overhead

An equivalent term for the word shingles is?

n gram

Describe in one sentence the search results that are returned by Google if you enter the query filetype: pdf

pdf results are returned.

What is the purpose of the Levenshtein Algorithm?

Spell Correction. It is used to measure how different two string sequences are.

Suppose one advertiser bids $1.00 for his ad to be displayed and a second advertiser bids $0.50 for his ad to be displayed and all other factors affecting ads are identical. If the first advertiser's ad is clicked on how much does he pay Google?

$0.51

What is a popular data structure for organizing a lexicon that is especially useful to implement autocomplete or prefix matching?

Trie

Google retains a user's entire query history? (circle the best answer): - True - False

True

True or False, Google does NOT permit Boolean operators in queries?

True

True or False, Solr will return results in XML format?

True

True or false, Solr includes explicit operators for AND, OR, NOT?

True

True or false: Google does auto completion and spelling correction at the same time.

True

AltaVista, Lycos and InfoSeek are (circle the best answer): - apps for an iPhone - early web search engines - information retrieval systems

early web search engines

The definitions of Precision and Recall both have "the number of relevant items retrieved" as their numerator. What is the denominator for Recall?

# of relevant items

The definitions of Precision and Recall both have "the number of relevant items retrieved" as their numerator. What is the denominator for Precision?

# of retrieved items

Some browsers now include a feature that prevents "third-party cookies" from being placed on a browser. Name the three parties involved.

1. The browser/system from where the website is accessed 2. The website which is accessed 3. The cookies of the website which does not belong to the same domain which is actually accessed by the browser

Suppose there are only two web pages, each with only one link that points to the other web page. What will be the PageRank of each page?

Each will have a score of one as PR algorithm over large iterations will converge to one.

What do the terms AdWords and AdCenter refer to?

Google Adwords is used for placing ads next to search engine results. Bing Ads (formerly Microsoft adCenter and MSN adCenter) is a service that provides pay per click advertising on both the Bing and Yahoo! search engines.

What is meant by Google's Universal Search?

Google Universal Search is a way that Google "blends" results from "vertical" search engines like Google Images or Google News into its web search listings.

Recall and Precision are two measures of the effectiveness of an Information Retrieval system. If A is the number of relevant records retrieved, B is the number of relevant records NOT retrieved, and C is the number of irrelevant records retrieved, define Recall and Precision in terms of A, B, and C.

Recall = #(relevant items retrieved) divided by #(all relevant items) = tp/(tp + fp) = A / (A + B) Precision = #(relevant items retrieved) divided by #(all retrieved items) = tp/(tp + fn) = A / (A + C) Accuracy = (tp + tn) / ( tp + fp + fn + tn) Error = 1 - Accuracy

A phonetic algorithm identifies words that sound the same but are spelled differently. Name one

Soundex Algorithm

What is the Soundex Algorithm?

Soundex is a phonetic algorithm for indexing names by their sound when pronounced in English.

What does DMCA stand for?

The Digital Millennium Copyright Act (DMCA) is a United States copyright law

Define Kendall's Tau distance in words, i.e. without using mathematical symbols

The Kendall tau rank distance is a metric that counts the number of pairwise disagreements between two ranking lists.

Google and Bing both allow advertisers to restrict where their ads will be seen; the restriction can be by country, by state, by city. Name one way to accomplish this.

They can use ip tables to cross verify the region. Also, GPS can be used if user allows to further target the ads.

What is a "tracking pixel"?

Tracking pixels are small, typically transparent images on a web page that have special names which permit the loading of the web page to be tracked by a web server.

Google offers a variety of special operators that can be used to narrow a search. Define the following ones: filetype, site, allinanchor

filetype - Restricts the search results by file type extension. site - Restricts the results to those websites in a domain. Allinanchor - restricts results to pages containing all query terms you specify in the anchor text on links to the page. daterange - limits your search to a particular date or range of dates that a page was indexed by Google. inanchor - will restrict the results to pages containing the query terms you specify in the anchor text or links to the page. intext - ignores link text, URLs, and titles, and only searches body text. intitle - restricts the results to documents containing a particular word in its title. inurl - restricts the results to documents containing a particular word in its URL. cache - url shows the version of a web page that Google has in its cache. link - restricts the results to those web pages that have links to the specified URL. relate - lists web pages that are "similar" to a specified web page. info - presents some information that Google has about a particular web page. stocks - Google will treat the rest of the query terms as stock ticker symbols, and will link to a finance page showing stock information for those symbols. phonebook - searches the entire Google phonebook. rphonebook - searches residential listings only. bphonebook - searches business listings only (As of 2010, Google's phone book feature has been officially retired. Both the phonebook:and the rphonebook - search operator have both been dropped due to many complaints about privacy violations)


Ensembles d'études connexes

Honors Chemistry Semester One Final

View Set

AH: PrepU Hinkle Ch 62 BURNS ALL INFO

View Set

Entrepreneur Chapter 3 Study Guide.

View Set

7.2.11 Basic Switch Configuration

View Set

Dental anatomy (ch.8: cranial nerves V & V||)

View Set

Science 9- Unit D - Electrical Principles and Technologies

View Set