572
Define Accuracy
(tp + tn) / (tp + tn + fp + fn)
Page rank, The damping factor value and its effect:
- If too high, more iterations are required - If too low, you get repeated over-shoot, - Both above and below the average - the numbers just swing like pendulum and never settle down.
As a website grows and adds more pages with more links to web pages outside of the website, how is the total PageRank of the website affected?
It begins to distribute the score outside or leaks it outside. Page Rank Score decreases.
What is the Soundex Algorithm?
a phonetic algorithm for indexing names by their sound when pronounced in English.
Zipf's law
a power law with c = -1 frequency of any word is inversely proportional to its rank in the frequency table
Define "stop words" and provide three examples
removing words that are so common they provide no information Examples - a, the, be, by, do
Define Edit distance
the distance between two strings is the smallest number of insertions and deletions of single characters that will convert one string into the other
State Zipf's Law
the frequency of any word is inversely proportional to its rank in the frequency table
Define "stemming".
Reducing words to their morphological roots
Name the 3 major phases that a search engine goes through:
1) Spider (a.k.a. crawler/robot) - builds corpus 2) The indexer - creates inverted indexes 3) Query processor - serves query results
cryptographic hash function has three main properties
1. It is extremely easy (i.e. fast) to calculate a hash for any given data. 2. It is extremely computationally difficult to calculate an alphanumeric text that has a given hash. 3. A small change to the text yields a totally different hash value. 4. It is extremely unlikely that two slightly different messages will have the same hash.
A distance measure must satisfy 4 properties
1. No negative distances 2. D(x,y) = 0 iff x=y 3. D(x,y) = D(y,x) symmetric 4. D(x,y) <= D(x,z) + D(z,y) triangle inequality
Heap's Law Conclusion
1. the dictionary size continues to increase with more documents in the collection, rather than a maximum vocabulary size being reached 2. the size of the dictionary is quite large for large collections
Define the term-document incidence matrix
A sparse matrix where rows represent terms and columns represent documents. The value in the matrix is 1 if the document contains the term otherwise zero.
A study of how to design a web page crawler to locate the best quality pages was done by Cho and Garcia-Molina. What measure of quality did they use? What algorithm did they determine would produce the highest quality pages in the shortest time?
?
Do Google, Yahoo and Bing track ALL clicks by users, or just clicks on ads?
All clicks by users
power law
An equation of the form y = kx^c
Two parts of invert index
Dictionary and posting
Suppose there are only two web pages, each with only one link that points to the other web page. What will be the PageRank of each page?
Each will have a score of one as PR algorithm over large iterations will converge to one.
Porter's algorithm
For a given set of rules only the rule with longest suffix match is applied For example
Write out the 3-grams for the phrase: "Fourscore and seven years ago our fathers brought forth a nation"
Fourscore and seven and seven years seven years ago years ago our ago our fathers our fathers brought fathers brought forth brought forth a forth a nation
**What are the names of the Google and Yahoo! web crawlers?
Google - googlebot Yahoo - Yahoo! Slurp Bing - Bingbot, Adidxbot, MSNbot, BingPreview
Heaps' Law
Heap's law describes the number of distinct words (V) in a set of documents as a function of the document length (n) If V is the size of the vocabulary and n is the number of words: V = Kn^β with constants K, 0 < β <1 Typical constants: - K ≈ 10 - 100 - β ≈ 0.4 - 0.6 (approx. square-root)
Kleinberg's "Authoritative Sources in a Hyperlinked Environment" was covered in the Essential Papers section of the course and in lecture material. Consider Twitter as a set of hyperlinked resources - would you consider Twitter an authority or a hub in today's world? Explain your answer
Hub
Where is the file in your answer above located?
In the root folder
**Cho and Garcia-Molina describe strategies for distributed crawlers and they define three different ways the crawlers can interact: independent, dynamic assignment, or static assignment. In a few words define each
Independent: no coordination, every process follows its extracted links Dynamic assignment: a central coordinator dynamically divides the web into small partitions and assigns each partition to a process Static assignment: Web is partitioned and assigned without a central coordinator before the crawl starts
The format of invert index
Index term, number of index, (jth document, term frequency)
Given sets S and T, define the Jaccard Similarity of S and T.
Jaccard Similarity = size (S Intersect T) / size (S Union T) Jaccard Distance = 1 - (size (S Intersect T) / size (S Union T)) Euclidean distance = d([x1...xn], [y1,...,yn]) = sqrt(Sum(xi-yi)^2) i=1...n S(D,w) is the set of shingles of a document D of width w Resemblance(A, B) = size(S(A,w) intersect S(B,w)) / size(S(A,w) union S(B,w)) Containment(A, B) = size of (S(A,w) intersect S(B,w)) / size of (S(A,w))
Name three reasons that it is important to detect mirrors during deduplication
Mirroring is the single largest cause of duplication on the web Saves resources (on the crawler end, as well as the remote host) Increases crawler politeness Reduces the analysis that a crawler will have to do later
Why use log-log plot for Zipf's law
On a log-log plot, power laws give a straight line with slope c.
Pagerank fomula
PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) We assume page A has pages T1...Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85. There are more details about d in the next section. Also C(A) is defined as the number of links going out of page A.
Recall and Precision are two measures of the effectiveness of an Information Retrieval system. If A is the number of relevant records retrieved, B is the number of relevant records NOT retrieved, and C is the number of irrelevant records retrieved, define Recall and Precision in terms of A, B, and C.
Recall = #(relevant items retrieved) divided by #(all relevant items) = tp/(tp + fp) = A / (A + B) Precision = #(relevant items retrieved) divided by #(all retrieved items) = tp/(tp + fn) = A / (A + C) Accuracy = (tp + tn) / ( tp + fp + fn + tn) Error = 1 - Accuracy
When a search engine crawls the web and visits a website for the first time, what is the first file the crawler should look for? What do robots.txt do?
Robots.txt The website announces its request on what can(not) be crawled by placing a robots.txt file in the root directory Example - No robot visiting this domain should visit any URL starting with "/yoursite/temp/": User-agent: * Disallow: /yoursite/temp/ Example of '*': User-agent: Slurp Allow: /public*/ Disallow: /_print.html Disallow: /*?sessionid
The HITS Algorithm developed by Jon Kleinberg identifies two types of web pages that have special significance. What are these two types of web pages?
Some pages known as hubs serve as large directories of information on a given topic authority Some pages known as authorities serve as the page(s) that best describe the information on a given topic
Soundex Algorithm
Soundex is a phonetic algorithm for indexing names by their sound when pronounced in English.
The terms "TF" and "IDF" are used in information retrieval. What do the terms stand for?
TF - Term Frequency tfij = fij / max{fij} IDF - Inverse Document Frequency dfi = document frequency of term i = number of documents containing term I of course dfi is always <= N (total number of documents) idfi = inverse document frequency of term i, = log2 (N/ dfi) Given a document containing 3 terms with given frequencies: A(3), B(2), C(1) Assume collection contains 10,000 documents and document frequencies of these 3 terms are: A(50), B(1300), C(250) Then: A: tf = 3/3; idf = log(10000/50) = 5.3; tf.idf = 5.3 B: tf = 2/3; idf = log(10000/1300) = 2.0; tf.idf = 1.3 C: tf = 1/3; idf = log(10000/250) = 3.7; tf.idf = 1.2
In an inverted index why is it important to use numeric identifiers as opposed to URLs? Explain your answer
The URLs are normally replaced by numeric identifiers for compactness
What is de-duplication and give two examples of why it needs to be done.
The process of identifying and avoiding essentially identical web pages. With respect to web crawling, de-duplication essentially refers to the identification of identical and nearly identical web pages and indexing only a single version to return as a search result. This technique is used to improve storage utilization and can also be applied to network data transfers to reduce the number of bytes that must be sent. Smarter crawling - Avoid returning many duplicate results to a query - Allow fetching from the fastest or freshest server Better connectivity analysis - By combining in-links from the multiple mirror sites to get an accurate PageRank (measure of importance) - Avoid double counting out-links Add redundancy in result listings - If that fails you can try: <mirror>/samepath Reduce Crawl Time: Crawlers need not crawl pages that are identical or near identical Ideally: given the web s scale and complexity, priority must be given to content that has not already been seen before or has recently changed - Saves resources (on the crawler end, as well as the remote host) - Increases crawler politeness - Reduces the analysis that a crawler will have to do later Examples 1. Same website with different host name 2. Same website with different ads
define tokenization define token Token normalization
The task of chopping a document unit into pieces, called tokens, and possibly throwing away certain characters A token is an instance of a sequence of characters in some particular document that are grouped together as a useful semantic unit Token normalization is the process of canonicalizing tokens so that matches occur despite superficial differences in the character sequences of the tokens.
True or false: Google does auto completion and spelling correction at the same time.
True
Define Cloaking
When the web server detects a request from a crawler, it returns a different page than the page it returns from a user request
Can a web page author claim his page is copyrighted if he forgets to insert a "Copyright ©" notice statement on the page?
Yes, it is no longer required.
Define "case folding".
converting all uppercase letters to lower case
What effect does the following line in a web page have? <meta name=robots content="noindex,nofollow">
don't index this page and don't follow the links
Google offers a variety of special operators that can be used to narrow a search. Define the following ones: filetype, site, allinanchor
filetype - Restricts the search results by file type extension. site - Restricts the results to those websites in a domain. Allinanchor - restricts results to pages containing all query terms you specify in the anchor text on links to the page. daterange - limits your search to a particular date or range of dates that a page was indexed by Google. inanchor - will restrict the results to pages containing the query terms you specify in the anchor text or links to the page. intext - ignores link text, URLs, and titles, and only searches body text. intitle - restricts the results to documents containing a particular word in its title. inurl - restricts the results to documents containing a particular word in its URL. cache - url shows the version of a web page that Google has in its cache. link - restricts the results to those web pages that have links to the specified URL. relate - lists web pages that are "similar" to a specified web page. info - presents some information that Google has about a particular web page. stocks - Google will treat the rest of the query terms as stock ticker symbols, and will link to a finance page showing stock information for those symbols. phonebook - searches the entire Google phonebook. rphonebook - searches residential listings only. bphonebook - searches business listings only (As of 2010, Google's phone book feature has been officially retired. Both the phonebook:and the rphonebook - search operator have both been dropped due to many complaints about privacy violations)
compute the sketchD[i]
• Compute its shingles and then the hash values of those shingles • Map the hash values to 1..2m , call it f(s) • Let pi be a random permutation on 1..2m • Pick MIN {pi(f(s))} over all shingles s in D • Do the above for 200 different permutations