isbi qs

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

3) What is search engine spamming? Describe the difference between search engine spamming and search engine optimization.

'Search Engine Optimization (SEO) is the process of improving the volume and quality of traffic to a web site from search engines via "natural" search results. SEO is informally divided into white-hat and black-hat SEO. White-hat SEO, ethical or "normal" SEO means optimization activities that are not noticeable to people or search engine algorithms. White-hat SEO implies weaving relevant keywords into the text of the document, acquiring on-topic links to the document in a "natural" way, increasing the document's Page Rank, improving the website's crawlability, etc. Search engine spamming is black-hat SEO. Two main activities include stuffing the text with keywords so that the text does not look written for people, and acquiring low quality links - off-topic links, links from link farms, too many reciprocal links, etc. Keyword spamming is noticeable for people; link spamming is easily visible to search engine link analysis algorithms.

Name five different reasons why some information on the web is not searchable by general purpose web search engines.

* The page containing the information has no incoming links, thus the crawler of a dearch engine can never find it. * The page containing the information may forbid the crawler access via the robot exclusion protocol, which crawler of a general purpose search engine respect. * The page containing the information may require authorization of the user to be displayed. A crawler cannot do that. * The information may be stored in a database, which required the user to fill in a complex form (choose parameters) to see the desired information. A crawler can not fill in these forms and thus never get to see the information .* The information may be present in an unusual data format that the crawler cannot read and therefore not match against any query.

What are Important features in user interaction, what is measured?

- Click-through rate, time on site - the more clicks and time on the site, the more relevant the search result is for the website. - Bounce rate - how often searchers click the link and quickly return to the search results - Conversion rate (Google Analytics) - how often a visit to the website ends with a purchase, sign-up, etc. - Returning visitors - if a visitor returns, then the page is relevant to the search of the first visit - Social signals - the link is (re-)tweeted, liked, plussed, shared, bookmarked. The person's web-social authority matters here. User interactions can be tracked through looking at signed in profiles, cookies, Google Analytics, Google Toolbar, Google Chrome.

How can a webpage/ site get high page rank and trust?

- Page rank In its original form, Page Rank was subjected to link spam. Therefore, Google introduced the notion of trust. This means that links from highly trusted seed sites transfer a part of that trust. The closer to a highly trusted site a page is in the chain of links, the higher its trust is. Topic Google trusts links that move from page to page within the same topic. As soon as the link moves from one topic to another, everything becomes difficult and messy. These links are often more likely to be spam links. Link diversity Search engines love diversity, the more varied the link network is, the better it is for the individual page. Trust- Also Offline Important features in trust: - Domain age - the older the better - Hosting company reputation, i.e. who are your hosted neighbors? - Postal address, is your physical address reputable? - Phone number is a unique identifier, where is it mentioned? - Mentions. Wherever people talk about you, Google will find out whether they like you or not. - Authorship Increases Authority, Google wants to know who the author of a specific page is and will use this when calculating authority and trust.

Link redirection means that the web server sends a new URL of the requested web document to the browser or web crawler, instead of providing the web document itself. Please give a few examples when link redirection is useful.

- Removing old pages and redirecting the user to new content in order to avoid broken links - Redirecting to a local site (google.com à google.se) - Moving the site to a new domain (not recommended, but if necessary) - Handy shortcut URL à complete URL - Enforcing one standard format of the link (www.sl.se à sl.se) - Redirecting to a secure connection (http:// à https//)

What does a search friendly site include?

- Site map - Robots Exclusion Protocol - nofollow links - Custom 404 (not found) page - Redirect old links - 301 Moved Permanently - Fix broken links and faulty redirections - unique content (no duplicates)

What are the ethics of crawling the web?

- Use behaved crawling - do not try to download very large files very often and do not send many request to the server without waiting between them - that is, do not get the server overloaded - Always do a "sniff-test" and ask yourself if it is ok to tell my customers how I am using the information on the internet for competitive intelligence - Always respect the Robots Exclusion Protocol - that is you should respect what is said about what crawling that is disallowed.

What different User Identifications are there? How do they work, what is their advantage and disadvantage. (web usage mining)

-IP Address + Agent Assumes each unique IP Address / Agent pair is a unique user. Low Privacy Advantage: Always available. No additional technology required. Disadvantage: No guaranteed to be unique. Defeated by rotating IPs -Embedded Session Ids Use dynamically generated pages to associate ID with every hyperlink. Low to medium Privacy Advantage: Always available. Independent from IP addresses. Disadvantage: Cannot capture repeat visitors. Additional overhead for dynamic pages. -Registration User explicity logs in to the site. Medium privacy. Advantage: Can track individuals, not just browsers. Disadvantage: Many users won't register. Not available before registration. -Cookie Save ID on the client machine. Medium to High Privacy. Advantage: Can track repeat visit from same browser. Disadvantage: Can be turned off bu users. -Software Agents Program loaded into browser and sends back usage data. High Privacy. Advantage: Accurate usage data for a single site. Disadvantage: Likely to be rejected by users.

1. Why does the Boolean retrieval model allow selecting documents from a collection but does not allow sorting them?

. A Boolean query yields two relevance values - "match", "no match". The Boolean retrieval model does not measure "how much match".

Web Usage Mining can be divided into Data Preparation and Pattern Discovery & Analysis. 1. What steps do you take in Data Preparation?

1. Data Collection • Need suitable target data set to "mine" • Types of data: - Usage Data - Content Data - Structure Data - User Data 2. Data Fusion and Cleaning • Multiple Web or application servers • Data fusion: merge log files • Data cleaning typically site-specific: - Remove irrelevant references - Remove irrelevant data fields - Remove crawler references 3. Data Segmentation • Goal of web usage mining: analyze the behavioral patterns and profiles of users interacting with a Web site • Depending on analysis, data needs to be transformed and aggregated • Variables of interest for analysis: - Users - Behavior (pageview, sessions, episodes) The user identity is not necessary, but it is necessary to distinguish between users (new visits or return visits). Here, we use a user activity record to see which activity belongs to which user. Step also include - Sessonization (The activity of one user is segmented into sessions.) - Episode identification (Dividing sessions into even smaller sections, ex one activity -> adding an item to a cart. 4. Path Completion Path completion is about inferring missing user references due to caching. It requires extensive knowledge of site structure and referrer information from server logs. Disambiguation between candidate paths is sometimes needed. 5. Data Integration Integrating useful information to the set of user sessions (or episodes). This can be user data, product attributes, categories from operational databases, etc. Together with usage date it is possible to discover important business intelligence metrics. Integrated data is often stored in a data warehouse. 6. Data Modelling The process of modelling data in a transaction matrix or according to some other schema according to certain attributes (weights). Enriched representations include more information such as semantic information from the content of a web page.

qq: What are the five tasks of topic detection tracking (TDT)

1. Story segmentation - to segment the news flow into different stories 2. Tracking - to follow news with similar topics 3. Detection - to group news with the same subject 4. Link detection - to find if two pieces of news discuss the same event 5. First Story Detection - to detect a completely novel piece of news in the news flow.

What is the difference between syntagmatic and paradigmatic relations in semantics?

1. Syntagmatic relation: if two words appear together in the same context (will group rabbit and carrot in the same topic) 2. Paradigmatic relation: if two words appear in similar contexts, but not necessarily co-occur (will group carrot and lettuce in the same topic) Random indexing supports both syntagmatic and paradigmatic relations. The first uses the whole document as context while the other uses a sliding window of our specified size. As the size increases towards the size of the document, the two models become increasingly similar.

Web Usage Mining can be divided into Data Preparation and Pattern Discovery & Analysis. 2. What steps do you take in Pattern Discovery & Analysis?

5 different - Session and Visitor Analysis · Basic statistical analysis o Most frequently accessed pages o Average view time of a page o Average length of a path through the site o Common entry and exit points · Online Analytical Processing (OLAP) o Flexible data analysis o Different dimensions of analysis and different levels of abstraction - Cluster Analysis and Visitor Segmentation Clustering groups together a set of items with similar characteristics. Two common kinds of clusters are user clusters and page clusters. User clusters are groups of users with similar browsing patterns. It is useful for market segmentation, personalized web content and business intelligence (using demographic data). Page clusters are based on usage data or content data. Content-based clustering is the collection of pages (or items) related to the same topic or category. Usage-based clustering is the grouping of pages (or items) that are commonly accessed/purchased together. - Association and Correlation Analysis Finding groups of items or pages that are commonly accessed or purchased together. Frequent itemsets are such commonly accessed groups. Association rules describe the relationships and what visits leads to other visits and with which frequency. This can, for example, give indications that a given promotional campaign is positively affecting sales. - Analysis of Sequential and Navigational Patterns This adds the dimension of time to frequent itemsets and association rules. Can be used to predict future visit patterns, which is useful for trend analysis and advertisement aimed at particular groups. It can be used to capture frequent navigational paths among user trails. - Classification and Prediction of User Transactions This entails classifications and predictions of users/items and user behavior.

We have a collection of 20 documents, 12 documents are relevant to the query. calculate the precision and recall values for the top 5, 10, and 15 retrieved documents. Then determine the interpolated precision values for the standard recall 0.1, 0.2, etc.

5 retreived, 4 relevant: precision 4/5 0.8 recall 4/12 0.33 10 retrieved, 6 relevant: precision 6/10 0.6 recall 6/12 0.5 15 retrieved, 10 relevant: precision 10/15 0.67 recall 10/12 0.83 standard recall: interpolated precision: 0.1 0.8 0.2 0.8 0.3 0.8 0.4 0.67 0.5 0.67 0.6 0.67 0.7 0.67 0.8 0.67

qq: What is rankbrain?

A collection of machine learning techniques that help google match unknown terms and concepts RankBrain powers Hummingbird with AI. This is created by big data analysis, machine learning: webpages, links, Knowledge Graph, relevance feedback (user interactions with search results and webpages, Google Analytics data, social media signals), etc. Neural networks are being used here. Building this intelligence is an offline process, just like indexing webpages. Google learns from users through sequences of queries from which it derives user intent. It also learns how people research a particular topic and knows what is most important, as well as relationships between texts by observing how users click on links in search results.

qq (opinon mining): Document (or text, review) level opinion mining assumes that:

A document focuses on a single object and contains opinions from one opinion holder

3. What is document index? Why do we need a document index?

A document index is a data structure that shows which words appear in which documents. A document index converts search in unstructured data into search in structured data, which speeds up the search process tremendously.

Two major differences between truncation and stemming

A major difference between truncation and stemming is that truncation is language independent, as it does not rely on any rules of the language to be used. Stemming on the other hand brings different inflection of a word to their common presentation, their "stem". Another difference is that stemming is used "behind the scenes" for example on indexing, whereas transaction is always used at a users initiative as it is really powerful and can generate a lot of results( eg. if you search for di* the results would be endless).

4. Why do search engines analyze links between web documents?

A natural link to a document is like a vote for this document. By analyzing the "votes" search engines determine importance of individual documents in their neighborhood.

What effect does the size of the sliding window have in Random Indexing (distributional semantics)?

A smaller window gives a more paradigmatic model (words that are in some sense interchangeable), while a larger window - up to the whole document - gives a more syntagmatic model (words that occur in the same domain). The sliding window is used to measure the similarity of different texts even if they do not use the exact same words. The size of the sliding window implies how many index vectors to the left and to the right of all words in a document that will be added to each words content vector. A smaller window gives a more paradigmatic model (words that are in some senseinterchangeable), while a larger window - up to the whole document - gives a more syntagmaticmodel (words that occur in the same domain).

qq: You have been asked to build your own system for competitive intelligence for your company. The system should be able to gather information from the web, store it for the future and present it in a suitable way for your boss. Which of the following would be suitable building blocks:

A web crawler to gather the information, a searchable news archive a text summarization tool a text clustering component a machine translation component

qq (opinion mining): As an initial step in sentiment classification, it is common to extract phrases containing which types of words? The reason for doing this is that research has shown that such words are especially good indicators of subjectivity and opinions.

Adjectives and adverbs

Describe the advantages and disadvantages of the Boolean retrieval model.

Advantages: the user has some control over the search process and the results are exact (either match or no match) main advantage from quiz: The user designs a Boolean query where he or she can tell the search system in more detail what and how to search Disadvantages: because the results of a boolean search are so exact (match/no match) then there is no measure of how well a result matches the query and therefore the results cannot be ranked. The cons are that Boolean queries are the only type of queries where the searcher has some control over the search process. Suited to find a needle in a haystack, i.e.,specific searches where the searcher knows what he or she is looking for which fits professional searchers. The pros are that there is no ranking of the retrieved documents which means filtering only, and that the exact match may lead to too few or too many retrieved documents. Query gets reformulated several times until the searcher is satisfied

How does a document index work?

All search engines use document indexing to "search the web". Document indexing is an offline process, which means that the work is carried out before a query is committed, independently of the searcher. Basically, a document is transformed to a bag of word which is then structured and indexed in a list of words with fast access to data given the key. The data in this case is the position and order of the word.

What is episode identification?

An episode is a subset or subsequence of a session comprised of semantically or functionally related pageviews and episode identification is the Classification of pageviews according to domain ontology or concept hierarchy

What is semantic tagging?

As opposed to structure tagging (headers, body and so on), semantic tagging gives items/tokens meaning or categorize them according to what they represent. Using this algorithm can interpret the meaning of the text, and not just its structure.

qq: _____________can find groups of items or pages that are commonly accessed or purchased together. This, in turn, enables Web sites to organize the site content more efficiently, or to provide effective cross-sale product recommendations.

Association Rule discovery

What are the limitations of the bag-of-words document representation?

Bag of words means that syntax is ignored to study words as a group of items appearing in or missing from a document or document collection. Also studied is the word frequency. The limitations id that this method cannot be used to measure text structure, distance between words and the beginning or end of the text.

quiz q: Why is the title of a webpage a good place for relevant keywords?

Because the title is shown in the search results and the user is more likely to click on a title that contains relevant keywords Because the title tells what the page is about

quiz q Why are search engines cautious about dynamic links/ pages?

Because they are afraid of spider traps- this is when dynamic links in a dynamic page point to other dynamic pages, etc and this never comes to an end. because some parameters such as session ID or user ID create different URLS for the same content.

quiz Q Why do search engines consider link texts important?

Because they contain a concise statement what the link landing page is about because they are usually created independently from the landing pages owner and therefore usually express someone else's opinion about the landing page because they are supposed to help people decide whether to follow or not follow the link, therefore these are important texts.

What is the difference between black-hat and white-hat search engine optimization (SEO)?

Black-hat SEO includes web design elements deliberately hidden from human visitors of the website, or aggressive over-optimization. White-hat SEO is beneficial for both the human visitors and the search engines, and reflects natural development of the website. Successful SEO ends up somewhere in the middle.

What is the difficult part of Information Retrieval?

Calculating the similarity between the query and the different pieces of data (usually piece of text, but can also be a picture).

You want to add query expansion by using distributional semantics to your search engine. In order to be able to build the language model (word space model) used to extract the sets of related words you need to pre process the web pages in different ways using various language technology techniques. Detail the steps that you would take in the pre-processing and argue why these steps are necessary.

Cleaning: removing layout (HTML) and scripting (e.g. JavaScript) so that only the character strings making up the text is left for the computer to process further. Even though this is strictly not a language technology technique, without this step the computer does not know the difference between code, markup and natural language text. Tokenization: divide character strings into words/tokens so that the computer can process them Stemming/lemmatizing: bring different word forms (inflections) to a common representation so that they are counted together during term frequency counts. Depending on how strict conflation we want (keeping within part-of-speech or not) we may choose between stemming and lemmatization. Compound splitting: split compound words into more common "sub-words". In some languages stringing existing words together can form new words. While these may carry a lot of information they are often much more infrequent than their constituting words.] Stop-word filtering: discount function words from term frequency counts. These are words that occur in almost every text and act as the "grammatical glue" when forming sentences from content words. It is the content words we want to count since they indicate the topic of the text.

qq: Applications that recommend items to users are becoming increasingly prevalent on the Web. Some of these applications rely on the profiles of similar users to make recommendations, for example based on user ratings of items. What is this technique known as?

Collaborative filtering

What is the usage of Compound splitting and what problems can accure?

Compound splitting is the process of splitting compound words into their more frequent parts. Problems that can accure is that the word get new meaning, For example, it can be fruitful to split the word "steamboat" but not "White-collar". It is also important to take Multi Word Tokens into account. For example, York and New York is not the same place and should not be returned as similar. New York, in this case, would be a Multi Word Token even if it is not graphically compounded. Idiomatic expressions are also included in this, ex "break a leg" does not mean literally.

Why is cosine similarity a better query-document similarity measure than counting the words that the query and the document have in common?

Cosine similarity is better because it not only measures the frequency of words but also the important words in terms of its uniqueness in the document collection. Thus is what constitutes term weight. Each document in the document collection can be represented as a vector in a n-dimensioned space where each unique term is a dimension. Each term vector (document) has a length and a dimension, and it's coordinated one it's term weight. Cosine similarities measure the similarity of a query and a document by the angle that separates them. The smaller the angle, the similar is a document and a query. Cosine similarity will therefore use a much more holistic approach to calculating similarity, as there is more to the similarity between documents than the frequency of a word . Causing only words frequency tends to favour longer documents.

Why is cosine similarity better than simply counting the number of common words in two documents?

Cosine similarity measures the angle between two term vectors, and that angle does not favor longer or shorter documents. If we simply count the number of common words, then longer documents will always be favored as there will be more matches. Cosine similarity does not favor longer or shorter documents.

Content data (web usage mining)

Data comprised of combinations of textual materials and images. The data sources used to deliver or generate this data include static HTML/XML pages, multimedia files, dynamically generated page segments from scripts, and collections of records from the operational databases.

What is data fusion and cleaning?

Data fusion: merge log files • Data cleaning typically site-specific: - Remove irrelevant references - Remove irrelevant data fields - Remove crawler references

qq: What does BERT do?

Disambiguates the meaning of a word in the sentence encodes the meaning of a word by the sequence of the words in the sentence.

quiz q Canonical URL is referred to by <link rel="canonical" href="the_canonical_url_goes_here"> in the header of a web page. What is the purpose of canonical URL?

Eliminate duplicate webpages Point to the webpage that should be indexed by the search engine instead of this webpage

Before you can index the contents of web pages you need to prepare the documents in several ways. Design and motivate a pipeline of components that pre-process the web pages in stages.

First non textual content needs to be removed, such as HTML tags, CSS, java script, etc. If we don't remove these then they will end up in the vector space model's vocabulary. The next step is tokenization. The text strings need to be split in to tokens such as words, numbers, and punctuation since the computer can only see a string of characters. Thirdly, Morphological analysis, so that different conjugations/inflections/forms of words are counted as one (dog/dogs). Most often stemming is used rather than lemmatization, so that we can conflate words from different word classes which roughly refer to the same concept such as cycle/cycling Finally stop words need to be handled. If we don't do this when constructing the term vectors, then the most common words in every document will be common non content words such as syntactic markers and prepositions. However, we need to retain them in the index in order to facilitate phrase search.

qq: We can spell a URL differently: http://mysite.com/, http://www.mysite.com/, http://mysite.com/index.html. All spellings will load the same page. What's wrong with these spellings from SEO point of view?

Formally, these are different URLs. The page has several sets of weaker link analysis attribute values, instead of one strong set

qq: How does Google guess the user's information need apart from what's written in the keywords?

Google observes similar sequences of queries and learns the user's need google learns how people research a particular topic google learns the relevance of a query to the documents by observing how users interact with their search results: which links they click, which clicked pages the users abandon quickly, which links lead to a completed task (a purchase, a sign up, subscription to newsletter, etc), which links are liked and tweeted.

qq What is Hummingbird?

Google's latest search platform, that addresses the user's information need, not just the keywords that the user types This is the Google search platform. It brought a major change to the search algorithm, moving from keyword matching to semantic search. Hummingbird estimates the user's intent (why people are looking for something, not just what they are looking for). It searches in context - previous queries, websites visited, user's Gmail, geo-location and so on. It can account for conversational queries such as who, why, where, how and delivers answers if it can. It matches concepts and relationships, not just keywords. The intelligence in the engine is powered by the Knowledge Graph and RankBrain. Google breaks down a query into parts, trying to determine what is being searched for. A word such as "where" will imply that the user is looking for a place, for example. In order to answer a query, google uses several information sources, such as "My Business", Google Maps and a registry of IP addresses or geo-location of mobile devices.

What is the knowledge graph?

Googles Knowledge Graph contains information about concepts and related entities. The purpose is to deliver answers, not documents. It keeps users on Google's own pages, increases advertising revenues. It also makes it possible to improve search results by identifying concepts in the query and the documents, not just keywords.

what does link reputation measure?

How trustable the link to a website is. It depends on authorship, proof of the expertise of the contentcreator, physical address, if Phone number is a unique identifier, Mentions, what other people say about the company, domain age (older is better), Hosting company's reputation, click-through rate, time on site, bounce rate, conversion rate, returning visitors.

Opinion spam refers to human activities that try to deliberately mislead readers or automated opinion mining systems by giving underserving positive opinions to some target objects in order to promote the objects and/or by giving unjust or false negative opinions on some other objects in order to damage their reputation.To achieve the above objectives, the spammer usually takes both or one of the following actions. What are they called?

Hype spam: to write undeserving positive reviews for the target objects in order to promote them Defaming spam: write unfair or malicious negative reviews for the target objects to damage their reputation

quiz q Why do search engines look for relevant keywords in image ALT tags?

Images describe the webpage and ALT tag texts describe the images

qq: What do we optimize when we do SEO?

Keywords and links associated with a webpage, crawler friendliness of the website, impression of a popular website with fresh and user generated content, mobile device friendliness of the website.

What is Lemmatization?

Lemmatization is the process of grouping together different inflected forms of a word, so they can be analyzed as a single item (its lemma).In lemmatization, the conflated token needs to be an actual word (the base form of the word, which can be found in a dictionary). This requires grammatical knowledge (morphology and syntax) in order to guess which part of speech (noun, verb etc.) the word belongs to. This is language dependent and non-trivial.This method is not common in information retrieval due to rigidity, computational requirements and difficulty to implement.

What is Distributional Lexical Semantics?

Lexical semantics is the analysis of word meanings and the relations between them. Distributional Lexical Semantics adds the distributional dimension, looking at where and how words appear in a collection.

What is the difference between page rank and link analysis?

Link analysis is analyzing the incoming and outgoing links. there are algorithms for calculating the weight of links, of which page rank is a part of (a linking domain that has a higher page rank gives the link more weight).

qq: Which features contribute to the reputation and authority of the website?

Link diversity Links from trusted domains Domain age (old domains are more trustworthy) Reputable physical address (a real life organization is trusted more) Opinions about the company/product/brand in social media Task completion rate on the website- products purchased, subscriptions made, etc.

How keywords can be used on a webpage to get higher results

Link reputation is important, due to the occurrence of link spam, a search engine will look at the relevance of a link. First, it will look at the title of the document and see if it relates to the links. It will then see if there are relevant keywords around the link and if the topic of the page is the same as the topic of the link target page. The more convinced the search engine is that the link is relevant, the higher its reputation will be. Using Keywords are important, a title is a good starting place for determining what a document is about. Another good place to look for relevant concepts and keywords are first-level (H1) headings and the document body, especially the beginning and the end of the document. - Keywords in the Filepath and Filename This is a minor ranking factor. They are good to include, but the search engine will not rank them as important. Users might, however, be more likely to click on a link if they see relevant keywords in it. Image ALT-tag This is the information attached to an image, which is visible when the mouse passes over the image. It will be indexed by search engines as part of the explanation of a document. - Keywords in Link Text Link/anchor text that points to the web page has the most important keywords that describe the page. This is because someone has used certain keywords to describe this page, which will contain a good summary of what a document is about.

What is Geo-targeting and how does the search engine get the information?

Location related searches is the reason that the same query will give different results in different countries. Websites have some geo-targeting signals which a search engine can consider: · Top level domain (.uk, .se, etc) · Language · IP address of the web server · Geo-location of incoming links · Location of the visitors · Domain registration's postal address · Google's My Business registration · Search Console's geo-targeting settings · Geo-coded images

qq: What is the advantage of optimizing for long tail keywords?

Long tail keywords have more words that more precisely describe the information need, therefore it is more likely that the page will be found by a searcher who needs that page

Why do we use normalized similarity?

Longer documents have more words than smaller documents, making it difficult and uneven to compare them. The normalized similarity accounts for document length (weight) by dividing the scalar product with this number. The document length is the amount of terms appearing in the vector - a very long document with several different words will therefore be less relevant than a smaller document with a lower amount of unique terms provided these also match the query.The problem with this calculation is that now shorter documents are generally preferred over longer documents, which might not always be relevant. This can be solved by looking at similarity as the angle between a query and a document.

A very large share of searches on Google come from mobile devices, therefore Google pays close attention to how well websites perform on mobile devices. Please describe the activities of optimizing a website for better user experience on mobile devices and, thus, better ranking in search results on mobile devices

Mobile SEO · Content is important, and so is the format · More than a half of all Google searches come from mobile devices · Your site must be fully mobile compatible to be considered relevant by Google Two approaches to mobile-friendliness Dedicated mobile and desktop site - Mobile site will be heavily optimized for mobile use - Pages with loads of mobile visitors (such as Facebook) often have dedicated mobile sites - Disadvantages - two sites to maintain, - where one tends to have less content (the mobile site). - Redirection errors are also a possibility, where users are sent to the wrong place or a site that does not exist. - Page Rank and authority may also be split, although this is not as problematic as Google allows two pages if certain guidelines are followed Responsive web design. - Google prefers this, no duplicates - Adjusts itself to the size of a window - Only one site to maintain - Mobile users have full access to the content - No faulty redirections (google likes this) - Disadvantages are that optimized mobile pages load faster, everything is a compromise - always keep it concise, that mobile-only features may need a separate page, older phones do not work well with RWD

qq: Which features of a website contribute to evaluation of user experience in the eyes of a search engine?

Mobile-friendliness Secure browser-server connection Lots of advertising Pop-up windows Slow website Fast website Website does not respond to user clicks Broken links

Which method from natural language processing can be used to identify entities representing possible opinion holders and opinion objects such as organizations and persons in a text?

Named entity recognition. Specifically made to identify the entities Persons and Organizations. Other entities such as Locations and Time points can also be identified by NER

What is opinion mining ?

Opinion mining is about automatically extracting opinions from evaluative texts. Evaluative texts are texts that express opinion in some way (such as products, companies, politics, events etc.). It is a relatively new research area closely related to sentiment analysis (what sentiment a writer wants to communicate) and subjectivity analysis (is a text subjective or objective).There is often no clear distinction between sentiment analysis and opinion mining. A sentiment is more about feelings and an opinion is more related to thoughts and ideas.

quiz q what does page rank (PR) show?

PR shows how important the page is in terms of the number of incoming links and the importance of the linking pages.

page vs site SEO

Page · Keywords specific for the content of one page · Page Rank assigned to one page · Link texts describe one page Website · Crawler-friendly website · Domain trust · Link diversity · Geo-targeting signals General · Targeted keywords - look at what keywords are important to the page and optimize for that · Search-friendly site - look at how crawlers behave and optimize for that · Inbound links · Site authority · Mobile SEO

qq: Depending on the goals of the analysis of usage data, this data needs to be transformed and aggregated at different levels of abstraction. In Web usage mining, what is the most basic level of data abstraction?

Pageview

What is a pageview

Pageview is the most basic level of data abstraction in web usage mining. It is the users "view" of a website, for example the links the user are following, what products the user adds to the shopping cart, what the user actually buys, etc.

What is path completion?

Path completion is about inferring missing user references due to caching. It requires extensive knowledge of site structure and referrer information from server logs. Disambiguation between candidate paths is sometimes needed.Caching = Data that is frequently used can sometimes be saved in computer memory instead of opening the application over and over again. Problem is that it's not registered as an activity. Another potentially important pre-processing task which is usually performed after sessionization. It is necessary due to client- or proxy-side caching. Path completion is supposed to find the missing user references, but requires extensive knowledge of site structure and referrer information from server logs

2) What is precision and recall, what do they measure? Explain why precision drops as recall rises, and recall drops as precision rises.

Precision and recall are used to measure the quality of document retrieval. After the system has retrieved a number of documents relevant to the query, recall is the share of retrieved relevant documents among all the relevant documents in the collection. Recall measures the system's ability to retrieve as many relevant documents as possible. Precision is the share of relevant documents among all the retrieved documents. Precision measures the system's ability to provide as clean retrieval results as possible with few irrelevant documents in them. There is always a trade-off between precision and recall. If we want to have cleaner retrieval results, we have to leave out more relevant documents. If we want to retrieve as many relevant documents as possible, we have to accept more garbage in the results. One reason for such a trade-off is cut-off line: the system sorts documents according to their similarity values, and then accepts only top X documents. Larger X means higher recall and lower precision.

Precision and recall are two very traditional document retrieval quality measures. Please reason about other possible retrieval quality measures, feel free to invent your own. Motivate your choices. (Part 2)

Precision at k For modern (web-scale) information retrieval, recall is no longer a meaningful metric, as many queries have thousands of relevant documents, and few users will be interested in reading all of them. Precision at k documents (P@k) is still a useful metric (e.g., P@10 or "Precision at 10" corresponds to the number of relevant results among the top 10 retrieved documents), but fails to take into account the positions of the relevant documents among the top k. Another shortcoming is that on a query with fewer relevant results than k, even a perfect system will have a score less than 1. It is easier to score manually since only the top k results need to be examined to determine if they are relevant or not. Maybe adding trained ML in the future? Disadvantage needing to train the modell needing a lot of data, might take longer?

Relevance feedback is an effective method for improving search results. One way of applying relevance feedback in web search is observing how users interact with their search results. Please design the relevance-feedback part of a search engine. What would the search engine observe in order to improve the future search results? How would the search engine improve the future search results based on your designed relevance feedback?

Re read about relevance feedback on my own. Relevance feedback is to observe how users interact with the search results and how they are marked by the user as relevant or irrelevant. Which results are clicked, which are bounce back results (where the user immediately clicks back without spending any time on the website) and which results the user spent the most time on are all data points which can be recorded. (if the user has to reformulate the query?) User behavior like time spent on a page or site is indirect feedback and reformatting the query would be direct feedback. In the future, more relevance feedback could be gained if a user copying the url or content on the page could be observed, this would be a strong indication that the user found these pages relevant to their query. The search engine could also look for the links or content in what is being shared on social media or what is being saved in google docs and drive. The search results could add a ranking of which pages were the most heavily shared on social media or the ones that were most often copied cited, or used in some way.

difference between explicit and implicit relevance feedback

Relevance feedback is used for providing a system with information on the relevance of retrieved documents. In explicit relevance feedback, relevant documents are selected and fed to the system as correct results. While in implicit relevance feedback, the user is not asked to mark relevant documents. Instead, relevance is measured on user behavior analysis such as links clicked after reading the summary, time spent on exploring a document and eye tracking.

qq: Which option of mobile-friendly website design does Google prefer?

Responsive web design

How do search engines list and retrieve webpages?

Search Engines use Crawlers to index webpage., making them readable for machines. 1. The crawler retrieves a page a. Extracts links b. Indexes the page content 2. The crawler selects the next link to visit a. Dynamic Links and Broken Links may cause problems b. Site Map may help find and prioritizing links c. Robots Exclusion Protocol keeps polite crawlers away from non-searchable areas 3. Back to step 1

Semantic search makes a search engine more intelligent than a plain keyword matching system could possibly be, and RankBrain is the "kitchen" where Google's intelligence is being created. Please describe the principles of how RankBrain works, what it does, motivate the use of different technologies. (Part 2)

Semantic Search is about identifying concepts and realizing the purpose of the user's search, not just finding similarity. Hummingbird is the Google search platform. It brought a major change to the search algorithm, moving from keyword matching to semantic search. Hummingbird estimates the user's intent (why people are looking for something, not just what they are looking for). It searches in context - previous queries, websites visited, user's Gmail, geo-location and so on. However, RankBrain powers Hummingbird with AI. This is created by big data analysis, machine learning: webpages, links, Knowledge Graph, relevance feedback (user interactions with search results and webpages, Google Analytics data, social media signals), etc. Neural networks are being used here.Building this intelligence is an offline process, just like indexing webpages. Google learns from users through sequences of queries from which it derives user intent. It also learns how people research a particular topic and knows what is most important, as well as relationships between texts by observing how users click on links in search results.

what is sentiment analysis?

Sentiment analysis is the sub-task of opinion mining, that tries to assess the sentiment of an opinion holder on a specific topic of artefact. A sentiment is usually defined as a level of positive, neutral or negative as it is calculated as a number in order to distinguish between strong and weak sentiments.A sentiment is more about feelings and an opinion is more related to thoughts and ideas.

What is the primary data source for analyzing users' interactions on the Web? (or primary data sources used in web usage mining?)

Server log files (Web and application server logs).

What are server log files?

Server logs are Navigational behavior of visitors which is The primary data sources used in Web usage mining. It includes for example information of which link the user has klicked on at what time, what browser and OS the user is using, IP-adress of the klient etc.

qq: Which of the following make a webpage load slower?

Slow network connection Lots of advertising on the page Many attractive pictures on the page Website placed on several web servers Content generated by JavaScript

What is stemming?

Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma. Stemming is important in natural language understanding (NLU) and natural language processing (NLP). Stemming is a morphological (and less often semantic) conflation based on rewrite rules. This is language dependent as every language has its own morphological rules.Stemming rules : Different rules are used to replace certain characters or combination of characters at a certain position of a token. This will allow for changing, for example, progressive form to the infinitive form (i.e. running -> run).Challenges - Affix stripping:o Suffix - morphological changes at the end of word so Prefix - morphological changes at the beginning of word so Infix - morphological changes inside words

What does stemming or lemmatization do with words?

Stemming means to reduce a word to it's stem to be able to increase the search results. For example facilities to "facilit" so that documents with both "facility" and "facilities" can be crawled when searching for facilites. Lemmatization is the process of grouping together different inflected forms of a word so they can be analyzed as a single item, for example "better" has "good" as its lemma.

2. What is the purpose of term weights in the vector-space retrieval model?

Term weights show the importance of individual terms in the document.

How does the vector space retrieval model work?

The Vector Space Retrieval Model is the most popular RM. In it, similarity is calculated as a number. A bigger number means more similarity, which enables relevance ranking as opposed to the Boolean Retrieval Model. The model looks at documents as a bag of words, meaning that structural information is lost. It is best for doing document clustering and categorization as well as finding similar documents - not for keyword-based searches.

quiz q: What would be a good domain name?

The company name, the product brand name

Structure data (web usage mining)

The designer's view of the content organization - Inter-page linkage structure (hyperlinks) • Site maps - Intra-page structure • Tree structure over the space of tags

What does the distributional hypothesis state?

The distributional hypothesis stats that the meaning of word is defined by its use in language, which means that words or synonyms co-occur in the same content.

What is the goal of aspect/feature based opinion mining?

The goal is finding opinions about some specific properties of an opinion object.

What is the goal of sessionization?

The goal of sessionization is to reconstruct the sequence of actions performed by one user during one visit to the website by segmenting the activity record of each user into sessions.

What is the goal of web usage mining?

The goal of web usage mining is to analyze the behavioral patterns and profiles of users interacting with a Web site

What is document clustering?

The information retrieval system decides itself what groups of documents (clusters) there will be depending on their similarity)

What is document Categorization?

The information retrieval system puts new documents into predefined groups of documents - text categories - that have sample documents as queries

Why do we get rid of stop words?

The most frequent words in a text is not often content words but functional words, which will tell us very little about the text in question. These will therefore need to be removed in order to provide a proper representation of a document. This process is called Stop-word removal, where Stop-words are words that do not carry much meaning (i.e. functional words). Removing stop-words is a form of noise reduction, minimizing the importance of very frequent, non-important words and very infrequent and therefore useless words. Be aware that removing stop words is not always good. For example, while studying reviews it might not be good to erase words like "not" that get a lot of meaning in the context.

Imagine that you work at a SEO company, and your task is to optimize a small web shop. Please specify the five phases of search-engine optimization (two on-site optimization phases and two off-site optimization phases, plus optimization for mobile devices). Tell what SEO activities each phase requires.

The optimization phases are: • Keyword selection. Make sure your keywords are relevant, purchase oriented, perhaps longtail, placed in the title, heading, in the beginning of the body text, link anchor texts, etc. • Search-friendly site. Site map, robots exclusion protocol, fixed broken links and redirections, no flash and frames. • Link optimization. Page Rank, link diversity, link reputation. • Site authority. See question 12. • Mobile-friendly. Responsive web design, no flash, no disturbing ads, fast loading pages.

qq: What are the features of the web pages designed according to Responsive Web Design?

The page adapts its appearance with respect to the size of the browser window. The size of the web page elements, e.g. images, tables, is defined proportionately the size of the browser window, it is not defined in pixels.

Please tell which precision-recall curve is more suitable for a web search engine and which one for a library search system. Why?

The precision-recall function can be divided into regions so that it is possible to look at only, for example, low-recall, mid-recall or high-recall. Low-recall is important in a web environment where we will only go through a short list of results in, for example, a search engine. In exhaustive information searching, however, we would want high-recall since we want to look at all or many relevant documents. The later type of searching could be in a research situation at a library, for example

You are constructing a search engine from scratch. Please reason around the pre-processing technique stop words filtering and stemming. The search engine that you have constructed performs stop word filtering and stemming of the document collection before indexing. How does this affect phrase search in an information retrieval setting? What extra modules would you need to implement in your search engine to deal with phrase search? Imagine you would search for the phrase "To be or not to be a student". (Part 2)

The problem that is introduced by stemming doesn't necessarily concern phrases, since words in both the query and the documents are reduced to their stem, the search engine can still find the documents containing the phrase (among other documents that contain a variation of the phrases). The bigger problem with stemming is that if the concrete form of the word matters in a particular query, then precision will be low for this query, since documents containing the other verb forms will be found as well. Given the example phrase, I personally doubt that stemming will have a negative impact for this specific phrase. The bigger problem lies with stop words filtering. While stop words (words like "to", "be", "a", "or") contribute little to the actual content or information (at least form a text processing/analyzing point of view) they are of course very relevant when searching for exact phrase. Given the above mentioned list of stop words, stop word filtering would essentially reduce the query to "not student" which would most likely return many irrelevant documents. Even worse: since the stop word filtering is performed before the indexing there is no trace of these words in the document representation, which makes the phrase search close to impossible. To change this, one would have to leave the stop words in the index. It infer(?) to compute documents similarity without the stop words, but the information whether they exist or not must be in the index. After having calculated document similarity without using stop words (easy: just ignore the stop words vector elements) one should then calculate a second similarity measure that includes stop words for the documents that were relevant according to the first similarity measure. This measure should include the position of the words, since this is important for phrases. Of course this only needs to be done if the phrase actually contains stop words. With the help of this measure reassures the relevance/similarity and report the results.

what is competitive intelligence?

The process of gathering information about the competitive environment to improve the company's ability to succeed CI is a subset of BI looking at things going on outside the company. The process is as follows:· Identify what information is to be collected· Identify possible information sources· Evaluate what you can trust and what is useful· Integrate information collected from different sources· Interpret and analyze the information. Draw conclusions and recommend actions.· Present analyzed findings to those that decide things.

Traditionally, the vector-space retrieval model does not consider document structure and text formatting - is the word placed in the title, a heading, a link text, or the main text; is the word emphasized by bold/italic? Nevertheless, we would like to include the document structure and text formatting into the query-document similarity calculation the same way as we include term frequency and inverted document frequency. Please reason how we could include the document structure and text formatting into the document representation in the vector-space retrieval model in order to use them for query-document similarity calculation.

The question itself suggests the answer - "we would like to include the document structure and text formatting into the query-document similarity calculation the same way as we include term frequency and inverted document frequency", i.e., we include them into the term weights, which are individual for each document. When we calculate term frequency, today each term is worth 1 point - term frequency is 1+1+1+... We can change the term-points tp according to the formatting and calculate the term frequency tp+tp+tp+... The numeric value of tp mirrors the formatting of the particular occurrence of the term. This approach has been tested before. The tricky part here is measuring the importance/weight of the title, headings, bold, etc. We could experimentally determine this importance/weight by test queries and measuring the probability that matching words appear in various document structure and text formatting elements. Another suggestion in the exam answers was to have a separate dimension for each termformatting combination. If the term t appears in 3 formats it has 3 dimensions.

chapter 2 What is the Scalar Product and what is it used for?

The scalar product is the multiplied product of vectors. The geometrical definition gives the product of the vectors multiplied by cos (the angle between the two vectors). This product is just a number and not a vector, as it does not hold any directional information.The easiest way of calculating document similarity is by using the scalar product of the query vector and the document vector - meaning the sum of the products of each dimensional coordinate.

What are the benefits of robots exclusion protocol (i) for the search engines and (ii) for the websites?

The search engine saves its resources by not crawling and indexing irrelevant parts of the website and improves its search results by not delivering irrelevant pages. The benefit for the website - the search engine saves its resources by not crawling irrelevant pages and allocates more resources for crawling and exposing relevant pages.

what is user activity record?

The sequence of logged activities belonging to the same user - The analysis of Web usage does not require knowledge about a user's identity. However, it is necessary to distinguish among different users.

What effect on the relations being modelled does the size of the sliding window have in Distributional Semantics?

The size of the sliding window affects what type of relation is between the words in the vocabulary that we model. A paradigmatic model uses a smaller window, while a syntagmatic model uses a larger window. (up to the whole document)

In order to calculate recall we need to know the number of documents relevant to the query. How do we estimate the number of relevant documents in a very large document collection? Please outline this process.

Through Pooling, which is done by drawing a "cut off" line when the search is done, and considers what is above the line to be a representation of the entire document collection. For example, if you compare entire search engines, you can choose e.g. top 30 of the results for each search engine, and count the total number of relevant hits for all search engines (not duplicates), which we let be the "estimated total number of relevant hits. Then in each search engine's result, we can divide the number of relevant hits in that search engine by the total number of relevant hits (from all search engine results and through the get recall value.)

qq: What is First Input Delay?

Time between a user's interaction with a webpage and the webpage's response The delay with which the website responds to a user's interaction

qq: What does Largest Contentful Paint estimate?

Time for the page's main content to get loaded

qq: What does First Contentful Paint measure?

Time from the page starts loading to some content is visible in the browser

qq: what is business intelligence?

To continuously and in a structured way investigate and analyze the factors that affect the development and success of the company/organization

quiz q: What is the purpose of pooling?

To create a small subset of documents that hopefully (but not necessarily) contains all the relevant documents to create a small subset of documents full of relevant documents so that we can manually verify the relevance and then assume/define the total number of relevant documents in the collection for the measurements of recall.

Please explain link diversity in the context of web link analysis by search engines.

To have high link diversity, pages should be linked from many different other sites (variety of domains). The incoming links should also link to different pages within the site (deep linking). Link diversity contributes to site authority and makes the site more trustworthy (?)

quiz q: What is the main purpose of a site map?

To help search engine crawlers find relevant pages on the site by listing them in the site map

quiz q: what is the purpose of query expansion?

To retrieve more relevant documents, those which the original query misses out on. The purpose is to retrieve more relevant documents, those which the original query misses out, examples of query expansion is adding synonyms to the original query, have spelling mistakes included and also stemming.

Precision and recall measure the quality of document retrieval. Usually there is a trade-off: if we optimize the retrieval system for better precision, we lose recall; if we optimize the system for better recall, we lose precision. Please explain the cause of this trade-off. Now, please do some reasoning. How can we improve the retrieval technique and increase precision/recall while trying to lose less of the corresponding recall/precision? There is no straight answer. The task is to reason about different methods of improving precision/recall and to explain the corresponding loss of recall/precision. (Part 2)

To the tradeoff it is helpful to consider the formulas for precision and recall: R = (retrieved relevant documents / all relevant documents) P = (retrieved relevant documents / all retrieved documents) First note that if a search engine reports more documents the recall can not drop (determinator does not change), it can only increase since more relevant documents might be retrieved. However precision may of course drop if the search engine returns irrelevant documents. Thus if we want to optimize recall we want to report as much as possible, but if we want to optimize precision we want to (ideally) only report truly relevant documents, it does not matter if we miss some relevant documents (but it matters for recall!). In short: if we try to increase recall we might report more irrelevant documents, thus precision drops. If we try to increase precision we might miss more relevant documents, thus recall drops.The goal of the tradeoff is to find a search(?) to balance between the two. Precision can be increase in many ways. For example if given an estimated relevance ranking of the retrieved documents, only report the first n findings. This will (if the ranking is good) filter out many irrelevant documents, thus increase precision.But of course recall may drop since we may have left out relevant documents as well. Another approach to archive high precision is to use multiple well preforming retrieval systems, feed the query to all of those and then only report those documents that all systems returned (very likely to be truly relevant, thus precision increases). Once again recall will probably drop due to relevant documents being discarded. Recall can be improved by doing query expansion, for example by adding synonyms for some words of the query. To do so one requires a synonym dictionary or a more sophisticated approach like word similarity models. The input on precision is clear: for some queries the addition of more synonyms or similar words is harmful (for example book or song titles), since more unwanted (irrelevant) documents will be returned. Another possibility of query expansion is to use generalization or specialization. Consider the following: a vehicle is something very general, a car is a special kind of vehicle and a VW is a special kind (brand) of car. Specialization could add the terms "car" and "truck" to the query "vehicle". Generalization would add the term "car" to the query "VW". Both has the potential to increase recall, since the chance is there (??) if vehicles are of interest documents about cars or truck may be relevant. Similar to the addition of synonyms however, precision will probably drop for a lot of queries, where the specific term is wanted. My personal guess is that there will be worse for generalization than for specialization since if a user search for something very specific (for example "VW") it is less likely that he or she will be interested in something general (for example a document about the history of cars"). (Note that this is just a personal guess, I don't know any studies to back this claim). Another way to improve recall is to use stemming or truncation to bring words to a standard form. Thus if the query is "opinions" it will also find the singular form. This will most definitely increase recall since more documents that are likely to be relevant will be received. Precious may drop however due to similar reasons that were explained above in the context of query expansions with synonyms: if the exact verb or noun form is wanted by the user, many irrelevant documents will be retrieved, thus precision drops. Another very straight forward way to increase recall is to do spelling corrections. If spelling mistakes are spotted (not trivial) and corrected this will lead to a lot more retrieved documents (very few documents will contain the misspelled word).

qq: www.doctor-x.com is a popular website with many incoming links, high Page Rank, and high trustworthiness. www.star-riders.com is a website developed by and for amateur astronomers. Being a niche website, it has fewer visitors, fewer incoming links, and smaller Page Rank than www.doctor-x.com. Which site has pages that are likely to come higher up in the search results, www.doctor-x.com or www.star-riders.com?

Undefined, because the two sites are in different knowledge domains and do not compete with Page Rank, nor do they compete for the same queries.

User data

User profile information - Demographic information - User ratings - Past purchases and visit history • Operational databases, cookies, etc.

1) In the vector space model a document is represented as a vector and similarity between two documents as the angle between both vectors. Describe the analogies: what do documents and vectors have in common and how angles between vectors express the similarity between two documents. No formulas required. Describe your understanding of the issue.

We can imagine a vector as an arrow in n-dimensional spaceThe projection of a vector on each dimension is a coordinate. A vector has two properties - length and direction. Two vectors may have the same length but point in different directions, or have different lengths but point in the same direction. We can imagine a collection of documents as an n-dimensional space where unique terms are dimensions; there may be thousands of dimensions. A document there is a vector, called term vector, where term weights are coordinates of the vector. If two term vectors point in approximately the same direction, their respective documents contain approximately the same terms and these terms are proportionally equally important. That means that the smaller the angle between two term vectors, the more similar the respective documents. The length of the term vectors, which depends on the number of words in both documents, is unimportant.

qq: Why do websites need evergreen content?

Websites need content that does not get out of date and so continuously attracts visitors and external links

quiz q why is link redirection (so called 301 direction) useful?

When a browser or a crawler sends a link (URL) to the web server, the web server replies with a new URL which guides the browser or crawler to the new location of this page or to another relevant webpage, good in case the page has been moved or deleted, the domain name has changed, or to redirect the browser from insecure to secure HTTP protocol.

what is pooling?

Whether or not a document is relevant to a query is always decided by human judges. The problem here is that it is not always possible to assess the relevance of all the documents to each query, simply because this would require enormous amounts of evaluations. The solution is to identify a much smaller subset where most documents relevant to the query are and then manually scan this subset. Some form of "magnet" needs to create this subset, moving most or all of the relevant documents to the set. Creating a subset can be done by drawing a cut-off line in several search engine result lists and then gathering the relevant results as a common pool of relevant documents.

qq Where does the knowledge graph get it's data?

Wikipedia, Wikidata CIA world factbook any popular structured database that google can process

Vector-space retrieval model uses term weights that measure importance of each term. How are these term weights calculated? Why are they calculated that way?

With binary coordinates, as used above, all words in a document are equally important for similarity calculation, which will not always be the case in a search. To reflect the relative importance of certain words, we would need to use numeric weights instead of binary coordinates. A higher weight means that a term is both present and more important than lower term weights.Two properties are traditionally used to calculate term weight: Term frequency (tf, how frequent a term is in the document) and Inverted document frequency (idf, how unique the term is in the entire document collection). tf = Term frequency idf = Inverted document frequency High idf for a word shows high uniqueness, words with low idf are common words like stop words. The calculation ( tf - idf )shows a certain word uniqness in a specific document. Some documents are longer then others, therefore the idf most normalize so the longer document don't get more weight. tf - idf is good to use because it combines unique words with how often the are used. If you only would have looked at how many times a word is used, you wouldn't get what is uniqued for this document. Would you only look at unique words, you wouldn't get words that is important for this document.

quiz q why do search engines include link analysis in their search algorithms?

a page that recieves many links from pages in the same topic is an authority within this topic. authorities are important pages. links that go from an authority to other pages within the same topic show the way to other reasonably important pages. a link is like a vote from some page, a vote that is difficult to fake (so they thought). many recieved votes mean an important page.

qq: what is a preferential crawler vs a universal crawler?

a preferential crawler attempts to only download pages of certain types or topics whereas a universal crawler downloads any page

quiz q: What does bag of words mean?

a set of words in a document, where the frequency of each word is indicated. Bag of words means that syntax is ignored to study words as a group of items appearing in or missing from a document or document collection. It studies the words frequency but cannot be used to measure text structure, distance between words and the beginning or end of the text. Uses in Boolean retrieval model, vector-space retrieval model.

Please explain how evaluation of a program (artefact) takes place using the concepts, training data set (also sometimes called development set) and evaluation/test data set. Elaborate on when you have scarce training data, how do you solve the problem?

a training data set is used to train the machine learning algorithm and the test data is used to evaluate the trained system/model. If there isn't very much training data you can use cross fold validation and use for example 10 cross fold validation. The system trains on 9 folds and evaluates on 1 fold and then performs the same task on permutations of this on all folds 10 times, and then calculates the average of all results.

qq: You have been asked to build a system that monitors the Internet and gives an alert if interesting information on your competitors is written somewhere. You specifically want it to find two things: * Information that indicates that one of the companies among your competitors is planning to buy another one of your competitors. * Information that indicates that one of your competitors will be bought by a company that is NOT among the companies that you count as your competitors.

a web crawler a named entity recognizer, that recognizes company names a keyword search component , searching for companies in a list of known competitors

You want to improve your movie recommendation website by analyzing your users and their interactions with the website. The users can rate and review movies, suggest movies to their friends, add movies to their watch-list (i.e., a list of movies the user wants to watch), see what their friends have watched and are planning to watch. a. Describe one user identification method that would allow you to track users who are not logged in to the system. Mention at least one weakness of the method. b. What is path completion? Describe why it may be needed to describe a user's actual navigation path. c. What pattern discovery method would you use to identify users who like similar movies? Motivate your choice and describe how you would use this to add a feature to your website.

a. IP Address + Agent: Not guaranteed to be unique. Defeated by rotating IPs. Embedded Session IDs: Cannot capture repeat visitors. Additional overhead. Cookies: Can be turned off by users. Software Agents: Likely to be rejected by users. b. The process of inferring missing user references due to caching, i.e. when there is no record in the server logs since no HTTP request is made to revisited webpages that have already been downloaded. c. Clustering of users based on their ratings of movies. This would allow groups of users who like similar movies to be identified. It can be used for providing movie recommendations: depending on the cluster to which a given user belongs, movies that others users in the cluster have rated highly, but that the given user has not watched, can be recommended to him/her.

You have been hired by a company selling virtual reality headsets. They want to know what opinions people are expressing about them. They also want to know which other companies that are mentioned in the same documents that mention them. a. You decide to use supervised machine learning to solve this task. What resources (such as data and tools) do you need to train and evaluate the system? b. What technique can be used to find the names of the other companies? c. When applying your system, you discover a large amount of opinions. Why is aggregation of opinions necessary? How would you present your results to the company?

a. Training and test data. Annotations of the data (e.g. positive and negative opinions). Some implementation of a machine learning algorithm e.g. SVM b. Named entity recognition c. Why aggregation: a long list of opinions is not useful. Presentation: some idea on how to aggregation, e.g. top opinions, number of opinions, visualisation.

quiz q what does a document index consist of?

an ordered list of words where each word is linked to information which tells in which documents the word occurs, in which position in a document the word occurs, and other application dependent information.

quiz q: why does high recall imply low precision, and high precision imply low recall?

because high recall requires more retrieved documents which means that more irrelevant documents are retrieved and vice versa- high precision focuses on the top retrieved documents leaving behind quite a few relevant documents. (recall is number of retrieved relevant documents divided by the total number of relevant documents, precision is the number of retrieved relevant documents divided by the number of retrieved documents)

qq: The data of interest in Web usage mining are obtained through various sources and can be categorized into four primary groups: usage data, content data, structure data and user data.To which data type does the following belong: data comprised of combinations of textual materials and images. The data sources used to deliver or generate this data include static HTML/XML pages, multimedia files, dynamically generated page segments from scripts, and collections of records from the operational databases.

content data

What is truncation?

cut something off · Use an asterisk (*) to truncate word · Broaden/expand keyword search results · EX: psychology* -- retrieves psychology, psychologist, psychological, etc. · Combines searches EX: simulat* AND computer - retrieves works about computer-related simulations, simulators, etc. left to the user to initiate, it is language dependent

quiz q: what do we have in term document vector space?

documents are represented as vectors in an n dimensional space where each dimension is a term and each coordinate of the vector is the weight of the term in the document.

qq: porters five forces

existing competition, threat of new entrants, power of buyers, power of suppliers, threat of substitution

What are the disadvantages of the scalar product

favors longer documents Disadvantages of the scalar product is that longer documents are more likely to be relevant because they are more likely to contain matching terms, therefore the document should be integrated in the similarity score (weighted).

quiz q: how do we obtain interpolated precision value at a given recall value?

find the largest measured precision value for all the recall values equal or larger than the given recall value on the precision recall curve, find the largest measured precision value for recall values starting from the given recall value and furthest to the right. interpolated precision values must always decrease (the line is always staying the same or falling, it never increases with the recall values)

qq: How does Google discover relationships between terms and entities whose meaning Google does not know?

if different entities appear in the same context many times, they are apparently the same sort of thing if google sees pairs of entities in the same context many times, the pair most likely defines the same sort of relationship.

quiz q: how does robots exclusion protocol work?

it lists folders and seperate files which should not be crawled and indexed in a robots.txt file in the robots meta tag of HTML documents, we indicate whether the retrieved page should or should not be indexed

qq: What is considered good diversity of links coming to a website?

links from different domains (domain diversity) links targeting a wide range of pages inside the website (deep linking)

quiz q: What is the purpose of nofollow links?

nofollow signals that a link points to untrusted or off topic content and search engines should not consider the link in their link analysis algorithms.

qq: Another potentially important pre-processing task which is usually performed after sessionization is _______________which is necessary due to client- or proxy-side caching.

path completion

qq: What indicates a good search phrase to optimize for?

popular search phrase used by many users but not present in many webpages ex: "drugstore nearby" location related, this person is looking for a drugstore and likely to visit it "ibumetrin 400 mg best price" this person is ready to buy the product

quiz q: what are the main features of the extended boolean retrieval model?

searcher has much more control over the search process. the model considers text structure and distance between words when it matches the query to a piece of text.

qq What are seed sites and what are they good for?

seed sites are highly trusted websites. The shortest distance calculated link wise between any webpage to some page on a seed site approximates the trustworthiness of the webpage- shorter distance means higher trustworthiness.

qq: In the context of Web usage data,_________________can be used to capture frequent navigational paths among user trails. By using this approach, Web marketers can predict future visit trails which will be helpful in placing advertisements aimed at certain user groups.

sequential pattern mining

quiz q: What does cosine similarity measure?

similarity between two documents calculated as a function of the angle between the term vectors of these documents

qq: (opinion mining) A statement can be subjective or objective and it can have different polarities (often positive, negative, or neutral). Match the sentences below with the best description. "I have five green apples and two red apples" Answer 1 "Apples can have different colours" Answer 2 "Green apples are delicious"

subjective and neutral objective and neutral subjective and positive

What is a canonical URL?

the URL of the page that google thinks is most representative from a set of duplicate pages on your site.

quiz q: what does inverse document frequency measure?

the uniqueness of a term in the document collection. low IDF means that the term appears in many documents and high IDF means that the term appears in few documents. The more unique a word is, the better it describes a smaller subset of a collection. Uniqueness can be measured by the total number of documents divided by the number of documents where the word/term appears. The Inverted Document Frequency is calculated by applying logarithms to the above. It doesn't matter which logarithmic base is used as long as all calculations use the same base.

quiz q: How is explicit relevance feedback carried out?

the user indicates which retreived documents are relevant to the query, the system modifies the original query including the terms that are well represented in the newly relevant documents and the user submits the query to the system.

quiz q What is the favorite way of search engines to obtain links that are eventually shown in search results?

the website owner writes the link into the sitemap the crawlers discover new links by crawling the web and add these links to the search index

quiz q: What is the purpose of robots exclusion protocol?

to instruct the crawlers which pages on the website should not be included in search engine indicies.

quiz q: what is the purpose of a document index?

to speed up the search process by searching in structured data that represents the document collection instead of searching in an unstructured collection of documents.

qq: The analysis of Web usage does not require knowledge about a user's identity. However, it is necessary to distinguish among different users. We use the phrase _____________ to refer to the sequence of logged activities belonging to the same user.

user activity record (the recorded sequence of actions performed by a user that is found in a web service log)

What are the four levels of opinion mining?

• Document level • Paragraph level • Sentence level • Aspect level

Search engines consider a number of on-line and off-line features of websites and webpages in order to judge the authority and credibility of each website and webpage. Please specify these features and make an estimate how reliable they are, i.e., how difficult it is to fake each feature by black-hat SEO. Grade each feature on a scale 1 (not reliable, easy to fake) to 10 (reliable, difficult to fake). Motivate the grade in one-two sentences. Please observe that the question addresses only the features that determine the authority and credibility of websites and webpages.

• Page Rank. Grade 9. Difficult to fake a big PR. • Link diversity. Grade 3-9. It's as fake as fake the links are. Difficult to fake thousands of links. • Small distance to highly trusted websites. Grade 9. Trusted sites do not talk to spammers. • Domain age. Grade 5. You don't fake it, but you can buy an old domain name. • Reputable hosting company. Grade 5. Shady business avoids places where it can be prosecuted. • Reputable physical address. Grade 10. If you do business registration with Google and Google can verify with "yellow pages" who the tenant at the given address is. • Reputable mentions of the telephone number. Grade 5. It is easy to fake positive mentions but no one fakes negative ones. • Opinions about the brand in social media. Grade 9. You can fake a small number of opinions but not many of them. • Clear authorship of the information. Grade 10. Normally people are aware of where their name appears and would object false appearances. • Spelling, grammar. Grade 10. If they are bad, the site is bad. • Bounce rate, how much time people spend on the site. Grade 10. You can't fake many users' behavior. • Task completion rate. Grade 10. It's never fake

Usage data (web usage mining)

• Server log files: - Web server access logs - Application server logs • Navigational behavior of visitors • Each HTTP request -> entry in server access l

Voir tous les ensembles d'études

isbi qs

Ensembles d'études connexes

Epidemiology Review

Econ- Chapter 11

Chapter 07 Business Marketing (7-8 - 7-8b.)

Religion Sacraments Chapter 7: Matrimony

Chapter 7, Legal Dimensions of Nursing Practice

Econ-Chapter 13

HW 7

economics unit 1

ACC351 final exam guide

the art of public speaking chapter 10

Pharm Exam Review Questions (from in class)

MSN 377 EXAM 4

Penny Review : The First Trimester

Periodic Table

3540 Mac OS Lesson 7, 8, & 9

Early US History Quiz 1

CH 13 fluid and electrolyte

EofG Ch3

IGB 2

Comptia A+ Quizset