BUS-S326 Final

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Price Premium =

(Seller Price - Competitor Price) Calculated for each seller-competitor pair, for each transaction Each transaction generates M observations, (M: number of competing sellers)

sentiment analysis process

1)Breaks the document into its basic parts of speech, called POS tags, which identify the structural elements of a sentence (e.g., nouns, adjectives, verbs, and adverbs) 2)Algorithms identify sentiment-bearing phrases like "terrible service" or "cool atmosphere" 3)Each sentiment-bearing phrase earns a score based on a logarithmic scale ranging from negative to positive 10 (or 0-1 scale etc.) 4)The scores are combined to determine the overall sentiment of the document

When running a regression we are making two assumptions:

1)there is a linear relationship between two variables (i.e. X and Y) 2)this relationship is additive Technically, linear regression estimates how much Y changes when X change one unit

publishing on the web

1. You create the web page on your computer 2. You send the files to the IU Web server 3. A web user requests your home page URL 4. The IU Web server serves up your page

price premium equation

2*weight(speedy, delivery)+1*weight (great, service) +1*weight(awful, packaging) note: 2,1,1 are the frequency of that review

self selection bias

A bias that occurs because people who feel strongly about a subject are more likely to respond to survey questions than people who feel indifferent about it.

evaluating patterns in data analytics

A dataset can be divided into training and testing sets The training set is used in learning The testing data serves as ground truth for testing

primary source

A direct source of information , a representative of the competitive firm being studied

Social Media Text Analysis

A technique to extract, analyze, and interpret hidden business insights from textual elements of social media content

search engine data analytics

Analyzing historical search data to gain valuable insight into trends analysis, keyword monitoring, and advertainment spending statistics. How people search for your brand? When does interest spike in our products or services? Which keywords drive more traffic? Which regions are interested in your brand? How are your competitors performing?

document categorization

Assign documents to pre-defined categories E.g., categorizing incoming email as spam or sentiment analysis

How to store the words for fast lookup

Basic steps: Make a "dictionary" of all the words in all of the web pages For each word, list all the documents it occurs in Often omit very common word so"stop words": The most common words are unlikely to help text mining eg., "the", "a", "an", "you" ...•Sometimes stem the words cats -> ca running -> run

Human Brand

Celebrities $1 billion Nike deal with Cristiano Ronaldo Highlights the economic value of a human brand on social media influencer effect

textual elements of social media

Comments Tweets Blog posts Product reviews Status updates

Sequential concerns in reviews

Consumers adjust their reviews downward after observing prior negative reviews

regressions

Control for all variables that affect price premiums Control for all numeric scores of reputation Examine effect of text: E.g., seller with "fast delivery" has premium $7.95 over seller with "slow delivery", everything else being equal ex. "fast delivery" is $7.95 better than "slow delivery"

explanatory model

Develop an understanding of the relationship between an outcome and a set of predictors Focus on the nature of the relationship of each variable to the outcome The entire data set is used to build the model Data is typically limited; reliability of findings is judged by statistical criteria

Page Rank Algorithm

Developed by Google's founders. Provides a ranking to web pages that should be returned from a web search. Ranking is based in large part on how often other web pages link to a given page. More links from highly ranked pages increases a given page's rank •Idea: important pages are pointed to by other important pages •Method:Each link from one page to another is counted as a "vote" for the destination page

Temporal concerns in reviews

Dissimilar preferences between customers who buy early versus those who buy later

session

Every time a page loads, the tracking code will collect and send updated information about the user's activity •These activities will be grouped into a period of time called a "session" A session begins when a user navigates to a page that includes the tracking code. A session ends after 30 minutes of inactivity If a user returns to a page after a session ends, a new session will begin

exit rate

For all pageviews to the page Exit Rate is the percentage that were the last in the session.

hyperlinks

Hyperlinks are not merely technical links between two websites but serve a more symbolic means Represent a reasonable approximation of a social relationship Serve as validation or endorsement of the linked organization Incoming links serve to increase the page authority (helps SEO page rankings) Websites mostly link to other websites of similar nature hyperlinks serve as indicators of content similarity

Self-presentational concerns in reviews

Negative information may be viewed as more useful than positive information

predictive model

Prediction is the key - not modeling elegance Focus on predictingnew observations The data is divided into training and validation subsets Data is plentiful(if not unlimited); reliability is judged by prediction to a hold-out sample of fresh data

scoring reputation

Sellers are rated on these dimensions by buyers using modifiers (adjectives or adverbs), not numerical scores ex. "Fastshipping!" then split the review text into parts and count the frequency of each part: −Speedy delivery: 2 −Great service: 1 −Awful packaging: 1

Decompose reputation

Sellers characterized by a set of fulfillment characteristics(packaging, delivery, and so on) Think of each characteristic as a dimension, represented by a noun, noun phrase, verbor verbal phrase("shipping", "packaging", "delivery", "arrived") Scan the textual feedback to discover these dimensions

downfalls of sentiment based analysis

Sentiment classification at both document and sentence (or clause) levels are not sufficient They do not tell what people like and/or dislike A positive opinion on an object does not mean that the opinion holder likes everything. A negative opinion on an object does not mean .....

secondary source

Someone who gets information from a primary source or another secondary source and may have altered that information, either intentionally or unintentionally

Basic Crawler Algorithm

Start with a list of domain names, visit the home pages there Look at the hyperlink on the home page, and follow those links to more pages Keep a list of URLs visited, and those still to be visited Each time the program loads in a new HTML page, add the links in that page to the list to be crawled •A web crawler is specifically designed to collect and store data about websites for indexing

topic modeling

Surface themes or the key issues underlying a corpus and organizes the documents within the corpus according to those themes •The themes surfaced are hierarchical or nested

Average Page Depth

The average number of pages on a site that visitors view during a single session. Repeated views of a single page are counted

macro level negative trend

The average reviewer is becoming more critica

Landing Page Optimization

The effort of optimizing a Landing Page for the clearest path to action. Ie: Making it extremely relevant to the users intention, keeping it clean (free of clutter and easy to navigate), professional, effective and clearly drawing the audience to ONE ACTION. Best if there are little to no distractions or links on the page -- accept the call to action.

Organic search results

The natural search results that the search engine finds the most relevant matches to a user's query The rank of a web page in organic search reflects its success in attracting visitors to the site Higher ranks align with greater visibility

intention mining

To discover users' intentions (such as buy, sell, recommend, quit, desire, or wish) from natural language social media text such a user comments, product reviews, tweets, and blog posts. Purpose: to find new potential customers who intend to buy a product (or services) and serve existing customers who have trouble with a product

concept mining

To extract ideas and concepts from documents Purpose: to classify, cluster, and rank ideas from large amounts of social media text

Golden Triangle

Top left corner Not only in search engines

results ranking

When a user requests a specific information by entering keywords into a search engine, the search engine 1.Search engine receives a query, then 2.Looks up the words in the index, retrieves many documents, then 3.Rank orders the pages and extracts "snippets" or summaries containing query words

ISP measurements

a competitive intelligence service and consumer insights tool to help marketers compare website traffic. ex. HitWise

supervised learning methods

a model is created based on previous observations (or datasets). The training data consist of a set of training examples. •The algorithm is trained on the type of data it can expect to analyze •The goal is to predict the value of some variable •e.g., information extraction, document classification •Example: Crimson Hexagon •type of machine learning

trends mining

aka Predictive Analytics: use huge amounts of historical and real-time social media data to predict future events Purpose: to exploit patterns in large amounts of data by using statistical techniques including machine learning, data mining, and social network analysis

Linear Regression

an algorithm to find a precise line of fit for a set of data Goal: predict numerical target variable •Each row is a case •Target variable is numeric

Discovery based approach in text analysis

apply more complex algorithms, based on the string-of-words representation The string of words model preserves the contextual information contained in a text More toward text mining instead of text analysis Key tasks: information extraction, document summarization, document classification, topic modeling

out links

are hyperlinks generated out of a website Out-links attract essential, relevant, and valuable eyeballs Now has also been considered in search engine ranking Out-linking to valuable and relevant content can help improve visitors' experience

unsupervised learning methods

attempt to discover patterns rather than trying to fit the data into a predefined structure •The goal is to look for patterns, grouping, or other ways to characterize the data •e.g., clustering, topic modeling •type of machine learning

hyperlink analytics

deals with extracting, analyzing, and interpreting hyperlinks (e.g., in-links, out-links, and co-links). •Hyperlink analytics reveals the Internet traffic patterns and sources of the incoming or outgoing traffic to and from a website Study university rankings, political networks, and business competitiveness (e.g., use co-link to map competitive business positions) •does not examine internal links within a website between pages •does not measure the effectiveness of navigation within a website (more on landing page optimization)

Current trend of online reviews

decreasing ratings possibly because of Self-presentational concerns, Sequential concerns, temporal concerns. Macro-level negative trend

Competitive intelligence

definition: an intelligence system that helps managers assess their competition and vendors in order to become more efficient and effective competitors is forward looking It is not enough to know what competitors have done in the past It is also about anticipating the future •Make predictions about the future actions of firms, their products and services •Competitive intelligence guides business strategy

named entity extraction

extract the relevant information and ignore non-relevant information •e.g., takes a document and provide answers to questions such as who (individual or organization), where, and what

Quick scan

for candidate

long scan

for relevance

Relation Extraction

how the entities are connected Example"Lucy Yan is a faculty at the Kelley School of Business." •Who: Lucy Yan, Kelley School of Business •Relation: faculty at

Panel data

information collected from a group of consumers, organized into panels, over time ex. (ComScore, Alexa)

Acquisition

involves building awareness and acquiring user interest

Web Analytics

is the collection, analysis and reporting of Internet data for purposes of understanding and optimizing web usage, which translates into desired outcome.

landing page

is the page where visitors arrive at after clicking on your promotional creative

conversion

is when a user becomes a customer and transacts with your business

behavior

is when user engage with your business

in links

links directed towards a website originating from other websites Measure of a site's popularity(review) Important for website analytics as both the quality and number of in-links can affect the search engine ranking of the website

machine learning

methods or algorithms designed to learn the underlying patterns in the data and make predictions based on these patterns •Produce accurate out-of-sample prediction •ML methods can automatically learn which factors affect user behavior and how they interact with each other •ML assumes a model or structure to learn, but they use a general class of models that can be very rich

solution to biased reviews

multiple reviews; peer feedback, collect prior expectations

reporting bias

occurs when a source has the required knowledge but we question his or her willingness to convey it accurately

Two classes of research results

organic (unpaid) advertisements (paid)

dynamic text

real-time social media user-generated text or statement to express an opinion about content or information posted over social media

clickstream

records information about a customer during a web surfing session such as what websites were visited, how long the visit was, what ads were viewed, and what was purchased

Price premiums measure

reputation

Reputation is captured in

text feedback

web presence

the degree to which they are getting recognized on the web

social media hyperlink analytics

the extraction and analysis of hyperlinks embedded within social media texts (e.g., tweets and comments) By extracting out-links and tracing them back to their sender, a network of the out-link structure can be created A comparison of out-links between tweets of different organizations, patterns can be revealed (e.g., cite domestic portals' news services/blogs or no concentration in specific sites)

co-links/co-citation

the interconnectivity created by two or more websites (or web pages) that link to a joint website (or web page) •If two website receive a link from a third website they are considered to be connected indirectly •If two page link to a third page, they are also considered to be co-linking Co-links and co-citations are used by search engines to rank pages based on the words they contain Search engine algorithms track topic relationship and rank search result pages based on this information

new users

the number of first-time users during the selected date range

Unique Pageviews

the number of sessions during which the specified page was viewed at least once

bounce rate

the percentage of single-page sessions in which there was no interaction with the page A bounced session has a duration of 0 seconds.

Search Engine Optimization

the process of maximizing the number of visitors to a particular website by ensuring that the site appears high on the list of results returned by a search engine. Testing web presence within the context of search •Identify keywords list •Content edit •Gain links •Get indexed A search engine result page is the list of the results returned by a search engine in response to a user's query Usually contain 10 organic listings and some paid listing on the top or right side of the page The better your website is optimized for the search engine, the higher will be its ranking

pageview

the total number of pages viewed

UGC (user generated content)

the various forms of online media content that are publicly available and created by end users

google's mission

to organize the world's information and make it universally accessible and useful

types of dynamic text

tweet, comment, discussion, conversation, reviews

static text

usually large in length and is generated, updated, or deleted less frequently Wiki content Blog page Word documents News transcripts •The purpose is to inform, educate, and elaborate

users

who have initiated at least one session during the date range

Markov Blanket Classifier

•Accounts for conditional feature dependencies •Allowed reduction of discriminating features from thousands of words to about 20 (movie review domain)

document summarization

•Accurately condense or paraphrase long text pass, allowing the reader to glean salient information without digesting the entire document •E.g., news headlines, tables of content, and abstract •Automated extractive summarization in web search engines

Aspect-based sentiment analysis method

•An opinionis a quintuple (entity, aspect, sentiment, holder, time) where entity:target entity (or object). Aspect: aspect (or feature) of the entity. Sentiment: +, -, or neu, a rating, or an emotion. holder: opinion holder. time: time when the opinion was expressed.

How to Rank High in the Results, Offsite

•Backlinks building The most critical factor is the PageRank The greater the number of quality in-links to your website, the better •Obtain links from related/relevant websites •Grate new content that will hopefully attract links from other websites •Link diversity, i.e., from a variety of domains, e.g., .com, .net, .org, etc. •Social sharing Establish a sound social media presence People also use search engines embedded within social media channels to search people, brands, products, and services

reputation dimension examples

•Delivery and contract fulfillment (extent and speed) •Product quality and appropriate description •Packaging •Customer service •Price (!) •Responsiveness/Communication (speed and quality) •Overall feeling (transaction)

Crawler-based search engines

•Gather the contents of all web pages (using a program called a Crawler or Spider) •Organize the contents of the pages in a way that allows efficient retrieval (Indexing) •Take in a query, determine which pages match, and show the results (Ranking and display of results)

Eigenvector Centrality

•Having more friends does not by itself guarantee that someone is more important, but having more important friends provides a stronger signal •Eigenvector centrality tries to generalize degree centrality by incorporating the importance of the neighbors (undirected) •For directed graphs, we can use incoming or outgoing edges

indexing

•Helps classify a website correctly for searching purposes •The data crawled or extracted is then indexed and stored in a database for quick access •Typically, record the following information about each page

do people trust reviews?

•Law of large numbers Single review= NO Multiple ones= YES •Peer feedback: number of useful votes •Perceived usefulness is affected by: Identity disclosure: users trust real people Mixture of objective and subjective elements Readability, grammaticality

how page rank is used

•Locate the pages that contain the query text •Weight the "text score" with the "link score" •Rank results Lesson: PageRank of competitors matters! Do not obsess (only) about your own.

Machine Learning Approaches

•Machine learning Algorithms Decision trees Naïve Bayes Maximum Entropy Classifier Support Vector Machines (SVM) Markov Blanket Classifier

spongebob effect

•People choose movies they think they will like, and often they are right Ratings only tell us that "fans of SpongeBob like SpongeBob" Self-selection •Oscar winners draw a wider audience Rating is much more representative of the general population•When SpongeBob gets a wider audience, his ratings drop

possible reasons for biases

•People don't like to be critical •People do not post if they do not feel strongly about the product (positively or negatively)

How to Rank High in the Results, Onsite

•Position your keywords (title, headings, early on page) •Make text visible (no tiny fonts, no white-on-white) •"Alt text" for images: accessibility + search engines •Frames can kill, (Flash, AJAX also problematic) •Have relevant content •Do not change topics •Build links (nice to build a real community) •Just say no to search engine spamming •Submit your key pages •Verify often your listing

Opinion Mining

•Sentiment analysis •Drawn on positive and negative word sets that convey human emotion of feeling •Example: Movie Reviews Positive words may come from movie love stories and comedies, uplifting or "feel good" movies Negative words are more likely to be associated with horror, violence, and tragedy Note, there is nothing inherently good about the positive words or inherently bad about the negative words

How do Search Engines Discover Information?

•Start with a list of domain names, visit the home pages there •Mostly HTML pages, but other file types too: PDF, Word, PPT, etc. •A small fraction of the Web that search engines know about; no search engine is exhaustive •Not the "live" Web, but the search engine's index•Not the "deep" Web

why do we need for Competitive Data?

•Understand how competitors perform •How competitors get visitors •Where customers go after visiting competitor's site •Demographics

Eye-Tracking: The F-pattern

•Users don't read text thoroughly •The first two paragraphs must state the most important information •Start subheads, paragraphs, and bullet points with information-carrying words Example:Employment growth in the past decade has been in contract and temporary jobs. (New York Times, 3/31/2016)

web analytics objectives

•Web analytics for own website What customers look at Where they come from How to engage them •Web analytics for monitoring competitors How customers behave in general Why they go to competitors What they do there

Discover-based text categorization can

•reduce the categorization imprecision •reduce the labor-intensiveness of developing comprehensive concept dictionaries


Ensembles d'études connexes

Module 4: Lesson 4.05 - Graphs of Sine and Cosine Functions

View Set

freshman year: Spring 2016: Fundamentals of Bio II: exam1

View Set

Six Sigma Green Belt Certification

View Set

financial accounting ch.1 learning smart book

View Set