Chapter 6: Extracting Meaning from Data on the Web

¡Supera tus tareas y exámenes ahora con Quizwiz!

Data Pipeline: Document Frequency Matrix

With tokenization complete, it is possible to construct a new dataframe (i.e., a matrix) where: -Each row represents a document. -Each column represents a distinct token. -Each cell is a count of the token for a document

Social listening benefits

You can get consumer data directly from the consumer themself. This will help marketing professionals make better decisions when it comes to the 4 P's.

Classification trees

are tree models where the target variable can take a discrete set of values (ham/spam)

Data Scraping

-A computer programmed extraction of information from individual computer screens, websites, or reports. -The legal and legitimate uses of data scraping focus primarily on scraping from public websites.

Data Scraping- Web Crawlers

-A web crawler uses web scraping when firms index the web periodically to make searching quicker. -After extraction, data is stored to calculate metrics. *It is also possible to gain even greater control over web scraping through programming in languages like R.

Online Content Count Tools

-Calculates frequencies of word or phrases embedded within a media provider's website -Ex: New York Time has a tool to search for any word or phrase within articles published by the NY Times from 1851-Present -Counts can be adjusted by date range, article type, and section -Calculates frequencies of word or phrases embedded within a media provider's website Public Access New York Times

Native Social Listening Tools

Some social media outlets provide detailed analytics about who is interacting with your content (example: Facebook)

Do we want all tokens to be terms in our DFM?

-Casing (e.g. If vs if) Punctuation (e.g. " ? ! , .) Numbers (e.g. 0, 56, 109) Every word (e.g. the, an, a) Symbols (e.g. <, @, #) What about similar words (e.g. ran, run, runs, running) -In our case, the answer to all of these is no. We are going to get rid of them and use the process of stemming to transform similar words into one representation of the word -Pre-Processing is a major part of text analytics!

Data Scraping and Content Analysis Tools

-Do I have to be a computer programmer? -No! There are countless programs (free and commercial) to help you mine the web. -Linguistic Inquiry and Word Count (LIWK) -Interprets text to reveal thoughts, attitudes, feelings, personality, and motivations of the author.

Octoparse

-Easily-configured visual scraping tool -Can run extractions on the cloud and on your own local machine -Exports the scraped data in TXT, CSV, HTML or Excel formats

Social listening examples

-Hootsuite -Facebook Insights -Tweet Reach

Search-Volume Data Tools

-Indexes the search volume of terms people use on search engines -Data can be downloaded into raw files -Public Access

Data Pipeline: Tokenization

-Key question: How to represent text as a data frame? -Answer: Words become columns -Take the following hypothetical document: "If it looks like a duck, swims like a duck, and quacks like a duck, then it probably is a duck." -First step in representation is decomposing a text document into distinct pieces, or tokens. -Applying tokenization to our hypothetical document could produce the following tokens: [If] [it] [looks] [like] [a] [duck] [,] [swims] [like] [a] [duck] [,] [and] [quacks] [like] [a] [duck] [,] [then] [it] [probably] [is] [a] [duck] [.]

Text Analysis Tools

-Lexalytics -Sysomos -Clarabridge -Provalis -Rapid Miner -Medallia -Luminosa -Etuma

Public Data Tools

-National & Local Governments: Census data and NCHS -Private Firms: AWS

Sentiment Analysis Tools

-Nifi -HortonWorks -OpenText -BrandWatch -StatSoft -Cision -Meltwater -Critical Mention

Search Volume Data Tools

-Search volume tools allow the user to see the popularity of specific topics -Google dominates this space with Trends and AdWords -Trends is used to track the ups and downs of search engine results -AdWords also shows search popularity, but is designed to allow marketing professionals opportunities for advertising.

Google Trends

-Simple tool that works just like a Google search -Indexes search interest on a 0-100 scale -Recently expanded service to include data from Google News, Google Images, Google Shopping, and Youtube

Spinn3r

-Spinn3r scrapes entire data from blogs, news sites, social media and RSS feeds. -Firehose API manages 95% of crawling and indexing work. -Scraped data can be filtered using keywords.

Steps of Content Analysis

-Structure the data -Clean the data -Extract meaning from the data

-Text analysis through tokenization and sentiment analysis through dictionaries -TM -Quanteda

Other Search Volume Tools

-Tons of free, simple keyword research tools -Searchvolume.io -Serps.com -Wordstream.com -Bing and Yahoo offer services similar to Google Trends -Moz is a popular commercial service that analyzes search volume and reports metrics and suggested actions

Commercial Content analysis tools

-User pays for access -Linguistic Inquiry and Word Count (LIWC) -Translates words and phrases into psychological states

Content Analysis

-Using acquired data in meaningful ways -Studying digital text, photos, audio or visual formats of communication to further understand customers -Sentiment vs. text analysis

Content Analysis in R

-Various packages - TM, Quanteda -Parses text into meaningful pieces of sentences, and groups related phrases -Includes options for sentiment analysis using dictionaries

Commercial Data Tools

-Vast amounts of data (customer details, product information, trends, and more) -Typically structured -Costs $$$

Dex.io

-Web-based scraping application that doesn't require any download -Browser-based tool that sets up crawlers to fetch data in real-time -Has features that save the scraped data directly to Box.net and Google drive or export it as JSON or CSV files -Supports scraping data anonymously using proxy servers

Data Pipeline: Hypothetical DFM

-Word ordering is not preserved! -This is known as the "bag-of-words" model. -The BOW model is a very common representation in text analytics

Summary of Standard Text Analytics Data Preprocessing Pipeline

1. Tokenize 2. Lowercase 3. Remove Symbols, Numbers, and Punctuation 4. Get rid of Stop Words (e.g. the, an, a) 5. Stem: Convert ( ran, run, runs, running) to (run) -Note: You could do more complicated stuff. This is the 80/20 rule. It is 20 percent of the tools that will get you useful results 80 percent of the time.

Structure the data

Sort data scraped from websites or social media into a comprehensible format

Content Analysis vs. Social Listening

Social Listening: -Gathers content from forums, comment sections, and social media -Converts from HTML to text readable by humans, but not interpreted - i.e., raw data CONTENT ANALYSIS -Allows for sentiment and text analysis of gathered data -Serves segmentation and targeting activities, as well as opinions about brands

Social listening

Social listening tools are platforms that connect to various social media networks in order to extract consumer data.

Multi-Platform Social Listening Tools

Allows connection to multiple platforms (example: Hootsuite)

Clean the data

Remove errors - irrelevant, incorrectly parsed, or unhelpful entries

Content Analysis Tools

Commercial: LIWC or R

Building Predictive Classification Model

Cross-validation is the basis of our model building process -A technique for assessing how accurately a predictive model will perform in production on brand new data that it has never seen before. -Typically want to use a three-way split of training, validation, and test -Since our data is fairly large, we are going to use a single decision tree algorithm for our model.

Opportunities with data

Data--> Social Listening Content

Extract meaning from the data

Dictionaries - convert word usage into psychological and sentimental information

Data Pipeline: Some Considerations

Do we want all tokens to be terms in our DFM? -Casing (e.g. If vs if) -Punctuation (e.g. " ? ! , .) -Numbers (e.g. 0, 56, 109) -Every word (e.g. the, an, a) -Symbols (e.g. <, @, #) -What about similar words (e.g. ran, run, runs, running)

Single-Attribute Social Listening Tools

Focuses on a single aspect of a post such as the text or an image (example: TweetReach)

Chapter 6: Extracting Meaning from Data on the Web

Conjuntos de estudio relacionados

Chapter 17 & 18 Blood & The Heart

Фармакология,Антибиотици

PSY252 Final

BLAW ch 14,15,16

MG 415 FINAL EXAM

Collective Bargaining Midterm

sexually transmitted infections

Impromptu Questions

CHEM Chapter 1 Smartbooks

bio 3709 ch23

AP Psych: Module 73 Vocab

Gerontology Tabloski 1, 2, 3, 5, 8, 12

Life and Health (FL 2-15) Part 1

Sociology Midterm Reading Quiz

Test 2 Review

Chapter 2 forensic Photography

Virology Assignment 6a Quiz

13a Defining and Classifying Psychological Disorders

Vocabulary Words With Latin Bases ANN, ANNU, ENN, ENNU

Test 2 (Ch. 7-10): Missed Questions