Chapter 6: Extracting Meaning from Data on the Web
Data Pipeline: Document Frequency Matrix
With tokenization complete, it is possible to construct a new dataframe (i.e., a matrix) where: -Each row represents a document. -Each column represents a distinct token. -Each cell is a count of the token for a document
Social listening benefits
You can get consumer data directly from the consumer themself. This will help marketing professionals make better decisions when it comes to the 4 P's.
Classification trees
are tree models where the target variable can take a discrete set of values (ham/spam)
Data Scraping
-A computer programmed extraction of information from individual computer screens, websites, or reports. -The legal and legitimate uses of data scraping focus primarily on scraping from public websites.
Data Scraping- Web Crawlers
-A web crawler uses web scraping when firms index the web periodically to make searching quicker. -After extraction, data is stored to calculate metrics. *It is also possible to gain even greater control over web scraping through programming in languages like R.
Online Content Count Tools
-Calculates frequencies of word or phrases embedded within a media provider's website -Ex: New York Time has a tool to search for any word or phrase within articles published by the NY Times from 1851-Present -Counts can be adjusted by date range, article type, and section -Calculates frequencies of word or phrases embedded within a media provider's website Public Access New York Times
Native Social Listening Tools
Some social media outlets provide detailed analytics about who is interacting with your content (example: Facebook)
Do we want all tokens to be terms in our DFM?
-Casing (e.g. If vs if) Punctuation (e.g. " ? ! , .) Numbers (e.g. 0, 56, 109) Every word (e.g. the, an, a) Symbols (e.g. <, @, #) What about similar words (e.g. ran, run, runs, running) -In our case, the answer to all of these is no. We are going to get rid of them and use the process of stemming to transform similar words into one representation of the word -Pre-Processing is a major part of text analytics!
Data Scraping and Content Analysis Tools
-Do I have to be a computer programmer? -No! There are countless programs (free and commercial) to help you mine the web. -Linguistic Inquiry and Word Count (LIWK) -Interprets text to reveal thoughts, attitudes, feelings, personality, and motivations of the author.
Octoparse
-Easily-configured visual scraping tool -Can run extractions on the cloud and on your own local machine -Exports the scraped data in TXT, CSV, HTML or Excel formats
Social listening examples
-Hootsuite -Facebook Insights -Tweet Reach
Search-Volume Data Tools
-Indexes the search volume of terms people use on search engines -Data can be downloaded into raw files -Public Access
Data Pipeline: Tokenization
-Key question: How to represent text as a data frame? -Answer: Words become columns -Take the following hypothetical document: "If it looks like a duck, swims like a duck, and quacks like a duck, then it probably is a duck." -First step in representation is decomposing a text document into distinct pieces, or tokens. -Applying tokenization to our hypothetical document could produce the following tokens: [If] [it] [looks] [like] [a] [duck] [,] [swims] [like] [a] [duck] [,] [and] [quacks] [like] [a] [duck] [,] [then] [it] [probably] [is] [a] [duck] [.]
Text Analysis Tools
-Lexalytics -Sysomos -Clarabridge -Provalis -Rapid Miner -Medallia -Luminosa -Etuma
Public Data Tools
-National & Local Governments: Census data and NCHS -Private Firms: AWS
Sentiment Analysis Tools
-Nifi -HortonWorks -OpenText -BrandWatch -StatSoft -Cision -Meltwater -Critical Mention
Search Volume Data Tools
-Search volume tools allow the user to see the popularity of specific topics -Google dominates this space with Trends and AdWords -Trends is used to track the ups and downs of search engine results -AdWords also shows search popularity, but is designed to allow marketing professionals opportunities for advertising.
Google Trends
-Simple tool that works just like a Google search -Indexes search interest on a 0-100 scale -Recently expanded service to include data from Google News, Google Images, Google Shopping, and Youtube
Spinn3r
-Spinn3r scrapes entire data from blogs, news sites, social media and RSS feeds. -Firehose API manages 95% of crawling and indexing work. -Scraped data can be filtered using keywords.
Steps of Content Analysis
-Structure the data -Clean the data -Extract meaning from the data
R
-Text analysis through tokenization and sentiment analysis through dictionaries -TM -Quanteda
Other Search Volume Tools
-Tons of free, simple keyword research tools -Searchvolume.io -Serps.com -Wordstream.com -Bing and Yahoo offer services similar to Google Trends -Moz is a popular commercial service that analyzes search volume and reports metrics and suggested actions
Commercial Content analysis tools
-User pays for access -Linguistic Inquiry and Word Count (LIWC) -Translates words and phrases into psychological states
Content Analysis
-Using acquired data in meaningful ways -Studying digital text, photos, audio or visual formats of communication to further understand customers -Sentiment vs. text analysis
Content Analysis in R
-Various packages - TM, Quanteda -Parses text into meaningful pieces of sentences, and groups related phrases -Includes options for sentiment analysis using dictionaries
Commercial Data Tools
-Vast amounts of data (customer details, product information, trends, and more) -Typically structured -Costs $$$
Dex.io
-Web-based scraping application that doesn't require any download -Browser-based tool that sets up crawlers to fetch data in real-time -Has features that save the scraped data directly to Box.net and Google drive or export it as JSON or CSV files -Supports scraping data anonymously using proxy servers
Data Pipeline: Hypothetical DFM
-Word ordering is not preserved! -This is known as the "bag-of-words" model. -The BOW model is a very common representation in text analytics
Summary of Standard Text Analytics Data Preprocessing Pipeline
1. Tokenize 2. Lowercase 3. Remove Symbols, Numbers, and Punctuation 4. Get rid of Stop Words (e.g. the, an, a) 5. Stem: Convert ( ran, run, runs, running) to (run) -Note: You could do more complicated stuff. This is the 80/20 rule. It is 20 percent of the tools that will get you useful results 80 percent of the time.
Structure the data
Sort data scraped from websites or social media into a comprehensible format
Content Analysis vs. Social Listening
Social Listening: -Gathers content from forums, comment sections, and social media -Converts from HTML to text readable by humans, but not interpreted - i.e., raw data CONTENT ANALYSIS -Allows for sentiment and text analysis of gathered data -Serves segmentation and targeting activities, as well as opinions about brands
Social listening
Social listening tools are platforms that connect to various social media networks in order to extract consumer data.
Multi-Platform Social Listening Tools
Allows connection to multiple platforms (example: Hootsuite)
Clean the data
Remove errors - irrelevant, incorrectly parsed, or unhelpful entries
Content Analysis Tools
Commercial: LIWC or R
Building Predictive Classification Model
Cross-validation is the basis of our model building process -A technique for assessing how accurately a predictive model will perform in production on brand new data that it has never seen before. -Typically want to use a three-way split of training, validation, and test -Since our data is fairly large, we are going to use a single decision tree algorithm for our model.
Opportunities with data
Data--> Social Listening Content
Extract meaning from the data
Dictionaries - convert word usage into psychological and sentimental information
Data Pipeline: Some Considerations
Do we want all tokens to be terms in our DFM? -Casing (e.g. If vs if) -Punctuation (e.g. " ? ! , .) -Numbers (e.g. 0, 56, 109) -Every word (e.g. the, an, a) -Symbols (e.g. <, @, #) -What about similar words (e.g. ran, run, runs, running)
Single-Attribute Social Listening Tools
Focuses on a single aspect of a post such as the text or an image (example: TweetReach)