BI exam 2
All of the following are challenges associated with natural language processing EXCEPT
dividing up a text into individual words in English
Clickstream analysis does not need users to
enter their perceptions of the Web site or other feedback directly to be useful in determining their preferences.
Data mining tools' capabilities and ease of use are
essential
to perform text mining
first, impose structure to the data, then mine the structured data
85-90 percent of all corporate data is
in some kind of unstructured form (e.g., text)
Data mining can be very useful in detecting patterns such as credit card fraud and
is very helpful in increasing sales
What does the robustness of a data mining method refer to?
its ability to overcome noisy data to make somewhat accurate predictions
If using a mining analogy, _________________ would be a more appropriate term than "data mining."
knowledge mining
Data mining versus statistics
main difference is that statistics starts with a well-defined proposition and hypothesis, wheras data mining starts with a loosely defined discovery statement Data mining looks for data sets that are as "big" as possible; statistics looks for the right size of data
Tapping into these information sources is not an option, but a
need to stay competitive
The data field "ethnic group" can be best described as
nominal data.
unstructured data
nonnumeric information that is typically formatted in a way that is meant for human eyes and not easily understood by computers
What are the two main types of Web analytics?
off-site and on-site Web analytics
DM extract ________ from data
patterns
Myth or reality: If the data accurately reflect the business or its customers, any company can use data mining.
reality
Striking it rich
requires creative thinking
In text mining, if an association between two concepts has 7% support, it means
that 7% of the documents had both concepts represented in the same document.
Categorization and clustering of documents during text mining differ only in
the preselection of categories.
All of the following statements about data mining are true EXCEPT
the process aspect means that data mining should be a one-step process to results.
Sentiment analysis projects require a lexicon for use. If a project in English is undertaken, you must generally make sure to
use an English lexicon appropriate to the project at your discretion.
In text analysis, what is a lexicon?
a catalog of words, their synonyms, and their meanings
In the opening vignette, the architectural system that supported Watson used all the following elements EXCEPT
a core engine that could operate seamlessly in another domain without changes
Why Data Mining?
-More intense competition at the global scale -Recognition of the value in data sources -Availability of quality data on customers, vendors, transactions, Web, etc. -Consolidation and integration of data repositories into data warehouses -The exponential increase in data processing and storage capabilities; and decrease in cost -Movement toward conversion of information resources into nonphysical form
Data mining is
-the novel aspect means that previously unknown patterns are discovered. -the potentially useful aspect means that results should lead to some business benefit. -the valid aspect means that the discovered patterns should hold true on new data.
In the Influence Health case study, what was the goal of the system?
Increasing service use
Text Analytics =
Information Retrieval + Text Mining
What does advanced analytics for social media do?
It examines the content of online conversations.
I B M Watson going head-to-head with the best of the best in
Jeopardy
All of the following are challenges associated with natural language processing
-understanding the context in which something is said. -distinguishing between words that have more than one meaning. -recognizing typographical or grammatical errors in texts
Area under the ROC Curve (AUC)
-works with binary classification -Produces values from 0 to 1.0 -Random chance is 0.5 and perfect classification is 1.0 -Produces a good assessment for skewed class distributions too!
Companies understand that when their product goes "viral," the content of the online conversations about their product does not matter, only the volume of conversations.
False
Data mining can be very useful in detecting patterns such as credit card fraud, but is of little help in improving sales.
False
Search engines are only used in the context of the World Wide Web (WWW).
False
Since little can be done about visitor Web site abandonment rates, organizations have to focus their efforts on increasing the number of new visitors.
False
Text analytics is the subset of text mining that handles information retrieval and extraction, plus data mining.
False
In the cancer research case study, data mining algorithms that predict cancer survivability with high predictive power are good replacements for medical professionals.
False Are good at helping doctors
Association consists of
Market Basket Link Analysis Sequence Analysis
Myth or reality: Data mining is not yet viable for mainstream business applications.
Myth
Myth or reality: Data mining is only for large firms that have lots of customer data.
Myth
Myth or reality: Data mining provides instant, crystal-ball-like predictions.
Myth
Myth or reality: Data mining requires a separate, dedicated database.
Myth
Myth or reality: Only those with advanced degrees can do data mining.
Myth
In sentiment analysis, which of the following is an implicit opinion?
The customer service I got for my TV was laughable
In sentiment analysis, it is hard to classify some subjects such as news as good or bad, but easier to classify others, e.g., movie reviews, in the same way
True
•A representative application of association rule mining includes
-*In business:* cross-marketing, cross-selling, store design, catalog design, e-commerce site design, optimization of online advertising, product pricing, and sales/promotion configuration -*In medicine:* relationships between symptoms and illnesses; diagnosis and patient characteristics and treatments (to be used in medical D S S); and genes and their functions (to be used in genomics projects) •*Commercial* -I B M S P S S Modeler (formerly Clementine) -S A S Enterprise Miner -Statistica - Dell/Statsoft -... many more •*Free and/or Open Source* -K N I M E -RapidMiner -Weka -R, ...
Types of patterns
-Association -Prediction -Cluster (segmentation) -Sequential (or time series) relationships
Most common standard processes
-C R I S P-D M (Cross-Industry Standard Process for Data Mining) -S E M M A (Sample, Explore, Modify, Model, and Assess) -K D D (Knowledge Discovery in Databases)
k-Fold Cross Validation (rotation estimation)
-Estimation methodology for classification -Data is split into k mutual subsets and k number training/testing experiments are conducted
single/simple split method
-Estimation methodology for classification -Simple split (or holdout or test sample estimation) -Split the data into 2 mutually exclusive sets: - training (~70%) and testing (30%) -For Neural Networks, the data is split into three sub-sets (training [~60%], validation [~20%], testing [~20%])
•Clustering results may be used to
-Identify natural groupings of customers -Identify rules for assigning new cases to classes for targeting/diagnostic purposes -Provide characterization, definition, labeling of populations -Decrease the size and complexity of problems for other data mining methods -Identify outliers in a specific domain (e.g., rare-event detection)
Electronic communization records (e.g., e-mail)
-Spam filtering -E-mail prioritization and categorization -Automatic response generation
Cluster analysis methods
-Statistical methods (including both hierarchical and nonhierarchical), such as k-means, k-modes, and so on. -Neural networks (adaptive resonance theory [A R T], self-organizing map [S O M]) -Fuzzy logic (e.g., fuzzy c-means algorithm) -Genetic algorithms
What steps compose of 85% of the total CRISP-DM project time
-Step 1: Business Understanding -Step 2: Data Understanding -Step 3: Data Preparation
CRISP-DM composed of six consecutive phases
-Step 1: Business Understanding -Step 2: Data Understanding -Step 3: Data Preparation -Step 4: Model Building -Step 5: Testing and Evaluation -Step 6: Deployment
Structured versus unstructured data
-Structured data: in databases -Unstructured data: Word documents, P D F files, text excerpts, X M L files, and so on
K-means clustering algorithm
-k : pre-determined number of clusters -Algorithm (Step 0: determine value of k) Step 1: Randomly generate k random points as initial cluster centers. Step 2: Assign each point to the nearest cluster center. Step 3: Re-compute the new cluster centers. Repetition step: Repeat steps 3 and 4 until some convergence criterion is met (usually that the assignment of points to clusters becomes stable).
Other names for data mining
-knowledge extraction, -pattern analysis, -knowledge discovery, -information harvesting, -pattern searching, -data dredging
In the opening vignette, the architectural system that supported Watson used all the following
-massive parallelism to enable simultaneous consideration of multiple hypotheses. -an underlying confidence subsystem that ranks and integrates answers -integration of shallow and deep knowledge.
Unstructured corporate data is doubling in size every
18 months
pattern
A mathematical (numeric and/or symbolic) relationship among data items
Predictions consists of
Classification Regression Time Series
Classification versus clustering
Classification: learns the function between the characteristic of things and their membership through a supervised learning process where both input and output variables are presented to the algorithm Clustering: learns the function between the characteristic of things and their membership through an unsupervised learning process where only the input variables are presented to the algorithm
Classification versus regression
Classification: what is being predicted is a class label (e.g. "sunny," "rainy," "cloudy) Regression: what is being predicted is a numeric value (e.g. temperature)
Which broad area of data mining applications partitions a collection of objects into natural groupings with similar features?
Clustering
Segmentation consists of
Clustering Outlier analysis
CRISP-DM stands for
Cross Industry Standard Process for Data Mining
structured data
Data that (1) are typically numeric or categorical; (2) can be organized and formatted in a way that is easy for computers to read, organize, and understand; and (3) can be inserted into a database in a seamless fashion.
Decision Trees
Employs a divide-and-conquer method
Definition of Data Mining
The nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data stored in structures databases.
Myth or Reality: Data mining is a multistep process that requires deliberate, proactive design and use.
Reality
Myth or reality: Because of the advances in database technology, a dedicated database is not required.
Reality
Myth or reality: Newer Web-based tools enable managers of all educational levels to do data mining
Reality
Myth or reality: The current state of the art is ready to go for almost any business type and/or size.
Reality
____________ present challenges for natural language processing.
Regional accents
Clustering partitions a collection of things into segments whose members share
Similar Characteristics
Text analytics versus text mining
Text analytics: a broader concept that includes information retrieval as well as information extraction, data mining, and web mining Text mining: primarily focused on discovering new and useful knowledge from the textual data sources
How many clusters?
There is not an optimal way to configure the number needed
What do voice of the market (VOM) applications of sentiment analysis do?
They examine customer sentiment at the aggregate level
Search engine optimization (SEO) is a means by which
Web site developers can increase Web site search rankings
In the Miami-Dade Police Department case study, predictive analytics helped to identify
a holistic view of the world of crime and criminals for better and faster reaction and management. NOT to identify the best schedule to pay the least overtime
Corpus (and corpora)
a large and structured set of texts prepared for the purpose of conducting knowledge discovery
The miner is often
an end user
In text mining, tokenizing is the process of
categorizing a block of text in a sentence
Current use of sentiment analysis in voice of the customer applications allows companies to
change their products or services in real time in response to customer sentiment.
D M environment is usually a
client-server or a Web-based information systems architecture.
In classification problems, the primary source for accuracy estimation is the
confusion matrix
Source of data for DM is often a
consolidated data warehouse (not always!)
Data is the most
critical ingredient for D M which may include soft/unstructured data
Data Mining Process
•A manifestation of the best practices •A systematic way to conduct D M projects •Moving from *Art to Science* for D M project Everybody has a different version
Association Rule Mining (market basket analysis)
•A very popular D M method in business •Finds interesting relationships (affinities) between variables (items or events) •Part of machine learning family •Employs unsupervised learning •There is no output variable •Also known as *market basket analysis* •Often used as an example to describe D M to ordinary people, such as the famous *"relationship between diapers and beers!"*
Data Mining versus Text Mining
•Both seek for novel and useful patterns Both are semi-automated processes
Classification techniques
•Decision tree analysis •Statistical analysis •Neural networks •Support vector machines •Case-based reasoning •Bayesian classifiers •Genetic algorithms Rough sets
Association ruling mining
•Input: the simple point-of-sale transaction data •Output: Most frequent affinities among items •Example: according to the transaction data... "Customer who bought a lap-top computer and a virus protection software, also bought extended service plan 70 percent of the time." •How do you use such a pattern/knowledge? -Put the items next to each other -Promote the items as a package Place items far apart from each other!
Data Mining Goes to Hollywood: Predicting Financial Success of Movies
•Goal: Predicting financial success of Hollywood movies before the start of their production process •How: Use of advanced predictive analytics methods •Results: promising
Text Mining Application Area
•Information extraction •Topic tracking •Summarization •Categorization •Clustering •Concept linking •Question answering
Additional Estimation Methodologies for Classification
•Leave-one-out -Similar to k-fold where k = number of samples •Bootstrapping -Random sampling with replacement •Jackknifing -Similar to leave-one-out •Area Under the R O C Curve (A U C) -R O C: receiver operating characteristics (a term borrowed from radar image processing)
Classification
•Most frequently used DM method •Part of the machine-learning family •Employ supervised learning •Learn from past data, classify new data •The output variable is categorical (nominal or ordinal) in nature
Assessment methosds for classification
•Predictive accuracy -Hit rate •Speed -Model building versus predicting/usage speed •Robustness •Scalability •Interpretability -Transparency, explainability
Ensemble Models for Predictive Analytics
•Produces more robust and reliable prediction models
Cluster Analysis for Data Mining
•Used for automatic identification of natural groupings of things •Part of the machine-learning family •Employ unsupervised learning •Learns the clusters of things from past data, then assigns new instances •There is not an output/target variable •In marketing, it is also known as segmentation