BI exam 2

Ace your homework & exams now with Quizwiz!

All of the following are challenges associated with natural language processing EXCEPT

dividing up a text into individual words in English

Clickstream analysis does not need users to

enter their perceptions of the Web site or other feedback directly to be useful in determining their preferences.

Data mining tools' capabilities and ease of use are

essential

to perform text mining

first, impose structure to the data, then mine the structured data

85-90 percent of all corporate data is

in some kind of unstructured form (e.g., text)

Data mining can be very useful in detecting patterns such as credit card fraud and

is very helpful in increasing sales

What does the robustness of a data mining method refer to?

its ability to overcome noisy data to make somewhat accurate predictions

If using a mining analogy, _________________ would be a more appropriate term than "data mining."

knowledge mining

Data mining versus statistics

main difference is that statistics starts with a well-defined proposition and hypothesis, wheras data mining starts with a loosely defined discovery statement Data mining looks for data sets that are as "big" as possible; statistics looks for the right size of data

Tapping into these information sources is not an option, but a

need to stay competitive

The data field "ethnic group" can be best described as

nominal data.

unstructured data

nonnumeric information that is typically formatted in a way that is meant for human eyes and not easily understood by computers

What are the two main types of Web analytics?

off-site and on-site Web analytics

DM extract ________ from data

patterns

Myth or reality: If the data accurately reflect the business or its customers, any company can use data mining.

reality

Striking it rich

requires creative thinking

In text mining, if an association between two concepts has 7% support, it means

that 7% of the documents had both concepts represented in the same document.

Categorization and clustering of documents during text mining differ only in

the preselection of categories.

All of the following statements about data mining are true EXCEPT

the process aspect means that data mining should be a one-step process to results.

Sentiment analysis projects require a lexicon for use. If a project in English is undertaken, you must generally make sure to

use an English lexicon appropriate to the project at your discretion.

In text analysis, what is a lexicon?

a catalog of words, their synonyms, and their meanings

In the opening vignette, the architectural system that supported Watson used all the following elements EXCEPT

a core engine that could operate seamlessly in another domain without changes

Why Data Mining?

-More intense competition at the global scale -Recognition of the value in data sources -Availability of quality data on customers, vendors, transactions, Web, etc. -Consolidation and integration of data repositories into data warehouses -The exponential increase in data processing and storage capabilities; and decrease in cost -Movement toward conversion of information resources into nonphysical form

Data mining is

-the novel aspect means that previously unknown patterns are discovered. -the potentially useful aspect means that results should lead to some business benefit. -the valid aspect means that the discovered patterns should hold true on new data.

In the Influence Health case study, what was the goal of the system?

Increasing service use

Text Analytics =

Information Retrieval + Text Mining

What does advanced analytics for social media do?

It examines the content of online conversations.

I B M Watson going head-to-head with the best of the best in

Jeopardy

All of the following are challenges associated with natural language processing

-understanding the context in which something is said. -distinguishing between words that have more than one meaning. -recognizing typographical or grammatical errors in texts

Area under the ROC Curve (AUC)

-works with binary classification -Produces values from 0 to 1.0 -Random chance is 0.5 and perfect classification is 1.0 -Produces a good assessment for skewed class distributions too!

Companies understand that when their product goes "viral," the content of the online conversations about their product does not matter, only the volume of conversations.

False

Data mining can be very useful in detecting patterns such as credit card fraud, but is of little help in improving sales.

False

Search engines are only used in the context of the World Wide Web (WWW).

False

Since little can be done about visitor Web site abandonment rates, organizations have to focus their efforts on increasing the number of new visitors.

False

Text analytics is the subset of text mining that handles information retrieval and extraction, plus data mining.

False

In the cancer research case study, data mining algorithms that predict cancer survivability with high predictive power are good replacements for medical professionals.

False Are good at helping doctors

Association consists of

Market Basket Link Analysis Sequence Analysis

Myth or reality: Data mining is not yet viable for mainstream business applications.

Myth

Myth or reality: Data mining is only for large firms that have lots of customer data.

Myth

Myth or reality: Data mining provides instant, crystal-ball-like predictions.

Myth

Myth or reality: Data mining requires a separate, dedicated database.

Myth

Myth or reality: Only those with advanced degrees can do data mining.

Myth

In sentiment analysis, which of the following is an implicit opinion?

The customer service I got for my TV was laughable

In sentiment analysis, it is hard to classify some subjects such as news as good or bad, but easier to classify others, e.g., movie reviews, in the same way

True

•A representative application of association rule mining includes

-*In business:* cross-marketing, cross-selling, store design, catalog design, e-commerce site design, optimization of online advertising, product pricing, and sales/promotion configuration -*In medicine:* relationships between symptoms and illnesses; diagnosis and patient characteristics and treatments (to be used in medical D S S); and genes and their functions (to be used in genomics projects) •*Commercial* -I B M S P S S Modeler (formerly Clementine) -S A S Enterprise Miner -Statistica - Dell/Statsoft -... many more •*Free and/or Open Source* -K N I M E -RapidMiner -Weka -R, ...

Types of patterns

-Association -Prediction -Cluster (segmentation) -Sequential (or time series) relationships

Most common standard processes

-C R I S P-D M (Cross-Industry Standard Process for Data Mining) -S E M M A (Sample, Explore, Modify, Model, and Assess) -K D D (Knowledge Discovery in Databases)

k-Fold Cross Validation (rotation estimation)

-Estimation methodology for classification -Data is split into k mutual subsets and k number training/testing experiments are conducted

single/simple split method

-Estimation methodology for classification -Simple split (or holdout or test sample estimation) -Split the data into 2 mutually exclusive sets: - training (~70%) and testing (30%) -For Neural Networks, the data is split into three sub-sets (training [~60%], validation [~20%], testing [~20%])

•Clustering results may be used to

-Identify natural groupings of customers -Identify rules for assigning new cases to classes for targeting/diagnostic purposes -Provide characterization, definition, labeling of populations -Decrease the size and complexity of problems for other data mining methods -Identify outliers in a specific domain (e.g., rare-event detection)

Electronic communization records (e.g., e-mail)

-Spam filtering -E-mail prioritization and categorization -Automatic response generation

Cluster analysis methods

-Statistical methods (including both hierarchical and nonhierarchical), such as k-means, k-modes, and so on. -Neural networks (adaptive resonance theory [A R T], self-organizing map [S O M]) -Fuzzy logic (e.g., fuzzy c-means algorithm) -Genetic algorithms

What steps compose of 85% of the total CRISP-DM project time

-Step 1: Business Understanding -Step 2: Data Understanding -Step 3: Data Preparation

CRISP-DM composed of six consecutive phases

-Step 1: Business Understanding -Step 2: Data Understanding -Step 3: Data Preparation -Step 4: Model Building -Step 5: Testing and Evaluation -Step 6: Deployment

Structured versus unstructured data

-Structured data: in databases -Unstructured data: Word documents, P D F files, text excerpts, X M L files, and so on

K-means clustering algorithm

-k : pre-determined number of clusters -Algorithm (Step 0: determine value of k) Step 1: Randomly generate k random points as initial cluster centers. Step 2: Assign each point to the nearest cluster center. Step 3: Re-compute the new cluster centers. Repetition step: Repeat steps 3 and 4 until some convergence criterion is met (usually that the assignment of points to clusters becomes stable).

Other names for data mining

-knowledge extraction, -pattern analysis, -knowledge discovery, -information harvesting, -pattern searching, -data dredging

In the opening vignette, the architectural system that supported Watson used all the following

-massive parallelism to enable simultaneous consideration of multiple hypotheses. -an underlying confidence subsystem that ranks and integrates answers -integration of shallow and deep knowledge.

Unstructured corporate data is doubling in size every

18 months

pattern

A mathematical (numeric and/or symbolic) relationship among data items

Predictions consists of

Classification Regression Time Series

Classification versus clustering

Classification: learns the function between the characteristic of things and their membership through a supervised learning process where both input and output variables are presented to the algorithm Clustering: learns the function between the characteristic of things and their membership through an unsupervised learning process where only the input variables are presented to the algorithm

Classification versus regression

Classification: what is being predicted is a class label (e.g. "sunny," "rainy," "cloudy) Regression: what is being predicted is a numeric value (e.g. temperature)

Which broad area of data mining applications partitions a collection of objects into natural groupings with similar features?

Clustering

Segmentation consists of

Clustering Outlier analysis

CRISP-DM stands for

Cross Industry Standard Process for Data Mining

structured data

Data that (1) are typically numeric or categorical; (2) can be organized and formatted in a way that is easy for computers to read, organize, and understand; and (3) can be inserted into a database in a seamless fashion.

Decision Trees

Employs a divide-and-conquer method

Definition of Data Mining

The nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data stored in structures databases.

Myth or Reality: Data mining is a multistep process that requires deliberate, proactive design and use.

Reality

Myth or reality: Because of the advances in database technology, a dedicated database is not required.

Reality

Myth or reality: Newer Web-based tools enable managers of all educational levels to do data mining

Reality

Myth or reality: The current state of the art is ready to go for almost any business type and/or size.

Reality

____________ present challenges for natural language processing.

Regional accents

Clustering partitions a collection of things into segments whose members share

Similar Characteristics

Text analytics versus text mining

Text analytics: a broader concept that includes information retrieval as well as information extraction, data mining, and web mining Text mining: primarily focused on discovering new and useful knowledge from the textual data sources

How many clusters?

There is not an optimal way to configure the number needed

What do voice of the market (VOM) applications of sentiment analysis do?

They examine customer sentiment at the aggregate level

Search engine optimization (SEO) is a means by which

Web site developers can increase Web site search rankings

In the Miami-Dade Police Department case study, predictive analytics helped to identify

a holistic view of the world of crime and criminals for better and faster reaction and management. NOT to identify the best schedule to pay the least overtime

Corpus (and corpora)

a large and structured set of texts prepared for the purpose of conducting knowledge discovery

The miner is often

an end user

In text mining, tokenizing is the process of

categorizing a block of text in a sentence

Current use of sentiment analysis in voice of the customer applications allows companies to

change their products or services in real time in response to customer sentiment.

D M environment is usually a

client-server or a Web-based information systems architecture.

In classification problems, the primary source for accuracy estimation is the

confusion matrix

Source of data for DM is often a

consolidated data warehouse (not always!)

Data is the most

critical ingredient for D M which may include soft/unstructured data

Data Mining Process

•A manifestation of the best practices •A systematic way to conduct D M projects •Moving from *Art to Science* for D M project Everybody has a different version

Association Rule Mining (market basket analysis)

•A very popular D M method in business •Finds interesting relationships (affinities) between variables (items or events) •Part of machine learning family •Employs unsupervised learning •There is no output variable •Also known as *market basket analysis* •Often used as an example to describe D M to ordinary people, such as the famous *"relationship between diapers and beers!"*

Data Mining versus Text Mining

•Both seek for novel and useful patterns Both are semi-automated processes

Classification techniques

•Decision tree analysis •Statistical analysis •Neural networks •Support vector machines •Case-based reasoning •Bayesian classifiers •Genetic algorithms Rough sets

Association ruling mining

•Input: the simple point-of-sale transaction data •Output: Most frequent affinities among items •Example: according to the transaction data... "Customer who bought a lap-top computer and a virus protection software, also bought extended service plan 70 percent of the time." •How do you use such a pattern/knowledge? -Put the items next to each other -Promote the items as a package Place items far apart from each other!

Data Mining Goes to Hollywood: Predicting Financial Success of Movies

•Goal: Predicting financial success of Hollywood movies before the start of their production process •How: Use of advanced predictive analytics methods •Results: promising

Text Mining Application Area

•Information extraction •Topic tracking •Summarization •Categorization •Clustering •Concept linking •Question answering

Additional Estimation Methodologies for Classification

•Leave-one-out -Similar to k-fold where k = number of samples •Bootstrapping -Random sampling with replacement •Jackknifing -Similar to leave-one-out •Area Under the R O C Curve (A U C) -R O C: receiver operating characteristics (a term borrowed from radar image processing)

Classification

•Most frequently used DM method •Part of the machine-learning family •Employ supervised learning •Learn from past data, classify new data •The output variable is categorical (nominal or ordinal) in nature

Assessment methosds for classification

•Predictive accuracy -Hit rate •Speed -Model building versus predicting/usage speed •Robustness •Scalability •Interpretability -Transparency, explainability

Ensemble Models for Predictive Analytics

•Produces more robust and reliable prediction models

Cluster Analysis for Data Mining

•Used for automatic identification of natural groupings of things •Part of the machine-learning family •Employ unsupervised learning •Learns the clusters of things from past data, then assigns new instances •There is not an output/target variable •In marketing, it is also known as segmentation


Related study sets

EDCP 652 - Chapter 9 - Experimental Research Designs

View Set

PA Life and Health Insurance License Test

View Set

EMT Unit 1 Exam (vocab, quizzes, objectives)

View Set