ch 4 midterm

¡Supera tus tareas y exámenes ahora con Quizwiz!

Open-source data mining tools include applications such as IBM SPSS Modeler and Dell Statistica.

False

Ratio data is a type of categorical data.

False

Statistics and data mining both look for data sets that are as large as possible.

False

In the Influence Health case, the company was able to evaluate over ________ million records in only two days.

195

In the terrorist funding case study, an observed price ________ may be related to income tax avoidance/evasion, money laundering, or terrorist financing.

deviation

Market basket analysis is a useful and entertaining way to explain data mining to a technologically less savvy audience, but it has little business significance.

False

K-fold cross-validation is also called sliding estimation.

False

Which data mining process/methodology is thought to be the most comprehensive, according to kdnuggets.com rankings?

CRISP-DM

________ was proposed in the mid-1990s by a European consortium of companies to serve as a nonproprietary standard methodology for data mining.

CRISP-DM

Which of the following is a data mining myth?

Data mining requires a separate, dedicated database.

In the data mining in Hollywood case study, how successful were the models in predicting the success or failure of a Hollywood movie?

The researchers claim that these prediction results are better than any reported in the published literature for this problem domain. Fusion classification methods attained up to 56.07% accuracy in correctly classifying movies and 90.75% accuracy in classifying movies within one category of their actual category. The SVM classification method attained up to 55.49% accuracy in correctly classifying movies and 85.55% accuracy in classifying movies within one category of their actual category.

What does the scalability of a data mining method refer to?

its ability to construct a prediction model efficiently given a large amount of data

Because of its successful application to retail business problems, association rule mining is commonly called ________.

market-basket analysis

The data field "ethnic group" can be best described as

nominal data.

Clustering partitions a collection of things into segments whose members share

similar characteristics.

All of the following statements about data mining are true EXCEPT

the process aspect means that data mining should be a one-step process to results.

In estimating the accuracy of data mining (or other) classification models, the true positive rate is

the ratio of correctly classified positives divided by the total positive count.

List four myths associated with data mining.

• Data mining provides instant, crystal-ball-like predictions. • Data mining is not yet viable for business applications. • Data mining requires a separate, dedicated database. • Only those with advanced degrees can do data mining. • Data mining is only for large firms that have lots of customer data.

List six common data mining mistakes.

• Selecting the wrong problem for data mining • Ignoring what your sponsor thinks data mining is and what it really can and cannot do • Leaving insufficient time for data preparation • Looking only at aggregated results and not at individual records • Being sloppy about keeping track of the data mining procedure and results • Ignoring suspicious findings and quickly moving on • Running mining algorithms repeatedly and blindly • Believing everything you are told about the data • Believing everything you are told about your own data mining analysis • Measuring your results differently from the way your sponsor measures them

What is the main reason parallel processing is sometimes used for data mining?

because of the massive data amounts and search efforts involved

Data preparation, the third step in the CRISP-DM data mining process, is more commonly known as ________.

data preprocessing

List 3 common data mining myths and realities.

1) Myth: Data mining provides instant, crystal-ball-like predictions. Reality: Data mining is a multistep process that requires deliberate, proactive design and use. 2) Myth: Data mining is not yet viable for mainstream business applications. Reality: The current state of the art is ready to go for almost any business type and/or size. 3) Myth: Data mining requires a separate, dedicated database. Reality: Because of the advances in database technology, a dedicated database is not required. 4) Myth: Only those with advanced degrees can do data mining. Reality: Newer Web-based tools enable managers of all educational levels to do data mining. 5) Myth: Data mining is only for large firms that have lots of customer data. Reality: If the data accurately reflect the business or its customers, any company can use data mining.

The ________ is the most commonly used algorithm to discover association rules. Given a set of itemsets, the algorithm attempts to find subsets that are common to at least a minimum number of the itemsets.

Apriori algorithm

Describe cluster analysis and some of its applications.

Cluster analysis is an exploratory data analysis tool for solving classification problems. The objective is to sort cases (e.g., people, things, events) into groups, or clusters, so that the degree of association is strong among members of the same cluster and weak among members of different clusters. Cluster analysis is an essential data mining method for classifying items, events, or concepts into common groupings called clusters. The method is commonly used in biology, medicine, genetics, social network analysis, anthropology, archaeology, astronomy, character recognition, and even in MIS development. As data mining has increased in popularity, the underlying techniques have been applied to business, especially to marketing. Cluster analysis has been used extensively for fraud detection (both credit card and e-commerce fraud) and market segmentation of customers in contemporary CRM systems.

Data mining can be very useful in detecting patterns such as credit card fraud, but is of little help in improving sales.

False

Data mining requires specialized data analysts to ask ad hoc questions and obtain answers quickly from the system.

False

Data that is collected, stored, and analyzed in data mining is often private and personal. There is no way to maintain individuals' privacy other than being very careful about physical data security.

False

In the Dell cases study, the largest issue was how to properly spend the online marketing budget.

False

In the Miami-Dade Police Department case study, predictive analytics helped to identify the best schedule for officers in order to pay the least overtime.

False

In the cancer research case study, data mining algorithms that predict cancer survivability with high predictive power are good replacements for medical professionals.

False

In the opening case, police detectives used data mining to identify possible new areas of inquiry.

False

The entire focus of the predictive analytics system in the Infinity P&C case was on detecting and handling fraudulent claims for the company's benefit.

False

In the Target case study, why did Target send a teen maternity ads?

Target's analytic model suggested she was pregnant based on her buying habits.

In lessons learned from the Target case, what legal warnings would you give another retailer using data mining for marketing?

If you look at this practice from a legal perspective, you would conclude that Target did not use any information that violates customer privacy; rather, they used transactional data that most every other retail chain is collecting and storing (and perhaps analyzing) about their customers. What was disturbing in this scenario was perhaps the targeted concept: pregnancy. There are certain events or concepts that should be off limits or treated extremely cautiously, such as terminal disease, divorce, and bankruptcy.

List five reasons for the growing popularity of data mining in the business world.

More intense competition at the global scale driven by customers' ever-changing needs and wants in an increasingly saturated marketplace • General recognition of the untapped value hidden in large data sources • Consolidation and integration of database records, which enables a single view of customers, vendors, transactions, etc. • Consolidation of databases and other data repositories into a single location in the form of a data warehouse • The exponential increase in data processing and storage technologies • Significant reduction in the cost of hardware and software for data storage and processing • Movement toward the demassification (conversion of information resources into nonphysical form) of business practices

List and briefly describe the six steps of the CRISP-DM data mining process.

Step 1: Business Understanding — The key element of any data mining study is to know what the study is for. Answering such a question begins with a thorough understanding of the managerial need for new knowledge and an explicit specification of the business objective regarding the study to be conducted. Step 2: Data Understanding — A data mining study is specific to addressing a well-defined business task, and different business tasks require different sets of data. Following the business understanding, the main activity of the data mining process is to identify the relevant data from many available databases. Step 3: Data Preparation — The purpose of data preparation (or more commonly called data preprocessing) is to take the data identified in the previous step and prepare it for analysis by data mining methods. Compared to the other steps in CRISP-DM, data preprocessing consumes the most time and effort; most believe that this step accounts for roughly 80 percent of the total time spent on a data mining project Step 4: Model Building — Here, various modeling techniques are selected and applied to an already prepared data set in order to address the specific business need. The model-building step also encompasses the assessment and comparative analysis of the various models built. Step 5: Testing and Evaluation — In step 5, the developed models are assessed and evaluated for their accuracy and generality. This step assesses the degree to which the selected model (or models) meets the business objectives and, if so, to what extent (i.e., do more models need to be developed and assessed). Step 6: Deployment — Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data mining process across the enterprise. In many cases, it is the customer, not the data analyst, who carries out the deployment steps.

All of the following statements about data mining are true EXCEPT:

The ideas behind it are relatively new.

Describe the role of the simple split in estimating the accuracy of classification models.

The simple split (or holdout or test sample estimation) partitions the data into two mutually exclusive subsets called a training set and a test set (or holdout set). It is common to designate two-thirds of the data as the training set and the remaining one-third as the test set. The training set is used by the inducer (model builder), and the built classifier is then tested on the test set. An exception to this rule occurs when the classifier is an artificial neural network. In this case, the data is partitioned into three mutually exclusive subsets: training, validation, and testing.

Converting continuous valued numerical variables to ranges and categories is referred to as discretization.

True

During classification in data mining, a false positive is an occurrence classified as true by the algorithm while being false in reality.

True

If using a mining analogy, "knowledge mining" would be a more appropriate term than "data mining."

True

In data mining, classification models help in prediction.

True

The cost of data storage has plummeted recently, making data mining feasible for more firms.

True

Using data mining on data about imports and exports can help to detect tax avoidance and money laundering.

True

When a problem has many attributes that impact the classification of different patterns, decision trees may be a useful approach.

True

Understanding customers better has helped Amazon and others become more successful. The understanding comes primarily from

analyzing the vast data amounts routinely collected.

In data mining, finding an affinity of two products to be commonly together in a shopping cart is known as

association rule mining.

Which broad area of data mining applications analyzes data, forming rules to distinguish between defined classes?

classification

Which broad area of data mining applications partitions a collection of objects into natural groupings with similar features?

clustering

As described in the Influence Health case study, customers are more often ________ services from a variety of healthcare service providers before selecting one.

comparing

In the Dell case study, engineers working closely with marketing, used lean software development strategies and numerous technologies to create a highly scalable, singular ________.

data mart

Knowledge extraction, pattern analysis, data archaeology, information harvesting, pattern searching, and data dredging are all alternative names for ________.

data mining

Data are often buried deep within very large ________, which sometimes contain data from several years.

databases

One way to accomplish privacy and protection of individuals' rights when data mining is by ________ of the customer records prior to applying data mining applications, so that the records cannot be traced to an individual.

de-identification

The basic idea behind a(n) ________ is that it recursively divides a training set until each division consists entirely or primarily of examples from one class.

decision tree

A data mining study is specific to addressing a well-defined business task, and different business tasks require

different sets of data.

Patterns have been manually ________ from data by humans for centuries, but the increasing volume of data in modern times has created a need for more automatic approaches.

extracted

While prediction is largely experience and opinion based, ________ is data and model based.

forecasting

In the Influence Health case study, what was the goal of the system?

increasing service use

Identifying and preventing incorrect claim payments and fraudulent activities falls under which type of data mining applications?

insurance

What does the robustness of a data mining method refer to?

its ability to overcome noisy data to make somewhat accurate predictions

In ________, a classification method, the complete data set is randomly split into mutually exclusive subsets of approximately equal size and tested multiple times on each left-out subset, using the others as a training set.

k-fold cross validation

Fayyad et al. (1996) defined ________ in databases as a process of using data mining methods to find useful information and patterns in the data.

knowledge discovery

There has been an increase in data mining to deal with global competition and customers' more sophisticated ________ and wants.

needs

Prediction problems where the variables have numeric values are most accurately defined as

regressions.

Customer ________ management extends traditional marketing by creating one-on-one relationships with customers.

relationship

The data mining in cancer research case study explains that data mining methods are capable of extracting patterns and ________ hidden deep in large and complex medical databases.

relationships

Third party providers of publicly available data sets protect the anonymity of the individuals in the data set primarily by

removing identifiers such as names and social security numbers.

Whereas ________ starts with a well-defined proposition and hypothesis, data mining starts with a loosely defined discovery statement.

statistics

Briefly describe five techniques (or algorithms) that are used for classification modeling.

• Decision tree analysis. Decision tree analysis (a machine-learning technique) is arguably the most popular classification technique in the data mining arena. • Statistical analysis. Statistical techniques were the primary classification algorithm for many years until the emergence of machine-learning techniques. Statistical classification techniques include logistic regression and discriminant analysis. • Neural networks. These are among the most popular machine-learning techniques that can be used for classification-type problems. • Case-based reasoning. This approach uses historical cases to recognize commonalities in order to assign a new case into the most probable category. • Bayesian classifiers. This approach uses probability theory to build classification models based on the past occurrences that are capable of placing a new instance into a most probable class (or category). • Genetic algorithms. This approach uses the analogy of natural evolution to build directed-search-based mechanisms to classify data samples. • Rough sets. This method takes into account the partial membership of class labels to predefined categories in building models (collection of rules) for classification problems.


Conjuntos de estudio relacionados

Salesforce Platform Developer II

View Set

CHAPTER 54 (DRUGS ACTING ON UPPER RESPIRATORY TRACT)

View Set

Chapter 8 pt. 3 Launchpad questions

View Set

Oral Pathology Exam #1 Chapter 1,2,3

View Set

Learning How to Learn: Week 4: Part 1

View Set