Chapter 8

Ace your homework & exams now with Quizwiz!

A common practice in data partitioning is to partition some percent of the data into the training data set and some percent of the data into the validation data set. Which of the answer below is consistent with the percentage of data in training data set and percentage of data in validation data set? (60%,40%) (40%,60%) (70%,30%) (30%,70%)

(60%,40%)

The Manhattan distance is calculated using the formula=|x1i−x1j|+|x2i−x2j|+|x3i−x3j|+⋯+|xki−xkj|. Calculate the Manhattan distance between Observations 1 and 2 which is shown by Observation 1: (3,4) and Observation 2: (4,5) 1.41 7.62 2 10

2 I 3-4 I + I 4-5 I = 2

Multiple Choice Question What is the term used to describe computer systems that demonstrate human-like intelligence and cognitive functions, such as deduction, pattern recognition, and the interpretation of complex data? Machine learning Artificial intelligence Data mining Business analytics

Artificial intelligence

Select all that apply With either CRISP-DM or SEMMA, it is important to fully understand which of the following: Please select all that apply! the surrounding socioeconomic climate, business goals, and underlying issues at hand prior to preparing the data and choosing analysis techniques Underlying issues for the business The surrounding socioeconomic climate Political implications Business goals

Business goals The surrounding socioeconomic climate Underlying issues for the business

Click and drag on elements in order The Cross-Industry Standard Process for Data Mining (CRISP-DM) methodology consists of six major phases. Please order the six phases from the start phase to end phase! Business understanding: The first phase focuses on understanding the data mining project and its objectives Evaluation: After developing data mining models, we evaluate the performance of competing models based on specific criteria Data understanding: This phase involves collecting raw data and conducting a preliminary analysis to understand the data Data preparation: Specific tasks include record and variable selection, data wrangling, and cleansing for subsequent analyses Modeling: This phase involves the selection and execution of several data mining techniques, including linear and logistic regression models Deployment: During this final phase, we develop a set of actionable recommendations based on the analysis results

Business understanding: The first phase focuses on understanding the data mining project and its objectives Data understanding: This phase involves collecting raw data and conducting a preliminary analysis to understand the data Data preparation: Specific tasks include record and variable selection, data wrangling, and cleansing for subsequent analyses Modeling: This phase involves the selection and execution of several data mining techniques, including linear and logistic regression models Evaluation: After developing data mining models, we evaluate the performance of competing models based on specific criteria Deployment: During this final phase, we develop a set of actionable recommendations based on the analysis results

What is the term for a table that summarizes classification outcomes obtained from the validation data set? Classification matrix Confusion matrix Summary table Validating table

Confusion matrix

Select all that apply In the field of data mining, there is a growing need for the establishment of standards in the area. When conducting data mining analysis, practitioners generally adopt two standards. Select all that apply! Multiple select question. Sample, Explore, Modify, Model, and Assess (SEMMA) American National Standards (ANSI) International Standards Organization (ISO) Cross-Industry Standard Process for Data Mining (CRISP-DM)

Cross-Industry Standard Process for Data Mining (CRISP-DM) Sample, Explore, Modify, Model, and Assess (SEMMA)

What is the name of the chart that shows the improvement that a predictive model provides over a random selection in capturing the target class cases? ROC curve chart Performance chart Cumulative lift chart Decile-wise lift chart

Cumulative lift chart

Select all that apply It is sometimes more informative to have graphic representations to assess the predictive performance of data mining models.Which of the following are the most popular performance charts? Select all that apply! Multiple select question. Cumulative lift chart Decile-wise lift chart (ROC) curve Predictive chart

Cumulative lift chart Decile-wise lift chart (ROC) curve

Select all that apply What are two methods used to detect overfitting and provide objective assessment of the predictive performance of models? Select all that apply! Multiple select question. Data partitioning Cross-validation Over sampling Data mining

Data partitioning Cross-validation

Specific tasks in this phase include record and variable selection, data This step in the CRISP-DM methodology includes wrangling, and cleansing data for subsequent analyses. For example, certain data mining techniques may require categorical variables to be transformed into binary variables. What is this step? Data understanding Data preparation Evaluation Modeling

Data preparation

What is the name of the chart that shows the improvement that a predictive model provides over a random selection but presents the information in 10 equal-sized intervals (e.g., every 10% of the observations)? ROC curve Performance chart Decile-wise lift chart Cumulative lift chart

Decile-wise lift chart

The key distinction between supervised and unsupervised techniques is that, in supervised data mining Effective for developing predictive models Effective for dimension reduction Effective for data exploration Effective for pattern recognition

Effective for developing predictive models

What is one of the most widely used measures for evaluating similarity with numerical variables. It is defined as the length of a straight line between two observations. Euclidean distance Algebraic manipulation Geometric calculations Manhattan distance

Euclidean distance

Principal components are uncorrelated variables whose values are the weighted linear combinations of the original variables.Which principal component accounts for most of the variability in the data? Last First Base case

First

In situations where negative outcomes are not as important as positive outcomes, what is a more appropriate measure of similarity? Please choose the best answer. Jaccard's coefficient Manhattan coefficient Matching coefficient Euclidean coefficient

Jaccard's coefficient

Select all that apply Euclidean and Manhattan distance measures are suitable for numerical variables. When dealing with categorical variables, we rely on other measures of similarity. What are two commonly used measures for categorical and binary data? Please select all that apply! Multiple select question. Matching coefficient Euclidean distance Standardizing Jaccard's coefficient

Jaccard's coefficient Matching coefficient

Specific actions in this step of the SEMMA data mining methodology include: relevant variables are selected, created, and/or transformed in order to prepare the data set for subsequent analyses. Which step is this? Sample Model Modify Explore

Modify

Select all that apply This difference in scale distorts the true distance between observations and can lead to inaccurate results.It is common, therefore, to make the observations unit-free. How is this accomplished? Choose all that apply. Multiple select question. Collect new data Normalizing Standardizing Revise the scale

Normalizing Standardizing

The '_________________________' technique involves intentionally selecting more samples from one class than from the other class or classes in order to adjust the class distribution of a data set.

Oversampling

Select all that apply Unsupervised data mining requires no knowledge of the target variable.Common applications of unsupervised learning include: Select all that apply! Multiple select question. Classification Prediction Pattern recognition Dimension reduction

Pattern recognition Dimension reduction

Select all that apply What is the term that refers to the process of converting a set of high-dimensional data (data with a large number of variables) into data with lesser dimensions without losing much of the information in the original data. Pattern recognition Prediction Dimension reduction Classification

Pattern recognition Dimension reduction

Select all that apply Examples of the use of prediction models include: Select all that apply! Multiple select question. Predict the selling price of a house Classify a list of prospective customers and non buyers Predict the classification of a a list of prospective customers as buyers Predict the spending of a customer

Predict the selling price of a house Predict the spending of a customer

Which tool shows the sensitivity and specificity measures across all cutoff values and how accurately the model is able to classify both target and non target class cases overall.? Performance chart Decile-Wise Lift chart Cumulative lift chart ROC curve

ROC curve

Click and drag on elements in order The Sample, Explore, Modify, Model, and Assess (SEMMA) methodology consists of five steps. Please order the five steps from the first step to the last step! Instructions Drag and drop application. Assess Model Explore Modify Sample

Sample Explore Modify Model Assess

The z-score measures the distance of a given observation from the sample mean in terms of standard deviations. The z-score is an example of making observations unit free. This is an example of '________________________' data.

Standardizing

Select all that apply Name 3 prediction model performance measures described in the chapter. The root mean square error (RMSE) The root mean error (RME) The mean error (ME) The mean absolute percentage error (MAPE)

The mean error (ME) The root mean square error (RMSE) The mean absolute percentage error (MAPE)

Select all that apply Common applications of supervised data mining include classification and prediction models. What is true of a classification model? The objective is to predict the class memberships of new cases The objective is to predict the membership of a numerical value The target variable is categorical The target variable is numerical

The target variable is categorical The objective is to predict the class memberships of new cases

When creating a decile-wise lift chart, if the lift for the first 10% of the observations (first bar) is about 7.1, what does that mean? The top 7.1% of the observations selected by the model contain as many Class 1 cases as the 10% of the observations that are randomly selected The top 10% of the observations selected by the model contain 10x as many Class 1 cases as the 7.1% of the observations that are randomly selected Top 10% of observations selected by model contain 7.1x as many Class 1 cases as the 10% of the observations randomly selected The top 10% of the observations randomly selected contain 7.1x as many Class 1 cases as the 10% of the observations selected by the model

The top 10% of the observations randomly selected contain 7.1x as many Class 1 cases as the 10% of the observations selected by the model

Recall that data partitioning is the process of dividing a data set into a training, a validation, and an optional test data set. As a common practice, in the oversampling technique, which data set is oversampled? Training data set Validation data set Optional test data set All can be oversampled

Training data set

The terms artificial intelligence, machine learning, and data mining are often grouped together or used interchangeably because their definitions tend to overlap with no clear boundaries. True False

True

Assume that the target class, also called the success class (in general, the class of interest), is Class 1 and that the non target class is Class 0. In the confusion matrix there are 4 possible outcomes. When a Class 1 observation is correctly classified by the model, what would it be called? (according to the textbook) True positive (TP) True negative (TN) False negative (FN) False positive (FP)

True positive (TP)

Select all that apply When selecting the cutoff values for performance measures, in some applications, the analyst may choose to increase or decrease the cutoff value to classify fewer or more observations into the target class. What are some reasons for doing this? Select all that apply! Multiple select question. Personal preference Even class distributions Uneven class distributions Asymmetric misclassification costs

Uneven class distributions Asymmetric misclassification costs

When is the RMSE performance measure most desirable? When you want to measure the magnitude of the errors When you want to show the relative error When large errors are particularly undesirable When you care that there is an error but not whether it is + or -

When large errors are particularly undesirable

A good predictive model would have a ROC curve that lies above the diagonal line. The greater the area between the ROC curve and the baseline, the '_______________' the model is.

better

Principal component analysis (PCA) transforms a large number of possibly correlated variables into a smaller number of uncorrelated variables called principal '________________'.

component

'________________' occurs when a predictive model is made overly complex to fit the quirks of given sample data. By making the model conform too closely to the sample data, its predictive power is compromised

overfitting

The formula for calculating the Matching coefficient is (the Number of variables with matching outcomes)/(Total number of variables).The '________________' the value of the matching coefficient, the more similar the two observations are.

higher

Data '______________' is the process of dividing a data set into a training, a validation, and, in some situations, an optional test data set.

partitioning

it is important to develop '_______________________' measures that evaluate how well an estimated model will perform in an unseen sample, rather than making the evaluation solely on the basis of the sample data used to build the model.

performance

'______________' measures gauge whether a group of observations are similar or dissimilar to one another

similarity

'____________________' measures gauge whether a group of observations are similar or dissimilar to one another

similarity

The probability of a randomly selected case belonging to Class 1, or the '________________' of the diagonal line.

slope

Data mining uses many kinds of computational algorithms to identify hidden patterns and relationships in data. For developing predictive models, one tends to employ '_______________' data mining techniques.

supervised


Related study sets

Lab Simulation 8-1: MAC Address Filtering

View Set

EAQ Sedatives, Hypnotics and Anxiety

View Set

Chapter 29 Multiple Choice Questions

View Set