Data Mining Exam

Ace your homework & exams now with Quizwiz!

which is true of creation of a decision tree?

Each variable is evaluated at each node to determine the splitting variable The same variable may be used for splitting at different locations in the decision tree Entropy reduction (information gain) is a common splitting method a stopping criterion in creating a decision tree is when the tree has pure leafs

In k-nearest-neighbor classification, choosing low values of k results in

Either: (not sure) Predicting the most frequent class label Avoiding over-smoothing Missing local noise in the training data none of the above

What is true of k-nearest neighbor models?

Either: (not sure) It is a supervised model that uses a specified model for finding "similar" records It is an unsupervised model that somehow magically measures records' proximity It is a supervised model that requires model that requires no target variable because it places records close together

When is it appropriate to use a bar graph?

For categorical data

When is it appropriate to use a line graph?

For data involving time series

When is it appropriate to use a scatterplot?

For demonstrating a relationship between two numerical variables

one common splitting method for decision trees is

Gini

Creating classification decision tree models will

generally use a single categorical target variable

A leaf node of a decision tree

is not further split into additional branches

the most common clustering method is ___

k-means

the most common Neural network algorithm is

not - LMP Layer map procedure

a neural network model

not - have a tendency to overfit(memorize data)

clustering is in the __ category of data mining

pg 12 of textbook

Naive Bays is a(n) __ data mining task

see ppt classification association clustering description

The confidence for the association rule: if milk, then orange juice, and the confidence for the association rule: if orange juice, then milk, ____

see ppt slide 12

Which statement is true of association analysis?

see ppt slide 3 (not: association analysis requires a target variable)

One method to possibly reduce the dimensionality of a supervised model is ___

To use the 5th dimension To use principal components with eignvalues > 1 To use principal components with negative eignvalues

One should ___ for clustering

c &d

Text mining is the process of taking ___ the text data and transforming it to __ quantitative data to perform analysis

ch 20 ppt

what is a major difference between the data mining tasks of clustering and classification

classification is a supervised data mining task where as clustering is an unsupervised data mining task

Assessment for classification models include

Lift Classification matrix (confusion matrix) Misclassification rate Error metrics

The beta coefficients of a logistic regression model...

May be different for a 1 unit change of an independent variable value (say from 3-4 than 5-6, while holding other model features constant)

Clustering requires standardization. One way to do standardization is using the min-max equation (value of interest - minimum)/Range. Given a particular predictor has a mean of 10, minimum value of 8 and a maximum value of 12, provide the range and use the min-max equation to standardize the following two values. Standardize the value 9: Standardize the value 11:

9 = (9-8)/(12-8) = 1/4 11 = (11-8)/(12-8) = 3/4

a common name used for association analysis is ___

A & d

A NN would typically have

An input, hidden, and output layer

Clustering algorithms seek to create clusters such that the ___ is large compared to the ___

Between-cluster variation, within-cluster variation

Bayesian classifiers work only with ___ predictors

Categorical

What is the primary mathematical concept underlying Naive Bays Classifiers?

Conditional probability

Provide three example nodes from SAS EM that help visualize data and provide exploration outputs. What do they produce? Why is this important for business use?

Graph explore provides interactive graphs that quickly visualize the data. It produces many different types of graphs. This is important for quick overviews. Multiplot uses two different variables and graphs them together. This is good for business uses because you are able to see the relationship and the trends of two variables on top of eachother. Link analysis is a way to see how variables are related to eachother. It looks to see what they have in common, and what they don't have in common. This is important for business use because it helps you quickly identify why relationships do or do not exist.

A prestigious university has been tracking students for many years in their MBA program for the purposes of building a model to predict success. The graduates have been placed into three categories: underperforming, average, top performing. Extensive study has identified a number of significant predictor attributes. What data mining task (estimation or classification) would you recommend and why?

I would recommend classification because the graduates are placed into three ranked categories. There is no numerical determinant, and the response variable is very broad (success), and left undefined numerically. Classification should be used because we are not predicting an exact number or outcome, we are predicting a broader classified behavior.

A k-nearest neighbor model ___

Is one of the models in IBM SPSS Modeler and SAS EM May be referred to as Case Based Reasoning Evaluates on k neighbors

What is true of logistic regression models?

It is a supervised model that uses a categorical target variable

Care must be used with association analysis because

It is easy to generate many rules many of which may be useless

What is the biggest drawback of NN models?

Overfitting

One approach to developing models when the target variable contains a rare class is ___

Oversampling

When referring to Kohonen/SOM clustering models, SOM is

Self-organizing map

At a minimum, running a k-nearest neighbor model requires

Setting the number of neighbors to compare records to

Assuming that data mining techniques are to be used, for the following scenario, identify whether the best task is a supervised or unsupervised data mining task. Deciding whether to issue a loan to an applicant based on demographic and financial data (with reference to a database of similar data on prior customers)

Supervised

Assuming that data mining techniques are to be used, for the following scenario, identify whether the best task is a supervised or unsupervised data mining task. Identifying a network data packet as dangerous (virus, hacker attack) based on comparison to other packets whose threat status is known

Supervised

What is the major difference between supervised (directed) and unsupervised (non-directed) data mining categories?

Supervised data mining requires a target variable whereas unsupervised data mining has no target variable

List all data mining tasks where NN are an appropriate methodology to use

Supervised tasks with continuous or categorical targets, and continuous or categorical predictors in any combination

Two data formats for input into association analysis software are referred to as __ or __

Tabluar, transactional

Provide three example nodes from SPSS modeler that help visualize data and provide exploration outputs. What do they produce? why is this important for business use?

The graphboard node visualizes data in different graph formats. They produce bar graphs, pie charts, scatterplots, and more. This is important for business use because these are quick ways to get the overall jist of some set of data. The plot node visualizes data in forms of scatterplots or line graphs. These type of graphs are important for business use to show the relationship between two different, but possibly related variables and their trends. The distribution node visualizes data by using the frequencies of categorical data. It is similar to the graphboard node because it produces graphs such as bar charts. This node is important for business use because it will show the distribution of a population in an easy to read graph.

When is it appropriate to use a histogram?

To display "how many" of each value occur in a data set

Assuming that data mining techniques are to be used, for the following scenario, identify whether the best task is a supervised or unsupervised data mining task. In an online bookstore, making recommendations to customers concerning additional time to buy based on the buying patterns in prior transactions

Unsupervised

What are the axes of a ROC (Receiver operating characteristic) curve?

Vertical axis: % of true positives Horizontal axis: % of false positives

The most appropriate data mining category for Naive Bayes classifiers is ___

supervised

an advantage of using a decision tree model would be

that a decision tree generates rules that can be easily explained and implemented


Related study sets

6 - Health Insurance Policy Provisions

View Set

Conjugaison arabe - forme V (augmentée)

View Set

Last Chapter end of Cram Course Review- Chapter 8 Indiana Laws and Rules Pertinent to Insurance

View Set

Unit 3 Test Managerial Accounting

View Set

Consolidated Frequent Spelling Bee Word List A - F

View Set