Data Mining Exam
which is true of creation of a decision tree?
Each variable is evaluated at each node to determine the splitting variable The same variable may be used for splitting at different locations in the decision tree Entropy reduction (information gain) is a common splitting method a stopping criterion in creating a decision tree is when the tree has pure leafs
In k-nearest-neighbor classification, choosing low values of k results in
Either: (not sure) Predicting the most frequent class label Avoiding over-smoothing Missing local noise in the training data none of the above
What is true of k-nearest neighbor models?
Either: (not sure) It is a supervised model that uses a specified model for finding "similar" records It is an unsupervised model that somehow magically measures records' proximity It is a supervised model that requires model that requires no target variable because it places records close together
When is it appropriate to use a bar graph?
For categorical data
When is it appropriate to use a line graph?
For data involving time series
When is it appropriate to use a scatterplot?
For demonstrating a relationship between two numerical variables
one common splitting method for decision trees is
Gini
Creating classification decision tree models will
generally use a single categorical target variable
A leaf node of a decision tree
is not further split into additional branches
the most common clustering method is ___
k-means
the most common Neural network algorithm is
not - LMP Layer map procedure
a neural network model
not - have a tendency to overfit(memorize data)
clustering is in the __ category of data mining
pg 12 of textbook
Naive Bays is a(n) __ data mining task
see ppt classification association clustering description
The confidence for the association rule: if milk, then orange juice, and the confidence for the association rule: if orange juice, then milk, ____
see ppt slide 12
Which statement is true of association analysis?
see ppt slide 3 (not: association analysis requires a target variable)
One method to possibly reduce the dimensionality of a supervised model is ___
To use the 5th dimension To use principal components with eignvalues > 1 To use principal components with negative eignvalues
One should ___ for clustering
c &d
Text mining is the process of taking ___ the text data and transforming it to __ quantitative data to perform analysis
ch 20 ppt
what is a major difference between the data mining tasks of clustering and classification
classification is a supervised data mining task where as clustering is an unsupervised data mining task
Assessment for classification models include
Lift Classification matrix (confusion matrix) Misclassification rate Error metrics
The beta coefficients of a logistic regression model...
May be different for a 1 unit change of an independent variable value (say from 3-4 than 5-6, while holding other model features constant)
Clustering requires standardization. One way to do standardization is using the min-max equation (value of interest - minimum)/Range. Given a particular predictor has a mean of 10, minimum value of 8 and a maximum value of 12, provide the range and use the min-max equation to standardize the following two values. Standardize the value 9: Standardize the value 11:
9 = (9-8)/(12-8) = 1/4 11 = (11-8)/(12-8) = 3/4
a common name used for association analysis is ___
A & d
A NN would typically have
An input, hidden, and output layer
Clustering algorithms seek to create clusters such that the ___ is large compared to the ___
Between-cluster variation, within-cluster variation
Bayesian classifiers work only with ___ predictors
Categorical
What is the primary mathematical concept underlying Naive Bays Classifiers?
Conditional probability
Provide three example nodes from SAS EM that help visualize data and provide exploration outputs. What do they produce? Why is this important for business use?
Graph explore provides interactive graphs that quickly visualize the data. It produces many different types of graphs. This is important for quick overviews. Multiplot uses two different variables and graphs them together. This is good for business uses because you are able to see the relationship and the trends of two variables on top of eachother. Link analysis is a way to see how variables are related to eachother. It looks to see what they have in common, and what they don't have in common. This is important for business use because it helps you quickly identify why relationships do or do not exist.
A prestigious university has been tracking students for many years in their MBA program for the purposes of building a model to predict success. The graduates have been placed into three categories: underperforming, average, top performing. Extensive study has identified a number of significant predictor attributes. What data mining task (estimation or classification) would you recommend and why?
I would recommend classification because the graduates are placed into three ranked categories. There is no numerical determinant, and the response variable is very broad (success), and left undefined numerically. Classification should be used because we are not predicting an exact number or outcome, we are predicting a broader classified behavior.
A k-nearest neighbor model ___
Is one of the models in IBM SPSS Modeler and SAS EM May be referred to as Case Based Reasoning Evaluates on k neighbors
What is true of logistic regression models?
It is a supervised model that uses a categorical target variable
Care must be used with association analysis because
It is easy to generate many rules many of which may be useless
What is the biggest drawback of NN models?
Overfitting
One approach to developing models when the target variable contains a rare class is ___
Oversampling
When referring to Kohonen/SOM clustering models, SOM is
Self-organizing map
At a minimum, running a k-nearest neighbor model requires
Setting the number of neighbors to compare records to
Assuming that data mining techniques are to be used, for the following scenario, identify whether the best task is a supervised or unsupervised data mining task. Deciding whether to issue a loan to an applicant based on demographic and financial data (with reference to a database of similar data on prior customers)
Supervised
Assuming that data mining techniques are to be used, for the following scenario, identify whether the best task is a supervised or unsupervised data mining task. Identifying a network data packet as dangerous (virus, hacker attack) based on comparison to other packets whose threat status is known
Supervised
What is the major difference between supervised (directed) and unsupervised (non-directed) data mining categories?
Supervised data mining requires a target variable whereas unsupervised data mining has no target variable
List all data mining tasks where NN are an appropriate methodology to use
Supervised tasks with continuous or categorical targets, and continuous or categorical predictors in any combination
Two data formats for input into association analysis software are referred to as __ or __
Tabluar, transactional
Provide three example nodes from SPSS modeler that help visualize data and provide exploration outputs. What do they produce? why is this important for business use?
The graphboard node visualizes data in different graph formats. They produce bar graphs, pie charts, scatterplots, and more. This is important for business use because these are quick ways to get the overall jist of some set of data. The plot node visualizes data in forms of scatterplots or line graphs. These type of graphs are important for business use to show the relationship between two different, but possibly related variables and their trends. The distribution node visualizes data by using the frequencies of categorical data. It is similar to the graphboard node because it produces graphs such as bar charts. This node is important for business use because it will show the distribution of a population in an easy to read graph.
When is it appropriate to use a histogram?
To display "how many" of each value occur in a data set
Assuming that data mining techniques are to be used, for the following scenario, identify whether the best task is a supervised or unsupervised data mining task. In an online bookstore, making recommendations to customers concerning additional time to buy based on the buying patterns in prior transactions
Unsupervised
What are the axes of a ROC (Receiver operating characteristic) curve?
Vertical axis: % of true positives Horizontal axis: % of false positives
The most appropriate data mining category for Naive Bayes classifiers is ___
supervised
an advantage of using a decision tree model would be
that a decision tree generates rules that can be easily explained and implemented