CIS 375 Final Practice Questions
Determining which customers are most likely to leave a business (or a social media site) is unsupervised learning
FALSE
In a confusion matrix, we want to minimize which of the following?
False positive and false negative
If you could only choose between a fitting graph and a learning curve to compare multiple models' performances, which should you choose
Learning curve
If we have well defined cost benefit matrix, which of the following curve can we use to evaluate model performance?
Profit curve
"Do my customers form natural groups?" is an example of clustering
TRUE
In the mathematical equation below, representing a linear boundary, variable "y" represents the ____________. y = b + w1x1 + w2x2 + w3x3 + ....
Target variable
What is the hold-out data used for?
Test a model
What is the purpose of the following statement? "berryrules <- subset(groceryrules, items %in% "berries")"
The subset() function is used to search for groups within a transaction. This statement will help search for items that use the name "berries".
What are training data and test data used for when we perform data mining tasks?
The test data is used to evaluate how accurate the model is. While the training data is used as the initial data set used to help the model understand and produce successful results using the induction algorithm.
Cluster analysis is under the category of ________________ learning
Unsupervised
What is the function we use to perform hierarchical clustering algorithm using R?
hclust()
What is the function we use to perform clustering by k-means algorithm using R?
kmeans()
Regression is distinguished from classification by
numerical target variable
Examples of Applications of Data Mining
- Fraud detection - Prediction of loan repayment - Prediction of membership of social media site
Truth about overfitting
- Overfitting may be caused by the lack of representative instances in the training data - Holdout method can help detect the issue of overfitting - All data mining procedures have some tendency to overfit
Truth about model accuracy
- Accuracy is misleading with imbalanced data - Accuracy doesn't make distinctions between false positives and false negatives
Characteristics of data mining
- DM extracts useful information and knowledge from large volumes of data by following a well-defined process - DM revolves around data - DM is a set of techniques for analyzing data to discover interesting knowledge or patterns in the data
Characteristics of a tree-structured model
- Made up of root, interior nodes, leaf nodes, and branches - Every instance always ends up at a leaf node - Each branch represents a distinct value of the attribute at that node
Describing Support Vector Machine Model
- SVMs can estimate class membership probability - SVMs are based on supervised learning - SVMs use the hinge loss function
Before performing cluster analysis, what shall we do to prepare the data ready?
- Scale/standardize the features - Dealing with missing values - Screen for unreasonable records
ROC Curve Characteristics
- The dash line represents a random guessing strategy/model - ROC curves are not an intuitive visualization for business stakeholders - AUC measures the area under the ROC curve
Classification/Decision Tree
- The target variable for a classification decision tree cannot be numeric - Decision trees for classification recursively perform IG-based attribute selection - Decision tree is one of the methods that are used for classification
How many steps are in a CRISP-data mining cycle/process
6
Model accuracy can be calculated based on:
Confusion matrix
If we would like to create a confusion matrix for evaluating the model performance on the test data, which function shall we use?
CrossTable()
Estimating probability of default (or "write-off") for a new loan application is a regression problem
FALSE
In selecting informative attributes, we should look for attributes that produce subsets with highest entropy
FALSE
In the case of biopsy dataset we discussed in class, when we use ifelse() function to obtain classification results, the higher cutoff value we set, the better model performance we may get
FALSE
Linear discriminant boundary uses attributes recursively to classify the data
FALSE
There is a target variable when we perform market basket analysis
FALSE
Transactional data displays the same format as the dataset we used for decision trees
FALSE
Tree-structured predictive models take into account multiple attributes (input variables) all at once, via the use of a mathematical formula
FALSE
When we customize the tuning process, we should set SelectionFunction = "best" to choose the simplest candidate model
FALSE
What is the preferred method when pruning a decision/classification tree?
Post-pruning
Clustering attempts to segment the training dataset based on the input features/variables in order to make instances in the same cluster as _______________ as possible
Similar
Explain the basic difference between supervised and unsupervised data mining
Supervised data mining has a clear and specific target variable that we are interested in or trying to predict. Unsupervised data mining has no guarantee that the results are useful or have true meaning.
A decision tree model partitions the instance space into similar regions by perpendicular decision boundaries- horizontal and vertical boundaries
TRUE
A predictive model is a sort of formula to estimate the unknown value of interest, which we often call "the target"
TRUE
A training dataset should be as similar as possible to the original dataset in terms of class distribution in target variable
TRUE
An "objective (loss) function" measures the amount of classification error a model has for a given training dataset
TRUE
C5.0 uses entropy to identify the best decision tree splitting candidate
TRUE
Classification models attempt to predict which class an instance/individual in a population belongs to
TRUE
Data mining is the application of various analytical techniques to find useful knowledge, patterns and relationships among data
TRUE
Every "leaf node" in a classification/decision tree represents a segment of the population, and the attributes and their values along the tree path give the characteristics of the segment
TRUE
Finding the features that differentiate customers into different groups is an example of an unsupervised learning task (think clustering)
TRUE
In ROC curves, the closer the curve is to the perfect classifier, the better it is at identifying positive values
TRUE
In a classification/decision tree induction (generation) process, the next attribute added is the one with the largest increase in information gain value
TRUE
In a classification/decision tree, the root node represents the application of segmentation of the dataset by the attribute with the highest information gain value
TRUE
In supervised segmentation, informative attributes increase model accuracy
TRUE
In the confusion matrix, a false negative occurs when a classifier predicts an instance as negative when it is a positive
TRUE
In the context of a classification/decision tree, an interior node represents a testing of an attribute
TRUE
In the context of a classification/decision tree, every instance/data point will correspond to one and only one path ending at a leaf node
TRUE
In the real-world, model developers often use several model performance tools (graphical and/or numerical) to choose a best model
TRUE
Logistic regression is estimating the probability of class membership over a categorical class
TRUE
Over-fitting occurs when a model learns perfect on the training data but can not be generalized to new dataset
TRUE
Pruning is a technique for reducing the complexity of a decision tree model
TRUE
Random forests focuses only on ensembles of decision trees
TRUE
Support Vector Machines (SVMs) approach classifies problems by finding the widest possible bar that fits between points of two different classes
TRUE
The kappa statistic indicates agreement between a model's predictions and the true values
TRUE
The support an itemset measures how frequently it occurs in the data, while a rule's confidence is a measurement of its predictive power or accuracy
TRUE
The test dataset is used to evaluate the model in order to judge how well its characterization of the training data generalizes to new, unseen cases
TRUE
What is the problem with holdout method (train-test split) and how should we eliminate this problem?
Using the train-test split, we can have an overfitting problem. Where the train accuracy is higher than the test accuracy. So, to eliminate this problem, we perform multiple splits and break the samples into folds.
What should we set for "family" parameter when performing logistic regression using glm() if we have a two class categorical response?
family = binomial
A fitting graph plots
generalization performance vs. model complexity
Information gain is used to
measures the change in entropy due to any amount of new information being added
A "measure of purity" known as entropy...
measures the impurity of a set
The goal of Laplace "correction" is to
reduce the influence of segments (leaf nodes) with only a few instances
Which function can we use for automated parameter tuning to search for optimal models using a choice of evaluation methods and metrics?
train()