CIS 412: Quiz 1, Quiz 2, and Attendance Questions
Select ALL the statements that are TRUE about model accuracy. - Accuracy is misleading with imbalanced data - Accuracy determines the proportion of actual negatives that are correctly identified - Accuracy measures true positive rate - Accuracy doesn't make distinctions between false positives and false negatives
- Accuracy is misleading with imbalanced data - Accuracy doesn't make distinctions between false positives and false negatives
SVM uses a loss function known as hinge loss. Select ALL the following statement that are TRUE about hinge loss. - Hinge loss equals to zero when a negative instance is on the negative side of the boundary - Hinge loss increase linearly with the distance when instances are on the wrong side of the boundary and beyond the margin - hinge loss becomes positive when a negative instance is on the correct side of the margin - The farther away the instances are from the separating boundary, the less the loss
-Hinge loss equals to zero when a negative instance is on the negative side of the boundary - Hinge loss increase linearly with the distance when instances are on the wrong side of the boundary and beyond the margin
The population, with 30 instances, shown below is considered a ______ class problem
2
Given the visual plot below, select the size (number) of nodes to be included in the optimal tree model.
20
What does the following figure describe?
5 fold cross validation
How many steps are in a CRISP-Data Mining cycle/process?
6
predictive model
A formula for estimating the target variable
Which of the following is NOT true in regards to Classification/Decision Tree? - All the variables in the decision tree model must be categorical - the target variable for a classification decision tree cannot be numeric - decision trees for classification recursively perform IG based attribution selection - decision tree is one of the methods that are used for classification
All the variables in the decision tree model must be categorical
Which of the following is NOT part of the data format spectrum in data mining? - unstructured visual data - unstructured textual data - structured numeric data - Big Data
Big Data
In the Classification/Decision Tree shown below, which variable has the highest information gain?
Body shape
Which of the following is NOT a characteristic of Data Mining? -DM extracts useful information and knowledge from large volumes of data by following a well-defined process - DM revolves around data - DM helps make decisions based on intuition
DM helps make decisions based on intuition
A training set is used to train and evaluate a data mining model.
False
After we have added a new, hypothetical instance (star in right chart), we can conclude that Support Vector Machines appear to be more prone to over-fitting than Logistic Regression.
False
Based on the following 10-fold Cross-Validation results for the MegaTelCo data, a Data Analyst would recommend using the Logistic Regression model based on the model performance.
False
Determining which customers are most likely to leave a business (or a social media site) is unsupervised learning.
False
Estimating probability of default (or "write-off") for a new loan application is a regression problem.
False
In selecting informative attributes, we should look for attributes that produce subsets with highest entropy.
False
Linear discriminant boundary uses attributes recursively to classify the data.
False
SVM can only deal with linearly separatable data
False
Tree-structured predictive model creates a linear decision boundary to separate data points into different classes.
False
We expect to have as many hidden layers as possible to get a good model performance on test set.
False
In order to obtain a good model accuracy, we want to minimize which of the following? - True positive - True negative - True positive and false positive - False positive and false negative
False positive and false negative
A Fitting Graph plots ... - Generalization Performance vs. Model Complexity - True Positive rate vs. False Positive rate - True Positive rate vs. False Negative rate - Generalization Performance vs. Size of Training Dataset
Generalization Performance vs. Model Complexity
Determine the best initial attribute to segment the set of Stick-Figures shown below. Choose between two attributes: Head Shape and Body Shape. Entropy (parent/population) = 0.954 Show all your work. You may use the sample log table and simple calculator in your computer. (tips: create two tree structures based on Head Shape and Body Shape respectively, calculate information gain for each of the attributes and compare their values) IG = Entropy(parent) - [ p(c1) * Entropy(c1) + p(c2) * Entropy(c2) + ... ]
Head shape: entropy1 = -1/2*log(1/2)-1/2*log(1/2) = 1 entropy2 <- -4/6*log(4/6)-2/6*log(2/6) = 0.917 IG = 0.954 - (2/8*entropy1 +6/8*entropy2) = 0.016 Body shape: entropy3 = -1/3*log(1/3)-2/3*log(2/3) = 0.918 entropy4 = -1/5*log(1/)-4/5*log(4/5) = 0.722 IG2 = 0.954 - (3/8*entropy3+5/8*entropy4) = 0.159 Body shape has a higher IG
Which of the following CANNOT be a rule derived from the Classification/Decision Tree shown below? - IF(Employed = yes) THEN Class = Not Write-off - IF(Employed = No) AND (Balance < 50k) THEN Class = Not Write-off - IF(Employed = No) AND (Balance >= 50k) AND (Age <45) THEN Class = Write-off - IF(Employed = No) AND (Balance >= 50k) AND (Age >= 45) THEN Class = Write-off
IF(Employed = No) AND (Balance >= 50k) AND (Age <45) THEN Class = Write-off
What do the following two figures describe? Select ALL that apply. - Loss function of SVM - Kernal - Network Topology - Mapping data into a higher dimension
Kernal Mapping data into a higher dimension
In linear discriminant boundaries, we select the best possible line to separate the instances based on: - Best model performance - Loss function - AUC value - Estimated weights
Loss Function
When should the growth (generation) of a Classification/Decision Tree be stopped/terminated? Select ALL that apply.
Nodes are pure, there are no more instances to process
In the context of decision tree, if a categorical attribute to be used for segmentation of a dataset has m possible values, then the dataset will be segmented into _________ subsets/subgroups. - m + 1 - m / (m + n) - m - 1 - None of the above
None of the above (m)
Which of the following steps is NOT part of a Cross-Validation process? - Performing multiple splits - Systematic swapping - Obtaining new data sets - Computing the mean of multiple estimated performances
Obtaining new data sets
Select ALL the statement that are TRUE about overfitting: - Overfitting may be caused by the lack of representative instances in the training data - Holdout method can help detect the issue of overfitting - All data mining procedures have some tendency to overfit - Overfitting can be avoided by applying multiple data mining models to the dataset
Overfitting may be caused by the lack of representative instances in the training data - Holdout method can help detect the issue of overfitting - All data mining procedures have some tendency to overfit
Theoretically which is the preferred method when pruning a decision/classification tree?
Post-pruning
Which of the following components is NOT needed for implementing the Expected Value framework? - The Cost/Benefit Matrix - Probability based Confusion Matrix - The Confusion Matrix - The percentage of targeted instances
Probability based Confusion Matrix
What kind of visualization can we use to show the performance of a classification model at all classification thresholds?
ROC curve
Which is the following is NOT true about the ROC curve? - The dash line represents a random guessing strategy/model - ROC graphs are dependent on the class proportions as well as the costs and benefits - Roc curves are not an intuitive visualization for business stakeholders - AUC measures the area under the ROC curve
ROC graphs are dependent on the class proportions as well as the costs and benefits
How much will this customer use the service?
Regression
Which of the following does NOT describe a Support Vector Machine model (SVM)? - SVMs can estimate class membership probability - SVMs are based on supervised learning - SVMs use the Hinge Loss function - SVMs can only classify linearly separable data
SVMs can only classify linearly separable data
Which of the following is NOT considered an application of Data Mining? - fraud detection - prediction of loan repayment - Summary of the population in Arizona - Prediction of membership of a social media site
Summary of the population in Arizona
Briefly explain the basic difference between supervised and unsupervised data mining.
Supervised data has a specific, quantifiable target that we are interested in or trying to predict. There are two subgroups to supervised which are classification and regression. However, unsupervised does not specify a specific purpose or target for the grouping and there is no guarantee that the results will be meaningful or useful for any particular purpose. Therefore, the difference is that supervised is specific and quantifiable and unsupervised does not specify a target.
Which of the following Confusion Matrices is most preferable (i.e. a good model)? [Note: x > 0 and y > 0.] (TP: True Positives, TN: True Negatives, FN: False Negatives, FP: False Positives) TP: 0 | FP: x FN: y | TN: 0 TP: 0 | FP: x FN: 0 | TN: y TP: 0 | FP: 0 FN: y | TN: 0 TP: x | FP: 0 FN: 0 | TN: y
TP: x | FP: 0 FN: 0 | TN: y
Given a Confusion Matrix represented by (see image) Calculate its True Positive Rate (i.e., recall ) and True Negative Rate (i.e., specificity). Provide your answers in fraction form (e.g. x/y). (Clearly show your work.)
TPR = TP/(TP+FN) TPR = 65/(65+9) TPR = 65/74 TPR = 0.878 TNR = TN/(TN+FP) TNR = 52/(52+4) TNR = 52/56 TNR = 0.92857
In the mathematical equation below, representing a linear boundary, variable "y" represents the _________. y = b + w1x1 + w2x2 + w3x3 + ... - Input variable - Target variable - y-intercept - Slope
Target Variable
Model induction
The creation of models from data
Based on the figure below, which of the following is NOT valid in terms of basic characteristics of a neural network? - x1, x2, and x3 each process a single feature (predictor) in the dataset - Feature's value will be transformed by the corresponding node's activation function. - There is one hidden layer in this figure - The number of input nodes is predetermined by the number of features in the input data
There is one hidden layer in this figure
One reason that why logistic regression takes the log- odd is that: - It's easier to calculate - To make the estimated value (f(x)) range from -∞ to +∞ - We need negative values to present the possibility - None of above are correct
To make the estimated value (f(x)) range from -∞ to +∞
What are training data and test data used for when we perform data mining tasks?
Training data is the input data for model induction. This means t is used to train the model and help it understand the data and produce the outputs. Training is a certain percentage of the overall data set that is used to train the model. Test data is used to see how accurate the model is after it is trained. This can help to know if overfitting is happening. It is a larger amount of data than the training set and will show if there are issues with the model.
"Do my customers form natural groups?" is an example of clustering.
True
A predictive model is a sort of formula to estimate the unknown value of interest, which we often call "the target".
True
An "Objective (Loss) Function" measures the amount of classification error a model has for a given training dataset.
True
Classification models attempt to predict which class an instance in a population belongs to.
True
Data Mining is the application of various analytical techniques to find useful knowledge, patterns and relationship among data.
True
Decide which customers are mode likely to leave is an example of classification problem
True
Finding the features that differentiate customers into different groups is an example of an unsupervised learning task. [Hint: Think clustering!]
True
In a Classification/Decision Tree induction (generation) process, the next attribute added is the one with the largest increase in Information Gain value.
True
In a Classification/Decision Tree, the root node represents the attribute with the highest information gain value
True
In supervised segmentation, informative attributes increase model accuracy.
True
In the confusion matrix, a false negative occurs when a classifier predicts an instance as negative when it is a positive.
True
In the context of a Classification/Decision Tree, every instance/data point will correspond to one and only one path ending at a leaf node
True
Logistic regression estimates the probability of class membership over a categorical class
True
Over-fitting occurs when a model learns perfect on the training data but can not be generalized to new dataset.
True
Support Vector Machines (SVMs) approach classifies problems by finding the widest possible bar that fits between points of two different classes.
True
When conducting supervised data mining the value of the attributes except for the target variable is known when the model is used
True
Suppose you are working in a marketing team and trying to advertise a new product to your customers. You developed a classification model to identify the customers who would purchase the new product . If the model predicts that one particular customer is going to purchase this product, you will send him/her a promotional offer so they can purchase this product at a discounted price. To evaluate your classification model, you decided to use the expected framework value as the evaluation metric. To calculate the expected value you had to define the cost/benefit matrix. Given the information below: Profit to sell one product to a customer: $100 Cost to manufacture this product: $50 Cost to target (e.g., send a promotion offer) the customer who will purchase the product: $2 What are costs (or benefits) associated with True Positive cell and False Positive cell in the confusion matrix?
True Positives would be we predict someone will buy something if we send them an offer and they do buy something. False Positive would be we predict someone will buy something if we send them an offer and they don't buy anything. The costs associated with a true positive is $50 for manufacturing and $2 for targeting, but you receive $100. So, the cost would be 100 - 2 = $98. The costs associated with the false positive cell is $-2 because you lost the $2 you spent targeting the customer that did not buy anything.
Which of the following is NOT a characteristic of a tree-structured model? - two parents share children - Made up of root, interior nodes, leaf nodes, and branches - every instance always ends up at a leaf node - each branch represents a distinct value of the attribute at that node
Two parents share children
Below is the instance space as we use two attributes (age and balance) to predict write-off or non write-off. Plus sign denotes non write-off. Filled dots denotes write-off. If a new instance has a balance of 40,000 dollars and his age is 40, which class will he being assign to?
Write-off
Which attribute did we use to do the first partitioning in the stick figure exercise in order to produce the pure final subgroups?
body shape
Did advertisements influence a consumer to purchase?
casual modeling
How likely is this consumer to respond to our campaign?
classification
Which dataset do we use for Software lab 2 demonstration?
credit data
Which of the following is NOT a step in a typical Data Mining process? - modeling - data understanding - data storage - data preparation
data storage
Which function do we use to get the first several rows in a data frame (suppose the data frame is called df)?
df.head()
information gain
difference in entropy between parent node and weighted sum of child node
Assume the image below is a representation of one leaf node in a classification/decision tree (having only two classes: + and -). Which of the following information can we get from this image? Select ALL that apply. Information gain entropy total number of instances in this segment all of the above
entropy total number of instances in this segment
Knowing that the complexity of a model increases as the complexity of its linear equation increases, as well as the relations between model complexity and overfitting, which one of the following linear discriminants is most prone to over-fitting a training data set? - f(𝑥) = w0 + w1x1 + w2x2 + w3x3 + w4x4 + w5(x1/x5) + w6x42 - f(𝑥) = w0 + w1x1 + w2x2 + w3x3 + w4x4 + w5(x1/x5) - f(𝑥) = w0 + w1x1 + w2x2 + w3x3 + w4x4 - f(𝑥) = w0 + w1x1 + w2x2 + w3x3
f(𝑥) = w0 + w1x1 + w2x2 + w3x3 + w4x4 + w5(x1/x5) + w6x42
Which of following is NOT used to compare the model performance? - Lift curve - fitting graph - ROC curve - Cumulative profit curve
fitting graph
entropy
how mixed up classes are in a group
In k-fold Cross-Validation method, what is used for training a model? - k + 1 folds - k folds - k - 1 folds - None of the above
k - 1 folds
Which of the following is NOT the data approach to deal with imbalanced data? - Oversampling - K fold cross validation - Under-sampling - None of the above
k fold cross validation
logistic regression
log odds
What items are commonly purchased together?
market basket analysis
Information gain is used to: -measures the change in entropy due to any amount of new information being added -is only used to calculate entropy -is a measure of correlation between numeric variables -is prone to over-fitting
measures the change in entropy due to any amount of new information being added
A "measure of purity" known as entropy ...
measures the impurity of a set
Is there a single evaluation metric that is "right" for any data-mining tasks?
no
Regression is distinguished from classification by
numerical target variable
regression
numerical target variable
Calculate the entropy value for the set shown below. Show all your work. You may use the log table attached and the calculator in your computer. (Hint: show how you calculate p1, p2 to make sure you get some points) Entropy = - [p1 * log(p1) + p2 * log(p2) + ... ]
p1 = negative p2 = positive p1 = 3/4 p2 = 1/4 entropy = - [3/4*log2(3/4) + 1/4*log2(1/4)] entropy = - [3/4*-0.415 * 1/4*-2] entropy = -[-0.31125 + -0.5] entropy = - [-0.81125] entropy = 0.81125
A Fitting Graph plots
performance vs. model complexity
The goal of Laplace "correction" is to ... - increase the influence of segments (leaf nodes) with only a few instances - reduce the influence of segments (leaf nodes) with only a few instances
reduce the influence of segments (leaf nodes) with only a few instances
Test data
the data for model evaluation
What does the "k" mean in k-means cluster analysis?
the number of clusters
Target variables
the unknown value
Which attribute should be used to create the segmentation for the 10-instance facebook sample data?
usage hours