Data Mining Exam 1: Lecture 4
A drawback of using validation set is that there will be less data available for training True False
True
Which of the following options is/are true for K-fold cross-validation? A. Increase in K will result in higher time required to cross validate the result. B. Higher values of K will result in higher confidence on the cross-validation result as compared to lower value of K C. If K=N, then it is called Leave one out cross validation, where N is the number of observations. D. All of the above.
D. All of the above
Validation Set
Divide training data into 2 parts: Training set & Validation Set
A classifier trained on less training data is less likely to overfit. True False
False
Decision tree is learned by minimizing information gain. True False
False
Training error is more important than test error when evaluating the performance of a model. True False
False
Identifying the best attributes in a decision tree:
We look at the ID3 heuristic which splits attributes based on their entropy
Overfitting
When the model is too complex, training error is small but test error is large
Underfitting
When the model is too simple, both training and test errors are large
Ensemble Classifiers List:
-Boosting -Bagging -Random Forests
Base Classifiers List:
-Decision tree based methods -Rule-based methods -Nearest-neighbor -Neural Networks -Naive Bayes and Bayesian Belief Networks -Support Vector Machines
Classification Errors include
-Training errors (apparent errors) -Test errors (errors committed on the test set) -Generalization errors (expected error of a model over random selection of records from same dist.)
Cross Validation
-partition data into k disjoint subsets k-fold: train on k-1 partitions, test on remaining one -Leave-one-out: k=n
What is a decision tree?
An inductive learning task - using particular facts to make more generalized conclusions A predictive model based on a branching series of boolean tests
Below are the 8 actual values of a target variable. [0,0,0,1,1,1,1,1] What is the correct way to compute the entropy of the target variable?
B. -5/8 log2(5/8) + (-3/8 log2(3/8))
Which one of the following statements regarding decision trees is NOT correct? A. Decision Tree is an inductive learning task that uses particular facts to make more generalized conclusions. B. Decision trees work more effectively with continuous attributes. C. The time for building a tree may be higher than another type of classifier. D. The trees may suffer from error propagation.
B. Decision trees work more effectively with continuous attributes.
Classification Techniques include
Base Classifiers Ensemble Classifiers
Decision trees can be represented as rules instead of graphically
But, this may be much harder to read
Which of the following is NOT true when identifying the best attributes for Decision Tree? A. Choose the best attribute(s) to split the remaining instances and make that attribute a decision node B. ID3 heuristic can help to determine the best attribute C. ID3 splits based on attributes with the highest entropy D. ID3 must use discrete (or discretized) attributes
C. ID3 splits based on attributes with the highest entropy
Entropy is minimized when
all values of the target attribute are the same. Ex: if we know that commute time will always be short, then entropy = 0
Entropy is
the measure if disinformation
ID3 splits on attributes with the lowest entropy and we calculate the entropy for all values of an attribute as
the weighted sum of subset entropies
Entropy is maximized when
there is an equal chance of all values for the target attribute (i.e. the result is totally random and equally likely)
We can also measure information gain
which is inversely proportional to entropy