Data Mining Exam 1: Lecture 4

Ace your homework & exams now with Quizwiz!

A drawback of using validation set is that there will be less data available for training True False

True

Which of the following options is/are true for K-fold cross-validation? A. Increase in K will result in higher time required to cross validate the result. B. Higher values of K will result in higher confidence on the cross-validation result as compared to lower value of K C. If K=N, then it is called Leave one out cross validation, where N is the number of observations. D. All of the above.

D. All of the above

Validation Set

Divide training data into 2 parts: Training set & Validation Set

A classifier trained on less training data is less likely to overfit. True False

False

Decision tree is learned by minimizing information gain. True False

False

Training error is more important than test error when evaluating the performance of a model. True False

False

Identifying the best attributes in a decision tree:

We look at the ID3 heuristic which splits attributes based on their entropy

Overfitting

When the model is too complex, training error is small but test error is large

Underfitting

When the model is too simple, both training and test errors are large

Ensemble Classifiers List:

-Boosting -Bagging -Random Forests

Base Classifiers List:

-Decision tree based methods -Rule-based methods -Nearest-neighbor -Neural Networks -Naive Bayes and Bayesian Belief Networks -Support Vector Machines

Classification Errors include

-Training errors (apparent errors) -Test errors (errors committed on the test set) -Generalization errors (expected error of a model over random selection of records from same dist.)

Cross Validation

-partition data into k disjoint subsets k-fold: train on k-1 partitions, test on remaining one -Leave-one-out: k=n

What is a decision tree?

An inductive learning task - using particular facts to make more generalized conclusions A predictive model based on a branching series of boolean tests

Below are the 8 actual values of a target variable. [0,0,0,1,1,1,1,1] What is the correct way to compute the entropy of the target variable?

B. -5/8 log2(5/8) + (-3/8 log2(3/8))

Which one of the following statements regarding decision trees is NOT correct? A. Decision Tree is an inductive learning task that uses particular facts to make more generalized conclusions. B. Decision trees work more effectively with continuous attributes. C. The time for building a tree may be higher than another type of classifier. D. The trees may suffer from error propagation.

B. Decision trees work more effectively with continuous attributes.

Classification Techniques include

Base Classifiers Ensemble Classifiers

Decision trees can be represented as rules instead of graphically

But, this may be much harder to read

Which of the following is NOT true when identifying the best attributes for Decision Tree? A. Choose the best attribute(s) to split the remaining instances and make that attribute a decision node B. ID3 heuristic can help to determine the best attribute C. ID3 splits based on attributes with the highest entropy D. ID3 must use discrete (or discretized) attributes

C. ID3 splits based on attributes with the highest entropy

Entropy is minimized when

all values of the target attribute are the same. Ex: if we know that commute time will always be short, then entropy = 0

Entropy is

the measure if disinformation

ID3 splits on attributes with the lowest entropy and we calculate the entropy for all values of an attribute as

the weighted sum of subset entropies

Entropy is maximized when

there is an equal chance of all values for the target attribute (i.e. the result is totally random and equally likely)

We can also measure information gain

which is inversely proportional to entropy


Related study sets

Environmental Health & Safety Final

View Set

Throat, Thorax and Visceral Condition

View Set

Physical Science Unit 2- Speedback Quiz's

View Set

Histology SIU SOM -- Gastrointestinal

View Set

Macroeconomics-ECON 2301 Test 1 Chapters 1-4

View Set

Chp 24 nutritional care & support

View Set

Antibacterial Drugs: Tetracycline and Amphenicols

View Set

Chapter 5 Competitive Advantage, Firm Performance and Business Models.

View Set