CS170 Midterm 2
What are some reasons a decision tree can overfit data
1) If there are to many branches 2) results in poor accuracy on unseen data
What are the disadvantages of decision trees?
1) May suffer from overfitting 2) classifies by rectangular partitioning 3) can be large 4) does not handle streaming data easily
What are the properties of a distance measure?
1) Symmetric 2) Consistancy 3) D(A,B) = 0 than A == B 4) Triangular inequality
What are the 4 major ways that we compare algorithms
1) accuracy 2) speed and scalability 3) robustness 4) interoperability
How do we improve accuracy when we are given X features
1) remove of number of features, so that we are only left with the most informative features 2) Create new feature from combining old features together
Describe K Fold Cross Validation
1) split data into k groups 2) For each unique group: Take the group as a hold out or test data set 3) Take the remaining groups as a training data set 4) Fit a model on the training set and evaluate it on the test set 5) Retain the evaluation score and discard the model 6) repeat 2-5 until we have computed for all k's 7) accuracy = (# of correct classifications) / (# of instances in our dataset)
nearest neighbors is sensitive to irrelevant data, how can we fix this?
1) use more training data 2) Ask an expert 3) use statistical test to determine which features are important and which features aren't 4) use feature subsets (forward selection, backward selection, bidirectional search)
How can we estimate the accuracy of a machine learning algo A) By having it guess the label of several instances for which we know the correct label B) Using k-nearest neighbors C) Using cross-fitting D) By folding the data
A, By having it guess the label of several instances for which we know the correct label
Which of the following statements is true about decision tree classifiers? A) Decision trees are prone to overfitting,especially when a tree is particularly deep. B) Decision trees cannot handle categorical features and require all features to be numerical. C) Decision tree classifiers always result in higher prediction accuracy compared to other classification algorithms. D) Decision tree classifiers are not affected by the order or arrangement of features in the dataset.
A, Decision trees are prone to overfitting, especially when a tree is particularly deep
Which of the following statements is TRUE about naive bayes classifier A) Naive Bayes assumes that all features are independent of each other B) Naive Bayes is a non-parametric classification algo C) Naive Bayes is not suitable for text classification task D) Naive Bayes requires a large amount of training data to perform well
A, Naive Bayes assumes that all features are independent of each other
Which statements are TRUE about nearest neighbor classifier (may be more than one) A) It is sensitive to irrelevant features and this can be remedied by using more training data B) It is sensitive to noise, but this can be remedied by using the k-nearest neighbors instead of 1 nearest neighbor C) It is sensitive to irrelevant features and this can be remedied by searching over feature subset D) It is NOT sensitive to noise
A,B,C
Which of the following statements are True about overfitting (may be more than 1 A) It happens when we try to reduce the classifier's error too much B) It happens when we use complex models to exactly fit the training data C) It happens when we use huge amounts of training data D) Although overfitting achieves lower error on the training data, it will have higher error on the future unseen data
A,b,d
Which statement is FALSE about Naive Bayes A) it is called naive because the feature independence assumption B) It is sensitive to irrelevant features C) A new instance will be classifier based on the probability of it belonging to each class D) it is fast and space efficient
B, it is sensitive to irrelevant features
If we have a information gain of 0, is this a bad question or a good question
Bad question
Which of the following is true about linear classifiers? A) Linear classifiers can only separate data that is linearly separable. B) Linear classifiers are not suitable for high-dimensional datasets. C) Linear classifiers can handle nonlinear relationships in the data. D) Linear classifiers always result in overfitting.
C, Linear classifiers can handle nonlinear relationships in the data
Which statement is TRUE about choosing the right classifier for a dataset A) If classifiers X and Y yield the same accuracy on the dataset, we should choose the more complex classifier B) If on a dataset, classifier X yields 99.9% accuracy and classifier Y yields 99% accuracy, we should choose classifier X. C) If classifiers X and Y yield the same accuracy on the dataset, we should choose the simpler classifier. D) The choice of classifier is not important. The only thing that affects the accuracy is feature selection and generation.
C, if classifiers X and Y yeild the same accuracy on the dataset, we should choose the simpler classifier
for the nearest neighbor algorithm what function does it use to choose the nearest neighbor?
Can use any distance function to measure the nearest.
How do we select the best attribute to split a mixed node when constructing a decision tree A) We select the attribute that results in more homogeneous clusters/nodes B) We select the attribute that results in a higher information gain C) We select the attributes that results in a greater reduction of entropy D) All of the above
D), all of the above
When solving a problem using machine learning, what should we mostly focus on? A) Finding the most informative features B) finding meaningful ways to combine features to get a new feature C) Finding the right classifier ' D) both finding the most informative features and finding meaningful ways to combine features to get a new feature
D, Both finding the most informative features and finding meaningful ways to combine features to get a new feature
How can we increase the accuracy of a machine learning algo for solving a problem? A) By choosing the right set of features B) by choosing the right classifier C) by using more training data D) by generating new features out of the existing features E) All of the above
E, All of the above
What are the advantages of decision trees?
Easy to understand and easy to generate
What is information gain
Expected reduction in entropy
What is pre-pruning?
Halts tree construction early. Do not split a node if it results in a bad goodness measure. Difficult to choose this threshold
How long does it take to construct a linear classifier model
Linear time: O(n) Where n is the number of instances in our dataset
What is the time to construct a nearest neighbor classifier model
None, constructs in real time
How long does it take to run a greedy feature selection search?
O(2^n) or 2^n - 1
What is the runtime of K-Means
O(KTN), N = Objects, T = iterations, K = Clusters
How long does it take to use a nearest neighbor classifier
O(n), not good for large datasets
How do we correct overfitting on a decision tree
Pre-pruning and post-pruning
What is Naive Bayes trying to predict?
The probability of Class C, given our observation O
What does an entropy of 1 mean?
This means it took the maximum effort to calculate, this is bad
nearest neighbors is sensitive to outliers, how do we fix this?
Use K-NN Algo, This is when we measure the K nearest neighbors instead of the singular closes neighbor. We would classify our point as the majority class Note: this will also make the algo more robust
What is k-fold cross-validation used for?
Used to estimate to accuracy of our clssifier
How long does it take to use/test a linear classifier model?
constant time: O(1)
What is data editing
data editing is what we do when we want to remove data points from nearest neighbor classifiers, this is done to speed up the algo
What is entropy?
min effort needed to label a set of objects
What does it mean if a dataset is linear separable?
problems that can be solved by a linear classifier
What is post-pruning?
removes branches from a fully grown tree. Gets a sequences of progressively pruned trees. Use a test data set to decide which one performs best.