Midterm

¡Supera tus tareas y exámenes ahora con Quizwiz!

Which one of the following statement is NOT true?

Decision tree can easily handle missing data.

QUESTION

WHY?

Which of the following statements is true with respect to parameterized and instance-based models?

Parameterized models usually have a bounded number of parameters and hence a lower memory footprint.

What is a key difference between Decision Tree Algorithm (JR Quinlan) and Random Tree Algorithm (A Cutler)?

'A' is not true as both of them could be used as a Regression method. 'C' not true as for Random Tree Algorithm a random feature has to be determined, 'D' is obviously not true and 'B 'is definitely a key difference between these algorithms and how it generates the decision tree.

When building a decision tree, one of the most important parts of the algorithm is to pick the feature to split on. What of the following is correct regarding decision tree and how to pick the feature to split on in decision tree?

(A) is wrong since correlation value of each field does not guarantee the largest information gain. (B) Decision Tree is non-parametric machine learning algorithm (C) True (D) Apply bagging would lengthen the training time.

There are two vectors X, Y in R^n. a, b in R. Which statement below is correct?

(a) RMSE(X, Y) = sqrt(sum((X - Y)^2) / n) (b) RMSE(X, Y) != RMSE(aX + b, Y) (c) corrcoef(X, Y) = (E[XY] - E[X]E[Y]) / ((E[X2] - E[X]2)^0.5*(E[Y2] - E[Y]2)^0.5)

Which of the following is a classification problem? (i) Predicting housing price at a particular area based on size of the house (ii) Determining whether or not a credit card transaction was a fraud (iii) Recognizing communities within large groups of people in social networks

(i) is a regression problem, since it involves the prediction of a continuous value (housing price) (ii) is a classification problem, since the prediction is a class label (fraud or not) (iii) is a form of unsupervised learning (clustering). The output is a set of unlabelled clusters, each representing a community of similar people.

List the learning algorithms in terms of the time to train the model, from fastest to slowest

- KNN has the fastest training time, as we just need to add the new data points to our model - Linear regression takes longer than KNN, as it needs to compute a line or curve that best fits the data. However, this is still faster than decision tree, since for each split in the tree, we'll need to do computation for the best feature to split on as well as the split value, and then proceed to recursively build the rest of the tree from there for all of the remaining data.

Which of the following three statements regarding overfitting is/are true:

1) For a KNN learner, decreasing K is likely to increase overfitting. 2) For a parametric regression learner, increasing d (the degree of the equation) is likely to decrease overfitting. 3) For a bootstrap aggregating learner, implementing adaptive boost has no impact on overfitting when compared with using simple bagging. (1) is a true statement; as K is decreased, there are fewer and fewer "nearest neighbors" to consider for a given input at query time, and taking the mean of fewer neighbors increases chances of overfitting. (2) is a false statement; increasing d means that the model equation curve will end up having more "twists and turns" in order to fit the data more closely, thus increasing chances of overfitting. (3) is a false statement; implementing ada boost may result in a greater likelihood of overfitting over simple bagging provided that m (the number of bags) grows large enough. This is one of the possible drawbacks of ada boost, which tries to make adjustments in the creation of each successive bag based on the sets of data that have particularly poor results in the previous bags.

Between KNN, Linear Regression, and Decision Trees, select which machine learning method you would use for each for the following scenarios: 1) to minimize query time, 2) is the easiest method to incrementally add new training data to an existing model, 3) to minimize the space required to store the model, 4) to minimize training time.

1) Linear Regression is the fastest querying method as it is just an O(1) multiplication and addition of values. Decision trees require O(log(n)) search, and KNN requires comparing distance functions of many points 2) KNN is the easiest learner to add data to the model, as the data point is just added to a database or index structure. There are incremental algorithms for regression and decision trees, but they are not as simple as KNN 3) Linear Regression minimizes model space as it only needs to store coefficients of the model. Decision trees can be very large and KNN needs to store many data points 4) KNN can be considered to have the minimum training time because it only needs to store data points during the training phase. It doesn't need to select features like decision trees or fit a line like linear regression

In bagging, assume that the training set D is of size n and there are m bagging training sets each n'=n. i.e. n'=n=size(D). In those new bagging training sets, samples are randomly chosen from D. Then roughly how many unique samples are there in each bagging training set, others are duplicates?

1-1/e ~63.2%

Consider the following two problems:

1. An advertisement agency would like to know which ad to present to a customer next based on previous ad-click history. 2. A student would like to know the optimal number of hours to spend studying to maximize performance on exams. What is the best approach (model) for solving each problem? Marketing data analysis such as Problem 1 is well-suited to an instance based model because customer preferences are not likely to be quantifiable. They are more emotional. On the other hand, we expect studying versus grades to be a nice numerical model where we will see some optimum where additional hours of studying has a negligible effect on test performance. So Problem 2 would make a great parametric model.

Which of the following is a correct set of supervised and unsupervised learning problems?

1. Predicting the origin of wines using chemical analysis datas. 2. Differentiating among three species of flowers using physical characteristics. 3. Determining the species of mushroom using physical characteristics. 4. Anticipating the quality of wines using several features. 5. Arranging the same type of fruit at one place among four kinds of fruits using physical characteristics. 1 - Supervised learning problem, chemical analysis datas are used to predict the label. (origin of wines) 2. Unsupervised learning problem, clustering has to be used to differentiate among three species of flowers. 3. Supervised learning problem, physical characteristics are used to determine the label. (species of mushroom) 4. Supervised learning problem, several features are used to anticipate the label. (quality of wines) 5. Unsupervised learning problem, clustering has to be used to group the same time of fruit at one place.

Which of the following is a correct set of supervised and unsupervised learning problems?

1. Predicting the origin of wines using chemical analysis. 2. Differentiating among three species of flowers using physical characteristics. 3. Determining the species of mushroom using physical characteristics. 4. Anticipating the quality of wines using several features. 5. Arranging the same type of fruit at one place among four kinds of fruits using physical characteristics. 1. Supervised learning problem. Chemical analysis data are used to predict the label. (origin of wines) 2. Unsupervised learning problem. Clustering has to be used to differentiate among three species of flowers. 3. Supervised learning problem. Physical characteristics are used to determine the label. (species of mushroom) 4. Supervised learning problem. Several features are used to anticipate the label. (quality of wines) 5. Unsupervised learning problem. Clustering has to be used to arrange the same time of fruit at one place.

Which of the following is a correct set of supervised and unsupervised learning problems?

1. Predicting the origin of wines using chemical analysis. 2. Differentiating among three species of flowers using physical characteristics. 3. Determining the species of mushroom using physical characteristics. 4. Anticipating the quality of wines using several features. 5. Arranging the same type of fruit at one place among four kinds of fruits using physical characteristics. 1. Supervised learning problem. Chemical analysis data are used to predict the label. (origin of wines) 2. Unsupervised learning problem. Clustering has to be used to differentiate among three species of flowers. 3. Supervised learning problem. Physical characteristics are used to determine the label. (species of mushroom) 4. Supervised learning problem. Several features are used to anticipate the label. (quality of wines) 5. Unsupervised learning problem. Clustering has to be used to arrange the same type of fruit at one place.

1.A k-NN model will underfit the data for value of k nearing the total number of training samples.

2.The solutions for a linear regression problem obtained through normal equations and least squares iterative gradient descent approach will be identical. For the above two statements, please determine whether they are true or false and select the correct combination. 1 True. k-NN model will underfit the data. As k -> N, all samples have an influence on the class label. Eventually, when k is too large it will always predict majority class 2. True. Cost function is convex, so there is only one global minimum.

Suppose you used four different regression models (A, B, C, D) and achieved the following values :

A . Low correlation, Low RMSE B. Low correlation, High RMSE C. High correlation, High RMSE D. High correlation, Low RMSE Which answer choice depicts the arrangement of the regression models in terms of prediction quality, worst quality first. RMSE is a much more strong factor than Correlation for measuring the predictor quality. There may be cases when prediction is highly correlated but RMSE is also high, thus reducing the importance of the regression model.

How are sampled the instances to fill the third bag D3 in a boosting algorithm ?

A boosting algorithm focus its training on instances with a significant prediction error of the joint model of all previous bags. To do so instances from the training set wrongly predicted are weighted to be more likely to get selected when creating a new bag.

Which of the following learning problems is a classification problem?

A classification problem has discrete answers. Of all the questions, only guessing an animal will result in a discrete answer. All other possible answers have continuous answers.

If you are building a decision tree model, which is most likely to be true as the leaf size increases?

A greater leaf size will cause the model to not fit the training (in-sample) data perfectly, and will make it's correlation go down as the leaf size increases (in general) because individual cases in the tree are not handled as particularly--and are decided by an average or voting method across the size of the leaf. This will also cause the RMSE to increase since these more one-off cases will not be handled as particularly, but instead face the average or vote of other elements going into a leaf.

When plotting a validation curve, what is an example of overfitting? MSE = mean squared error. MSE is always positive. Values closer to 0 are better.

A higher mean square error (MSE) on the validation set is less likely to generalize the model correctly from the training set. A is an example of underfitting. B is an example of overfitting. C is an example that generalizes well.

The best way to reduce overfitting in bagging is to:

A increases the number of learners "voting" on the output, which makes overfitting less likely to occur. B could help overfitting if the new learner is less prone to overfitting, but it could also make overfitting worse. C and D can actually make overfitting worse.

Which of the following is an example of an unsupervised learning problem?

A is a supervised learning problem where features are used to predict a label. B is the only one that is not trying to find a specific target variable. It is clustering stocks together based on similarity (K Means) and not classifying them with specific labels in mind. C is a supervised learning problem where the features are the answers to the questions which are used to determine the type of animal. D is a supervised learning regression problem where the features are used to predict a continuous value.

Which of the following best describes Bagging?

A is boosting, B is incorrect because it randomly creates subsets of data D is incorrect because we do not include test data in the subset data generation

Which is not an acceptable way to alleviate overfitting?

A is correct because different types of models can develop bias in different ways, these will be less pronounced if we average many different types of learners. B is also correct because bagging eliminates the severity of the influence of outliers or faulty data. Therefore C is correct.

What is a reason one might decide to use linear regression instead of kNN for a particular regression problem?

A is false because as you decrease "k", kNN is more likely to overfit the data. C is false because kNN essentially requires no training time, whereas linear regression has to find parameters. D is false because A and C are false. B is true because the nearest neighbors in the data set will remain the same no matter how far away you move beyond the edge of the data (the extrapolation is a straight line).

What causes Overfitting?

A is how you can tell you have overfit your data. Be is what causes overfitting. The closer you get to modeling your training data exactly, the lower the error on your training set will be and the higher the error when running against your test data (or real data).

Overfitting is made worse by

A is nonsense, but might make someone pause if they didn't know the material. B is the correct answer, because making your training data fit too perfectly causes testing data to have a higher error rate. C and D, of course, help generalize your predictions, decreasing the overfit error.

What is the goal of Supervised Learning?

A is the correct definition of Supervised Learning, B is a definition of Reinforcement Learning, C is a definition of Unsupervised Learning, and D is not really machine learning at all, just an arbitrary answer.

Which of the following is a true statement regarding supervised and unsupervised learning?

A is the definition of supervised learning. In unsupervised learning the correct output is not known. B is false as both are examples of supervised learning. C is not always true. A is the only correct answer

Which of the following statements is false?

A is true as functions represented by decision trees are nonlinear. B is true as decision trees applied to regression functions always learn step functions. C is true as decision trees can always have decision rules that model any train set perfectly. D is false because C is true.

Given a dataset with X features and the Y values are given as [2,3,4,8,2,5,4,8,7,6,9,4,1,3,7,5,6,9] for a sample, what kind of supervised learning would you use?

A is true because we can see 9 classes and so is B as the values are continuous. This is similar to the wine dataset problem where either regression or classification can be used.

Which of the following statement is correct regrading the properties of KNN, linear regression and decision tree in general ?

A is wrong --> linear regression generally handles regression problem B is wrong --> KNN has the lowest cost of learning, in fact KNN does not require an explicit learning step. C is correct --> Need pair-wise computation to get distance from all entries in dataset. D is wrong --> Decision tree requires the least effort to handle missing data.

Which of the following algorithm is the bagging procedure we learn from class?

A is wrong because we usually use the same sample size in each bag C is the adaBoost algorithm D is the stacking algorithm

What does a low RMSE on training data and high RMSE on test data imply?

A low error on training data, but high error on test data implies that the learner is overfitted to the training data causing poor results on test data.

Which best characterizes a balanced decision tree?

A perfect decision tree requires calculations to determine the best attributes to split on. Thus, it will take longer to build than a random tree.

Which of the following best distinguishes unsupervised learning from reinforcement learning?

A reward signal or a training label is what distinguishes reinforcement learning and supervised learning from unsupervised learning. (B) is clearly wrong. (C) Normalization is not the distinguishing difference between these classes of algorithms. (D) Both are often generative but don't have to be.

How is the following quote best related to the topic of overfitting? Occam's razor: Prefer the simplest hypothesis that fits the data

A simple hypothesis is analogous to a model which is generalized. In contrast, a model that seeks to fit all of the data may be much more complicated. The particular peril of a model that is over-complicated is that it may not adequately explain new data, because it is too specialized.

Which of the following statements is true about overfitting?

A single random tree learner with leaf size 1 will fit the training data exactly, which makes it likely to be overfit. A, C, and D are all false: A. Overfitting occurs in regions where the in-sample error is decreasing and the out-of-sample error is increasing. C. A k-nearest neighbor model with a k value of 1 is more likely to be overfit than one with a k value of 50. D. A polynomial model of degree 50 is more likely to be overfit than one of degree 1.

Why is using cross-validation more reliable in determining overfitting than doing a simple dataset split (test and train)?

A) Because you do multiple complementary sets, you are going to reduce variability. Thus, the results that you see from cross-validation are going to be truer to the real accuracy of your model. Cross-validation gets around the problem of being unlucky and choosing a biased data set. B is true but have nothing to do with reliability. For C, the structure can change, but cross-validation will not help you with this. D is false—cross-validation actually uses the data more efficiently because you have multiple splits for your training and test sets.

Which of the below statements is true about the construction of a random tree?

A) is correct and is a key benefit over other algorithms (like KNN) B) is not correct as we are specifically avoiding splitting on information gain to save algorithm time C) is not correct as you never want to build the model using the test data, as one needs a dataset that is not represented in the model to test against for out of sample data. D) is not correct as this could lead to one side of the tree having no leaves.

Which statement below about overfitting is correct?

A) not correct, because in a weak model or model which is not highly complex, there is no overfitting, because both training and testing errors can be high. B) This is correct. The definition of overfitting. C) Not correct. When overfitting occurs, training error is low but testing error is high.. D) Not correct. When k is smaller, the overfitting is more likely to occur. When k equals 1, the model fits perfectly on training data, but may perform poorly on testing data.

Which of following statements about boosting is NOT correct?

A, B, and C are correct statements. The statement D is false. Adaboost is susceptible to outliers. AdaBoost tends to treat outliers as "hard" cases and put tremendous weights on outliers.

What method, by itself, is not used for reducing variance error in individual decision trees?

A, B, and D are common techniques for reducing overfitting. By itself, randomly selecting splitting decisions does not reduce overfitting, it will fit one leaf for each distinct data point.

While building a decision tree, which of the following approaches in selecting the best feature to split on is the fastest?

A,B,C & D all can be used to determine the best feature to split on and hence all seem plausible. However, A,B & C require low to high mathematical computation on the entire input data-set (at least O(N)) whereas D requires simple random number generation between a given range.

Which of the following statement is false?

A. A weak learner is defined to be a classifier which is only slightly correlated with the true classification B.Boosting convert weak learner to strong learner

Comparing the Bagging with Random Tree algorithm, which one of the following statements is not correct?

A. Corrrect, "train differnt subsets of the training data" is the core of bagging algorithm. B. Not Correct, Bagging is more like a framework to repeat other algorithms like Random Tree or Linear Regresson and eventually combine different training results together. Bagging must work together with these algorithms. C. Correct. "More bags" means randomly get more subsets of the training data. D. Correct. Because "Bagging" will make more "sample data" be collected to help train the prediction model.

How is a random tree built using A. Cutler's approach different than a standard decision tree using JR Quinlan's approach?

A. Randomization occurs in selecting factors and split values, not in randomly scrambling the input data, so this is false. B. It is much easier to build a random tree because meaures of information gain do not have to be studied and implemented. C. Correct answer. Random trees built by Cutler's method are fast learners due selecting the factors and their split values randomly, whereas the standard tree relies on measures of information gain which are more complex to build and analyze and run more slowly. D. Factors can appear multiple times in either algorithm.

How is a random tree built using A. Cutler's approach different than a standard decision tree using JR Quinlan's approach?

A. Randomization occurs in selecting factors and split values, not in randomly scrambling the input data, so this is false. B. It is much easier to build a random tree because meaures of information gain do not have to be studied and implemented. So this is false. C. Correct answer. Random trees built by Cutler's method are fast learners due selecting the factors randomly and their split values randomly as well, whereas the standard tree relies on measures of information gain to select the best factor. These measures are more complex to build, analyze and are more computationally demanding. D. Factors can appear multiple times in either algorithm and so this is false.

What should we use to determine the quality of predictions for any regression model?

A. Regression is used for continuous values, so this metric won't take into account how close the predicted value is to the actual one. For example, it would return 0 accuracy if the true value was 5 and the predicted value in one case was 5.2 and the other case was 10. B. RMSE is a good measure of how close the predicted value is to the true value. C and D. They won't take into account the absolute values of error, so if there is a prediction of +10 and -10 than the actual values, the error would average out to zero according to these metrics.

About boosting, which of the following is TRUE?

A: False. Random forest uses bagging, rather than boosting. B: False. Boosting models are harder to be paralleled since the training period is sequential. C: True. By definition of boosting, it combines a set of weak learners to be a single strong learner. D: False. Boosting models can still overfit with many boosting steps.

Which of the following statement about the comparison between supervised learning and unsupervised learning is correct?

A: Wrong. Unsupervised learning also have problem of overfitting, for example, KNN algorithm we make k=1 B: Wrong. The query time for supervised learning is faster than unsupervised learning, but the training time usually is longer since it needs to generate the predict functions. C: Correct. Generally, supervised learning is better than unsupervised learning, since supervised learning only need to store the model (function), unsupervised learning need to store all the training data D. Wrong. Unsupervised learning also has bias, for example KNN, we assume that point near each other should belong to the same group, this is also a kind of bias

what of the following statements about Boosting and Bagging is true?

A: Wrong. the main goal of boosting is to increase predictive force. B. right C: Wrong. Bagging builds classifier/model independent from the previous one to decrease variance. (Each model is independent) D. wrong. It is designed to be no correlation between trees. So it is a Bagging method

If we don't care about the speed, for modeling stock prices with a decision tree, which of the following feature selection method will have the best performance?

A: false. "We don't care about the speed", which is the main advantage of random feature selection B: false. Stock prices are continuous, we need to use a regression tree C: best choice in here. D: false.

Which ensemble method strives to improve the learner by focusing on areas where the system is not performing well?

Ada Boost is an improvement upon the Bagging ensemble learner. While training and testing with Bagging, some data points will have significant errors. Those data points with errors are more likely to be picked for the next bag when using Ada Boost, then a new model is built and tested. This is how Ada Boost focuses on areas where the system is not performing well.

Which one of the following models is NOT suitable to create AdaBoost?

AdaBoost should use weak learners. SVM is a very complex model which will cause overfitting of training data.

How does AdaBoost handle training data with high error when selecting bags?

AdaBoost uses training data to test the model and determine error metrics for each point in the training data. When choosing the next bag AdaBoost weights each instance in the training data according to the error. Higher error values for an instance in the training data mean the instance is more likely to be selected for the next bag.

which one is more likely to overfit as the number of bags increase?

AdaBoosting is trying hard to match those data points that are off or outliers and thus is more susceptible to overfitting

Which strategy has a chance protect a supervised learning algorithm from damage due to some rows being contaminated by a deliberately hostile supervisory signal?

Adding features to the decision tree does not protect against damage because the target data is still wrong. Hostile supervisory signal would not be identified in a clustering analysis, perfectly good outlier data would be removed. Training a supervised learning algorithm on random noise rows would not decrease damage, it would increase damage. Ensambling many supervised learning classifiers to work together like a democracy and vote on where the exact target is and averaging their response, correcting all algorithms in proportion to their stake in the vote would be an effective resistance to a hostile supervisory signal. An attacker may be able to stage an effective attack, heavily damaging one of the algorithms, but that the vote of the damaged model be washed out by the ineffectiveness of the attack on all the others.

Which of the following techniques can be used to reduce overfitting:

All three options are standard techniques used to reduce overfitting.

In Random tree learner if a leaf size is provided we use that as a limit to put as many as leaf_size values in a leaf. What is the ideal value(SplitVal) that should be used to identify this leaf with multiple entries?

Ans: Mean of the values that belong to this leaf. In Random tree learner if a leaf size is provided we use that as a limit to as many as leaf_size values in a leaf. We need to use a correct value to make sure the decision tree functions. Of the given choices, (A) First value and (D) Maximum value are clearly not correct answers. We also cannot pick a (B) random value as that will hurt the accuracy. Out of the given choices C mean is the closest. If the student had worked thru MC3 P1 this should be easy to answer.

What is the one of the main benefits/goals of utilizing an ensemble learner made by bagging with multiple KNN learners over using a single KNN learner with the same k value?

Answer B is correct as the bias in each of the individual learners will be smoothed out when aggregated. Reduced testing error is one of the main reasons bagging is utilized. Answer A is incorrect as the query time for the KNN ensemble learner will be greater due to the need to query multiple learners. Answer C is incorrect as the training time for KNN algorithms is negligible. Answer D is incorrect because the goal is to reduce overfitting of the in-sample, or training data.

A strength of Perfect Random Trees Ensembles is that they are not:

Answer C The answer is the opposite of a correct answer. It's a did you read the paper we were supposed to read type of question.

As part of the decision tree overfitting-prevention technique known as 'rule post-pruning', how many rules will be generated for a given tree before the pruning has begun?

Answer d): According to Chapter 3 in Mitchell, one rule is created for each path from the root node to a leaf node. Since there is only one path from a given leaf to the root, there are as many rules as there are leaf nodes.

Suppose we have a machine learning problem where X is the feature space and Y is the output space. To be considered a classification problem, which of the following must be true:

Answer: b) A classification problem has a finite output space.

In KNN based models, which of the following scenarios involving values of K is likely to produce an overfit model?

As K decreases, KNN model will produce values that are closer and closer to the actual data points. Therefore, when KNN model is trained using a training set and lower value of K, the model becomes too specific for that set and produces larger errors for any testing sets. This is why overfitting occurs. When K=1, KNN will produce the most overfit model.

Which of the following is true about measuring the quality of predictions (training RMSE and training correlation.

As RMSE increases, which indicates poorer performance correlation decreases. Although, there might be cases where the correlation may not decrease like in a biased model.

Which of the following is NOT an approach to pick a feature to split on while building a decision tree?

As learned in the class, any feature can be used multiple times to split on.

Which of the following techniques can be used to reduce overfitting in decision trees?

As per Mitchell (p.77), post pruning the decision tree are therefore important to avoid overfitting in decision tree learning.

What best explains the risky nature of selling short?

As short selling involves "borrowed" shares of stock that the investor must return later, the investor is responsible for covering any increase in price that the stock has undergone during the time period. The stock's price can increase without bound, and as a result, the investor can be responsible to pay back a theoretically unlimited amount (whereas, in comparison, a "long" position on a stock can lose only 100% of its value if the stock price goes to zero).

Which is true when comparing kNN to decision trees?

As the number of samples in a data set increases, the computational complexity of a kNN learner grows.

What kind of learners do we use in Ada boosting?

As the number of weak learners increase, the bias converges exponentially fast while the variance increases by geometrically diminishing magnitudes which means it overfits much slower than other methods

Consider a decision tree, with leaf size greater than 1 i.e. we aggregate several nodes into a leaf.

As you know, leaves of decision trees contain the output Y (prediction/classification/association) values, and since in our example tree, leaves can aggregate multiple nodes (up to leaf size), we need a way to calculate that aggregated leaf value. If our aim is regression, then the average of aggregated nodes' values will give us the correct Y value. What if our aim is classification? For classification we need to vote, to select the value that most prominently classifies the instance. And mode function will give us the most prominent value, as this will be the most frequent value among aggregated nodes.

Which of the following is NOT true when comparing between kNN, decision trees, and linear regression?

B is not true as linear regression is most likely to take the least time to query among the three. It uses mathematic functions to query, unlike the other two, which also need to traverse and/or perform iterations in order to get results. (A is true because kNN is a lazy learning algorithm, which defers most of the computations until testing (i.e. classification); C is true because using a leaf size of 1 and the same test dataset as the training dataset (in-sample) would always return the known, accurate result; D is true because linear regression is not reliable when the data is highly non-linear)

Which of the following is NOT true when comparing between kNN, decision trees, and linear regression?

B is not true as linear regression is most likely to take the least time to query among the three. It uses mathematic functions to query, unlike the other two, which need to traverse and/or perform iterations as well in order to get the result. (A is true because kNN is a lazy learning algorithm, which defers most of the computations until testing (classification); C is true because using a leaf size of 1 and the same test dataset with the training dataset (in-sample) would always return the known, accurate result; D is true because linear regression is not reliable when the data is highly non-linear)

When looking to build a decision tree, is it better to pick the splitting feature randomly or through a calculated method such as feature correlation or Gini index?

B is the best choice for this question. Although answer A is correct in saying the tree will be more accurate, this comes at a speed cost as doing that calculation is expensive and an overly accurate tree can lead to overfitting. Answer C is correct in saying a random tree is faster, but if there is only one tree it may not be a good representation of the actual data. D is just not correct as a group of calculated trees is just extra slow and they will all be the same, therefore will not be any better as a group.

Which of the following statements comparing parameterized models to K-nearest neighbors (as an example of instance-based models) is FALSE?

B is the correct answer because all of the heavy processing for KNN happens at query time. Learning is basically just loading up the data (which rules out A and D, because both are true). C is true because optimal KNN requires remembering all of the data (though there are ways of giving up some accuracy to reduce storage) while a parameterized model can work fine just by remembering the values the optimizer came up with for the equation. It doesn't have to remember all of the data used to generate the equation (ruling out C).

Which one of the following is a regression problem?

B is the correct answer because it is a question for approximating a numeric value, whereas other choices ask for identifying a class like the name of a fruit.

Which of the following statements best describe Boosting?

B is wrong in terms of randomly C is wrong in terms of equal weight D is wrong since overfitting might occur as iteration increases.

Which of the following model is 'not' considered to be overfitted?

B. is correct because this model tries to generalize to other unseen data sets by learning the trends from training data and thus not prone to overfitting. A. is wrong because with too many parameters relative to the observations, it overreacts to minor fluctuations in the training data and thus lowering the predictive performance. C. is wrong because the efficacy of the model is determined not by its performance on the training data but by its ability to perform well on unseen data D. is wrong because its general scenario of overfitting with high out-sample error while low in-sample error

Bagging is typically when you want to reduce the variance while retaining the bias. What is the correct approach when using this method ?

Bagging (Bootstrap aggregating), you sample the data with replacement. For each of those set, use the same predictor to train the model then averaging-out or vote to get the result.

Bagging works especially well for unstable predictive algorithms - algorithms whose output undergoes major changes in response to small changes in the training data. For which of these predictive algorithms Bagging may not give substantial gains in accuracy?

Bagging methods rely on the instability of the classifiers to improve their performance, as k-NN is fairly stable(for large enough k) with respect to resampling, bagging may fail in its attempt to improve the performance significantly for k-NN classifier.

What is not true about bagging?

Bagging samples new multiple training data set from the original training data set uniformly with replacement.

We want to apply bagging method to improve the performance of a decision tree classifier. Please select the statement that bagging fits:

Because each model is built independently.

Which of the following problem can be addressed by using proper classification algorithm?

Because this problem is a multi classification problem with 10 classes, which is 0 to 9. One can train the classification model and predict the written digits to be one of {0,1, ... 9}. If people doesn't know this, he can correctly get the answer by simply observing that only the predicted value of B takes discrete value.

After training with k-nearest neighbor (pattern) classification algorithm, your model is 100% accurate in fitting the training data. But when applying it on real data, it generates big error and only 25% percent accurate on real data. Which method below might work during optimize your model?

Because to solve overfitting, we usually use less features, more training data, and increase k in knn algorithm, so (a) and (d) are wrong. (c) is no help for overfitting.

Why boosting is less commonly used with decision trees than bagging?

Boosting algorithm is only focused on results that were incorrectly predicted in the previous iteration. By doing so, it reduces the size of data with each subsequent step. This limited set of samples will inevitably lead to overfitting.

Boosting is a variation of bagging in that it:

Boosting is a variation of bagging in that it is a step-wise ensemble method focused on areas where the models have not performed well and weights those areas in each subsequent model. All other answer incorrectly describe boosting algorithm

Which of the following is NOT true about boosting?

Boosting is an ensemble learner which runs weak learners in sequence to improve performance. However, it does not help with overfitment (and can actually make it worse)

Which of the following is correct for boosting?

Boosting is focusing on where the training is not performing well, so it can help cut errors

In general, compared with bagging, which answer about boosting is true?

Boosting uses weak performance samples repeatedly. It will increase the bias and decrease the variance from the data selection point of view.

What is the main difference between the Bagging and Boosting algorithms?

Boosting's main difference from bagging is that it uses the model from the last bag to determine which data points it performs poorly on and should be emphasized in the new bag. It does this by calculating the error for each data point and randomly chooses the data for the new bag proportional to the error associated with the data point.

Accuracy alone is a sufficient metric for classification model evaluation.

Both accuracy and recall are important metrics to evaluate a classification machine learning model.

Pruning decisions trees is one way to avoid overfitting. Which of the following describes a valid way to prune a decision tree?

C describes a correct process for pruning, wherein a fully grown tree is pruned bottom-up in order to avoid overfitting to training data. The other options describe other processes that may or may not help overfitting.

Which of the following statements is true about decision tree?

C is correct, decision trees can be used for either regression or classification problems, with small differences in training and aggregating result. A is plainly wrong; for B, the model with bagging contains multiple trees and would take more memory space; D is a bit tricky, but we want the tree to be as balanced as possible.

For the Random Tree Algorithm in A Cutler's paper, comparing to the traditional Decision Tree Algorithm proposed by JR Quinlan, which of the following best describes the keyword "random":

C is the right answer, because in A cutler's random tree algorithm, the randomness contains two part: randomly selecting feature i and randomly picking up two Xi and take the mean. A is wrong, because calculating the median is a little bit slower, which is different from A Cutler's purpose. (but I think A is the most attractive answer other than correct one since it provide the reason of using median value). B is wrong because compute correlation for all feature is expensive. D is wrong because in this algorithm only two values of Xi are selected.

Your boss gives you a dataset of daily returns for a given stock, and daily factor value data (n number of arbitrary factors) corresponding to that stock. He asks that you build a predictive model using the factor values at time t corresponding to the returns at time t+1. Your boss intends to query this model at future days with that days' factor values to get some numerical prediction for the stock's next day returns. Most generally, what kind of learning problem can your task be described as:

C, Regression. Given that your model will be used for 'numerical prediction', we can determine that this is a regression problem. Parametric/Non-Parametric simply describe different types of models you can use to solve a learning problem, but they don't describe what type of learning problem it is (although the uninformed test taker might be inclined to answer 'parametric'). Classification problems are those where you are trying to classify an object into a type - clearly not what your boss is asking you to do.

In bagging, we have a dataset that's divided into a training set (with n instances) and a testing set. Then, we make (m) bags and in each bag, we randomly select (n') instances from the training set with replacement (some instances may be repeated). Which of the following sentences is true about this method?

Choice A is wrong because we will have (m) models. Choice C is wrong because we take the mean of the outputs of the models if it's regression or we take majority vote if it's classification. Choice D is wrong because we can never be sure about the number of unique samples in the bags due to the randomness of the selection. Choice B is correct because in practice we know that an ensemble of learners can give better predictive performance than a single model.

which method is not included in supervised learning?

Clustering is a method of unsupervised learning. We can derive the structure by clustering the data based on relationships among the variables in the data without feedback based on the prediction results.

When should we use the 4th Root of the mean of the 4th powers (i.e. RMSE but instead of squaring the errors are raised to their 4th powers and instead of square rooting, the numbers are rooted by a power of 4) instead of RMSE changing nothing else?

Correct Answer is B: because the 4th power of a larger error will be much greater than the square of the same error, hence we will penalize outliers significantly more that we were in RMSE

Please select the correct statement

Correct Answer is C because in KNN as K decreases, model more likely to overfit data. Think about K=1 which will overfit the data (more specific model) whereas K=N (where N > 1) will be more general model. A, C and D are incorrect statement.

Q1. Which of the following measures of building a decision tree will NOT lead to successfully classifying atleast 50% of training data?

Correct Answer is D) None of the above. Note that the questions ask you to compare based on performance on Training data and tests the fundamental concepts on how trees are built. Easier variant will be performance on test data, in that case, C will be our answer

Consider the options below and select the appropriate choice for the kind of trees that are preferred and why:

Correct Answer is a) Shorter Trees are preferred because they fit the data with the simplest hypothesis and avoid overfitting. Attribute selection measures based on Information gain like Entropy, Gini Index place high information gain features close to the root of the tree resulting in shorter trees

You are given a task to build 2 different (regression) decision trees based on the same data set. Method A: A decision tree that is built in shortest possible time, Method B: A decision tree that is most balanced. Which of the following 'pair' of statements for the above mentioned methods is most likely to be true regarding the criteria for choosing the (i) 'feature to split on' and (ii) 'split value' for that feature while building the respective decision trees?

Correct Answer: B; Selecting the feature randomly and then choosing the split value randomly will build the decision tree fastest time (based on Adele Cutler's approach) and splitting the feature based on correlation followed by choosing the split value based on median might take longer to build but will build a tree that will be very close to being balanced (based on Ross Quinlan's approach).

Which of the following is TRUE about overfitting?

Correct answer is (A) because an overfitted model tries so hard to fit to the training data perfectly ("memorize" training data rather than learning it) that the resulting model does not generalize at all.

Which of the following statements is TRUE about overfitting?

Correct answer is (A) because an overfitted model tries so hard to fit to the training data perfectly ("memorize" training data rather than learning it) that the resulting model does not generalize to new, unseen data well.

Which of the following is wrong about parameterized models and instance-based models?

Correct answer is A because parameterized models are a family of distributions that can be described using a finite number of parameters.

When building a decision tree, which method for determining the feature to split on is most likely to result in a shallower tree?

Correct answer is A) since splitting on features with high correlation will tend to more quickly and cleanly separate training examples, ultimately requiring fewer splits to separate the records.

A machine learning problem dataset has only two unique values to predict, 0 and 1, which clearly can be applied to a classification algorithm. What is the greatest challenge from using a regression machine learning algorithm on this problem?

Correct answer is B) because A, C, and D are all false and if you require a probability propensity of 0 versus 1, predicted values outside of [0, 1], which many regression algorithms would result in, wouldn't make any sense.

The primary difference between K nearest neighbor (KNN) and Kernel regression is:

Correct answer is B) because KNN weighs points equally and Kernel regression uses the distances between considered points to assign weighted values.

Q1: In creating a decision tree, when choosing a feature to split on, what is the most correct use of correlation?

Correct answer is D because correlation can go both positive and negative ways

As a decision tree learns more and more training data (starting from the first instance), does test error (out-of-sample error) go:

Correct answer is a) because a decision tree with very few training instances will not have enough conditions to classify (or regress) new samples so test error will start high. As we add training samples, the tree becomes more generalized and test error goes down. Eventually though, as we add too many training samples, the tree starts to overfit the training data so test error starts to go back up because the tree has lost its ability to generalize.

You use bagging with decision trees such that each tree learns from a random subset of your training data. You might want to:

Correct answer is b) because a great advantage of random decision trees is the speed of learning in comparison to traditional decision trees. a) is wrong because decision trees don't require normalization. c) is wrong because, while each tree would use information gain, they would not be identical because each was trained on a different subset of the data.

If I have an overfit model, I will have high variability out-of-sample (since we have likely fit on noise). If I use a bagging procedure instead, why might I expect lower variability and lower likelihood of overfitting?

Correct answer is b) because model averaging means that prediction error an any particular single learner used in the bagging process is averaged out by the predictions of the other learners in the ensemble. (a is untrue, d is non-sensical but plausible (also KNN will likely increase in error variability as the number of observations in each group falls), and c refers to boosting—not bagging.

Which model has the lowest query cost out of KNN, decision tree and linear regression)?

Correct answer is c) - since calculating a value is scales best towards infinity (always cost of 1)

Which of the following approach is considered to be unsupervised learning algorithm?

Correct answer is c) because K-means is "clustering" algorithm which uses unlabeled datasets. All other algorithms use labeled datasets and considered to be supervised learning algorithms.

Which of the following statements is true about regression/classification?

Correct answer is c) because regression gives a numerical predication which is the stock price in this case, whereas the classification is to classify an object.

Which of the following choices does not belong to supervised regression learning?

Correct answer is c) because supervised learning needs both predictors X and response Y, but clustering only has predictors X and it does not has response Y.

Assume k is the number of nearest neighbors for a KNN learner, d is the degree of a polynomial model learner, and bags are the number of bags used for a bagging learner. Which machine learner is least likely to overfit?

Correct answer is c) because while a KNN learner with k=1 is likely to overfit, the large number of bags will reduce the likelihood of overfitting. Note that d) is incorrect since a polynomial model with a large degree is likely overfit and large numbers of bags with boosting can further contribute to overfitting.

Which of the following is correct about parametric and instance-based models:

Correct classification of parametric and instance-based models. Parametric models: Logistic Regression, Linear Discriminant Analysis, Perceptron, Naive Bayes, Simple Neural Networks Instance-Based models: k-Nearest Neighbour (kNN), Learning Vector Quantization (LVQ), Self-Organizing Map (SOM), Locally Weighted Learning (LWL)

Which is a better split way for single tree?

Correlation shows better feature related to the result.

Typically classification problems involve sorting elements into categories. How could a regression problem be turned into a classification problem?

Creating a category for every single answer is impossible for a regression problem, there is an uncountably infinite number of possibilities. However values can be sorted into ranges.

Which of the following statements are true: 1. Quinlan Trees select feature which has lowest correlation with output values 2. Cutler Trees uses correlation/information gain to select feature to split on

Cutler trees are random trees and choose features to split on randomly, whereas Quinlan trees select feature with highest correlation/information gain.

What is the range of possible values for the output of a linear correlation calculation?

D is correct because by the mathematical definition of linear correlation, the only possible resultant values are in the range -1 to 1, inclusive.

Under what condition would you choose to use KNN rather than Decision Tree?

D is correct since KNN needs almost no training time. The only requirement is to store the data so it can be used later during query time. A is wrong since both need all the data stored. B is wrong because fair representative data benefits both methods. C is wrong since Decision Trees are better for approximating discrete-valued target functions.

In which case would a parametric model be preferable to an instance based model?

D is really the only answer that can be represented mathematically and therefore lends itself to a parameterized model. We don't know that relationships exist between any of the other comparisons. A car can certainly drive fast and not get in a car crash and vice versa. A skyscraper will definitely increase in weight as floors increase but there is no mathematical formula as a lot depends on construction and size and use of the building. Restaurant visits by income definitely doe not have a mathematical relationship and should be used with an instance based method.

What are some reasons that could explain a high out-of sample error in your model?

D is the answer because high out-of sample error (i.e., the RMSE for the test set) is indicative of low correlation between the factors and the label in your test set. By knowing that your training set is overfitting, it can also provide some insights into why your model may have high out-of sample error. A could potentially be an answer, but without any assessment of how the test data is also performing, it's difficult to assess why RMSE on the test set may be happening. B is incorrect because, generally speaking, if the correlation between Ypredict and Ytest is high, you'll see RMSE to be lower. The answer choice also states "Low RMSE", but the question refers to high out-of sample error (i.e., high RMSE) so this is incorrect. C runs into the same issue as A where, even though it seems that the model is fitting the training data well, it's difficult to draw conclusions without more details of the model with the test data. The second part of the answer also tells you that the model is overfitting on the test data, which means the out-of sample error is expected to be low. Thus, this answer is also incorrect.

If two variables are highly correlated, what does that tell you about the two variables?

D is the correct answer. A is wrong, R^2 can be negative and highly correlated B is wrong, same reason as A, it can be a negative correlation C is wrong, assumption is incorrect, it is possible to have correlation with other variable

Which of the following statements is FALSE with regards to supervised and unsupervised learning algorithms?

D.) Correct, because the whole idea behind unsupervised learning is that labeled data is not required. They attempt to cluster the data into a structure to make sense of it. A.) is wrong because that is the idea behind a supervised learner, you give it labeled data and attempt to perform classification or regression B.) is wrong because you can just not pass in the label portion of the dataset into the algorithm or you can pass it in and allow it to use that as a feature to cluster on. C.) is wrong because lesson "03-01 how machine learning is used at a hedge fund" slide 4 lists these as supervised learning algorithms.

Which of the following machine learning algorithms is faster than the rest?

Deciding which one is the fastest depends on whether we are looking at training time or testing time.

Pick the choice where the statements regarding KNN, Decision Tree and Linear regression are all true

Decision tree is a supervised learner. It is an eager learner because it builds a classification model based on the training data and then uses it for classifying test data. It takes both numeric and nominal features. It is used for classification and not regression. KNN is an unsupervised learner. It is a lazy learner. It is computationally intensive because it needs to go through all the data to give the result. Linear regression does not store any data. As soon as an equation is formed to fit the data, the data is thrown away. It takes only numeric features.

Which of the following statement is incorrect?

Decision tree is an instance-based learning algorithm and it does not assume any assumption about the distribution.

Which learning algorithm has the following characteristics: query time = log n_samples training time = n_samples*n_features*log_n_samples?

Decision tree splits the data according to log n_samples during query. Training time is linearly proportional to n_samples*n_features*log_n_samples.

Which one of the models below is a parametric model?

Decision tree, KNN and neural networks are instance based models. On the other hand, a linear regression model is a parameterized model

Which of the following best help to reduce overfitting?

Decreasing the amount of training data is the opposite of an ensemble learner with bagging. We want to increase the amount of varying training data samples so using bagging allows us to create more data samples from our training set. Answer (c) is incorrect because we do not want our model to fit the data exactly as this leads to overfitting and an increase in noise. Answer (d) is incorrect because (a) and (c) are incorrect.

Suppose you have a regression model with 5 parameters, and a KNN model with k=5. What could you do to reduce the overfitting in both the parametric model as well as the non-parametric, instance-based model?

Decreasing the number of parameters in a regression model reduces overfitting, while increasing the k value used in KNN reduces overfitting, so the answer is (B).

Supervised learning differs from unsupervised clustering in that supervised learning requires

Definition of Supervised learning

Aziz has been tweaking his decision tree algorithm to improve performance. In his most recent change, he decided to test each input parameter (X_i) in sequence at each successive level of the tree (i.e. parameter 0 at the root, parameter 1 at level 1, parameter 2 at level 2 and so on, looping back as needed). Previously Aziz had chosen which parameter to test at each branch by selecting the one with the highest correlation to Y values for the remaining training data. When he updates his decision tree in this way Aziz is developing...

Determining correlation has been completely eliminated for all parameters at every branch of the tree. Since correlation must be recalculated over all records within a branch/subtree, eliminating it in this way greatly speeds up the tree construction time. The minor downside is that it could result in an unbalanced tree.

For a number N of training samples, which of the following is faster to train?

Doesn't really matter what K is, KNN is the fastest in training because all it does is store the samples in memory. No computation, nothing else going on. The slowest would probably be decision trees of leaf size 1 because it takes more recursive steps to get to 1 observation.

Given a training set of 6000 samples and testing set of 4000 samples, which of the following is false regarding Bagging?

Each bag should contain the same number of samples as the training data, which in this case is 6000.

Let N be the number of samples in our training set. Which of the following statement is true about bagging?

Each bag usually contains N' samples where N' <= N, therefore statements A, B, and D are incorrect. Random selection of N samples from the training set of size N with replacement will give on average (1-1/e)*N = 0.632*N distinct samples; if N'<N, this number will be even less. Therefore expectation of the number of distinct samples is usually less than 0.65*N.

Given a set of training data which contains both features and labels, which of the following Machine Learning algorithms would be good choices for predicting the labels for a set of test data which contains only features? (1) Decision Trees (2) K Nearest Neighbors (3) Linear Regression (4) Random Forests

Each of the four algorithms is suitable for solving a supervised learning problem. Each of these was discussed as a valid supervised learning approach in lectures and in the context of the MC3_P1 project.

Which of the following is NOT an example of a classification problem?

FICO score is a continual value, and thus is a regression problem.

Which of the following options does NOT improve the performance of decision tree construction?

For A, it reduces the number of subtrees. For B and D, finding the best correlated metric and median can be time consuming. For C, a well balanced tree requires the split value to evenly divide the tree, which asks for the median of all the metric values. So it takes a lot of time.

Which of the following three models, kNN, decision trees, and linear regression, can be guaranteed to achieve perfect in-sample prediction accuracy?

For Knn with k=1, the closest observation to every in-sample query point will be the point itself, so it will output the same value. For decision trees with leaf_size =1, the tree will be expanded until it perfectly fits the in-sample data and every in-sample query point will be represented in the tree. Choice A is wrong because k=n (n= training observation) will always output the mean, however if the student is not familiar with the algorithm this answer may be appealing. Choice B is wrong since increasing the leaf size usually increases in-sample error. Choice C is wrong because a polynomial linear regression will only fit the data perfectly if there are three observations. It is possible for a linear regression to achieve perfect in-sample accuracy, but only if we include a degree n-1 polynomial so students who do not understand linear regressions may choose choice C.

Which of the following three models, kNN, decision trees, and linear regression, can be guaranteed to achieve perfect in-sample prediction accuracy (assuming there are no one-to-many relationships between the X to Y values)?

For Knn with k=1, the closest observation to every in-sample query point will be the point itself, so it will output the same value. For decision trees with leaf_size =1, the tree will be expanded until it perfectly fits the in-sample data and every in-sample query point will be represented in the tree. Choice A is wrong because k=n (n= training observation) will always output the mean, however if the student is not familiar with the algorithm this answer may be appealing. Choice B is wrong since increasing the leaf size usually increases in-sample error. Choice C is wrong because a second order linear regression is only guaranteed to fit the data perfectly if there are three observations. It is possible for a linear regression to achieve perfect in-sample accuracy, but only if we include a degree n-1 polynomial so students who do not understand linear regressions may choose choice C.

Which algorithm would you use for a classification problem where all its features are binary i.e 0/1?

For binary attributes, it makes sense intuitively to opt for decision trees. However, we can use K-NN as well because the data points that have similar binary attribute values will be closer to each other, so K-NN would fit the model well. Since it is given that it is a classification problem, the output value doesn't matter.

Concerning the cost of learning(training) and cost of query with regard to kNN, decision trees, and linear regression learners, which of the following statements is true?

For cost of learning: kNN<Linear Regresion<Decision Tree, and for cost of query: Linear Regression<Decision Tree<kNN

Consider the two problems: 1) We want to estimate how many students attending Georgia tech will be attracted to a particular beer as its bitterness(hoppiness) increases. 2) We want to estimate the total Kinetic Energy of moving automobiles based on their total mass. Which type of model would best solve each of the above problems? {KE = .5*m*(v^2) }

For problem 1, we can start with a equation for Kinetic energy. We can then use parametric learner to find the correct parameters. However, in case of students' preference for beer taste in problem 2, we do not have an underlying formula to begin with.

For simple bagging, how can we choose the number of learners that will help reduce overfitting?

For simple bagging, increase the number m will help reduce overfitting as while m increasing, it resulted in more generalized predictions as the predictions for each learner results are averaged.

When compared to trees produced with random features and random splits (PERT), trees that are produced with the GINI criterion (CART), in general:

From the chart on page 3 of Cutler's paper on PERT and Culter's remarks in the paper, the CART trees had significantly fewer nodes for perfectly-fitting trees, even though PERT was trained much more quickly.

Which among the following is NOT a metric to determine the best feature to split on?

Gaussian Naive Bayes is used as a parameter estimation factor for Naive Bayes classifiers. It is used as a technique to handle continues values during parameter estimation for Naive Bayes Classifiers.

Which is NOT a correct way to reduce overfitting?

Getting higher testing accuracy by training the model with testing data is useless because the model will overfit to the testing data.

Which of the following would always give a perfect performance (zero RMS error / 100% correlation) for in-sample (i.e. training) data after learning that same training data?

Going over all possibilities: Linear regression (polynomial the size of the training set): This can work for some algorithms. The model can have enough information to create a polynomial that goes through all training points. Linear regression (y=ax+b): Nope. Degree 1 polynomial doesn't have enough information. Random Forest (homework version, with sampling repetitions for bagging), leaf_size = 1: Nope. Not all the training data is used for training, so learner will almost never have enough information and the question asks for "always". kNN, k=1: Yes. Serious overfitting. The prediction will just grab the training point stored in training and return it as is. Random Tree (homework version), leaf size = 1: Yes. Serious overfitting as well. Each training point has its own leaf in the tree for perfect training set predictions. Hence, the correct answer is D, "All of the above".

What of the following is more likely to happen if you end up with n out of M bags containing the exact same selection of elements? (2 <= n < M).

Having equal bags will lead to having equal submodels, which in turn will bias the overall result producing a model "overfitted" towards the train data of those models. This invalidates A and B. C is wrong because bagging allows repetition (lecture 03-04). If you CAN recreate bags (no worries about time penalties mentioned) then diversity in your models will lead to a less overfitted ensemble model.

What is the impact of increasing the number of bags used in Random Tree learner on out of sample RMSE when the leaf size is fixed. Assume that the same learn data is used for all bags.

Hence the same learn data is used for all bags, with large test data set there will be no much of a change for our of sample RMSE.

Which of the following statements are True?

I. Parametric models are more space efficient than instance based models, but their training time is relatively slower. II. Querying time of Parametric models is faster when compared to Instance Based Models. III. Parametric models are more space efficient than instance based models and their training time is relatively faster. IV. Querying time of Parametric models is slower when compared to Instance Based Models. Parametric models do not store any data, parameters need to be recomputed when more data is gathered and it uses only the solution equation to query. While instance based models store all data points, and consults the data to query.

Consider the below decision tree algorithms. Which algorithm uses "information gain" as splitting criteria?

ID3 uses information gain as splitting criteria

Assume a dataset with all discrete features. In Random Tree, an attribute is selected at random to split the data. In ID3 and C4.5, we calculate Entropy and Information Gain to determine the best attribute to split the data on. Let's suppose at each step, we instead choose an attribute with least Information Gain and also suppose we can reuse/repeat features for the nodes. What is the effect on the resulting tree when compared to a tree built using ID3 and C4.5 ?

IG enables ordering of nodes in a decision tree with most important nodes at the top. This enable to branches to be as short as possible. If attributes are selected with least IG, the tree will grow much larger in depth and size (provided the features can be repeated)

Which of the problems below is most appropriate for a classification algorithm (as opposed to regression)?

Identifying a tumor as malignant or benign is a classic example of a classification problem. The other three answers are all continuous numerical values, for which classification is not commonly appropriate.

If you want to normalize your data, which of the following commands in Python is correct? (df is short for data frame)

If you want to normalize your data, you should take all the data to divide the first row, and the command df.ix[0,:] slices exactly the first row out, therefore the correct answer is A.

Which of the following is not true about arbitrage as a hedge fund investing strategy?

In Arbitrage strategy, the profit margins are small and opportunities are rapidly competed away by other arbitrageurs, hence hedge funds use programmed or high-frequency trading systems to act on opportunities very quickly.

We are ______ likely to overfit as K increase in KNN and ______ likely to overfit as D increases in a parametric model.

In KNN you take the average of the K nearest neighbor, so a higher K leads to an average that is less biased towards a single point. With parametric models, as we increase the polynomials we are allowing the equation to become quite specific to the data itself.

Which of the following is TRUE about overfitting?

In a KNN model, if we increased k, like k=N, then all neighbors would be considered and hence no overfitting. It occurs when K is small. In a polynomial model, if d is small, like d=1, then it is a linear model and cannot over fit. When d is larger, it is more likely to overfit. In the project, we know that if fix bags, the larger leaf size, the better performance of the model. Also, if fix leaf size, the larger the number of bags, the better performance of the model. So the correct answer is B.

Which of these correctly describes bagging?

In bagging, if we have N data points for training, we pick one data point to put in each bag N times, i.e., we pick N data points for each bag with replacement. This usually ends up in each bag containing 60% unique data points. Then we can have B such bags and create a model for each bag, and let these models vote to give the final answer.

In general, which of the following models has the slowest query time?

In general, K-Nearest-Neighbor is the slowest to query because all the data is used in calculating the prediction. There may be an edge case (only one data point) where it could match the other models, but in general it will be slowest to query.

Which of the following best describes the part of an error functions graph (ie RSME) where overfitting is occurring between the test data set and its training data set?

In overfitting a model is produced that will follow the training data too closely to the detriment of the models usefulness on the training data. This will be most seen in the error graph when there is a down tick in the error function for the in-sample tests, showing a better result for that data set, while at the same time increasing the error in the testing data set (out of sample).

Which answer defines Overfitting ?

In sample error is decreasing, out of sample error is increasing. When we train data in sample, we can expect RMSE would be decrease. In contrast, when we test data out of sample, also we can expect RMSE would be increase.

You've been asked to implement a rain prediction decision tree algorithm for a new line of sensors to be deployed across the globe at many diverse locations and weather conditions. You have access to one year's worth of data collected from one of the sensors located in New Mexico, where the climate is rather dry throughout most of the year. It includes the following features: average daily temperature, average daily pressure, and includes the total amount of rain observed for each day (in inches). After training your model on a randomly selected set of 60% of the data and using the remaining 40% of the data for testing the trained model, you obtain the following results:

In sample error: 0.55 In sample correlation: 0.5 Out of sample error: 0.4 Out of sample correlation: 0.65 Based on the information collected and given the fact the sensor does not suffer from any noise artifacts (i.e., assume perfect measurements), does the model suffer from overfitting? This model may very well generalize across other locations but it may not. It also may be overfitting to New Mexico weather and/or the model may be too complex. Both our training and test data is insufficient to make this conclusion, given the goal/requirement to deploy this model across many diverse locations.

Which of the followings regarding supervised learning and unsupervised learning is correct?

In supervised learning we know what the output should look like and there is a relationship between the input and the output

A benefit of creating a decision tree by splitting data with random splits instead of splits based on correlation is:

In the JR Quinlan decision tree algorithm, the correlation has to be calculated for each column. This is an expensive operation. To speed up the algorithm, A Cutler used random splits. Random splits don't prevent overfitting.

Which of the following statements is true

Instance based models are lazy-learners and hence store all of the training data, which requires more storage space than parameterized models which only store the parameters of the model and throw away the training data after the model has been built.

Which option is true for instance based models?

Instance based models deals with real train data so fewer assumptions.

List one advantage and one disadvantage of Instance Based Models over Parameterized Models:

Instance based models for example, KNN or Decision tree do not build the form of a function to summarize the data. That is why they assume less about the data. Instead they learn from the instances of the data points. Since they learn from each instance of the data point, as data set grows, the computational complexity of the model also grows linearly. Also, as instance based models are dependent on number of instances of the data, so a big data set allows better precision of the model. But since they do not assume anything about the data and do not form a function to summarize it they form more precise predictive model on unseen data.

For which of the following applications would an instance-based model have the greatest benefits over a parameterized model?

Instance based models such as KNN do not take very long to train a model and adding new data just requires the addition of new data points to the existing model. Querying speed is an advantage of parameterized models, so answer A is incorrect. Instance-based models have much larger space requirements than parameterized models so Answer C is incorrect and they are unbiased so answer D is not a good choice either.

For which of the following applications would an instance-based model have the greatest benefits over a parameterized model? An application whose priorities are...

Instance based models such as KNN do not take very long to train a model and adding new data just requires the addition of new data points to the existing model. Querying speed is an advantage of parameterized models, so answer A is incorrect. Instance-based models have much larger space requirements than parameterized models so Answer C is incorrect and they are unbiased so answer D is not a good choice either.

The major advantage of an "Instance-Based model" as compared to a "Parameterized model" is

Instance-Based models seek to best fit the training data in constructing the mapping function, whilst maintaining some ability to generalize to unseen data. All other options are benefits of Parameterized models over Instance-Based models

Which type of model is best-suited to adding new data?

Instance-based is best to add new data because you don't need to change an existing underlying model. Even though it is easy to add testing and/or training data, those aren't models.

Which of the following algorithm(s) can be easily extensible to streaming data ( i.e online algorithms) ?

Instance-based model are easy to extend in the online setting as it would just involve adding the incremental data to the model set. The other options are not easily extensible and are an active research topic.

In regards of K-NN model, Linear Regression and Decision Trees. which of the following is true?

It does not need training/learning for KNN model, while it needs training for Linear Regression and Decision Trees.

Overfitting happens when a model learns the nuances of the training data too well. The reason for this is:

It is important for a ML algorithm to be able to generalize. We are not selecting the perfect answer, but the best answer from the data provided and what we've seen before.

Which of the following statements is FALSE about Boosting?

It runs multiple times on training dataset, not testing dataset.

Which of the following is an incorrect statement about bagging?

It's the other way around, "Bagging is an approach to ensemble learning that is based on bootstrapping." .

An instructor was curious about the association between students and class forum data. The instructor wanted to know two things: (1) Can you predict the grade of a student based on their post volume to the class forum? (2) Are there any other inferences that can be made about students and class forum data? Given data that contains past semester student information (grades, classes taken, age, etc.) and class forum data (post volume, ratings, length, instructor comments, etc.), which type of learning algorithm would be best suited for the above items?

Item (1) can be addressed with labeled training data gathered from student grades and their class forum post volume. A supervised learning algorithm is best suited for this scenario. Item (2) would be best served using an unsupervised learning algorithm in order to cluster the data into potentially interesting groupings.

Under what circumstances would you prefer decision trees over K-NN?

K-NN takes no time to train, but very long to query. K-NN can be updated easily as it can learn incrementally due to its instance-based learning paradigm. Both algorithms are used for classification so that is not a deciding factor.

Which of the following statement are false?

K-nearest neighbor is actually a type of instance-based model. The other answer choices are true.

Of the three learning algorithms, K-nearest neighbor, decision trees, and linear regression, which is the most resource intensive operation to query test data and why?

KNN In order to classify any new data point using KNN, the entire data set must be used meaning the training data must be held in memory, this is not true for decision tree or regression learners and results in the cost of query for KNN being the highest of the three, especially as the training data set becomes very large.

Which of the following is the most accurate answer regarding a K Nearest Neighbor (KNN) learning algorithm?

KNN Learners are fast to train and slow to query. Essentially all the data is stored as a model so it takes a lot of space. When K is a smaller number like 1, overfitting is very likely.

Select a learner with least cost of learning and a learner with least cost of query.

KNN has the least cost of learning, and linear regression has the least cost of query

A single learner running KNN, another running a decision tree model and another running linear regression tree all begin training identical copies of a data set at the same time. Assuming a reasonably large data set, which learner will finish first? Choose the best answer.

KNN has the least learning time cost out of the listed models.

Which of the following does not require training?

KNN just stores each data, so it spends 0 time on training

Which of the following learner has the fastest training speed given the same data set, under the condition that overfitting won't occur?

KNN should have the fastest training speed because it is instance-based learning, which only needs to store all the data during the training phase. But with K=1, it is almost certain to overfit. So does decision tree with leaf_size=1. Then A and C are not correct. As for B and D, they'll generally not overfit, but decision tree needs to split the data recursively and select the median of each split, which requires more training time than linear regression. So B is correct.

Which of the following would NOT be an advantage of KNN in comparison to Linear Regression?

KNN, an instance based model also known as 'Memory Based Learning' requires more memory than parameterized models like Linear Regression. The extra memory requirement is used to store problem instances. It keeps training data and uses it to compare to new problem instances.

Which of the following CANNOT reduce or eliminate overfitting?

Learned from MC3P1, A, B and D can improve the model's performance. Increasing the amount of training data examples can also improve while decreasing cannot. So the answer is C.

What best describes a key benefit of parameterized models?

Learning is fast because parameterized models only are required to estimate the parameters of the function.

For which value of k from the following options does the KNN model have the least in sample RMS error?

Lesser the value of k, more closely the model fits the data, resulting in lesser in-sample RMS error (In case of k=1 the error will be least = 0)

You were asked to develop a demo learner app as quickly as possible. Soon you realize the given dataset is not so big, that makes you naturally concerned about overfitting. Given this short notice you will prefer

Linear Regression because it generalizes smaller datasets well and easy to train than decision trees. Both KNN and Decision trees suffers from overfitting problem for smaller datasets.

Which of the following regression tools are parametric? - Linear Regression - k Nearest Neighbour - Decision Trees

Linear Regression involves representing a model using a set of parameters, such that it can represented as a polynomial or a line that fits the data. k Nearest Neighbour and Decisions Trees instead involve recording a set of training data in some data structure, that is then queried against to predict a value.

Being given a very large dataset, ensuring that the models do not trivially overfit (k and leaf size both superior to 1 for kNN and Decision Trees), select the best possible order when in comes to memory space for saving the model

Linear Regression is the best in terms of space for saving the model. For Decision trees and kNN: if the leaf size is strictly greater than 1, the size of the model stored is at most equal to a number which is striclty inferior to the number of rows in the data. The space used is then lower for decision trees, than kNN which needs to store the whole dataset

Out of K-Nearest Neighbor, linear regression and decision tree, which model has the least query time?

Linear Regression model has least query time since it only calculates output from known parameters and equation. Decision tree takes longer to query since it must traverse the full tree until it reaches a leaf. KNN must calculate the distance from each data point and takes longer at query time than linear regression.

Which of the following statements about properties of k-NN versus, decision trees, and linear regression is true?

Linear regression doesn't eliminate the correlation between your input variables. Some times introducing to many variables may over-fit your training set. And that will cause the fault of the model.

Which learning method is the fastest query-wise?

Linear regression is fastest because calculating the prediction is as simple as plugging in all of your independent variable values and computing the result (to get y, the dependent variable). kNN can be quite slow because you have to calculate the distance from your query to all of the other data elements (potentially thousands), sort distances, take top k values, and calculate mean of those top k values. Decision trees are fairly fast at query time but they are slower than linear regression. In general, query time is O(\log(n_{samples})). A tree's path is variable depending on the input, unlike linear regression that is one plug-and-play equation. Neural networks are also slower. This black box learning method is computationally intensive, particularly if the network topology is complex.

You are tasked with building a model to predict y-values. The x-values have only 1 degree of freedom. You are given historical data with x-values between 0 and 100 to work with. However, your client wishes to predict y-values for x-values between 150 and 200. Which type of model should you build for the client?

Linear regression is the correct answer because it is a parametric model that can be used to extrapolate values beyond the range of inputs whereas an instance based model cannot.

Which of the following learning algorithms is the fastest to query (assuming a non-trivial model)?

Linear regression is the fastest to query as simply involves substituting the input into a linear equation. Decision trees would be next fastest, but involves a number of comparisons for a non-trivial model. kNN for any value of k is the slowest as all training data points must be considered.

Which of the following learners is not susceptible to overfitting?

Linear regression models are susceptible to overfitting as the degree of polynomials increases. Random tree learners overfit when the leaf size of the learner is small. Ensemble learners using boosting tend to overfit as the number of bags increases. According to Leo Breiman, random forests (bag learners composed of random tree learners) do not overfit.

We want to select a classifier solving (x, y) coordinates to the corresponding quadrant. Which of the following algorithm would perform the worst in terms of accuracy?

Linear regression works well only when there exists parallel decision boundaries in the feature space, while decision trees can have decision boundaries orthogonal to the feature axes. k-Nearest Neighbor would also work well using Euclidean distance as the similarity function.

Order the algorithm starting from fastest to slowest for querying

Linear regression you just need to do multipllication Decision tree you only need to traverse the tree (log(n)) KNN you need to compute every distance of dataset to query

Which of the following is not a supervised learning alrogithm

Linear regression, decision tree and KNN are all supervised learning algorithms we talked about in the class, which leaves k-means the only answer. Actually, k-means is an unsupervised clustering algorithm. Even though we didn't talk about k-means in class, but it should not be problem since the other three are clearly supervised for anyone who is following along in the class.

Which of the following pairs best fill the blank for regression & classification, respectively, in the situation: "Forecasting the change in _______ of a stock's closing price from one day to the next."

Magnitude is a continuous number (regression), and direction is a discrete category (classification).

Select the statement below that best reflects when overfitting occurs in a parametrized polynomial model and a KNN model:

More degrees of freedom tailors a parameterized polynomial to the specific data it is being trained too while a KNN model overfits the training data when fewer data points are averaged when calculating the nearest neighbor.

When building a decision tree, which method will result in the GREATEST impact on the results produced from querying the final tree?

Most measures of information gain, including the three mentioned in A, B, and C, are very consistent with one another. As such, the metric used for information gain will have very little effect relative to pruning the tree after constructing it.

You're part timing it for a Canadian investment firm. An algorithm that you developed in the states can predict tomorrows stock prices with an RMSE of 5 US dollars. The Canadians have a similar algorithm, but it works in Canadian dollars. What RMSE value would indicate the Canadians algorithm is equivalent to your own, given the exchange rate is one US dollar to 'X' Canadian dollars?

My topic was: measuring the quality of predictions using RMSE, correlation, other? (X^2) indicates X squared, or X * X. This question is supposed to illustrate that RMSE scales linearly with the magnitude of the input. MSE squares quadratically, but RMSE scales linearly.

Which of the following statements if TRUE in regards to overfitting of linear regression models?

Note: D = number of dimensions, N = number of test inputs Overfitting is more likely as D increases because the resulting polynomial will produce a line that more closely matches the test data, which most likely contains either some random error or some noise.

ID3 and its extension C4.5 are two different methods used to build a decision tree. Which of this statement is True only for ID3 ?

Only one attribute at a time is tested for making a decision. The algorithm used entropy and information gain to search the best attribute to split the data on. There are various splitting criteria based on impurity of a node. The main aim of splitting criteria is to reduce the impurity of a node.

Which of the following is false about supervised and unsupervised learning?

Only supervised learning requires training data with corresponding result values, unsupervised learning only requires training data.

Which model is most likely to overfit? Let n = the number of training instances.

Options A and D would be very likely to underfit the data and Linear regression is less likely to overfit than using a "not-so-weak" learner with adaptive boosting.

In Machine Learning, why is overfitting considered bad?

Overfit models exhibit all of the above symptoms: - They do not generalize but memorize the relationship - They cannot effectively filter out noise from the relationship - They violate Occam's razor principle

Which of the following is true in regards to overfitting:

Overfitting describes noise, not data, which is answer A. B is false, as overfitting has a high correlation coefficient in the training dataset. C is false, as overfitting does not perform well on testing datasets. D is false, as you simply retrain your model using different parameters.

Which of the following can techniques can minimize overfitting when you have a small dataset?

Overfitting happens when a model contains errors and noise from the training data. Because of the small nature of the dataset, there is not enough data to effectively analyze the data and thus errors and noise will be highly evident. By using roll cross validation with slicing, more data can be effectively created.

You've built a new machine learning algorithm for trading that can be tuned for accuracy by adjusting a parameter you call "A" for accuracy. When you use the algorithm to train predictive models using historic stock data you find that models created with a large "A" values are very accurate (error rates are low) whereas models created with a small "A" values are not very accurate (error rates are high). When you test these same models using real-time stock data, however, you find just the opposite. What do you think might cause this problem?

Overfitting is a common problem in machine learning. The algorithm builds a model that matches the training data very closely but that model does not generalize well when faced with data it has never seen before, so it performs poorly against the 'test' data. In this case the training data is the historic stock data and the test data is the real-time stock data.

Which of the following statements if TRUE in regards to overfitting of linear regression models? Note: D = number of dimensions, N = number of test inputs

Overfitting is more likely as D increases because the resulting polynomial will produce a line that more closely matches the training data, which most likely contains either some random error or some noise.

Which of the following statements is TRUE in regards to overfitting of linear regression models? Note: D = number of dimensions

Overfitting is more likely as D increases because the resulting polynomial will produce a line that more closely matches the training data, which most likely contains either some random error or some noise.

Which of the following best describes the definition of "overfitting"?

Overfitting is when a machine learning algorithm's model is too specific to its training data. So, it's the area where error is continuing to decrease on that training data but error is increasing for unseen test data, as the learner's model does not represent that data well.

What is overfitting?

Overfitting is when a model trusts training data too much so that the model also captures noise. And this leads to poor performance on testing set.

Which of the following is NOT true about overfitting?

Overfitting should be suspected when the TRAINING error is zero. Because of the increase in complexity, the flexible models in training set cause overfitting and maybe a zero training error. The other options are correct.

Which of the following statements about parameterized models is TRUE?

Parameterized models are called that because they expect some parameters/features identified to train on. Instance-based models do not need features to be identified beforehand as they train best over large sets of training data where the user may not have a good idea of what the makeup of the data looks like.

Which of these is typically the greatest advantage to using a parameterized model rather than an instance-based model?

Parameterized models are usually more expensive to train than instance-based models, which are more storage intensive but often do little processing during training. kNN is an example of an instance-based model, while a linear regression model is an example of a parameterized model. The advantage of of a parameterized model is that the hard work in training the model often pays of in query performance. A linear model can be computed against new data very quickly once it has been fit to data. Depending on the expressiveness of the model, many parameterized models can easily overfit data, such as a high-order polynomial regression model, so c) is not a good choice, and parameterized models such as a neural network or support vector machine are at a minimum not any easier to understand than an instance based model like kNN, so d) is not a good choice.

Problem 1: A dataset for the types of fruits, with columns for color, length, diameter, density and taste. Predict the type of a particular fruit based on available data.

Problem 2: A dataset for the price of houses with columns for number of rooms, number of bathrooms, square-footage, location, and price of the house. Predict the price of other houses based on the data available. What would you use for the above two problems? Regression is used to predict a continuous value based on the available data (when data is not labeled), therefore it can predict the price of houses. Classification is used to predict the category based on the given labeled data, therefore the type of fruit can be predicted.

For KNN algorithm, when does overfitting take place?

Professors's video lectures on learning provides the following definition on overfitting: overfitting occurs when in sample error decreases while out of sample error increases

Which of the following are true for ID3 algorithm of building a decision tree?

Pruning, handling of numeric attributes and missing values were introduced in C4.5 algorithm which was an extension of the ID3 algorithm. ID3 algorithm uses information gain for splitting criteria.

Which one of the following statements is TRUE:

Query cost for linear regression is very low, since you just need to apply the formula. While the learning cost for KNN is very low, querying KNN is where the model needs to go through each of data samples and group them based on the number k neighbour selected.

Compared to the others, which of the below Decision Tree construction processes would save us computational cost for feature selection most?

Quilan's algorithm in options A, B, C requires some optimization to determine the "best" feature that splits data. Hence they are more time-consuming. Cutler's methodology simply picks a random feature to split the data and forms a random decision tree, so it is much faster to grow a tree in this way. Such random trees can then be ensembled with algorithms like Bagging and Boosting, so that model accuracy won't be affected, surprisingly.

"Based on the four classifiers below, which one is most likely to overfit: Classifier #

RMSE (Train Data) RMSE (Test Data) 1 0.2 0.3 2 0.6 0.5 3 0.3 0.3 4 0.3 0.7 " RMSE out sample > RMSE in sample. Higher difference in classifier 4 compared to classifier 2.

You cross scatter plot two vectors of data with different units and see a gaussian blob of points spread across the graph. Which of the following is the most correct description of the data.

RMSE cannot be calculated because the vectors have different units. Correlation coefficient is a unit-less comparison that is normalized by the standard deviation of each vector. Correlation coefficient is small because the change in one variable cannot explain the change in the other.

The Root Mean Square Error equation compute the sum of square difference of which data points?

RMSE equation is the square root of the sum different of Y test value and Y predicted value square divide by n. RMSE = sqrt(mean((y-y_pred)**2));

which of the following is not true about RMSE.

RMSE is a convex function which is widely used as optimization technique.

Can learning algorithms produce different predictions on a given test set but with the same RMSE of the predictions (up to 4 decimal points)?

RMSE is a measure of error. There can be more than one set of predictions that have the same value of RMSE (up to certain decimal point), given non-zero RMSE. Randomized learning algorithms can potentially produce such results given adequate number of training.

Which of the following is not true about RMSE?

RMSE is used for regression problems and not classification problems because the RMSE value gives a continuous variable as an answer as opposed to a discrete variable that would be used for a classification problem.

Which of the following is NOT correct about the difference between Cutler's Random Tree algorithm and Quinlan's ID3 algorithm?

Random Tree algorithm does not randomly select the route while it's traversing. It rather selects a feature randomly.

Why would someone prefer a random decision tree over a normal (deterministic built) decision tree?

Random decision trees can be built faster because correlations and means don't have to be calculated for each split. They however are generally a little longer (since splits aren't 100% optimal) which means they take longer to query and take more storage space. A random decision tree and a normal decision tree have similar accuracy so it is not MORE accurate.

How is a random tree built using A. Cutler's approach different than a standard decision tree using JR Quinlan's approach?

Random decision trees using A. Cutler's method are very fast learners because of the random factor selection as opposed to determining a "best" factor to split on using entropy or correlation used when building a tree using Quinlan's approach.

Which decision tree algorithm has the least overfitting?

Random forest over decision/regression tree mostly lies in the boostrap samplings in both examples and features during training and ensemble average in the end, so overfitting has been eliminated. Boosting are more sensitive than Random Forests to outliers and parameter choices so the overfitting occurs.

For building a decision tree, why might you choose the split feature of a node randomly instead of using entropy or Gini index?

Random selection is chosen because it is significantly faster to choose a feature randomly than to determine the information gain of each attribute using entropy or Gini index.

Comparing to the normal decision tree, what is the key feature of the random tree algorithm?

Random tree builds faster, and it randomly select feature to split on. There is no differences between the two algorithm's output

What is the most likely reason for overfitting when training a random tree learner?

Random trees with small maximum leaf sizes will grow very large in an attempt to perfectly classify every training instance. This will result in a lower training error, but the algorithm will have a harder time generalizing to unseen data.

Which of the following tasks would be best accomplished using regression learning method?

Regression method enables us to predict the value of a continuous variable. In this question, stock price is most likely to be in the form of real number, therefore a continuous variable. The other tasks are mostly seeking discrete values.

In which case is it better to use a regression learner than a classification learner?

Regression returns continuous values while classification gives discrete values. Option A has the potential to be any real number. The other options have discrete labels that apply to them such as a single number from the set 1 to 10 or yes or no.

What technique is often used to avoid overfitting when working with a machine learning algorithm?

Regularization introduces additional information to the program that can prevent overfitting when successful. Cross-validation is a model validation technique for assessing how the results of a statistical analysis will generalize to an independent data set. Obtaining more training data also helps solve overfitting.

What does a high RMSE tell us about the quality of the model in terms of its predictions and correlation of data?

Root Mean Square Error (RMSE) indicates the absolute fit of the model to the data. Higher values of RMSE indicate a worse fit (and thus a lower prediction accuracy). This is because the formula takes the squared difference between the predicted and the actual values and thus, a high RMSE indicates a large difference between predicted and actual. Correlation and RMSE have an inverse relationship so a high RMSE yields a low correlation.

What is the key difference in the way a random tree is implemented that leads to faster training times over a classic decision tree?

Selecting a random feature to split the data is the key difference between random trees and classic decision trees. By not having to measure correlation to the labels for thousands of data points and just selecting a random feature, training times are sped up considerably.

Rank the following in decreasing order of query cost: KNN, Decision Trees and Linear Regression.

Since Linear Regression is parameterized, it only requires a simple computation at query time, hence low query cost. Decision tree requires traversal of the entire tree and hence needs more time. Whereas KNN has a highest query cost as K- nearest neighbours need to be located at query time.

Given an small dataset with an extremly large numbers of predictors, which model is supposed to perform better: an instance-based model or an parametric model?

Since instance based models use previously seen instances to make predictions it's difficult to fit little data in a high dimensional space. That is because the little data will be spread across all dimensions. The datapoints are therefore sparsely distributed and each point has little predictive power. Parametrized models on the other hand will fit a given model using the datapoints trying to optimize an objective function. Therefore even though there might be many datapoints, they still make a reasonable prediction.

Which of the following quality metric is robust to outliers?

Since mean absolute error gives equal weight to the data, hence it is robust to outliers.

If when recursively building a non-random regression tree, which methodology is most suitable for calculating the split value of a discrete feature?

Since this is a regression tree, the target variable is continuous. Thus with a discrete variable, the most plausible choice is to split variance minimization.

When utilizing Bagging with Regression Trees, which of the following is usually true?

Since we are doing regression and not classification, we must take the mean of the predictions of the trees which will smooth the individual results and reduce overfitting.

Why is roll forward cross validation used instead of standard cross validation in financial applications?

Standard cross validation allows "peeking into the future", because random selection of train and test sets might pick data points from the future for the training set.

"To avoid Overfitting in Decision trees: (1)

Stop splitting when it is no longer statistically significant (2) Grow the tree, then post-prune " Both of these are methods used to effectively not grow a tree or reduce the size of a tree that is already too large, i.e a tree which perfectly fits all the samples in the training data set and hence, is overfitting.

What distinguishes supervised from unsupervised learning?

Supervised learning accepts input data along with the expected outcome. For example, MC3-Project 1 is an example of supervised learning since we are provided with wine quality data where the data provided are various factors used to rate the wine and the quality that was selected by an individual. Unsupervised learning would attempt to rate the wine without knowing the quality assigned to it by a human.

Supervised learning differs from unsupervised clustering in that supervised learning requires

Supervised learning and unsupervised clustering both require at least one input attribute. But in supervised learning, the output datasets are provided which are used to train the machine and get the desired outputs whereas in unsupervised learning no output datasets are provided, instead the data is clustered into different classes.

Suppose you have a dataset with features based on characteristics of animals. Each animal has a classification assigned to it, but you don't know what the classification means or what animal is represented by each data row. Your task is to classify new animal rows by finding other animals similar to it. Could this best be formulated as a supervised or unsupervised learning model?

Supervised learning is the machine learning task of inferring a function from labeled training data. It does not matter if you don't know what the label means as long as it is labeled.

Which of the following is not an example of supervised learning?

Supervised learning is the type of learning for which both sample input and output are provided to train the machine and get the desired output, while in unsupervised learning, no output data sets are provided and the goal is to find similarities in the training data and cluster data into different classes. Answers a, c and d are applications that use historical data to train the machine and make prediction on output. However, b does not have output data set; the goal of the machine is to find the similarities among genes and put them into clusters. Therefore, b is not an example of supervised learning.

Which of the following involves unsupervised learning algorithm?

Supervised learning makes use of data to predict the outcome/label, while unsupervised learning tries to find the hidden pattern in the data. Choice A is a supervised learning, because the algorithm tries to predict the stock price (continuous) based on input data. Choice B is a supervised learning, because the algorithm tries to predict the recommendations (categorical) based on input data. Choice C is a supervised learning, because the algorithm tries to predict the the movement of stock price (rise or drop) based on input data (actually it is very similar to Choice A). Choice D is a unsupervised learning, because the algorithm does not try to predict anything, but instead tries to find the hidden pattern in the data by clustering the stocks into two groups.

Which of the following doesn't allow you to do supervised learning, but allows unsupervised learning?

Supervised learning optimizes the model by minimizing the difference between prediction and ground truth, so it needs to know the output for input data. Unsupervised learning doesn't have such requirement for training.

Which of the following is key characteristic of supervised learning?

Supervised learning required both input data and results to generalize to a model.

Unlike linear regression and decission tree learners, KNN learners are:

The KNN algorithm does all of its computation at query time. This makes it extremely fast to train since it only has to save the training set. Querying is slow because the distance function must be calculated for every point each time a query is made.

John is working on building a decision tree for a dataset. He can do it in 3 ways. (a) Build random trees (b)Build decision tree where Gini index is used for splitting. (c) Build Baglearner with number of bags= 20 which uses randomly built decision trees. What would you suggest him if he is most concerned about time taken for learning and wants to speed up the procees

The Random trees are the fastest way of building trees since we randomly select feature and its value to split on. So, option A is correct Option b is incorrect as calculating Gini index at each splitting stage takes additional time and is slower as compared to Random Trees. Option C is incorrect since it involves reforming the dataset for each tree and making 20 decision trees which is much slower compared to making 1 decision tree. Option D is incorrect since they don't take same amount of time for learning as explained above.

One of the main goals with decision trees is to split the data as quickly and efficiently as possible into similar groups. In the Decision Trees lecture, we were provided with three potential measures for determining the best feature to split on. Which of the following is NOT one of those measures?

The Sharpe Ratio is a measure used to calculate risk-adjusted returns, not similarity of a group.

Suppose you are building a machine learning algorithm with training data and then diligently applying test data to check accuracy. At what point do you fear the algorithm may be suffering from overfitting?

The algorithm can be built in a manner that is too precise, i.e. the degrees of freedom are too high. When this occurs the algorithm fits the training data extremely well but any new data, such as the test set, are not accurately represented. The solutions are given in manner that can easily make a test taker second guess what the expected action is as an algorithm is developed.

Which one is a SUPERVISED event?

The answer B provides feedbacks of former similar events that can be used for predicting and assessing current event while other choices don't mention the feedback of a similar event in the history.

RMSE is more sensitive to ______ than other measures of error.

The answer is (A) because the squaring process in the RMSE calculation gives disproportionate weight to very large errors. This can affect the quality of your prediction results if a large error is a problem in your choice situation.

Which of the following is correct about overfitting?

The answer is (d). Overfitting basically makes an effort to fit into the training data perfectly that the results are not generalized at all.

Which of the following is an advantage of a parametric leaner?

The answer is B. A parametric model takes the training data and generates a mathematical equation that can be used to determine the result of a future data point. Because of this only the parameters need to be stored and the training data can be discarded. A is incorrect because the model does not have access to all of the training data and if it did, exploring that data to determine the correct result would not be as fast as the parameterized model. C is incorrect because there is no limit to the number of features D is incorrect because calculating the parameters on the fly both requires access to the training data and is expensive.

In terms of linear regression and kNN performance metrics, which one these would likely to be correct?

The answer is C because in terms of total model space in linear regression, if we are learning a 3rd order polynomial, we only need to save 4 numbers. Since float in python is 64bit, 4 number will be in total 64 byte. On the other hand, with kNN, we need to store all the data. In terms of compute time for training, kNN needs no time for training itself so 0 second is a logical number. On the other hand, in linear regression, it has to calculate all the data to find the parameters. In terms of compute time for querying, we are only calculating polynominal function by giving the X values. However for kNN, we need to sort across all the data and this might take some time to evaluate. So the order should be in terms of better performance metrics, total model space is linear regression, compute time for training is kNN and compute time for querying is linear regression. A,B and D is not matching with this order.

A kNN model is using k=1. How would we expect its RMSE score to be for In-Sample and Out-Sample?

The answer is C because kNN using k=1 which indicates that the model is striving really hard to model the dataset exactly. Therefore, it is overfit. Because it is overfit, we expect RMSE to be low for In-Sample and high for Out-Sample querying or testing. RMSE is an error metric. The higher the RMSE, the higher the error.

The root-mean-square error is a measure of ?

The answer is D) . It is a measure of forecast accuracy of the predictive model.

Which one of the following statement of Bagging and Boosting is correct?

The chance of selecting each data point is weighted according to the error of that data point in the previous bag.

Which of the following are generally true of instance-based models?

The correct answer is (D). (A) is incorrect because instance-based models directly use the sample data, and therefore typically cannot make any guesses outside the bounds of the original data. (B) is incorrect because querying is slow as the data is being consulted and the solution is being calculated as opposed to consulting a minimal, pre-built model. (C) is incorrect because they have to store all the data presented to them. (D) is correct, because in most case the data only need be stored to update the "model," whereas in parameterized learning the whole model must be retrained. Any initial calculations that may be performed on the data to update the model tend to be minimal, and significantly faster than retraining an entire parameterized model.

Which of the following is most likely to result in overfitting?

The correct answer is (a) because, for an adaptive boost ensemble learner, as the number of bags increases, the learner tries to assign more and more specific data points to subsequent bags in an attempt to describe all of the data points contained in the training dataset. In all other cases, overfitting is reduced by implementing the actions suggested.

Which of the following would be a good reason for choosing a parametric-based learner?

The correct answer is (c). Parametric learners are typically slower to train, but have very fast query response times. Answers a), b), and d) all describe reasons for choosing an instance-based learner.

Is random tree learning supervised or unsupervised, and why?

The correct answer is A because supervised is defined as inferring from labeled training data. The alternative plausible answer is C because random trees do not really make use of the labels during tree construction.

When fitting a K-NN model to data, when would we expect to see overfitting and why?

The correct answer is A which was explained in the lectures. This is because the model fits the training data very closely and it may actually be fitting noise which would lead to errors when classifying the test data set. Answer B is wrong because K-NN is actually quite good at fitting data that is not linearly separable. Answer C is wrong, a large K is actually what protects against overfitting. Answer D is wrong because a low amount of noise actually makes it easy to create a model without overfitting.

When applying bagging to a dataset of size n, which most accurately describes the number of times an instance will be included in the bags:

The correct answer is C. Bags sample with replacement from the dataset and sampling procedure doesn't vary between bags. A is false, since the dataset is not split evenly. B is false, since the bags sample with replacement, not without. D is false, since it refers to a different related technique: boosting.

Which of the following statement is correct regarding unsupervised learning and supervised learning?

The correct answer is C. Because in supervised learning, all data is labeled and the algorithms learn to predict the output from the input data. In unsupervised learning, all data is unlabeled and the algorithms learn to inherent structure from the input data. Supervised learning includes linear regression, decision trees, random forest, classification and support vector machines for classification. Unsupervised learning includes K-means clustering, self-organizing maps, association, but does not include support vector machines for classification.

Which algorithm for building decision trees tends to produce trees the fastest?

The correct answer is b) because does not base the feature to split on or the split value on an optimality criteria (such as information gain or correlation) like the other algorithms do.

What is the best way of improving an overfitting model, and why?

The correct answer is c) because by fitting the training data less well, the model can focus on the most important features of the training data, which are the ones most likely to remain true for additional unseen data, while avoiding basing its predictions on unimportant, erroneous, or coincidental details present in the data used to build the model.

RMSE stands for which of the following:

The correct definition of RMSE is Root Mean Square Error

A machine learning algorithm fits a line to a scatter plot of data points. The line has a slope of -2 and a correlation coefficient of -0.9. Choose the answer that best describes the correlation of the data.

The correlation coefficient, not the slope of the line, determines the correlation of the data points. Correlation coefficients run from -1 to 1, with 1 being the strongest positive correlation and -1 the strongest inverse correlation, so C is the correct answer.

Suppose we are measuring correlation between price and sales of a commodity. Which of the following number will we most likely to get?

The correlation should be range from -1 to 1. And minus means they are inversely correlated, since drop in price will increase sales.

When using Numpy.corrcoef() to measure the quality of predictions, it returns values in the range of -1 to 1. Which of the following interpretations of the range of values is correct about the correlation?

The correlation value describes the closeness of the data when plotting Ytest vs. Ypredict as well as the slope of the line. Positive numbers reflect the correct slope of the line.

In decision tree, linear regression and KNN, which is the least costing at learning and which is the least costing at query?

The cost at learning is O(n) for linear regression and O(nlogn) for decision tree and O(1) for KNN, so the least cost at learning is KNN. The cost at query is O(1) for linear regression and O(logn) for decision tree and O(n) for KNN, so the least cost at query is linear regression.

Select the python function for finding correlation 'c' of 'A' and 'B'. (using numpy)

The cr=bumpy.corrcoef(B,A) creates a matrix of 2*2 where the correlation between A and B is given by cr[1,0] or cr[0,1]

What is the difference between classification tree and regression tree?

The difference between classification tree and regression tree depends on the data and what you want to do with the data. Classification tree have discrete target variable that is used for classifying. Regression tree is build on continuous numeric values that is used for predicting

Which of the following is an example of a problem best solved using parametric modeling?

The distance travelled by a cannon ball is an example of a parametric modeling problem because you can start with a mathematical equation (an estimate of the underlying behavior of the system) to express how it will behave. The other choices are better solved by non-parametric or instance based modeling

Given m training points and n testing points, where n is much greater than m, rank the total query costs from least to greatest of kNN, balanced decision tree ("DT"), and linear regression ("LR") models.

The fact that there are many more testing points than training points has no bearing on the answer. kNN queries in O(m), DT in O(log(m)), LR in O(1). Total query time would then be O(n*m), O(n*log(m)), and O(n), respectively. This gives us the ranking of LR < DT < kNN.

You are attempting to add ML capabilities to an embedded system with limited resources. You have a training data set of 10,000 elements and will be able to update remotely. Which type of learner would best fit this system?

The limited resources of the embedded system makes the Parametric Regression method best. Updates to the parameters can be done via update and should take no more memory than when initially designed.

What is the main difference between the decision tree algorithm and the random tree algorithm?

The main difference between the two algorithms is that the decision tree algorithm deterministically selects the best feature to split on and the random tree algorithm randomly selects a feature to split on. An additional difference is that the decision tree algorithm uses the median value of the feature as the split value and the random tree algorithm uses the mean of two randomly selected data points of the feature.

You train a random forest decision tree with a leaf size of 1 then query it on the in-sample data set, it outputs a result with an RMSE of 0 and a correlation of 1. This result could mean:

The model is likely overfit. RMSE of zero means there is no error and the the correlation of 1 means the prediction labels are the same as the in-sample actual labels. This likely means it has failed to generalize and won't product accurate results out-of-sample.

The most likely reason for overfitting in a kNN classifier can be

The most likely reason for overfitting in a kNN classifier is choosing small values of K . For instance when k = 1 , the model might fit training data quite perfectly and in sample error will be quite low , but it might not generalize well , leading to high out of sample error . As we consider more values of k , it helps us to build a more generalized model ,reducing overfitting.

Which of the those is a key factor in Parametric Models?

The parametric models tries to form a function then learn the coefficients for this function using the training data. choices A and D are correct for instance based models, While B is just a random choice out of subject.

What is the approximate percentage of unsampled data in a data set of size K if random sampling with replacement is used to generate a sample of size K during bagging.

The percentage of sampled unique data in from a sample of size K is approximated at 1 -1/e or 0.63. It has been mentioned several times at 0.6 in the class and piazza posts. The % of un-sampled data would be 1 - 0.6 = 0.4

Which of the following learning problem is regression

The price of the house is a continuous value, while the two types of the email are catagorical, the range of price is discrete value, and so is the rating of movie on rottentomatoes (100 discrete values, a lot but still discrete)

Which of the following is not true about bagging?

The purpose of bagging is to leverage the good performance of different models for different datasets. It does not mandate all the bags to use the same training model.

Which of the following is not true about boosting ?

The question asks for a false statement about boosting. While boosting usually uses weak learners to generate an ensemble strong learner, it is because weak learners are cheaper. Strong learners can also be used in Boosting to produce an ensemble boosted model.

You are working for an online video service. Your boss wants you to recommend a simple algorithm for generating video recommendations. New videos are added all the time so training time needs to be kept to a minimum. Which of the following would be the best choice, with the correct reasoning?

The question asks for the best algorithm with regards to training time, with the correct reasoning. The answer is B) KNN, as there is no 'training' required.

Which of the following is not true about boosting ?

The question is to pick the false statement about Boosting. While an ensemble of many weak learners can be combined to perform better than a single strong learner, an ensemble of strong learners can also be used. The reason it is not usually used is that it's cheaper to construct weak learners than strong ones.

Which of the following is not a characteristic of AdaBoost?

The sample data chosen for each model is not independent of one another. Each data point in the sample has a weight that is assigned based on the lasts model error

Which one of the following statements is NOT true.

The statement C is NOT correct because linear regression only looks at linear relationships between variables. It assumes there is a straight-line relationship between them. That can be incorrect for some data sets. For example relationship between income and age is curved not linear.

Consider a Bag Learner with the following parameters: learner=RTLearner, bag_size=1. Which of the following choices regarding overfitting is correct?

The student should realize that a bag_size=1 is simply the RTLearner. Answer explanations are: A) Incorrect. There will be overfitting for some datasets B) Incorrect. For datasets with overfitting it will occur as leaf size decreases (not increases away from 1) C) Correct. When a dataset is chosen that exhibits overfitting it will happen when we move towards a leaf size of 1. D) Incorrect. Overfitting is never continuously present

In terms of the cost of query, which of the following algorithms is considered the be the worst or slowest for a decision tree learner?

The system would have to compute the distance from the query to all the individual data points, sort them, and find the nearest K data point

In terms of the cost of query, which of the following algorithms is considered the be the worst or slowest?

The system would have to compute the distance from the query to all the individual data points, sort them, and find the nearest K data point

Why is feature selection important when building a decision tree?

The thing we're optimizing for when selecting a feature during building a decision tree is to split the decision space in half (or get as close as we can to it). This ensures a balanced and shallow (as possible) tree. B - False, we can split on a feature more than once C - We don't need to use all features D - False

Which of these models would be a simple yet effective predictor of the time it would take for a car in a controlled test to come to a complete stop given different speeds when the brakes were initially applied?

This is a simple scenario with extremely predictable behavior based on only a few factors. There would be a high correlation between how fast the car is moving and how long it would take to stop and a very low correlation with other factors.

Given a data set containing information about oranges with the following header: size, date of production, color and quality, you are asked to train a machine learning model so that you could estimate how good an orange is given its size, date and production. What kind of machine learning approach should you choose?

This is a typical classification problem that Supervised Learning is great for.

Which of these statements are true? I. K-NN takes less time to train, but more time to query than decision trees II. Decision trees take less time to train, but more time to query than K-NN III. Decision trees are more prone to overfit than K-NN IV. K-NN is more prone to overfit than decision trees

This is based on the properties of the three types of learners: Statement I is true: K-NN is the quickest to build and the most expensive to query among the three types of learners, and decision trees fall in the middle on both criteria. Statement II. is false because Statement I is true. Statement III. is true because decision trees have a higher tendency to specialize on the training data, especially with smaller leaf sizes, whereas K-NN is susceptible to overfitting only for very small values of K. Statement IV. is false because Statement III is true. Therefore, A. I & III is the only right answer.

We can identify overfitting by plotting in sample and out of sample model error against model complexity. Overfitting is characterized by portions of the graph with:

This is the definition of overfitting given in the lecture. The test taker should either be able to remember this or reason it from the ML project. If in sample error is decreasing while out of sample is increasing it indicates that the model is fitting the training set well, but not generalizing to the test data.

When constructing the base learner for a PERT implementation what combination of steps are randomized?

This question was derived from the Adele Cutler paper 'PERT - Perfect Random Tree Ensembles'. Number of bags may be randomly selected, but that is not part of building the base learner. Number of times to resolve a tie was stated as 10 in the paper.

As a result of overfitting, how does training errors and validation errors varies ?

Training errors decreases continuously whereas validation error first decreases then increases. As we overfit the data, training errors decreases as we have trained the data set at the lowest granularity. However, at the same time, while running on the test data, the errors will increase when we overfit as the data set doesn't actually emulates the fit at that granularity.

Out of the given choices, in a decision tree model for which value of leaf size would you expect your training set RMSE to be the highest?

Training set RMSE would increase as we increase our decision tree leaf size as we get less precise in predicting the response variable. RMSE would be the highest for the highest value of leaf size, so in this case -> 30 i.e answer (d)

Which statement is False?

Training time for KNN is not dependent on size of training set

Given a large set of data for training and knowing you will have missing and malformed data that is not in your training data sets what is the best option to implement a learning algorithm and why?

Trees don't have to normalize data and can easily handle missing data. kNN and Linear Regression do not. kNN will slow down considerable depending on distance and number of neighbors.

Professor Balch has highlighted that the decision tree learner you have submitted is overfitting. This is MOST likely because:

Typically once the training set continues decreasing RMSE and the testing set begins having higher RMSE then your model is overfitting. The exact opposite is the case for Correlation, increasing correlation for training set mixed with decreasing correlation for the testing set typically also signals overfitting.

What is a reason ensemble learners are beneficial over single learners?

Using an ensemble learner is beneficial because it combines the usage of multiple types of learners. Every type has their own natural bias, and using an ensemble will give you their strengths (C) while reducing their weaknesses (B). And because it's a group of them, there is less overfitting

Determine which one is NOT correct about the comparison between a random tree and a decision tree created from information (JR Quinlan).

We always can build a perfect random tree, which fits the training data perfectly.

What's the most fundamental difference between k-means clustering and random decision tree and why?

We are looking for fundamental difference between k-means clustering and random decision tree. K-means assumes no prior knowledge of label/class in the training data, ie. unsupervised. On the contrary, random decision tree requires ground truth labels/classes in order to train, i.e, a supervised learning process.

In Ada boosting, what should we do with the points that are not well predicted in the first bag?

We should weight the data according to the error, let the model to be more representative.

In AdaBoost, what kind of learners are usually combined for producing the boosted output?

Weak learners - The output of the weak learning algorithms ('weak learners') is combined into a weighted sum that represents the final output of the boosted classifier.

Suppose that I am interested in buying a house built in year x_1 with x_2 bathrooms, x_3 total square feet, and x_4 floors.

What kind of method should I use to determine the approximate price I should be paying for this house? House prices are a continuous variable, and it is obviously not a clustering problem, so it is regression.

What value of K in a K nearest neighbor model will overfit the data?

When K =1 it will fit every point in the training data and therefore cause overfitting to occur.

Which of the following statements are true as they relate to overfitting?

When a hypothesis exists that works on the entire dataset while not matching the sample data as well, that shows that the original hypothesis is overfitted to the sample data. Ref pg 67 Machine Learning by Tom M. Mitchell

As increase leaf size in Random Tree, or k in KNN algorithm, or degree d of polynomial in Linear Regression, is it more likely to overfit? The order is Random Tree, KNN, Leaner Regression.

When increasing the leaf size in Random Tree, it shouldn't be more overfit. The first one should be false. When increasing k in KNN algorithm, it shouldn't be more overfit. because it will calculate the average for larger size. It should be false. When increasing degree of polynomial in Leaner Regression, it should be more overfit. Because the more degree of polynomial means the model is more accurate for the training data. The last one should be turtle So the answer is False, False, True

Which of the following models is more likely to overfit?

When k decreases; kNN is going to overfit. Bagging can reduce overfitting; therefore the correct answer is not C and D. When we compare k=2 with k=10, k=2 is more likely to overfit. Hence, The correct answer is B.

Which of the statement regarding root square mean error (RMSE) is true? (n = number of instances and assume n > 1 with some random variables)

When k=1, RMSE error will be zero for training set as it matches one to one and can query the exact match. When k=n, RMSE error will be the same with RMSE of average for training set. Test sets performance depends on the distribution of test sets.

Which of the following statement is TRUE?

When noise presents in the dataset, decision tree tends to over fit the data which reduce the performance of the learner. Therefore the decision tree is sensitive to noise.

We are training a parametric regression model and ending up with many learners with different degrees of freedom (d). We then test these learners and plot the in sample errors (training data) and out of sample error (testing data) vs degrees of freedom (0~d) to detect overfitting, which of the following statements indicates overfitting?

When the d increases, the model fits the training sample better and better and thus the in sample error always goes down. When the in sample error is decreasing and the out of sample error is increasing, the overfitting occurs.

Regression is a better method when the output is a ______ variable, whereas Classification is the better method when output is a _______ variable.

When trying to determine whether to use a regression method, or a classification method an important consideration is the required output. Classification methods are used when we would like to find similarities in the data and make predictions based on those similarities therefore it produces categorical output. Regression on the other hand uses the data to estimate parameters to predict a relationship between the data and the response, therefore the output is a continuous variable.

Which one is true about overfitting?

When using cross-validation, if the accuracy on the training folds is higher than on the test folds, the overfitting occurs. So cross-validation can reveal overfitting.

When we are doing the very simple bagging for a certain model, we would like to to "bag" the training dataset into smaller ones. For one data point A, where can we find it?

When we are doing bagging, the randomly choosing of data should be with replacement, so that one data can be bagged into different bags, or same bag many times. This is to avoid some potential bias on the training dataset.

You have a data set with 1000s of features (X) but the end result for each row (y) can be determined by the first 3 features and the rest of the features are useless.

Which technique is ideal to learn from such a data set? Decision trees start with calculating information gain of each column. The feature with highest info gain will get selected as the root node, the feature with second highest gain will get selected as the next node and so on. The least contributing features will have very less or 0 information gain. With an appropriate threshold value for the information gain, we can avoid selection of less useful columns when constructing the tree.

Which of the following is NOT true of bagging?

While boosting does increase predictive power, bagging is still useful for algorithms with high variance like decision trees. Each method has their purpose.

Of the following examples, which would you address using an unsupervised learning algorithm:

With A,C and D choices, the input data has already been labeled which indicates supervised learning.

Unsupervised learning is different from supervised learning because unsupervised learning is good at...

With little to no training sets available, unsupervised learning does not require perfect training set. Unsupervised learning is good at clustering significance and labeling data points even if the clusters are small.

Which of the following is false?

You can fit higher order polynomials to the data using linear regression by squaring the features and including them as squared features

Lets take an example of a problem where given data about the size of houses and historic prices in the real estate market, you need to predict their future prices (assuming all other factors are constant). Which Machine learning method would be best suitable to solve this ?

You can test your model using past data so this is definately an example of Supervised Learning and there is no data classification involved in this example.

You work for a company that specializes in field equipment for coleopterists (scientists who study beetles). Your boss approaches you about a new product the company is developing. The order "Coleoptera" constitutes almost a quarter of all known animal species, more than any other order, and your company wants to produce a hand-held beetle identifier that will use measurements of various parts of a specific beetle (input by hand from a scientist in the field) and output a predicted species. He asks you to build the prediction model for this device and gives you a data set of 20,231 examples, each one containing 12 continuous numeric attributes representing measurements of various features on the scarab (wing length, horn length, abdomen diameter, leg segments, etc.).

Your company wants the device to work in such a way that that as new specimens of known species are encountered in the field, this data can be input by the scientists into the device. If the scientist has definitively identified a bug to be of a particular species, she will plug in the scarab's 12 measurements in the device along with the classification. This data must then be available to the scientist right away for making new predictions. Given this requirement, which is the most appropriate choice of algorithm for this project? d is the correct answer because with k-NN you can incorporate new data into your model "on the fly" without having to rebuild the model. You simply add the new instances with their classifications and they are instantly available for new classifications.

"What would be the optimal algorithm for a solution that requires fast training time.

a. kNN b. Decision Tree c. Linear Regression d. Neural network " A linear regression algorithm would be the best choice based on its minimal training costs.

In the bagging, how to combine the results from querying different learners to generate the output value?

according to our lecture, we should take the mean of the output from all the learners to generate the predicted value

What is the output of this python code?

array = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12], [13, 14, 15, 16]]) array = array[:3, :3] array = array[1:, 1:] print(array) -> np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12], [13, 14, 15, 16]]) [[ 1, 2, 3, 4] [ 5, 6, 7, 8] [ 9, 10, 11, 12] [13, 14, 15, 16]] -> array[:3, :3] [[ 1, 2, 3] [ 5, 6, 7] [ 9, 10, 11]] -> array[1:, 1:] [[ 6, 7] [10, 11]]

Select the correct answer with regard to the concept of bagging:

b and c are incorrect. Bagging helps to reduce overfitting and decreases variance. a is incorrect since bagging uses replacement to select the samples that go in each bag, and n' is usually equals to n. The correct answer is D, if we increment m, then we will be increasing the number of models to be built during the bagging process.

Which following is correct in terms of adaboost(boosting)?

because a): overfitting is more likely to occur in adaboost than simple bagging b): It is bagging(not boosting) just draw a random subset of training samples with replacement from the training set d): Incorrectly predicted training examples at the previous step would have increased weights.

Which of the following methods would you choose in order to minimize the running time for a single decision tree?

because the other three methods all need to calculate information gain before being able to decide which feature to split.

In Boosting, it's preferred to use ___ learners. For mis-classified data points, the algorithm will apply ___ weights to them.

create strong learners from weak learners. put more weights to errors.

How about the effect of random tree built using A. Cutler's approach and a standard decision tree using JR Quinlan's approach?

due to radom choice of feature and split value, the effect of random tree is worst than a standard tree. However if we build multiple tree with bagging, the effect can surpass a standard tree.

Which of the following methods are BEST suited for building instance based models?

k-NN and kernel regressions are non-parametric methods that can be used for building an instance based model. Kernel regression takes into account the weightage of the nearest datapoints, whereas k-NN gives equal weightage to the nearest datapoints.

In terms of the expressiveness of the model, which one is most restrictive: kNN, decision trees, or linear regression?

kNN and decision trees are non-parametric models, and they do not assume any characteristic of the given data. Linear regression, on the other hand, assumes that the features can linearly explain the outcome.

We train a model with low and high temperature data for every day from 2010-2015 in Neverland. The temperatures in Neverland vary smoothly for the most part, increasing steadily from January to August and then falling till December. But every year, from a period of 15th to 20th of March, Neverland experiences a cold streak that results in temperatures being around 20 degrees lower than normal. Choose the option that is correct for a kNN and a Parametric Regression learner from the following. (Assume k<10 for the kNN.)

kNN is more sensitive to local changes in the data. Parametric Regression will not be affected a lot by this dip for this small sample (5:365), but will fit try to a curve to the overall data. kNN when queried with this date, will use the outlier data for the cold streak found in the training data rather than bothering about the entire training data. So kNN will be able to predict the cold streak, while Parametric Regression will not be predict it.

We train a model with low and high temperature data for every day from 2010-2015 in Neverland. The temperatures in Neverland vary smoothly for the most part, increasing steadily from January to August and then falling till December. But every year, from a period of 15th to 20th of March, Neverland experiences a cold streak that results in temperatures being around 20 degrees lower than normal. Choose the option that is correct for a kNN and a Parametric Regression learner from the following. (Assume k<10)

kNN is more sensitive to local changes in the data. Parametric Regression will not be affected a lot by this dip for this small sample (5:365), but will fit try to a curve to the overall data. kNN when queried with this date, will use the outlier data for the cold streak found in the training data rather than bothering about the entire training data. So kNN will be able to predict the cold streak, while Parametric Regression will not be predict it.

Which of the following parametric learning has the lowest training cost?

kNN undoubtedly has the lowest training cost, but linear regression is the only parametric learning among all four choices.

knn algorithm belongs to which families (y) of algorithm (Choose one)

knn algorithm depends on instances seen in the training set. Hence an instance based algorithm.

Which one of these learning methods is the fastest at query time?

linear regression is the fastest at query time because once you calculate the parameters, you just need to solve the equation of X-s and find the answer. But in case of K nearest neighbor, you need to consult with all the data points and calculate the distance to all of them and take the mean of the ones that are closest so its going to be really slow. For decision tree the query time is of log2(n) so it depends on how many data points you have.

Which among the following is NOT an example of Regression vs Classification

size of tumor is a numerical measure, the probability of cancer is also a numerical measure, hence both of them are regression case

You are given a set of data with a known target variable. Your task is to build a model to predict the target variable. The target variable is a continuous numerical value. What type of learning method would be best suited for this task?

target variable is in data set, therefore a Supervised Learning Problem, target variable is continuous numerical value, regression is best suited for this task.

Which of the following statements is correct with regards to overfitting?

the closer the value of k gets to 1, the in-sample error decreases as the model fits the data more or less perfectly. Where as the out of sample error increases by not being able to fit the data as well.

How about the training time of a random tree built using A. Cutler's approach and a standard decision tree using JR Quinlan's approach?

the process of choosing feature and split value randomly can speed up the training process and make building random tree faster.

Suppose you want to train a machine learning model to predict whether a stock price will increase or drop in coming week. Which machine learning model can't be applied here?

this is a classification problem, all other methods except linear regression can be used for classification. Logistic regression is the option to confuse students who didn't understand the models.

What is one noticeable difference between JR Quinlan and A Cutler's implementation of the their respective decision tree?

"Page 7 of the How to Learn a decision tree slides - JR Quinlan's algorithm -> SplitVal = data[:,i].median(). Page 8 - A Cutler's algorithm -> SplitVal= (data[random,i]+data[random,i])/2 ."

What is an not an example of how to determine the best feature in a decision tree that was discussed in class?

In the course, we specifically discussed using entropy, correlation, and the gini index in order to determine the best feature in a tree. When discussing code, we looked at using correlation, but both correlation and entropy "should" return similar results. According to research, variance reduction is another common way to determine the best feature. We did not, however, discuss this in class.

Which of the following correctly describe unsupervised learning?

In unsupervised learning the training data consists of a set of input vectors x without any corresponding target values. The goal in such unsupervised learning problems may be to discover groups of similar examples within the data

Pearson correlation coefficient, r, can take a range of values from +1 to -1. While deciding the feature to split on, what value of correlation coefficient of a feature with respect to target would serve as the best split?

Information gain is maximum when there is maximum correlation. Positive or negative values indicate if they are positively or negatively correlated, either of which would serve as a good split as long as they are strongly correlated.

What is the purpose of using entropy?

Information gain is the amount of information acquired that increases certainty in prediction. The more information we get the more certain we are going to be about the outcome. Predictors or features serve as sources of such information and each usually provides different levels of this information. We can measure the usefulness of predictors by how much information they provide. For this we can calculate entropy or value of uncertainty - the higher is entropy the more uncertain we are in prediction. Conversely, the lowest entropy indicates the highest information gain. Entropy can change when different features are included or excluded from the model. The difference between the former and current entropies is the information gain that helps to determine the most relevant features.

In which of the following models is overfitting most likely to occur?

Non-parametric and non-linear models are more likely cause overfitting because of the flexibility in terms of learning the target function. The learner needs to adjust to random features of the training data. Therefore, the performance on unseen data is worse compared to training data.

In which of the following models is overfitting most likely to occur?

Non-parametric and non-linear models have lot of flexibility in terms of target function. The learner needs to adjust to random features of the training data. Therefore, the performance on unseen data becomes worse compared to training data.

Between KNN, linear regression, and decision trees, which would be the best method in terms of prediction accuracy to use for extrapolating beyond the domain of a given data set?

Of all the methods mentioned, only linear regression is a parametric model capable of extrapolation. KNN and decision trees simply use the edge values of the data set to predict the values beyond them.

Which of the following statements is TRUE in regards to overfitting of linear regression models? Note: D = number of dimensions

Overfitting is more likely as D increases because the resulting polynomial will produce a line that more closely matches the test data, which most likely contains either some random error or some noise.

Which of the following would be a reason NOT to use a parametric regression learning model?

Parametric regression is a biased learning model, meaning that we will provide an initial guess as to its shape. A non-parametric approach would be more suitable for problems that are hard to model mathematically.

Which of the following metric is most suitable in determining whether prediction quality linearly matches up with actual data?

Pearson correlation determines linear relationship among two variables.

When comparing the performance of a bootstrap aggregating from bags=[1toN] of Random Decision Trees leaf size=[1toK] on a training set of shape [500,4] and a testing set of 700 elements, why is using RMSE unreliable?

RMSE is a scale-dependent function, so when comparing against multiple variables (in this case against the number of bags and number of leaves), then it is not an effective measure. RMSE is only effective in evaluating single-variable distributed models.

Supervised learning differs from unsupervised clustering in that supervised learning requires

Supervised learning and unsupervised clustering both require at least one input attribute. But in supervised learning, the output datasets are provided which are used to train the machine and get the desired outputs whereas in unsupervised learning no output datasets are provided, instead the data is clustered into different classes .

Which of the following statements about k-NN and linear regression is true?

k-NN is a non-parametric model that takes the average of the y values of the k nearest neighbors to output a prediction. Therefore, it cannot extrapolate to values outside the observations' range. On the other hand, linear regression is a parametric model whose prediction is based on a linear combination of parameters. Once a function is found, the model can evaluate any point in its domain and output a prediction that may be outside the range seen in the training data.

Which of the following is true of the Random Tree Algorithm proposed in A Cutler's paper, as compared to JR Quinlan's Decision Tree Algorithm proposal?

(D) is correct, because (B) and (C) are True. (B) is true, since A Cutler's proposal removes the need to compute information gain (e.g. Correlation, Entropy) and calculate median row value. (C) is true, as it accurately described both the randomness in the Split Factor AND the split value. (A) is not true, because the median row value determination is an aspect of JR Quinlan's Decision Tree Algorithm proposal, not A Cutler's paper.

While building a decision tree, which of the following methods can would most likely cause in taking longer to query the tree?

A - Choosing a smaller leaf size would cause the tree to branch out more from a node, increasing it's depth and complexity. This would in turn result in larger trees and more time to query them. B- Using information gain would actually shorten the query time as it would find the best feature to split on. C- Limiting the depth would result in a shorter tree, which would be easier to query. D- Choosing the median of all points from the split feature would mean that we are dividing our data into the following left and right nodes equally. This would also shorten query time. It may be easy to get confused between what causes the tree to take longer to build vs what causes it to take longer to query.

For which of the following problems will a regression learning algorithm be applicable as opposed to a classification algorithm?

A - requires you to predict the percentage of breast-cancer occurence which is a continuous output while all the other answers are classification problems.

You have been tasked with building a learning model that must be trained with incoming sample data in real-time. Which model would be the most suitable and why?

A K-Nearest Neighbor model does not need to be trained when we add new data points. Query times can be slower, but since there are no requirements with regard to query performance, KNN is the most suitable.

Which statement below is correct related to AdaBoost?

A and C is wrong. In AdaBoost, we choose both randomly from the training data and from instances that did not fit well in previous learners. B is wrong. AdaBoost is more likely to result in over-fitting because it concentrate on the data points that did not fit well. D is correct. AdaBoost is sensitive to noisy data and outliers, for it concentrates on those instances that are misclassified by previous learners. If there is a major outlier, AdaBoost will be more likely trying to fit it.

In this class, we learn about overfitting. Which of the examples below do you think will likely contribute to/describe overfitting?

A is correct. Setting N=1 for KNN learners will generally lead to overfitting, because training error will always be 0 as existing samples will always be matched to themselves, but will not scale well to unseen data. B is wrong. The model used in the learner is too simple to represent the underlying data, and will generally lead to high error, even if more data is used to learn the model. This phenomenon is called underfitting, instead of overfitting. C is correct. Fitting a complicated (5 degree polynomial) model to data generated from a simple 1 degree polynomial will lead to, like A, good in-sample error but poor out-sample error. n-folds Cross validation can be used with existing data to verify this. D is the correct answer, since both A and C are correct.

Which one (select only one) of the following is NOT true about boosting?

A is true because boosting applies more weight on hardest examples, but those can be noise. Boosting is more likely to overfit when the number of bags increases than simple bagging. So B is not true. Boosting can be applied on any weak learner by applying more weight on the harder examples with larger errors, and has no parameter to tune. So C is true. D is true, because boosting applies more weight on hardest examples, and the examples with the highest weight often turn out to be outliers.

Which of the following is true about Bagging (without boosting)?

A is wrong because Bagging doesn't cause underfitting, it improves the performance. If anything, with the increase in number of bags to a very large number, it might lead to some overfitting. B is wrong because in case of classification problems, we can always take the most frequent value instead of the mean. C is wrong because 'without replacement' it'd be Running the Learner on the same set of data multiple times which defeats the whole concept of Bagging (Ensemble learners). Hence D.

You are designing a cat face detection mobile app. Even though most of the app will be driven by server side API's most of the face detection logic will reside on the mobile app itself. What type of machine learning algorithm would you choose to use and why?

A parameterized approach is preferable in this scenario as the resulting model can be fairly lightweight and fast to execute on the storage constraints and offline requirements of a mobile device. An instance based method, while quick to execute also, will require as much storage of all the input data required to accurately predict what's being classified (which may be significant).

what is an example of a parametric based learning algorithm

A parametric model captures all its information about the data within its parameters. All you need to know for predicting a future data value from the current state of the model is just its parameters

what is an example of a parametric based learning algorithm

A parametric model captures all its information about the data within its parameters. All you need to know for predicting a future data value from the current state of the model is just its parameters.

Your coworker, who has limited knowledge of modeling, is tasked with developing a model to predict the temperature from a number of different observation stations, each of which report weather-related metrics such as atmospheric pressure. However, many of the stations are randomly missing various pieces of equipment, and so many of your observations are missing a few random measurements. Assume your coworker doesn't understand imputation. Between a linear regression learner, kNN, and a decision tree, which model would you expect to offer the best performance out of the box?

Decision trees easily handle missing data, and do not algorithmically depend on complete observations. This is in contrast to KNN (many distance measures need each feature to be non-missing) and linear regression (need all features to be non-missing or encode missingness as an indicator - can't just ignore it).

When does overfitting most likely to occur?

D) Overfitting occurs when a model learns noises from training data. It is more likely to overfit on actual test dataset since the model "over-learned" from the training dataset.

Which of the following is better to be judged as a unsupervised learning algorithm instead of a supervised learning algorithm? Suppose all the algorithms are either supervised learning algorithm or unsupervised learning algorithm.

Even we have the information about their names, that shouldn't be judged as Ys, this problem is still a unsupervised one where we only use the X, actually it is more like a clustering algorithm. For A, it is a classification algorithm, Y is wether they are granted credit card. For C, it is regression algorithm, Y is the price. For D, it is a classification algorithm, Y is the digit which the vector/image represents.

When build a random tree for an array of data. Which group of statements randomly select a feature and randomly choose a split value from the feature?

First statement randomly choose feature by using data.shape[1]-1 as high value. Second and third statement randomly choose two different value from rows.

In the decision tree algorithm by J R Quinlan, if the split value is calculated to be the second largest element among the data rows of the best feature column, instead of the median of the data rows of the best feature column. What will be the worst case height (i.e. maximum) of the decision tree? Assume n to be the number of rows of data.

For a data set of n rows, if we build the decision tree as per the question, then there will be (n-1) edges from the root of the tree till the deepest leaf in the worst case scenario. Hence worst case height of the decision tree will be (n-1). If second largest data element repeats in more than one row, then height of tree will be less than n-1. Hence only the worst case scenario i.e. maximum height of tree possible is asked.

When building decision trees, instead of randomly choosing a feature index to split on, we can select the one with lowest node impurity. Given a split, let R1 and R2 be its left and right regions. For a region Ri (i=1,2), let Ni be the number of samples of this region, and let pi be the proportion of samples with label 1, then we compute Gini index of this region as Qi = 2*pi*(1-pi). When choosing a feature to split, we are minimizing the weighted impurity measure f = N1*Q1 + N2*Q2.

For an example dataset in the format of {(Label, F1, F2):(1,1,0),(1,0,1),(1,1,1),(1,1,0),(0,0,0),(0,0,1)}, the weighted impurity measure f of F1 is _______, and we should choose feature ____ as the root split. The weighted impurity measures of feature F1 and F2 are 4/3, 8/3 respectively. So we should choose F1 to split.

What is the fundamental feature of supervised learning that distinguishes it from unsupervised learning? For all following choices (A through D), X is an mxn matrix where each column corresponds to one feature/factor and each row represent one data instance. Y is vector of size m in which all prediction values are stored. For supervised learning:

For supervised learning, the feature/factor data X and its corresponding prediction value Y are both provided. Each Y value must be associated with each row in X.

I need to predict the price of gold by the end of the year. Which algorithms should I use?

Gold prices are represented by a continuous numerical value, which makes linear regression the answer for this question. The other solutions are classification-based algorithms, which are best suited for discrete variables

Assuming all three models are equally accurate in their predictions, which model - kNN, decision tree, or linear regression - is favorable, and why?

If the accuracy is the same for all three models, then linear regression is the best method, as it has the the fastest query time and, in general, lower training times.

Supervised learning differs from unsupervised clustering in that supervised learning requires

In supervised learning, the output datasets are provided which are used to train the machine and get the desired outputs whereas in unsupervised learning no datasets are provided instead the data is clustered into different classes

Which of the following is not a possible cause for overfitting?

"B. Limited number of input variables are given in the sample." is not a cause for overfitting. Just the opposite, too many variables without proper selection will always lead to overfitting. A,C and D are all common causes for overfitting.

When building a decision tree, choosing the factor to split a node on can be done using entropy or the factor can be chosen randomly. Which is an advantage of choosing the factor to split on randomly?

Calculating entropy for the factors is the most expensive step of the algorithm. Choosing the factor randomly speeds up the algorithm.

What kind of error does RSME emphasize?

The square of a number like 64 is 8 versus the square of a number like 2 is ~1.7 so the larger errors are emphasized a bit more.

The main difference between supervised and unsupervised learning is...

Correct answer is b) because supervised learning maps inputs to outputs.

Which of the following statements about the prediction performance of bagging is true

A,C,D and really saying the same thing: That bagging will usually not improve on a stable learner (kNN). Rather bagging is expected to improve the variance of an unstable (decision tree) learner. The reason the former wont benefit is that (say in kNNs) case removing a couple of the points won't significantly change the hypothesis. In a sense is what bagging is doing for every bag and than performing a vote between them (for A,C,D most bags is expected to give similar answers).

Which of the following statement about boosting is TRUE?

A. Boosting aims to decrease bias, and sequentially extract information from the "residual" of previous model. Starting with high variance models is likely to cause overfitting. B. the training error decrease but not necessarily monotonically C. The boosted classifier is linear combinations of weak classifiers and can form a more complex decision boundary D. Each model is added sequentially based on the information from previous models, not independent

Assume we have a model and we have tested in sample data and out of sample data over a specific parameter (leaf size, degrees of freedom etc). What should we observe on the in sample and out of sample errors to identify overfit:

According to lecture, 03-03 Assessing a Learning Algorithm, it shows a a graph where in sample (test) error increases and the out of sample error (train) is increasing. This shows the model is overfit to training data.

Which is most likely to overfit?

AdaBoost subsequently attempts to fit specific dataset poorly performed in previous bags over and over, which makes the prediction to be accurate for training dataset and causes overfitting. Therefore, incrementing the number of bags in AdaBoost is most likely to overfit.

Which of the following statements is true about boosting (as implemented by Adaboost) and overfitting?

Adaboost is designed to give high weight to weak learners that have different errors from the rest of the weak learners. This results in lower average error.

Which decision tree algorithm is better in terms of performance for use in a ensemble learner

Adele Culter's algorithm performs better because randomly selecting features and values to split on significantly speeds up the algorithm. Also, since randomness and averaging in involved, the outcome of the ensemble learner is more accurate in terms of predicting correct Y values. It is not 'C' because KNN is not even a decision tree algorithm.

Which one of the following choices is most accurate for boosting?

All the other answers describe more accurately bagging. The only answer that strictly applies to boosting is C

Supervised learning is?

B is the true definition per the class content that correctly describe supervised learning. A is unsupervised learning. C is an answer that "sounds reasonable" and D is an answer that incorrectly extrapolates from MC3-P1

Per Cutler's algorithm, how is randomization used by a random decision tree learner?

Cutler's paper defines how to build a single random tree learner, introducing the method of using randomization to both select a factor and its split value.

Bagging leads to improvements in machine learning procedures by reducing the variance in data. This is least true for :

Bagging can mildly degrade the performance of stable methods such as K-nearest neighbors

What is the main difference between the approach of boosting and bagging when creating each weak learner?

Bagging creates weak learners from uniform samples with replacement from the input space. Boosting checks the error rate of each weak learner and its preceding weak learners for values that are not well predicted. Those values are weighted higher when selecting input values for a weak learner so that they are better predicted in subsequent weak learners.

What is the main advantage of using Ensemble learning methods, such as bagging?

Bagging involves drawing several subsamples, or bags, selected randomly with replacement from the training data set.Each of these bags are then used to train a different model.Each model is then used to do predictions that are then averaged to do the final prediction. The inherent feature of selecting bootstrap samples and then averaging the output reduces the variance as compared to using outputs from one specific model, such as KNN or Decision Tree.

You are given a black box estimator that can accept three genes at a time, and outputs an estimate of the person's height. The procedure of bagging will be most useful in what scenario?

Bagging will work to average out the small biases and random variance around the true prediction, and produce a better estimator than an individual prediction using any three genes alone.

Which statement is true regarding boosting using weak learner?

Because boosting can't improve upon total random guessing.

in bagging, where M stands for the number of different bags and N' is the number of samples used to train each bag, N' is

Because in 03-04 Ensemble Learners, Bagging And Boosting video, the instructor notes state that in most implementations N' = N because the training data is sampled with replacement (making about 60% of the instances in each bag unique)

Which statement is true regarding boosting using weak learner?

Because random guessing can't be improved into strong leaner through boosting.

Which of the following statements is true about Regression versus Classification Learning.

Because the regression model usually do expensive processing before hand to build the model, and they produce an equation that is easily used for querying.

In which scenario are you most likely to encounter overfitting?

Because when K = 1 the model will retrieve the exact value for y in the model. It is taking the average of k =1 data point.

Which of the following is true about KNN?

Because you store all individual data points vs a parametric representation of the model.

Which of the following comparisons between Bagging and Boosting is true?

Boosting is more likely to cause overfitting as the number of bags approaches infinity because it takes into consideration past model prediction error when resampling, causing the RMSE in sample to reduce at the cost of overfitting. The rest of the answers are blatantly false: both Bagging and Boosting reduce RMSE, and Bagging does performs bootstrap sampling.

Which of the following option can be the characteristic of both Quinlan and Cutler decision trees, in other words, which of the following feature can be used by both of them?

Correct answer: Feeding random data for the decision tree creation and using ensemble of learners. Explanation: Quinlan's decision tree algorithm uses correlation to pick the feature to split and then recommends taking mean of the values of that feature. These are the unique properties of Quinlan decision tree algorithm. On the other hand Cutler's decision tree makes use of random selection of the feature to split on. However it is always possible to feed random data to both Quinlan's and Cutler's decision tree algorithms and then use ensemble of learners to improve the prediction.

Which choices can make the prediction more accurate?

C. More baglearner will give the decision tree better training

Which of the following improvements in C4.5 algorithm over ID3 for building tree makes it less sensitive to features with large number of values(example feature: SSN)

C4.5 allows to measure a gain ratio in addition to ID3 which will avoid choosing features such as SSN that are prone to low entropy but are un-useful.

Which of the following statements is wrong on the comparison between decision tree algorithms CART and C4.5?

CART uses Gini impurity as a measure of the frequency of incorrect labeling. C4.5 utilizes information entropy concept, similar to ID3.

Hand written digit recognition (0-9) is a good application of supervised machine learning. Examples of handwritten digits and their corresponding mapping to integers [0,9] are provided as input for training. Which of the following statements are true about it?

Classification is a form of supervised machine learning where given an input, the expected response is a label from a known set of discrete classes

Which method is not included in supervised learning?

Clustering is a method of unsupervised learning, which is based on relationships among the variables in the data without feedback based on the prediction results.

Which statement below is true about Random Forests and overfitting? (Random Forests = bagging Random Trees.)

Correct answer is (C). Overfitting refers to high variance between training and test data. Bagging reduces variance by averaging across base learners. While Random Tree base learners might overfit, e.g., leaf size=1 (one unique sample per branch), bagging across trees with randomly selected features will average out the variance. So (A) is false. (B) is false since the performance gains from increasing the number of bags imply lower variance, which cannot cause increased overfitting. Averaging smoothes; it cannot inject external variance and so only (C) is true.

When building a decision tree, what is the benefit of finding the correlation of each feature (X1, X2, X3...) with the labels (Y)

Correct answer is a) because higher absolute correlation between the features (X1, X2...) and labels (Y) indicates that it will best feature to split on to build the best performing decision tree

Comparing a decision tree model using information gain to a decision tree model using randomized feature selection and splits with the same leaf size, the decision tree using information gain will likely require

Correct answer is b) because the computational complexity of using information gain is greater than choosing random splits, resulting in greater training time. Trees using information gain tend to be shorter than random trees, resulting in reduced prediction time.

What value would indicate the strongest direct correlation coefficient?

Correlation coefficients range from -1 to 1 continuously. -1 is the strongest inverse correlation; 0 indicates no correlation; 1 indicates the strongest direct correlation.

What is the purpose of using entropy?

Information gain is the amount of information acquired that increases certainty in prediction. The more information we get the more certain we are going to be about the outcome. Predictors or features serve as sources of such information and each usually provides different levels of this information. We can measure the usefulness of predictors by how much information they provide. For this we can calculate entropy or value of uncertainty - the higher is entropy the more uncertain we are in prediction. Entropy can change when different features are included or excluded from the model. The difference between the former and current entries is the information gain that helps to determine the most relevant features.

What is the purpose of using entropy?

Information gain is the amount of information acquired that increases certainty in prediction. The more information we get the more certain we are going to be about the outcome. Predictors or features serve as sources of such information and each usually provides different levels of this information. We can measure the usefulness of predictors by how much information they provide. For this we can calculate entropy or value of uncertainty - the higher is entropy the more uncertain we are in prediction. Entropy can change when different features are included or excluded from the model. The difference between the former and current entropies is the information gain that helps to determine the most relevant features.

Which option among the following best describes features of an instance based learning model such as KNN?

Instance based models need to mostly store the training data which they refer during the query process. This makes learning faster(as it is just a storage operation), slower query (as the model will process the training data to ans the query), and more memory required to store the training data sets.

Task of inferring a model from labelled training data is called

Labelled datasets are provided in supervised learning for training and testing the data to infer a predictive model.

Consider a regression decision tree modeled with 1000 data points. Which combination of leaf size and bootstrap aggregating will result in the LEAST amount of overfitting?

Larger leaf sizes rather than small ones prevent over fitting. Bagging reduces overfitting.

Which of the following techniques DO NOT help a Decision Tree generalize better (i.e. avoid overfitting)?

Limiting the number of training samples increases the bias of the tree. Although it may make the tree simple, but does not necessarily make the tree generalize better. The other options all may help make the tree generalize better.

Which of the following methods will benefit least from bagging. (Assume perfectly uniform random sampling and sufficiently high number of bags)

Linear fitting is a convex problem for which the best possible fit will be almost the same when trained on different random samples of data (because we are assuming a perfectly uniform random sampling). Thus bagging will create the same model over and over whose combination will again be the same model as the one trained on the original data. Option d is incorrect because even if the data has a very well linear fit, it only means that the linear regression can be better than the other two methods but the question asks for an improvement of performance relative to non bagging version of the same model.

Considering Parametric and Instance based learners - which of the following is FALSE

Linear regression is a parametric model based learner and hence the addition of new data will only increase the training time but the time to query will not change.

Your company is investigating two different models for use in their company, kNN and linear regression. Which of the below is a false claim made by the investigator?

Linear regression is not faster at training than kNN.

While measuring the performance of a learner model, which of the below options is the best: 1. RMSE must be preferred over correlation because correlation gives scaled results but RMSE gives absolute results 2. RMSE must be preferred over correlation because through correlation performance can't be measured 3. Correlation must be preferred over RMSE because correlation provides the information about scaled similarity between datapoints 4. Neither correlation or RMSE can measure learner's performance 4.

RMSE must be preferred because correlation gives scaled results but RMSE gives absolute results.

When creating a model for predicting an outcome should overfitting be considered as a rubic for determining the quality of the model, and why?

Reducing overfitting improves predictive quality as the less rigid, and more general a model becomes, the higher the quality of its prediction.

Which one is true with respect regression problem in machine learning?

Regression is always supervised learning, since we use set of input and output data to predict unseen data.

Which of the following is used as a measure for calculating the best fit for samples predicted by a model?

Root mean square deviation is the method which is used for measuring the difference between the values predicted by a model or an estimator and the values actually observed.

Which of the following ways, when used to select feature to split on in creating a decision tree, is going to create the biggest tree, provided leaf size is fixed?

The answer is C because randomly selecting a feature to split on has the least information gain. Thus, each split at random won't be as effective as other methods, causing the produced tree to be bigger.

In following algorithm , which one doesn't belong to supervised learning

The difference between supervised and unsupervised learning is whether the data come with the answer. A B D , all of them has be given answers while C has not. Crusting need associate rules because no answers would be provided.

When applying adaptive boosting algorithm to random decision trees, what type of observation do we care most after equally assigning weight to each tree?

The out sample generality and its accuracy should be the focus over others

Which one is correct?

The outfitting is more likely happens when the training set is small.

Suppose you have a constant stream of new data and you can summarize your data points accurately using some algorithm like Reservoir Sampling (https://en.wikipedia.org/wiki/Reservoir_sampling). Your requirement is to be able to query (i.e classify a test point) based on previous data at any time. Which classification algorithm will you choose?

The primary requirement is to add new data. Although it takes longer (O( n log n) to query KNN, by limiting N using sampling, it will be faster than building a decision tree or linear regression at query time.

KNN overfitting typically occurs when ...

The relationship is different than polynomial degrees of error. And overfitting appears to occur between in and out of samples when K is decreasing with in sample error decreasing and out of sample error increasing.

Cutler's random tree building method differs from JR Quinlan's tree building method mainly in two aspects. which of the following answers is one of the two main differences between the two tree building methods?

There are two primary differences between the two decision tree building methods: 1. The feature i to split on at each level is determined randomly. It is not determined using information gain or correlation, etc. 2. The split value for each node is determined by: Randomly selecting two samples of data and taking the mean of their Xi values. Choice B happens to be one of them

What is the difference between knn and kernel regression?

This answer is correct because kernel regression takes distance away from an answer into consideration and weights data points accordingly, while knn does not take weights into consideration.

Which is NOT a primary implementation component of Cutler's Random Tree that is not present in the methodology originally proposed by JR Quinlan?

This is taken from the RTLearner component of the MC3-P1 assignment and clearly documented here: http://quantsoftware.gatech.edu/MC3-Project-1#Part_1:_Implement_RTLearner_.2830.25.29 Leaf_size does not vary within the tree, it is fixed at the time the tree is created.

Which of the following is better to be judged as a unsupervised learning algorithm instead of a supervised learning algorithm? Suppose all the algorithms are either supervised learning algorithm or unsupervised learning algorithm.

This problem is a unsupervised one where we only use the X and no Y is provided and used, actually it is more like a clustering algorithm. For A, it is a classification algorithm, Y is wether they are granted credit card. For C, it is regression algorithm, Y is the price. For D, it is a classification algorithm, Y is the digit which the vector/image represents.

Let the error of hypothesis h over training data be error_train (h) and the error of h over the entire distribution be error_distribution (h). Then a hypothesis h over-fits the training data if there is an alternative hypothesis h' such that:

This question is to test the definition of over-fitting. When hypothesis h has smaller error on training data but larger error on actual data than an alternative hypothesis h', it over-fits the training data.


Conjuntos de estudio relacionados

Natural Selection and Motivation - Psych 1 Final

View Set

Semester 1 Child Development Exam 2

View Set

EverFi: Budgeting for your Loans

View Set

Statistics Exam 1 Definitions (Ch 1, 2, 3)

View Set

Black Swans and Unpredictability

View Set

Automotive Brakes 2023-2024 55-2

View Set

CSS -Defines How HTML is Displayed

View Set