Naive Bayes & Ensemble & Trees

¡Supera tus tareas y exámenes ahora con Quizwiz!

Q28) Which splitting algorithm is better with categorical variable having high cardinality? Information Gain Gain Ratio Change in Variance None of these

B When high cardinality problems, gain ratio is preferred over any other splitting technique. Refer slide number 72 of this presentation.

15) Now consider only one splitting on both (one on X1 and one on X2) feature. You can split both features at any point. Would you be able to classify all data points correctly? A) TRUE B) FALSE

B You won't find such case because you can get minimum 1 misclassification.

Q9) Look at the below image: The red dots represent original data input, while the green line is the resultant model. q9_image How do you propose to make this model better while working with decision tree? Let it be. The model is general enough Set the number of nodes in the tree beforehand so that it does not overdo its task Build a decision tree model, use cross validation method to tune tree parameters Both B and C All A, B and C None of these

C A. As we can see in the image, our model is not general enough, it takes outliers/ noise into account when calculating predictions which makes it overfit the data. B. If we can set the number of nodes, we could easily get an optimal tree. But to select this value optimally beforehand is very hard, as it requires extensive cross-validation to be generalizable. C. Tuning Tree parameters is the best method to ensure generalizability

Q41) Which of the following statement is correct about XGBOOST parameters: Learning rate can go upto 10 Sub Sampling / Row Sampling percentage should lie between 0 to 1 Number of trees / estimators can be 1 Max depth can not be greater than 10 1 1 and 3 1, 3 and 4 2 and 3 2 4

D 1 and 4 are wrong statements, whereas 2 and 3 are correct. Therefore D is true. Refer this article.

Q37) In Random Forest, which of the following is randomly selected? Number of decision trees features to be taken into account when building a tree samples to be given to train individual tree in a forest B and C A, B and C

D Option A is False because, number of trees has to decided when building a tree. It is not random. Options B and C are true

10) Which of the following algorithm are not an example of ensemble learning algorithm? A) Random Forest B) Adaboost C) Extra Trees D) Gradient Boosting E) Decision Trees

E Decision trees doesn't aggregate the results of multiple trees so it is not an ensemble algorithm.

8. What kind of distance metric(s) are suitable for categorical variables to find the closest neighbors? (A) Euclidean distance. (B) Manhattan distance. (C) Minkowski distance. (D) Hamming distance.

Option-D Explanation: Hamming distance is a metric for comparing two binary data strings i.e, suitable for categorical variables.

8) Which of the following is true about training and testing error in such case? Suppose you want to apply AdaBoost algorithm on Data D which has T observations. You set half the data for training and half for testing initially. Now you want to increase the number of data points for training T1, T2 ... Tn where T1 < T2.... Tn-1 < Tn. A) The difference between training error and test error increases as number of observations increases B) The difference between training error and test error decreases as number of observations increases C) The difference between training error and test error will not change D) None of These

B As we have more and more data, training error increases and testing error de-creases. And they all converge to the true error.

6) Which of the following algorithm doesn't uses learning Rate as of one of its hyperparameter? Gradient Boosting Extra Trees AdaBoost Random Forest A) 1 and 3 B) 1 and 4 C) 2 and 3 D) 2 and 4

D Random Forest and Extra Trees don't have learning rate as a hyperparameter.

Q17) Which of the following is not possible in a boosting algorithm? Increase in training error. Decrease in training error Increase in testing error Decrease in testing error Any of the above

A Boosted algorithms minimize error in previously predicted values by last estimator. So it always decreases training error.

2) Which of the following is/are true about boosting trees? In boosting trees, individual weak learners are independent of each other It is the method for improving the performance by aggregating the results of weak learners A) 1 B) 2 C) 1 and 2 D) None of these

B In boosting tree individual weak learners are not independent of each other because each tree correct the results of previous tree. Bagging and boosting both can be consider as improving the base learners results.

14) If you consider only feature X2 for splitting. Can you now perfectly separate the positive class from negative class for any one split on X2? A) Yes B) No

B It is also not possible.

Q22) Given 1000 observations, Minimum observation required to split a node equals to 200 and minimum leaf size equals to 300 then what could be the maximum depth of a decision tree? 1 2 3 4 5

B The leaf nodes will be as follows for minimum observation to split is 200 and minimum leaf size is 300: capture78 So only after 2 split, the tree is created. Therefore depth is 2.

19) Which of the following is true about the Gradient Boosting trees? In each stage, introduce a new regression tree to compensate the shortcomings of existing model We can use gradient decent method for minimize the loss function A) 1 B) 2 C) 1 and 2 D) None of these

C Both are true and self explanatory

9) In random forest or gradient boosting algorithms, features can be of any type. For example, it can be a continuous feature or a categorical feature. Which of the following option is true when you consider these types of features? A) Only Random forest algorithm handles real valued attributes by discretizing them B) Only Gradient boosting algorithm handles real valued attributes by discretizing them C) Both algorithms can handle real valued attributes by discretizing them D) None of these

C Both can handle real valued features.

Q30) Suppose we have missing values in our data. Which of the following method(s) can help us to deal with missing values while building a decision tree? Let it be. Decision Trees are not affected by missing values Fill dummy value in place of missing, such as -1 Impute missing value with mean/median All of these

D All the options are correct. Refer this article.

Q5) Now let's take multiple features into account. Outlet_Location_Type Outlet_Type Item_Outlet_Sales Tier 1 Supermarket Type1 3735.1380 Tier 3 Supermarket Type2 443.4228 Tier 1 Supermarket Type1 2097.2700 Tier 3 Grocery Store 732.3800 Tier 3 Supermarket Type1 994.7052 If have multiple if-else ladders, which model is best with respect to RMSE? if "Outlet_Location_Type" is 'Tier 1': return 2500 else: if "Outlet_Type" is 'Supermarket Type1': return 1000 elif "Outlet_Type" is 'Supermarket Type2': return 400 else: return 700 if "Outlet_Location_Type" is 'Tier 3': return 2500 else: if "Outlet_Type" is 'Supermarket Type1': return 1000 elif "Outlet_Type" is 'Supermarket Type2': return 400 else: return 700 if "Outlet_Location_Type" is 'Tier 3': return 3000 else: if "Outlet_Type" is 'Supermarket Type1': return 1000 else: return 500 if "Outlet_Location_Type" is 'Tier 1': return 3000 else: if "Outlet_Type" is 'Supermarket Type1': return 1000 else: return 450

D RMSE value: 581.50 RMSE value: 1913.36 RMSE value: 2208.36 RMSE value: 535.75 We see that option D has the lowest value

Q21) Provided n < N and m < M. A Bagged Decision Tree with a dataset of N rows and M columns uses____rows and ____ columns for training an individual intermediate tree. N, M N, M n, M n, m

C Bagged trees uses all the columns for only a sample of the rows. So randomization is done on the number of observations not on number of columns.

1) Which of the following is/are true about bagging trees? In bagging trees, individual trees are independent of each other Bagging is the method for improving the performance by aggregating the results of weak learners A) 1 B) 2 C) 1 and 2 D) None of these

C Both options are true. In Bagging, each individual trees are independent of each other because they consider different subset of features and samples.

B A D C Can't Say

C Decision Boundaries of decision trees are always perpendicular to X and Y axis.

Q35) For parameter tuning in a boosting algorithm, which of the following search strategies may give best tuned model: Random Search. Grid Search. A or B Can't say

C For a a given search space, Random search randomly picks out hyperparameters. In terms of time required, random search requires much less time to converge. Grid search deterministically tries to find optimum hyperparameters. This is a brute force approach for solving a problem, and requires much time to give output. Both random search or grid search may give best tuned model. It depends on how much time and resources can be allocated for search.

Q24) Generally, in terms of prediction performance which of the following arrangements are correct: Bagging>Boosting>Random Forest>Single Tree Boosting>Random Forest>Single Tree>Bagging Boosting>Random Forest>Bagging>Single Tree Boosting >Bagging>Random Forest>Single Tree

C Generally speaking, Boosting algorithms will perform better than bagging algorithms. In terms of bagging vs random forest, random forest works better in practice because random forest has less correlated trees compared to bagging. And it's always true that ensembles of algorithms are better than single models

Q13) Random forests (While solving a regression problem) have the higher variance of predicted result in comparison to Boosted Trees (Assumption: both Random Forest and Boosted Tree are fully optimized). True False Cannot be determined

C It completely depends on the data, the assumption cannot be made without data.

24) In greadient boosting it is important use learning rate to get optimum output. Which of the following is true abut choosing the learning rate? A) Learning rate should be as high as possible B) Learning Rate should be as low as possible C) Learning Rate should be low but it should not be very low D) Learning rate should be high but it should not be very high

C Learning rate should be low but it should not be very low otherwise algorithm will take so long to finish the training because you need to increase the number trees. 25) [True or False] Cross validation can be used to select the number of iterations in boosting; this procedure may help reduce overfitting. A) TRUE B) FALSE Solution: A

17) What will be the minimum accuracy you can get? A) Always greater than 70% B) Always greater than and equal to 70% C) It can be less than 70% D) None of these

C Refer below table for models M1, M2 and M3. Actual predictions M1 M2 M3 Output 1 1 0 0 0 1 1 1 1 1 1 1 0 0 0 1 0 1 0 0 1 0 1 1 1 1 0 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Q42) What can be the maximum depth of decision tree (where k is the number of features and N is the number of samples)? Our constraint is that we are considering a binary decision tree with no duplicate rows in sample (Splitting criterion is not fixed). N N - k - 1 N - 1 k - 1

C The answer is N-1. An example of max depth would be when splitting only happens on the left node.

Q16) Which of the following could not be result of two-dimensional feature space from natural recursive binary split? q16_image 1 only 2 only 1 and 2 None

A 1 is not possible. Therefore, Option A is correct. For more details, refer to Page 308 from ELSI (Elements of Statistical Learning).

Q23) Consider a classification tree for whether a person watches 'Game of Thrones' based on features like age, gender, qualification and salary. Is it possible to have following leaf node? q25_image Yes No Can't say

A A node can be split on a feature, as long as it gives information after split. So even though the above split does not reduce the classification error, it improves the Gini index and the cross-entropy. Refer Pg. 314 of ISLR.

Q43) Boosting is a general approach that can be applied to many statistical learning methods for regression or classification. True False

A Boosting is an ensemble technique and can be applied to various base algorithms

5) Which of the following is true about "max_depth" hyperparameter in Gradient Boosting? Lower is better parameter in case of same validation accuracy Higher is better parameter in case of same validation accuracy Increase the value of max_depth may overfit the data Increase the value of max_depth may underfit the data A) 1 and 3 B) 1 and 4 C) 2 and 3 D) 2 and 4

A Increase the depth from the certain value of depth may overfit the data and for 2 depth values validation accuracies are same we always prefer the small depth in final model building.

A) Outlook B) Humidity C) Windy D) Temperature

A Information gain increases with the average purity of subsets. So option A would be the right answer.

Context 12-15 Consider the following figure for answering the next few questions. In the figure, X1 and X2 are the two features and the data point is represented by dots (-1 is negative class and +1 is a positive class). And you first split the data based on feature X1(say splitting point is x11) which is shown in the figure using vertical line. Every value less than x11 will be predicted as positive class and greater than x will be predicted as negative class. 12) How many data points are misclassified in above image? A) 1 B) 2 C) 3 D) 4

A Only one observation is misclassified, one negative class is showing at the left side of vertical line which will be predicting as a positive class.

Q20) Boosted decision trees perform better than Logistic Regression on anomaly detection problems (Imbalanced Class problems). True, because they give more weight for lesser weighted class in successive rounds False, because boosted trees are based on Decision Tree, which will try to overfit the data

A Option A is correct

Q8) Next, we want to find which feature would be better for splitting root node (where root node represents entire population). For this, we will set "Reduction in Variance" as our splitting method. Outlet_Location_Type Item_Fat_Content Item_Outlet_Sales Tier 1 Low Fat 3735.1380 Tier 3 Regular 443.4228 Tier 1 Low Fat 2097.2700 Tier 3 Regular 732.3800 Tier 3 Low Fat 994.7052 The split with lower variance is selected as the criteria to split the population. q8_image Among Between Outlet_Location_Type and Item_Fat_Content, which was a better feature to split? Outlet_Location_Type Item_Fat_Content will not split on both

A Option A is correct because Outlet_Location_Type has more reduction in variance. You can perform calculation similar to last question.

Q7) We could improve our model by selecting the feature which gives a better prediction when we use it for splitting (It is a process of dividing a node into two or more sub-nodes). Outlet_Location_Type Item_Fat_Content Item_Outlet_Sales Tier 1 Low Fat 3735.1380 Tier 3 Regular 443.4228 Tier 1 Low Fat 2097.2700 Tier 3 Regular 732.3800 Tier 3 Low Fat 994.7052 In this example, we want to find which feature would be better for splitting root node (entire population or sample and this further gets divided into two or more homogeneous sets). Assume splitting method is "Reduction in Variance" i.e. we split using a variable, which results in overall lower variance. q7_image What is the resulting variance if we split using Outlet_Location_Type? ~298676 ~298676 ~3182902 ~2222733 None of these

A Option A is correct. The steps to solve this problem are: Calculate mean of target value for "Tier 1" and then find the variance of each of the target values of "Tier 1" Similarly calculate the variance for "Tier 3" Find weighted mean of variance of "Tier 1" and "Tier 3" (above calculated values). P.S. You will need to take weigthed mean.

Q29) There are "A" features in a dataset and a Random Forest model is built over it. It is given that there exists only one significant feature of the outcome - "Feature1". What would be the % of total splits that will not consider the "Feature1" as one of the features involved in that split (It is given that m is the number of maximum features for random forest)? Note: Considering random forest select features space for every node split. (A-m)/A (m-A)/m m/A Cannot be determined

A Option A is correct. This can be considered as permutation of not selecting a predictor from all the possible predictors

4) In Random forest you can generate hundreds of trees (say T1, T2 .....Tn) and then aggregate the results of these tree. Which of the following is true about individual(Tk) tree in Random Forest? Individual tree is built on a subset of the features Individual tree is built on all the features Individual tree is built on a subset of observations Individual tree is built on full set of observations A) 1 and 3 B) 1 and 4 C) 2 and 3 D) 2 and 4

A Random forest is based on bagging concept, that consider faction of sample and faction of feature for building the individual trees.

11) Suppose you are using a bagging based algorithm say a RandomForest in model building. Which of the following can be true? Number of tree should be as large as possible You will have interpretability after using RandomForest A) 1 B) 2 C) 1 and 2 D) None of these

A Since Random Forest aggregate the result of different weak learners, If It is possible we would want more number of trees in model building. Random Forest is a black box model you will lose interpretability after using it.

23) Now, Consider the learning rate hyperparameter and arrange the options in terms of time taken by each hyperparameter for building the Gradient boosting model? Note: Remaining hyperparameters are same 1. learning rate = 1 2. learning rate = 2 3. learning rate = 3 A) 1~2~3 B) 1<2<3 C) 1>2>3 D) None of these

A Since learning rate doesn't affect time so all learning rates would take equal time.

7) Which of the following algorithm would you take into the consideration in your final model building on the basis of performance? Suppose you have given the following graph which shows the ROC curve for two different classification algorithms such as Random Forest(Red) and Logistic Regression(Blue) A) Random Forest B) Logistic Regression C) Both of the above D) None of these

A Since, Random forest has largest AUC given in the picture so I would prefer Random Forest

Q 1) The data scientists at "BigMart Inc" have collected 2013 sales data for 1559 products across 10 stores in different cities. Also, certain attributes of each product based on these attributes and store have been defined. The aim is to build a predictive model and find out the sales of each product at a particular store during a defined period. Which learning problem does this belong to? Supervised learning Unsupervised learning Reinforcement learning None

A Supervised learning is the machine learning task of inferring a function from labeled training data. Here historical sales data is our training data and it contains the labels / outcomes.

20) True-False: The bagging is suitable for high variance low bias models? A) TRUE B) FALSE

A The bagging is suitable for high variance low bias models or you can say for complex models. 21) Which of the following is true when you choose fraction of observations for building the base learners in tree based algorithm? A) Decrease the fraction of samples to build a base learners will result in decrease in variance B) Decrease the fraction of samples to build a base learners will result in increase in variance C) Increase the fraction of samples to build a base learners will result in decrease in variance D) Increase the fraction of samples to build a base learners will result in Increase in variance Solution: A Answer is self explanatory

Q10) Which methodology does Decision Tree (ID3) take to decide on first split? Greedy approach Look-ahead approach Brute force approach None of these

A The process of top-down induction of decision trees (TDIDT) is an example of a greedy algorithm, and it is by far the most common strategy for learning decision trees from data. Read here.

Q34) Decision Trees are not affected by multicollinearity in features: TRUE FALSE

A The statement is true. For example, if there are two 90% correlated features, decision tree would consider only one of them for splitting.

26) When you use the boosting algorithm you always consider the weak learners. Which of the following is the main reason for having weak learners? To prevent overfitting To prevent under fitting A) 1 B) 2 C) 1 and 2 D) None of these

A To prevent overfitting, since the complexity of the overall learner increases at each step. Starting with weak learners implies the final classifier will be less likely to overfit.

29) In which of the following scenario a gain ratio is preferred over Information Gain? A) When a categorical variable has very large number of category B) When a categorical variable has very small number of category C) Number of categories is the not the reason D) None of these

A When high cardinality problems, gain ratio is preferred over Information Gain technique.

Q11) There are 24 predictors in a dataset. You build 2 models on the dataset: 1. Bagged decision trees and 2. Random forest Let the number of predictors used at a single split in bagged decision tree is A and Random Forest is B. Which of the following statement is correct? A >= B A < B A >> B Cannot be said since different iterations use different numbers of predictors

A Random Forest uses a subset of predictors for model building, whereas bagged trees use all the features at once.

Q27) Boosting is said to be a good classifier because: It creates all ensemble members in parallel, so their diversity can be boosted. It attempts to minimize the margin distribution It attempts to maximize the margins on the training data None of these

B A. Trees are sequential in boosting. They are not parallel B. Boosting attempts to minimize residual error which reduces margin distribution C. As we saw in B, margins are minimized and not maximized.

Q39) While tuning the parameters "Number of estimators" and "Shrinkage Parameter"/"Learning Rate" for boosting algorithm.Which of the following relationship should be kept in mind? Number of estimators is directly proportional to shrinkage parameter Number of estimators is inversely proportional to shrinkage parameter Both have polynomial relationship

B It is generally seen that smaller learning rates require more trees to be added to the model and vice versa. So when tuning parameters of boosting algorithm, there is a trade-off between learning rate and number of estimators

Q31) To reduce under fitting of a Random Forest model, which of the following method can be used? Increase minimum sample leaf value increase depth of trees Increase the value of minimum samples to split None of these

B Only option B is correct, because A: increasing the number of samples for a leaf will reduce the depth of a tree, indirectly increasing underfitting B: Increasing depth will definitely decrease help reduce underfitting C: increasing the number of samples considered to split will have no effect, as the same information will be given to the model. Therefore B is True. Q32) While creating a Decision Tree, can we reuse a feature to split a node? Yes No Solution: A Yes, decision tree recursively uses all the features at each node.

A) 1 B) 2 C) 3 D) 4

B Scenario 2 and 4 has same validation accuracies but we would select 2 because depth is lower is better hyper parameter.

Q19) Let's say we have m numbers of estimators (trees) in a boosted tree. Now, how many intermediate trees will work on modified version (OR weighted) of data set? 1 m-1 m Can't say None of the above

B The first tree in boosted trees works on the original data, whereas all the rest work on modified version of the data.

Context 22-23 Suppose, you are building a Gradient Boosting model on data, which has millions of observations and 1000's of features. Before building the model you want to consider the difference parameter setting for time measurement. 22) Consider the hyperparameter "number of trees" and arrange the options in terms of time taken by each hyperparameter for building the Gradient Boosting model? Note: remaining hyperparameters are same Number of trees = 100 Number of trees = 500 Number of trees = 1000 A) 1~2~3 B) 1<2<3 C) 1>2>3 D) None of these

B The time taken by building 1000 trees is maximum and time taken by building the 100 trees is minimum which is given in solution B

Q44) Predictions of individual trees of bagged decision trees have lower correlation in comparison to individual trees of random forest. TRUE FALSE

B This is False because random Forest has more randomly generated uncorrelated trees than bagged decision trees. Random Forest considers only a subset of total features. So individual trees that are generated by random forest may have different feature subsets. This is not true for bagged trees.

Q26) When using Random Forest for feature selection, suppose you permute values of two features - A and B. Permutation is such that you change the indices of individual values so that they do not remain associated with the same target as before. For example: You notice that permuting values does not affect the score of model built on A, whereas the score decreases on the model trained on B.Which of the following features would you select from the following solely based on the above finding? (A) (B)

B This is called mean decrease in accuracy when using random forest for feature selection. Intuitively, if shuffling the values is not impacting the predictions, the feature is unlikely to add value.

28) How to select best hyperparameters in tree based models? A) Measure performance over training data B) Measure performance over validation data C) Both of these D) None of these

B We always consider the validation results to compare with the test result.

Q3) The below created if-else statement is called a decision stump: Our model: if "Outlet_Location" is "Tier 1": then "Outlet_Sales" is 2000, else "Outlet_Sales" is 1000 Now let us evaluate the model we created above on following data: Evaluation Data: Outlet_Location_Type Item_Outlet_Sales Tier 1 3735.1380 Tier 3 443.4228 Tier 1 2097.2700 Tier 3 732.3800 Tier 3 994.7052 We will calculate RMSE to evaluate this model. The root-mean-square error (RMSE) is a measure of the differences between values predicted by a model or an estimator and the values actually observed. The formula is : rmse = (sqrt(sum(square(predicted_values - actual_values)) / number of observations)) What would be the RMSE value for this model? ~23 ~824 ~680318 ~2152

B capture89 So by calculating RMSE value using the formula above, we get ~824 as our answer. Q4) For the same data, let us evaluate our models. The root-mean-square error (RMSE) is a measure of the differences between values predicted by a model or an estimator and the values actually observed. Outlet_Location_Type Item_Outlet_Sales Tier 1 3735.1380 Tier 3 443.4228 Tier 1 2097.2700 Tier 3 732.3800 Tier 3 994.7052 The formula is : rmse = (sqrt(sum(square(predicted_values - actual_values)) / num_samples)) Which of the following will be the best model with respect to RMSE scoring? if "Outlet_Location_Type" is "Tier 1": then "Outlet_Sales" is 2000, else "Outlet_Sales" is 1000 if "Outlet_Location_Type" is "Tier 1": then "Outlet_Sales" is 1000, else "Outlet_Sales" is 2000 if "Outlet_Location_Type" is "Tier 3": then "Outlet_Sales" is 500, else "Outlet_Sales" is 5000 if "Outlet_Location_Type" is "Tier 3": then "Outlet_Sales" is 2000, else "Outlet_Sales" is 200 Solution: A Calculate the RMSE value for each if-else model: RMSE value of the model: 824.81 RMSE value of the model: 1656.82 RMSE value of the model: 1437.19 RMSE value of the model: 2056.07 We see that the model in option A has the lowest value and lower the RMSE, better the model.

Q40) Let's say we have m number of estimators (trees) in a XGBOOST model. Now, how many trees will work on bootstrapped data set? 1 m-1 m Can't say None of the above

C All the trees in XGBoost will work on bootstrapped data. Therefore, option C is true

Q6) Till now, we have just created predictions using some intuition based rules. Hence our predictions may not be optimal.What could be done to optimize the approach of finding better predictions from the given data? Put predictions which are the sum of all the actual values of samples present. For example, in "Tier 1", we have two values 3735.1380 and 2097.2700, so we will take ~5832 as our prediction Put predictions which are the difference of all the actual values of samples present. For example, in "Tier 1", we have two values 3735.1380 and 2097.2700, so we will take ~1638 as our prediction Put predictions which are mean of all the actual values of samples present. For example, in "Tier 1", we have two values 3735.1380 and 2097.2700, so we will take ~2916 as our prediction

C We will take that value which is more representative of the data. Given all three options, central tendency, mean value would be a better fit for the data.

27) To apply bagging to regression trees which of the following is/are true in such case? We build the N regression with N bootstrap sample We take the average the of N regression tree Each tree has a high variance with low bias A) 1 and 2 B) 2 and 3 C) 1 and 3 D) 1,2 and 3

D All of the options are correct and self explanatory

Q12) Why do we prefer information gain over accuracy when splitting? Decision Tree is prone to overfit and accuracy doesn't help to generalize Information gain is more stable as compared to accuracy Information gain chooses more impactful features closer to root All of these

D All the above options are correct

Q2) Before building our model, we first look at our data and make predictions manually. Suppose we have only one feature as an independent variable (Outlet_Location_Type) along with a continuous dependent variable (Item_Outlet_Sales). Outlet_Location_Type Item_Outlet_Sales Tier 1 3735.14 Tier 3 443.42 Tier 1 2097.27 Tier 3 732.38 Tier 3 994.71 We see that we can possibly differentiate in Sales based on location (tier 1 or tier 3). We can write simple if-else statements to make predictions. Which of the following models could be used to generate predictions (may not be most accurate)? if "Outlet_Location" is "Tier 1": then "Outlet_Sales" is 2000, else "Outlet_Sales" is 1000 if "Outlet_Location" is "Tier 1": then "Outlet_Sales" is 1000, else "Outlet_Sales" is 2000 if "Outlet_Location" is "Tier 3": then "Outlet_Sales" is 500, else "Outlet_Sales" is 5000 Any of the above

D All the options would be correct. All the above models give a prediction as output and here we are not talking about most or least accurate.

Q45) Below is a list of parameters of Decision Tree. In which of the following cases higher is better? Number of samples used for split Depth of tree Samples for leaf Can't Say

D For all three options A, B and C, it is not necessary that if you increase the value of parameter the performance may increase. For example, if we have a very high value of depth of tree, the resulting tree may overfit the data, and would not generalize well. On the other hand, if we have a very low value, the tree may underfit the data. So, we can't say for sure that "higher is better".

13) Which of the following splitting point on feature x1 will classify the data correctly? A) Greater than x11 B) Less than x11 C) Equal to x11 D) None of above

D If you search any point on X1 you won't find any point that gives 100% accuracy.

Q15) Which of the following tree based algorithm uses some parallel (full or partial) implementation? Random Forest Gradient Boosted Trees XGBOOST Both A and C A, B and C

D Only Random Forest and XGBoost have parallel implementations. Random Forest is very easy to parallelize, where as XGBoost can have partially parallel implementation. In Random Forest, all trees grows parallel and finally ensemble the output of each tree . Xgboost doesn't run multiple trees in parallel like Random Forest, you need predictions after each tree to update gradients. Rather it does the parallelization WITHIN a single tree to create branches independently.

Q38) Which of the following are the disadvantage of Decision Tree algorithm? Decision tree is not easy to interpret Decision tree is not a very stable algorithm Decision Tree will over fit the data easily if it perfectly memorizes it Both B and C

D Option A is False, as decision tree are very easy to interpret Option B is True, as decision tree are high unstable models Option C is True, as decision tree also tries to memorize noise. So option D is True.

Context 16-17 Suppose, you are working on a binary classification problem with 3 input features. And you chose to apply a bagging algorithm(X) on this data. You chose max_features = 2 and the n_estimators =3. Now, Think that each estimators have 70% accuracy. Note: Algorithm X is aggregating the results of individual estimators based on maximum voting 16) What will be the maximum accuracy you can get? A) 70% B) 80% C) 90% D) 100%

D Refer below table for models M1, M2 and M3.

Q14) Assume everything else remains same, which of the following is the right statement about the predictions from decision tree in comparison with predictions from Random Forest? Lower Variance, Lower Bias Lower Variance, Higher Bias Higher Variance, Higher Bias Lower Bias, Higher Variance

D The predicted values in Decision Trees have low Bias but high Variance when compared to Random Forests. This is because random forest attempts to reduce variance by bootstrap aggregation. Refer topic 15.4 of Elements of Statistical Learning

Q36) Imagine a two variable predictor space having 10 data points. A decision tree is built over it with 5 leaf nodes. The number of distinct regions that will be formed in predictors space? 25 10 2 5

D The predictor space will be divided into 5 regions. Therefore, option D is correct.

Q33) Which of the following is a mandatory data pre-processing step(s) for XGBOOST? Impute Missing Values Remove Outliers Convert data to numeric array / sparse matrix Input variable must have normal distribution Select the sample of records for each tree/ estimators 1 and 2 1, 2 and 3 3, 4 and 5 3 5 All

D XGBoost is doesn't require most of the pre-processing steps, so only converting data to numeric is required among of the above listed steps

3) Which of the following is/are true about Random Forest and Gradient Boosting ensemble methods? Both methods can be used for classification task Random Forest is use for classification whereas Gradient Boosting is use for regression task Random Forest is use for regression whereas Gradient Boosting is use for Classification task Both methods can be used for regression task A) 1 B) 2 C) 3 D) 4 E) 1 and 4

E Both algorithms are design for classification as well as regression task.

Q25) In which of the following application(s), a tree based algorithm can be applied successfully? Recognizing moving hand gestures in real time Predicting next move in a chess game Predicting sales values of a company based on their past sales A and B A, B, and C

E Option E is correct as we can apply tree based algorithm in all the 3 scenarios.

15. Which of the following statement is TRUE about the Bayes classifier? (A) Bayes classifier works on the Bayes theorem of probability. (B) Bayes classifier is an unsupervised learning algorithm. (C) Bayes classifier is also known as maximum apriori classifier. (D) It assumes the independence between the independent variables or features.

Option-A Explanation: Bayes classifier internally uses the concept of Bayes theorem for doing the predictions for unseen data points.

13. True or False: In a naive Bayes algorithm, when an attribute value in the testing record has no example in the training set, then the entire posterior probability will be zero. (A) True (B) False (C) Can't determined (D) None of these.

Option-A Explanation: Since for a particular value in the attribute, the probability will be zero due to the absence of an example present in the training dataset. This usually leads to the problem of zero probability in the Naive Bayes algorithm. For further reference refer to the given article Link.

11. Which of the following is FALSE about Correlation and Covariance? (A) A zero correlation does not necessarily imply independence between variables. (B) Correlation and covariance values are the same. (C) The covariance and correlation are always the same sign. (D) Correlation is the standardized version of Covariance.

Option-B Explanation: Correlation is defined as covariance divided by standard deviations and therefore, is the standardized version of covariance.

6. Which one of the following statements is TRUE for a Decision Tree? (A) Decision tree is only suitable for the classification problem statement. (B) In a decision tree, the entropy of a node decreases as we go down a decision tree. (C) In a decision tree, entropy determines purity. (D) Decision tree can only be used for only numeric valued and continuous attributes.

Option-B Explanation: Entropy helps to determine the impurity of a node and as we go down the decision tree, entropy decreases.

9. In the Naive Bayes algorithm, suppose that prior for class w1 is greater than class w2, would the decision boundary shift towards the region R1(region for deciding w1) or towards region R2(region for deciding w2)? (A) towards region R1. (B) towards region R2. (C) No shift in decision boundary. (D) It depends on the exact value of priors.

Option-B Explanation: Upon shifting the decision boundary towards region R2, we preserve the prior probabilities proportion since prior for w1 is greater than w2.

5. The robotic arm will be able to paint every corner in the automotive parts while minimizing the quantity of paint wasted in the process. Which learning technique is used in this problem? (A) Supervised Learning. (B) Unsupervised Learning. (C) Reinforcement Learning. (D) Both (A) and (B).

Option-C Explanation: Here robot is learning from the environment, by taking the rewards for positive actions and penalties for negative actions.

12. In Regression modeling we develop a mathematical equation that describes how, (Predictor-Independent variable, Response-Dependent variable) (A) one predictor and one or more response variables are related. (B) several predictors and several response variables response are related. (C) one response and one or more predictors are related. (D) All of these are correct.

Option-C Explanation: In the regression problem statement, we have several independent variables but only one dependent variable.

2. Which of the following statement is False in the case of the KNN Algorithm? (A) For a very large value of K, points from other classes may be included in the neighborhood. (B) For the very small value of K, the algorithm is very sensitive to noise. (C) KNN is used only for classification problem statements. (D) KNN is a lazy learner.

Option-C Explanation: We can use KNN for both regression and classification problem statements. In classification, we use the majority class based on the value of K while in regression we take an average of all points and then give the predictions.

1. How do we perform Bayesian classification when some features are missing? (A) We assuming the missing values as the mean of all values. (B) We ignore the missing features. (C) We integrate the posteriors probabilities over the missing features. (D) Drop the features completely.

Option-C. Explanation: Here we don't use general methods of handling missing values instead we integrate the posterior probabilities over the missing features for better predictions.

14. Which of the following is NOT True about Ensemble Techniques? (A) Bagging decreases the variance of the classifier. (B) Boosting helps to decrease the bias of the classifier. (C) Bagging combines the predictions from different models and then finally gives the results. (D) Bagging and Boosting are the only available ensemble techniques.

Option-D Explanation: Apart from bagging and boosting there are other various types of ensemble techniques such as Stacking, Extra trees classifier, Voting classifier, etc.

4. The following data is used to apply a linear regression algorithm with least squares regression line Y=a1X. Then, the approximate value of a1 is given by:(X-Independent variable, Y-Dependent variable) (A) 27.876 (B) 32.650 (C) 40.541 (D) 28.956 X 1 20 30 40 Y 1 400 800 1300

Option-D Explanation: Hint: Use the ordinary least square method.

10. Which of the following statements is FALSE about Ridge and Lasso Regression? (A) These are types of regularization methods to solve the overfitting problem. (B) Lasso Regression is a type of regularization method. (C) Ridge regression shrinks the coefficient to a lower value. (D) Ridge regression lowers some coefficients to a zero value.

Option-D Explanation: Ridge regression never drops any feature instead it shrinks the coefficients. However, Lasso regression drops some features by making the coefficient of that feature zero. Therefore, the latter one is used as a Feature Selection Technique.

3. Which of the following statement is TRUE? (A) Outliers should be identified and removed always from a dataset. (B) Outliers can never be present in the testing dataset. (C) Outliers is a data point that is significantly close to other data points. (D) The nature of our business problem determines how outliers are used.

Option-D Explanation: The nature of a business problem often determines the use of outliers e.g, in case of problems where class imbalance condition exists like Credit Card Fraud detection, where the records for fraud class are very less with respect to no fraud class.

7. How do you choose the right node while constructing a decision tree? (A) An attribute having high entropy (B) An attribute having high entropy and information gain (C) An attribute having the lowest information gain. (D) An attribute having the highest information gain.

Option-D Explanation: We select first those attributes which are having maximum information gain.


Conjuntos de estudio relacionados

The History of Mother's Day READTHEORY

View Set

Paraphrasing, Quoting, Summarizing Test

View Set

Cosmetology Sate Board Practice Written Exam

View Set

Unit 2 test review-Understanding organisms

View Set

APES TOPICS 1.1-1.11 & Skills 1:A & B,2:A and B and C:6

View Set

Nursing Application: Antianginals

View Set

Jensen: Chapter 13 - Eyes Assessment

View Set

Fin 3400 Chapter 9 Net present value

View Set

Environmental Science 150: Chapter 8

View Set