CISD 43

अब Quizwiz के साथ अपने होमवर्क और परीक्षाओं को एस करें!

25) The below figure shows AUC-ROC curves for three logistic regression models. Different colors show curves for different hyper parameters values. Which of the following AUC-ROC will give best result? A) Yellow B) Pink C) Black D) All are same

A The best classification is the largest area under the curve so yellow line has largest area under the curve.

1) True-False: Is Logistic regression a supervised machine learning algorithm? A) TRUE B) FALSE

A True, Logistic regression is a supervised learning algorithm because it uses true labels for training. Supervised learning algorithm should have input variables (x) and an target variable (Y) when you train the model .

19) Suppose, You applied a Logistic Regression model on a given data and got a training accuracy X and testing accuracy Y. Now, you want to add a few new features in the same data. Select the option(s) which is/are correct in such a case. Note: Consider remaining parameters are same. A) Training accuracy increases B) Training accuracy increases or remains the same C) Testing accuracy decreases D) Testing accuracy increases or remains the same

A and D Adding more features to model will increase the training accuracy because model has to consider more data to fit the logistic regression. But testing accuracy increases if feature is found to be significant

10) What would be the range of p in such case? A) (0, inf) B) (-inf, 0 ) C) (0, 1) D) (-inf, inf)

C For values of x in the range of real number from −∞ to +∞ Logistic function will give the output between (0,1)

ensemble models - weakness

However there are also downsides. Most notably, some ensemble techniques, particularly boosting, are prone to overfitting. You also lose a lot of the transparency that individual models offer. You saw this clearly in the Random Forest example, where the easy explanations offered by decision trees are abstracted away by the forest.

ensemble models - strengths

Most notable is of course their performance. Ensemble models are often some of the most accurate techniques to apply to a problem. They also tend to have low variance because they're built from multiple internal models.

2) Which of the following is an example of a deterministic algorithm? A) PCA B) K-Means C) None of the above

(A) A deterministic algorithm is that in which output does not change on different runs. PCA would give the same result if we run again, but not k-means.

18) Adding a non-important feature to a linear regression model may result in. 1. Increase in R-square 2. Decrease in R-square A) Only 1 is correct B) Only 2 is correct C) Either 1 or 2 D) None of these

(A) After adding a feature in feature space, whether that feature is important or unimportant features the R-squared always increase.

32) Which of the following value of K will have least leave-one-out cross validation accuracy? A) 1NN B) 3NN C) 4NN D) All have same leave one out error

(A) Each point which will always be misclassified in 1-NN which means that you will get the 0% accuracy.

34) Suppose we have a dataset which can be trained with 100% accuracy with help of a decision tree of depth 6. Now consider the points below and choose the option based on these points. Note: All other hyper parameters are same and other factors are not affected. 1. Depth 4 will have high bias and low variance 2. Depth 4 will have low bias and low variance A) Only 1 B) Only 2 C) Both 1 and 2 D) None of the above

(A) If you fit decision tree of depth 4 in such data means it will more likely to underfit the data. So, in case of underfitting you will have high bias and low variance.

4) Which of the following statement(s) is / are true for Gradient Decent (GD) and Stochastic Gradient Decent (SGD)? 1. In GD and SGD, you update a set of parameters in an iterative manner to minimize the error function. 2. In SGD, you have to run through all the samples in your training set for a single update of a parameter in each iteration. 3. In GD, you either use the entire data or a subset of training data to update a parameter in each iteration. A) Only 1 B) Only 2 C) Only 3 D) 1 and 2 E) 2 and 3 F) 1,2 and 3

(A) In SGD for each iteration you choose the batch which is generally contain the random sample of data But in case of GD each iteration contain the all of the training observations.

27) It is possible to construct a k-NN classification algorithm based on this black box alone. Note: Where n (number of training observations) is very large compared to k. A) TRUE B) FALSE

(A) In first step, you pass an observation (q1) in the black box algorithm so this algorithm would return a nearest observation and its class. In second step, you through it out nearest observation from train data and again input the observation (q1). The black box algorithm will again return the a nearest observation and it's class. You need to repeat this procedure k times

28) Instead of using 1-NN black box we want to use the j-NN (j>1) algorithm as black box. Which of the following option is correct for finding k-NN using j-NN? 1. J must be a proper factor of k 2. J > k 3. Not possible A) 1 B) 2 C) 3

(A) Same as question number 27

Context 38-39 Imagine, you have a 28 * 28 image and you run a 3 * 3 convolution neural network on it with the input depth of 3 and output depth of 8. Note: Stride is 1 and you are using same padding. 38) What is the dimension of output feature map when you are using the given parameters. A) 28 width, 28 height and 8 depth B) 13 width, 13 height and 8 depth C) 28 width, 13 height and 8 depth D) 13 width, 28 height and 8 depth

(A) The formula for calculating output size is output size = (N - F)/S + 1 where, N is input size, F is filter size and S is stride. Read this article to get a better understanding.

8) Below are the 8 actual values of target variable in the train file. [0,0,0,1,1,1,1,1] What is the entropy of the target variable? A) -(5/8 log(5/8) + 3/8 log(3/8)) B) 5/8 log(5/8) + 3/8 log(3/8) C) 3/8 log(5/8) + 5/8 log(3/8) D) 5/8 log(3/8) - 3/8 log(5/8)

(A) The formula for entropy is So the answer is A.

21) In ensemble learning, you aggregate the predictions for weak learners, so that an ensemble of these models will give a better prediction than prediction of individual models. Which of the following statements is / are true for weak learners used in ensemble model? 1. They don't usually overfit. 2. They have high bias, so they cannot solve complex learning problems 3. They usually overfit. A) 1 and 2 B) 1 and 3 C) 2 and 3 D) Only 1 E) Only 2 F) None of the above

(A) Weak learners are sure about particular part of a problem. So, they usually don't overfit which means that weak learners have low variance and high bias.

6) Imagine, you are working with "Analytics Vidhya" and you want to develop a machine learning algorithm which predicts the number of views on the articles. Your analysis is based on features like author name, number of articles written by the same author on Analytics Vidhya in past and a few other features. Which of the following evaluation metric would you choose in that case? 1. Mean Square Error 2. Accuracy 3. F1 Score A) Only 1 B) Only 2 C) Only 3 D) 1 and 3 E) 2 and 3 F) 1 and 2

(A) You can think that the number of views of articles is the continuous target variable which fall under the regression problem. So, mean squared error will be used as an evaluation metrics.

10) Skip gram model is one of the best models used in Word2vec algorithm for words embedding. Which one of the following models depict the skip gram model? A) A B) B C) Both A and B D) None of these

(B) Both models (model1 and model2) are used in Word2vec algorithm. The model1 represent a CBOW model where as Model2 represent the Skip gram model.

33) Suppose you are given the below data and you want to apply a logistic regression model for classifying it in two given classes. You are using logistic regression with L1 regularization. Where C is the regularization parameter and w1 & w2 are the coefficients of x1 and x2. Which of the following option is correct when you increase the value of C from zero to a very large value? A) First w2 becomes zero and then w1 becomes zero B) First w1 becomes zero and then w2 becomes zero C) Both becomes zero at the same time D) Both cannot be zero even after very large value of C

(B) By looking at the image, we see that even on just using x2, we can efficiently perform classification. So at first w1 will become 0. As regularization parameter increases more, w2 will come more and more closer to 0.

12) [True or False] LogLoss evaluation metric can have negative values. A) TRUE B) FALSE

(B) Log loss cannot have negative values.

A feature F1 can take certain value: A, B, C, D, E, & F and represents grade of students from a college. 1) Which of the following statement is true in following case? A) Feature F1 is an example of nominal variable. B) Feature F1 is an example of ordinal variable. C) It doesn't belong to any of the above category. D) Both of these

(B) Ordinal variables are the variables which has some order in their categories. For example, grade A should be consider as high grade than grade B.

39) What is the dimensions of output feature map when you are using following parameters. A) 28 width, 28 height and 8 depth B) 13 width, 13 height and 8 depth C) 28 width, 13 height and 8 depth D) 13 width, 28 height and 8 depth

(B) Same as above

11) Let's say, you are using activation function X in hidden layers of neural network. At a particular neuron for any given input, you get the output as "-0.0001". Which of the following activation function could X represent? A) ReLU B) tanh C) SIGMOID D) None of these

(B) The function is a tanh because the this function output range is between (-1,-1).

5) Which of the following hyper parameter(s), when increased may cause random forest to over fit the data? 1. Number of Trees 2. Depth of Tree 3. Learning Rate A) Only 1 B) Only 2 C) Only 3 D) 1 and 2 E) 2 and 3 F) 1,2 and 3

(B) Usually, if we increase the depth of tree it will cause overfitting. Learning rate is not an hyperparameter in random forest. Increase in the number of tree will cause under fitting.

29) Suppose you are given 7 Scatter plots 1-7 (left to right) and you want to compare Pearson correlation coefficients between variables of each scatterplot. Which of the following is in the right order? 1. 1<2<3<4 2. 1>2>3 > 4 3. 7<6<5<4 4. 7>6>5>4 A) 1 and 3 B) 2 and 3 C) 1 and 4 D) 2 and 4

(B) from image 1to 4 correlation is decreasing (absolute value). But from image 4 to 7 correlation is increasing but values are negative (for example, 0, -0.3, -0.7, -0.99).

13) Which of the following statements is/are true about "Type-1" and "Type-2" errors? 1. Type1 is known as false positive and Type2 is known as false negative. 2. Type1 is known as false negative and Type2 is known as false positive. 3. Type1 error occurs when we reject a null hypothesis when it is actually true. A) Only 1 B) Only 2 C) Only 3 D) 1 and 2 E) 1 and 3 F) 2 and 3

(E) In statistical hypothesis testing, a type I error is the incorrect rejection of a true null hypothesis (a "false positive"), while a type II error is incorrectly retaining a false null hypothesis (a "false negative").

17) In previous question, suppose you have identified multi-collinear features. Which of the following action(s) would you perform next? 1. Remove both collinear variables. 2. Instead of removing both variables, we can remove only one variable. 3. Removing correlated variables might lead to loss of information. In order to retain those variables, we can use penalized regression models like ridge or lasso regression. A) Only 1 B)Only 2 C) Only 3 D) Either 1 or 3 E) Either 2 or 3

(E) You cannot remove the both features because after removing the both features you will lose all of the information so you should either remove the only 1 feature or you can use the regularization algorithm like L1 and L2.

11) Suppose, you want to predict the class of new data point x=1 and y=1 using eucludian distance in 3-NN. In which class this data point belong to? A) + Class B) - Class C) Can't say D) None of these

A All three nearest point are of +class so this point will be classified as +class.

23) A company has build a kNN classifier that gets 100% accuracy on training data. When they deployed this model on client side it has been found that the model is not at all accurate. Which of the following thing might gone wrong? Note: Model has successfully deployed and no technical issues are found at client side except the model performance A) It is probably a overfitted model B) It is probably a underfitted model C) Can't say D) None of these

A In an overfitted module, it seems to be performing well on training data, but it is not generalized enough to give the same results on a new data.

7) Which of the following is true about Manhattan distance? A) It can be used for continuous variables B) It can be used for categorical variables C) It can be used for categorical as well as continuous D) None of these

A Manhattan Distance is designed for calculating the distance between real valued features. 9) Which of the following will be Euclidean Distance between the two data point A(1,3) and B(2,3)? A) 1 B) 2 C) 4 D) 8 Solution: A sqrt( (1-2)^2 + (3-3)^2) = sqrt(1^2 + 0^2) = 1

1) [True or False] k-NN algorithm does more computation on test time rather than train time. A) TRUE B) FALSE

A The training phase of the algorithm consists only of storing the feature vectors and class labels of the training samples. In the testing phase, a test point is classified by assigning the label which are most frequent among the k training samples nearest to that query point - hence higher computation.

18) When you find noise in data which of the following option would you consider in k-NN? A) I will increase the value of k B) I will decrease the value of k C) Noise can not be dependent on value of k D) None of these

A To be more sure of which classifications you make, you can try increasing the value of k.

26) True-False: It is possible to construct a 2-NN classifier by using the 1-NN classifier? A) TRUE B) FALSE

A You can implement a 2-NN classifier by ensembling 1-NN classifiers

15) Which of the following will be true about k in k-NN in terms of Bias? A) When you increase the k the bias will be increases B) When you decrease the k the bias will be increases C) Can't say D) None of these

A large K means simple model, simple model always condider as high bias

2) In the image below, which would be the best value for k assuming that the algorithm you are using is k-Nearest Neighbor. A) 3 B) 10 C) 20 D 50

B Validation error is the least when the value of k is 10. So it is best to use this value of k

24) You have given the following 2 statements, find which of these option is/are true in case of k-NN? 1. In case of very large value of k, we may include points from other classes into the neighborhood. 2. In case of too small value of k the algorithm is very sensitive to noise A) 1 B) 2 C) 1 and 2 D) None of these

C Both the options are true and are self explanatory.

28) Following are the two statements given for k-NN algorthm, which of the statement(s) is/are true? 1. We can choose optimal value of k with the help of cross validation 2. Euclidean distance treats each feature as equally important A) 1 B) 2 C) 1 and 2 D) None of these

C Both the statements are true

19) In k-NN it is very likely to overfit due to the curse of dimensionality. Which of the following option would you consider to handle such problem? 1. Dimensionality Reduction 2. Feature selection A) 1 B) 2 C) 1 and 2 D) None of these

C In such case you can use either dimensionality reduction algorithm or the feature selection algorithm

25) Which of the following statements is true for k-NN classifiers? A) The classification accuracy is better with larger values of k B) The decision boundary is smoother with smaller values of k C) The decision boundary is linear D) k-NN does not require an explicit training step

D Option A: This is not always true. You have to ensure that the value of k is not too high or not too low. Option B: This statement is not true. The decision boundary can be a bit jagged Option C: Same as option B Option D: This statement is true

5) Which of the following statement is true about k-NN algorithm? 1. k-NN performs much better if all of the data have the same scale 2. k-NN works well with a small number of input variables (p), but struggles when the number of inputs is very large 3. k-NN makes no assumptions about the functional form of the problem being solved A) 1 and 2 B) 1 and 3 C) Only 1 D) All of the above

D The above mentioned statements are assumptions of kNN algorithm

decision tree - strength

Decision trees are great, and their easily represented set of rules is a powerful feature for modeling and even more so for conveying that model to a more general audience.

5. Which of the following is a disadvantage of decision trees? 1. Factor analysis 2. Decision trees are robust to outliers 3. Decision trees are prone to be overfit 4. None of the above

Decision trees are prone to be overfit

Random Forest - weaknesses

Firstly in either classification or regression it will not predict outside of sample, meaning it will only return values that are within a range it has seen before. Random Forests can also get rather large and slow if you let them grow too wildly. The biggest disadvantage, however, is the lack of transparency in the process. Random Forest is often referred to as a "black box" model, meaning it will give you an output but very little insight into how it got there. You'll run into a few more of these throughout the course.

downside of decision trees - weakness

Firstly there is a randomness to their generation, which can lead to variance in estimates. There is not a hard and fast rule to how the tree is built, so it doesn't build the same way every time. You saw this above when we discussed the random_state argument. In addition, they are incredibly prone to overfitting, particularly if you allow them to grow too deep or complex. Also note that because they are working from information gain, they are biased towards the dominant class, so balanced data is needed. high variance and propensity to overfit are serious problems.

18. You run gradient descent for 15 iterations with a=0.3 and compute J(theta) after each iteration. You find that the value of J(Theta) decreases quickly and then levels off. Based on this, which of the following conclusions seems most plausible? 1. Rather than using the current value of a, use a larger value of a (say a=1.0) 2. Rather than using the current value of a, use a smaller value of a (say a=0.1) 3. a=0.3 is an effective choice of learning rate 4. None of the above

a=0.3 is an effective choice of learning rate

24) In previous question, if you train the same algorithm for tuning 2 hyper parameters say "max_depth" and "learning_rate". You want to select the right value against "max_depth" (from given 10 depth values) and learning rate (from given 5 different learning rates). In such cases, which of the following will represent the overall time? A) 1000-1500 second B) 1500-3000 Second C) More than or equal to 3000 Second D) None of these

(D) Same as question number 23.

27) Which of the following image is showing the cost function for y =1. Following is the loss function in logistic regression(Y-axis loss function and x axis log probability) for two class classification problem. Note: Y is the target class A) A B) B C) Both D) None of these

A A is the true answer as loss function decreases as the log probability increases

11) In above question what do you think which function would make p between (0,1)? A) logistic function B) Log likelihood function C) Mixture of both D) None of them

A Explanation is same as question number 10

15) The logit function(given as l(x)) is the log of odds function. What could be the range of logit function in the domain x=[0,1]? A) (- ∞ , ∞) B) (0,1) C) (0, ∞) D) (- ∞, 0)

A For our purposes, the odds function has the advantage of transforming the probability function, which has values from 0 to 1, into an equivalent function with values between 0 and ∞. When we take the natural log of the odds function, we get a range of values from -∞ to ∞.

20) Choose which of the following options is true regarding One-Vs-All method in Logistic Regression. A) We need to fit n models in n-class classification problem B) We need to fit n-1 models to classify into n classes C) We need to fit only 1 model to classify into n classes D) None of these

A If there are n classes, then n separate logistic regression has to fit, where the probability of each category is predicted over the rest of the categories combined.

9) Which of the following algorithms do we use for Variable Selection? A) LASSO B) Ridge C) Both D) None of these

A In case of lasso we apply a absolute penality, after increasing the penality in lasso some of the coefficient of variables may become zero.

18) How will the bias change on using high(infinite) regularisation? Suppose you have given the two scatter plot "a" and "b" for two classes( blue for positive and red for negative class). In scatter plot "a", you correctly classified all data points using logistic regression ( black line is a decision boundary). A) Bias will be high B) Bias will be low C) Can't say D) None of these

A Model will become very simple so bias will be very high.

16) Which of the following option is true? A) Linear Regression errors values has to be normally distributed but in case of Logistic Regression it is not the case B) Logistic Regression errors values has to be normally distributed but in case of Linear Regression it is not the case C) Both Linear Regression and Logistic Regression error values have to be normally distributed D) Both Linear Regression and Logistic Regression error values have not to be normally distributed

A Only A is true. Refer this tutorial https://czep.net/stat/mlelr.pdf

24) Suppose, above decision boundaries were generated for the different value of regularization. Which of the above decision boundary shows the maximum regularization? A) A B) B C) C D) All have equal regularization

A Since, more regularization means more penality means less complex decision boundry that shows in first figure A.

3) True-False: Is it possible to design a logistic regression algorithm using a Neural Network Algorithm? A) TRUE B) FALSE

A True, Neural network is a is a universal approximator so it can implement linear regression algorithm.

7) One of the very good methods to analyze the performance of Logistic Regression is AIC, which is similar to R-Squared in Linear Regression. Which of the following is true about AIC? A) We prefer a model with minimum AIC value B) We prefer a model with maximum AIC value C) Both but depend on the situation D) None of these

A We select the best model in logistic regression which can least AIC. For more information refer this source: http://www4.ncsu.edu/~shu3/Presentation/AIC.pdf

4) True-False: Is it possible to apply a logistic regression algorithm on a 3-class Classification problem? A) TRUE B) FALSE

A Yes, we can apply logistic regression on 3 classification problem, We can use One Vs all method for 3 class classification in logistic regression.

Random Forest - classification

As a regression it is typically the average or mean that is returned.

2) True-False: Is Logistic regression mainly used for Regression? A) TRUE B) FALSE

B Logistic regression is a classification algorithm, don't confuse with the name regression.

5) Which of the following methods do we use to best fit the data in Logistic Regression? A) Least Square Error B) Maximum Likelihood C) Jaccard distance D) Both A and B

B Logistic regression uses maximum likely hood estimate for training a logistic regression.

30) Can a Logistic Regression classifier do a perfect classification on the below data? Note: You can use only X1 and X2 variables where X1 and X2 can take only two binary values(0,1). A) TRUE B) FALSE C) Can't say D) None of these

B No, logistic regression only forms linear decision surface, but the examples in the figure are not linearly separable.

12) Which of the following figure will represent the decision boundary as given by above classifier? A) B) C) D)

B Option B would be the right answer. Since our line will be represented by y = g(-6+x2) which is shown in the option A and option B. But option B is the right answer because when you put the value x2 = 6 in the equation then y = g(0) you will get that means y= 0.5 will be on the line, if you increase the value of x2 greater then 6 you will get negative values so output will be the region y =0.

17) Which of the following is true regarding the logistic function for any value "x"? Note: Logistic(x): is a logistic function of any number "x" Logit(x): is a logit function of any number "x" Logit_inv(x): is a inverse logit function of any number "x" A) Logistic(x) = Logit(x) B) Logistic(x) = Logit_inv(x) C) Logit_inv(x) = Logit(x) D) None of these

B Refer this link for the solution: https://en.wikipedia.org/wiki/Logit

16) Which of the following will be true about k in k-NN in terms of variance? A) When you increase the k the variance will increases B) When you decrease the k the variance will increases C) Can't say D) None of these

B Simple model will be consider as less variance model

8) [True-False] Standardisation of features is required before training a Logistic Regression. A) TRUE B) FALSE

B Standardization isn't required for logistic regression. The main goal of standardizing features is to help convergence of the technique used for optimization.

21) Below are two different logistic models with different values for β0 and β1. Which of the following statement(s) is true about β0 and β1 values of two logistics models (Green, Black)? Note: consider Y = β0 + β1*X. Here, β0 is intercept and β1 is coefficient. A) β1 for Green is greater than Black B) β1 for Green is lower than Black C) β1 for both models is same D) Can't Say

B β0 and β1: β0 = 0, β1 = 1 is in X1 color(black) and β0 = 0, β1 = −1 is in X4 color (green)

20) Below are two statements given. Which of the following will be true both statements? 1. k-NN is a memory-based approach is that the classifier immediately adapts as we collect new training data. 2. The computational complexity for classifying new samples grows linearly with the number of samples in the training dataset in the worst-case scenario. A) 1 B) 2 C) 1 and 2 D) None of these

C Both are true and self explanatory

29) Imagine, you have given the below graph of logistic regression which is shows the relationships between cost function and number of iteration for 3 different learning rate values (different colors are showing different curves at different learning rates ). Suppose, you save the graph for future reference but you forgot to save the value of different learning rates for this graph. Now, you want to find out the relation between the leaning rate values of these curve. Which of the following will be the true relation? Note: 1. The learning rate for blue is l1 2. The learning rate for red is l2 3. The learning rate for green is l3 A) l1>l2>l3 B) l1 = l2 = l3 C) l1 < l2 < l3 D) None of these

C If you have low learning rate means your cost function will decrease slowly but in case of large learning rate cost function will decrease very fast.

14) Suppose you have been given a fair coin and you want to find out the odds of getting heads. Which of the following option is true for such a case? A) odds will be 0 B) odds will be 0.5 C) odds will be 1 D) None of these

C Odds are defined as the ratio of the probability of success and the probability of failure. So in case of fair coin probability of success is 1/2 and the probability of failure is 1/2 so odd would be 1

22) Which of the following above figure shows that the decision boundary is overfitting the training data? A) A B) B C) C D)None of these

C Since in figure 3, Decision boundary is not smooth that means it will over-fitting the data.

23) What do you conclude after seeing this visualization? 1. The training error in first plot is maximum as compare to second and third plot. 2. The best model for this regression problem is the last (third) plot because it has minimum training error (zero). 3. The second model is more robust than first and third because it will perform best on unseen data. 4. The third model is overfitting more as compare to first and second. 5. All will perform same because we have not seen the testing data. A) 1 and 3 B) 1 and 3 C) 1, 3 and 4 D) 5

C The trend in the graphs looks like a quadratic trend over independent variable X. A higher degree(Right graph) polynomial might have a very high accuracy on the train population but is expected to fail badly on test dataset. But if you see in left graph we will have training error maximum because it underfits the training data

28) Suppose, Following graph is a cost function for logistic regression. Now, How many local minimas are present in the graph? A) 1 B) 2 C) 3 D) 4

C There are three local minima present in the graph

4) Which of the following option is true about k-NN algorithm? A) It can be used for classification B) It can be used for regression C) It can be used in both classification and regression

C We can also use k-NN for regression problems. In this case the prediction can be based on the mean or the median of the k-most similar instances.

26) What would do if you want to train logistic regression on same data that will take less time as well as give the comparatively similar accuracy(may not be same)? Suppose you are using a Logistic Regression model on a huge dataset. One of the problem you may face on such huge data is that Logistic regression will take very long time to train. A) Decrease the learning rate and decrease the number of iteration B) Decrease the learning rate and increase the number of iteration C) Increase the learning rate and increase the number of iteration D) Increase the learning rate and decrease the number of iteration

D If you decrease the number of iteration while training it will take less time for surly but will not give the same accuracy for getting the similar accuracy but not exact you need to increase the learning rate.

13) If you replace coefficient of x1 with x2 what would be the output figure? A) B) C) D)

D Same explanation as in previous question.

6) Which of the following evaluation metrics can not be applied in case of logistic regression output to compare with target? A) AUC-ROC B) Accuracy C) Logloss D) Mean-Squared-Error

D Since, Logistic Regression is a classification algorithm so it's output can not be real time value so mean squared error can not use for evaluating it

20) Imagine, you are solving a classification problems with highly imbalanced class. The majority class is observed 99% of times in the training data. Your model has 99% accuracy after taking the predictions on test data. Which of the following is true in such a case? 1. Accuracy metric is not a good idea for imbalanced class problems. 2. Accuracy metric is a good idea for imbalanced class problems. 3. Precision and recall metrics are good for imbalanced class problems. 4. Precision and recall metrics aren't good for imbalanced class problems. A) 1 and 3 B) 1 and 4 C) 2 and 3 D) 2 and 4

(A) Refer the question number 4 from in this article.

26) What would you do in PCA to get the same projection as SVD? A) Transform data to zero mean B) Transform data to zero median C) Not possible D) None of these

(A) When the data has a zero mean vector PCA will have same projections as SVD, otherwise you have to centre the data first before taking SVD.

3) [True or False] A Pearson correlation between two variables is zero but, still their values can still be related to each other. A) TRUE B) FALSE

(A) Y=X2. Note that, they are not only associated, but one is a function of the other and Pearson correlation between them is 0.

35) Which of the following options can be used to get global minima in k-Means Algorithm? 1. Try to run algorithm for different centroid initialization 2. Adjust number of iterations 3. Find out the optimal number of clusters A) 2 and 3 B) 1 and 3 C) 1 and 2 D) All of above

(D) All of the option can be tuned to find the global minima.

25) Given below is a scenario for training error TE and Validation error VE for a machine learning algorithm M1. You want to choose a hyperparameter (H) based on TE and VE. H TE VE 1 105 90 2 200 85 3 250 96 4 105 85 5 300 100 Which value of H will you choose based on the above table? A) 1 B) 2 C) 3 D) 4 E) 5

(D) Looking at the table, option D seems the best

30) You can evaluate the performance of a binary class classification problem using different metrics such as accuracy, log-loss, F-Score. Let's say, you are using the log-loss function as evaluation metric. Which of the following option is / are true for interpretation of log-loss as an evaluation metric? 1. If a classifier is confident about an incorrect classification, then log-loss will penalise it heavily. 2. For a particular observation, the classifier assigns a very small probability for the correct class then the corresponding contribution to the log-loss will be very large. 3. Lower the log-loss, the better is the model. A) 1 and 3 B) 2 and 3 C) 1 and 2 D) 1,2 and 3

(D) Options are self-explanatory.

14) Which of the following is/are one of the important step(s) to pre-process the text in NLP based projects? 1. Stemming 2. Stop word removal 3. Object Standardization A) 1 and 2 B) 1 and 3 C) 2 and 3 D) 1,2 and 3

(D) Stemming is a rudimentary rule-based process of stripping the suffixes ("ing", "ly", "es", "s" etc) from a word. Stop words are those words which will have not relevant to the context of the data for example is/am/are. Object Standardization is also one of the good way to pre-process the text.

7) Given below are three images (1,2,3). Which of the following option is correct for these images? A) B) C) A) 1 is tanh, 2 is ReLU and 3 is SIGMOID activation functions. B) 1 is SIGMOID, 2 is ReLU and 3 is tanh activation functions. C) 1 is ReLU, 2 is tanh and 3 is SIGMOID activation functions. D) 1 is tanh, 2 is SIGMOID and 3 is ReLU activation functions.

(D) The range of SIGMOID function is [0,1]. The range of the tanh function is [-1,1]. The range of the RELU function is [0, infinity]. So Option D is the right answer.

16. In which of the following cases will K-means clustering fail to give good results? 1) Data points with outliers 2) Data points with different densities 3) Data points with nonconvex shapes 1. 1 and 2 2. 2 and 3 3. 1, 2, and 3 4. 1 and 3

1, 2, and 3

random forest - strength and description

1. Very good performance (speed, , low variance, high accuracy) when abundant data is available. 2. Use bootstrapping/bagging to initialize each tree with different data. 3. Use only a subset of variables at each node. 4. Use a random optimization criterion at each node. 5. Project features on a random different manifold at each node.

27) In k-NN what will happen when you increase/decrease the value of k? A) The boundary becomes smoother with increasing value of K B) The boundary becomes smoother with decreasing value of K C) Smoothness of boundary doesn't dependent on value of K D) None of these

A The decision boundary would become smoother by increasing the value of K

6) Which of the following machine learning algorithm can be used for imputing missing values of both categorical and continuous variables? A) K-NN B) Linear Regression C) Logistic Regression

A k-NN algorithm can be used for imputing missing value of both categorical and continuous variables.

10) Which of the following will be Manhattan Distance between the two data point A(1,3) and B(2,3)? A) 1 B) 2 C) 4 D) 8

A sqrt( mod((1-2)) + mod((3-3))) = sqrt(1 + 0) = 1

Entropy

A measure of disorder or randomness. https://arxiv.org/ftp/arxiv/papers/1405/1405.2061.pdf

10. Which of the folllowing is an example of feature extraction? 1. Constructing bag of words vector from an email 2. Applying PCA projects to a large high-dimensional data 3. Removing stopwords in a sentence 4. All of the above

All of the above

3. The most widely used metrics and tools to assess a classification model are: 1. Confusion matrix 2. Cost-sensitive accuracy 3. Area under the ROC curve 4. All of the above

All of the above

6. How do you handle missing or corrupted data in a dataset? 1. Drop missing rows or columns 2. Replace missing values with mean/median/mode 3. Assign a unique category to missing values 4. All of the above

All of the above

Random Forest - classification As a classifier the most popular outcome (the mode) is returned. As a regression it is typically the average or mean that is returned.

As a classifier the most popular outcome (the mode) is returned.

Context 13-14: Suppose you have given the following 2-class data where "+" represent a postive class and "" is represent negative class. 13) Which of the following value of k in k-NN would minimize the leave one out cross validation accuracy? A) 3 B) 5 C) Both have same D) None of these

B 5-NN will have least leave one out cross validation error.

12. Which of the following is true about Naive Bayes ? 1. Assumes that all the features in a dataset are equally important 2. Assumes that all the features in a dataset are independent 3. Both A and B 4. None of the above options

Both A and B

15. Which of the following techniques can be used for normalization in text mining? 1. Stemming 2. Lemmatization 3. Stop Word Removal 4. Both A and B

Both A and B

4. Which of the following is a good test dataset characteristic? 1. Large enough to yield meaningful results 2. Is representative of the dataset as a whole 3. Both A and B 4. None of the above

Both A and B

7. What is the purpose of performing cross-validation? 1. To assess the predictive performance of the models 2. To judge how the trained model performs outside the sample on test data 3. Both A and B

Both A and B

8. Why is second order differencing in time series needed? 1. To remove stationarity 2. To find the maxima or minima at the local point 3. Both A and B 4. None of the above

Both A and B

17. Which of the following is a reasonable way to select the number of principal components "k"? 1. Choose k to be the smallest value so that at least 99% of the varinace is retained. 2. Choose k to be 99% of m (k = 0.99*m, rounded to the nearest integer). 3. Choose k to be the largest value so that 99% of the variance is retained. 4. Use the elbow method

Choose k to be the smallest value so that at least 99% of the varinace is retained

bagging in random forest

Each tree selects a subset of observations with replacement to build the training set. Replacement here means it can simply choose the same observation multiple times, which is only really a problem when there are few observations. It puts the observation "back in the bag", if you will, where it can be pulled and chosen again.

Ensemble models

Ensemble models are essentially models made up of other models. These component models are often models that are simpler than would be necessary to accurately predict the desired outcome on their own. In the case of Random Forest, those sub models are Decision Trees. Random Forest generates many Decision Trees and combines them to generate a single prediction via a voting process.

Node of decision tree

Example node is a question or a rule Nodes are either root nodes (the first node), interior nodes (follow up questions), or leaf nodes (endpoints). Every node except for leaf nodes contains a rule, which is the question we're asking. When put in terms of flow, you start at the root node and follow branches through interior nodes until you arrive at a leaf node. Each rule divides the data into a certain number of subgroups, typically two subgroups with binary "yes or no" questions being particularly common. It is important to note that all data has to have a way to flow through the tree, it cannot simply disappear or not be contained in the tree.

9. When performing regression or classification, which of the following is the correct way to preprocess the data? 1. Normalize the data → PCA → training 2. PCA → normalize PCA output → training 3. Normalize the data → PCA → normalize PCA output → training 4. None of the above

Normalize the data → PCA → training

20. Suppose you have trained a logistic regression classifier and it outputs a new example x with a prediction ho(x) = 0.2. This means 1. Our estimate for P(y=1 | x) 2. Our estimate for P(y=0 | x) 3. Our estimate for P(y=1 | x) 4. Our estimate for P(y=0 | x)

Our estimate for P(y=0 | x)

1. Which of the following is a widely used and effective machine learning algorithm based on the idea of bagging? 1. Decision Tree 2. Regression 3. Classification 4. Random Forest

Random Forest

random subspace in random forest

Random forests also typically use a random subset of features for each split. This means for each time it has to perform a split or generate a rule, it is only looking at the random subspace created by a random subset of some of the features as possibilities to generate that rule. This will help avoid the correlation problem because the trees will not be built with the same available features at every point. As a general rule, for a dataset with x features 𝑥⎯⎯√x features are used for classifiers and 𝑥/3x/3 for regression.

11. What is pca.components_ in Sklearn? 1. Set of all eigen vectors for the projection space 2. Matrix of principal components 3. Result of the multiplication matrix 4. None of the above options

Set of all eigen vectors for the projection space

branches or path

The links between nodes are called branches or paths.

2. To find the minimum or the maximum of a function, we set the gradient to zero because: 1. The value of the gradient at extrema of a function is always zero 2. Depends on the type of problem 3. Both A and B 4. None of the above

The value of the gradient at extrema of a function is always zero

14. How can you prevent a clustering algorithm from getting stuck in bad local optima? 1. Set the same seed value for each run 2. Use multiple random initializations 3. Both A and B 4. None of the above

Use multiple random initializations

decision tree

a hierarchical arrangement of criteria that predict a classification or a value We used decision trees here as a classifier. You can also use them for regression, and we'll cover a regression version next which follows the same principles.

15) Suppose you want to project high dimensional data into lower dimensions. The two most famous dimensionality reduction algorithms used here are PCA and t-SNE. Let's say you have applied both algorithms respectively on data "X" and you got the datasets "X_projected_PCA" , "X_projected_tSNE". Which of the following statements is true for "X_projected_PCA" & "X_projected_tSNE" ? A) X_projected_PCA will have interpretation in the nearest neighbour space. B) X_projected_tSNE will have interpretation in the nearest neighbour space. C) Both will have interpretation in the nearest neighbour space. D) None of them will have interpretation in the nearest neighbour space.

(B) t-SNE algorithm consider nearest neighbour points to reduce the dimensionality of the data. So, after using t-SNE we can think that reduced dimensions will also have interpretation in nearest neighbour space. But in case of PCA it is not the case.

31) Which of the following is leave-one-out cross-validation accuracy for 3-NN (3-nearest neighbor)? A) 0 D) 0.4 C) 0.8 D) 1

(C) In Leave-One-Out cross validation, we will select (n-1) observations for training and 1 observation of validation. Consider each point as a cross validation point and then find the 3 nearest point to this point. So if you repeat this procedure for all points you will get the correct classification for all positive class given in the above figure but negative class will be misclassified. Hence you will get 80% accuracy.

40) Suppose, we were plotting the visualization for different values of C (Penalty parameter) in SVM algorithm. Due to some reason, we forgot to tag the C values with visualizations. In that case, which of the following option best explains the C values for the images below (1,2,3 left to right, so C values are C1 for image1, C2 for image2 and C3 for image3 ) in case of rbf kernel. A) C1 = C2 = C3 B) C1 > C2 > C3 C) C1 < C2 < C3 D) None of these

(C) Penalty parameter C of the error term. It also controls the trade-off between smooth decision boundary and classifying the training points correctly. For large values of C, the optimization will choose a smaller-margin hyperplane. Read more here.

36) Imagine you are working on a project which is a binary classification problem. You trained a model on training dataset and get the below confusion matrix on validation dataset. Based on the above confusion matrix, choose which option(s) below will give you correct predictions? 1. Accuracy is ~0.91 2. Misclassification rate is ~ 0.91 3. False positive rate is ~0.95 4. True positive rate is ~0.95 A) 1 and 3 B) 2 and 4 C) 1 and 4 D) 2 and 3

(C) The Accuracy (correct classification) is (50+100)/165 which is nearly equal to 0.91. The true Positive Rate is how many times you are predicting positive class correctly so true positive rate would be 100/105 = 0.95 also known as "Sensitivity" or "Recall"

9) Let's say, you are working with categorical feature(s) and you have not looked at the distribution of the categorical variable in the test data. You want to apply one hot encoding (OHE) on the categorical feature(s). What challenges you may face if you have applied OHE on a categorical variable of train dataset? A) All categories of categorical variable are not present in the test dataset. B) Frequency distribution of categories is different in train as compared to the test dataset. C) Train and Test always have same distribution. D) Both A and B E) None of these

(D) Both are true, The OHE will fail to encode the categories which is present in test but not in train so it could be one of the main challenges while applying OHE. The challenge given in option B is also true you need to more careful while applying OHE if frequency distribution doesn't same in train and test.

23) Which of the following option is true for overall execution time for 5-fold cross validation with 10 different values of "max_depth"? A) Less than 100 seconds B) 100 - 300 seconds C) 300 - 600 seconds D) More than or equal to 600 seconds C) None of the above D) Can't estimate

(D) Each iteration for depth "2" in 5-fold cross validation will take 10 secs for training and 2 second for testing. So, 5 folds will take 12*5 = 60 seconds. Since we are searching over the 10 depth values so the algorithm would take 60*10 = 600 seconds. But training and testing a model on depth greater than 2 will take more time than depth "2" so overall timing would be greater than 600.

Context: 16-17 Given below are three scatter plots for two features (Image 1, 2 & 3 from left to right). 16) In the above images, which of the following is/are example of multi-collinear features? A) Features in Image 1 B) Features in Image 2 C) Features in Image 3 D) Features in Image 1 & 2 E) Features in Image 2 & 3 F) Features in Image 3 & 1

(D) In Image 1, features have high positive correlation where as in Image 2 has high negative correlation between the features so in both images pair of features are the example of multicollinear features.

22) Which of the following options is/are true for K-fold cross-validation? 1. Increase in K will result in higher time required to cross validate the result. 2. Higher values of K will result in higher confidence on the cross-validation result as compared to lower value of K. 3. If K=N, then it is called Leave one out cross validation, where N is the number of observations. A) 1 and 2 B) 2 and 3 C) 1 and 3 D) 1,2 and 3

(D) Larger k value means less bias towards overestimating the true expected error (as training folds will be closer to the total dataset) and higher running time (as you are getting closer to the limit case: Leave-One-Out CV). We also need to consider the variance between the k folds accuracy while selecting the k.

19) Suppose, you are given three variables X, Y and Z. The Pearson correlation coefficients for (X, Y), (Y, Z) and (X, Z) are C1, C2 & C3 respectively. Now, you have added 2 in all values of X (i.enew values become X+2), subtracted 2 from all values of Y (i.e. new values are Y-2) and Z remains the same. The new coefficients for (X,Y), (Y,Z) and (X,Z) are given by D1, D2 & D3 respectively. How do the values of D1, D2 & D3 relate to C1, C2 & C3? A) D1= C1, D2 < C2, D3 > C3 B) D1 = C1, D2 > C2, D3 > C3 C) D1 = C1, D2 > C2, D3 < C3 D) D1 = C1, D2 < C2, D3 < C3 E) D1 = C1, D2 = C2, D3 = C3 F) Cannot be determined

(E) Correlation between the features won't change if you add or subtract a value in the features.

37) For which of the following hyperparameters, higher value is better for decision tree algorithm? 1. Number of samples used for split 2. Depth of tree 3. Samples for leaf A)1 and 2 B) 2 and 3 C) 1 and 3 D) 1, 2 and 3 E) Can't say

(E) For all three options A, B and C, it is not necessary that if you increase the value of parameter the performance may increase. For example, if we have a very high value of depth of tree, the resulting tree may overfit the data, and would not generalize well. On the other hand, if we have a very low value, the tree may underfit the data. So, we can't say for sure that "higher is better".

Ensemble models - three types

However, most ensemble models fall into three main categories. Bagging is one such ensemble technique. In bagging you take subsets of the data and train a model on each subset. Then the subsets are allowed to simultaneously vote on the outcome, either taking a majority or a mean. You just saw this in action with Random Forests, the most popular bagging technique. Another ensemble technique is called boosting. Rather than build multiple models simultaneously like bagging, boosting uses the output of one model as an input into the next in a form of serial processing. These models then get daisy-chained together sequentially until some stopping condition is met. We'll cover boosting methods later. Lastly, stacking is a two phase process. In the first phase multiple models are trained in parallel. Then in the second phase those models are used as inputs into a final model to give your prediction. This approach combines the parallel approach embodied by bagging with the serial approach of boosting to create a hybrid of the two.

Random forests - general

Instead of one decision tree you have several, a forest. In which each tree in the forest got a vote on the outcome for a given observation. You can also specify how the tree is built, with using information gain and entropy like in Decision Tree, or other methods like Gini impurity. You also get to control the number of estimators you want to generate, or the number of trees in the forest. Here you have a tradeoff between how much variance you can explain and the computational complexity. This is pretty easily tuneable. As you increase the number of trees in the forest the accuracy should converge as eventually the additional learning from another tree approaches zero.

ID3

Interactive dichotomizer version 3 Used for nominal, unordered, input data only. Every split has branching factor , where is the number of values a variable can take (e.g. bins of discretized variable) has as many levels as input variables Essentially ID3 goes through every feature to find the most valuable attribute and then splits based on it. It moves further and further down the tree until it either has a pure class or has met a terminating condition.

19. What is a sentence parser typically used for? 1. It is used to parse sentences to check if they are utf-8 compliant. 2. It is used to parse sentences to derive their most likely syntax tree structures. 3. It is used to parse sentences to assign POS tags to all tokens. 4. It is used to check if sentences can be parsed into meaningful tokens.

It is used to parse sentences to derive their most likely syntax tree structures.

13. Which of the following statements about regularization is not correct? 1. Using too large a value of lambda can cause your hypothesis to underfit the data. 2. Using too large a value of lambda can cause your hypothesis to overfit the data. 3. Using a very large value of lambda cannot hurt the performance of your hypothesis. 4. None of the above

None of the above


संबंधित स्टडी सेट्स

PEDS Chapter 23: The Child with Fluid and Electrolyte Imbalance

View Set

English- Catcher in the Rye Test + Vocab

View Set

Ch. 12- The Evolution of Social Behavior

View Set

Chapter 11. Facebook: Platforms, Privacy, and Big Business from the Social Graph

View Set

Unit 13 - Asset Allocation and Modern Portfolio Theory

View Set