CISD43 MIDTERM

Ace your homework & exams now with Quizwiz!

Random Forest - classification -regression

As a regression it is typically the average or mean that is returned.

random forest - strength and description

1. Very good performance (speed, , low variance, high accuracy) when abundant data is available. 2. Use bootstrapping/bagging to initialize each tree with different data. 3. Use only a subset of variables at each node. 4. Use a random optimization criterion at each node. 5. Project features on a random different manifold at each node.

Entropy

A measure of disorder or randomness.

6. How do you handle missing or corrupted data in a dataset? Drop missing rows or columns Replace missing values with mean/median/mode Assign a unique category to missing values All of the above

All of the above Drop missing rows or columns Replace missing values with mean/median/mode Assign a unique category to missing values

3. The most widely used metrics and tools to assess a classification model are: Confusion matrix Cost-sensitive accuracy Area under the ROC curve All of the above

All of the above - onfusion matrix Cost-sensitive accuracy Area under the ROC curve

Random Forest - classification- classiifier

As a classifier the most popular outcome (the mode) is returned.

5. Which of the following is a disadvantage of decision trees? Factor analysis Decision trees are robust to outliers Decision trees are prone to be overfit None of the Above

Decision trees are prone to be overfit

Node of decision tree

Example node is a question or a rule Nodes are either root nodes (the first node), interior nodes (follow up questions), or leaf nodes (endpoints). Every node except for leaf nodes contains a rule, which is the question we're asking. When put in terms of flow, you start at the root node and follow branches through interior nodes until you arrive at a leaf node. Each rule divides the data into a certain number of subgroups, typically two subgroups with binary "yes or no" questions being particularly common. It is important to note that all data has to have a way to flow through the tree, it cannot simply disappear or not be contained in the tree.

1. Which of the following is a widely used and effective machine learning algorithm based on the idea of bagging? Decision Tree Regression Classification Random Forest

Random Forest

branches or path

The links between nodes are called branches or paths.

decision tree

a hierarchical arrangement of criteria that predict a classification or a value We used decision trees here as a classifier. You can also use them for regression, and we'll cover a regression version next which follows the same principles.

Finally we are going to build our decision tree #Function to perform training with Entropy #clf_entropy is the actual decision tree or decision tree classifier #max_depth we are going to go in only three layers before it stops #min_samples_leaf that means it will have a least 5 leaves at the end

lf_entropy=DecisionTreeClassifier(criterion='entropy', max_depth=3, min_samples_leaf=5)

Write the code to print the length of the dataset of df balanced_data length.

print('Dataset length:',len(balance_data))

4. Which of the following is a good test dataset characteristic? Large enough to yield meaningful results Is representative of the dataset as a whole Both A and B None of the above

Both A and B Large enough to yield meaningful results Is representative of the dataset as a whole

this is the tree function needed for the DecisionTreeClassifier

from sklearn import tree Definition: A decision tree classifier. Parameters criterion{"gini", "entropy"}, default="gini" The function to measure the quality of a split.

1. [True or False] k-NN algorithm does more computation on test time rather than train time. A) TRUE B) FALSE

A) TRUE The training phase of the algorithm consists only of storing the feature vectors and class labels of the training samples. In the testing phase, a test point is classified by assigning the label which are most frequent among the k training samples nearest to that query point - hence higher computation.

7. What is the purpose of performing cross-validation? To assess the predictive performance of the models To judge how the trained model performs outside the sample on test data Both A and B

Both A and B To assess the predictive performance of the models To judge how the trained model performs outside the sample on test data Both A and B - answer

decision tree - strength

Decision trees are great, and their easily represented set of rules is a powerful feature for modeling and even more so for conveying that model to a more general audience.

downside of decision trees - weakness

Firstly there is a randomness to their generation, which can lead to variance in estimates. There is not a hard and fast rule to how the tree is built, so it doesn't build the same way every time. You saw this above when we discussed the random_state argument. In addition, they are incredibly prone to overfitting, particularly if you allow them to grow too deep or complex. Also note that because they are working from information gain, they are biased towards the dominant class, so balanced data is needed. high variance and propensity to overfit are serious problems.

9) Which of the following will be Euclidean Distance between the two data point A(1,3) and B(2,3)? A) 1 B) 2 C) 4 D) 8

Solution: A sqrt( (1-2)^2 + (3-3)^2) = sqrt(1^2 + 0^2) = 1

18) Adding a non-important feature to a linear regression model may result in. A. Increase in R-square B. Decrease in R-square A) Only 1 is correct B) Only 2 is correct C) Either 1 or 2 D) None of these

Solution: (A) Increase in R-square After adding a feature in feature space, whether that feature is important or unimportant features the R-squared always increase.

34) Suppose we have a dataset which can be trained with 100% accuracy with help of a decision tree of depth 6. Now consider the points below and choose the option based on these points. Note: All other hyper parameters are same and other factors are not affected. 1. Depth 4 will have high bias and low variance 2. Depth 4 will have low bias and low variance A) Only 1 B) Only 2 C) Both 1 and 2 D) None of the above

Solution: (A) Only 1 Depth 4 will have high bias and low variance If you fit decision tree of depth 4 in such data means it will more likely to underfit the data. So, in case of underfitting you will have high bias and low variance.

37) For which of the following hyperparameters, higher value is better for decision tree algorithm? 1. Number of samples used for split 2. Depth of tree 3. Samples for leaf A)1 and 2 B) 2 and 3 C) 1 and 3 D) 1, 2 and 3 E) Can't say

Solution: (E) Can't say For all three options A, B and C, it is not necessary that if you increase the value of parameter the performance may increase. For example, if we have a very high value of depth of tree, the resulting tree may overfit the data, and would not generalize well. On the other hand, if we have a very low value, the tree may underfit the data. So, we can't say for sure that "higher is better".

7) Which of the following is true about Manhattan distance? A) It can be used for continuous variables B) It can be used for categorical variables C) It can be used for categorical as well as continuous D) None of these

Solution: A Manhattan Distance is designed for calculating the distance between real valued features.

8) When you find noise in data which of the following option would you consider in k-NN? A) I will increase the value of k B) I will decrease the value of k C) Noise can not be dependent on value of k D) None of these

Solution: A To be more sure of which classifications you make, you can try increasing the value of k.

5) Which of the following will be true about k in k-NN in terms of Bias? A) When you increase the k the bias will be increases B) When you decrease the k the bias will be increases C) Can't say D) None of these

Solution: A large K means simple model, simple model always condider as high bias

23) A company has build a kNN classifier that gets 100% accuracy on training data. When they deployed this model on client side it has been found that the model is not at all accurate. Which of the following thing might gone wrong? Note: Model has successfully deployed and no technical issues are found at client side except the model performance A) It is probably a overfitted model B) It is probably a underfitted model C) Can't say D) None of these

Solution: A It is probably a overfitted model In an overfitted module, it seems to be perfo

27) In k-NN what will happen when you increase/decrease the value of k? A) The boundary becomes smoother with increasing value of K B) The boundary becomes smoother with decreasing value of K C) Smoothness of boundary doesn't dependent on value of K D) None of these

Solution: A The boundary becomes smoother with increasing value of K The decision boundary would become smoother by increasing the value of K

26) True-False: It is possible to construct a 2-NN classifier by using the 1-NN classifier? A) TRUE B) FALSE

Solution: A True You can implement a 2-NN classifier by ensembling 1-NN classifiers

16) Which of the following will be true about k in k-NN in terms of variance? A) When you increase the k the variance will increases B) When you decrease the k the variance will increases C) Can't say D) None of these

Solution: B Simple model will be consider as less variance model

24) You have given the following 2 statements, find which of these option is/are true in case of k-NN? In case of very large value of k, we may include points from other classes into the neighborhood. In case of too small value of k the algorithm is very sensitive to noise A) 1 B) 2 C) 1 and 2 D) None of these

Solution: C In case of very large value of k, we may include points from other classes into the neighborhood. In case of too small value of k the algorithm is very sensitive to noise

28) Following are the two statements given for k-NN algorthm, which of the statement(s) is/are true? We can choose optimal value of k with the help of cross validation Euclidean distance treats each feature as equally important A) 1 B) 2 C) 1 and 2 D) None of these

Solution: C We can choose optimal value of k with the help of cross validation Euclidean distance treats each feature as equally important Both the statements are true

19) In k-NN it is very likely to overfit due to the curse of dimensionality. Which of the following option would you consider to handle such problem? Dimensionality Reduction Feature selection A) 1 B) 2 C) 1 and 2 D) None of these

Solution: C In such case you can use either dimensionality reduction algorithm or the feature selection algorithm

4) Which of the following option is true about k-NN algorithm? A) It can be used for classification B) It can be used for regression C) It can be used in both classification and regression

Solution: C We can also use k-NN for regression problems. In this case the prediction can be based on the mean or the median of the k-most similar instances.

20) Below are two statements given. Which of the following will be true both statements? k-NN is a memory-based approach is that the classifier immediately adapts as we collect new training data. The computational complexity for classifying new samples grows linearly with the number of samples in the training dataset in the worst-case scenario. A) 1 B) 2 C) 1 and 2 D) None of these

Solution: C C) 1 and 2 k-NN is a memory-based approach is that the classifier immediately adapts as we collect new training data. The computational complexity for classifying new samples grows linearly with the number of samples in the training dataset in the worst-case scenario. Both are true and self explanatory

5) Which of the following statement is true about k-NN algorithm? A. k-NN performs much better if all of the data have the same scale B. k-NN works well with a small number of input variables (p), but struggles when the number of inputs is very large C. k-NN makes no assumptions about the functional form of the problem being solved

Solution: D The above mentioned statements are assumptions of kNN algorithm

25) Which of the following statements is true for k-NN classifiers? A) The classification accuracy is better with larger values of k B) The decision boundary is smoother with smaller values of k C) The decision boundary is linear D) k-NN does not require an explicit training step

Solution: D k-NN does not require an explicit training step Option A: This is not always true. You have to ensure that the value of k is not too high or not too low. Option B: This statement is not true. The decision boundary can be a bit jagged Option C: Same as option B Option D: This statement is true

6) Imagine, you are working with "Analytics Vidhya" and you want to develop a machine learning algorithm which predicts the number of views on the articles. Your analysis is based on features like author name, number of articles written by the same author on Analytics Vidhya in past and a few other features. Which of the following evaluation metric would you choose in that case? 1. Mean Square Error 2. Accuracy 3. F1 Score A) Only 1 B) Only 2 C) Only 3 D) 1 and 3 E) 2 and 3 F) 1 and 2

Solution:(A) Mean Squar Error You can think that the number of views of articles is the continuous target variable which fall under the regression problem. So, mean squared error will be used as an evaluation metrics.

What python library do we use for an accuracy score?

from sklearn.metrics import accuracy_score Definition: In multilabel classification, this function computes subset accuracy: the set of labels predicted for a sample must exactly match the corresponding set of labels in y_true.

What python library is to test, train, split?

from sklearn.model_selection import train_test_split Definition: is a function in Sklearn model selection for splitting data arrays into two subsets: for training data and for testing data. With this function, you don't need to divide the dataset manually. By default, Sklearn train_test_split will make random partitions for the two subsets.Nov 25, 2019

20) Imagine, you are solving a classification problems with highly imbalanced class. The majority class is observed 99% of times in the training data. Your model has 99% accuracy after taking the predictions on test data. Which of the following is true in such a case? 1. Accuracy metric is not a good idea for imbalanced class problems. 2. Accuracy metric is a good idea for imbalanced class problems. 3. Precision and recall metrics are good for imbalanced class problems. 4. Precision and recall metrics aren't good for imbalanced class problems. A) 1 and 3 B) 1 and 4 C) 2 and 3 D) 2 and 4

solution: (A) 1 and 3 1. Accuracy metric is not a good idea for imbalanced class problems. 3. Precision and recall metrics are good for imbalanced class problems.


Related study sets

Derm Qbank: Surgical, Laser, Cosmetic

View Set

Ch. 9 emotion in relationships and society

View Set

Chapter 15: Respiratory Emergencies

View Set

Evidence-Based Practice Exam One

View Set

PTSD: Post Traumatic Stress Disorder

View Set

Taylor Nursing Fundamentals CH40 - Fluid & Electrolytes

View Set

Chapter 50: Diabetes Mellitus and the Metabolic Syndrome

View Set

Exam 3: Nursing I: Module 6, 7, 8

View Set