Data Mining Quizzes

Ace your homework & exams now with Quizwiz!

When K = 5 (K is the number of classes), how many binary classifiers are created using all-vs-all approach to combine binary classifiers into a multi-class classification?

(K 2) = (K * (K - 1)) / 2 = (5 * (5 - 1)) / 2 = 10

DNN differs from ANN by having many nodes in the hidden layer True False

False

Historical bias issues could be solved by adding more training data. True False

False

A data scientist built a logistic regression model for a hospital to predict whether a patient has breast cancer. The features include age (x1) and some test result (x2). The model is 1/(1+e^(-f(x))), where f(x) = 0.1*x1 -2*x2. Now there are two patients, patient 1 has features (35, 0.5) and patient 2 has features (40, 0.2). Who is more likely to have breast cancer based on model f? Patient 1 Patient 2 Same

f(patient 1) = (0.1 * 35) - (2 * 0.5) = 3.5 f(patient 2) = (0.1 * 40) - (2 * 0.2) = 4.4 Patient 2

A decision tree always needs to be large, i.e., having many nodes in the tree, such that the tree can capture any possible patterns that exist in the training data. True False

False

After the "evaluation" phase is completed, a data scientist needs to proceed to deployment and cannot go back to update the previous steps because they are final. True False

False

Most of the time, data mining refers to human experts making predictions based on their prior experience with the data. True False

False

Restricting the use of sensitive features such as gender and race in an AI model will eliminate biases against these two features. True False

False

The base learners in Random forest and Adaboost are trained sequentially. True False

False

The best SVM model is usually the one with the largest margin regardless of how large is hinge loss, since it is most robust to outliers True False

False

When building a regression model, we desire a model with larger residuals on the training data. True False

False

When predicting some target variable, more features will provide more information, so models built on more features will always achieve better performance than models using fewer features True False

False

Clustering differs from supervised learning in that it uses similarity or distance measures in its algorithms. True False

False (They differ in that supervised learning has a target variable but clustering does not. Some supervised learning may also use similarity and distance measures.)

Which of the following is FALSE? - Adaboost always uses the original target variable in the dataset as the target variable to train each base learner - Gradient boost often has better performance than random forest because it uses more base learners than random forest - The weight for each base learner (e.g., decision stump) in Adaboost is determined by the weights associated of its misclassified examples

Gradient boost often has better performance than random forest because it uses more base learners than random forest (Gradient Boost and AdaBoost often need fewer base learners than random forest because of the "coordination" of base learners during training.)

What is the type of the feature: Tesla stock price (sample values: 190,80,220, etc) Numeric Categoric

Numeric

There are 3 people described by 3 features (note that all features are rescaled to 0 and 1). Who is more similar (i.e., smaller distance) to Rachel? Please use Manhattan distance. Rachel: - Age = 0.32 - Personality = 0.62 - Relationship = 0.2 Monica: - Age = 0.33 - Personality = 0.58 - Relationship = 0.7 - Age = 0.34 - Personality = 0.78 - Relationship = 0.6 Monica Phoebe They are the same

Monica = (0.33 - 0.32) + (0.62 - 0.58) + (0.7 - 0.2) = 0.55 Phoebe = (0.34 - 0.32) + (0.78 - 0.62) + (0.6 - 0.2) = 0.58 Monica

____________ models are more likely to overfit More complicated Simpler

More complicated

Which one of the following metrics can NOT be used to evaluated a regression model? MAE Accuracy MSE R-squared

Accuracy

Which of the following is NOT true about DNN? DNN can work with structured and unstructured data Training DNN usually takes a long time DNN is very slow in making a prediction DNN does automated "feature engineering" via multiple hidden layers

DNN is very slow in making a prediction

CRISP-DM is enforced to make sure that the experiences of the data scientists will have a significant impact on the output of the data mining process True False

False

Which scenario has data leakage A. Predicting whether a criminal will re-affend in the future. Features include: his/her age, previous offenses, marital status, etc. B. Predicting the salary of a fresh graduate student before he finds a job. Features include: the GPA of the student, degree, major, age, internship experience (if there is any), etc C. Predicting the total amount of money a fundraiser will receive on GoFundMe before posting to GoFundMe. Features include: the characteristics of the project, the personal information of the fundraiser, the total number of donors, time of the year, etc

C

For the same model above, the doctor who uses the model thinks there are too many false negative (positive class is defined as having breast cancer), which means many patients with breast cancer are diagnosed as healthy. The default cut-off value for the logistic regression model is 0.5. Without changing the model, how should the doctor change the cut-off value to avoid having too many false negatives (i.e., to predict more negatives)? Decrease the cut-off value Increase the cut-off value Do not change the cut-off value

Decrease the cut-off value

Which of the following is true about random forest? - Each base learner is trained on a subset of training data. - For random forest to perform well, each base learner needs to be as accurate as possible so one often set the number of variables for each tree to be large - The trees in a random forest are trained sequentially to complement each other

Each base learner is trained on a subset of training data (Each tree is built using a random bootstrapped sample of observations with a random subset of features)

What is the entropy of the target variable: type of chocolate? There are a total of 60 chocolates in the box: 20 milk chocolates, 10 black chocolates, and 30 almond chocolates. (Approximate to three decimal places) hint: pay special attention to the log base, i.e., how many "classes" are there in the data?

Entropy = -((20/60) * log_3(20/60)) - ((10/60) * log_3(10/60)) - ((30/60) * log_3(30/60)) = 0.921

Which of the following does NOT help improve the performance of an ensemble model? Increase the number of base learners Increase the correlation of base learners Increase the performance of base learners

Increase the correlation of base learners

What is the precision? True Positive (TP) = Actual + & Pred + = 10 False Negative (FN) = Actual + & Pred - = 10 False Positives (FP) = Actual - & Pred + = 40 True Negative (TN) = Actual - & Pred - = 10

Precision = TP / (TP + FP) = 10 / (10 + 40) = 0.2

What is the recall/sensitivity? True Positive (TP) = Actual + & Pred + = 10 False Negative (FN) = Actual + & Pred - = 10 False Positives (FP) = Actual - & Pred + = 40 True Negative (TN) = Actual - & Pred - = 10

Recall/Sensitivity = TP / (TP + FN) = 10 / (10 + 10) = 0.5

Which of the following command will convert a categorical feature into numeric features through binary coding (one-hot-encoding)? Recode -> as categoric Recode -> indicator variable Recode -> Quantiles

Recode -> indicator variable

Which type of data mining problem is it: Predicting the daily new COVID-19 cases in Iowa Regression Classification Clustering Co-occurrence grouping

Regression

Larger entropy of the target variable within a certain group of instances indicates ______________ homogeneity in this group in terms of the target variable. Larger Smaller

Smaller

When performing pre-pruning for decision trees via setting a minimum split size as the stopping criteria (i.e., stop splitting a node if the number of observations in the node is less than the minimum split size), a larger minimum split size is likely to produce a _____________ tree Smaller Larger

Smaller

If a model has very different predictive performance (e.g., error rate) on the training and testing set, it is likely that The model has high variance The model has high bias

The model has high variance

The kernel functions calculate relationships between every pair of points as if They are in the lower dimensions They are in the same dimensions as the original data They are in the higher dimensions

They are in the higher dimensions

When performing feature selection, using an optimum search strategy is often not practical in real applications True False

True (Consider all combinations of features. If a set contains n elements, then the number of subsets of the set is 2^n)

Which of the two has higher complexity, i.e., needs to build more models? all-vs-all one-vs-all

all-vs-all


Related study sets

Network Security 1.0 Modules 8-10: ACLs and Firewalls Group Exam Answers Answers

View Set