Business Intelligence Final Exam
Name 10 hyper-parameters we have tuned in our labs in this course with regard to all the models we have had.
- DT: splitting criteria, depth - NB: None - KNN: K - SVM: C - MLP: depth (number of layers), number of neurons in each layer (width), epoch, batch size, activation functions, dropouts, etc -K-Means: K -Apriori: support and confidence threshold -Ensembles, you say!
Three maior centrality measures in Network Analysis:
-Degree (and weighted degree) centrality -Betweenness centrality -Closedness centrality
from sklearn.svm, ---1--- is used for regression tasks whereas ---2-- - is used for classification.
1- SVR 2- SVC
The -----1- - in OLS regression measures the proportion of variance in the dependent variable explained by the model, adjusted for the number of predictors. It penalizes the addition of irrelevant predictors, unlike ----2---- , which may increase even with the addition of insignificant predictors.
1. Adjusted R squared 2. R-squared
Given the following clustering results for the BartRider dataset, what is the probability that an individual in cluster 2 owns a house?
17%
--1--- attempts to gauge how the homogeneity or heterogeneity within the clusters changes for various values of k in ----2---- algorithm.
1: Elbow method 2: K-Means
from sklearn.neural network, ---1--- is used to train a model for classification tasks, whereas ---2--- is used for numeric prediction.
1: MLPClassifier 2: MLPRegressor
In classification with AdaBoost, 1 instances are given 2 weights to allow subsequent weak learners to focus more on 3.
1: misclassified 2: higher 3: classifying them correctly
In OLS regression, a 1._______ p-value (< 0.05) indicates 2.______evidence against the null hypothesis, suggesting 3.__________significance of the variable.
1: smaller 2: stronger 3: greater
Item X has appeared 545 times in our transactional data, while item Y has been observed 2513 times. Out of the total 9835 transactions, these items have co-appeared 271 times, calculate the confidence of the rule X - -> Y
271/545=0.497
Based on the model's performance on the test data for predicting bad-buy cars, how many NOT bad-buy cars are classified as bad buy?
2712
In K-means clustering, how are initial cluster centroids typically selected? a) Randomly b) Based on the largest cluster size c) According to the closest data points d) Determined by the smallest distance
A. Randomly
Which library is often used for creating data visualizations in Python for data mining tasks? a) Pandas b) Seaborn c) NumPy d) Scikit-learn
B) seaborn
What is NOT a cause for overfitting: a) Insufficient training data b) Noises in data c) Model's complexity d) Balanced dataset
D. Balanced dataset
How to avoid overfitting? Mention three approaches.
Data strategies §Secure sufficient data §Identify and handle potential outliers and noises §Evaluation strategies §identifying overfitting: comparing the model performance on training and testing data § Avoid big gap between the training and testing accuracy §Model strategies §Select proper algorithm and manage model complexity -Compare different algorithms -Lower model complexity via method-specific parameters
Regression trees are not suitable for handling categorical predictors in the input features.
False
True or False: In regression trees, splitting variables are chosen based on p-values calculated from the data.
False
True or False: The significance of predictor variables in regression trees is determined by their coefficients.
False
Which model is a lazy learner: a) KNN b) DT c) MLP d) SVM
KNN
The key feature of SVMs is their ability to map the problem into a higher dimension space using a process known as the _______. This process allows SVM to learn concepts that were not explicitly measured in the original data.
Kernel Trick
Name two general approaches for CF in recommender systems and explain the differences, pros and cons.
Memory-based approach: In User-User CF, similarity is computed between users based on their interactions or preferences. Conversely, in Item-Item CF, similarity is calculated between items based on the users who interacted with them, focusing on item-item relationships rather than user-user interactions. Model-based approach
Which of the following is NOT an example of a classification problem? a) Predicting stock prices b) Identifying spam emails c) Distinguishing customer churners d) Predicting whether a county votes to legalize gaming
Predicting stock prices is NOT an example of a classification problem. It typically falls under the regression problem category, where the goal is to predict a continuous value (the stock price) rather than categorizing data into classes as in classification problems.
The strength of association between predictor and dependent variables is reflected in the ________ value in OLS regression, indicating the proportion of variance explained by the model.
R-SQUARED
Given the following clustering results for the BartRider dataset, what is the probability that an individual in cluster 1 is not a Bart rider? Be as precise as possible.
The result of 1 − 0.023742 1−0.023742 is approximately 0.976258. 98%
Based on the model's performance on the test data for predicting bad-buy cars, what is the probability of catching a bad-buy car by this model?
Yes, recall: 56%
What is association rule mining primarily used for? a) Discovering relationships between items b) Predicting numeric values c) Grouping similar data points d) Image recognition
a) Discovering relationships between items
What could be the purpose of the following code snippet when working with scikit-learn? from sklearn import preprocessing a) Doing min-max normalization b) Importing a dataset c) Training a machine learning model d) Visualizing data
a) Doing min-max normalization
To handle missing values in a Pandas DataFrame, what code can you use to replace them with the mean value of the column 'column name'? a) df.fillna(df[ column_name] mean()) b) df.fillna(mean) c) df.replace na(mean) d) df.fillna(df.mean())
a) df.fillna(df[ column_name] mean())
In the context of Ordinary Least Squares (OLS), R-squared (aka the coefficient of determination) measures: a) The sum of squared residuals B)The proportion of the variance in the dependent variable explained by the independent variables c) The mean squared error d) The variance of the residuals
b) The proportion of the variance in the dependent variable explained by the independent variables.
What is the primary purpose of cross-validation in machine learning? a) To overfit the model b) To evaluate the model's performance on multiple subsets of data c) To train the model d) To avoid data preprocessing
b) To evaluate the model's performance on multiple subsets of data
In the carAuction dataframe, 'Auction', and 'color' are categorical variables. Which of the following is true if we run the following code cell twice in Colab? carAuction = pd.get_ dummies(carAuction, columns= ['Auction', 'Color"], drop _first=True) a) We get the dummy-coded version of the selected columns and the original 'Auction' and 'color' columns will not be dropped b) We get error in the second run because 'Auction' and 'color' columns are no longer in the CarAuction dataframe after the first run c) We get error in the first run because the syntax is not accurate d) We get the dummy-coded columns in the first run, the second time does not return any error, but it does not do anything either
b) We get error in the second run because 'Auction' and 'color' columns are no longer in the CarAuction dataframe after the first run
Fill in the blank of c= a) 1000, 10000, 1500 b) 1000, 100, 10000 c) 100, 5000, 15000 d) 15000, 5000, 100
c) 100, 5000, 15000
What is the purpose of the following code when applied to a Decision Tree classifier? clf = DecisionTreeClassifier(max_depth=3) clf. fit(X _train, y_ train) a) It displays a decision tree visualization with depth of 3 b) It evaluates the model's performance c) It creates a Decision Tree classifier with a maximum depth of 3 and a default splitting criterion d) It returns error because we are not specifying the split criterion
c) It creates a Decision Tree classifier with a maximum depth of 3 and a default splitting criterion
What is the difference between precision and recall of the positive class in classification metrics? a) Precision and recall are the same and can be used interchangeably. b) Precision measures the number of true positives, while recall measures the number of false positives. c) Precision measures the accuracy of predictions of a model, while recall measures the model's ability to capture all positives. d) Precision measures false positive rate, while recall measures true positive rate.
c) Precision measures the accuracy of predictions of a model, while recall measures the model's ability to capture all positives.
What evaluation metric is commonly used to assess the quality of K-means clustering results? a) Accuracy b) Precision c) Silhouette coefficient d) F1-score
c) Silhouette coefficient
the association rule X->Y, X is considered the antecedent, and Y is the consequent. The confidence of an association rule is calculated as: a) Support of the antecedent divided by the support of the consequent b) Support of the rule divided by the support of the consequent c) Support of the rule divided by the support of the antecedent d) Support of the consequent divided by the support of the antecedent
c) Support of the rule divided by the support of the antecedent
Which of the following statements accurately describes what a boxplot represents? a) A boxplot shows the distribution of data, including the range, mean, and median b) A boxplot is used exclusively for displaying categorical data c) A boxplot displays the exact values of each data point in a dataset d) A boxplot represents the spread of data and identifies potential outliers
d) A boxplot represents the spread of data and identifies potential outliers
What is the primary advantage of regression trees (and decision trees in general)? a Speed of trainingOverfitting b) Overfitting c) Non-linearity handling d) Interpretability
d) Interpretability
What distinguishes Mean Absolute Error (MAE) from Root Mean Squared Error (RMSE) in evaluating models? a) MAE is more sensitive to outliers than RMSE and penalizes larger errors more heavily. b) RMSE calculates the absolute differences between predicted and actual values, while MAE squares these differences before averaging. c) MAE provides an unambiguous measure of the average magnitude of errors, while RMSE emphasizes the impact of smaller errors. d) RMSE amplifies the effect of larger errors due to the squaring operation, while MAE treats all errors equally in the averaging process.
d) RMSE amplifies the effect of larger errors due to the squaring operation, while MAE treats all errors equally in the averaging process.
Given the following chart showing the train and test error wrt the model's complexity, the model is underfit at --1--, overfit at --2--, should continue training at --3--, and is near optimal at --4-- a) Blue, Yellow, Red, Green b) Red, Yellow, Green, Blue c) Yellow, Red, Blue, Green d) Red, Yellow, Green, Blue
d) Red, Yellow, Green, Blue
Consider two rules of X --> Y and Y --> X, a) They have the same confidence but potentially different lift and support. b) They have the same confidence and support but different lift c) They have the same support, confidence, and lift d) They have the same support and lift but potentially different confidence
d) They have the same support and lift but potentially different confidence
If you want to retrieve the details of the Pokémon at index 3 on the pokemon dataframe, which code snippet should you use? a) pokemon.loc[31 b) pokemon.iloc[-3] c) pokemon.iloc[:, 3] d) pokemon.iloc[3, :]
d) pokemon.iloc[3, :]
In Naive Bayes classification, the classifier assumes that features are 1. given the class, and they are 2. from each other.
independent; unrelated
Increasing the -------reduces the number of rules generated as the output of Apriori.
minimum support and confidence thresholds