Business Intelligence Final Exam

¡Supera tus tareas y exámenes ahora con Quizwiz!

Name 10 hyper-parameters we have tuned in our labs in this course with regard to all the models we have had.

- DT: splitting criteria, depth - NB: None - KNN: K - SVM: C - MLP: depth (number of layers), number of neurons in each layer (width), epoch, batch size, activation functions, dropouts, etc -K-Means: K -Apriori: support and confidence threshold -Ensembles, you say!

Three maior centrality measures in Network Analysis:

-Degree (and weighted degree) centrality -Betweenness centrality -Closedness centrality

from sklearn.svm, ---1--- is used for regression tasks whereas ---2-- - is used for classification.

1- SVR 2- SVC

The -----1- - in OLS regression measures the proportion of variance in the dependent variable explained by the model, adjusted for the number of predictors. It penalizes the addition of irrelevant predictors, unlike ----2---- , which may increase even with the addition of insignificant predictors.

1. Adjusted R squared 2. R-squared

Given the following clustering results for the BartRider dataset, what is the probability that an individual in cluster 2 owns a house?

17%

--1--- attempts to gauge how the homogeneity or heterogeneity within the clusters changes for various values of k in ----2---- algorithm.

1: Elbow method 2: K-Means

from sklearn.neural network, ---1--- is used to train a model for classification tasks, whereas ---2--- is used for numeric prediction.

1: MLPClassifier 2: MLPRegressor

In classification with AdaBoost, 1 instances are given 2 weights to allow subsequent weak learners to focus more on 3.

1: misclassified 2: higher 3: classifying them correctly

In OLS regression, a 1._______ p-value (< 0.05) indicates 2.______evidence against the null hypothesis, suggesting 3.__________significance of the variable.

1: smaller 2: stronger 3: greater

Item X has appeared 545 times in our transactional data, while item Y has been observed 2513 times. Out of the total 9835 transactions, these items have co-appeared 271 times, calculate the confidence of the rule X - -> Y

271/545=0.497

Based on the model's performance on the test data for predicting bad-buy cars, how many NOT bad-buy cars are classified as bad buy?

2712

In K-means clustering, how are initial cluster centroids typically selected? a) Randomly b) Based on the largest cluster size c) According to the closest data points d) Determined by the smallest distance

A. Randomly

Which library is often used for creating data visualizations in Python for data mining tasks? a) Pandas b) Seaborn c) NumPy d) Scikit-learn

B) seaborn

What is NOT a cause for overfitting: a) Insufficient training data b) Noises in data c) Model's complexity d) Balanced dataset

D. Balanced dataset

How to avoid overfitting? Mention three approaches.

Data strategies §Secure sufficient data §Identify and handle potential outliers and noises §Evaluation strategies §identifying overfitting: comparing the model performance on training and testing data § Avoid big gap between the training and testing accuracy §Model strategies §Select proper algorithm and manage model complexity -Compare different algorithms -Lower model complexity via method-specific parameters

Regression trees are not suitable for handling categorical predictors in the input features.

False

True or False: In regression trees, splitting variables are chosen based on p-values calculated from the data.

False

True or False: The significance of predictor variables in regression trees is determined by their coefficients.

False

Which model is a lazy learner: a) KNN b) DT c) MLP d) SVM

KNN

The key feature of SVMs is their ability to map the problem into a higher dimension space using a process known as the _______. This process allows SVM to learn concepts that were not explicitly measured in the original data.

Kernel Trick

Name two general approaches for CF in recommender systems and explain the differences, pros and cons.

Memory-based approach: In User-User CF, similarity is computed between users based on their interactions or preferences. Conversely, in Item-Item CF, similarity is calculated between items based on the users who interacted with them, focusing on item-item relationships rather than user-user interactions. Model-based approach

Which of the following is NOT an example of a classification problem? a) Predicting stock prices b) Identifying spam emails c) Distinguishing customer churners d) Predicting whether a county votes to legalize gaming

Predicting stock prices is NOT an example of a classification problem. It typically falls under the regression problem category, where the goal is to predict a continuous value (the stock price) rather than categorizing data into classes as in classification problems.

The strength of association between predictor and dependent variables is reflected in the ________ value in OLS regression, indicating the proportion of variance explained by the model.

R-SQUARED

Given the following clustering results for the BartRider dataset, what is the probability that an individual in cluster 1 is not a Bart rider? Be as precise as possible.

The result of 1 − 0.023742 1−0.023742 is approximately 0.976258. 98%

Based on the model's performance on the test data for predicting bad-buy cars, what is the probability of catching a bad-buy car by this model?

Yes, recall: 56%

What is association rule mining primarily used for? a) Discovering relationships between items b) Predicting numeric values c) Grouping similar data points d) Image recognition

a) Discovering relationships between items

What could be the purpose of the following code snippet when working with scikit-learn? from sklearn import preprocessing a) Doing min-max normalization b) Importing a dataset c) Training a machine learning model d) Visualizing data

a) Doing min-max normalization

To handle missing values in a Pandas DataFrame, what code can you use to replace them with the mean value of the column 'column name'? a) df.fillna(df[ column_name] mean()) b) df.fillna(mean) c) df.replace na(mean) d) df.fillna(df.mean())

a) df.fillna(df[ column_name] mean())

In the context of Ordinary Least Squares (OLS), R-squared (aka the coefficient of determination) measures: a) The sum of squared residuals B)The proportion of the variance in the dependent variable explained by the independent variables c) The mean squared error d) The variance of the residuals

b) The proportion of the variance in the dependent variable explained by the independent variables.

What is the primary purpose of cross-validation in machine learning? a) To overfit the model b) To evaluate the model's performance on multiple subsets of data c) To train the model d) To avoid data preprocessing

b) To evaluate the model's performance on multiple subsets of data

In the carAuction dataframe, 'Auction', and 'color' are categorical variables. Which of the following is true if we run the following code cell twice in Colab? carAuction = pd.get_ dummies(carAuction, columns= ['Auction', 'Color"], drop _first=True) a) We get the dummy-coded version of the selected columns and the original 'Auction' and 'color' columns will not be dropped b) We get error in the second run because 'Auction' and 'color' columns are no longer in the CarAuction dataframe after the first run c) We get error in the first run because the syntax is not accurate d) We get the dummy-coded columns in the first run, the second time does not return any error, but it does not do anything either

b) We get error in the second run because 'Auction' and 'color' columns are no longer in the CarAuction dataframe after the first run

Fill in the blank of c= a) 1000, 10000, 1500 b) 1000, 100, 10000 c) 100, 5000, 15000 d) 15000, 5000, 100

c) 100, 5000, 15000

What is the purpose of the following code when applied to a Decision Tree classifier? clf = DecisionTreeClassifier(max_depth=3) clf. fit(X _train, y_ train) a) It displays a decision tree visualization with depth of 3 b) It evaluates the model's performance c) It creates a Decision Tree classifier with a maximum depth of 3 and a default splitting criterion d) It returns error because we are not specifying the split criterion

c) It creates a Decision Tree classifier with a maximum depth of 3 and a default splitting criterion

What is the difference between precision and recall of the positive class in classification metrics? a) Precision and recall are the same and can be used interchangeably. b) Precision measures the number of true positives, while recall measures the number of false positives. c) Precision measures the accuracy of predictions of a model, while recall measures the model's ability to capture all positives. d) Precision measures false positive rate, while recall measures true positive rate.

c) Precision measures the accuracy of predictions of a model, while recall measures the model's ability to capture all positives.

What evaluation metric is commonly used to assess the quality of K-means clustering results? a) Accuracy b) Precision c) Silhouette coefficient d) F1-score

c) Silhouette coefficient

the association rule X->Y, X is considered the antecedent, and Y is the consequent. The confidence of an association rule is calculated as: a) Support of the antecedent divided by the support of the consequent b) Support of the rule divided by the support of the consequent c) Support of the rule divided by the support of the antecedent d) Support of the consequent divided by the support of the antecedent

c) Support of the rule divided by the support of the antecedent

Which of the following statements accurately describes what a boxplot represents? a) A boxplot shows the distribution of data, including the range, mean, and median b) A boxplot is used exclusively for displaying categorical data c) A boxplot displays the exact values of each data point in a dataset d) A boxplot represents the spread of data and identifies potential outliers

d) A boxplot represents the spread of data and identifies potential outliers

What is the primary advantage of regression trees (and decision trees in general)? a Speed of trainingOverfitting b) Overfitting c) Non-linearity handling d) Interpretability

d) Interpretability

What distinguishes Mean Absolute Error (MAE) from Root Mean Squared Error (RMSE) in evaluating models? a) MAE is more sensitive to outliers than RMSE and penalizes larger errors more heavily. b) RMSE calculates the absolute differences between predicted and actual values, while MAE squares these differences before averaging. c) MAE provides an unambiguous measure of the average magnitude of errors, while RMSE emphasizes the impact of smaller errors. d) RMSE amplifies the effect of larger errors due to the squaring operation, while MAE treats all errors equally in the averaging process.

d) RMSE amplifies the effect of larger errors due to the squaring operation, while MAE treats all errors equally in the averaging process.

Given the following chart showing the train and test error wrt the model's complexity, the model is underfit at --1--, overfit at --2--, should continue training at --3--, and is near optimal at --4-- a) Blue, Yellow, Red, Green b) Red, Yellow, Green, Blue c) Yellow, Red, Blue, Green d) Red, Yellow, Green, Blue

d) Red, Yellow, Green, Blue

Consider two rules of X --> Y and Y --> X, a) They have the same confidence but potentially different lift and support. b) They have the same confidence and support but different lift c) They have the same support, confidence, and lift d) They have the same support and lift but potentially different confidence

d) They have the same support and lift but potentially different confidence

If you want to retrieve the details of the Pokémon at index 3 on the pokemon dataframe, which code snippet should you use? a) pokemon.loc[31 b) pokemon.iloc[-3] c) pokemon.iloc[:, 3] d) pokemon.iloc[3, :]

d) pokemon.iloc[3, :]

In Naive Bayes classification, the classifier assumes that features are 1. given the class, and they are 2. from each other.

independent; unrelated

Increasing the -------reduces the number of rules generated as the output of Apriori.

minimum support and confidence thresholds


Conjuntos de estudio relacionados

Unit 3 Part 2 AP Macro Multiple Choice and True/ False Questions

View Set

Renal urinary drugs NCLEX questions

View Set

Business Result B1. Unit 11. Decisions

View Set

NHW Pre-intermediate, 4th edition, Workbook, Unit 11, p.77 - Juliette Binoche

View Set

Ch 26: Fluid, Electrolyte, and Acid-Base Balance Practice Test

View Set

Comps: Baroque (1600 AD - 1750 AD), Classic (1750 AD - 1825 AD), Romantic (1825 AD - 1900 AD), Twentieth Century

View Set