474 - Final Exam

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

If you are using kNN for binary classification and you set k=3, and the following nearest neighbor values are0,0,1 - what do you predict for this record?

0

In Leave-one-out cross-validation, how many models are trained if your total dataset had 10,000records?

10,000

If you had a categorical variable measuring degree level in your dataset with the following values(HS, BS, PHD) and wanted to make dummy variables (a.k.a. one-hot encode) the variable. How many columns would be created in your data to accomplish that?

3

What does a Naïve Bayes model consist of?

A bunch of look-up tables. These tables contain marginal and conditional probabilities used to generate any predictions you may need.

What is regularization and how does it relate to the bias-variance tradeoff?

Regularization (sometimes referred to as shrinkage) means to put constraints on the model's parameter coefficients as they are learned so some become smaller than they would without constraints (e.g. OLS). The effect of shrinking or regularizing the parameter coefficient estimates will increase the bias (because the model can become simpler - think of Lasso pushing some betas all the way to 0), but will simultaneously reduce the model variance.

A meteorologist believes many variables are important to accurately predict the temperature next week. Which would be ideal for the meterologist to use? ridge regression or lasso

Ridge regression - leads to more drivers. Lasso shrinks variables to zero

List all the predictive methods we covered in class up thru SVM and indicate if they are "statistical learning" or "machine learning".

SL = Linear regression, Logistic regression, Naïve Bayes, Lasso, Ridge b. ML = KNN, Kernel methods, Trees, Random Forests, AdaBoost, GBM, Neural Networks, Deep Learning, SVM

Say you were trying to predict which customers were likely to leave you company (churn vs non-churn). IN your data set, 95% of the records were non-chunrers (which is a good thing), and 5% left the company (churned). Which is the ideal statistical performance measure to evaluate your predictive model based on the possibilities provided? a. Overall accuracy b. Specificity c. Sensitivity d. churners that they actually churned

Sensitivity- how well we predicted churners that they actually churned

What is the difference between statistical learning and machine learning?

Statistical learning methods will have assumptions about the data. Machine learning methods don't assume anything

You want to estimate the effect that a new leadership development program has on keeping team members from staying or leaving the company.

Supervised classification; classification

Meijer needs to better predict the demand of their toilet paper products each day. They typically use variables such as store location, previous sales history, and toilet paper brand as their variables.

Supervised learning; regression problem

Could you regularize a logistic regression model - why or why not?

Sure. As we saw from the caret model list, there are many variants of the methodologies we learned which are regularized.

What is NOT a general-purpose lexicon one might use for sentiment analysis?

THOR

What is the issue with the validation set approach to cross-validation?

THere is only one fit on the observations selected to be in train sets, we would expect different fits for different training sets

Which is NOT a popular cross-validation designs in predictive modeling?

Upsample cross validation

What is not a popular approach to scaling data?

Upscaling

What are three popular ways to normalize your numeric features?

Z-score standardization, min-max normalization, decimal scaling

Consider an example where you have just a few more rows/observations than you have input features and performed OLS to fit a linear regression model. What can be said in regards to the variance in the bias-variance trade-off? a. Beta estimates can have a lot of variation, thus resulting in overfitting and poor predictions on new observations b. Best estimates tend to have low variance, thus perform well on test observations c. You do not have enough data to estimate the beta coefficients d. Variance will not be impacted by a smaller dataset, but bias will

a. Beta estimates can have a lot of variation, thus resulting in overfitting and poor predictions on new observations

You are trying to predict a target variable containing 0s and 1s. Which method would NOT be appropriate? a. Linear Regression - numeric problems b. Logistic regression c. Naive bayes d. kNN

a. Linear Regression - numeric problems

Say you were trying to predict which customers were likely to leave you company (churn vs non-churn). IN your data set, 95% of the records were non-chunrers (which is a good thing), and 5% left the company (churned). Which is the ideal statistical performance measure to evaluate your predictive model based on the possibilities provided? a. Matthews correlation coefficient b. Overall accuracy c. Area under the ROC curve (AUC) d. F-score

a. Matthews correlation coefficient

What would be the ideal thing to di when training your e predictive model so it leads to what you define as "best" for your business case? a. When training/fitting the model optimize based on the statistical performance measure that you will be evaluating the model on instead of a different metric. b. Try training/fitting your models using different statistical performance measures(eg metric=) to see what performs best. c. When training/fitting the model optimize based on the statistical performance measure that leads to the lowest variation (eg RMSE for regression or log-loss for classification). d. All options are true

a. When training/fitting the model optimize based on the statistical performance measure that you will be evaluating the model on instead of a different metric.

Say you were trying to predict which customers were likely to leave you company (churn vs non-churn). IN your data set, 95% of the records were non-chunrers (which is a good thing), and 5% left the company (churned). When the modeled this, you labeled churners as 1s and non-churners are 0s. Before you calculated your confusion matrix, you used a .4 probability cutoff threshold instead of the typical .5 cutoff. How do you think that will impact your sensitivity? a. Will likely increase sensitivity or be the same as using the .5 b. Will likely decrease sensitivity or be the same as using the .5 c. Will definitely not change the sensitivity d. Will change the confusion matrix values but sensitivity will remain the same

a. Will likely increase sensitivity or be the same as using the .5

Which variable selection approach begins with all variables in the model and sequentially eliminates them one by one in each iteration until you are left with only statistically significant features? a. Forward selection b. Backward selection c. Stepwise selection d. Lasso

b. Backward selection

What is a true statement about predictive modeling? a. A complex model is desired when it performs just as well as simpler model b. Including irrelevant features leads to unnecessary complexity c. A simpler model is desired when it performs just as well as a more complex model d. Two options are true

b. Including irrelevant features leads to unnecessary complexity c. A simpler model is desired when it performs just as well as a more complex model d. Two options are true

What is a false statement about h2o? a. The h2o library is just an API b. It's an idea solution when you're looking for an interpretable model c. It provides you functionality to estimate how long a model experiments will run d. You initialize your cluster using h2o.init()

b. It's an idea solution when you're looking for an interpretable model

Your client tasks you with a prediction problem and requires that you provide a interpretable model that clearly shows the relationship of inputs to your output. Which is NOT a method you will consider? a. Linear regression b. Naive bayes c. kNN d. Logistic Regression

b. Naive bayes c. kNN

Sensitivity and Positive Predictive Value(PPV) are closely related, just like Specificity is closely related to Negative predictive Value (NPV). What is a false statement about these performance measures? a. Sensitivity is the percentage of true 1s that are identified b. Specificity is the unconditional analog to NPV, which accounted for the events prevalence c. PPV and NPV are when you are trying to explain how well your model predicts each class respectively. d. Sensitivity and specificity are when you are evaluating performance of correctly classified observations conditioned on the actual class.

b. Specificity is the unconditional analog to NPV, which accounted for the events prevalence

Which is not a plot function from base graphics? boxplot() qplot() hist() ggplot()

boxplot() hist()

What is a false statement about the lasso? a. In the shrinkage penalty, the betas are taken in absolute value (instead of squared) and summed b. Since some of the estimated beta coefficients can be shrunk all the way to 0, this helps identify drivers (ie variable selection), which helps make the model easier to interpret. c. As lambda increases, sometimes an estimated beta coefficient will increase, but will always eventually decrease. d. Lasso will usually lead to a smaller sum of squared errors compared to OLS for linear regression

c. As lambda increases, sometimes an estimated beta coefficient will increase, but will always eventually decrease.

Which of the following is not a tidy data principle? a. Each type of observational unit is a value b. Each variable is a column c. Each table has multiple tokens per row- one token per row not multiple d. Each observation is a row

c. Each table has multiple tokens per row- one token per row not multiple

What is a false statement about ridge regression? a. When lambda =0, the estimated beta coefficients are the same as those obtained from OLS b. As lambda increases, flexibility decreases, resulting in a bias increase (b/c simpler model) and variance to decrease (less variation dataset to dataset) -said would be on the exam c. It does not matter if you standardize the predictors before applying ridge regression d. Ridge will tend to shrink your estimated betas coefficients toward zero but not make them go all the way to zero.

c. It does not matter if you standardize the predictors before applying ridge regression

Which is a false statement? a. Linear regression using OLS fitting procedure b. Kernel methods are like kNN but they just use a different weighting scheme called a kernel to weight the nearest neighbors c. Naive bayes can be used for both regression and classification problems- (only classification not regression) d. Logistic regression used LME fitting procedure

c. Naive bayes can be used for both regression and classification problems- (only classification not regression)

Generally speaking, among the following models which would tend to have the lowest bias but highest variance? a. Linear regression b. Knn c. Neural network d. Lasso

c. Neural network

Your client has tasked you with trying to classify products as "sellers vs. non-sellers". The idea is that in the upcoming season, the retailer wants to make sure they put products on the shelves that are most likely to sell than not sell. The data you have is fairly balanced in sellers/non-sellers. Which statistical performance measure would be ideal to choose your "best" candidate model if you care about predicting sellers just as well as non-sellers. a. Sensitivity b. PPV c. Overall Accuracy- balanced data so this is also acceptable d. AUC- balance between sensitivity and specificity the more area under the curve the better or 1 most of the time if you get 1 its probably wrong, .5 is horrible

c. Overall Accuracy- balanced data so this is also acceptable d. AUC- balance between sensitivity and specificity the more area under the curve the better or 1 most of the time if you get 1 its probably wrong, .5 is horrible

Your client has tasked you with trying to classify products as "sellers vs. non-sellers". The idea is that in the upcoming season, the retailer wants to make sure they put products on the shelves that are most likely to sell than not sell. After discussing your best model with your client and prediction expectations, tey tell you that they not only want to predict seller/non-seller accurately, but also want to make sure the probabilities provide from the model are meaningful, because they want to use the probabilities to rank which predicted sellers will be added to the shelves in order until they run out of shelf space. What can help demonstrate how good your predictive probabilities are? a. Cohen's kappa statistic b. Log-loss statistic c. Probability calibration plot d. AUC

c. Probability calibration plot

An auto parts retailer needs to forecast demand for their next year-long planning horizon. 90% of the products they are generating a forecast for sell only one unit or do not sell. How would you frame this problem as a prediction problem?

classification

A potential client is skeptical to try predictive modeling to support targeted marketing initiatives. She has provided you their database which contains previous customers purchasing behavior over time. You are also able to see under what conditions customers purchased products( eg used in-store coupon, mailed flyers coupons, etc) you decided to develop a predictive model for an upcoming marketing initiative she plans to run. What might you show her to help persuade her to use your analytics skills to help in that initiative? a. A lift chart b. A gains chart c. A probability calibration plot d. All could provide valuable talking points to show expectations of using a predictive model

d. All could provide valuable talking points to show expectations of using a predictive model

What is a false statement about logistic regression? a. The errors/residuals do not have to be normally distributed b. The features are assumed to be independent c. Logistic regresssion can be used for classification problems having more than two classes, which is called multinomial logistic regression d. It is very robust at estimating the beta coefficients on small datasets, meaning it is unlikely to over fit to the training data.

d. It is very robust at estimating the beta coefficients on small datasets, meaning it is unlikely to over fit to the training data.

You are predicting demand for various products for your client. They want you to report model performance using MAPE (means absolute percentage error). Which is a false statement about MAPE? a. Advantage of being scale independent, so frequently used to compare forecast performance between different data series. b. Cannot be used if there are 0 demand values because you can't divide by 0 c. MAPE tends to prefer models that under forecast. d. MAPE has a lower (0%) and upper bound (100%) making it very interpretable.

d. MAPE has a lower (0%) and upper bound (100%) making it very interpretable.

What is a false statement about regularization techniques? a. In regression, they shrink the beta coefficient estimates b. Shrinking the coefficient estimates in absolute value can significantly reduce their variance c. The two most popular techniques for shrinking the regression coefficients toward zero are ridge regression and least absolute shrinkage and selection operator (lasso) d. Shrinking the coefficient estimates in absolute value can significantly increase their variance

d. Shrinking the coefficient estimates in absolute value can significantly increase their variance

When you are framing a predictive analytics problem as you learn about it from the client, what is something that would not matter in your thought process? a. How the model would be used b. If the problem is a regression or classification type problem c. What performance measure(s) would be ideal to measure the accuracy and potential impact of using the model d. If model interpretability will be required or not e. All are important considerations

e. All are important considerations

In regression-type problems, since we are just summarizing the errors from the actual values, why would we need to consider alternative statistical performance measures? a. Some statistics do NOT allow for comparison among different series b. Sometimes you want to compare accuracy on an individual series (forecasts for milk this month or another month or between multiple different series (milk demand, hamburger demand). c. You might want a statistic that is on a standardized scale (eg. between 0 and 1) for comparison purposes. d. Some statistics weight large errors more heavily than smaller ones. e. All options are valid reasons.

e. All options are valid reasons.

A ggplot is made up of two main components, a _______object and at least one __________.

ggplot(), geo layers

Which knn model would be considered most flexible and likely to overfit?

k=1

What base R function could be used to fit a linear regression model?

lm()

What should method= be set to in the train() function from caret to train a linear regression model?

method="lm"

How would you create a 3 x 3 grid for 9 plots using base graphics?

par(mfrow=c(3,3))

When discussing logistic regression, the range of odds was ______ and the range of log-odds was _________, which is why log-odds is used.

range of odds = [0,∞), range of log-odds = [0,1]

A handy function that gives you a count of each unique value in a vector is?

table()

Which R function is used to tokenize the text?

unnest_tokens()

If you normalize numeric features before you train a linear regression model, will that affect the interpretation of the beta coefficients or will the units of measure remain the same?

yes

What is a simple way to remove stop words in your text using R?

%>% anti_join(stopwords)

You are developing a predictive model and are concerned with your error. You believe most of the error is due to the method/estimation technique you are using. This type of error is called________________.

Reducible Error

What is the purpose of normalizing the data?

To ensure they are all in similar or comparable scale

How can influential points impact your linear regression model? How might you identify an influential point?

An influential point is a data point that significantly "influences" the estimate of your beta coefficients. To identify an influential point you could look at a plot of the 'Cooks-distances' as well as see where that point's residual shows up in the histogram of residuals.

Which lexicon generally has more negative?

Bing

What fitting procedure is used to estimate the beta coefficients in linear regression?

Ordinary least squares (OLS)

Predictive analytics is the process of extracting information from data to make __________ about future outcomes.

Prediction or estimates

The Bing lexicon____

Categorize words in a binary positive or negative category

What is given about z-standardization?

Center data at 0 and have standard deviation at 1

What is a popular way to explore categorical-type features?

Cross-tabulation tables

Which methods are good considerations to try to predict/estimate probabilities when those probabilities will be used for "ranking" purposes

Logistic regression and naive bayes

What is a great idea to create about your data as you learn more about it ?

Data dictionary

Where will you spend most of your time in the data mining or analysis process?

Data preparation

During the tokenization process say "Purdue University" was a token. Which is true statement?

Purdue University is a bi-gram

How might you find outliers?

Examine distribution, z-score and IQR

Which is a true statement when determining which predictive model is "best" among competing models?

First you need to identify among the predictive models which are "candidate models" meaning models that are not overfit THen compare test statistics and is more relevant to the business problem

Based on your PA experience in this course, what best describes what MUST be done to deliver a valid predictive solution in practice?

Follow a structured process such as CRISP-DM or INFORMS CAP

Describe a good strategy to develop a decision tree model.

Grow a large overfit tree, then prune it back using cross-validation.

You are developing a predictive model and are concerned with your error. You believe if you could incorporate new variables about your customers or products would lead to reducing your______________.

Irreducible error

What is the problem with the mean imputation approach?

It can lead to changing a variable's distribution, can cause artificial spike in distribution

Does it matter whether you use all or k-1 dummy variables when using knn?

It is preferred that you use all k dummy variables and not eliminate any when using knn

What does heteroskedasticity mean in linear regression and how might you check that it exists in your model?

It means that the model's residual errors are not random and potentially follow some sort of pattern. Modelers also refer to heteroskedasticity as "non-constant variance" because it the errors are not consistent around zero, which OLS assumes (i.e. homoskedasticity). To check for heteroskedasticity you could plot the residual errors on the y-axis versus the index of your data points (x-axis).

Which cross-validation is best at finding a good balance among error rate estimation and computational run time?

K-fold

How does kNN differ from Kernel Methods?

Knn averages the closest neighbors, while kernel regression would use a kernel (a weight mechanism) to weight the neighbors differently. This can help obtain more accurate predictions among your data points that are on the edge of your data.

If the purpose of your model is to predict for several thousands of new records, what would be the disadvantage of using 𝑘 -NN prediction?

Knn is computationally intensive because it has to calculate distances to all the datapoints to identify the k closet neighbors.

Say you a dataset that had 50 potential features and 100 observations. Which predictive model design would be ideal for such a situation?

LOOCV

What constraint is on the Lasso model compared to the Ridge Regression model?

Lasso uses the L1 norm (takes the sum of the betas in absolute value), while ridge uses the L2 norm (takes the sum of the square of the betas.

When would you use the Lasso model versus a Ridge Regression model?

Lasso would be preferred over ridge if you expect based on the context of your problem that the number of significant features/drivers is fairly small, and thus want to identify what those primary model drivers are. For ridge, you might expect many features have some statistical impact and thus driving their parameters all the way to zero (which would eliminate that feature from your model) would not be ideal.

Once you decide if you are going to model your problem using regression-type techniques or classification-type techniques, which of the following is NOT an appropriate metric used to assess classification-type techniques?

MSE

A client has tasked you with developing predictive models that will help them identify what the optimal markdown percentage should be for their seasonal items. Which is a true statement?

Make interpretable

What fitting procedure is used to estimate the beta in a logistic regression model?

Maximum likelihood estimation (MLE)

What normalization approach sets data from 0-1?

Min-max

Below are some methods we will learn and apply this semester which are more flexible/complex than others. Which option would be considered most flexible, and thus difficult to interpret, and challenging to provide inferences?

Neural Networks

If you convert your categorical variables to one-hot encoding (aka dummy variables), do you need to remove a one-hot encoding column like you would in linear regression?

Nope

A better way to identify differences among categorical group levels is plot the data in a?

Normalized view

In r coding sometimes coders will use the pipe %>% statement within their code. What does the pipe do?

Takes the output of what comes before it and feeds that into the first argument of the next function

What does linearity assumption imply in the OLS assumptions?

That the model is linear in the model parameters. Meaning, that each parameter (beta coefficient) has a constant effect on the response.

What makes Naïve Bayes naïve?

The assumption that features are independent.

Explain the bias-variance tradeoff using an example.

The bias-variance tradeoff is the relationship of how flexible the model you use to fit your data (bias) versus how generalizable the model performs on data you didn't use to fit/train your model. If I tried to predict the weather next week, that's probably a complex problem and using a linear regression would probably be too simple, thus I could expect a lot of model bias. On the flip side, evaluating this simple model on datasets I didn't use to train the model would likely lead to similar/consistent (and likely poor) performance, thus low variance.

How does the caret library determine the "best" tuning parameter lambda and model (see slide46)?

The decision is based on the metric you define as the modeler in the train() function (eg metric = "")

In a predictive model, the Y variable is not referred to as ____________.

The input

What's the key idea of the bias-variance tradeoff?

The more flexible you decrease bias but will increase variance in the test error. Meaning as you assess different datasets, you can expect much different test error estimates.

What are some of the problems of using a linear regression to fit a binary classification problem?

The obvious problems are the homoskedasticity and normality of errors assumptions will be violated. Another issue if that the linear regression will not guarantee the predictions lie between 0 and 1, which must happen because that is the range where probability is defined.

In context of a text analysis, tokenization is___.

The process of splitting text into tokens

As a model becomes more flexible, what is likely to eventually happen to the training set accuracy and test set accuracy?

The training set accuracy will likely get better but the testing accuracy will get worse due to overfitting

In ridge regression, the lambda term is called ___________.

The tuning parameter (or hyperparameter)

What is the purpose of normalizing numeric features? Meaning what potential benefit can it provide?

To ensure your model inputs have an even playing field in that the method you're using does no weight variables having larger or smaller values more or less than they should. Putting the features on the same scale will remove the chance of this type of bias when the model learns.

What is the purpose of cross-validation?

To estimate/gauge the true error rate /accuracy of our model

In the validation set approach, what is the purpose of splitting your data into training and testing data sets?

Training set used to build the predictive model. Use the testing to gauge the generalizability of the model for future observations


Ensembles d'études connexes

MASTERING MICROBIOLOGY: Chapter 20 Tutorial

View Set

Engl200 Final--passage Identification

View Set

Article 200 Use & Identification of Grounded Conductors Article 210 Branch Circuits Article 215 Feeders

View Set

NATIONAL FINANCE - TRUTH IN LENDING

View Set

Chapter 25- suicide and nonsuicidal self injury

View Set

SGS 303 Global Trends Final Exam

View Set

Chemistry B: Properties and Uses of Unsaturated Hyrdocarbons

View Set