srm

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

What is a scatterplot matrix?

Arranges the scatterplots for every pair of variables into a square matrix

For the K-nearest neighbors classifier, which of the following is/are true as K increases? I. Flexibility increases II. Squared bias increases III. Variance decreases

II and III only

You are given the following three statistical learning tools: I. Cluster Analysis II. Logistic Regression III. Ridge Regression Determine which of the above are examples of supervised learning

II and III only

Determine which of the following statements regarding statistical learning methods is/are true. I. Methods that are highly interpretable are more likely to be highly flexible. II. When inference is the goal, there are clear advantages to using a lasso method versus a bagging method. III. Using a more flexible method will produce a more accurate prediction against unseen data.

II only

Determine which of the following statements about hierarchical clustering is/are true. I. The method may not assign extreme outliers to any cluster. II. The resulting dendrogram can be used to obtain different numbers of clusters. III. The method is not robust to small changes in the data.

II, III

Which of the following is/are considered a benefit of KK-means clustering over hierarchical clustering? I. Running the algorithm once is guaranteed to find clusters with the global minimum of the total within-cluster variation. II. It is less restrictive in its clustering structure. III. There are fewer areas of consideration in clustering a dataset.

II, III

Which of the following is/are true for a Poisson regression? I. A square root link is as appropriate as a log link. II. If the model is adequate, the deviance is a realization from a chi-square distribution. III. A large Pearson chi-square statistic indicates that overdispersion is likely more severe.

II, III

Determine which of the following statements is/are true. I. The number of clusters must be pre-specified for both K-means and hierarchical clustering. II. The K-means clustering algorithm is less sensitive to the presence of outliers than the hierarchical clustering algorithm. III. The K-means clustering algorithm requires random assignments while the hierarchical clustering algorithm does not.

III only

You are given the following statements on the bias-variance tradeoff: I. Bias refers to the error arising from the method's sensitivity towards the training data set. II. Variance refers to the error arising from the assumptions made in the statistical learning tool. III. The variance of a statistical learning method increases as the method's flexibility increases. Determine which of the statements is/are true.

III only

For any statistical learning method, which of the following increases monotonically as flexibility increases? I. Training MSE II. Test MSE III. Squared Bias IV. Variance

IV.

How do we know if a variable's distribution is similar to the theoretical distribution using a qq plot?

If a majority of the points follow the superimposed line, we can conclude that the variable's distribution is similar to the theoretical distribution

What types of charts are used to identify stationarity?

control charts

Training MSE is consistently _________ than the test MSE at every level of flexibility

less

A smoother fit means a ______________ f hat

less flexible

What are branches?

lines that connect any two nodes

Low flexibility -> _________ variance -> _________ squared bias

low variance, high squared bias

Choosing a more complex form helps to ____________ the bias

lower

The mean of Y is supplied by __________ and everything else about the distribution of Y is supplied by __________

mean of Y -> f(x1, x2, ... , xp) everything else about Y -> error term

How many distinct principal components does a dataset have?

min(n-1, p)

Generally a _________ flexible method produces the lowest test MSE

moderately

Non parametric methods are considered to be _______ flexible than parametric methods

more

A rougher fit means a _______________ f hat

more flexible

Y is "__________"

signal plus noise

What is a standardized variable?

(variable - sample mean)/sample standard deviation

How do we find the bayes error rate?

1 - sum(max(P(X = x1, ... , xp , Y = c))) 1. For each x value, find the max probability. 2. Find the sum of those probabilities 3. Do 1 - the answer from step 2

What types of parameters does the result of clustering depend on?

1. Choice of k in k-means clustering. 2. Choice of number of clusters, linkage, and dissimilarity measure in hierarchical clustering. 3. Choice to standardize variables

For each linkage method below, identify the inter-cluster dissimilarity used: 1. Complete 2. Single 3. Average 4. Centroid

1. Complete -> largest dissimilarity 2. Single -> smallest dissimilarity 3. Average -> arithmetic mean 4. Centroid -> dissimilarity between the cluster centroids

What are the steps to creating a decision tree?

1. Construct a large tree with 𝑔 terminal nodes using recursive binary splitting. 2. Obtain a sequence of best subtrees, as a function of 𝜆, using cost complexity pruning. 3. Choose 𝜆 by applying 𝑘-fold cross validation. Select the 𝜆 that results in the lowest cross-validation error. 4. The best subtree is the subtree created in step 2 with the selected 𝜆 value.

What are the steps for bagging?

1. Create 𝑏 bootstrap samples from the original training dataset. 2. Construct a decision tree for each bootstrap sample using recursive binary splitting. 3. Predict the response of a new observation by averaging the predictions (regression trees) or by using the most frequent category (classification trees) across all 𝑏 trees.

What are the steps to create random forests?

1. Create 𝑏 bootstrap samples from the original training dataset. 2. Construct a decision tree for each bootstrap sample using recursive binary splitting. At each split, a random subset of 𝑘 variables are considered. 3. Predict the response of a new observation by averaging the predictions (regression trees) or by using the most frequent category (classification trees) across all 𝑏 trees.

Give two examples of unit root tests

1. Dicky-Fuller test 2. Augmented Dicky-Fuller test

What are 4 advantages of trees?

1. Easy to interpret and explain 2. Can be presented visually 3. Manage categorical variables without the need of dummy variables 4. Mimic human decision-making

What are the steps for the stepwise selection?

1. Fit all g simple linear regression models. The model with the largest R^2 is M1. 2. For p = 2, ... , g, fit the models that add one of the remaining predictors to Mp-1. The model with the largest R^2 is Mp. 3. Choose the best model among M0, ... , Mg using a selection criterion of choice.

What are the steps for the backward stepwise selection?

1. Fit the model with all g predictors, Mg. 2. For p = g - 1, ... ,1, fit the models that drop one of the predictors from Mp+1. The model with the largest R^2 is Mp. 3. Choose the best model among M0, ... ,Mg using a selection criterion of choice.

What are the steps for the best subset selection?

1. For p = 0, 1, ... , g, fit all gCp models with p predictors. The model with the largest R^2 is Mp. 2. Choose the best model among M0, ... , Mg using a selection criterion of choice.

How do we use the k-nearest neighbors method?

1. Identify the "center of the neighborhood" 2. Starting from the "center of the neighborhood," identify the k nearest training observations 3. The category prediction, y hat, is the most frequent category among the k training observations

What are the steps for k-Nearest Neighbors (KNN)?

1. Identify the "center of the neighborhood", i.e. the location of an observation with inputs x1, ... ,xp. 2. Starting from the "center of the neighborhood", identify the k nearest training observations. 3. For classification, y hat is the most frequent category among the k observations; for regression, y hat is the average of the response among the k observations. k is inversely related to flexibility

There is no universal approach to handling multicollinearity. Name two approaches.

1. It is possible to accept it when there is a suppressor variable. 2. It can be eliminated by using a set of orthogonal predictors

What are the various violations and issues for the linear model? Explain each.

1. Misspecified model equation: refers to incorrectly assuming the true form of f and to failing to include appropriate predictors. 2. Residuals with non-zero averages: an average residuals that is far from 0 suggests that some aspect of the linear regression is incorrect. 3. Heteroscedasticity: variance of the error term is non-constant across all observations. This leads to an unreliable MSE because it views several parameters as one parameter. 4. Dependent errors: dependent or correlated error terms will behave predictably from observation to observation 5. Non-normal errors: When we have non-normal errors we likely cannot conclude that certain estimates follow a t-distribution or an F-distribution. 6. Multicollinearity or collinearity: interpretations of bj's become less reliable. With perfect multicollinearity we may fail to distinguish which predictors are truly meaningful to a model. 7. Outliers and high leverage points: extreme residuals and unusual set of predictor values 8. High dimensions: high dimensional data (large p) causes overfitting to be very likely.

What are 2 disadvantages of trees?

1. Not robust 2. Do not have the same degree of predictive accuracy as other statistical methods

What are the steps of k-Means Clustering?

1. Randomly assign a cluster to each observation. This serves as the initial cluster assignments. 2. Calculate the centroid of each cluster. 3. For each observation, identify the closest centroid and reassign to that cluster. 4. Repeat steps 2 and 3 until the cluster assignments stop changing.

What are the steps for k-fold cross-validation?

1. Randomly divide all available observations into k folds. 2. For v = 1, ... ,k, obtain the vth fit by training with all observations except those in the vth fold. 3. For v = 1, ... , k , use y hat from the vth fit to calculate a test MSE estimate with observations in the vth fold. 4. To calculate CV error, average the k test MSE estimates in the previous step.

What are the steps for hierarchical clustering?

1. Select the dissimilarity measure and linkage to be used. Treat each observation as its own cluster. 2. For 𝑘 = 𝑛, 𝑛 − 1, ... , 2: • Compute the inter-cluster dissimilarity between all 𝑘𝑘 clusters. • Examine all 2Ck pairwise dissimilarities. The two clusters with the lowest inter-cluster dissimilarity are fused. The dissimilarity indicates the height in the dendrogram at which these two clusters join.

Simon uses a statistical learning method to estimate the number of ears of corn produced per acre of land. He has multiple training data sets and applies the same statistical method to all of them. The results are not identical, but very similar, between the training data sets. Which of the following best describes the statistical learning method? A. The method has low variance B. The method has high variance C. The method has high training error D. The method has low training error E. The method has low bias

A

What is a stump?

A decision tree with only one internal node

What can a double smoothing procedure be used for?

A double smoothing procedure can be used to forecast time series data with a linear trend.

What is a non-parametric method?

A functional form for f is not specified.

Which is better, test MSE or training MSE?

A lower test MSE is more accurate

When is a random walk model a good fit?

A random walk model is a good fit if the time series possesses a unit root

What is a unit root test?

A unit root test is used to evaluate the fit of a random walk model.

Determine which of the following statements about prediction is true. (A) Each of several candidate regression models must produce the same prediction. (B) When making predictions, it is assumed that the new observation follows the same model as the one used in the sample. (C) A point prediction is more reliable than an interval prediction. (D) A wider prediction interval is more informative than a narrower prediction interval. (E) A prediction interval should not contain the single point prediction.

B

How is bagging related to random forests?

Bagging is a special case of random forests.

What is inference?

Comprehension of f and the influence that the x's have on the y

Determine which of the following is not a criticism of the validation set approach or out-of-sample validation procedure in selecting the best model. A The results of the validation set approach are not robust. B Statisticians do not agree on the choice of the subset sizes. C The validation set error tends to overestimate the test error. D The validation set error is likely to have substantial variance as an estimator of the test error. E It takes a considerable amount of time to calculate the validation set error for each of the candidate models.

D

Determine which of the following statements is NOT true about clustering methods. (A) Clustering is used to discover structure within a data set. (B) Clustering is used to find homogeneous subgroups among the observations within a data set. (C) Clustering is an unsupervised learning method. (D) Clustering is used to reduce the dimensionality of a dataset while retaining explanation for a good fraction of the variance. (E) In K-means clustering, it is necessary to pre-specify the number of clusters.

D

What is training data?

Data that is used to train a learning method to obtain the optimal f hat

What part of the linear regression line do dummy variables affect?

Dummy variables define a distinct intercept for each class. Without the interaction between a dummy variable and a predictor, the dummy variable cannot additionally affect that predictor's regression coefficient.

Determine which of the following statements is true (A) Linear regression is a flexible approach (B) Lasso is more flexible than a linear regression approach (C) Bagging is a low flexibility approach (D) There are methods that have high flexibility and are also easy to interpret (E) None of (A), (B), (C), or (D) are true

E

In wrongly assuming the multiple linear regression error terms are independent, which of the following issues is/are likely to occur? I. The model suffers from severe multicollinearity. II. The estimated standard errors are smaller than they should be. III. Most of the residuals stay close to 0.

II

Determine which of the following statements is/are true. I. The leverage for each observation in a linear model must be between 1/n and 1. II. The n leverages in a linear model must sum to the number of explanatory variables. III. If an explanatory variable is uncorrelated with all other explanatory variables, the corresponding variance inflation factor would be zero.

I

Determine which of the following statements is/are drawbacks of using forward selection compared to backward selection in model selection. I. Forward selection may miss variables that are jointly important. II. Forward selection is not guaranteed to find the best possible model. III. Forward selection ignores the presence of outliers and high leverage points.

I only

From an investigation of the residuals of fitting a linear regression by ordinary least squares it is clear that the spread of the residuals increases as the predicted values increase. Observed values of the dependent variable range from 0 to 100. Determine which of the following statements is/are true with regard to transforming the dependent variable to make the variance of the residuals more constant. I. Taking the logarithm of one plus the value of the dependent variable may make the variance of the residuals more constant. II. A square root transformation may make the variance of the residuals more constant. III. A logit transformation may make the variance of the residuals more constant.

I, II

Determine which of the following indicates that a nonstationary time series can be represented as a random walk I. A control chart of the series detects a linear trend in time and increasing variability. II. The differenced series follows a white noise model. III. The standard deviation of the original series is greater than the standard deviation of the differenced series

I, II, III

Determine which of the following indicates that a nonstationary time series can be represented as a random walk. I. A control chart of the series detects a linear trend in time and increasing variability. II. The differenced series follows a white noise model. III. The standard deviation of the original series is greater than the standard deviation of the differenced series.

I, II, III

Determine which of the following is/are arguments for preferring simpler models when selecting variables. I. A simpler model is easier to explain and interpret. II. Parsimonious models often perform well on fitting out-of-sample data. III. Extraneous variables can cause severe multicollinearity, leading to difficulty in interpreting individual coefficients.

I, II, III

Determine which of the following statements describe the advantages of using an alternative fitting procedure, such as subset selection and shrinkage, instead of least squares. I. Doing so will likely result in a simpler model II. Doing so will likely improve prediction accuracy III. The results are likely to be easier to interpret

I, II, III

Determine which of the following statements is/are true about Pearson residuals. I. They can be used to calculate a goodness-of-fit statistic. II. They can be used to detect if additional variables of interest can be used to improve the model specification. III. They can be used to identify unusual observations

I, II, III

You are given a set of n observations, each with p features. Determine which of the following statements is/are true with respect to clustering methods. I. The n observations can be clustered on the basis of the p features to identify subgroups among the observations. II. The p features can be clustered on the basis of the n observations to identify subgroups among the features. III. Clustering is an unsupervised learning method and is often performed as part of an exploratory data analysis.

I, II, III

Consider the following statements: I. Principal Component Analysis (PCA) provide low-dimensional linear surfaces that are closest to the observations. II. The first principal component is the line in p-dimensional space that is closest to the observations. III. PCA finds a low dimension representation of a dataset that contains as much variation as possible. IV. PCA serves as a tool for data visualization. Determine which of the statements are correct.

I, II, III, IV

Determine which of the following statements about clustering is/are true. I. Cutting a dendrogram at a lower height will not decrease the number of clusters. II. K-means clustering requires plotting the data before determining the number of clusters. III. For a given number of clusters, hierarchical clustering can sometimes yield less accurate results than K-means clustering.

I, III

Determine which of the following statements concerning decision tree pruning is/are true. I. The recursive binary splitting method can lead to overfitting the data. II. A tree with more splits tends to have lower variance. III. When using the cost complexity pruning method, α = 0 results in a very large tree

I, III

Determine which of the following statements is/are true about clustering methods: I. If K is held constant, K-means clustering will always produce the same cluster assignments. II. Given a linkage and a dissimilarity measure, hierarchical clustering will always produce the same cluster assignments for a specific number of clusters. III. Given identical data sets, cutting a dendrogram to obtain five clusters produces the same cluster assignments as K-means clustering with K = 5.

II

What is 𝜆 used for in boosting?

It controls the rate at which boosting learns

What is flexibility?

It describes how closely f hat is able to follow the data

What is the first partial least squares direction, z1?

It is a linear combination of standardized predictors x1, ... ,xp, with coefficients based on the relation between xj and y.

How do we calculate every subsequent partial least squares direction?

It is calculated iteratively as a linear combination of "updated predictors" which are the residuals of fits with the "previous predictors" explained by the previous direction.

What is interpretability?

It is f hat's ability to be understood

When is smoothing approriate?

It is only appropriate for time series data without a linear trend.

For k-means clustering, does the algorithm need to be repeated?

It needs to be repeated for each k

Rank LOOCV, k-fold CV, and validation set from least to greatest with respect to bias.

LCOOV < k-fold CV < validation set

We have the following methods: - Classification trees - Boosting - Subset selection - Bagging - Least squares - Lasso - Regression trees Arrange those in the following three categories: 1. Less flexible/More interpretable 2. Moderately flexible/Interpretable 3. More flexible/Less interpretable

Less flexible/More interpretable: - Lasso - Subset selection Moderately flexible/Interpretable: - Least squares - Regression trees - Classification trees More flexible/Less interpretable: - Bagging - Boosting

What are the steps for boosting?

Let 𝑧1 be the actual response variable, 𝑦. 1. For 𝑘 = 1, 2, ... , 𝑏: • Use recursive binary splitting to fit a tree with 𝑑 splits to the data with 𝑧k as the response. • Update 𝑧k by subtracting 𝜆 ⋅ 𝑓k(𝐱) hat, i.e. let 𝑧_k+1 = 𝑧k − 𝜆 ⋅ 𝑓k(𝐱) hat. 2. Calculate the boosted model prediction as 𝑓(𝐱) hat = ∑ from k = 1 to b 𝜆 ⋅ 𝑓k(𝐱) hat.

How do we find the training MSE?

MSE calculated from training data

What is Cook's Distance?

Measure that combines leverage and residuals

What is leverage?

Measures its influence in predicting the response

You are given the following three statistical learning tools: I. Boosting II. K-Nearest Neighbors (KNN) III. Regression Tree Determine which of the above are examples of unsupervised learning.

None are examples of unsupervised learning

What is the difference between the training set and validation set?

Only the observations in the training set are used to attain the fitted model, and those in the validation set are used to estimate the test MSE

What are internal nodes?

Points along the tree where splits occur

Do polynomials have a constant slope?

Polynomials do not change consistently by unit increases of its variable, i.e. no constant slope.

What is a parametric method?

Specifies a functional form for f that includes free parameters. Data is used to estimate these parameters.

What is stationarity in a time series?

Stationarity describes how something does not vary with respect to time.

(T/F) Is smoothing related to weighted least squares?

T

(T/F) The number of directions, g, is a measure of flexibility

T

For hierarchical clustering, how many times does the algorithm need to be performed?

The algorithm only needs to be performed once for any number of clusters

What is a downside of a parametric method?

The chosen form may be significantly unlike the true f

What do terminal nodes or leaves represent?

The partitions of the predictor space

How are the cross validation approaches better than the validation set approach?

The validation set approach has unstable results and will tend to overestimate the test MSE. The two other approaches mitigate these issues.

Is the variance explained by each subsequent principal component always less than or greater than the variance explained by the previous principal component?

The variance explained by each subsequent principal component is always less than the variance explained by the previous principal component.

What do we use studentized residuals for?

They are realizations of the t-distribution and can help identify outliers.

What is a downside of a non-parametric method?

This method is inadequate w/o an abundance of observations in the dataset

What is the bayes classifier?

To find the bayes classifier we find the category that maximizes the conditional probability: P(Y = c | X1 = x1, ... , Xp = xp)

What is a qq plot used for?

Used to identify whether a variable is distributed similarly to a certain theoretical distribution

How do we calculate the leave-one-out cross validation (LCOOV)?

We calculate LOOCV error as a special case of k-fold cross-validation where k = n

How do we get the lowest MSE (in terms of variance and bias)?

We want a low variance and bias?

What is prediction?

We want to find the output of f hat, y

What is overfitting?

When f hat fits the training data so well that it does not properly capture f (the true relationship)

Holt-Winter double exponential smoothing is a generalization of the ____________________

double exponential smoothing

What is the f in a classification problem?

f is a decision function that classifies an observation as one category of the response variable

High flexibility -> __________ variance -> ____________ squared bias

high variance, low squared bias

For parametric methods, more free parameters mean __________ flexibility

higher

Higher flexibility -> ___________ variance

higher

Lower MSE -> ______________ accuracy

higher

In a scatterplot matrix, the variable name listed in the column corresponds to the __________ axis

horizontal

Training MSE decreases as flexibility _____________

increases

What is an explanatory variable?

input/independent variable, x

What type of nodes have child nodes?

internal nodes not terminal nodes

Flexibility is _____________ related to interpretability

inversely

In ridge and lasso regression, lambda is _______ related to flexbility

inversely

k is __________ related to flexibilty

inversely

Random Forests: What happens to the correlation between predictions if we decrease k?

it reduces

R^2 is a poor measure for model comparison because...

it will increase simply by adding more predictors to a model

Bagging: Does increasing b cause overfitting?

no

Random Forests: Does increasing b cause overfitting?

no

What type of method is the k-nearest neighbors?

non-parametric method

In ridge and lasso regression, with a finite lambda, ________ of the ridge estimates will _________ but the lasso estimates could ________

none of the ridge estimates will = 0 but the lasso estimates could = 0

What is a response variable?

output/dependent variable, y

Logistic regression is typically used for _____________ responses

qualitative

What are ordinal variables?

qualitative variable w/ a logical order

What are nominal variables?

qualitative variable w/ no logical order

What are categorical variables?

qualitative; class/level

Regression problems involve a ________________ response variable

quantitative

What are count variables?

quantitative; discrete; integers

What are continuous variables?

quantitative; intervals

Bagging: How does it affect variance?

reduces variance

In ridge and lasso regression, x1, ... ,xp are....

scaled predictors

How do we approximate the MSE?

sum of (yi - y hat i)^2 divided by n

Y = ____________ + ____________

systematic + random; f(x1, x2, ... , xp) + e

How are the directions z1, ... ,zg used in a multiple linear regression?

they are used as predictors

High flexibility improves predictions on the ______________ data

training

When residuals have a larger spread for larger predictions, one solution is to...

transform the response variable with a concave function

Test MSE is _________ shaped

u-shaped (for flexibility on x-axis and MSE on y-axis)

Rank LCOOV, k-fold CV, and validation set from least to greatest with respect to variance

validation set < k-fold CV < LCOOV

What is a centered variable?

variable - sample mean

What is a scaled variable?

variable/sample standard deviation

What is homoscedasticity?

variance is constant for all observations

In a scatterplot matrix, the variable name listed in the row corresponds to the __________ axis

vertical

How many dummy variables are needed to represent w classes of a categorical predictor?

w - 1 dummy variables where one of the classes acts as a baseline

What is supervised learning?

we analyze the response variable through the explanatory variables (monitored by y)

What is unsupervised learning?

we analyze the variables without a response variable (not monitored by y)

Are all principal components uncorrelated with one another?

yes

Boosting: Can increasing b cause overfitting?

yes

Does boosting reduce bias?

yes

Is out-of-bag error a valid estimate of test error?

yes


Kaugnay na mga set ng pag-aaral

ACCT 324 - Smartbook Ch. 18 - Contracts in Writing

View Set

Registered Behavior Technician (RTB) certification study guide

View Set

Unit 4 - 403(b) Plans (Tax Exempt Organizations)

View Set

Chapter 10: Health Assessment of Children

View Set

EMT Chapter 24 - Trauma Overview Quiz

View Set

Chapter 3: The Organic Molecules of Life

View Set