Statistical Learning Midterm 1

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

What does 95% confidence mean?

In the long run, if we drew repeated samples, 95% of the confidence intervals we construct using this procedure would contain the true unknown value of the parameter.

Interpreting coefficients in logistic regression

Increasing X1 by one unit changes the log odds by β1, or equivalently it multiplies the odds by e^(β1 ). For example, a one-unit increase in credit card balance is associated with an increase in the log odds of defaulting on credit card payments by 0.0055 units.

Least squares line

Obtained by fitting a line to the sample using the method of least squares. The least squares line produces an unbiased estimator of the population regression line.

Prediction intervals vs. confidence intervals

Prediction intervals are always wider than confidence intervals, because they incorporate both the error in the estimate for f(X) (the reducible error) and the uncertainty as to how much an individual point will differ from the population regression plane (the irreducible error).

Curse of dimensionality

Spreading observations over many dimensions results in a phenomenon in which a given observation has no nearby neighbors. That is, the K observations that are nearest to a given test observation xo may be very far away from xo in p-dimensional space when p is large, leading to a very poor prediction f(xo) and hence a poor KNN fit.

Log odds

log⁡(p/(1-p))

Prior Probability

■ Let π_k represent the overall or prior probability that a randomly chosen observation comes from the kth class. This is the (prior) probability that a given observation is associated with the kth category of the response variable Y. ■ For example, you have to predict whether the fruit is an apple or a banana. If I tell you the fruit is from the west coast, there is a higher prior probability that the fruit is an apple. This prior probability allows us to build a better model.

Inference vs. Prediction: Interpretability vs. Accuracy

■ Linear models allow for relatively simple and interpretable inference, but may not yield as accurate predictions as some other approaches. ■ Highly non-linear models can potentially provide quite accurate predictions for Y, but this comes at the expense of a less interpretable model for which inference is more challenging. ■ The more complicated the model, the harder it is to interpret inference.

Why can't we use linear regression for qualitative variables with more than two levels?

■ Linear regression assumes that variables have a natural order (from low to high). ■ Qualitative variables with more than two levels do not have a natural order.

LDA v. Logistic Regression

■ Logistic regression and LDA are similar in that they both produce linear decision boundaries. ■ The difference is that logistic regression uses the maximum likelihood function to estimate coefficients, whereas LDA uses the estimated mean and variance from a normal distribution. ■ When the classes are well-separated, the parameter estimates for the logistic regression model are surprisingly unstable. Linear discriminant analysis does not suffer from this problem. ■ If n is small and the distribution of the predictors X is approximately normal in each of the classes, the linear discriminant model is again more stable than the logistic regression model. ■ Linear discriminant analysis is popular when we have more than two response classes ■ Logistic regression performs better than LDA when observations are not drawn from a Gaussian distribution with a common covariance matrix in each class.

t-statistic

■ Measures the number of standard deviations that β ̂_1 is away from 0. ■ We calculate the t-statistic when we don't have the variance or the std deviation. ■ Used for hypothesis testing in simple linear regression. t = (β1-hat - 0)/SE(β1-hat)

Logistic regression

■ When the outcome variable in a regression is binary, we should use logistic instead of linear regression. ■ If we use linear regression for a binary variable, we sometimes get negative probabilities or probabilities higher than 1, neither of which makes sense ■ Logistic regression solves this problem by bounding our probability between zero and 1.

Collinearity

■ When two predictor variables are closely related to one another. ■ Collinearity limits your ability to separate out the individual effects of collinear variables on the response. ■ Results in a decline in the t-statistic, which reduces the power of the hypothesis test. ■ Can result with many pairs of coefficients with a similar RSS value. ■ To detect collinearity look at the correlation matrix of the predictors. ■ A large value indicates a pair of highly correlated variables.

Pros of parametric models

● Simplifies the problem of estimating f because it is generally much easier to estimate a set of parameters in the linear model than it is to fit an entirely arbitrary function f. ● More interpretable ● Doesn't need as large a sample size

How to estimate f?

1. Linear models 2. Non-linear models 3. Parametric models 4. Non-parametric models

Logistic regression steps

1. Model the probability using the logistic function, which ensures our probability is within 0 and 1. 2. Fit the model using maximum likelihood. This transforms the probability into the odds. 3. Log the odds.

Three sorts of uncertainty associated with prediction

1. Reducible error: sample vs. population -> use confidence interval 2. Model bias: linear model not good approx. for true f 3. Irreducible error: due to the random error term -> use prediction interval

Confusion Matrix

A convenient way to identify the number of type I errors (false positives) and type II errors (false negatives) in our model. The rows indicate the predicted categories and the columns indicate the real categories.

P-value

A p-value is the probability that if the null hypothesis were true you would observe a value for your test statistic at least as extreme as the one observed in your dataset.

Hypothesis testing in simple linear regression: how far is far enough?

Depends on the accuracy of β1-hat: ■ If SE(β1-hat) is small, then even relatively small values of β1-hat may provide strong enough evidence that β1≠0. ■ If SE(β1-hat) is large, then β1-hat must be large in order for us to reject the null hypothesis.

How large does the F-statistic need to be before we can reject H0 and conclude that there is a relationship?

Depends on the sample size: ■ When n is large, an F-statistic that is just a little larger than 1 might still provide evidence against H0. ■ In contrast, a larger F-statistic is needed to reject H0 if n is small.

How do you estimate the prior probability?

Estimating π_k is easy if we have a random sample of Ys from the population. We simply compute the fraction of the training observations that belong to the kth class.

Unsupervised learning

Exploring the data and discovering patterns without a clear purpose. There are inputs but no supervising output.

What do you do when there are non-linear associations in the data?

If the residual plot indicates that there are non-linear associations in the data, then a simple approach is to use non-linear transformations of the predictors, such as log(X), √X, and X^2, in the regression model.

What type of distribution do we expect the t-statistic to have if there is no relationship between X and Y?

If there is no relationship between X and Y, then we expect that the t-statistic will have a t-distribution with n-2 degrees of freedom (because 2 degrees of freedom are lost in estimating the coefficients).

What is statistical learning?

A set of approaches for estimating f.

What happens to the model as K decreases?

As K decreases, the model becomes more complex because you are trying to find the probability of a particular value based on the nearest neighbor.

What happens to the training error and test error in KNN as flexibility increases?

As the level of flexibility (assessed using 1/K) increases, the training error decreases and the test error increases.

Standard error

Average distance between the population mean and the sample mean. SE(μ-hat )^2 = (σ^2)/n

What does a small RSE indicate?

Better model fit.

What types of models should we use for qualitative variables?

Choose models from the classification toolkit (logistic regression, KNN, and discriminant analysis).

What types of models should we use for quantitative variables?

Choose models from the regression toolkit (simple and multiple linear regression).

Linear Discriminant Analysis (LDA)

Instead of working with the conditional probability of the outcomes (as in logistic regression), LDA looks at how the predictors are distributed within each class and then uses Bayes theorem to estimate the conditional probabilities indirectly.

Supervised learning

Involves building a statistical model for predicting or estimating an output based on one or more inputs. Usually uses theory-driven models.

Model complexity and sample size

Match the model complexity to the data resources, NOT to the target complexity. Even if the true function is complex, adding complexity to the model does not help you predict the true form if you have insufficient data because using a smaller sample increases variance. Only use complex functions with large sample sizes.

Why would we ever choose to use a more restrictive method instead of a very flexible approach?

More interpretability

Measures of model fit

RSE and R2

What units is the RSE measured in?

RSE is measured in units of Y, so it is not always clear what constitutes a good RSE.

What is f?

Some fixed but unknown function of X. Y = f(X) + e In this formulation, f represents the systematic information that X provides about Y.

Total sum of squares (TSS)

TSS measures the total variance in Y, and can be thought of as the amount of variability inherent in the response before the regression is performed.

Bayes decision boundary

The Bayes classifier's prediction is determined by the Bayes decision boundary. E.g., an observation that falls on the orange side of the boundary will be assigned to the orange class, and similarly an observation on the blue side of the boundary will be assigned to the blue class.

Why is the training MSE not the ideal way to choose a model?

The added complexity reduces the MSE to zero, which means you can't use the Training MSE to judge the goodness of fit of your model.

Population regression line

The best linear approximation of the true relationship between Y and X.

Reducible error

The error that can be reduced through modeling. [f(x) - fhat(x)]^2

Logistic versus linear regression: interpretability

The major advantage of the linear model is its interpretability, whereas the logistic model is less interpretable. Assume we have a binary outcome variable (Y) with 1 and 0 as its only possible values. ■ In the linear model, if β1 is 0.05, that means that a one-unit increase in X1 is associated with a 5 percent increase in the probability that Y is 1. ■ In the logistic model, if β1 is 0.05, that means that a one-unit increase in X1 is associated with a 0.05 increase in the log odds that Y is 1.

What is the best model?

The one with the lowest TEST MSE.

Specificity of a Classifier

The percentage of true negatives that are identified.

Sensitivity of a Classifier

The percentage of true positives that are identified.

What happens if your confidence interval contains the number zero?

Then you can't make any conclusions about whether the coefficient is different from zero or not. So, a confidence interval that includes zero is BAD.

Hypothesis testing in simple linear regression

To test the null hypothesis, we need to determine whether β1-hat is sufficiently far from zero so that we can be confident that β1≠0.

Cons of parametric models

Too simplistic: the model we choose will not usually match the true unknown form of f. If you have some variables that have a linear relationship with Y and some that have a non-linear relationship with Y, then your model is biased. We can address this problem by choosing flexible models that can fit many different possible functional forms of f.

What type of interval do we use when trying to predict the average response?

Use a confidence interval. E.g., we use a confidence interval to quantify the uncertainty surrounding the average sales over a large number of cities.

Inference

Using statistical learning to learn about the direction and/or magnitude of predictors' influence on the outcome variable. ■ Does human activity cause climate change? ■ Will administering drug X increase chances of survival in a patient?

Prediction

Using statistical learning to learn the value of the outcome variable. ■ How much rainfall will California have in 2050? ■ How much solar photo-voltaic will be installed in the US in 2025?

Irreducible error

Var (e), also understood as the variance associated with the error term.

How well does the LDA classifier perform?

We can computer the Bayes error rate and LDA test error rate. The smaller the gap between the LDA and Bayes error rates, the better the LDA classifier is performing. Recall that the Bayes error rate is the smallest possible error rate.

Posterior probability

We refer to p_k (X) as the posterior probability than an observation X=x belongs to the kth class, given the predictor value for that observation.

How are coefficients estimated in simple vs. multiple linear regression?

We use the least squares approach to estimate coefficients in both simple and multiple linear regression.

What type of interval do we use when trying to predict an individual response?

When you are trying to predict an individual response you get a lot more noise and you have to use prediction intervals. E.g., a prediction interval can be used to quantify the uncertainty surrounding sales for a particular city.

Problem with overfitting

When you over-fit, you learn about the noise which is not good

Decision boundary: K=1 vs. K=100

With K=1, the decision boundary is overly flexible, whereas with K=100 it is not sufficiently flexible.

Odds

p/(1-p)

Interaction terms

■ We can relax the additive assumption by adding an interaction term (product of X1 and X2). ■ Adding an interaction term means that the effect of X1 on Y is no longer constant because X2 will change the impact of X1 on Y. ■ We can interpret the coefficient of the interaction term as the increase in X1 for a one unit increase in X2 (or vice versa).

Confidence intervals

■ A confidence interval for some unknown parameter is an interval that, under repeated sampling, will contain the true parameter with some specified probability. ■ The range is defined in terms of lower and upper limits computed from the sample of data.

Test error rate (classification)

■ A good classifier is one for which the test error is smallest. ■ The test error rate is minimized, on average, by the Bayes classifier.

Residual sum of squares (RSS)

■ A measure of the discrepancy between the data and an estimation model. ■ The least squares approach chooses coefficients to minimize the RSS. ■ A small RSS indicates a tight fit of the model to the data.

Small p-value

■ A small p-value tells you it's very unlikely you would observe that result if the Ho were true. It is evidence AGAINST the Ho. ■ If we see a small p-value, then we can infer that there is an association between the predictor and the response.

Bayes classifier

■ A very simple classifier that assigns each observation to the most likely class, given its predictor values. ■ We should assign a test observation with predictor vector xo to the class j for which the Bayes classifier is the largest: p(Y=j | X=xo)

Outliers

■ An outlier is a point for which yi is far from the value predicted by the model. ■ To identify outliers, you can standardize residuals by plotting the studentized residuals. ■ These are computed by dividing each residual by its estimated standard error. ■ Observations whose studentized residuals are greater than 3 in absolute value are possible outliers. ■ Once you've identified an outlier, determine why it is an outlier (measurement error or something else?). ■ If the outlier is important then you can't exclude it from the analysis.

KNN Classifier

■ For real data, we do not know the conditional distribution of Y given X, and so computing the Bayes classifier is impossible. ■ KNN is an approach that estimates the conditional distribution of Y given X and then classifies a given observation to the class with highest estimated probability.

KNN classifying procedure

■ Given a positive integer K, the KNN classifier first identifies the K points in the training data that are closes to xo, represented by No. ■ It then estimates the conditional probability for class j as the fraction of points in No whose response values equal j. ■ Finally, the KNN applies the Bayes rule and classifies the test observation xo to the class with the largest probability.

How do you identify and deal with heteroscedasticity?

■ Heteroscedasticity is when the variances of the error terms are non-constant. ■ It can be identified from the presence of a funnel shape in the residual plot. ■ You can deal with heteroscedasticity by doing non-linear transformations or using weighted least squares.

Rule of thumb for choosing between logistic and linear regression

■ If the probabilities that you're modeling are extreme—close to 0 or 1—then you probably have to use logistic regression. ■ But if the probabilities are more moderate—say between 0.2 and 0.8—then the linear and logistic models fit about equally well, and the linear model should be favored for its ease of interpretation.

Residual standard error (RSE)

■ In general, σ^2 is not known, but can be estimated from the data. This estimate is known as the residual standard error. ■ The RSE is the average amount that the response will deviate from the true regression line. ■ Considered a measure of the lack of fit of the model. RSE = √(RSS/(n-2))

Logistic versus linear regression: fit

■ In many situations, the linear and logistic model give results that are practically indistinguishable except that the logistic estimates are harder to interpret. ■ For the logistic model to fit better than the linear model, it must be the case that the log odds are a linear function of X, but the probability is not. ■ And for that to be true, the relationship between the probability and the log odds must itself be nonlinear.

Hypothesis testing in multiple linear regression

■ In the simple linear regression setting, we can simply check whether β1=0. ■ In the multiple regression setting, we need to ask whether all of the regression coefficients are zero. Ho: β1=β2=⋯=βn=0 Ha: at least one β1≠0

Smoothness v. Flexibility

■ Increasing smoothness reduces flexibility. ■ Reducing smoothness increases flexibility but may result in overfitting. It reduces bias but increases variance.

Non-parametric model

■ Infinite set of parameters. ■ You learn new things from increasing the sample size. ■ Does not make assumptions about the functional form.

Why calculate the F-statistic in addition to p-values?

■ It is not enough that the individual p-values for the β's are very small. ■ Especially for large n, in which case just by chance one or more of the p-values might be very small, even if there is no true relationship between the predictor(s) and the response! ■ The F-statistic does not suffer from this problem because it adjusts for the number of predictors.

Multicollinearity

■ When more than two variables are highly correlated. ■ To detect multicollinearity, use the variance inflation factor (VIF).

Maximum likelihood method

■ Method for estimating regression coefficients in logistic regression. ■ We seek estimates for β0 and β1 such that the predicted probability of Y for each individual corresponds as closely as possible to the individual's observed Y. ■ In other words, we try to find β0 and β1 such that plugging these estimates into the model for p(x) yields a number close to 1 for all individuals who defaulted on their credit card payments, and a number close to zero for all individuals who did not. ■ This intuition can be formalized using a mathematical equation called a likelihood function.

High-leverage points

■ Observations with unusual values for xi. ■The leverage statistic is always between 1/n and 1. ■ The average leverage for all the observations is always equal to (p + 1)/n.

KNN vs. linear regression

■ Parametric methods such as least squares linear regression will tend to outperform non-parametric approaches such as KNN regression when: 1. The parametric form that has been selected is close to the true form of f. 2. There is a small number of observations per predictor. ■ As complexity increases, KNN regression has a lower error rate. But if the complexity is not too high, the linear model predicts better than KNN regression. ■ Curse of dimensionality

Density Function

■ Probability of the predictor being a different value given your class. ■ Let f_k(X) = P(X=x | Y=k) denote the density function of X for an observation that comes from the kth class. ■ f_k(X) is relatively large if there is a high probability that an observation in the kth class has X ≈ x. ■ f_k(X) is small if it is very unlikely that an observation in the kth class has X ≈ x.

What happens to the R-squared when we add more variables to the model?

■ R2 will always increase when more variables are added to the model, even if those variables are only weakly associated with the response. ■ This happens because you are increasing flexibility by adding more variables, and therefore you improve the model's fit. ■ But higher fit does not necessarily mean more predictive accuracy.

Bayes error rate

■ The Bayes classifier produces the lowest possible test error rate. ■ It is greater than zero when the classes overlap in the true population.

Extensions of the linear model

■ The additive assumption says that the effect of changes in a predictor Xi on the response Y is independent of the values of the other predictors. ■ The linear assumption states that the change in the response Y due to a one-unit change in Xi is constant, regardless of the value of Xi.

Two ways to deal with collinearity

■ The first is to drop one of the problematic variables from the regression. This can usually be done without much compromise to the regression fit. ■ The second solution is to combine the collinear variables together into a single predictor (interaction term).

Non-constant variance of error terms

■ The linear regression model assumes that the error terms have a constant variance. ■ However, it is often the case that the variances of the error terms are non-constant. ■ This is called heteroscedasticity.

R-squared

■ The proportion of explained variance. ■ Always takes a value between 0 and 1. ■ An R2 near 0 indicates that the regression did not explain much of the variability in the response. ■ R2 tells you how well your model is fitted, but it does not say anything about the predictive accuracy of your model. R^2=(TSS-RSS)/TSS

Training error rate (classification)

■ The proportion of mistakes that are made if we apply our classifier to training observations (i.e. the fraction of incorrect classifications). ■ Most common approach for quantifying the accuracy of our estimate f-hat.

Correlation of the error terms

■ The standard errors that are computed for the estimated regression coefficients are based on the assumption of uncorrelated error terms. ■ If the error terms are correlated, we may have an unwarranted sense of confidence in our model.

Training data vs. Test data

■ Training data is what we use to fit our model. ■ Test data is what we use to measure goodness of fit. ■ The test data is a sample taken from your data set. ■ The training data is whatever is left. ■ Never use your test data to fit your model. ■ But also, never lose your test data.

Which is typically lower: training or test error rates?

■ Training error rates will usually be lower than test error rates. ■ The reason is that we specifically adjust the parameters of our model to do well on the training data.

F-statistic

■ Used for hypothesis testing in multiple linear regression. ■ When there is no relationship between the response and predictors, one would expect the F-statistic to take on a value close to 1. ■ If the alternative hypothesis is true, we expect F to be greater than 1. F=((TSS-RSS)/p)/(RSS/(n-p-1))

Residual Plot

■ Useful graphical tool for identifying non-linearity. ■ Given a simple linear regression model, we can plot the residuals (e = y - y-hat) versus the predictor (x). ■ Ideally the residual plot will show no discernible pattern. ■ The presence of a pattern may indicate a problem with some aspect of the linear model.

Parametric model

■ Uses a finite set of parameters. ■ You don't learn anything new from increasing the sample size. ■ Assumes a functional form for f. In real life, we will never know what the true functional form is.

Polynomial regression

■ Using exponents for your variables. ■ If using Xi^2 improves the model, why not include Xi^3, Xi^4, or Xi^5? ■ Polynomial regressions give you better fit but not a better model.

Variance Inflation Factor (VIF)

■ VIF is the ratio of the variance of Bi-hat when fitting the full model divided by the variance Bi-hat when fit on its own. ■ A VIF value of 1 indicates the complete absence of collinearity. ■ A VIF value higher than 5 is problematic.


Set pelajaran terkait

political geography (chapter 8): study guide

View Set

Pronom 'Y' - (objets) les verbes suivants sont tous suivis de la préposition 'à'

View Set

英文法ハイパートレーニングレベル1 ユニット20

View Set

Advanced Automotive System 1 Final Review Questions

View Set

Protein Synthesis, DNA Replication, and RNA

View Set

Health Assessment MC Quiz Q's Exam II

View Set