STAT 417 Exam #1
What is linear discriminant analysis?
Linear discriminant analysis estimates the probability that a new set of inputs belongs to every class. The output class is the one that has the highest probability. That is how the LDA makes its prediction. LDA uses Bayes' Theorem to estimate the probabilities. If the output class is (k) and the input is (x), here is how Bayes' theorem works to estimate the probability that the data belongs to each class.
Collinearity in linear regression? Detection?
Occurs when two predictors are highly correlated; makes it difficult to separate the predictive ability of each predictor. -> Can also cause significant bias in parameter estimates (estimates may not make sense) & standard error inflation ■ Collinearity can be detected by a correlation matrix between predictors, or by using variance inflation factors (VIFs) -> A VIF above 5 or 10 should be investigated
What are qualitative predictors?
Predictors which can't be easily quantified such as gender, student status, or ethnicity
Supervised learning overview
Supervised learning problems are based on data that have inputs and outputs. - The output s are known as the supervising data. - The purpose of supervised learning is to be able to estimate or predict the output given the values of the inputs.
K-Nearest Neighbor bias-variance tradeoff
■ k = 1: low bias, high variance ■ k = 100: high bias, low variance
Overall error rate (confusion matrix)
(FN + FP) / n
6 potential problems in linear regression?
1. Non-linearity of the response-predictor relationships. 2. Correlation of error terms. 3. Non-constant variance of error terms. 4. Outliers. 5. High-leverage points. 6. Collinearity
What are two tasks we can perform once we've estimated the function f between X and Y?
1.) Prediction: for a specified value of X, try to guess the corresponding value of Y 2.) Inference: learn about the relationship between X and Y; what is the actual form of f ? Which predictors are related to Y? Is the relationship linear or more complicated?
Statistical leaning problems differ from other statistical problems in two ways:
1.) The number of observed data points is often vary large. 2.) The data are often high dimensional.
What is the logistic function?
A generalized linear model (GLM) - Used to derive the log odds
What is multiple regression?
A linear regression performed on multiple independent variables. -> extends simple linear regression to p predictors.
What is logistic regression?
A regression with an outcome variable that is categorical and independent variables that can be a mix of continuous and/or categorical variables -> Models the probability of response given the values of the predictors
What is parametric estimation? Positives/negatives?
Estimation in which we: 1.) Make some specific assumptions about the form of f . 2.) Use the training data to fit that form of f . For example, we may assume f is linear and then fit an appropriate model to the data (linear regression) Positives: easy to implement, very well understood, easy to interpret Negatives: may not be linear/follow the underlying assumptions
What is the likelihood function?
Function in logistic regression which is maximized to find estimates for our regression parameters
T-test for regression
H_0: the coefficient is 0 H_a: the coefficient is nonzero We use the t-test to get a p-value and infer accordingly.
Variance-Bias tradeoff overview
High bias means that our model won't be accurate because it doesn't have the capacity to capture the signal in the data, whereas high variance means that our model won't be accurate because it overfit to the data it was trained on, and thus won't generalize well to new unseen data. The more complex the model (less smooth), the higher the variance but the lower the bias. The more smooth the model, lower variance but high bias
Why is deciding which predictors are significant in multiple regression complicated? What techniques might we use?
If you reject H_0, we ask which predictors should be in the model. However, the significance of a predictor depends on what other predictors are in the model! So, should really look at all 2^p possible models but this is a lot of work and increases type I error Instead, we use algorithms for model selection such as forward selection, backward elimination, and stepwise algorithms.
What is nonparametric estimation? Positives/negatives?
In nonparametric estimation, we make very few assumptions about f and fit without constraints. Positives: few assumptions Negatives: more difficult to implement, more difficult to interpret
How do parametric / nonparametric methods fit with prediction/inference?
In prediction problems you don't need to interpret, just give a good prediction: nonparametric may work better. Meanwhile, for inference you need the interpretability so parametric is likely better.
Three-level predictor in linear regression
Create additional dummy variables; need two dummy variables for three-level predictors.
Confusion matrix
Cross-classified frequency table of the true classification of each response against the classification predicted by the model
Nonconstant variance (heteroscedasticity) in linear regression? Detection? Solution?
- Linear regression assumes that the error terms have a constant variance with V(ε_i) = σ^2 ■ The key tool in detecting nonconstant variance is the plot of residuals against the fitted values ■ To help with this problem, try a variance stabilizing transformation on the response (Y) such as sqrt(Y) or log(Y)
Correlated error terms in linear regression? Detection?
- The computation of standard errors requires the assumption that the error terms are uncorrelated -> If error terms are correlated, standard error estimates will tend to be too small, p-values too small: rejecting null hypothesis too often. Smaller confidence intervals If there is a time index to the data, can plot residuals to time index. Otherwise, can be difficult to detect this in the data.
K-Nearest Neighbor overview
A supervised learning technique that classifies a new observation by finding similarities ("nearness") between this new observation and the existing data.
Bayes Theorem? Posterior probability?
Any observation with X = x is classified as Y = k when P(Y = k | X = x) is the largest (Bayes classifier)
What to if your classification model is not very sensitive?
Can move the boundary of the Bayes classifier (e.g. 1/2 to 1/5 will classify more observations as "yes" and increase sensitivty) -> However, this will lower specificity
Quadratic Discriminant Analysis (QDA) Overview / Assumptions
Classifier / generative model which is similar to LDA but drops the common variance assumption. -> Now, each probability distribution can have a different σ^2_k -> Note in the resulting P_1(x) > P_2(x) we have an x^2 term (quadratic)
When to use LDA vs. QDA?
LDA better for smaller training data sets, QDA better for large data sets
How do we find estimate coefficients in linear regression?
Least squares method: minimize the residual sum of squares (RSS) using the training data.
What is the mean squared error (MSE)?
Measure of how well f_hat predicts the response
Specificity (definition/formula)
Rate at which the negative classification is identified = TN / N
Sensitivity (definition/formula)
Rate at which the positive classification is identified = TP / P = (1- FN error rate)
F-statistic interpretation
Rejection region & p-values computed using the F-distribution
What is the standard error?
Roughly speaking, the standard error tells us the average amount that this estimate ˆµ differs from the actual value of µ.
Coefficient of Determination (R^2)
The fraction of the variation in the values of y that is accounted for by the least-squares regression line of y on x. ■ TSS measures total variation ■ TSS - RSS is the amount of variation in response due to relationship with X ■ R^2 = proportion of variation in responses explained by X
What is an unbiased estimator (regression)?
The least squares regression line is an unbiased estimator of the population regression line
Nonlinearity in linear regression? How is it detected?
The linear regression's underlying assumption is that there is a linear relationship between Y and the X1, ... , X_p predictors. However, if the assumption of linearity is not correct then the model will not fit well and the corresponding predictions will be unreliable. ■ We can detect nonlinearity by plotting our fitted values (ŷ) against the residuals. -> In a good fit, the residuals should look random centered around 0
What is the log odds / logit?
The logit is linear with respect to our predictor; as such, we find our regression estimates using maximum likelihood estimation
What is a classification problem? Training data form?
The problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known. Training data: {(x1, y1), (x2, y2), ... , (xn, yn)} with y1, ... , yn being qualitative (categorical) --> want to predict y from x.
Bias Variance Tradeoff in QDA (covariance matrix) vs. LDA
There are (1/2)p (p+1) parameters to estimate in a p x p covariance matrix -> If both K & p are large, QDA uses many more parameters. ■ LDA is less flexible -> low variance, high bias ■ QDA is more flexible -> high variance, low bias
Training/test error for classification?
Training error (shown): proportion of miss-classified observations Test error: proportion of correct classification in the test data
Unsupervised learning overview
Unsupervised learning problems there is input data but no output data. - The purpose of unsupervised learning is to understand the structure of the data.
Outliers in linear regression? Detection?
Values where y_i and ŷ_i are far apart (large residual) - Outliers will not usually effect estimates, but standard errors can become inflated & p-values too large ■ Plots of residuals can be used to detect outliers (look for large residuals) -> Plots using studentized residuals are better (residuals divided by their standard error) ■ Look for studentized residuals w/ an absolute value of >2 or >3
What is the fundamental assumption between our p independent variables X1, X2, ... , Xp and our dependent variable Y?
We assume that Y and X are related/associated; that is, X contains information about the behavior of Y. Specifically, assume the value of Y is equal to some function of X: Y = f(x) + ε (random error)
What assumptions do we make regarding the density functions in LDA with p = 1?
We assume that the density is normal with a mean of u_k and a standard deviation of σ^2
What assumption do we make regarding our random error ε ?
We assume ε is random error with expectation (mean) 0. That is, E(ε) = 0.
How do we gauge how well a model fits the data in multiple regression?
We can't use R^2 directly as this gets inflated as p increases. ■ Adjusted R^2 is used: adjusts for the number of predictors
Additivity assumption / interaction terms in linear regression
We generally have the additivity assumption in linear models: effect of one predictor is independent of the effect of the other predictors However, if we wish to investigate that predictors do have an effect on each other we may include interaction terms. This relaxes the additivity assumption.
Two-level predictor in linear regression
We simply create an indicator or dummy variable that takes on two possible numerical values.
LDA with p > 1
X is now a p-dimensional vector; has p-dimensional normal distribution with mean vector u_k and covariance matrix Σ -> We then have a more complex density function but otherwise the same process of calculating posterior probabilities. -> Estimate density functions, prior probabilities, and covariance matrix Σ using training data.
What is standard deviation?
a computed measure of how much scores vary around the mean score
Multiple logistic regression
an extension of logistic regression in which two or more independent variables are included in the model -> Similarly estimate via maximum likelihood estimation
High leverage points in regression? Detection?
■ High leverage points are unusual x values -> can have a great effect on parameter estimates ■ May be difficult to detect when p is larger (multiple regression) so you can use the leverage statistic (shown) -> any leverage values that exceed (p + 1)/n should be investigated ■ Can generally just limit our x range to omit the high leverage point
Training data vs test data?
■ Training data is what we use to fit our model. ■ Test data is what we use to measure goodness of fit. ■ The test data is a sample taken from your data set. ■ The training data is whatever is left. ■ Never use your test data to fit your model. ■ But also, never lose your test data. --> We use the test MSE as a better assessment of our model