Exam SRM
Common stats for comparing forecasts
ME, MPE, MSE, MAE, MAPE; we choose the forecast with smallest set of stats
F stat=
T stat squared
Interaction depth phi
controls the interaction order of the boosted model and complexity
ME
measures recent trends that aren't anticipated by model
We constrain the loadings so that their sum of squares equal 1 (normalized)
true
With B sufficiently large, out of bag error is virtually equivalent to LOOCV
true
Gini index
measure of total variance across the k classes, takes on small value if all pmk's are close to 0 or 1; measure of node purity, small value indicates that a node contains predominantly observations from a single class
Disadvantages of trees
- do not have the same level of predictive accuracy as some of the other regression and classification approaches - can be very non-robust, a small change in data can cause a large change in the final estimated tree
Advantages of trees
- easy to explain - closely mirrors human decision making - can be displayed graphically and easily interpreted - can easily handle qualitative predictors without need for dummy variables
Random walk
- for this model, we filter the data by taking differences - the partial sums of a white noise process define a random walk model - commonly used time series model
How does one recognize that an autoregressive model may be a suitable candidate model?
1. Since model is stationary, we can use control chart to examine graphically the data to search for stability 2. adjacent realizations of an AR(1) model should be related, can be detected using scatterplot 3. we can recognize an AR(1) model through its autocorrelation structure
Autocorrelation structure of AR(1) model
1. correlation between points k time units apart turns out to be b1^k 2. absolute values of autocorrelations of AR(1) process become smaller as lag k value increase 3. to aid in model identification, we use the idea of matching the observed autocorrelation rk to quantities that we expect from the theory pk
Why Stepwise Selection is better
Best subset selection cannot be applied with very large p, also suffers from problems because of large search space and ability to fit training data so well, can lead to overfitting and high variance of prediction; stepwise methods explore a far more restricted set of models
When there is no relationship between response and predictors, we expect F-stat to be
Close to 1
R squared adjusted
RSS always decreases and r squared always increases as more variables are added, a large value indicates a models with a small test error; pays a price for adding unnecessary variables in model
Disadvantage of forward stepwise selection
Tends to do well in practice, but it is not guaranteed to find the best possible model out of all 2^p models containing subsets of the p predictors
RSS measures
The amount of variability that is left unexplained after performing the regression
Linear Assumption
The change in the response Y due to a one unit change in Xj is constant regardless of the value of Xj
Additive assumption
The effect of changes in a predictor Xj on the response Y is independent of the values of the other predictors
Recursive Binary Splitting
Top down, greedy approach, because it is infeasible to consider every possible partition of feature space; we make the best split at that particular step instead of looking ahead and picking split that will lead to a better tree in some future step
AR(1) model can be viewed as a generalization of a white noise process and random walk model; b1=0, white noise process; b1=1, random walk
True
In forecasting, the primary concern is for the most recent part of the series
True
In generalized linear models, the choice of variance function not the choice of the distribution, drives the most important inference properties
True
Link between longitudinal and cross sectional models can be established through the notion of white noise process
True
When analyzing longitudinal data, transformation is an important tool used to filter a dataset, because it helps to reduce increasing variability through time
True
Time series
a single measurement of a process that yields a variable over time, denoted by y1, y2,...yT
Autoregressive model of order 2
a stationary process where there is a linear relationship between yt-2 and yt
Principal Component Analysis
a tool used for data visualization or data pre-processing before supervised techniques are applied; when faced with a large set of correlated variable, principal components allow us to summarize this set with a smaller number of representative variables that collectively explain most of the variability in the original set
2 ways the three patterns can be combined
additive (T+S+e) and partially multiplicative (T*S+e)
Entropy
alternative to gini index, will take on a value near 0 if the pmk's are all near 0 or near 1
MSE
can detect more patterns that ME
MAE
can detect more trend patterns that ME, units are the same as dependent variable
MAPE
can detect more trend patterns that MPE, it examines error relative to actual value
Shrinking parameter lambda
controls the rate at which boosting learns, very small lambda can require using a very large value of B in order to achieve good performance
Alpha (CCP)
controls tradeoff between the subtrees complexity and its fit to training data; when alpha=0, T=T0; as alpha increases, there is a price to pay for having a tree with many terminal nodes
Autocorrelation statistic (r1)
correlation of a series on itself, it summarizes the linear relationship between yt and yt-1; if r1>0, process is positively autocorrelated, positive relationship between yt and yt-1; we can use yt-1 to explain yt in regression model, autoregression
Conditional least squares estimates of b0 and b1 are approximated for autoregressive models because
differences arise because we have no explanatory variable for y1 the first observation, typically small in most series and diminishes as series length increases
Classification error is sufficiently sensitive for tree growing
false, it is not so two other methods are preferable
Increasing B in boosting doesn't lead to overfitting
false: we use cross validation to choose B
Average, complete, and single linkage are not most popular with statisticians
false; they are most popular
Drawbacks of Linear Probability Model
fitted values can be poor heteroscedasticity residual analysis is meaningless
Negative Binomial advantage compared to Poisson
has greater flexibility because it has 2 parameters Poisson is a limiting case of NB, it is nested within NB distribution NB arises from mixture of Poisson variables
How large does rk need to be to be considered statistically significantly different from 0 in absolute value?
if rk exceeds 2se(rk)=2(1/sqrt(T)) in absolute value it may be significantly non zero, with 5% level of significance
Random forests
improvement over bagged trees by way of a small tweak that decorrelates the trees; each time a split is considered, a random sample of m predictors is chosen as split candidates from the full set of p predictors; we typically choose m= sqrt(p); on average (p-m)/p of the splits will not even consider the strong predictor, and so other predictors will have more of a chance
Characteristics that apply to b0 and b1
least squares estimators, max-likelihood estimators
MPE
measure of trend, but examines error relative to actual value
Longitudinal data
measurements of a process that evolves over time
ARCH and GARCH models
models that quantify and forecast changing volatilty, concept of changing variability over times seems at odds with our notions of stationarity; we can allow for changing variances by conditioning on the past and still retain a weakly stationary model
Ordinal Dependent Variable
ordered categorical variable
Stochastic processes
ordered collections of random variables that quantify a process of interest
Filtering
procedure used to reduce observations to white noise
Autocorrelations and autoregressive models
provides techniques for detecting subtle trends in time and models to accommodate these trends
Bagging
reduces the variance of a statistical learning method; takes many training sets from the population, build a separate prediction model using each bootstrapped training set, and averaging resulting predictions
Fixed seasonal effects model
seasonal time series models can be modeled using trig functions, fixed effects means the relationships are constant over time, unlike exponential smoothing and autoregression which help us model trends that change over time and recent events
y=xt-xt-1
stationary mean
y=ln(xt)-ln(xt-1)
stationary mean and variance
y=ln(xt)
stationary variance
One step forecast residuals
taking difference between the actual and fitted values of forecast
Strong Stationarity requires
that the entire distribution of yt be constant over time, not just the mean and variance
Principal components regression
to perform we simply use principal components as predictors in a regression model in place of the original larger set of variables
Boosting
trees are grown sequentially, each tree is grown using info from previously grown trees; no bootstrapping involved; this approach learns slowly, given current model, we fit a decision tree to the residuals from the model, rather than the outcome Y; we then add this new decision tree into the fixed function in order to update residuals
Any of the 3 approaches might be used when pruning the tree, but the classification error rate is preferable if prediction accuracy of the final pruned tree is the goal
true
As with bagging, random forests will not overfit if we increase B, so in practice we use a value of B sufficiently large for the error rate to have settled down
true
Average and complete are generally preferred over single because they tend to yield more balanced dendrograms
true
Bagging improves prediction accuracy at the expense of interpretability
true
Bagging, random forest, and boosting can improve predictive performance of trees
true
Constraining Z2 to be uncorrelated with Z1 is equivalent to constraining the direction of phi2 to be orthogonal to the direction of phi1
true
For classification trees, we are often not only interested in the class prediction corresponding to a particular terminal node region, but also in the class proportions among the training observations that fall into that region
true
For white noise process, b1=0, thus pk=0 for all lag ks
true
Like running averages, exponential smoothing estimates provide greater weight to more recent observations; however the weight function is smooth
true
Principal loading vectors are the directions in feature space along which the data varies the most, and the principal component scores are projections along these directions
true
RSS cannot be used for criterion for binary splits for classification trees, instead we use classification error rate
true
The first principal component of a set of features X1, X2,...,Xp is the normalized linear combination of the features that has the largest variance
true
The number of trees B is not a critical parameter with bagging, using a very large value of B will not lead to overfitting
true
The second principal component is the linear combination of X1, X2,..., Xp that has maximal variance out of all linear combinations that are uncorrelated with the first principal component
true
We typically decide the number of principal components needed in order to explain sizable amount of data by using scree plot; we choose the smallest number of principal components that are required in order to get as accurate a read as possible; we eyeball and look for the point at which the PVE drops off
true
Nominal Dependent Variable
unordered categorical variable
Ridge Regression
very similar to least squares regression except coefficients are estimated by minimizing a slightly different quantity, tuning parameter lambda serves to control the relative impact on the regression coefficient estimates, as lambda increases, the ridge coefficient estimates shrink towards 0
Logit threshold interpretation
we do not observe the propensity, but we observe when the propensity crosses a threshold
Cost Complexity Pruning
weakest link pruning, instead of considering every subtree, we consider a sequence of trees indexed by a nonnegative tuning parameter alpha
Autoregressive model of order 1
when only immediate past is used as predictor, b0 may be any fixed constant, but b1 is restricted to be between -1 and 1, so AR(1) yt series is stationary
If there is a relationship between response and predictors, we expect F stat to be
Greater than 1, and the alternative hypothesis to be true
As K increases, flexibility decreases, bias increases, the method produces a decision boundary close to linear, and variance decreases (for K nearest neighbors
True
As long as there is some variability in the white noise process (variance of c greater than 0), the random walk is nonstationary in the variance
True
Can use binary variables or trig functions to capture seasonal effects
True
Collinearity reduces the accuracy of the estimates of the regression coefficients, so the standard error of the betas will grow, t stat will decline, and the power of the hypothesis test is reduced
True
For lasso regression, as lambda increases, flexibility decreases, variance decreases, squared bias increases, training error increases, and test error follows a u shape
True
Forecast of a future value of a white noise process is just the average of the process
True
Forward stepwise selection is the only viable subset method when p is very large
True
If model is purely multiplicative, yt=T*S*e, it can be made additive by taking logs of both sides
True
In Poisson regression models, we anticipate heteroscedastic dependent variables, which means ordinary residuals are useless so we use Pearson residuals
True
In theory, the models with the largest r squared adjusted will have only correct variables and no noise variables
True
K-fold cross validation with k<n has a computational advantage to LOOCV and often gives more accurate estimates of the test error rate because of the bias-variance tradeoff
True
LOOCV has far less bias than validation set approach, it tends not to overestimate the test error as much
True
Logistic Regression is parametric, but regression trees and KNN are not
True
Once all the correct variables have been included in the model, adding additional noise variables will lead to only a small decrease in RSS
True
Prediction intervals are always wider than confidence intervals because it incorporates irreducible error and reducible error
True
Random walk is not a stationary process because variability and possibly the mean depends on time point at which the series is observed
True
Regular periodic behavior is often found in business and economic data
True
Removing the high leverage observation tends to have a much more substantial impact on the least squares line than removing the outlier
True
Residuals (e1, e2,..) must sum to zero
True
Ridge regression works best in situations where the least squares estimates have high variance
True
Smallest possible value for VIF=1, means complete absence of collinearity
True
Stability of process is a basic concern with processes that evolve over time
True
The accuracy of the prediction of y depends on both the reducible error and the irreducible error
True
The least squares line always passes through (xbar, ybar)
True
The variance of a statistical learning method increases as the method's flexibility increases
True
Tolerance is the reciprocal of VIF
True
Variance increases monotonically as flexibility increases
True
When fitting models to data with binary or count dependent variable, it is common to observe that the variance exceeds that anticipated by the fit of the mean parameters, called overdispersion
True
When inference is the goal, there are clear advantages to using a lasso method vs. a bagging method
True
Width of prediction interval for random walk, grows a l grows, reflects diminishing ability to predict into the future
True
White noise process
a stationary process that displays no apparent patterns through time, i.i.d. series
Validation & Cross Validation
can be used to directly estimate the test error, has an advantage because it makes fewer assumptions about the true underlying model
What are 2 solutions for collinearity?
Drop one of the problematic variables or combine collinearity variables together into a single predictor
Pearson goodness of fit stat
measure of how well model fits data, if specified correctly, stat should approximately equal n-(k+1)
Weakly Stationary
- E(yt) does not depend on t, [E(y4)=E(y8)] - covariance between ys and yt depends only on the difference between time units |t-s|, [Cov(y6,y8)=Cov(y4,y6)] - has constant mean and variance
Time Series analysis process
- goal of analysis is to go backwards, and decompose the series into the 3 components - each component can then be forecast, which will provide us with forecasts that are reasonable and easy to interpret
How do we identify a series as a realization from a random walk?
1. examine series and decide whether it's stationary (using control charts) 2. if series is nonstationary we can use control chart to detect linear trend in time and increasing variability as t gets larger 3. we can use control charts to detect lack of pattern in differences of the series (white noise process) 4. we can compare the standard deviation of the original series and the differenced series, if series can be represented by random walk, we expect a substantial reduction in the standard deviation when taking differences
RSE info
Considered a measure of the lack of fit of the model to the data, we want it to be small, it's measured in units of y, so it's not always clear what constitutes a good RSE
Backward stepwise selection
Begins with full model and then it iteratively removes the least useful predictor one by one; also not guaranteed to yield the best model and it requires that n>p so that full model can be fit
Choosing the Optimal model
Can indirectly estimate test error by making an adjustment to the training error to account for bias due to over fitting; can directly estimate the test error, using either a validation set or cross validation set approach
To test relationship between response and predictors, we check if the betas equal zero by
Computing F stat=((ssr/p)/(rss/n-p-1))
RSE
Estimate of the standard deviation of the error, the average amount that response will deviate from the true regression line
Bias of statistical learning method increases as the model's flexibility increases
False
Hierarchical Principle
If we include a interaction in a model, we should also include the main effects, even if the p values associated with their coefficients are not significant
How large does the F stat need to be before we reject H0?
It depends on n and p; when n is large, an F stat that is just a little larger than 1 might still provide evidence against H0; a larger F stat is needed to reject H0 if n is small
Why is the white noise model both the least important and least important of time series models?
Least important: the model assumes that the observations are unrelated to one another, which is unlikely for most series of interest Most important: our modeling efforts are directed toward reducing a series to a white noise process; after all patterns are filtered from the data, the uncertainty is irreducible
Can the training RSS and R squared be used to select from among a set of models with different numbers of variables
No
Problems that may occur when fitting a linear regression model
Non linearity of data Correlation of error terms Non constant variance of error terms Outliers High leverage points Collinearity
3 types of patterns in a time series
Trends in time (Tt) Seasonal (St) Random or irregular patterns (et)
Heteroscedasticity
Presence of funnel shape in residual plot, variance of error terms may increase with the value of the response, a solution would be to transform the response y using a concave function
R-squared statistic
Provides an alternative measure of fit, takes form of a proportion, proportion of variance explained, always takes on values between 0 and 1, independent of scale of y
R squared =
SSR/TSS
A VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity
True
It is clear the true relationship isn't additive if,
The p value for interaction term is low, there is strong evidence for the alternative hypothesis
R squared measures
The proportion of variability in Y that can be explained by using X, we want it to be close to 1
R squared in MLR
The square of the correlation between the response and the fitted linear model; an r squared close to 1 indicates that the model explains a large portion of the variance in the response variable
TSS measures
The total variance in the response Y, the amount of variability inherent in the response before regression is performed
Best subset selection
To perform we fit a separate least squares regression for each possible combination of the p predictors
When does F stat work?
When p is relatively small, and small compared to n; if p is greater than n, then there are more betas to estimate than observations to estimate from, in this case we cannot use least squares to build model, so F stat cannot be used
Models used to address issue of excess zeros
Zero inflated and Hurdle models; zero inflated model accommodates for overdispersion and hurdle model accommodates for under and overdispersion
Lasso Regression
alternative to ridge regression that overcomes disadvantage that ridge regression will always generate a model involving all predictors, and increasing lambda will not result in exclusion of any variables
Seasonal (St)
aspects that repeat itself periodically
Forward Stepwise Selection
computationally efficient alternative to best subset selection, begins with null model, then adds predictors one by one until all are in the model; at each step the variable that gives the greatest additional improvement to the fit is added to the model
Trends in time (Tt)
long term, slow evolution, most important in long term forecasts
Seasonal adjustment
removal of seasonal patterns
Random or irregular patterns (et)
short term movements that are typically harder to anticipate
Shrinkage Methods
shrinking coefficient estimates can significantly reduce their variance, the two best known techniques for shrinking towards 0 are ridge regression and the lasso
Exposure
to extend basic Poisson model, we allow mean to vary by this known amount
Least squares regression is with lambda equal to 0
true; ridge regression
Logistic Regression
we represent the linear combination of explanatory variables as the logit of the success probabilities
How do we deal with the drawbacks of the linear probability model?
we use alternative models in which we express the expectation of the response as a function of explanatory variables, there are two cases: logit and probit