Exam SRM

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

Common stats for comparing forecasts

ME, MPE, MSE, MAE, MAPE; we choose the forecast with smallest set of stats

F stat=

T stat squared

Interaction depth phi

controls the interaction order of the boosted model and complexity

ME

measures recent trends that aren't anticipated by model

We constrain the loadings so that their sum of squares equal 1 (normalized)

true

With B sufficiently large, out of bag error is virtually equivalent to LOOCV

true

Gini index

measure of total variance across the k classes, takes on small value if all pmk's are close to 0 or 1; measure of node purity, small value indicates that a node contains predominantly observations from a single class

Disadvantages of trees

- do not have the same level of predictive accuracy as some of the other regression and classification approaches - can be very non-robust, a small change in data can cause a large change in the final estimated tree

Advantages of trees

- easy to explain - closely mirrors human decision making - can be displayed graphically and easily interpreted - can easily handle qualitative predictors without need for dummy variables

Random walk

- for this model, we filter the data by taking differences - the partial sums of a white noise process define a random walk model - commonly used time series model

How does one recognize that an autoregressive model may be a suitable candidate model?

1. Since model is stationary, we can use control chart to examine graphically the data to search for stability 2. adjacent realizations of an AR(1) model should be related, can be detected using scatterplot 3. we can recognize an AR(1) model through its autocorrelation structure

Autocorrelation structure of AR(1) model

1. correlation between points k time units apart turns out to be b1^k 2. absolute values of autocorrelations of AR(1) process become smaller as lag k value increase 3. to aid in model identification, we use the idea of matching the observed autocorrelation rk to quantities that we expect from the theory pk

Why Stepwise Selection is better

Best subset selection cannot be applied with very large p, also suffers from problems because of large search space and ability to fit training data so well, can lead to overfitting and high variance of prediction; stepwise methods explore a far more restricted set of models

When there is no relationship between response and predictors, we expect F-stat to be

Close to 1

R squared adjusted

RSS always decreases and r squared always increases as more variables are added, a large value indicates a models with a small test error; pays a price for adding unnecessary variables in model

Disadvantage of forward stepwise selection

Tends to do well in practice, but it is not guaranteed to find the best possible model out of all 2^p models containing subsets of the p predictors

RSS measures

The amount of variability that is left unexplained after performing the regression

Linear Assumption

The change in the response Y due to a one unit change in Xj is constant regardless of the value of Xj

Additive assumption

The effect of changes in a predictor Xj on the response Y is independent of the values of the other predictors

Recursive Binary Splitting

Top down, greedy approach, because it is infeasible to consider every possible partition of feature space; we make the best split at that particular step instead of looking ahead and picking split that will lead to a better tree in some future step

AR(1) model can be viewed as a generalization of a white noise process and random walk model; b1=0, white noise process; b1=1, random walk

True

In forecasting, the primary concern is for the most recent part of the series

True

In generalized linear models, the choice of variance function not the choice of the distribution, drives the most important inference properties

True

Link between longitudinal and cross sectional models can be established through the notion of white noise process

True

When analyzing longitudinal data, transformation is an important tool used to filter a dataset, because it helps to reduce increasing variability through time

True

Time series

a single measurement of a process that yields a variable over time, denoted by y1, y2,...yT

Autoregressive model of order 2

a stationary process where there is a linear relationship between yt-2 and yt

Principal Component Analysis

a tool used for data visualization or data pre-processing before supervised techniques are applied; when faced with a large set of correlated variable, principal components allow us to summarize this set with a smaller number of representative variables that collectively explain most of the variability in the original set

2 ways the three patterns can be combined

additive (T+S+e) and partially multiplicative (T*S+e)

Entropy

alternative to gini index, will take on a value near 0 if the pmk's are all near 0 or near 1

MSE

can detect more patterns that ME

MAE

can detect more trend patterns that ME, units are the same as dependent variable

MAPE

can detect more trend patterns that MPE, it examines error relative to actual value

Shrinking parameter lambda

controls the rate at which boosting learns, very small lambda can require using a very large value of B in order to achieve good performance

Alpha (CCP)

controls tradeoff between the subtrees complexity and its fit to training data; when alpha=0, T=T0; as alpha increases, there is a price to pay for having a tree with many terminal nodes

Autocorrelation statistic (r1)

correlation of a series on itself, it summarizes the linear relationship between yt and yt-1; if r1>0, process is positively autocorrelated, positive relationship between yt and yt-1; we can use yt-1 to explain yt in regression model, autoregression

Conditional least squares estimates of b0 and b1 are approximated for autoregressive models because

differences arise because we have no explanatory variable for y1 the first observation, typically small in most series and diminishes as series length increases

Classification error is sufficiently sensitive for tree growing

false, it is not so two other methods are preferable

Increasing B in boosting doesn't lead to overfitting

false: we use cross validation to choose B

Average, complete, and single linkage are not most popular with statisticians

false; they are most popular

Drawbacks of Linear Probability Model

fitted values can be poor heteroscedasticity residual analysis is meaningless

Negative Binomial advantage compared to Poisson

has greater flexibility because it has 2 parameters Poisson is a limiting case of NB, it is nested within NB distribution NB arises from mixture of Poisson variables

How large does rk need to be to be considered statistically significantly different from 0 in absolute value?

if rk exceeds 2se(rk)=2(1/sqrt(T)) in absolute value it may be significantly non zero, with 5% level of significance

Random forests

improvement over bagged trees by way of a small tweak that decorrelates the trees; each time a split is considered, a random sample of m predictors is chosen as split candidates from the full set of p predictors; we typically choose m= sqrt(p); on average (p-m)/p of the splits will not even consider the strong predictor, and so other predictors will have more of a chance

Characteristics that apply to b0 and b1

least squares estimators, max-likelihood estimators

MPE

measure of trend, but examines error relative to actual value

Longitudinal data

measurements of a process that evolves over time

ARCH and GARCH models

models that quantify and forecast changing volatilty, concept of changing variability over times seems at odds with our notions of stationarity; we can allow for changing variances by conditioning on the past and still retain a weakly stationary model

Ordinal Dependent Variable

ordered categorical variable

Stochastic processes

ordered collections of random variables that quantify a process of interest

Filtering

procedure used to reduce observations to white noise

Autocorrelations and autoregressive models

provides techniques for detecting subtle trends in time and models to accommodate these trends

Bagging

reduces the variance of a statistical learning method; takes many training sets from the population, build a separate prediction model using each bootstrapped training set, and averaging resulting predictions

Fixed seasonal effects model

seasonal time series models can be modeled using trig functions, fixed effects means the relationships are constant over time, unlike exponential smoothing and autoregression which help us model trends that change over time and recent events

y=xt-xt-1

stationary mean

y=ln(xt)-ln(xt-1)

stationary mean and variance

y=ln(xt)

stationary variance

One step forecast residuals

taking difference between the actual and fitted values of forecast

Strong Stationarity requires

that the entire distribution of yt be constant over time, not just the mean and variance

Principal components regression

to perform we simply use principal components as predictors in a regression model in place of the original larger set of variables

Boosting

trees are grown sequentially, each tree is grown using info from previously grown trees; no bootstrapping involved; this approach learns slowly, given current model, we fit a decision tree to the residuals from the model, rather than the outcome Y; we then add this new decision tree into the fixed function in order to update residuals

Any of the 3 approaches might be used when pruning the tree, but the classification error rate is preferable if prediction accuracy of the final pruned tree is the goal

true

As with bagging, random forests will not overfit if we increase B, so in practice we use a value of B sufficiently large for the error rate to have settled down

true

Average and complete are generally preferred over single because they tend to yield more balanced dendrograms

true

Bagging improves prediction accuracy at the expense of interpretability

true

Bagging, random forest, and boosting can improve predictive performance of trees

true

Constraining Z2 to be uncorrelated with Z1 is equivalent to constraining the direction of phi2 to be orthogonal to the direction of phi1

true

For classification trees, we are often not only interested in the class prediction corresponding to a particular terminal node region, but also in the class proportions among the training observations that fall into that region

true

For white noise process, b1=0, thus pk=0 for all lag ks

true

Like running averages, exponential smoothing estimates provide greater weight to more recent observations; however the weight function is smooth

true

Principal loading vectors are the directions in feature space along which the data varies the most, and the principal component scores are projections along these directions

true

RSS cannot be used for criterion for binary splits for classification trees, instead we use classification error rate

true

The first principal component of a set of features X1, X2,...,Xp is the normalized linear combination of the features that has the largest variance

true

The number of trees B is not a critical parameter with bagging, using a very large value of B will not lead to overfitting

true

The second principal component is the linear combination of X1, X2,..., Xp that has maximal variance out of all linear combinations that are uncorrelated with the first principal component

true

We typically decide the number of principal components needed in order to explain sizable amount of data by using scree plot; we choose the smallest number of principal components that are required in order to get as accurate a read as possible; we eyeball and look for the point at which the PVE drops off

true

Nominal Dependent Variable

unordered categorical variable

Ridge Regression

very similar to least squares regression except coefficients are estimated by minimizing a slightly different quantity, tuning parameter lambda serves to control the relative impact on the regression coefficient estimates, as lambda increases, the ridge coefficient estimates shrink towards 0

Logit threshold interpretation

we do not observe the propensity, but we observe when the propensity crosses a threshold

Cost Complexity Pruning

weakest link pruning, instead of considering every subtree, we consider a sequence of trees indexed by a nonnegative tuning parameter alpha

Autoregressive model of order 1

when only immediate past is used as predictor, b0 may be any fixed constant, but b1 is restricted to be between -1 and 1, so AR(1) yt series is stationary

If there is a relationship between response and predictors, we expect F stat to be

Greater than 1, and the alternative hypothesis to be true

As K increases, flexibility decreases, bias increases, the method produces a decision boundary close to linear, and variance decreases (for K nearest neighbors

True

As long as there is some variability in the white noise process (variance of c greater than 0), the random walk is nonstationary in the variance

True

Can use binary variables or trig functions to capture seasonal effects

True

Collinearity reduces the accuracy of the estimates of the regression coefficients, so the standard error of the betas will grow, t stat will decline, and the power of the hypothesis test is reduced

True

For lasso regression, as lambda increases, flexibility decreases, variance decreases, squared bias increases, training error increases, and test error follows a u shape

True

Forecast of a future value of a white noise process is just the average of the process

True

Forward stepwise selection is the only viable subset method when p is very large

True

If model is purely multiplicative, yt=T*S*e, it can be made additive by taking logs of both sides

True

In Poisson regression models, we anticipate heteroscedastic dependent variables, which means ordinary residuals are useless so we use Pearson residuals

True

In theory, the models with the largest r squared adjusted will have only correct variables and no noise variables

True

K-fold cross validation with k<n has a computational advantage to LOOCV and often gives more accurate estimates of the test error rate because of the bias-variance tradeoff

True

LOOCV has far less bias than validation set approach, it tends not to overestimate the test error as much

True

Logistic Regression is parametric, but regression trees and KNN are not

True

Once all the correct variables have been included in the model, adding additional noise variables will lead to only a small decrease in RSS

True

Prediction intervals are always wider than confidence intervals because it incorporates irreducible error and reducible error

True

Random walk is not a stationary process because variability and possibly the mean depends on time point at which the series is observed

True

Regular periodic behavior is often found in business and economic data

True

Removing the high leverage observation tends to have a much more substantial impact on the least squares line than removing the outlier

True

Residuals (e1, e2,..) must sum to zero

True

Ridge regression works best in situations where the least squares estimates have high variance

True

Smallest possible value for VIF=1, means complete absence of collinearity

True

Stability of process is a basic concern with processes that evolve over time

True

The accuracy of the prediction of y depends on both the reducible error and the irreducible error

True

The least squares line always passes through (xbar, ybar)

True

The variance of a statistical learning method increases as the method's flexibility increases

True

Tolerance is the reciprocal of VIF

True

Variance increases monotonically as flexibility increases

True

When fitting models to data with binary or count dependent variable, it is common to observe that the variance exceeds that anticipated by the fit of the mean parameters, called overdispersion

True

When inference is the goal, there are clear advantages to using a lasso method vs. a bagging method

True

Width of prediction interval for random walk, grows a l grows, reflects diminishing ability to predict into the future

True

White noise process

a stationary process that displays no apparent patterns through time, i.i.d. series

Validation & Cross Validation

can be used to directly estimate the test error, has an advantage because it makes fewer assumptions about the true underlying model

What are 2 solutions for collinearity?

Drop one of the problematic variables or combine collinearity variables together into a single predictor

Pearson goodness of fit stat

measure of how well model fits data, if specified correctly, stat should approximately equal n-(k+1)

Weakly Stationary

- E(yt) does not depend on t, [E(y4)=E(y8)] - covariance between ys and yt depends only on the difference between time units |t-s|, [Cov(y6,y8)=Cov(y4,y6)] - has constant mean and variance

Time Series analysis process

- goal of analysis is to go backwards, and decompose the series into the 3 components - each component can then be forecast, which will provide us with forecasts that are reasonable and easy to interpret

How do we identify a series as a realization from a random walk?

1. examine series and decide whether it's stationary (using control charts) 2. if series is nonstationary we can use control chart to detect linear trend in time and increasing variability as t gets larger 3. we can use control charts to detect lack of pattern in differences of the series (white noise process) 4. we can compare the standard deviation of the original series and the differenced series, if series can be represented by random walk, we expect a substantial reduction in the standard deviation when taking differences

RSE info

Considered a measure of the lack of fit of the model to the data, we want it to be small, it's measured in units of y, so it's not always clear what constitutes a good RSE

Backward stepwise selection

Begins with full model and then it iteratively removes the least useful predictor one by one; also not guaranteed to yield the best model and it requires that n>p so that full model can be fit

Choosing the Optimal model

Can indirectly estimate test error by making an adjustment to the training error to account for bias due to over fitting; can directly estimate the test error, using either a validation set or cross validation set approach

To test relationship between response and predictors, we check if the betas equal zero by

Computing F stat=((ssr/p)/(rss/n-p-1))

RSE

Estimate of the standard deviation of the error, the average amount that response will deviate from the true regression line

Bias of statistical learning method increases as the model's flexibility increases

False

Hierarchical Principle

If we include a interaction in a model, we should also include the main effects, even if the p values associated with their coefficients are not significant

How large does the F stat need to be before we reject H0?

It depends on n and p; when n is large, an F stat that is just a little larger than 1 might still provide evidence against H0; a larger F stat is needed to reject H0 if n is small

Why is the white noise model both the least important and least important of time series models?

Least important: the model assumes that the observations are unrelated to one another, which is unlikely for most series of interest Most important: our modeling efforts are directed toward reducing a series to a white noise process; after all patterns are filtered from the data, the uncertainty is irreducible

Can the training RSS and R squared be used to select from among a set of models with different numbers of variables

No

Problems that may occur when fitting a linear regression model

Non linearity of data Correlation of error terms Non constant variance of error terms Outliers High leverage points Collinearity

3 types of patterns in a time series

Trends in time (Tt) Seasonal (St) Random or irregular patterns (et)

Heteroscedasticity

Presence of funnel shape in residual plot, variance of error terms may increase with the value of the response, a solution would be to transform the response y using a concave function

R-squared statistic

Provides an alternative measure of fit, takes form of a proportion, proportion of variance explained, always takes on values between 0 and 1, independent of scale of y

R squared =

SSR/TSS

A VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity

True

It is clear the true relationship isn't additive if,

The p value for interaction term is low, there is strong evidence for the alternative hypothesis

R squared measures

The proportion of variability in Y that can be explained by using X, we want it to be close to 1

R squared in MLR

The square of the correlation between the response and the fitted linear model; an r squared close to 1 indicates that the model explains a large portion of the variance in the response variable

TSS measures

The total variance in the response Y, the amount of variability inherent in the response before regression is performed

Best subset selection

To perform we fit a separate least squares regression for each possible combination of the p predictors

When does F stat work?

When p is relatively small, and small compared to n; if p is greater than n, then there are more betas to estimate than observations to estimate from, in this case we cannot use least squares to build model, so F stat cannot be used

Models used to address issue of excess zeros

Zero inflated and Hurdle models; zero inflated model accommodates for overdispersion and hurdle model accommodates for under and overdispersion

Lasso Regression

alternative to ridge regression that overcomes disadvantage that ridge regression will always generate a model involving all predictors, and increasing lambda will not result in exclusion of any variables

Seasonal (St)

aspects that repeat itself periodically

Forward Stepwise Selection

computationally efficient alternative to best subset selection, begins with null model, then adds predictors one by one until all are in the model; at each step the variable that gives the greatest additional improvement to the fit is added to the model

Trends in time (Tt)

long term, slow evolution, most important in long term forecasts

Seasonal adjustment

removal of seasonal patterns

Random or irregular patterns (et)

short term movements that are typically harder to anticipate

Shrinkage Methods

shrinking coefficient estimates can significantly reduce their variance, the two best known techniques for shrinking towards 0 are ridge regression and the lasso

Exposure

to extend basic Poisson model, we allow mean to vary by this known amount

Least squares regression is with lambda equal to 0

true; ridge regression

Logistic Regression

we represent the linear combination of explanatory variables as the logit of the success probabilities

How do we deal with the drawbacks of the linear probability model?

we use alternative models in which we express the expectation of the response as a function of explanatory variables, there are two cases: logit and probit


Kaugnay na mga set ng pag-aaral

Chapter 14: Disorders Common Among Children and Adolescents

View Set

Final Exam - Hawaii Life and Health

View Set

Developmental Concepts - Peds Module 4

View Set

II. FL Statues, Rules, Regs pertinent to Life/Annuity Insurance incl Variable Products TEST

View Set