SRM Info

Ace your homework & exams now with Quizwiz!

Properties of AR(1) model

E[Y] = β₀/(1-β₁) Var[Y] = σ²/(1-β₁²) β₁ⁿ=ρₙ

Tweedie distribution

E[Y]= µ Var[Y]=∅µ^d, 1<d<2

Inverse Gaussian distribution

E[Y]=(-2θ)^(-1/2) Var[Y]=∅(-2θ)^(-3/2)

Heterogeneity models

E[Y|α]=µ=exp(α+x^Tβ) where exp(α) is continuous E(exp(α))=1 Y|α~ Poission

Ridge regression shrinks

coefficient estimates → reduces VARIANCE

White noise

collection of iid random variables; stationary predicted y(n+l) = E(y) std error predictedy(n+l) = sy√(1+1/n) PI = predicted y ± t*se ** df for t is n used to predict

used to select tuning parameter for shrinkage methods (ridge and lasso)

cross validation

R output for classification trees

deviance = -2∑∑n(m,c)ln(p(m,c)) residual mean deviance = deviance/ [n-g]

k-fold cross validation requires fitting a model

e.g. 10-fold cross validation split data set into 10 sets each set is test dataset while remaining 9 are used as training training set = 9 times test set = 1 time

advantages of trees

easy to interpret and explain can be presented visually manage categorical variables without the need of dummy variables mimic human decision making

Anscombe Residual

ei = [t(yi)-E(t(Yi))]/[√Var(t(Yi))] where t(*) is a function of t(Yi) that is approx. normally distributed

a very negative residual has a small fitted value (in SLR) indicates

either a small or large explanatory variable (dependent upon sign of b1) aka large leverage

Linear exponential family

f(y) = exp[((yθ - b(θ))/∅ + a(y,θ)] E(Y) = b'(θ) Var(Y) = ∅b''(θ) Normal, Binomial, Poisson, Gamma, Inverse Gaussian NG B/P I.V.

hierarchal cluseting is robust? true or false

false, only performed once

as K increases (k-nearest neighbors)

flexibility decreases...

Ridge regression, as budget parameter s increases,

flexibility increases

pca serves as a tool

for data visualization

correlation based distance should not be used for hierarchal clustering

for datasets with two features

Ridge Regression

goal is to minimize SSE + λ∑bj² where bj = ∑bj²<=a λ inversely related to flexibility

Lasso Regression

goal is to minimize SSE + λ∑|bj| where bj = ∑|bj|<=a λ inversely related to flexibility

k means clustering is

greedy

trigonometric functions d should be at most

half the seasonal base

the principal component score for any given observation will

have different signs, but the same magnitude, under both approaches

negative binonmial model is a special case of

heterogeneity model bc heterogenity model can be modeled as poisson gamma mixture and nb is a special case of that

key differences between k-means and hierarchal clusetering

hierarchal is performed once k is performed many times clusters rely on these paramters: k-means: choice of k hierarchal: linkage, number of clusters, dissimilarity measure both: choice to stdize variables

with high-dimensional data, ____ becomes unreliable

high dimension = too flexible = too closely fitted model SSE too small R² and MSE not accurate CI off

# of terminal nodes and flexibility

higher number of term nodes = more flexible

box plot median

is the bold line in IQR box

curse of dimensionality

issues with high dimensions the quantity of variables can dilutes quality of data; more explanatory variables ≠ better model

PCA provides a

low dimensional linear surfaces that are closest to the observations

canonical links

normal = µ binomial = ln (µ/(n-µ)) poisson =lnµ neg binomial = ln(µ/(r+µ)) gamma = -1/µ inverse Gaussian = -1/2µ²

if a poisson model with exposures is refitting without exposures

the coefficients will be different

The first principal component is extracted by finding

the direction along which the data varies the most

when predictors are highly correlated (PCA)

the first principal component explains much of the variability in the dataset

the more severe the multicolinearity

the less reliable the regression coefficent estimates, but MSE is still reliable.

an R^2 close to 1 indicates

the model suffers from high dimensions (too many predictors)

most flexible to least

think most interpretable > least, then flip i.e. boosting (most), linear regression, ridge regression (least), where ridge regression is most interpretable

which is superior hierarchal or k-means?

neither, both have pros and cons

some key ideas for smoothing AR(1)

no linear trend in time related to wls double smooth can forecast data w/ linear trend holt-winter double exp is generalization of double smoothing

Cross Validation

- estimates test MSE with available data - types: val. set approach, k-fold, LOOCV validation set approach has unstable results and tends to overestimate test MSE Validation set error has most bias LOOCV set error has highest variance

Link Functions

- identity link: h(µ)=µ - Logit link: h(µ)=ln(µ/(1-µ)) - Log link: h(µ)=ln(µ) - Canonical link: h(µ)=b'^(-1)µ

if an explanatory variable is uncorrelated with all other explanatory variables, VIF would be

1

Three drawbacks of linear models

1. Heteroskedasticity 2. Meaningless residuals 3. Because model is arbitrary, poor fitted values

Random forests

1. create b bootstrapped samples from og 2. construct a tree for each bootstrapped sample w rec.bin.splitting @ each split consider k variables from a random subset 3. predict response of a new obvs by avg or mode properties: increasing b DOES NOT cause overfitting decreaing K reduces correlation between predictions good ks are k=p/s (reg) k= sqrt(p) (class)

Bagging

1. create b bootstrapped samples from og 2. construct a tree for each bootstrapped sample w rec.bin.splitting, considering n variables 3. predict response of a new obvs by avg or mode properties: increasing b DOES NOT cause overfitting REDUCES VARIANCE oob error is a valid estimate of test error special case of random forests

Boosting

1. for k=1,2,..b a. use rec bin splitting to fit a tree with d splits b. update zk by subtracting λfx 2. calculate boosted model as f(x)=∑λf(x) properties: increasing b DOES cause overfitting REDUCES BIAS d controls complexity λ controls learning rate (want slower learning ie smaller λ)

Deviance for testing GLM

1. scaled deviance is good to compare nested models 2. small deviance = good model 3. saturated model is 0 Scaled Deviance: D*=2[l(sat)-l(b)] Deviance statistic: D = ∅D*

VIF

1/[1-Rj²] = [sx²(n-1)]/[MSE]*se(bj)² tolerance is reciprocal of VIF ** test for colinearity ** detects multicolinearity, if VIF >10

likelihood ratio test

2[l(bfull)-l(breduced)] Reject H0 is stat >= chi-sq stat df = pf-pr

xbar chart

A plot of sample means over time used to assess whether a process is in control.

Logit and probit graphs

Are very similar

unreliable/unstable b's causes

D MM HHH ∙ Misspecified model equation ∙ Heteroscedasticity ∙ Dependent errors ∙ Multicollinearity ∙ High leverage points ∙ High dimensions

difference between CI and PI

CI is possible range for E(y|x) PI is possible range for y|x

Cp and AIC vs BIC

Cp and AIC choose the same model as optimal BIC favors models with smaller p compared to Cp/aIC when n>=8

when GLM has normally distributed response, identity link, and homoskedasticity

GLM = MLR MLE estimates of b = OLS estimates of b Deviance= SSE

BIC formula

Linear model: [SSE + ln(n)*p*MSE]/[n*MSE] Non-linear model: -2ln(b)+ln(n)*(# of estimated parameters)

AIC formula

Linear model: [SSE+2p*MSE]/[n*MSE] Non-linear model: -2l(b)+2(# of estimated parameters)

unreliable MSE potential causes

MHOH ∙ Misspecified model equation ∙ Heteroscedasticity ∙ Outliers ∙ High dimensions

regression coefficents with GLM are estimated with

MLE and normal distribution

violations and issues

MRZ. HDN MOLD Misspecified model equation Residuals with non-zero averages Heteroscedasticity Dependent errors Non-normal errors Multicollinearity Outliers High leverage points High-dimensional issues

when there is heteroskedasticity

MSE is unreliable, therefore adjusted R2, F stat, are unreliable.

se(y+1)

MSE[1 + 1/n + (x(n+1)*-xbar)²/∑(xi-xbar)²]

se(y)

MSE[1/n + (x*-xbar)²/∑(xi-xbar)²]

se(b0)

MSE[1/n + xbar²/∑(xi-xbar)²]

F- statistic formula

MSR/MSE = (SSR/p)/(SSE/(n-p-1))

best model for each use...

Nested: LPD (likelihood ratio, partial F test, deviance) Non-nested: ABR AIC, BIC, R²adj Predictors: chi goodness of fit = significance of predictors

stationary

Not related to time, therefore, average and variance should not depend on t and should be equal across all values covariance is dependent on t-s

AR(1)

Only the immediate past value of yt-1 is used to predict yt

a method that reduces dimensions

PCA regression and partial least squares

Pearson chi square statistic

Pearson Residual = ei = (yi-µi)/√(∅v(µi)) Pearson Chi-Square Statistic = ∑ei² - large value = over dispersion is more severe - to address overdispertion, inflate variance with δ - δ=Pearson Chi-square stat/(n-p-1)

Cook's distance

R²*(e²*h)/((1+p)(1-h)²) OR [ei²hi]/[MSE(p+1)(1-hi)²] when given standardized residual: Di=ri²hi/(p+1)(1-hi) where standardized residual = ei/(s√(1-hi))

Leverage formula

SLR: hi = 1/n + [(x*-x_bar)²]/[∑(xi-x_bar)²] GLM: hi = SE(y)²/MSE 1/n< hi <1 ∑hi = p+1 diagonal values in X(X^TX)^(-1)X^T

another formula for SSR

SSR = b₁²∑(xi-x_bar)²

fixed seasonal effects - trigonometric funcion

St = ∑[β₁sin + β₂cos] let β₁= acosb β₂=asinb

s²y formula

TSS/(n-1)

when lambda = 0 ridge and lasso

are the same = ols estimate

inversions can happen with _____ linkage (hierarchical clustering)

centroid

Unconditional variance for ARCH and GARCH models

Var(ε)=θ/[1-∑γj - ∑δj]

weighted least squares

Var(ε)=σ²/w equivalent to running OLS with (√w * y) as response

Lag k autocorrelation of white noise process

Will always be zero, for all ks

Seasonal model

Yt = b0 + b1Y(t-g) + ... + bpY(t-pg) smoothed: Yt = b0 + b1t + St + error

Logit transformation requires range of

[0,1] does not reduce heteroskedasticity does not shrink range of values of the dependent variable

Max-scaled R^2 = R²ms

[1-exp{2[l₀-l(b)]/n}]/[1-exp{2l₀/n}]

Mallows Cp

[SSE + 2p +MSE]/[n]

Pseudo-R^2 = R²pse

[l(b)-l₀]/[l(sat)-l₀]

R-chart

a chart used to monitor variability (time series)

PCA finds a low dimensional representation of

a dataset that contains as much variation as possible

Latent class model

a discrete mixture which models each subgroup with its own distribution (1-p)P(Y=y|α=1) + pP(Y=y|α=2), y=0,1,... where α=subgroup n

Negative binomial model (poisson dist)

a limiting case of Poisson; results in a poisson-gamma mixture pmf = (y+r-1)C(r-1)*π^r(1−π)y, y = 0,1,... E(Y) = r(1-p)/p Var(Y) = r(1-p)/p^2 for r, log link is common

in picking a purity measure, you want

a lower value

when residuals have a larger spread for larger predictions

a possible solution would be to transform y with a concave function

in random forsests ____ variables are considered at each split

a subset of variables

Control Chart

a time-ordered diagram that is used to determine whether observed variations are abnormal; also can detect trends in time detects non stationarity

a small R^2 indicates

adding more predictors would likely improve the model

the Tweedie distribution can model

aggregate loss with Poisson frequency and Gamma severity

the likelihood ratio chi-squared statistic tests

all non-intercept coefficients are zero vs non-zero df = p-1

in boosting ____ variables are considered at each split

all variables

Multicollinearity

an issue with every regression model leads to inflated variance no universal approach to handling.. can be eliminated by using orthogonal predictors

scatterplots can detect ______ type of relationship between two variables

any

balanced denogram indicates

average or complete linkage

matrix for coefficients (SLR)

b = (XTX)^-1y - all 1's is intercept

PCA is not useful if variables are not correlated

b/c pca is supposed to produce uncorrelated PC, if already uncorrelated, it wont do anything

AR(1) parameter estimates

b1 = ∑[(y(t-1) - ybar_)*(yt - ybar+)] / ∑(y(t-1)-ybar_)² b0 = ybar+ - b₁ybar_ s²= 1/(n-3) ∑et² Var[Yt] = s²/(1-b₁²)

AR(1) is a meandering process if

b1 >0

tweedie distribution is not suitable for count models

bc its part continuous and part discrete

Modeling trees

best split minimizes weighted average of the impurity at each terminal node cost complexity pruning finds sequence of subtrees that minimizes: 1/n∑n*I +λ|T|

ridge regression produces

biased estimators this is the trade off made in ridge regression in order to reduce the variance of the predictors

regression in high dimensions

can use lasso ridge

ideal models to study distribution of a variable

box and qq plot

with a lease squares regression, LOOCV estimate for test MSE

can be calculated by fitting a model once

Alternative Count models

can incorporate poisson distributions while letting the mean differ from the variance of the response: - negative binomial - latent class model - zero-inflated model - hurdle model - heterogeneity model Which can let mean > var? ** only hurdle model**

PCR

if all p.c. are used, PCR = OLS addresses multicolinearity reduces overfitting recommended to standardize predictors ****not useful for feature selection

the first principal component is the line

in a p-dimensional space that is closest to the observations

when simple linear regression residual plot is n shaped,

indicates quadratic relationship between the response and explanatory variables therefore, it is likely missing a key predictor

GLM facts about link functions

inverse of mean fx is link fx choice of mean and variance function drive inference making for LEF cannononical link = inverse of mean

confidence interval = prediction interval when

irreducible error is zero

total within-cluster variation

is guaranteed to decrease as more iterations are performed until final assignment

tranforming the reposnse variable with a concave funciton

is not a reasonable way to deal with multicolinearity

clustering and categorical variables

k - means cannot use hierarchal can

modeling techniques that perform variable selection

lasso regression

zero inflated model is a special case of

latent class model

explanatory data analysis

learns about relationships between observations or features

bagging makes a model

less interpretable

ARCH model

models the conditional variance of a time series σ²t=θ+(γ1)(ε²(t−1)) +...+ (γp)(ε²)(t−p)) θ>0 γj>=0 ∑γj < 1

To be free from multicollinearity

must use uncorrelated/orthogonal predictors. PCR uses principal components (which are orthogonal) as predictors. NOT :Lasso, forward selection, and best subset selection

no backward selection when

n < p

disadvantages of trees

not robust do not have the same degree of predictive accuracy as other statistical models

interaction depth

number of splits in a tree

we can cluster

observations on the basis of features OR features on the basis of observations

Binary Response Variable

odds = p/(1-p) Link functions: Logit, Probit (Φ^−1(μ)), Complementary Log Link (ln(-ln(1-u)) Log-likelihood: l(β) = ∑[ylnµ + (1-y)ln(1-µ)] Score equations: [∂/∂β]l(β)=∑[ xi * (y−µ) * (μ′)/[μ(1−μ)] ]=0 Deviance statistic: D = 2∑[yln(y/µ)+(1-y)ln((1-y)/(1-µ))] Pearson Residual: ei = (y-µ)²/µ(1-µ) Pearson Chi-sq: = ∑ei

Poisson distributions are inadequate at handling

overdispertion

sum of leverages must equal

p + 1

for poisson, when you address over overdispertion

p-values change, standard errors change. no change in estimated mean probability may change b/c its not poisson anymore

Random Walk

partial sums of white noise process; not stationary wt=yt - y(t-1) predicted y(n+1) = yn +l*E(w) se = s(w)√l differenced series is white noise process sample variance of the series < sample variance of the differenced series

Impurity measures

pure = contains obvs from the same class Gini and Cross entropy are sensitive to PURITY good for TREE GROWING Cross-validation is NOT sensitive to purity good for TREE PRUNING

best graph for non-linear model residuals

qq plot

deviance residuals are either positive or negative depending on

raw residuals

zero-inflated model (poisson dist)

requires selecting a discrete distribution valid on integers starting with 0 then constructs a discrete mixture y=0: (1-p) + pP(Y=0|α=2) y>0: pP(Y=y|α=2) can model p with logit link

hurdle model (poisson dist)

requires selecting a discrete distribution valid on integers starting with 1; distinguishes zero claims from non-zero. y=0: (1-p) y>0: pP(Y=y|α=2) zero-truncated: pmf = 1/(1-e^-λ)*(e^-λ*λ^y/y!), y = 1,2,.. if zero-truncated, mean can be greater than variance

in pca, variables are

scaled. i.e. xi - meani

info in PCA

scaling has significant effect on result of PCA scree plot is used to det # of pc needed loadings are only unique up to the sign

ideal model for studying relationship between two variables

scatter plot

non accurate se's (ignoring MSE) potential cuases

shrunken = dependent errors inflated = multicolinearity

loadings are unique up to a sign flip - condition

sign flips should be consistent amongst all loadings

skewed denogram indicates

simple linkage

in wrongingly assuming MLR errors are independent (dependent errors)

standard errors are smaller than they should be *bc there would be covariance terms unaccounted for

Goodness of fit test

stat = ∑(n*_c - n*q_c)²/(n*q_c) Reject H0 is stat >= chi-sq stat df = w-g-1 where: w=# of mutually exclusive split intervals g=# of free parameters

as realizations of a t-distribution..

studentized residuals can help identify outliers

within cluster variation

sum of square euclidean distances always * 2

if a GLM is adequate,

the scaled deviance is a realization of a chi-sq dist. BECAUSE = 2[lloglikelihood(1)-loglikelihood(2)] for nested models for poisson, deviance = scaled deviance

In SLR, the F-statistic is always..

the square of the t-statistic

when you see a funnel shaped residuals graph

there is heteroskedasticity WLS for all funnels each way if it is small left to big right can use concave function too

you can combine all predictors with high VIF into one predictor

to deal with multicolinearity

unit root test

to identify if random walk is a good fit for a time series Dickey fuller & augmented dicky fuller test

to find subgroups based off of PATTERNS not numbers

use hierarchal with correlation based distance

Poisson Regression Model

uses log link: h(µ)=lnµ Log-likelihood function: ∑[ylnµ - µ - ln(y!)] Score Equation: [∂/∂βl](β)=∑xi(y−µ)=0 Information matrix: I =∑µxx^T Deviance statistic: D = 2∑[y(ln(y/µ)-1)+µ] Pearson Residual: ei = (y-µ)/√µ Pearson chi-square stat: ∑(y-µ)²/µ

GARCH Model

uses past conditional variances to model conditional variance σ²t=θ+(γ1)(ε²t-1)+...+(γp)(ε²t-p)+δ₁σ²(t-1) +... +δqσ²(t-q)

if pca in a biplot is scaled,

variables have mean zero and variance one; therefore you can not determine variance

OLS bias is always

zero

AR(1) is a generalization of

white noise or a random walk model, depending on b1

Estimating moving average forecast

y ̂_(t+l) = s ̂_t + (2(s ̂_t-s ̂_t^2))/(k-1)*l

estimating AR(1) model

yt = b0 + b1y(t-1) se = s√1+b1²+b1⁴....

proportion of variance explained by a principal component

∑(principal component score for either first, second, ... ) squared divided by n divided by n if centered/normalized

k-fold cross validation error

∑MSE/k & compare between models Lower = better

sum of squared one-step prediction erros

∑[y(t) - s(t-1)]²

poor inferences (ignoring MSE and se) potential causes

∙ Misspecified model equation ∙ Non-normal errors


Related study sets

Culture and Environment Section 3

View Set

Funeral Service Management 165 Chapter 4

View Set