SRM Info
Properties of AR(1) model
E[Y] = β₀/(1-β₁) Var[Y] = σ²/(1-β₁²) β₁ⁿ=ρₙ
Tweedie distribution
E[Y]= µ Var[Y]=∅µ^d, 1<d<2
Inverse Gaussian distribution
E[Y]=(-2θ)^(-1/2) Var[Y]=∅(-2θ)^(-3/2)
Heterogeneity models
E[Y|α]=µ=exp(α+x^Tβ) where exp(α) is continuous E(exp(α))=1 Y|α~ Poission
Ridge regression shrinks
coefficient estimates → reduces VARIANCE
White noise
collection of iid random variables; stationary predicted y(n+l) = E(y) std error predictedy(n+l) = sy√(1+1/n) PI = predicted y ± t*se ** df for t is n used to predict
used to select tuning parameter for shrinkage methods (ridge and lasso)
cross validation
R output for classification trees
deviance = -2∑∑n(m,c)ln(p(m,c)) residual mean deviance = deviance/ [n-g]
k-fold cross validation requires fitting a model
e.g. 10-fold cross validation split data set into 10 sets each set is test dataset while remaining 9 are used as training training set = 9 times test set = 1 time
advantages of trees
easy to interpret and explain can be presented visually manage categorical variables without the need of dummy variables mimic human decision making
Anscombe Residual
ei = [t(yi)-E(t(Yi))]/[√Var(t(Yi))] where t(*) is a function of t(Yi) that is approx. normally distributed
a very negative residual has a small fitted value (in SLR) indicates
either a small or large explanatory variable (dependent upon sign of b1) aka large leverage
Linear exponential family
f(y) = exp[((yθ - b(θ))/∅ + a(y,θ)] E(Y) = b'(θ) Var(Y) = ∅b''(θ) Normal, Binomial, Poisson, Gamma, Inverse Gaussian NG B/P I.V.
hierarchal cluseting is robust? true or false
false, only performed once
as K increases (k-nearest neighbors)
flexibility decreases...
Ridge regression, as budget parameter s increases,
flexibility increases
pca serves as a tool
for data visualization
correlation based distance should not be used for hierarchal clustering
for datasets with two features
Ridge Regression
goal is to minimize SSE + λ∑bj² where bj = ∑bj²<=a λ inversely related to flexibility
Lasso Regression
goal is to minimize SSE + λ∑|bj| where bj = ∑|bj|<=a λ inversely related to flexibility
k means clustering is
greedy
trigonometric functions d should be at most
half the seasonal base
the principal component score for any given observation will
have different signs, but the same magnitude, under both approaches
negative binonmial model is a special case of
heterogeneity model bc heterogenity model can be modeled as poisson gamma mixture and nb is a special case of that
key differences between k-means and hierarchal clusetering
hierarchal is performed once k is performed many times clusters rely on these paramters: k-means: choice of k hierarchal: linkage, number of clusters, dissimilarity measure both: choice to stdize variables
with high-dimensional data, ____ becomes unreliable
high dimension = too flexible = too closely fitted model SSE too small R² and MSE not accurate CI off
# of terminal nodes and flexibility
higher number of term nodes = more flexible
box plot median
is the bold line in IQR box
curse of dimensionality
issues with high dimensions the quantity of variables can dilutes quality of data; more explanatory variables ≠ better model
PCA provides a
low dimensional linear surfaces that are closest to the observations
canonical links
normal = µ binomial = ln (µ/(n-µ)) poisson =lnµ neg binomial = ln(µ/(r+µ)) gamma = -1/µ inverse Gaussian = -1/2µ²
if a poisson model with exposures is refitting without exposures
the coefficients will be different
The first principal component is extracted by finding
the direction along which the data varies the most
when predictors are highly correlated (PCA)
the first principal component explains much of the variability in the dataset
the more severe the multicolinearity
the less reliable the regression coefficent estimates, but MSE is still reliable.
an R^2 close to 1 indicates
the model suffers from high dimensions (too many predictors)
most flexible to least
think most interpretable > least, then flip i.e. boosting (most), linear regression, ridge regression (least), where ridge regression is most interpretable
which is superior hierarchal or k-means?
neither, both have pros and cons
some key ideas for smoothing AR(1)
no linear trend in time related to wls double smooth can forecast data w/ linear trend holt-winter double exp is generalization of double smoothing
Cross Validation
- estimates test MSE with available data - types: val. set approach, k-fold, LOOCV validation set approach has unstable results and tends to overestimate test MSE Validation set error has most bias LOOCV set error has highest variance
Link Functions
- identity link: h(µ)=µ - Logit link: h(µ)=ln(µ/(1-µ)) - Log link: h(µ)=ln(µ) - Canonical link: h(µ)=b'^(-1)µ
if an explanatory variable is uncorrelated with all other explanatory variables, VIF would be
1
Three drawbacks of linear models
1. Heteroskedasticity 2. Meaningless residuals 3. Because model is arbitrary, poor fitted values
Random forests
1. create b bootstrapped samples from og 2. construct a tree for each bootstrapped sample w rec.bin.splitting @ each split consider k variables from a random subset 3. predict response of a new obvs by avg or mode properties: increasing b DOES NOT cause overfitting decreaing K reduces correlation between predictions good ks are k=p/s (reg) k= sqrt(p) (class)
Bagging
1. create b bootstrapped samples from og 2. construct a tree for each bootstrapped sample w rec.bin.splitting, considering n variables 3. predict response of a new obvs by avg or mode properties: increasing b DOES NOT cause overfitting REDUCES VARIANCE oob error is a valid estimate of test error special case of random forests
Boosting
1. for k=1,2,..b a. use rec bin splitting to fit a tree with d splits b. update zk by subtracting λfx 2. calculate boosted model as f(x)=∑λf(x) properties: increasing b DOES cause overfitting REDUCES BIAS d controls complexity λ controls learning rate (want slower learning ie smaller λ)
Deviance for testing GLM
1. scaled deviance is good to compare nested models 2. small deviance = good model 3. saturated model is 0 Scaled Deviance: D*=2[l(sat)-l(b)] Deviance statistic: D = ∅D*
VIF
1/[1-Rj²] = [sx²(n-1)]/[MSE]*se(bj)² tolerance is reciprocal of VIF ** test for colinearity ** detects multicolinearity, if VIF >10
likelihood ratio test
2[l(bfull)-l(breduced)] Reject H0 is stat >= chi-sq stat df = pf-pr
xbar chart
A plot of sample means over time used to assess whether a process is in control.
Logit and probit graphs
Are very similar
unreliable/unstable b's causes
D MM HHH ∙ Misspecified model equation ∙ Heteroscedasticity ∙ Dependent errors ∙ Multicollinearity ∙ High leverage points ∙ High dimensions
difference between CI and PI
CI is possible range for E(y|x) PI is possible range for y|x
Cp and AIC vs BIC
Cp and AIC choose the same model as optimal BIC favors models with smaller p compared to Cp/aIC when n>=8
when GLM has normally distributed response, identity link, and homoskedasticity
GLM = MLR MLE estimates of b = OLS estimates of b Deviance= SSE
BIC formula
Linear model: [SSE + ln(n)*p*MSE]/[n*MSE] Non-linear model: -2ln(b)+ln(n)*(# of estimated parameters)
AIC formula
Linear model: [SSE+2p*MSE]/[n*MSE] Non-linear model: -2l(b)+2(# of estimated parameters)
unreliable MSE potential causes
MHOH ∙ Misspecified model equation ∙ Heteroscedasticity ∙ Outliers ∙ High dimensions
regression coefficents with GLM are estimated with
MLE and normal distribution
violations and issues
MRZ. HDN MOLD Misspecified model equation Residuals with non-zero averages Heteroscedasticity Dependent errors Non-normal errors Multicollinearity Outliers High leverage points High-dimensional issues
when there is heteroskedasticity
MSE is unreliable, therefore adjusted R2, F stat, are unreliable.
se(y+1)
MSE[1 + 1/n + (x(n+1)*-xbar)²/∑(xi-xbar)²]
se(y)
MSE[1/n + (x*-xbar)²/∑(xi-xbar)²]
se(b0)
MSE[1/n + xbar²/∑(xi-xbar)²]
F- statistic formula
MSR/MSE = (SSR/p)/(SSE/(n-p-1))
best model for each use...
Nested: LPD (likelihood ratio, partial F test, deviance) Non-nested: ABR AIC, BIC, R²adj Predictors: chi goodness of fit = significance of predictors
stationary
Not related to time, therefore, average and variance should not depend on t and should be equal across all values covariance is dependent on t-s
AR(1)
Only the immediate past value of yt-1 is used to predict yt
a method that reduces dimensions
PCA regression and partial least squares
Pearson chi square statistic
Pearson Residual = ei = (yi-µi)/√(∅v(µi)) Pearson Chi-Square Statistic = ∑ei² - large value = over dispersion is more severe - to address overdispertion, inflate variance with δ - δ=Pearson Chi-square stat/(n-p-1)
Cook's distance
R²*(e²*h)/((1+p)(1-h)²) OR [ei²hi]/[MSE(p+1)(1-hi)²] when given standardized residual: Di=ri²hi/(p+1)(1-hi) where standardized residual = ei/(s√(1-hi))
Leverage formula
SLR: hi = 1/n + [(x*-x_bar)²]/[∑(xi-x_bar)²] GLM: hi = SE(y)²/MSE 1/n< hi <1 ∑hi = p+1 diagonal values in X(X^TX)^(-1)X^T
another formula for SSR
SSR = b₁²∑(xi-x_bar)²
fixed seasonal effects - trigonometric funcion
St = ∑[β₁sin + β₂cos] let β₁= acosb β₂=asinb
s²y formula
TSS/(n-1)
when lambda = 0 ridge and lasso
are the same = ols estimate
inversions can happen with _____ linkage (hierarchical clustering)
centroid
Unconditional variance for ARCH and GARCH models
Var(ε)=θ/[1-∑γj - ∑δj]
weighted least squares
Var(ε)=σ²/w equivalent to running OLS with (√w * y) as response
Lag k autocorrelation of white noise process
Will always be zero, for all ks
Seasonal model
Yt = b0 + b1Y(t-g) + ... + bpY(t-pg) smoothed: Yt = b0 + b1t + St + error
Logit transformation requires range of
[0,1] does not reduce heteroskedasticity does not shrink range of values of the dependent variable
Max-scaled R^2 = R²ms
[1-exp{2[l₀-l(b)]/n}]/[1-exp{2l₀/n}]
Mallows Cp
[SSE + 2p +MSE]/[n]
Pseudo-R^2 = R²pse
[l(b)-l₀]/[l(sat)-l₀]
R-chart
a chart used to monitor variability (time series)
PCA finds a low dimensional representation of
a dataset that contains as much variation as possible
Latent class model
a discrete mixture which models each subgroup with its own distribution (1-p)P(Y=y|α=1) + pP(Y=y|α=2), y=0,1,... where α=subgroup n
Negative binomial model (poisson dist)
a limiting case of Poisson; results in a poisson-gamma mixture pmf = (y+r-1)C(r-1)*π^r(1−π)y, y = 0,1,... E(Y) = r(1-p)/p Var(Y) = r(1-p)/p^2 for r, log link is common
in picking a purity measure, you want
a lower value
when residuals have a larger spread for larger predictions
a possible solution would be to transform y with a concave function
in random forsests ____ variables are considered at each split
a subset of variables
Control Chart
a time-ordered diagram that is used to determine whether observed variations are abnormal; also can detect trends in time detects non stationarity
a small R^2 indicates
adding more predictors would likely improve the model
the Tweedie distribution can model
aggregate loss with Poisson frequency and Gamma severity
the likelihood ratio chi-squared statistic tests
all non-intercept coefficients are zero vs non-zero df = p-1
in boosting ____ variables are considered at each split
all variables
Multicollinearity
an issue with every regression model leads to inflated variance no universal approach to handling.. can be eliminated by using orthogonal predictors
scatterplots can detect ______ type of relationship between two variables
any
balanced denogram indicates
average or complete linkage
matrix for coefficients (SLR)
b = (XTX)^-1y - all 1's is intercept
PCA is not useful if variables are not correlated
b/c pca is supposed to produce uncorrelated PC, if already uncorrelated, it wont do anything
AR(1) parameter estimates
b1 = ∑[(y(t-1) - ybar_)*(yt - ybar+)] / ∑(y(t-1)-ybar_)² b0 = ybar+ - b₁ybar_ s²= 1/(n-3) ∑et² Var[Yt] = s²/(1-b₁²)
AR(1) is a meandering process if
b1 >0
tweedie distribution is not suitable for count models
bc its part continuous and part discrete
Modeling trees
best split minimizes weighted average of the impurity at each terminal node cost complexity pruning finds sequence of subtrees that minimizes: 1/n∑n*I +λ|T|
ridge regression produces
biased estimators this is the trade off made in ridge regression in order to reduce the variance of the predictors
regression in high dimensions
can use lasso ridge
ideal models to study distribution of a variable
box and qq plot
with a lease squares regression, LOOCV estimate for test MSE
can be calculated by fitting a model once
Alternative Count models
can incorporate poisson distributions while letting the mean differ from the variance of the response: - negative binomial - latent class model - zero-inflated model - hurdle model - heterogeneity model Which can let mean > var? ** only hurdle model**
PCR
if all p.c. are used, PCR = OLS addresses multicolinearity reduces overfitting recommended to standardize predictors ****not useful for feature selection
the first principal component is the line
in a p-dimensional space that is closest to the observations
when simple linear regression residual plot is n shaped,
indicates quadratic relationship between the response and explanatory variables therefore, it is likely missing a key predictor
GLM facts about link functions
inverse of mean fx is link fx choice of mean and variance function drive inference making for LEF cannononical link = inverse of mean
confidence interval = prediction interval when
irreducible error is zero
total within-cluster variation
is guaranteed to decrease as more iterations are performed until final assignment
tranforming the reposnse variable with a concave funciton
is not a reasonable way to deal with multicolinearity
clustering and categorical variables
k - means cannot use hierarchal can
modeling techniques that perform variable selection
lasso regression
zero inflated model is a special case of
latent class model
explanatory data analysis
learns about relationships between observations or features
bagging makes a model
less interpretable
ARCH model
models the conditional variance of a time series σ²t=θ+(γ1)(ε²(t−1)) +...+ (γp)(ε²)(t−p)) θ>0 γj>=0 ∑γj < 1
To be free from multicollinearity
must use uncorrelated/orthogonal predictors. PCR uses principal components (which are orthogonal) as predictors. NOT :Lasso, forward selection, and best subset selection
no backward selection when
n < p
disadvantages of trees
not robust do not have the same degree of predictive accuracy as other statistical models
interaction depth
number of splits in a tree
we can cluster
observations on the basis of features OR features on the basis of observations
Binary Response Variable
odds = p/(1-p) Link functions: Logit, Probit (Φ^−1(μ)), Complementary Log Link (ln(-ln(1-u)) Log-likelihood: l(β) = ∑[ylnµ + (1-y)ln(1-µ)] Score equations: [∂/∂β]l(β)=∑[ xi * (y−µ) * (μ′)/[μ(1−μ)] ]=0 Deviance statistic: D = 2∑[yln(y/µ)+(1-y)ln((1-y)/(1-µ))] Pearson Residual: ei = (y-µ)²/µ(1-µ) Pearson Chi-sq: = ∑ei
Poisson distributions are inadequate at handling
overdispertion
sum of leverages must equal
p + 1
for poisson, when you address over overdispertion
p-values change, standard errors change. no change in estimated mean probability may change b/c its not poisson anymore
Random Walk
partial sums of white noise process; not stationary wt=yt - y(t-1) predicted y(n+1) = yn +l*E(w) se = s(w)√l differenced series is white noise process sample variance of the series < sample variance of the differenced series
Impurity measures
pure = contains obvs from the same class Gini and Cross entropy are sensitive to PURITY good for TREE GROWING Cross-validation is NOT sensitive to purity good for TREE PRUNING
best graph for non-linear model residuals
qq plot
deviance residuals are either positive or negative depending on
raw residuals
zero-inflated model (poisson dist)
requires selecting a discrete distribution valid on integers starting with 0 then constructs a discrete mixture y=0: (1-p) + pP(Y=0|α=2) y>0: pP(Y=y|α=2) can model p with logit link
hurdle model (poisson dist)
requires selecting a discrete distribution valid on integers starting with 1; distinguishes zero claims from non-zero. y=0: (1-p) y>0: pP(Y=y|α=2) zero-truncated: pmf = 1/(1-e^-λ)*(e^-λ*λ^y/y!), y = 1,2,.. if zero-truncated, mean can be greater than variance
in pca, variables are
scaled. i.e. xi - meani
info in PCA
scaling has significant effect on result of PCA scree plot is used to det # of pc needed loadings are only unique up to the sign
ideal model for studying relationship between two variables
scatter plot
non accurate se's (ignoring MSE) potential cuases
shrunken = dependent errors inflated = multicolinearity
loadings are unique up to a sign flip - condition
sign flips should be consistent amongst all loadings
skewed denogram indicates
simple linkage
in wrongingly assuming MLR errors are independent (dependent errors)
standard errors are smaller than they should be *bc there would be covariance terms unaccounted for
Goodness of fit test
stat = ∑(n*_c - n*q_c)²/(n*q_c) Reject H0 is stat >= chi-sq stat df = w-g-1 where: w=# of mutually exclusive split intervals g=# of free parameters
as realizations of a t-distribution..
studentized residuals can help identify outliers
within cluster variation
sum of square euclidean distances always * 2
if a GLM is adequate,
the scaled deviance is a realization of a chi-sq dist. BECAUSE = 2[lloglikelihood(1)-loglikelihood(2)] for nested models for poisson, deviance = scaled deviance
In SLR, the F-statistic is always..
the square of the t-statistic
when you see a funnel shaped residuals graph
there is heteroskedasticity WLS for all funnels each way if it is small left to big right can use concave function too
you can combine all predictors with high VIF into one predictor
to deal with multicolinearity
unit root test
to identify if random walk is a good fit for a time series Dickey fuller & augmented dicky fuller test
to find subgroups based off of PATTERNS not numbers
use hierarchal with correlation based distance
Poisson Regression Model
uses log link: h(µ)=lnµ Log-likelihood function: ∑[ylnµ - µ - ln(y!)] Score Equation: [∂/∂βl](β)=∑xi(y−µ)=0 Information matrix: I =∑µxx^T Deviance statistic: D = 2∑[y(ln(y/µ)-1)+µ] Pearson Residual: ei = (y-µ)/√µ Pearson chi-square stat: ∑(y-µ)²/µ
GARCH Model
uses past conditional variances to model conditional variance σ²t=θ+(γ1)(ε²t-1)+...+(γp)(ε²t-p)+δ₁σ²(t-1) +... +δqσ²(t-q)
if pca in a biplot is scaled,
variables have mean zero and variance one; therefore you can not determine variance
OLS bias is always
zero
AR(1) is a generalization of
white noise or a random walk model, depending on b1
Estimating moving average forecast
y ̂_(t+l) = s ̂_t + (2(s ̂_t-s ̂_t^2))/(k-1)*l
estimating AR(1) model
yt = b0 + b1y(t-1) se = s√1+b1²+b1⁴....
proportion of variance explained by a principal component
∑(principal component score for either first, second, ... ) squared divided by n divided by n if centered/normalized
k-fold cross validation error
∑MSE/k & compare between models Lower = better
sum of squared one-step prediction erros
∑[y(t) - s(t-1)]²
poor inferences (ignoring MSE and se) potential causes
∙ Misspecified model equation ∙ Non-normal errors