MGSC372 Multiple Choice (powerpoints 8-16)
seasonally adjusted data
It contains the time series components T, C, and I
the fundamental idea of the ANOVA concept
the fundamental idea is that if the null hypothesis is true (i.e. µ₁ = µ₂ = ... = µκ), then the p populations are statistically identical. Consequently, whichever way we calculate the variance (between populations or within populations) we should get similar results (i.e. F=MSTR/MSE should be close to 1)
testing for residual correlation: if the residual for year t is positive...
there is a tendency for year t+1 to be positive
identification of potential models
at the identification stage, you do not need to worry about the sign of the ACF or PACF, or about the speed at which an exponentially declining ACF or PACF as it approaches 0. These depend on the sign or actual value of AR and MA coefficients in some instances, the exponentially declining ACF alternates between positive and negative values. ACF and PACF plots from real data are never as clean as the plots drawn
residual term in an estimated multiple regression model
ei = yi - ŷ ei = yi - (βˆ₀+βˆ₁x₁+...βˆκxκ)
Bartlett's test
follows a Chi-square distribution with p-1 degrees of freedom
detecting unequal variances
for each treatment, construct a box plot or dot plot for y and look for differences in variability. In this situation, we can test the hypothesis H₀: σ₁²=σ₂²=...=σp² H₁: one of the σ² differs using one of the 3 tests for the homogeneity of variance
tests for main effects are only relevant when...
no interaction exists between factors. for this reason, the test for interaction is generally performed first, and if it is present, a test on the main effects is not performed
why differencing?
non-stationary series have an autocorrelation function that declines slowly (exponentially) rather than quickly declining to zero; you must difference such a series until it is stationary before you can identify the process
specific seasonal relative
(Y/4QCMA)*100
4 points to remember regarding the definition of the Durbin-Watson statistic (d)
1) 0 ≤ d ≤ 4 2) if residuals are uncorrelated, d≈2 3) if residuals are positively correlated, d<2 (if very highly positive, d≈0) 4) if residuals are negatively correlated, d>2 (if very highly negative, d≈4)
2 methods for variable selection in regression analysis
1) Akaike's information criterion 2) Bayesian information criterion (both included on formula sheet) (note, for both formulas, r is the total number of parameters, including the constant term) the lower the value, the better the model!
2 guidelines for interpreting VIFs
1) Any VIF > 10 suggests severe multicollinearity. Start to suspect problems when VIF > 5 2) If all VIFs are less than 1/(1-R²) where R² is the coefficient of determination in the model with all independent variables present, then multicollinearity is not strong enough to affect the coefficient estimates (i.e. variables are more strongly correlated to Y than to each other)
5 different degrees of freedom in two way ANOVA with replication
1) Factor A: n₁-1 2) Factor B: n₂-1 3) Interaction: (n₁-1)(n₂-1) 4) Error: n₁n₂(r-1) 5) Total: n₁n₂r-1 where r = # of replications
3 tests for the homogeneity of variance
1) Hartley test 2) Bartlett's test 3) Modified Levene's test
leverages are a measure of the distance between the _____1_____ for _____2_____ and the _____3_____ for _____4_____.
1) x values 2) an observation 3) mean of x values 4) all observations
4 steps showing how to isolate the cyclical and random components
1) Y/T = CSI 2) Y/TS = CI (deseasonalized % of trend) 3) a 3 period moving average (3MA) is taken; the averaging process reduces irregularity, leaving just C 4) CI/C = I
9 pieces of ANOVA notation: Yij, Tj, nj, Ybarj, n, p, Y double bar, µj, σ
1) Yij = the ith observation from the jth treatment (population) 2) Tj = the total of the jth sample 3) nj = the size of the jth sample 4) Y-barj = Tj/nj = the mean of the jth sample 5) n = ∑nj = the total # of observations 6) p = # of treatments (populations) 7) Y double bar = the overall (grand) mean of all the data combined 8) µj = the mean for the jth treatment (population) 9) σ = √MSE = the common standard deviation for all treatments (populations)
5 time series components
1) Yt = data at time t 2) Tt = trend 3) Ct = cyclical component 4) St = seasonal component 5) It (or Rt) = irregular (or random) component
3 assumptions of ANOVA
1) all populations are normally distributed 2) all populations have the same variance 3) independent random samples from each population
4 steps to calculating seasonal indeces
1) arrange the specific seasonal relatives [(Y/4QCMA)*100] according to their respective seasons 2) find the median for each season 3) determine the "adjustment factor" = (# of seasons*100)/sum of medians 4) obtain seasonal indeces by multiplying each median by the adjustment factor (the seasonal indeces should add to 400 if using quarterly data, 1200 if monthly data)
Cook's D: if the calculated percentile value of the F distribution with (k+1, n-[k+1]) degrees of freedom is... (3 levels)
1) between 0 and .3: not influential 2) between .3 and .5: mildly influential 3) greater than .5: influential
3 steps to detecting non-normal populations and resolving them
1) for each treatment, construct a histogram, normal probability plot, or other graphical display that detects skewness 2) conduct formal tests for normality (e.g. Anderson-Darling) for which the H₀ is is that the probability distribution is normal. (note: in practice, the normality assumption will often not be exactly satisfied, so these tests are of limited use in practice) 3) if the distribution departs greatly from normality, a normalizing transformation such as ln(y) or √y may be necessary
3 steps of autoregressive integrating moving average (ARIMA) models
1) identification 2) estimation 3) diagnostic checking
because D is calculated using _____1_____ and _____2_____ it considers whether an observation is unusual with respect to both x and y values
1) leverage values 2) standardized residuals
3 measures of forecast error
1) mean absolute error (MAE) 2) mean squared error (MSE) (also called mean squared deviation - MSD) 3) mean absolute percentage error (MAPE) see formula sheet for all!
_____1______ departures from the _____2______ assumption will generally not invalidate the results of the regression analysis; regression analysis is robust with regard to this assumption.
1) moderate 2) normality (i.e. if the data is not badly skewed and has one major central peak, we can be confident using the model)
2 types of time series models
1) multiplicative, Yt = Tt*Ct*St*It (the one we care about) 2) additive, Yt = Tt+Ct+St+It
residual analysis: 3 things that can be detected through
1) normality of error terms 2) constant variance 3) independence of the error terms
2 possible causes of outliers within the model
1) omission of an important variable 2) omission of higher order terms
2 plots to detect lack of fit
1) plot residuals ei on the vertical axis against each of the independent variables x₁, x₂, ..., xκ on the horizontal axis (residual plot) 2) plot the residuals on the vertical against the predicted value ŷ on the horizontal axis (residuals vs fits)
It is important to choose a model that is consistent with the _____1_____ associated with the phenomenon under consideration, even if the _____2_____ is lower.
1) scientific fact 2) R²
2 notes on ratio-to-moving-average
1) the first 2 and last 2 values of a time series are lost in the totalling process 2) the ratio-to-MA values are not pure seasonal indeces because they contain the irregular component in addition to the seasonal component.
2 reasons to transform y
1) to make y values satisfy model assumptions 2) to make the deterministic portion of the model a better approximation to the mean value of the transformed variable
(similar to ANOVA for regression) the total variation comes from 2 sources:
1) treatment (between) 2) error (within)
Factor A: _______1_______ Factor B: _______2_______
1) treatments 2) blocks
4 notes on lags
1) we can plot the pairs (yt, yt-1) to investigate a possible first lag relationship 2) the Durbin-Watson statistic is used to detect significant auto-correlation of lag 1 3) the autocorrelation coefficient for the kth lag is rκ 4) the question we seek to answer: what lag length k might be interesting to investigate?
4 guidelines for identifying models using ACF and PACF plots
1) you are entitled to treat any non-significant values as 0 (i.e. ignore values that lie within the confidence intervals on the plot) 2) you do not HAVE to ignore them, particularly if they continue the pattern fo the statistically significant values 3) an occasional autocorrelation will be significant by chance alone. 4) you can ignore a statistically significant autocorrelation if it is isolated, preferably at a high lag, and if it does not occur at a seasonal lag
in the context of ANOVA, what are 2 reasons why t test can be more flexible than an ANOVA F test?
1) you may choose a one sided alternative instead 2) you may want to run a t test assuming unequal variance if you're not sure that your 2 populations have the same std. deviation σ
2 stabilizing transformations
1) √y 2) ln(y)
4 assumptions of the multiple linear regression model
1. *E(error) = 0* 2. *Normality* - values are normally distributed 3. *No multicollinearity* - independent variables are not highly correlated with each other 4. *Homoscedasticity* - similar variance of error terms across values of the independent variable
3 reasons why multicollinearity is a problem
1. The std. error of the regression coefficients are inflated 2. estimated regression coefficients must be interpreted as the average change in the dependent variable per unit change in an independent variable *when all other variables are held constant* 3. inferential statistics on the regression parameters are not reliable (t tests, CIs for βi: large std. error means large CIs and small t-stats, researcher will accept too many H0s)
4 ways to detect multicollinearity
1. high correlation between pairs of variables (correlation matrix) 2. estimated regression coefficients change when a variable is added or removed 3. there are conflicting results between F and t tests (e.g. overall F is significant but individual t values are not) 4. variance inflation factor (VIF) > 5
4 advantages of experimental variables
1. user controls experiment 2. variable values can be assigned so independent variables are not correlated and multicollinearity can be eliminated 3. cause and effect relationships can be inferred 4. randomization can be controlled by assigning a range of values to the independent variable
observations with large leverage may...
exert considerable influence on the fitted value, and thus the regression model
influential observation
one whose removal would significantly affect the regression equation
Does AIC or BIC choose a more parsimonious model and why?
BIC chooses the more parsimonious model as it imposes a higher penalty for including extra parameters (ln[n] in the numerator of the second part of the expression!)
Case 1: if all populations are approximately normal, use...
Bartlett's test
this methodology can be used to decide whether to use a pure AR or MA model or an ARMA model
Box-Jenkins
formula for Cook's D (not on the formula sheet!)
D = (ei²/[(k+1)MSE])*[hi/(1-hi)²] where: k= the number of x coefficients in the regression model ei = the ith residual hi = the ith leverage value
First difference
Dt = Yt -Yt-₁
other commonly used MAs besides 3MA
Henderson's MA (not on final)
Hypothesis test for equal variances
H₀: (σ₁²/σ₂²) = 1 H₁: (σ₁²/σ₂²) ≠ 1 (where σ₁² is the larger and σ₂² is the smaller) TS: F = (s²larger/s²smaller) F= (MSE larger/MSE smaller) CV: F tables Accept H₀ if F≤CV Reject H₀ if F>CV Rejecting the hypothesis of equal variance means that the variances are unequal i.e. there is heteroscedasticity present
ANOVA test of hypothesis
H₀: µ₁ = µ₂ = ... = µκ H₁: not all µj are equal (at least one mean is different) TS: F=(MSTR/MSE)=(MSB/MSW) CV: Fα (p-1,nt-p) [one way] [note nt: t should be subscript] Do not reject H₀ if F≤CV, reject H₀ is F>CV
Values of I in ARIMA
I refers to the degree of differencing I=0: differencing is not necessary I=1: first difference I=2: second difference Zt = Dt - Dt-₁ = (Yt - Yt-₁)-(Yt-₁ - Yt-₂) = (Yt - 2Yt-₁ - Yt-₂) note: it is rare to take higher order differences than 2 (i.e. 0, 1, or 2 are usually sufficient)
Portmanteau tests' goal
Instead of studying the correlation coefficients rκ one at a time, the idea of this class of tests is consider a whole set of nκ values (e.g. r₁ through r₁₂) all at once.
rationale for the Bonferroni correction
It is used to reduce the chances of obtaining false-positive results (i.e. type 1 errors) when multiple pairwise tests are performed on a single set of data.
Case 2: if all populations are clearly not normal, use...
Levene's test
Ljung-Box Q statistic formula (not on formula sheet)
Q = n(n+2) ∑ [rκ²/(n-k)] where k= time lag, rκ are the ACFS and n= # of observations
Ri² in VIF formula
Ri² is the multiple coefficient of determination in the multiple regression model that expresses xi as a function of all variables except xi
R² when multicollinearity is present
R² and the predictive power of the regression model remain unaffected by multicollinearity (i.e. a model with a high R² and significant F test can be used for prediction purposes)
Between/Within/Total
SSB SSW SSTO where SSTO = SSB + SSW MSB = σ² of B F = MSB/MSW = (SSB/p-1)/(SSW/nt-p) note: nt, t should be subscript
Treatment/Error/Total
SSTR SSE SSTO where SSTO = SSTR + SSE MSTR = σ² of TR F = MSTR/MSE = (SSTR/p-1)/(SSE/nt-p) note: nt, t should be subscript
the moving average (MA) model of order q
Y = θ₀ + et - θ₁et-₁ - θ₂et-₂ -...- θqet-q where et is the error term for period t. By convention, the θ terms in the model are preceded by a minus sign
the autoregressive (AR) model of order p
Y = ∅₀ + ∅₁Yt-₁ + ∅₂Yt-₂ +...+ ∅pYt-p+et where Yt-₁, Yt-₂... are time-lagged values and et is the residual error term.
ratio-to-moving-average
Yt = TCSI, 4QCMA = TC Therefore: Y/4QCMA = SI the ratio-to-moving-average includes only the seasonal and irregular components of the time series
autoregressive moving average (ARMA) model of order (p,q)
a combination of the AR and MA models Y = ∅₀ + ∅₁Yt-₁ + ∅₂Yt-₂ +...+ ∅pYt-p+et - θ₁et-₁ - θ₂et-₂ -...- θqet-q e.g. ARMA (1,1) : Y = ∅₀ + ∅₁Yt-₁ + et - θ₁et-₁
leverage
a measure of how influential an observation is; the larger the leverage value, the more influence the observed y has on its predicted value
time series
a sequence of observations collected from a process at fixed (and usually equally spaced) points in time.
ANOVA definition
a statistical test of significance for the equality of several (2 or more) population ("treatment") means H₀: µ₁ = µ₂ = ... = µκ H₁: not all µj are equal
stationary time series
a time series is said to be stationary if there is no systematic change in the mean (i.e. no trend), no systematic change in the the variance, and if strictly periodic variations have been removed; a longitudinal measure in which the process generating returns is identical over time
stabilizing transformation
a transformation that reduces heteroscedasticity
correlation plots: why?
after a times series has been made stationary by differencing, the next step in fitting an ARIMA model is to determine whether AR or MA terms, or a combination of the 2, are needed to correct any autocorrelation that remains in the differenced series. By looking at ACF and PACF plots of the differenced series, you can tentatively identify the numbers of AR or MA terms that are needed.
cyclical
alternating periods of expansion and contraction of more than one year's duration
the Bonferroni correction
an adjustment made to α values when several statistical tests are being performed simultaneously on a single data set to perform the correction, divide the α value by the number of comparisons being made (e.g. if k hypotheses are being tested, the new level of significance would be α/k, and the statistical power of the study is then calculated based on this modified α)
lag
an interval of time between observations in a time series (e.g. a one quarter lag between data points)
outlier
an observation with a residual greater than 3 standard deviations (>3σ), or equivalently with a standardized residual >3.
Cook's Distance (D)
an overall measure of the impact of the ith observation on the n fitted values
Tukey's multiple comparisons
another more sophisticated test for equality of means. In minitab output, means that do not share a letter are significantly different
a Bonferroni CI
construct k CIs each with confidence level 1-(α/k), note that there will be one CI for each population (formula is included on formula sheet!)
longitudinal (time series) data analysis
data are collected by making repeated observations on a process over time, and the past behavior of these series can be examined to try to predict future behavior
standardized residuals
denoted zi for the ith observation, this is the residual for the observation ei divided by the standard error of the estimate s (s=√MSE) "Standardized residuals normalize your data in regression analysis, [it is] a measure of the strength of the difference between observed and expected values." (see formula sheet!)
a test for heteroscedasticity
divide the sample of observations based on all values of ŷ. How many observations fall above or below a certain value?
if there is strong evidence of residual correlation...
doubt is cast on the least squares results and any inferences drawn from them
one possible action you can take if an outlier represents a real and plausible value in your dataset?
exclude the data from the statistical analysis and write an exception report to explain the absence of the value from the dataset
normal probability plot
graphs the *residuals* against the *expected values of the residuals* under the assumption of normality. if the assumption is true, then a residual should approximately equal its expected value, resulting in a straight line graph (points should lie within 95% confidence limits of the straight line representing the normal distribution)
resolving non-stationarity with regard to the mean
if a time series plot is not stationary with regard to the mean (i.e. it exhibits a trend), try differencing the series.
resolving non-stationarity with regard to the variance
if a time series plot is not stationary with regard to the variance, try a transformation (e.g. logarithmic, square root, etc.)
note on logarithmic transformations
if any values in the dataset are equal to zero, a logarithmic transformation cannot be used, as ln(0) is undefined. Use a square root transformation instead
outliers are often caused by...
improper measurement
what to look for in residual plots and residual vs fits plots
in each plot, look for trends, dramatic changes in variability, and/or more than 5% of residuals lying outside 2 standard deviations of 0; any of these patterns indicate a problem with model fit
computing leverage
in regression analysis it is known that the predicted value for the ith observation, ŷi, can be written as a linear combination of the observed values (y₁, y₂, ..., yₙ). Thus for each value yi, i = 1 to n: ŷi = h₁y₁ + h₂y₂ + ... + hiyi + ... + hκyκ where hi is the leverage of the ith observation, meaning it measures the influence of the observed value yi on its on predicted ŷi
note about the 3MA
it is the simplest possible MA and is not very efficient at removing the irregular component
trend
long-term growth or decline in a time series (linear is the only one we care about) Tt = β₀ + β₁t
the Bonferroni correction: the probability of identifying at least 1 significant result due to chance increases as...
more hypotheses are tested. for example, researcher testing 20 hypotheses with α=.05: P(at least 1 sig. result) = 1 - P(no sig. results) P(at least 1) = 1-(1.05)²⁰ P(at least 1) = .6415 this implies that there is a 64.15% chance of identifying a significant result, even if all of the tests are not statistically significant
observations with large D values may be...
outliers
in the AR model, forecasts are based exclusively on...
previous values of the variable to be forecasted
seasonal
repetitive patterns completing themselves within a year (i.e. seen in monthly and quarterly data)
irregular (random)
residual movements in the time series after all other factors have been removed, follows no recognizable pattern
Modified Levene's test
similar to a one way ANOVA based on the absolute deviation of the observations of each sample from their medians if the p value < α, reject H₀, conclude that the variances are unequal
Hartley test
test statistic = max σi²/min σi² reject H₀: σ₁²=σ₂²=...=σp² for large values of the test statistic
the idea of ANOVA
testing the equality of means extended to more than 2 groups
the rationale for finding the 4 quarter centered moving average (4QCMA)
the 4QCMA contains only the trend T and the cyclical component C. The totalling process removes the seasonal effect and dividing the 8QCMT by 8 (the averaging process) removes the random component
finding the 4 quarter centered moving average (4QCMA)
the 8QCMT is divided by 8 to obtain the 4QCMA. This can be directly compared to the original data value for the corresponding time period
F tests for ANOVA and regression
the ANOVA test of hypothesis (H₀: µ₁ = µ₂ = ... = µp) is equivalent to the regression test of hypothesis (H₀: β₁ = β₂ = ... = βp-₁ = 0) while the formulas for computing the F statistic are different, they result in the same F value
partial correlation
the amount of correlation between 2 variables which is not explained by their mutual correlations with a given specified set of other variables
partial autocorrelation
the amount of correlation between a variable and a lag of itself that is not explained by correlations at all lower order lags
autocorrelation
the correlation of a series with its own past history
centering seasonal data
the four-quarter moving totals (4QMT) are not "centered," meaning they do not correspond to the same time scale as the original data data is centered by adding 2 of the 4QMTS to get an 8 quarter centered moving total (8QCMT). The 8QCMT will have the same midpoint as the first 4QMT
any interpretation of main effects depends on the interaction and therefore...
the main effects must be retained in the model even if their p values are not significant
replications
the number of data points for each factor/block combination
lag length
the number of periods back that our series reaches
multicollinearity specifically affects...
the reliability of the least-squares estimate of the regression coefficients
the first lag
the series of values yt-1 note: t-1 should be subscript
computing partial correlation
the square root of the coefficient of partial determination (i.e. it is the % reduction in error variance that is achieved by adding X₃ to the regression model of Y on X₁ and X₂)
t statistics for autocorrelation functions (ACF) and partial autocorrelation functions (PACF)
the t-statistic can be used to test whether a particular lag equals zero. A t statistic with an absolute value greater than 1.25 for lags 1-3, or greater than 2 for lags 4+, indicate a lag not equal to zero.
treatment variation
the variation *between* groups
error variation
the variation within the group
residual plot: if the cluster of error points becomes more spread out as predicted Y values increase...
then heteroscedasticity is present
residual plot: if there is a tight cluster of points above zero and a more spread out cluster below...
then there is a failure of normality
residual plot: if a band of points forms around zero...
then there is no evidence that any of the 3 assumptions have been violated
a note on stabilizing transformations
there are many possible transformations of the form Y^k where k can be any real exponent (see Box-Cox)
multicollinearity: if 2 independent variables are moderately or highly correlated...
there are variables in the model that are performing the same job in explaining the variation in the dependent variable (i.e. they are not needed in the model together.)
ACF and PACF plots of ARIMA processes
they typically show exponential declines in both ACF and PACF plots
residual plot: if the cluster of points forms a wavy pattern that moves above and below zero...
this is a sign of nonlinearity
Ljung-Box Q statistic (extension of Box-Pierce test)
this is a test for overall model adequacy (a sort of lack-of-fit test); it tests to see if the entire set is significantly different from the zero set (a member of the class of tests known as "Portmanteau")
interpreting Cook's D
to interpret, compare D to the F distribution with (k+1, n-[k+1]) degrees of freedom to determine the corresponding percentile. Generally, if the percentile values is >50%, the observation has a major influence on the fitted values and should be examined
Hypothesis test with Ljung-Box Q statistic
use the Ljung-Box Q statistic to test the null hypothesis that the autocorrelations for all lags up to lag k are equal to 0 H₀: model is adequate H₁: model is not adequate If p value associated with the Q statistic is significant (p<.05), then the model is considered *inadequate* if all p values > .05, do not reject H₀
the Durbin-Watson test
used for time series data to detect serial correlations in residuals
residual plot
used to test the assumptions of the multiple regression model: plots residuals against the independent variable X or against the predicted Y values
residual analysis
using residuals to detect departure from assumptions
experimental data
values of the independent variable are controlled
observational data
values of the independent variable are uncontrolled
if the ANOVA is significant and the data fails to pass a test of normality of equal variances...
we must deem any conclusions drawn from the ANOVA invalid
when does heteroscedasticity occur?
when regression results produce error terms that are of significantly varying degrees across different observations
leverage values provide information about...
whether an observation has *unusual predictor values* compared to the rest of the data
if Yt is non-stationary in the mean, Dt...
will often be stationary (occasionally, second differencing will be necessary to achieve seasonality)
defining variables for two way ANOVA, factorial design: experimental design with 2 factors A and B, where A has 3 levels and B has 2
x₁ : 1 if Factor A level 1, 0 if not x₂ : 1 if Factor A level 2, 0 if not x₃ : 1 if Factor B level 1, 0 if not (see complete formula on formula sheet!)
reciprocal transformation of the independent variable
y = β₀+β₁x₁+ε where x₁ = 1/x₁ y = β₀+β₁(1/x₁)+ε
note on ACF and PACF plots
you must learn to pick out what is essential in any given plot, always check the ACF and PACF plots of the residuals in case your identification is wrong
error term in a multiple regression model
εi = yi - E(y) εi = yi - (β₀+β₁x₁+...+βκxκ)