Data Cleaning
Leverage
tells us how far the observed values for the case are from the mean values of the set of IVs • For one IV: leverage = h (sub) ii = 1/n + (x(sub)i - M(sub)X)^2)/sum of x^2 where hii is the leverage for case i, n is the number of cases, Mx is mean, and the denominator is the sum of squared deviations for each value of X from Mx How much does this case contribute? What proportion is this deviation of all deviations? How much of deviations does it account for? (RIGHT)
• These diagnostics examine three things:
· 1) Leverage: How unusual is a case in terms of its values on the IVs? • Does not involve Y • How extreme is this case, or these cases (in terms of predictors) • How much leverage does this point have over your regression? · 2) Discrepancy: The difference between the predicted and observed values on the DV • How extreme in reference to DV? · 3) Influence: The amount that the overall regression (or its coefficients) would change if the outlier were removed (both IV and DV) • Conceptually, influence represents the product of leverage and discrepancy • How does regression change with removal of outlier? · We should examine all three • Outliers can produce high values on one but not another
Outliers can increase and decrease correlations.
· Outliers can impact correlations A LOT · Even changing the sign!
Data Cleaning
• As we have discussed, correlation and regression are affected by influential data points • When outliers are present, regression analyses may produce results that strongly reflect a small number of atypical cases rather than a general relationship observed in the data
DFFITS has a minimum magnitude of The value can be
• DFFITS has a minimum magnitude of 0 - This occurs when case i has no effect on Ypred - When DFFITS = 0, case i falls on the regression line exactly • The value can be positive or negative - Negative when Ypred.i(i) > Ypred.i • Excluded bigger than included
• The jackknife can be an excellent for means, variances, correlations, regression coefficients - It is bad for the median. Why?
• Even in the simplest case, you leave out one observation which changes the percentiles AKA changing the median.
• All of our transformations can be applied to correct for negative skew as well
• First, we reflect the data - Find X: the absolute value of the maximum value of the data, and add 1 to that value - We subtract each data value from X • This results in a positive skew with values larger than zero • Then we conduct the transformations
• Let's do a quick example to demonstrate bootstrapping of a simple statistic: the mean
• I am going to generate a non-normal sample of 20 individuals - We will then calculate the 95% CI of the mean in the traditional way • I will then drawn 10,000 samples from my sample, each of 20 individuals, with replacement - I will then use this empirical distribution to calculate the 95% CI of the mean
• Why would you use the more focused DFBETAS versus the global statistics?
• If a researcher is interested in a particular coefficient, this is helpful - For instance, a researcher wants to know the relationship between school performance and IQ, and she suspects that controlling for family income will obscure the relationship - She can analyze: performance = b0 + b*income + b*IQ - This is controlling for income, and she interested in the influence of outliers on b*IQ
• Let's interpret the leverage formula:
• If case i has a score at the value of Mx then the numerator is zero; the quantity on the right is zero; and hii is 1/n - 1/n is thus the minimum leverage - The maximum of hii is 1 • We see that as n increases, the leverage of a particular observation decreases • As case i's score gets further from Mx, hii increases in size • Mean leverage = (k + 1)/n where k is # of IVs
• In the current example, the predicted value for our outlier case was 17.00 when it was included
• If we exclude that case and create a new regression equation, we can generate a new predicted value of that case: 121.05 • Thus, the change (numerator) is 17 - 121.05, which is -104.05
- Typically, we want to measure a latent construct, but we must transform that construct into something observable
• If we want to measure stress, which has some distribution in nature, we might measure the number of stressful life events Further, the effect of stress may be non-linear; perhaps after a few stressors, each additional stressor may have a bigger impact, so y = stressors2 might be more appropriate than a sum of the number of stressful life events
• Finally, it is worth noting that terminology differs across software
• Internally studentized residual is AKA - standardized residual, studentized residual, "SRESID" (SPSS), "Studentized Residual" (SAS) • Externally studentized residual is AKA - studentized residual (ugh), studentized deleted residual, "SDRESID" (SPSS), "Rstudent" (SAS)
• This process works for the mean, but it can be used to estimate the SE for many statistics
• It can also be used for, say, a regression coefficient - Resample multiple times from your sample, conduct a regression in each sample, and estimate the variation of the coefficients - Helpful for identifying bias.
• Cross-validation can leave out m number of observations, which is similar to delete-m jackknife analysis, leave another observation out and do the same thing over and over again.
• It is also common to conduct double cross-validation: - 1) Divide the data in half - 2) Use each as a training sample and compare the emergent models - 3) Test each model in the other sample
• Here, we have plotted the leverage values of the original (left) and outlier (right) datasets
• Note how two points have high-ish leverage at left, but this is not visible at right
• Here, we have plotted the DFFITS values for the original (left) and outlier (right) data
• Note that the outlier has a huge value (-10.13)
• Here, we have plotted the externally studentized residuals for the original (left) and outlier (right) data
• Note the extremity of the outlier • If you keep removing, you will end up with smaller data sets which have their own outliers.
• There are a variety of data transformations available, and we will discuss some of the most popular
• Note: There are alternatives to transforming raw data - You can change your analytic strategy (e.g., rank-order or tetrachoric correlations) - You can use sometimes change your estimators to modify your tests (e.g., using MLR in SEM)
• Unfortunately, repeated sampling is not usually plausible
• One workaround is the bootstrap • Named after a story in The Adventures of Baron Munchausen, wherein the Baron falls into a lake and saves himself by pulling himself up by his own bootstraps • Allows us to get more information out of the dataset than is initially apparent.
• Original data: vs...
• Original data: • M: 31.3, range: 1.29-157.16, SD: 23.15 • Log2 • M: 4.55, range: 0.37-7.30, SD: 1.28 • Ln • M: 3.16, range: 0.25-5.06, SD: 0.82 • Log10 • M: 1.37, range: 0.11-2.20, SD: 0.35 • Note how higher bases pull in the data more and more, and restrict the SD more and more
• Bootstrapping can be computationally intensive, but modern programs tend to handle it relatively well and quickly
• The number of samples drawn can vary, but some simulation work suggests 200 samples should be sufficient for SE estimation in many cases Our confidence in the bootstrap, like most statistics, grows as the sample is larger
• DFFITS ("difference in fit, standardized") is:
• The numerator tells how the predicted value for case i would change in raw score units of Y if case i were deleted • The denominator standardizes this value so DFFITS estimates the number of SDs by which the predicted value of Yi would change if case i were deleted *FORMULA*
• An alternative to DFFITS is Cook's D
• This equation indicates that D compares when case i is included and deleted, squared, summed, and standardized • D has a minimum magnitude of 0 and higher values indicate greater influence of case i on the model fit (must be positive) *FORMULA*
• Another transformation for positively skewed data is taking the square root of values
• This is particularly helpful for count variables • Ex: How many symptoms does each person endorse? • Count variables are different from continuous variables, and can require different analytic methods • Count variables often have a mean proportional to the variance rather than SD • Original data: • M: 31.3, range: 1.29-157.16, SD: 23.15 • Log2 • M: 4.55, range: 0.37-7.30, SD: 1.28 • Ln • M: 3.16, range: 0.25-5.06, SD: 0.82 • Log10 • M: 1.37, range: 0.11-2.20, SD: 0.35 • Square root • M: 5.23, range: 1.14-12.54, SD: 1.98
• Another transformation is the reciprocal transformation (i.e., x becomes 1/x)
• This is particularly helpful when you have a large positive skew • For instance, you are measuring rat time in a maze; some rats take 30s but some take 300s because they completely fail • Taking the reciprocal of time (i.e., speed) can reduce effect of outliers, particularly on the SD • So, if you have rat times of: • 10, 11, 13, 14, 15, 45, 450s • A reciprocal transformation produces: • 0.100, 0.09, 0.077, 0.071, 0.067, 0.022, 0.002 • Note that differences among the longer times are much reduced from what they were in the original units • Outliers have considerably less effect on the size of the SD
• So, imagine we want to determine the average time students take to complete the GRE
• We can sample 40 students, find the mean completion time, and calculate the SE using the information from the sample • We could get a better estimate of the SE by taking multiple samples of 40 people and seeing what the distribution of the means from each sample is
• However, we commonly transform data
• We conduct linear transformations: - Changing raw scores into z scores
Resampling for Regression
• When we attempt to estimate population parameters, we sample from the population • We then evaluate the resulting statistics in terms of their variation (SE) • We use the observations as a data set and draw samples from our samples repeatedly • Extremely powerful, often necessary, away from theory (theoretical infinite sampling) • (Empirical sampling distribution)
• The bootstrap uses your sample as an approximation of the population
• You then take multiple samples of size n from your data, with replacement, to approximate samples of size n from the population: • 1) Collect your sample of size n • 2) Resample with replacement a large number of samples of size n - This provides an estimate of the mean for each sample DO THE ABOVE STEPS 10,000 TIMES (or however many times) SO YOU HAVE 10,000 MEANS -these means produce an empirical sampling distribution... • 3) Use the SD of the multiple estimates to estimate the SE • Use: • When you don't have defined sampling distribution (and you need a SE) • To learn more from data then you thought you knew • Avoid theoretical approach and moving towards observed (empirical) sampling distribution. • ALPACA SHAMPOO • Each time you resample, you calculate your statistic of interest. • You can create MANY different samples.
Regression diagnostics are
• case statistics · They produce a value for each case (observation)
If both of our variables were IVs, we could calculate their
• centroid (denoted by x) • Note that cases 8 and 15 have higher leverage than cases 4 and 12
Both DFFITS and Cook's D are
• deletion statistics that compare aspects of the regression equation when case i is included versus deleted (like studentized residuals) - They are closely related and provide redundant information, so one can be used
Visual inspection is good for
• extreme values · What about values less visually extreme? · What about multivariate extremity?
Leverage values identify cases far from the mean of the IV(s) that have
• greater potential to influence regression equation results • Whether a particular case actually influences the regression also depends on the discrepancy between the observed and predicted values of Y
DFBETAS values are positive if
• if the inclusion of case i increases the coefficient and negative if it decreases the coefficient • Interpretation guidelines: - For small samples, you can create an index plot for each coefficient - Histograms or stem and leaf displays can be used in larger samples • Investigate cases with relatively larger magnitude DFBETAS values compared to other cases - Rules of thumb suggest influential cases are: • More extreme than +/- 1 for small/moderate samples • More extreme than +/- (2 / sqrt(n)) for larger samples • RULES OF THUMB MAY OR MAY NOT APPLY
Discrepancy
• is the distance between the predicted and observed values of Y • We could use residuals for this, that is: ei = Yi - Y-hat i
Measures of influence combine information from
• measures of leverage and discrepancy to inform about how a regression equation would change if case i were removed from the data • These are divided into: - Global measures: Provide information about how case i affects overall characteristics of the regression equation (DFFITS, Cook's D) - Specific measures: Provide information about how case i affects each individual regression coefficient (DFBETAS) - Both types should be examined
To address this problem, one can use
• studentized residuals • The notion here is that as a case's score on the IV(s) gets further from the centroid, the estimate of the value of the residual for that case gets more precise - That is, the variance of ei gets smaller • Think about it like this: As a value gets more extreme, the regression line will be pulled toward it, until it eventually hits that point (and the residual will be estimated with higher accuracy)
When data do not fit certain assumptions, we can
• transform them • Transformations can seem "dishonest" - It may feel like you are unhappy with your data, and so you manipulate them - Tukey referred to them as "re-expressions" rather than "transformations"
We can evaluate externally studentized residuals
• visually or via a t distribution
In general:
*FORMULA* • But remember that DFFITS can be positive or negative while D can only be 0 or greater
• The externally studentized residual follows these steps:
- 1) Calculate the regression equation in the sample with case i missing - 2) Find Ypred for case i using this new equation (where case i had no role in estimating the parameters), Ypred.i(i) - 3) Find difference between observed Yi and Y-hat , which we call di: *FORMULAS* THE PREDICTED VALUE COMES FROM A MODEL WHERE CASE Y IS NOT INCLUDED IN THE CREATION.
• There are two approaches for identifying cases with high leverage:
- 1) Plot distribution of hii and identify a number of case with substantially higher leverage values than other cases • Index plots (as I included before) are helpful with small Ns • Stem and leaf displays are helpful for larger Ns - 2) Identify cases with leverage values that fall above a rule of thumb cutoff value • Belsley, Kuh, & Welsch (1980) propose using values of hii > 2(k + 1)/n in large samples and > 3(k + 1)/n in small samples
• In addition to SE estimation, we can also use bootstrapping methods to estimate CIs
- 1) Take a sample - 2) Resample with replacement from within the sample - 3) Use the sample percentiles to estimate the desired population percentiles - 4) Use these percentiles and the observed mean to estimate CI
• If the regression model fits the data, the externally studentized residuals will follow a t distribution with df = n - k - 1
- About 5% of the cases are expected to be greater than about 2.0 in absolute magnitude for moderate to large samples • In large samples, though, that would be many cases to check (e.g., 500 in 10,000) • So, many analysts use larger cut-offs (3.0, 3.5, 4.0) in large samples
• An alternative method to the bootstrap is the jackknife, but it can be used for largely the same purpose
- Case deletion statistic - Estimate the variance and bias of a statistic from a random sample - It is consistent: As N increases, the estimate approaches the true value • One systematically recomputes the statistic estimate, each time leaving out one or more observations from the sample
• Because they are relatively interchangeable, you can use DFFITS or D • How can you identify high global influence cases?
- Detect outliers via an index plot (as shown before) in small samples - Use a rule of thumb. These include: • Absolute value of DFFITS > 1 in small/medium samples or > 2*sqrt((k+1)/n) in large samples • For D, you can use 1.0 or the critical value of F at alpha = .05 with df = (k + 1, n - k - 1)
• Visualizing your data is critical as demonstrated by Anscombe's Quartet
- Four datasets of N = 11 each • These datasets each have an X and Y variable with: MX = 9 s2X = 11 MY = 7.5 s2Y = 4.1 rXY = .816 Y-hat = 3.00 + .500X • What can you say about these variables' distributions?
• In other words, the right side of the distribution will be compressed more than will the left side
- In log10, the value of 10 gets reduced to 1, but the value of 1,000 gets reduced to 3 • This results in more symmetry for positively skewed distributions • It also reduces the SD if your numbers are large
• This basic leverage idea generalizes to multiple regression
- In that case, we are interested how far case i's score on each of the IVs is from the IV's centroid • The centroid is the point corresponding to the mean of each of the IVs taken together (multivariate mean)
• We standardize this -104.05 value using the denominator
- In this case, the full DFFITS result for this outlier case is -10.13 • In other words, we find that Ypred for this case decreased by more than 10 SDs when this case was included in the regression
• There are two types of studentized residuals:
- Internally studentized • This takes the precision of the estimate of the residual into account • It does not follow a standard statistical distribution, making interpretation difficult - Externally studentized • This considers what would happen if the outlying case were deleted • External is almost always better than internal • This is a deletion statistic • Used most often, and is almost always better
• Leverage is closely related to Mahalanobis distance, which is used in cluster analyses and is a geometric distance from a centroid
- Mahalanobis = (n - 1)h*ii - Some software provides this - Different cutoffs necessary for interpretation
• In examining our original data, we see that:
- Minimum hii is 1/n = 1/15 = 0.0667 - Mean hii = (k + 1)/n = (1 + 1)/15 = .13 • In our data, the mean number of years since PhD (X) is 7.67 - For one observation, where X = 8, hii = .0670 • Note how close this is to min(hii), given that this person's X is near Mx - Another observation, where X = 18, hii = .43 • Note how this X's difference from Mx produces hii much higher than mean hii
• Alternatively, we can express DFFITS in terms of the externally studentized residual and leverage
- Note how measures of influence are the product of both leverage and discrepancy *FORMULA*
• Let's do an example with X predicting Y
- Note that the last observation is a bit of an outlier on X and a notable outlier on Y - What do we expect for leverage and discrepancy? > x <- c(41,22,2,43,11,44,51,12,45,90) > y <- c(30,1,30,15,25,52,43,4,44,1500) > my.data <- cbind(x,y) > model <- lm(y~x) > pairs(my.data) > lev <- hat(model.matrix(model)) > plot(lev)
• The logarithmic transformation is useful whenever the SD is proportional to the mean and/or when the data is positively skewed
- Remember that a logarithm is a power - log10(25) is the power to which 10 must be raised to result in a value of 25 • Thus, log10(25) = 1.39794 -> 101.39794 = 25 • Using log10, 10 = 1, 100 = 2, 1000 = 3... • Pulls the values in the tail closer to eliminate positive skew
• Finally, note that not all stats packages report the same leverage
- SAS reports standard leverage ("Hat Diag H") - SPSS reports the centered leverage (h*ii), which removes the intercept as if Xs and Y were centered • Thus, it excludes potential impact on the intercept • min(h*ii) = 0 and max(h*ii) = 1-1/n
• By removing some data points, one overcomes influence of particular Y values on regression estimates
- So, computes coefficients without Yi in the training data and then tests the model in the validation data with Yi included - It will help you generalize beyond training set in general.
- Cross-validation is a method for examining a statistical predictive model
- Subsets of the data are held out as validating sets and the model is fit in a training set - One computes the model in the training set and then tests it in the validating set - Can then average the quality of prediction across cross-validation samples - Rather than collecting multiple samples, it subsets data - Fit into = forced into • If it's the same it suggests generalizability - Best way: Collect distinct data sets and change things up so you don't have a sea of homogenous sampling but samples which allow for stronger inferences about generalizability
• DFFITS and Cook's D are measures of global influence
- That is, they tell you how much impact case i has on the regression equation altogether • DFBETAS, on the other hand, is a deletion statistic that measures the case i's impact on specific regression coefficients - The only one we've talked about that gives you more than one value per value
- The jackknife is more specialized, and it only gives a single variance estimate of the statistic
- The bootstrap produces a (empirical) distribution of the statistic and then computes a variance on that • This is more flexible, but it is more computationally intensive - The jackknife is easier to apply to complex sampling designs • Ex: Representative sampling (with sampling designs) such as correctly weighting at 10% (of population) than 3% where the sample is. - Not the same with resampling in bootstrapping
• However, this isn't always helpful
- The outliers can pull the regression line toward them as we saw before, minimizing the raw residual of the outlier and increasing residuals for other cases - See how the outlier pulled the regression line toward it - Note how large the raw residual for the outlier would have been if the original data regression line were used - It underestimates the residual of outlier and overestimates the residuals of everything else. - Here, we have plotted the raw residuals for the original (left) and outlier (right) data - Note that the outlier has a small-ish raw residual, smaller than many non-outlier values, due to pulling the regression line
• What the bootstrap has done is calculate a mean for a sample of 20 individuals
- Then, it draws a new sample of 20 individuals, and calculates a new mean • And then it does that again - 10,000 times • The result is a sampling distribution of means: • The traditional mean and CI is given as: > mean(sample.data) [1] 1.145428 > CI.standard [1] 0.4795397 1.8113165 • Our bootstrapped mean and CI is given as: > mean(the.boot.means) [1] 1.13966 > CI.boot 2.5% 97.5% 0.5905015 1.7782094
• Both the bootstrap and jackknife estimate the variability of a statistic without parametric assumptions
- They tend to approximate one another - Difference: the bootstrap gives different estimates (because random sampling) when done on the same data, but the jackknife produces the same estimate each time (because systematic removal)
• We can also use bootstrapping to get estimates for regression coefficients and their sampling distributions
- This allows for estimation of an empirical SE and significance testing • Let's do this for Y predicted from variables X and Z - We will draw 10,000 samples of N = 100 from our data with replacement • Allows us to see bias we would not see theoretically - Deviation from normal indicates bias. • Cis are tighter around estimate (higher likelihood of significant effect (not including zero)) than theoretical ones. - Always better to look at precise p values than looking only at Cis. • The bootstrap identifies bias and can correct for these biases.
• You can use any base for the log:
- This is equivalent to measuring length in different units (inches, feet, cm, light years) - Common bases: 2, 3, 10, ln (e = 2.718) • Regardless of base, logs are only defined for positive numbers - So, if you have negative values or near-zero values, add a scalar to each value to correct - Also, x0 = 1, so logx(1) = 0, regardless of base
• First, there is no substitute for intimate knowledge of your data
- Visual inspection of the values or a scatterplot would reveal an outlier of 60 years since PhD - Is this possible? Is this probable? • After inspection of values, it may be that the data were entered incorrectly • If not, we will discuss what to do
• We also conduct non-linear transformations:
- We measure the time it takes a rat to run a maze, but we may report its speed, a reciprocal • A rat takes 10 seconds to run a maze (10/1); it runs one maze per 10 seconds (1/10) - We measure sound in terms of physical energy, but we report it in decibels, a log transformation
• When we do this for our faculty outlier case:
- raw residual (original model): -11.00 {if this were included in the regression} - externally studentized residual (di): -115.05 {model without case} • Thus, we see how externally studentized residuals overcome the pulling of the regression line toward outliers
• For our data, I conducted transformations with bases 2, 10, and ln
> log2.pos.skew = log2(pos.skew) > log10.pos.skew = log10(pos.skew) > ln.pos.skew = log(pos.skew) • I then calculated the means, range, and SD for each of these transformations - I plot them with the same axes
• I created a positively skewed data set of 1,000 observations in R:
> pos.skew = rgamma(1000, shape=1.7, scale=30/1.7) + 1 > hist(pos.skew, xlim = range(0,200), ylim = range(0,250)) • This created a randomly generated gamma distribution: - M: 31.3, range: 1.29-157.16, SD: 23.15
R calculates DFBETAS slightly differently:
FORMULA So the numerator is 13.662 - .5192 And the whole equation is calculated: 62.72... Note that what we get from this calculation maps onto what R gave us from the dfbetas function This outlier has an unusual Y value, but it has little influence on the regression line (in part because it is in the middle of the X range) This outlier has high leverage (due to its extreme X value) but no influence or discrepancy (because it falls on the line) Leverage does not indicate influence or discrepancy. This outlier has high leverage (extreme X) and some discrepancy, resulting in strong influence
(DON'T NEED TO KNOW) R calculates DFBETAS slightly differently:
FORMULA Where bk is the kth regression coefficient and bk(i) is that coefficient with i deleted. MSE is the mean-squared error with i deleted, and ckk is the kth diagonal element of the unscaled covariance matrix of X (i.e., [X'X]-1) We can use this formula to reproduce in R what we found with DFBETAS First, we can calculate a new model with observation 10 missing from our data and compare the two results (abbreviated here)
• DFBETAS is:
FORMULA • The numerator is the difference between the B coefficients calculated with case i included and excluded • The denominator standardizes the value, permitting comparison across coefficients • Each case will have k + 1 DFBETAS values—one for each coefficient and one for the intercept
• Multivariate leverage is tricky to calculate by hand; matrix algebra makes it easy:
H = X(X´X)-1X´ • Here, H is called the "hat matrix" because it puts the "hat" (i.e., ^) on y to get ypred - Leverages are on the diagonal of H
• We will explore an example from Cohen, Cohen, Aiken, and West • 15 faculty members were assessed in the original dataset for: - X: number of years since PhD conferral (range: 2-18) - Y: number of publications (range: 2-48) • A second dataset contained an outlier: - One value of X was changed • 6 years since PhD changed to 60 years
If we remove the problem case, our estimates are still somewhat different. Less significance because power is based on sample size!!!