AP Stat Ch. 10
Re-expressing Goal 2: Make the spread of several groups (as seen in side-by-side boxplots) more alike, even if their centers differ
Groups with common spread easier to compare; taking logs makes individual boxplots more symmetric and gives more nearly equal spread; can also reveal problems in data
When none of the data values is zero or negative...
LOGARITHMS. Try taking logs of both x- and y- variables. Then, re-express the data using the exponential, logarithmic, or power model.
Ways to model/summarize data
Requires that: -data have simple structure -Symmetry for summaries of center/spread and to use Normal model -equal variation across groups when we compare groups with boxplots or want to compare their centers -Linear shape in a scatterplot --> can use correlation to summarize the scatter and regression to fit a linear model
Reasons to consider re-expression:
-Make the distribution of a variable more symmetric -Make the spread across different groups more similar -Make the form of a scatterplot straighter -Make the scatter around the scatterplot more consistent
Ways to handle bent relationships
-Straighten the data, then fit a line -Use the calculator shortcut to create a curve
Don't choose a model based on R^2 alone.
A high R^2 does NOT mean the pattern is straight. MAKE A PICTURE. Before you fit a line, always look at the pattern in the scatterplot. After you fit the line, check for linearity again by plotting the residuals.
Watch out for data far from 1.
Data values that are very far from 1 probably not affected by re-expression unless range is very large -re-expressing numbers btwn 1-100 will have greater effect than re-expressing #'s 100,001-100,100 -Subtract a constant to bring them back near 1 >Consider "years since 1950" for re-expression >Unless your data starts @ 1950, avoid creating a zero by using "years since 1949"
Don't expect your model to be perfect.
We are not looking for the RIGHT MODEL...we are looking for a USEFUL model.
Watch out for negative data values
cannot re-express negative values or values that are zero for negative powers; add a constant (such as 1/2 or 1/6) to bring all the data values above zero
Ladder of Powers: When you take a negative power, the
direction of the relationship will change; you can always change the sign of the response variable if you want to keep the same direction
Re-expressing Goal 3: Make form of scatterplot more linear
easier to model; taking logs makes things more linear
Re-expressing Goal 1: Make distribution of variable more symmetric
easier to summarize center of symmetric distribution; for nearly symmetric distributions, use mean and standard deviations -distribution unimodal --> resulting distribution may be closer to Normal model --> can use 68-95-99.7 Rule
Ladder of Powers orders:
effects that the re-expressions have on data Ex: if you try taking the square roots of all the values in a variable and it helps but not quite, move farther down the ladder to the logarithm or reciprocal root; re-expressions will have similar, but even stronger effect on data. If you go too far, can go back up ladder.
Ladder of Powers
farther you move away from original data ("1" position), greater the effect of re-expression on data
Re-expressing Goal 4: Make the scatter in scatterplot spread out rather than thickening at one end
having even scatter is a condition of many methods of Statistics
We re-express data to:
improve symmetry, make scatter around a line more constant, or make a scatterplot more linear
The Ladder of Powers or the...
log-log approach can help us find a good re-expression
Ladder of Powers: "0"
logs -measurements that CANNOT be negative, values that grow by percentage increases (salaries, populations) -When in doubt, start here -If your data has zeros, try adding a small constant to all values before finding the logs.
We seek a useful...
model, not perfection (or even "the best")
Ladder of Powers
places in order the effects that many re-expressions have on the data
Ladder of Powers: Power "1"
raw data-no change at all; "home base" -farther you step from here up or down ladder, greater the effect -data that can take on both + and - values with no bounds less likely to benefit from re-expression.
Re-expression
re-express data by taking the logarithm, the square root, the reciprocal, or some other mathematical operation on all values of a variable
Models won't be perfect, but that...
re-expression can lead to a useful model
Beware of multiple modes.
re-expression can make skewed unimodal histogram more nearly symmetric, but cannot pull separate modes together; makes separation of modes clearer, making it easier to analyze individually
Watch out for scatterplots that turn around
re-expression cannot straighten oscillating graphs; should refuse to analyze such data
Ladder of Powers: Power "1/2"
square root of y -for counted data, start here
Don't stray too far from the ladder.
taking y-values to an extremely high power may artificially inflate R^2; will not be a useful/meaningful model -use powers between -2 and 2
Power
x-axis: log (x) y-axis: log (y) -Goldilock's model: When one of the ladder's powers is too big and the next is too small -Re-expression Equation: log ŷ = a + b logx -Calculator's Curve: >PwrReg >ŷ = ab^b
Logarithmic
x-axis: log(x) y-axis: y -a wide range of x-values, or a scatterplot descending rapidly at the left but leveling off toward the right -Re-expression equation: ŷ = a + b logx -Calculator's Curve: >LnReg >ŷ = a + b lnx
Exponential
x-axis: x y-axis: log (y) -is the "0" power in the ladder approach, useful for values that grow by percentage increases -Re-expression Equation: log ŷ = a + bx -Calculator's Curve: >Command: ExpReg >ŷ = ab^x
Ladder of Powers: Power "2"
y^2; try this for unimodal distributions that are skewed to the left