AP Stat Chapter 9 Vocab: Re-expressing Data: Get it Straight!
Goals of re-expression
1. Make the distribution of a variable (as seen in its histogram, for example) more symmetric - take logs of data to remove skew 2. Make the spread of several groups (as seen in side-by-side box plots) more alike, even if their centers differ - makes them easier to compare, can reveal problems in the data 3. Make the form of a scatterplot more nearly linear 4. Make the scatter in the scatterplot spread out evenly rather than thickening at one end.
Power of "0"
Although mathematician define the "0-th" power differently, for us the place is held by the logarithm - does not matter if you use base 10 log or natural log Measurements that cannot be negative, and especially values that grow by percentage increases such as salaries or populations, often benefit from a log re-expression. When in doubt, start here. If you data have zeros, try adding a small constant to all values before finding the logs.
R²
Ex: 99.85% of the total variance in log(weight) can be explained by/attributed to the the variance in log(length)
Power of -1/2
The (negative) reciprocal square root, -1/√y. An uncommon re-expression, but sometimes useful. Changing the sign to take the negative of the reciprocal square root preserves the direction of relationships, maing things a bit simpler.
Power of -1
The (negative) reciprocal, -1/y. Ratios of two quantities (mph, for example) often benefit from a reciprocal. (You have about a 50-50 chance that the original ratio was taken in the "wrong" order for simple statistical analysisand would benefit from re-expression). Often, the reciprocal will have simple units (hours per mile). Change the sign if you want to preserve the direction of the relationship. If your data have zeros, try adding a small constant to all values before finding the reciprocal. 1/y - look at gallons per 100 miles rather than mpg
Ladder of Powers
The Ladder of Powers places in order the effects that many re-expressions have on the data. (p. 237) - the farther away you move from the original data, the greater the effect on any curvature
Power of 1
The raw data (no change at all). This is "home base." The farther you step here up or down the ladder, the greater the effect. Data that can take on both positive and negative values with no bounds are less likely to benefit from re-expression.
Power of 2
The square of the data values, y². Try this for uni-modal distributions that are skewed to the left.
Power of 1/2
The square root of the data values, √y. Counts often benefit from a square root re-expression. For counted data, start here.
Re-expression
We re-express data by taking the logarithm, the square root, the reciprocal, or some other mathematical operation of all values of a variable. (p. 232)
Power model
x-axis: log(x) y-axis: log(y) The Goldilocks model: When one of the ladder's powers is too big and the next is too small, this one may be just right.
Logarithmic model
x-axis: log(x) y-axis: y A wide range of x-values, or a scatterplot descending rapidly at the left but leveling off toward the right, may benefit from trying this model.
Exponential model
x-axis: x y-axis: log(y) This model is the "0" power in the ladder approach, useful for values that grow by percentage increases.