Statistics 9
measures of dispersion
(range, IQR, standard deviation, and variance) tell us how much spread or variation exists in our data.
measures of central tendency skewness
- In most symmetrical distributions, the mean, the median and the mode are the same. - In a positively skewed (right skewed distribution), the mean is the largest measure of central tendency. - In a negatively skewed (left skewed distribution), the mode is the largest measure of central tendency.
Unlike the measures of central tendency, the measures of dispersion are biased in that they will usually underestimate the actual population dispersion if sample data is used.
Here's why. The range (max-min) for {2, 4, 6, 8, 10} is 8 (10-2). Can you take a sample of three numbers from {2, 4, 6, 8, 10} that exceeds 8? No way. At best you could choose the 2 and the 10 and get a range that equals, but never exceeds, the range of original data.
In general, skewness and kurtosis are both measures of shape
Specifically, skewness is a measure of symmetry and kurtosis is a measure of steepness.
In terms of kurtosis, we can describe our data as platykurtic (flatter than a normal curve), mesokurtic (a normal curve), or leptokurtic (steeper than a normal curve).
The steeper the curve, the smaller the measures of dispersion. The flatter the curve, the greater the measures of dispersion.
x̅
a statistic
measures of central tendency
mean, median, mode, weighted mean
standard deviation
tells us approximately (not exactly) how far each number differs from the mean of the distribution.
the mode is the most frequently occurring value in a set of data.
there can be more than one or no mode
the symbol for the population mean is
μ (called mu)
the mode of this data set is? (2, 3, 4, 5, 6, 7, and 8)
no mode
for example, the median of (10, 20, 30, 40, 50) is 30
Since we have 5 numbers, the formula becomes .5 X 4 + 1 = 3. the median is not 3 but the third number.
the mode of this data set is? (2, 3, 4, 5, 6, 7 and 2)
2
the mode of this data set is? (2, 3, 4, 5, 6, 7, 7, and 2)
2 and 7
to find the weighted mean
: 1) multiply each score by its weight or value, 2) add your products (the values obtained in step 1), and 3) then divide by the sum of the weights (usually 100 when computing grades).
add all your data and diving by the number of pieces of data you used
=ΣX/N.
The arithmetic mean
an average that is calculated by adding up a set of quantities and dividing the sum by the total number of quantities in the set.
restrictions:
because the mean uses all of the data, it is the most sensitive to outliers (extreme values). for example, the mean and median are both 30 for (10, 20, 30, 40, 50). the median remains 30 if we include an outlier (10, 20, 30, 40, 500). but the mean is greatly affected and now equals 120.
The IQR (Q3-Q1)
is also a range, but it excludes the top 25% of your data and the bottom 25% of your data. In other words, it is the range for the middle 50% of your data. It is found by subtracting the 25th percentile (Q1) from the 75th percentile (Q3). As a reminder, the median is always Q2. The IQR in the figure below is 77-64=13.
The range (max-min)
is found by subtracting the smallest number in a data set from the largest number in the data set. For example, the range for 1, 3, 6, 8, 4, and 5 is is 8-1=7. It's weakness is that it only uses two numbers.
although the mean is the most often used measure of central tendency,
it has certain restrictions and requirements
there are several types of means,
not just the arithmetic mean
μ
parameter of interest
Measures of Dispersion (variability or spread)
-Range - Max - Min -IQR - Q3-Q1 -Standard Deviation- tells us on average approximately how far the data are from the mean—the square root of the variance -Variance - tells us on average approximately how far the data are from the mean in squared units -Coefficient of variation - allows us to compare standard deviations in different scales
P/100 X (N-1) + 1 where P is the percentile you want to locate
since the median is always the 50th percentile, the formula could be written as .5 X (N-1) + 1= location of median
If you choose three numbers at random, much of the time you will not get both 2 and the 10, which means much of the time you will underestimate the range that exists in the population. The same principle applies for the standard deviation and the variance.
This is why you will notice the formula we use relies on N-1 in the denominator rather than N. Dividing by a smaller number increases the resulting quotient.
In terms of skewness, we can describe our data as no skewness or symmetrical, positively skewed/right tailed, or negatively skewed/left tailed.
In terms of skewness, positive and negative do not mean good or bad. Positive and negative simply refer to the direction of the dominant tail.
mode is typically less affected by outliers than
the mean or median
over the long run, it is not as representative as the mean, but still less affected by outliers, and can be used with ordinal, interval or ratio data.
the mean, however, cannot be always computed with ordinal data.
the median is the middle number or the 50th percentile- just make sure data are in order first
the median of (2, 1, 3) is 2 not 1.
requirements:
to compute the mean, interval or ratio data is required. under some circumstances, the ordinal data can also be used, but it is impossible to compute a mean with normal level data
Example 1:
what is your average ify ou make a 90 on a final exam worth 60% of your grade and a 50 on a paper worth 40% of your grade? (90 x 60) + (50 x 40) =5400 + 2000=7400, 7400/100=74.
Example 2:
what is your average speed if you drive 20 minutes at 65 mph and 60 minutes at 30 mph? (65 x 20) + (30 x 60) = 1300 + 1800 = 3100, 3100/80 minutes=38.75 mph.
the formula is the same regardless: M=ΣX/N
where Σ (sigma) means sum, X refers to your data, and N refers to how much data you have.
the symbol for the sample mean is
x̅ (called x bar)
if you have an even number (10, 20, 30, 40, 50, 60)
you may have a median that doesn't exist in your data set .5 X 5 + 1 = 3.5. the 3.5th number is 35 (the mean of 30 and 40)
since the symbols are hard to reproduce
you will usually just see a capital, italicized M used.