stats chapter 3
spread
(aka dispersion) of the distribution describes the variability in the data, which is the degree to which the data values are either clustered together or spread out; numerical measures of spread include the range, variance, standard deviation, and interquartile range
general z-score formula
(observation-mean)/standard deviation
for bell-shaped data, approx 68% of observations will have z-score between
-1&1
for bell-shaped data, approx 95% of observations will have z-score between
-2&2
for bell-shaped data, approx 99.7% (almost all) of observations will have z-score between
-3&3 (quite unusual to have a z-score larger than 3 or smaller than -3; observations with z-scores greater than 3 in absolute value are considered outliers)
an observation that is unusually large or small relative to the other values in a data set is called an outlier. outliers are typically attributable to one of the following causes:
-measurement is observed, recorded, or entered into the computer incorrectly -the measurement comes from a different population -the measurement is correct but represents a rare (chance) event
for bell-shaped distributions: what percentage will fall within 1 standard deviation of the mean?
68%
for bell-shaped distributions: what percentage will fall within 2 standard deviations of the mean?
95%
for bell-shaped distributions: what percentage will fall within 3 standard deviations of the mean?
99.7% (essentially all)
boxplot
a graph of the 5 number summary. make it by: -draw a number line long enough to include max and min values -above line, draw a box from lower quartile Q1 to upper quartile Q3 -draw line inside box at median -a line goes from lower end of box to smallest observation that is not an outlier; a second line goes from upper end of box to largest observation that is not an outlier, these lines are called whiskers -any data value less than the lower fence or greater than the upper fence are considered outliers and are marked with asterisk
resistant
a numerical summary of data is said to be resistant if extreme observations (very large or small compared to the rest of the data) have little, if any, influence on its value
variance
an average of the squares of the deviations of the observations from their mean
finding quartiles:
arrange data in order and find the medianM(second quartile), the median of the observations below M is Q1, the median of the observations above M is Q3
z-score is zero if...
data value equals the mean
measures of position
describe the relative position of a particular data value within the entire set of data. two examples are z-scores and percentiles
shape
described by mentioning any symmetry or skewness, the number of peaks, any clusters or gaps, and any unusually high or low observations (called outliers)
center
describes a "typical" or "representative" data value; numerical measures of center include the mean and median
to describe a distribution:
discuss its shape, provide an appropriate measure of center, and provide an appropriate measure of spread
interquartile range
distance between first and third quartiles. IQR = Q3-Q1 it's a measure of spread; more spread out the data is, the larger the IQR tends to be represents the range of the middle 50% of observations not affected by outliers, so is a RESISTANT measure of spread
most common percentiles are quartiles, which
divide the data into 4 equal parts
Q1 Q2 Q3
first quartile is 25th percentile second quartile is median (M) which is 50th percentile third quartile is 75th percentile
pth percentile
for any set of n observations arranged in order, the pth percentile is a number such that p% of the observations fall below the pth percentile and (100-p)% fall above it (think SAT scores)
outlier
has a value that is significantly larger or smaller than all of the other observations
when will positive z-score occur?
if a data value is larger than the mean
when will negative z-score occur?
if a data value is smaller than the mean
difference in median regarding number of observations:
if n is odd, the median M is the middle observation in the ordered sample; if n is even, the median M is the average of the two middle observations in the data set
why is a measure of center alone not enough to describe a distribution well?
it doesn't indicate the degree to which the data are spread out (ie the amount of dispersion)
for approx symmetric distributions, report what?
mean and standard deviation
resistance of mean
mean is NOT resistant (changes without outliers)
generally, if the shape of the distribution is approximately symmetric, then ...
mean is about equal to median
generally, if the shape of the distribution is skewed right, then...
mean is larger than median
generally, if the shape of the distribution is skewed left, then...
mean is smaller than median
if there is an outlier, take the ...
median
for distributions that are skewed or contain outliers, report what?
median and interquartile range
resistance of median
median is a resistant measure of center! (median is just about the same without the outlier)
median (M)
middle observation when the observations are listed in order
what does 5 number summary include?
minimum, Q1, M, Q3, and maximun
what should box plots be used for?
moderate to larger data sets (it takes at least 5 numbers to make one anyway) also only for unimodal distributions! they hide bimodality (peaks)
mode
observation that occurs most frequently; describes a typical observation in terms of the most common outcome, but the mode need not be near the center of a distribution; most often used to describe the category of a categorical variable that has the highest frequency
sigma (o thing)
population standard deviation
sigma^2
population variance
when asked to compare two distributions, you need to discuss the similarities and/or differences in their shapes, centers, and spreads;
provide measures of center and spread for each distribution, and discuss which one is larger/smaller. when comparing two distributions, you should always use the same measures of center and spread for both distributions, otherwise the comparison isn't valid. if a group is strongly skewed or has outliers, it is usually best to compare the medians and interquartile ranges for all groups; otherwise compare means and standard deviations
s
sample standard deviation
s^2
sample variance
fences
serve as cutoff points for determining outliers; if data value is less than the lower fence or greater than the upper fence, it's considered an outlier lower fence = Q1-1.5(IQR) upper fence = Q3+1.5(IQR)
when you describe the distribution of a quantitative variable, you need to mention 3 characteristics:
shape, center, spread
range
the difference between the largest and smallest observations; simplest and easiest measure of spread to compute, but it's SEVERELY affected by outliers (R = max-min)
z-score for an observation represents...
the distance that a data value is from the mean in terms of the number of standard deviations
if the distribution is close to symmetric or only mildly skewed, what measure of center should be used?
the mean is usually preferred because it uses the numerical value of all the observations
if a distribution is highly skewed, what measure of center should be used?
the median is usually preferred over the mean because it better represents what is typical
for distributions on box plots that are approximately symmetric,
the median will be near the center of the box and the left and right whiskers will be roughly the same length
for distributions on box plots that are skewed right,
the median will be slightly left of the center of the box and the right whisker will be longer than the left whisker (or there may be high outliers)
for distributions on box plots that are skewed left,
the median will be slightly right of the center of the box and the left whisker will be longer than the right whisker (or there may be low outliers)
standard deviation
the most popular summary of spread, which represents a typical value for how far the data fall from their mean; uses all data values in its computations
the most common measures of spread for a quantitative variable are ...
the range, the interquartile range, the variance, and the standard deviation
as well as the center, we are also interested in..
the spread (aka variability, consistency, dispersion)
mean (average)
the sum of the observations divided by the number of observations (x bar means sample mean and mu means population mean)
when the data are either skewed left/right...
there are extreme values in the tail, which tend to pull the mean in the direction of the tail
sometimes, the median may be a better measure of central tendency than the mean
true
properties of the standard deviation, s:
value will always be greater than or equal to zero; greater the spread of the data, the larger s's value; when all the observations are the same value, s=0 (there is no spread in the data); strong skewness or a few outliers can greatly increase s; so s is NOT resistant; has the same units of measurement as the original observations, but the variance is in square units (s^2) (cannot have chips^2 lol) ?
the shape of a distribution influences...
whether the mean is larger or smaller than the median
population z-score:
z=x-mu/sigma
sample z-score:
z=x-xbar/s