MATH 1680 Chapter 3
What does a z-score represent?
A z score represents the distance that a data value is from the mean in terms of the number of standard deviations
What does a z-score measure?
A z-score measures the number of standard deviations an observation is above or below the mean. For example, a z-score of 1.24 means the data value is 1.24 standard deviations above the mean. A z-score of −2.31 means the data value is 2.31 standard deviations below the mean.
Explain how to compute the arithmetic mean of a variable.
Add all the values of the variable in the data set and divide by the number of observations
When an observation that is much larger than the rest of the data is added to a data set, the value of the median will increase substantially.
False
When comparing two populations, what does a larger standard deviation imply about dispersion?
It implies that there is a greater dispersion or spread of the distribution provided the variable of interest from the two populations has the same unit of measure.
What does a measure of central tendency describe?
It numerically describes the average or "typical" data value. In everyday language the word average often represents the arithmetic mean (to compute the arithmetic mean of a set of data, the data must be quantitative
For a distribution that is symmetric, which of the following is true?
Mean = median
If the shape of a distribution is symmetric, which measure of central tendency and which measure of dispersion should be reported?
Mean should be the measure of central tendency and standard deviation should be the measure of dispersion
Is standard deviation resistant? Why or why not?
Standard deviation is NOT resistant because an extreme value changing can dramatically increase or decrease the standard deviation.
List the four steps for checking for outliers by using quartiles.
Step 1. Determine the first and third quartiles of the data. Step 2: Compute the interquartile range. Q3-Q1 Step 3: Determine the fences. Fences serve as cutoff points for determining outliers. Lower fence = Q1−1.5(IQR) Upper fence = Q3+1.5(IQR) If a data value is less than the lower fence or greater than the upper fence, it is considered an outlier.
List the three steps for finding quartiles.
Step 1: Arrange the data in ascending order. Step 2: Determine the median, M, or second quartile, Q2. Step 3: Divide the data set into two halves: the observations less than M and the observations greater than M. The first quartile, Q1, is the median of the bottom half, and the third quartile, Q3, is the median of the top half. Do not include M in these halves.
What is the range of a variable?
The difference between the largest and smallest data value
Define the first, second, and third quartiles
The first quartile, denoted Q1, divides the bottom 25% of the data from the top 75%. The second quartile, Q2, divides the bottom 50% of the data from the top 50%. The third quartile, Q3, divides the bottom 75% of the data from the top 25%.
If a data set has many values that are "far" from the mean, how is the standard deviation affected?
The further an observation is from the mean, the larger the squared deviation. If a data set has many observations that are "far" from the mean, then the sum of the squared deviations will be large and the standard deviation will be large.
Explain the circumstances for which the interquartile range is the preferred measure of dispersion. What is an advantage that the standard deviation has over the interquartile range?
The interquartile range is preferred when the data are skewed or have outliers. An advantage of the standard deviation is that it uses all the observations in its computation.
Define the interquartile range, IQR.
The interquartile range, IQR, is the range of the middle 50% of the observations in a data set. That is, the IQR is the difference between the first and third quartiles and is found using this formula IQR=Q3−Q1.
Why is the median resistant but the mean is not?
The mean is not resistant because when data are skewed, there are extreme values in the tail, which tend to pull the mean in the direction of the tail. The median is resistant because the median of a variable is the value that lies in the middle of the data when arranged in ascending order and does not depend on the extreme values of the data.
Define the mode of a variable.
The observation of the variable that occurs most frequently in the data set
List the conditions for determining when to use mean:
Use mean when data are quantitative and the frequency distribution is roughly symmetric
State the reason that we compute the mean.
We compute the mean because much of statistical inference is based on the mean
For a distribution that is skewed right, the median is ___________ of the box
left of center
For a distribution that is skewed left, which of the following is true?
mean < median
For a distribution that is skewed right, which of the following is true?
mean > median
In distributions that are skewed to the right, what is the relationship of the mean, median, and mode?
mean > median > mode
After all, the mean and median are close in value for symmetric data, and the ___________ is the better measure of central tendency for skewed data.
median
Therefore, when the distribution of data is highly skewed or contains extreme observations, it is best to use the _________ as the measure of central tendency and the interquartile range as the measure of dispersion because these measures are resistant.
median
Which measure, the mean or the median, is resistant?
median is resistant, mean is not resistant
With ____________ z-scores, we need to be careful when deciding the better outcome. For example, when comparing finishing times for a marathon the lower score is better because it is more standard deviations below the mean.
negative
The interpretation of the interquartile range is the range of the middle 50% of the data. The more spread a set of data has, the higher the interquartile range will be. The interquartile range, IQR, is a __________ measure of dispersion.
resistant
Which measure, the mean or the median, is least affected by extreme observations?
the median
The _______ represents the number of standard deviations an observation is from the mean.
z score
The sum of the deviations about the mean always equals _______
zero
Empirical Rule
68%, 95%, 99.7%
______ is not resistant
Range
Range, standard deviation, and variance are not resistant.
True
When an observation that is much larger than the rest of the data is added to a data set, the value of the mean will
increase
Define what it means for a numerical summary of data to be resistant.
A numerical summary of data is said to be resistant if values that are extreme (very large or small) relative to the data do not affect its value substantially
When describing the shape of a distribution from a boxplot, be sure to justify your conclusion. Possible areas to discuss:
Compare the length of the left whisker to the length of the right whisker The position of the median in the box Compare the distance between the median and the first quartile to the distance between the median and the third quartile Compare the distance between the median and the minimum value to the distance between the median and the maximum value
Define variance.
The variance of a variable is the square of the standard deviation.
What does a positive z-score for a data value indicate? What does a negative z-score indicate?
If a data value is larger than the mean, the z-score is positive. If a data value is smaller than the mean, the z-score is negative. If the data value equals the mean, the z-score is zero.
If the shape of a distribution is skewed left or skewed right, which measure of central tendency and which measure of dispersion should be reported? Why?
It is best to use the median as the measure of central tendency and the interquartile range as the measure of dispersion because these measures are resistant
Symmetric box plot
Median is in the center of the box Left and right whiskers are roughly the same length
Skewed right box plot
Median is left of center in the box Left whisker is shorter than right whisker
Skewed left box plot
Median is right of center in the box Left whisker is longer than right whisker
List the conditions for determining when to use median:
Use median when the data are quantitative and the frequency distribution is skewed left or skewed right
What is an outlier?
Outliers are extreme observations in data sets. They can occur by chance, because of errors in measurement of a variable, during data entry, or from errors in sampling.
Which measure of dispersion is resistant?
Quartiles are resistant & for this reason they are used to define a resistant measure of dispersion.
List the three steps in finding the median of a data set.
Step 1: Arrange the data is ascending order Step 2: Determine the number of observations, n Step 3: Determine the observation in the middle of the data set - if the number of observations is odd, then the median is the data value exactly in the middle of the data set. That is, the median is the observation that lies in the (n+1)/2 position -if the number of observations is even, then the median is the mean of the two middle observations in the data set. That is, the median is the mean of the observations that lie in the n/2 position and the (n/2) + 1 position
List the five steps for drawing a boxplot.
Step 1: Determine the lower and upper fences: LF = Q1 - 1.5(IQR) UF = Q3 + 1.5(IQR) where IQR = Q3-Q1 Step 2: Draw a number line long enough to include the maximum and minimum values. Insert vertical lines at Q1, M, and Q3. Enclose those vertical lines in a box Step 3: Label the lower and upper fences with a temporary mark Step 4: Draw a line from Q1 to the smallest data value that is larger than the lower fence. Draw a line from Q3 to the largest data value that is smaller than the upper fence. These lines are called whiskers. Step 5: Plot any data values less than the lower fence or greater than the upper fence as outliers. Outliers are marked with an asterisk (*). Remove the temporary marks labeling the fences.
List the conditions for determining when to use mode:
Use mode when the most frequent observation is the desired measure of central tendency or the data are qualitative.
What values does the five-number summary consist of?
The five-number summary of a set of data consists of the smallest data value, Q1, the median, Q3, and the largest data value. We use the five-number summary to learn information about the extremes of the data set. The summary is organized as: Minimum Q1 M(edian) Q3 Maximum
What symbols are used to represent the population mean and the sample mean?
The population arithmetic mean, μ (pronounced "mew"), is a parameter that is computed using data from all the individuals in a population. The sample arithmetic mean, x (with a line over it) (pronounced "x-bar"), is a statistic that is computed using data from individuals in a sample.
What is the mean of the data?
The value such that a histogram of the data is perfectly balanced, with equal weight on each side of the mean
Define the median of a variable.
The value that lies in the middle of the data when arranged in ascending order. We use M to represent the median.
The standard deviation can be negative.
false
True or False: When comparing two populations, the larger the standard deviation, the more dispersion the distribution has, provided that the variable of interest from the two populations has the same unit of measure.
True, because the standard deviation describes how far, on average, each observation is from the typical value. A larger standard deviation means that observations are more distant from the typical value, and therefore, more dispersed.