Chapter 3 & 4
Sample arithmetic mean
Computed using sample data. The sample mean is a statistic
Median
Divides the lower 50% and upper 50% sets of data. Is a special case of the general concept called the percentile.
Most widely used 3 measures of central tendency
Mean, median, and mode
An insurance company crashed four cars of the same model at 5 miles per hour. The costs of repair for each of the four crashes were $442, $440, $462, and $206. Compute the mean, median, and mode cost of repair.
Mean: $387.50 Median: $441.00 No mode
Quartiles
Most common percentiles. Divide data into 4ths, 4 equal parts.
Weighted mean formula
Multiplying each value of the variable by its corresponding weight, summing these products, and dividing the result by the sum of the weights.
What does it mean if r=0?
No linear relationship exists between the variables.
Dispersion
The degree to which the data are spread out
What does it mean to say that two variables are positively associated?
There is a linear relationship between the variables, and whenever the value of one variable increases, the value of the other variable increases.
Another measure of central tendency is the trimmed mean. It is computed by determining the mean of a data set after deleting the smallest and largest observed values. Compute the trimmed mean for the data given in the accompanying table. Is the trimmed mean resistant to changes in the extreme values in the given data?
Trimmed mean: 0.875
True or False: When comparing two populations, the larger the standard deviation, the more dispersion the distribution has, provided that the variable of interest from the two populations has the same unit of measure.
True
The sum of all deviations about the mean must equal
Zero
The standard deviation is used in conjunction with the mean to numerically describe distributions that are
bell shaped and symmetric
Exploratory data analysis
exploring data thru summaries, defined by John Turkey
How to check for outliers
(1) determine Q1 and Q3 (2) compute IQR (3) determine the fences (which serve as cutoff points for determining outliers) - Lower fence = Q1 - 1.5 (IQR) - Higher fence = Q2 + 1.5 (IQR) (4) If the data value is less than the LF or greater than the UF, it is an outlier
2nd Quartile
Q2. Divides the bottom 50% from the top 50%. Equal to the 50th percentile. Equal to the median of the entire set of data.
3rd Quartile
Q3. Divides the bottom 75% from the top 25%. Equal to the 75th percentile
Z-score
Represents the distance that a data value is from the mean in terms of the number of standard deviations. It is obtained by subtracting the mean from the data value and dividing this result by the standard deviation. There is both a population and sample z-score. It is unitless, with a mean 0 and SD 1.
Mode
The most frequent observation of the variable that occurs in the data set. Can be qualitative or quantitative data.
What makes the range less desirable than the standard deviation as a measure of dispersion?
The range does not use all the observations. The range of a variable is the difference between the largest data value and the smallest data value. The range is less desirable than the standard deviation as a measure of dispersion because it is computed using only two values in the data set (the largest and smallest).
Is the mean pulse rate of sample 1 (76) an overestimate of, an underestimate of, or equal to the population mean (73.6)?
The sample mean overestimates the population mean
Median
The value that lies in the middle of the data when arranged in ascending order.
The 5th percentile of the weight of males 36 months of age in a certain city is 11.0 kg.
5% of 36-month-old males weigh 11.0 kg or less, and 95% of 36-month-old males weigh more than 11.0 kg.
Scatter Diagram
A graph that shows the relationship between 2 quantitative variables measured on the same individual. Each individual in the data set is represented by a point in the scatter diagram. The explanatory variable is plotted on the horizontal axis, and the response variable is plotted on the vertical axis.
Average
A measure of central tendency that numerically describes the typical data value
A histogram of a set of data indicates that the distribution of the data is skewed right. Which measure of central tendency will likely be larger, the mean or the median? Why?
The mean will likely be larger because the extreme values in the right tail tend to pull the mean in the direction of the tail.
Weighted Mean
Used when certain data values have a higher importance or weight associated with them. Example: GPA, with the weights equal to the number of credit hours in each course. The value of the variable is equal to the grade converted to a point value
Negatively Associated
When 2 variables that are linearly related, When above average values of one variables are associated with below average values of the other variable (and vice versa)
Positively Associated
When 2 variables that are linearly related, and when above-average values of one variable are associated with above average values of the other. Same goes with below average values, it causes below average values of the other variable.
Dividing by n results in...
an underestimate, so we divide by a smaller number (n-1) to increase our guess
Since raw data cannot be retrieved from a frequency table, we assume that, within each class, the mean of the data values in equal to
the class midpoint, then multiply the class midpoint by the frequency, and this product is expected to be close to the sum of the data that lie within each class, and repeat the process for each class and sum the results. This sum approximates the sum of all the data
Steps in finding the median of a data set
(1) Arrange the data in ascending order (2) Determine the number of observations, n (3) Determine the observation in the middle of the data set - If the number of observations is odd, then the median is the data value that is exactly in the middle of the data set. That is, the median is the observation that lies in the (n + 1)/2 position - If the number of observations is even, then the median is the mean of the 2 middle observations in the data set. That is, the median is the mean of the observations that lie in the n/2 position and the (n/2) + 1 position
Find the population mean or sample mean as indicated. Sample: 23, 14, 1, 6, 21
13
The median for the given set of six ordered data values is 27.5: 7 12 21 __ 41 48. What is the missing value?
34
The following data for a random sample of banks in two cities represent the ATM fees for using another bank's ATM. Compute the range and sample standard deviation for ATM fees for each city. Which city has the most dispersion based on range? Which city has more dispersion based on the standard deviation? City A (1.50, 1.00, 1.50, 1.50, 1.50) City B (2.25, 1.00, 1.75, 0.00, 2.00)
City A range = 0.50 City B range = 2.25 City A SD = 0.22 City B SD = 0.91
Which city has the most dispersion based on standard deviation?
City B, because it has a higher standard deviation
Which city has the most dispersion based on range?
City B, because it has a higher range.
Arithmetic mean
Computed by determining the sum of all the values of the variable in the data set and dividing by the number of observations. Generally referred to as the mean
Sample variance
Computed by determining the sum of the squared deviations about the sample mean and dividing the result by n-1.
Population arithmetic mean (mew)
Computed using all the individuals in a population. The population mean is a parameter
Outliers
Extreme observations. Should always be checked for in data analysis. When encountered, their origins should be investigated. Can occur by chance, or error in measurement, sampling, and data entry. Are sometimes common within a population, which can cause outliers in sampling.
Multimodal
If a data set has 3 or more data values that occur with the highest frequency
No mode
If no observation occurs more than once
kth percentile
Is a value such that k % of observations are less than or equal to the value
For the histogram on the right determine whether the mean is greater than, less than, or approximately equal to the median. Justify your answer.
Mean < M because the histogram is skewed left
The following data represent the pulse rates (beats per minute) of nine students enrolled in a statistics course. Treat the nine students as a population: 60, 63, 65, 69, 71, 77, 82, 86, 89.
Population mean: 73.6
The following data represent the pulse rates (beats per minute) of nine students enrolled in a statistics course. Treat the nine students as a population. (68, 72, 79, 88, 60, 77, 86, 65, 73)
Population variance: 76.8 Population SD: 8.8
1st Quartile
Q1. Divides the bottom 25% from the top 75%. Equal to the 25th percentile
Interquartile Range
Resistant to extreme values, so it is the preferred measure of dispersion based on quartiles. Is the range of the middle 50% of the observations. IQR = Q3 - Q1. Similar to SD and range in that the more spread out of set of data is, the higher the IQR will be.
Sample 1: 60, 86 82 Sample 2: 89, 82, 65
Sample 1 mean: 76 Sample 2 mean: 78.7
Find the sample variance and standard deviation. (23, 12, 6, 7, 10)
Sample variance = 46.3 SD = 6.8
Determine the sample variance and sample standard deviation of the following two simple random samples of size 3. Sample 1: (86, 65, 88)
Sample variance: 162.3 Sample SD: 12.7
Determine the sample variance and sample standard deviation of the following two simple random samples of size 3. Sample 2: (79, 65, 77)
Sample variance: 57.3 Sample SD: 7.6
n represents
Size of sample
A random sample of 15 college students were asked "How many hours per week typically do you work outside the home?" Their responses are shown on the right. Determine the shape of the distribution of hours worked by drawing a frequency histogram and computing the mean and median. Which measure of central tendency better describes hours worked? (2, 8, 9, 10, 11, 17, 18, 18, 19, 21, 21, 24, 25, 26, 32)
Symmetric Mean: 17.4 Median: 18 Mean best described data
Explain the circumstances for which the interquartile range is the preferred measure of dispersion. What is an advantage that the standard deviation has over the interquartile range?
The interquartile range is preferred when the data are skewed or have outliers. An advantage of the standard deviation is that it uses all the observations in its computation.
Is the mean pulse rate of sample 2 (78.7) an overestimate of, an underestimate of, or equal to the population mean (73.6)?
The sample mean overestimates the population mean
Response Variable
The variable whose value can be explained by the value of the explanatory or predictor variable. It is a dependent variable, while the explanatory variable is independent. Ex: the speed of a golf club head would be the explanatory variable to the distance the golf ball travels, which would be the response variable.
What does it mean to say that the linear correlation coefficient between two variables equals 1? What would the scatter diagram look like?
When the linear correlation coefficient is 1, there is a perfect positive linear relation between the two variables. The scatter diagram would contain points that all lie on a line with a positive slope.
Resistant
When the value of a numerical summary of data is substantially affected by extreme values (very large or very small)
Population standard deviation
o; obtained by taking the square root of the population variance
Because the Empirical Rule requires that the distribution be bell shaped, while the Chebyshev's Inequality applies to all distributions, the Empirical Rule provides results that are more
precise
3 numerical measures for describing dispersion, or the spread, of data
range, variance, and standard deviation
Sample standard deviation
s; obtained by taking the square root of the sample variance
The most popular methods for numerically describing the distribution of a variable
standard deviation and the mean, because these 2 measures are used for most types of statistical inference
Roman letters are used to represent
statistics
The procedure for approximating the variance and standard deviation from grouped data is similar to
that of finding the mean from grouped data; and because we do not have access to the original data, the variance is approximate
If data have a distribution that is bell shaped, the Empirical Rule can be used to determine
the % of data that will lie within k standard deviations of the mean
The approximate mean from grouped data is equal to
the actual mean
The further an observation is from the mean...
the larger the absolute value of the deviation
When the word average is used in the media, it usually refers to
the mean
We use M to represent
the median
The larger the standard deviation
the more dispersion the distribution has, provided that the variable of interest from the 2 populations has the same unit of measure
The mean measures the center of the distribution, while the standard deviation measures
the spread of the distribution
The Greek letter capital sigma tells us
the terms are to be added
The standard deviation is the typical deviation from the
mean
The median is resistant while the...
mean is not resistant
Symmetric
mean roughly equal to median
Skewed right
mean substantially larger than median
Skewed left
mean substantially smaller than median
Degrees of Freedom
n-1, because the first n-1 observations have freedom to be whatever value they wish, but the nth value has not freedom. It must be whatever value forces the sum of the deviations about the mean to equal zero.
We cannot determine the value of the mean or median of data that are:
nominal (only mode)
Variance is based on the
Deviation about the mean
True or False: A data set will always have exactly one mode.
False
Bimodal
If a data set has 2 data values that occur with the highest frequency
The histogram on the right represents the connection time in seconds to an internet provider. Determine which measure of central tendency better describes the "center" of the distribution.
Median
Biased
Whenever a statistic consistently over or underestimates a parameter
Range
Simplest measure of dispersion. The data must be quantitative. Also seen as R. IS the difference between the largest data value and the smallest data value. Range = R = Largest data value - smallest data value. Is affected by extreme values.
N represents
Size of population
Greek letters are used to represent
parameters
Chebyshev's Inequality
used to determine a lower bound on the % of observations that lie within 'k' standard deviations of the mean, where k > 1. The bound is obtained regardless of the basic shape of the distribution (skewed left, right, or symmetric)
To obtain an unbiased estimate of population variance...
we divide the sum of the squared deviations about the sample mean by n-1
Violent crimes include rape, robbery, assault, and homicide. The following is a summary of the violent-crime rate (violent crimes per 100,000 population) for all states of a country in a certain year. Q1 =273.8, Q2 = 388.5, Q3 = 529.1
25% of the states have a violent-crime rate that is 273.8 crimes per 100,000 population or less. 50% of the states have a violent-crime rate that is 388.5 crimes per 100,000 population or less. 75% of the states have a violent-crime rate that is 529.1 crimes per 100,000 population or less. IQR = 255.3 (The middle 50% of all observations have a range of 255.3 crimes per 100,000 population.)
The 90th percentile of the length of newborn females in a certain city is 54.3 cm.
90% of newborn females have a length of 54.3 cm or less, and 10% of newborn females have a length that is more than54.3 cm
Is the trimmed mean resistant to changes in the extreme values for the given data?
Yes, because changing the extreme values does not change the trimmed mean.