Statistics Ch 3 Review
Variance
of a variable, is the square of the standard deviation
Comparing two populations
If we are comparing two populations, then the larger the standard deviation, the more dispersion the distribution has, provided that the variable of interest from the two populations has the same unit of measure
Summary
This chapter concentrated on describing distributions numerically. Measures of central tendency are used to indicate the typical value in a distribution. Three measures of central tendency were discussed. The mean measures the center of gravity of the distribution. The data must be quantitative to compute the mean. The median separates the bottom 50% of the data from the top 50%. The data must be at least ordinal to compute the median. The mode measures the most frequent observation. The data can be either quantitative or qualitative to compute the mode. The median is resistant to extreme values, while the mean is not. Measures of dispersion describe the spread in the data. The range is the difference between the highest and lowest data values. The standard deviation is based on the average squared deviation about the mean. The variance is the square of the the standard deviation. The range, standard deviation, and variance, are not resistant. The mean and standard deviation are used in many types of statistical inference. The mean, median, and mode can be approximated from grouped data. The standard deviation can also be approximated from grouped data. We can determine the relative position of an observation in a data set using z-scores and percentiles. A z-score denotes how many standard deviations an observation is from the mean. Percentiles determine the percent of observations that lie above and below an observation. The interquartile range is a resistant measure of dispersion. The upper and lower fences can be used to identify potential outliers. Any potential outlier must be investigated to determine whether it was the result of a data-entry error or some other error in the data-collection process, or is an unusual value in the data set. The five-number summary provides an idea about the center and spread of a data set through the median and the interquartile range. The length of the tails in the distribution can be determined from the smallest and largest data values. The five-number summary is used to construct boxplots. Boxplots can be used to describe the shape of the distribution and to visualize outliers
Biased
Whenever a statistics consistently underestimates a parameter, it is said to be this. To obtain an unbiased estimate of the population variance, we divide the sum of squared deviations about the sample mean by n-1 Ex.) Suppose you work for a carnival in which you must guess a person's age. After 20 people come to your booth, you notice that you have a tendency to underestimate people's age (you guess too low.) What would you do about words, originally your guesses were biased. To remove the bias, you increase your guess. This is what dividing by n-1 in the sample variance formula accomplishes.
Caution: mean
Whenever you hear the word average, be aware that the word may not always be referring to the mean. One average could be used to support one position, while another average could be used to support a different position
Resistant
a numerical summary of data is said to be this, if extreme values (very large or small) relative to the data do not affect its value substantially
Nominal data
are qualitative data that cannot be written in any meaningful order. We cannot determine the value of the mean or median of data that is qualified as this. The only measure of central tendency that can be determined for this kind of data, is the mode
Quartiles
divide data sets into fourths, or four equal parts
Outliers
extreme observations are referred to as these. These distort both the mean and the standard deviation, because neither is resistant. Because these measures often form the basis for most statistical inference, any conclusions drawn from a set of data that contains these can be flawed
Note
if a data set has many observations that are "far" from the mean, the sum of the squared deviations will be large, and therefore the standard deviation will be large
Multimodal
if a data set has three or more data values that occur with the highest frequency, the data set is said to be this
Bimodal
if a data set has two modes
Using the Empirical Rule
if data have a distribution that is bell shaped, the Empirical Rule can be used to determine the percentage of data that will lie within k standard deviations of the mean
No mode
if no observation occurs more than once, we say the data has this
Population variance
is
Exploratory data analysis
is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task
Class midpoint
is found by adding consecutive lower class limits and dividing the results by 2
Sample variance
is s squared
Identifying the shape of a distribution from a boxplot (or from a histogram)
is subjective. When identifying the shape of a distribution from a graph, be sure to support your opinion.
Dispersion
is the degree to which the data are spread out
Population standard deviation
of a variable is the square root of the sum of squared deviations about the population mean divided by the number of observations in the population, N. That is, it is the square root of the mean of the squared deviations about the population mean.
Arithmetic mean
of a variable, is computed by adding all the values of the variable in the data set and dividing by the number of observations
Range (R)
of a variable, is the difference between the largest and the smallest data value. That is, Range=R=largest data value-smallest data value
Mode
of a variable, is the most frequent observation of the variable that occurs in the data set
Median
of a variable, is the value that lies in the middle of the data when arranged in ascending order. We use M to represent this.
Sample standard deviation
s, of a variable is the square root of the sum of squared deviations about the sample mean divided by n-1, where n is the sample size
Mean
the arithmetic mean is generally referred to as this
Describe the distribution
this will mean to describe its shape (skewed left, skewed right, symmetric), its center (mean or median), and its spread (standard deviation or interquartile range)
Population arithmetic mean
u (pronounced "mew"), is computed using all the individuals in a population. This is also a parameter.
Standard deviation
uses all the data values in the computations
Degrees of freedom
we call n-1 this, because the first n-1 observations have freedom to be whatever they wish, but the nth value has no freedom. It must be whatever value forces the sum of the deviations about the mean to equal zero
What do we use to represent statistics and parameters?
we usually use Greek letters to represent parameters and Roman letters to represent statistics
Steps in Finding the Median of a Data Set
1.) Arrange the data in ascending order 2.) Determine the number of observations, n 3.) Determine the observation in the middle of the data set If the number of observations is odd, then the median is the data value exactly in the middle of the data set. That is, the median is the observation that lies in the n+1/2 position If the number of observations is even, then the median is the mean of the two middle observations in the data set. That is, the median is the mean of the observations that lie in the n/2 position and the n/2 +1 position
Three numerical measures for describing the dispersion of data
1.) Range 2.) Standard deviation 3.) Variance
Sample arithmetic mean
x (with a line above x, pronounced "x-bar"), is computed using sample data. This is a statistic.
