Statistics: Descriptive- Chapter 2
Shapes of distribution
The graphs of freq dist can have different shapes: symmetric, uniform, skewed left (negatively), skewed right (positively)
box-and-whisker plot (or boxplot)
highlights important features of a data set. Five important values (the five number summary) must be known in order to construct a box-and-whisper plot
bimodal
if two data entries have the same greatest freq, each entry is a mode and the data set is called bimodal
population varience
in a population data set, the mean of the squares of the deviations Population variance = σ2 = Σ (x - μ )2 / N = sum of the deviations squared / sample size
interquartile range (IQR)
indicates the variation of the range of the middle 50% of the data. the calculation is: IQR = Q3 - Q1
organize data sets by grouping the data into:
intervals called classes and forming a frequency distribution
frequency histogram
is a bar graph that represents the frequency distribution of a data set. *classes must be changed to class boundaries so that the bar graphs touch each other *first, find the distance between the classes and subtract this number from the lower class limit. Also add this number to the upper class limit. Then divide by 2.
outlier
is a data entry that is out of place with the other data numbers * the data entry that is far removed from the other entries in the data set Example: for the data values given: {3, 6, 4, 3, 1, 6, 77}, the data entry 77 is an outlier
cumulative frequency graph (ogive)
is a line graph that displays the cumulative frequency of each class at its upper class boundary. *the upper boundaries are marked on the horizontal axis, and the cumulative frequencies are marked on the vertical axis
frequency polygon
is a line graph that emphasizes the continuous change in frequencies
frequency distribution
is a table that shows classes or intervals of data entries with a count of the number of entries in each class *used to show patterns or trends in a data set
measure of central tendency
is a value that represents a typical, or central of a data set
range
is the difference between the maximum and minimum entries in a data set Range = Maximum - Minimum
class width
is the distance between lower (or upper) limits of consecutive classes
lower class limit
is the least number that can belong to the class
frequency f of a class:
is the number of data entries in the class
a z-score of -2.5 is considered very unusual (T/F)
False. A z-score of -2.5 is considered unusual
range
difference between the maximum and minimum data entries
cumulative frequency (of a class)
is the sum of the frequencies of that class and all previous classes. the cumulative frequency of the last class is equal to the sample size n
characteristics to look for when organizing and describing a data set
its center, its variable (or spread), and its shape
positive z-score
means that the x-value is greater than the mean
negative z-score
means that the x-value is less than the mean
z-score of exactly zero
means x-value is equal to the mean
standard deviation
measure of typical amount an entry deviates from the mean *a large standard deviation indicates that the data is spread out away from the mean while a small standard deviation indicates the data is clustered close together near the mean
How do outliers affect the central tendencies?
the mean is heavily influenced by outliers because they are included in the calculation * the median is usually not influenced by outliers much since outliers usually fall at the beginning or end of the data set. * the mode is usually not influenced by outliers since they typically occur just once
weighted mean
the mean of a data set that has entries of varying weights x̅ = Σ (x * w) / Σ w = Σ (data values * weights) / Σ weights
Z-scores
the number of standard deviations that a value, x, falls from the mean, μ The formula to calculate is: z = value - mean / standard deviation = x - μ / σ z-scores should fall btw -2 and 2 as these values represent 95% of the data according to the Emperical Rule. A z-score outside this range occurs about 5% of the time and would be considered unusual
class bounderies
the numbers that separate classes without forming gaps between them *if data entries are integers, subtract 0.5 from each lower limit to find the lower class boundaries. *to find the upper class boundaries, add 0.5 to each upper limit. * the boundaries of class will equal the lower boundary of the next higher class
population standard deviation
the square root of the population variance * a population data set of N entries is the square root of the population variance population standard deviation = σ = √σ2 = √Σ (x - μ )2 / N
sample standard deviation
the square root of the sample variance *the population samples standard deviation estimator sample standard deviation = s = √s2 = √ Σ (x - x̅ )2 / n - 1
After constructing an expanded frequency distribution, what should the sum of the relative frequencies be? Explain
the sum of the relative freq must be 1 or 100% because it is the sum of all portions or percentage of the data.
Σ (sigma)
the uppercase Greek letter sigma ( Σ) is used throughout statistics to indicate a summation of values
what is the advantage of using a stem-and-leaf plot instead of a histogram? What is a disadvantage?
unlike the histogram, the stem-and-leaf plot still contains the original data values. However, some data are difficult to organize in a stem-and-leaf plot
Five Number Summary
used to create a box-and-whisker plot 1. minimum 2. Q1 3. Q2, or median 4. Q3 5. Maximum
Describe the difference btw the calculation of population standard deviation and that of sample standard deviation
when calculating the population standard deviation, you divide the sum of the squared deviations by N, then take the square root of that value. When calculating the sample standard deviation, you divide the sum of the squared deviations by n - 1, then take the square root of that value.
Sum of squares
when you add the squares of the deviations, you compute a quantity called the sum of squares, denoted SSx
upper class limit
which is the greatest number that can belong to the class
construct a sample data set for which n = 7, x̅ = 9, and s = 0
{9, 9, 9, 9, 9, 9, 9, 9}
Construct the described data set: median and mode are the same
Example: 1, 2, 2, 2, 3 (answers may vary)
Construct the described data set: Mean is not representative of a typical number in the data set
Example: 2, 5, 7, 9, 35 (answers may vary)
Some quantitative data sets do not have medians (T/F)
False. Every quantitative data set has a median.
The 50th percentile is equivalent to Q1 (T/F)
False. The 50th percentile is equivalent to Q2
the mean and median of a data set are both fractiles. (T/F)
False. the median of a data set is a fractile, but the mean may or may not be a fractile depending on the distribution of the data
In a frequency distribution, the class width is the distance between the lower and upper limits of a class (T/F)
False: Class width is the difference between lower or upper limits of consecutive classes
An ogive is a graph that displays relative frequencies (T/F)
False: an ogive is a graph that displays cumulative freq.
How is a Pareto chart different from a standard vertical bar graph?
In a Pareto chart, the height of each bar represents frequency or relative frequency and the bars are positioned in order of decreasing height with the tallest bar at the left
Find the range, mean, variance, and standard deviation of the sample data set: 4 15 9 12 16 8 11 19 14
Range = 15 x̅ = 12 s squared = 21 s = 4.6
Find the range, mean, variance, and standard deviation of the population data set: 9 5 9 10 11 12 7 7 8 12
Range = 7 μ = 9 σ squared = 4.8 standard deviation = 2.2
min= 17, max= 135, 8 classes; what is the class width, the lower class limits, and the upper class limits
class width=15; lower class limits: 17,32,47,62,77,92,107,122; upper class limits: 31,46,61,76,91,106,121,136
relative frequency histogram
*has the same shape and the same x-axis as a frequency historgram. the difference is that the y-axis measures relative frequencies *uses the same information as a frequency histogram but compares each class interval to the total number of items. The only difference between a frequency histogram and a relative frequency histogram is that the vertical axis uses relative or proportional frequency instead of simple frequency
median
*the data entry located in the middle of an ordered data set. *the data is arranged in order from least to greatest so that the data value in the middle can be determined. *measures the center of an ordered data set by dividing it into two equal parts. If the data set has an odd number of entries, the median is the middle data entry. if the data set has an even number of entries, the median is the mean of the two middle data entries.
relative frequency (of a class)
*is the portion or percentage of the data that falls in that class. *To find the relative frequency of a class, divide the frequency f by the sample size n. relative frequency= class frequency / sample size = f/n * the sample size is the total number of data values *you can write as a fraction, decimal, or percent. *the sum of the relative freq of all classes should be equal to 1, or 100%. *due to rounding, the sum may be slightly less than or greater than 1. Such as 0.99 or 1.01
mean
*the average * is the sum of the data entries divided by the number of entries. there are two types of means: population means and sample mean
frequency histogram (properties)
* has the following properties: 1. the horizontal scale is quantitative and measures the data values 2. the vertical scale measures the frequencies of the class 3. consecutive bars must touch
sample variance
* population samples, variance estimators sample variance= s 2 = Σ (x - x̅ )2 / n - 1= sum of the deviations squared / sample size - 1
midpoint (of a class)
* the center of each class *is the sum of the lower and upper limits of the class divided by two. the midpoint is sometimes called the class mark. midpoint=(lower class limit)+(upper class limit) / 2
dot plot
each data entry is represented by a point that is placed above an axis. *with a dot plot, the specific data entries can be determined, and it displays the distribution of the data
stem-and-leaf plot
* way to display quantitative data *contains the original data values * an easy way to display and sort data *two digit numbers are separated into a stem (the data's first digit) and a leaf (the data's second digit) * if the data is more than two digits, the stem is typically the first digits and the leaf is the last digit *similar to a histogram but sill contains the original data values Example: Data entries 382, 397, and 398 Stem leaf 38 2 39 7 8
frequency distribution (guidelines for constructing)
1. decide on the number of classes. this will be given in each problem. 2. find the class widths. class width =range/number of classes. always round up to the nearest whole number( EX: 2.8 rounds up to 3. 2.3 also rounds up to 3) 3. find the class limits. use the minimum as the starting value of the first class. then add the class width to the minimum of the previous class. this will give you the lower limits for the next class. then, you can easily find the upper limits by the class width. classes canNOT overlap. 4. tally the data by marking where each data entry belongs in the table. 5. count the tally marks to find the total frequencies of each class
sample mean (x̅, read as "x" bar)
A mean of a numerical set that includes an average of only a portion of the numbers within a group x̅ = Σ x / N = sum of data / sample size
in terms of displaying data, how is a stem-and-leaf plot similar to a dot plot?
Both allow you to see how data are distributed, to determine specific data entries, and to identify unusual data calues
What is the difference between class limits and class boundaries?
Class limits determine which numbers can belong to each class. Class boundaries are the numbers that separate classes without forming gaps between them.
min=9, max=64, 7 classes; what is the class width, the lower class limits, and the upper class limits?
Class width=8; lower class limits: 9,17,25,33,41,49,57; upper class limits: 16,24,32,40,48,56,64
population mean ( μ, pronounced mu)
The mean of a numerical set that includes all the numbers within the entire group. μ = Σ x / N= sum of the data / number of data entries
A data set can have the same mean, median, and mode. (T/F)
True
The mean is the measure of central tendency most likely to be affected by an outlier. (T/F)
True
When each data class has the same frequency, the distribution is symmetric (T/F)
True
the second quartile is the median of an ordered data set (T/F)
True
Pareto chart
a bar graph where the heights of each bar represent the frequency or relative frequency of each category. the bars are in order from the tallest to the shortest. Commonly seen in newspaper and on tv
pie chart
a circle that is divided into sectors that represent categories. the area of each sector is proportional to the frequency of each category
gaps
a data set can have one or more outliers, causing gaps in distribution
time series chart
a data set composed of quantitative entries taken at regular intervals of time
paired data sets
a data set that contains points, where the x-value of one set is paired with a y-value of another set *each entry in one data set corresponds to one entry in a second data set, the sets are called paired data sets
clusters
a distribution can have several gaps caused by outliers, or clusters of data. * they occur when several types of data are included in the one data set
scatter plot
a graph of data points with an x-axis and y-axis. *one way to graph paired data sets, where the ordered pairs are graphed as points in a coordinate plan. *used to show the relationship between two quantitative variables
symmetric
a vertical line can be drawn down the center of the graph, and the graph is about the same on both sides. * the mean, median, and modes are equal
uniform
all classes have the same frequency. * the mean and median are equal. *a uniform distribution is also symmetric
fractiles
numbers that divide an ordered data set into equal parts * median is a fractile b/c it divides a data set into two equal parts
deviation
of an entry, x, in a population data set is the difference between the entry and the mean μ of the data set deviation = x - μ
what are some benefits of representing data sets using frequency distributions?
organizing the data into freq dist. may make patterns within the data more evident
name some ways to display qualitative data graphically
pie chart, Pareto chart
Discuss the similarities and the differences btw the Empirical Rule and Chebychev's Theorem
similarities: both estimate proportions of the data contained within k standard deviations of the mean Differences: the Empirical Rule assumes the distribution is bell-shaped: Ccebychev's Theorem makes no such sassumption
what are some benefits of using graphs of frequency distribution?
sometimes it is easier to ID patterns of a data set by looking at a graph of the freq. dist.
name some ways to display quantitative data graphically.
stem-and-leaf plot, dot plot, histogram, scatter polt, time series chart
Chebychev's Theorem
tells the portion of data that lies within k standard deviations of the mean 1 - 1/ k2 (k is squared) * k = 2: in any data set, it tells us that 75% of the data lie within 2 standard deviations of the mean *k = 3: in any data set, it tells us that 88.9% of the data lie within 3 standard deviations of the means
what is an advantage of using the range as a measure of variation? what is the disadvantage?
the advantage of the range is that it is easy to calculate. the disadvantage is that it uses only two entries from the data set
mode
the data entry that occurs the most or with great freq. * a data set can have one mode, more than one mode, or no mode.
skewed left (negatively skewed)
the graph has a tail that elongates to the left. * a tail has values with a lower frequency. *the mean is less than the median which is less than the mode
skewed right (positively skewed)
the graph has a tail that elongates to the right. *the mode is less than the median which is less than the mean
Why is the standard deviation used more frequently than the variance? (Hint: Consider the units of the variance)
the units of variance are squared. Its units are meaningless. (example: dollars squared)
quartiles
there are three that divide a data set into four equal parts first quartile: 1/4 of the data is on or below Q1 second quartile or median: 1/2 of the dta is on or below Q2 third quartile: 3/4 of the data is on or below Q3
Empirical Rule
used for data with a bell-shaped distribution has the following characteristics: *about 68% of the data lies between ( x̅ - s, x̅ + s) or one standard deviation of the mean *about 95% of the data lies between ( x̅ - 2s, x̅ + 2s), or two standard deviations of the mean *about 99.7% of the data lies between ( x̅ - 3s, x̅ + 3s), or three standard deviations of the means