Statistics- Chapter 3
3.3 How to Make A Box-And-Whisker Plot
- 1. Draw a vertical scale to include the lowest and highest data values. - 2. To the right of the scale, draw a box from Q1 to Q3. - 3. Include a solid line through the box at the median level. - 4. Draw vertical lines, called whiskers, from Q1 to the lowest value and from Q3 to the highest value.
3.2 Standard Deviation of Grouped Data
Page 107 and 108 for reference if needed.
3.3 Whisker
Vertical lines that are drawn from Q1 to the LOWEST value, and from Q3 to the HIGHEST value.
3.1 How to find the Median
- 1. ORDER the data from smallest to largest. - 2. For a distribution with an odd number of data values, MEDIAN = MIDDLE DATA VALUE. - 3. For a distribution with an even number of data values, MEDIAN = SUM OF TWO MIDDLE VALUES DIVIDED BY 2.
3.1 How to compute a 5% trimmed mean
- 1. ORDER the data from smallest to largest. - 2. Delete the bottom 5% of the data and the top %5 of the data. NOTE: If the calculation of 5% of the number of data values does not produce a whole number, round to the nearest integer example- 35.95 = 36.0 your answer & 34.72 = 34.7 your answer. - 3. Compute the mean of the remaining 90% of the data. - HELPFUL HINT: 5% of total data entries. example 20entries would be 5% of 20 is 1 so eliminate 1 from top and 1 from bottom from ordered data set, the compute the mean from the remaining 18 entries.
3.2 Population Variance σ²
- Population variance is, σ² - The formula for population variance is, σ² = ∑(x - x̄)² / N
Percentile
For whole number P (where 1 <_ P _< 99), the Pth percentile of a distribution is a value such that P% of the data fall at or below it and (100 - P) % of the data fall at or above it.
3.3 How to Compute Quartiles
- 1. Order the data from smallest to largest. - 2. Find the median. This is the 2nd quartile. - 3. The first quartile Q1 is then the median of the lower half of the data; that is, it is the median of the data falling BELOW the Q2 position (and not including Q2). 4. The third quartile Q3 is the median of the upper half of the data; that is, it is the median of the data falling ABOVE the Q2 position (and not including Q2).
3.1 Population Mean- µ
- If the data comprise the entire POPULATION, we use the symbol µ, pronounced "mew," to represent the mean. - UPPER CASE N = NUMBER OF DATA VALUES IN THE POPULATION. - µ = ∑x DIVIDED BY N.
3.2 Population Size
- Population is upper case N. - We note that the formula for µ is the same as the formula for x̄ (the sample mean). - The formulas for σ² and σ are the same as the formulas for s² and s (sample variance and sample standard deviation). - The only difference is that the population size N is used instead of the sample size n-1. - Also, µ is used instead of x̄ in the formulas for σ² and σ.
3.2 Population Mean
- Population mean is, µ - If the data comprise the entire POPULATION, we use the symbol µ, pronounced "mew," to represent the mean. - UPPER CASE N = NUMBER OF DATA VALUES IN THE POPULATION. - µ = ∑x DIVIDED BY N.
3.2 Population Standard Deviation σ
- Population standard Deviation is, σ - The formula for population standard deviation is, σ = the square root of ∑(x - x̄)² / N - The computation formula for standard deviation is, σ = the square root of ∑x² - (∑x)² / N/ all over N.
3.2 Sum of Squares
- The DEFINING FORMULA for sum of squares is, ∑(x - x̄)² - PG 95 for reference.
3.1 Trimmed Mean
- A measure of center that is MORE RESISTANT than the mean but still sensitive to specific data values. - A trimmed mean is the mean of the data values left after "trimming" a specific percentage of the smallest and largest data values from the data set. - Usually a 5% trimmed mean is used. - This implies that we trim the lowest 5% of the data as well as the highest 5% of the data. - A similar procedure is used for a 10% trimmed mean.
3.2 Variance and Standard Deviation
- A measure of the distribution, or spread of data around an expected value, either sample or population. - Formulas for variance and standard deviation differ slightly depending on whether we are using a sample or the entire population. - Sample variance and sample deviation are used to describe the spread of data about the mean x̄ .
3.1 Resistant Measure
- An average that is not influenced by extremely high or low data values. - The mean IS NOT a resistant to measure of center because we can make the mean as large as we want by changing the size of only one data value. - The median IS more resistant.
3.1 Mean
- An average that uses the exact value of each entry of data. - To compute the mean, we add the values of all the entries and then divide by the number of entries. - MEAN = SUM OF ALL ENTRIES DIVIDED BY NUMBER OF ENTRIES.
3.2 Coefficient of Variation CV
- CV, to compare measurements from different populations, and expresses the standard deviation as a percentage of the sample or population. - If x̄ and s represent the sample mean and sample standard deviation, respectively, then the sample CV is defined to be, CV = s/x̄ MULTIPLIED BY 100%. - If µ and σ represent the population mean and population standard deviation, respectively, then the population CV is defined to be, CV = σ/µ MULTIPLIED BY 100%.
3.2 Chebyshev's Theorem
- For any data set (either population or sample) and for any constant k greater than 1, the proportion of the data that must lie within k standard deviations on either side of the mean is AT LEAST. - 1 - 1 / k²
3.2 Results of Chebyshev's Theorem
- For any set of data: - at least 75% of the data fall in the interval from µ - 2σ to µ + 2σ. - at least 88.9% of the data fall in the interval from µ - 3σ to µ + 3σ. - at least 93.8% of data fall in the interval from µ - 4σ to µ + 4σ.
3.1 Distribution Shapes and Averages
- In general, when a data distribution is mound-shaped symmetrical, the values for the MEAN, MEDIAN, AND MODE ARE THE SAME OR ALMOST THE SAME. - For Skewed-left distributions, the MEAN IS LESS THAN THE MEDIAN AND THE MEDIAN IS LESS THAN THE MODE. - For skewed-right distributions the MODE IS THE SMALLEST VALUE, THE MEDIAN IS THE NEXT LARGEST AND THE MEAN IS THE LARGEST. - PG. 88 for figure 3.1 examples.
3.1 For large ordered data sets of size "n"
- It is convenient to have a formula to find the middle of the data set. - For an ordered data set of size "n," POSITION OF THE MIDDLE VALUE = n+1 DIVIDED BY 2.
3.1 Data Types and Averages
- MODE: can be used for all four levels of data: nominal, ordinal, interval, and ratio. example- the modal color of all passenger cars sold last year might be blue. - MEDIAN: may be used with data at the ordinal, interval, or ratio level. example- if we ranked the passenger cars in order of costumer satisfaction, we could identify the median satisfaction level. - MEAN: Our data need to be at the interval or ratio level (although there are exceptions in which the mean of ordinal-level data is computed. example- we can certainly find the mean model year of used passenger cars sold or the mean price of new passenger cars. - Another issue of concern is that of taking averages for averages. example- if the values 520, 640, 730, 890, & 920 represent the mean monthly rents for five different apartment complexes, we can't say that 740 (the mean of the five numbers) is the mean monthly rent of all apartments. We need to know the number of apartments in each complex before we can determine an average based on the number of apartments renting at each designated amount.
3.2 Range
- Measure of variation - The difference between the largest and smallest values of a data distribution. LARGEST VALUE MINUS SMALLEST VALUE - an average taken by itself may not always be very meaningful so we need a statistical cross-reference that measures the spread of the data; the range. - example- two suppliers have the same range and mean, how do they differ? supplier 1 provides more cartons that weigh closer to the mean.
3.2 Outlier
- One indicator that a data value might be an "outlier" is that it is more than 2.5 standard deviations from the mean.
3.1 Average
- One number that is used to describe the entire sample or population. - We will study only three major ones: Mean, Median,& Mode. - Always use the highest number when deciding, and report it.
3.3 Quartiles
- Special percentiles used so frequently, and those percentiles that divide the data into fourths. - The first quartile Q1 is the 25th percentile. - The second quartile Q2 is the median. - The third quartile Q3 is the 75th percentile.
3.1 Geometric Mean
- The average used when data consist of percentages, ratios, or growth rates. - For n data values, the nth square root of the product of n numbers. PG. 93 for details. - Note that for the same data, the harmonic mean is less than or equal to the geometric mean, which is less than or equal to the arithmetic mean.
3.1 Median
- The central value of an ordered distribution. - When you are given a median, you know there are an equal number of data values in the ordered distribution that are above and below it. - If the extreme values of data set change, the median usually does not change, which is why it is used for house prices (mansions & lower priced homes)
Interquartile Range
- The interquartile range tells us the spread of the middle half of the data. - The median or Q2 is a popular measure of the center utilizing relative position, and a useful measure of data spread utilizing relative position is the IQR. - IQR = Q3 - Q1
3.3 Five-number Summary
- The quartiles together with the low and high data values give us a very useful "five-number summary" of the data and their spread. - Five number summary is *Lowest Value, *Q1, *Median, *Q3, *Highest Value.
3.1 Sample Mean- x̄
- The symbol for the mean of a SAMPLE distribution of x values is x̄ , read "x BAR" - LOWER CASE n = NUMBER OF DATA VALUES IN THE SAMPLE. - x̄ = ∑x DIVIDED BY n.
3.1 Mode
- The value that occurs most frequently in a set of data. - For large data sets it is useful to order, or sort, the data before scanning them for a mode. - Not every data set has a mode. - The mode is not very stable.
3.3 Box-and-Whisker Plot
- Using the five number summary to create a graphic sketch of the data. - these plots provide another useful technique from explanatory data analysis (EDA) for describing data.
3.1 How to compute the weighted average
- Weighted average = ∑xw DIVIDED BY ∑xw - Where x is the data value and w is the weight assigned to that data value. - The sum is taken over all data values.
3.1 Summation Symbol- ∑ & ∑x
- When we compute the mean, we sum the given data. - ∑ is a convenient notation to indicate the sum. - x represents any value in the data set, The notation ∑x reads "THE SUM OF ALL GIVEN x VALUES"
3.2 Sample Standard Deviation s
- s, is sample standard deviation. - The DEFINING FORMULA for standard deviation is, s = the square root of ∑(x - x̄)² / n - 1 - Where x is a member of the data set, x̄ is the mean, and n is the number of data values. The sum is taken over all the data values. - PG 95 & 96 for reference.
3.2 Sample Variance s²
- s², is the sample variance. - The DEFINING FORMULA for variance is, s² = ∑(x - x̄)² / n - 1 - Where x is a member of the data set, x̄ is the mean, and n is the number of data values. The sum is taken over all the data values. - PG 95 & 96 for reference.
3.1 Harmonic Mean
- the average used when data consist of rates of change, such as speeds. - For n data values, HARMONIC MEAN = n / ∑(1 / x)
3.1 Weighted Average
- when we want to average numbers but we want to assign more importance, or weight, to some of the numbers. - example- grade based on final exam and midterm both worth 100 points, but final is worth 60% of grade and midterm is worth 40%. To determine the average score that would reflect these different weights you'll need the weighted average.