Statistics Algebra

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Measures of relative position

Measures of where an observation stands in relation to other values in the data set. There are two principal methods of communication relative position: percentiles and standard scores.

High standard deviation

More variability

Properties of the mode

1. A data set does not have to have a mode 2. A data set can have more than one mode 3. If a mode exists for a data set, the mode is a value in the data set 4. Not affected by outliers in the data set 5. Only measure of center appropriate for qualitative data

Properties of the median

1. Easy to compute by hand 2. The middle number of the ordered data set 3. Only determined by middle values of a data set, and not affected by extreme numbers 4. Useful measure of center for skewed distributions 5. Is not necessarily a value in the data set

Finding sample variance

1. FIRST calculate the mean 2. Then follow formula 3. Round to 1 decimal

Steps to construct a box plot

1. Find the five-number summary of the data given. 2. Begin with a horizontal (or vertical) number line that contains the minimum and maximum values of the data. 3. Draw a small line segment above (or next to) the number line to represent the 1st and 3rd quartile. 4. Connect the line segment that represents the first quartile to the line segment representing the third quartile, forming a box. 5. Draw a line to represent the median inside the box just formed. 6. Draw a line from the minimum value to the first quartile. 7. Draw a line from the third quartile to the maximum value.

Determining the most appropriate measures of center

1. For QUALITATIVE data, the mode should be used. 2. For QUANTITATIVE data, the mean should be used, unless the data set contains outliers or is skewed. 3. For QUANTITATIVE data sets that are skewed or contains outliers, the median should be used. If all three measures of center are plotted on a distribution that is skewed to the right, then the mean is the measure of center farthest to the right.

Steps to determine the Pth percentile

1. Form an ordered array by placing the data in order from smallest to largest. 2. Calculate l, the location of the Pth percentile in the ordered array, using the formula l = n * P/100 3. Determine if l is an integer or decimal: -If the formula results in a decimal value for l, the location is the next largest integer. The value of the percentile will be this location in the ordered array. -If the formula results in a whole number, the percentile's value is the mean of the value in that location and the one in the next largest location in the ordered array. It is important to remember that when you complete the formula in step 2, the result IS NOT THE PERCENTILE-- it is the location of the percentile in the ordered array. Thus, if the result of step 2 is 14.2, then the desired percentile would be the 15th value in the ordered list. For example, find the 40th percentile of the following data: 5, 8, 6, 10, 20, 7, 6, 10, 19, 3, 4, 16

Properties of the mean

1. Most familiar and widely used 2. Its value is affected by every value in the data set 3. Is not necessarily a value in the data set 4. Appropriate choice for quantitative data with no outliers

Alternative method to approximate quartiles

1. Order the data set 2. Find the median, this will be Q2. 3. Next, use the median to divide the data set into an upper half and a lower half. For a data set, with an odd number of data values, do not include the median in each half. If there are an even number of data values, then the median would not be a value in the data set, so we would not include it in either the upper or lower half of the data anyway. 4. The first quartile, Q1, is the median of the lower half of the data. 5. The third quartile, Q3, is the median of the upper half of the data. This approximation method results in the values: Q1, MED, Q3 given on the TI-84 calculator. By not including the median when dividing the set into data, we are actually calculating a hinge.

What a box plot can tell you

1. The distribution of the data is skewed left if: a. The line from the minimum value to the first quartile is longer than the line from the third quartile to the maximum value, OR b. The median is closer to the third quartile. 2. The distribution of the data is skewed right if: a. The line from the third quartile to the maximum value is longer than the line from the minimum value to the first quartile, AND b. The median is closer to the first quartile. 3. The distribution of the data is symmetric if: a. The line from the minimum value to the first quartile is the same length as the line from the third quartile to the maximum value, AND b. The median is centered between the first quartile and the third quartile. 4. The smaller the range, the smaller the deviation. Similarly, the larger the range, the larger the deviation. 5. The box itself represents the middle 50% of the data, which is described by the interquartile range (IQR=Q3-Q1)

Statistical measures that define the center

Arithmetic Mean Median Mode

Statistical tools to make sense of a large group of data

Centrality: data values cluster around one central value which provides a focal point for the data set, a location of sorts. Dispersion Shape

ChebyShev's Theorem

ChevyShev's Theorem is helpful when empirical rule cannot be used. However, ChebyShev's theorem simply gives a minimum estimate; it is NOT exact. The proportion of data that lies within K standard deviations of the mean is at least... 1 - 1/K^2, for K > 1. When K=2, K=3, ChebyShev's Theorem says: K=2... At least 1 - 1/2^2 = 3/4 = 75% of the data lies within two standard deviations of the mean. K=3... At least 1 - 1/3^2 = 8/9 = 88.9% of the data lies within 3 standard deviations of the mean. Example: Suppose that in one town the mean income is $37,400 with a standard deviation of $4,200. What percentage of households earn between $24,800 and $50,000? -Since we do not know that the data is bell-shaped, we cannot apply the empirical rule but we can apply ChebyShev's theorem. In order to do so, we need to know how many standard deviations $24,800 and $50,000 are from the mean. By subtracting, we can find how far each of these figures is from the mean. Then, dividing by the standard deviation, we can convert these differences into #s of standard deviations. $24,800 - $37,400 = -$12,600 -$12,600 / $4200 = -3 and $50,000 - $37,400 = $12,600 $12,600 / $4200 = 3 Thus, these incomes lie 3 standard deviations from the mean (above and below). Cheby's theorem can be applied for K=3. K=88.9%

Standard deviation

Computed directly from the variance. The standard deviation is the SQUARE ROOT OF THE VARIANCE!!! The standard deviation provides a measure of how much we might expect a typical number of the data set to differ from the mean. The greater the standard deviation, the more the data is "SPREAD OUT." Note, that by definition, the standard deviation cannot be negative. The ST DEV IS THOUGHT OF AS AN AVERAGE DISTANCE AND IS COMPUTED FROM A SQUARE ROOT. If the standard deviation is 0, then all of the data values must be the same. The standard deviation allows us to interpret differences from the mean with some sense of scale. For instance, if a data set consisted of gas prices in various towns, a difference of even a single dollar would be considered very large. The standard deviation allows us us to make judgments of whether a difference is large or small, in a systematic way. s = square root of s^2 ...the sample standard deviation standard deviation sign = standard deviation sign^2 ...the population standard deviation Follow rounding rule.

Describing a data value

Describing a data value by its number of standard deviations from the mean is a fundamental concept in statistics. It is used as a standardization technique, a yardstick that will be used to describe properties of data sets, and to compare the relative values of data from different data sets. Standard scores are useful in comparing data values from populations with different means and standard deviations. For example, they could be used to determine if you scored better on the ACT exam or the SAT exam, assuming you took both.

Example of percentile method and approximation method

Find the three quartiles for the data set: 7, 10, 9, 3, 3, 4, 6, 13, 14, 2, 15, 5, 11 a. Use the percentile method to find the quartiles. -We must first put the data in order (2, 3, 3, 4, 5, 6, 7, 9, 10, 11, 13, 14, 15). To find the first quartile, we want to find the 25th percentile, so P=25. Using the formula for the location of a percentile, we get the following: l = n * P/100 = 13 * 25/100 = 3.25 ... Rounding up to the next whole number, we can say that the 4th value, which is 4, is the first quartile, thus Q1=4. -Using the formula again for the 50th and 75th percentiles, we find that Q2=7, and Q3=11. b. Use the approximation method to find the quartiles. -Put the data in order (2, 3, 3, 4, 5, 6, 7, 9, 10, 11, 13, 14, 15). -Find the median. Since there are 13 data values, the middle value (7th value), will be the median, which is 7. -Divide the data into two halves. We DO NOT INCLUDE THE MEDIAN IN EITHER HALF OF THE DATA. -The lower half of the data is 2, 3, 3, 4, 5, 6. The median of the lower half is 3.5, thus Q1= 3.5. -The upper half of the data is 9, 10, 11, 13, 14, 15. The median of the upper half of the data is 12, thus, Q3=12.

Deviation

Given a point A and a data point x, then (x- A) represents how far x deviates from A. This difference is called a deviation. The difference in data value and another measure, such as mean. The mean is considered a point of centrality because the deviations from the mean on the positive side and negative side are equal. The sample mean can be interpreted as a center of gravity. The sum of deviations from the mean is always equal to 0. On the other hand, if we calculate the deviations about any other value, the deviations do not balance. A desirable characteristic of a central value would be to have the positive and negative deviations equal to each other in absolute value.

Outlier effects

If the median and mode are close to the same value but the mean is smaller. The outlier is much smaller or larger than the rest of the data. The value for the mean will be pulled toward any outlier; thus, the mean is pulled towards the tail of a skewed distribution. For this reason, if a quantitative data set has an outlier or is skewed, you should use the median.

Box plots

If we want to represent a five-number summary graphically, we can use a graph called a box plot. The box plot is constructed from the five-number summary measures: -the smallest data value -the first quartile -the median -the third quartile -the largest data value A vertical line represents each of the five-number summary values. A box plot is also referred to as a box and whisker plot. The box refers to the rectangle that is created from joining the lines representing the 1st and 3rd quartiles. This box represents the INTERQUARTILE RANGE (IQR)

Pth percentile of a data value

Instead of finding the value that represents a given percentile, we can also take a specific value and approximate its corresponding percentile. We can use the same formula to solve for P, which gives us the following formula. The Pth percentile of a particular data set is given by... P= l/n * 100 Where.. P= the percentile rounded to the nearest whole number. l= the number of values in the data set less than or equal to the given value. n= the number of data values in the sample. **ROUNDING RULE!!!** When calculating a percentile, **ALWAYS** round to the nearest whole number. Example: To determine the percentile of a score of 50 on a screening test, the number of data values less than or equal to 50 must be counted. In the ordered array, there are 12 data values less than or equal to 50, so the resulting percentile would be: P=12/40 * 100 = 30 Hence a score of 50 on the screening test corresponds to the 30th percentile. Thus, approximately 30% of the scores were less than 50. If the resulting percentile is 47.5 round to 48. Therefore, the score is better than approximately 48% of all other scores on the test. By computing percentiles, we have changed the data's scaling. We can see the data from a new perspective. From the percentiles, it is clear than a score of 67 is much better than a score of 50. This 17-point difference in score is translated into a 18% differential on the percentile scale.

Dispersion

Is the data widely scattered or tightly grouped around the central point?

Coefficient of variation (CV)

Make comparisons mathematically even if the data does not have the same unit of measurement. The CV is the ratio of the standard deviation to the mean as a percentage. For sample data, it is defined as... CV = (s / xbar) * 100% For a population, it is... CV = (stdev sign / Mu) * 100% When comparing the variation of data sets, many times the unit of measure will be different. The coefficient of variation standardizes the variation measure by dividing it by the mean. The division has one interesting side effect: the unit of measure is removed from the statistic. The coefficient of variation allows us to compare the spread of data from 2 different sources. For example, Data set A: mean=35cm st dev=6cm CV = 6cm/35cm * 100% = 17.1%, which means that the variation is 17.1% of the mean value. In data set A, which has a CV of 17.1% compared to set B (1.6%), A has the larger relative standard deviation.

Percentiles

One way of calculating a value's relative position is to divide the data in to equal parts and state in which part the value lies. If we divide the data in ***100 parts***, the divisions are called percentiles. Percentiles tell you approximately what percentage of the data lie at or below a given value. For example, in data sets that do not contain significant quantities of identical data, the 30th percentile is a value such that about 30 percent of the values are below it. Oftentimes, standardized scores are reported in terms of a percentile. If you scored in the 81st percentile on the SAT, then 81% of SAT scores are less than or equal to your score. There are many different methods for calculating percentiles and very little agreement on which one is the best. Various statistical software packages actually allow you to choose the method you want to use. Each method could result in a different answer depending on the size and variation of your data set. Despite the controversy on how best t calculate percentiles, we will use the following methods. Data value of the Pth percentile

Variance

POPULATION VARIANCE: The variance of a data set containing the complete set of population data is given by: standard deviation squared = sum sign(xi - mu)^2 / N where... xi is the ith data value in the set Mu is the population mean N is the size of the population ...and is called the POPULATION VARIANCE The variance of a data set containing the sample data is given by: s^2 = sum sign(xi - x bar)^2 / n-1 where... xi is the ith data value in the data set x bar is the sample mean n is the size of the sample ...and is called the SAMPLE VARIANCE. Follow rounding rule.

Five-number summary

Quartiles are used in a numerical description, aptly called the five-number summary because it contains 5 numbers: -the minimum value -the first quartile (Q1) -the median or second quartile (Q2) -the third quartile (Q3) -the maximum value The five-number summary is made up of these 5 numbers listed in order from smallest to largest. Example: Find the five-number summary for the data given. Use the approximation method to calculate the quartiles. These values will match those produced by TI-84 calc. The data is: 7, 10, 9, 3, 3, 4, 6, ,13, 14, 2, 15, 5, 11. Solution: -Put the data in order (2, 3, 3, 4, 5, 6, 7, 9, 10, 11, 13, 14, 15) -Q1=3.5 -Q2=7 -Q3=12 -Minimum value is 2 -Maximum value is 15 Five-number summary: 2, 3.5, 7, 12, 15

Measures of dispersions

Range Variance, standard deviation -Most common measures of variability and both provide numerical measures of how the data varies around the mean. If the data is tightly packed around the mean, the variance and the standard deviation will be relatively small. On the other hand, if the data is widely dispersed around the mean, the variance and standard deviation will be relatively large.

Rounding rule

Round variances to one more decimal place than the highest number of decimal places contained in the data. If you are not given the actual data, but have a mean, round to the same number of decimal places as the mean.

Mean

Suppose there are n observations in a data set consisting of the observations x1, x2, x3, ..., xn. Rounding rule: round to one more decimal place than the highest number of decimal places contained in the data. The formula can also be represented as sigma * xi / n. x bar = SAMPLE MEAN Mu = POPULATION MEAN n= sample size N= POPULATION SIZE The sum of the deviations from the mean is always equal to 0. The mean will always be pulled towards any outliers. Ex: the average price of similar televisions at different stores. This is quantitative data that does not have outliers since the TVs at the different stores are similar.

Standard score example

Suppose you scored an 88 on your biology test and a 90 on your psychology test. The mean and standard deviations are given below. BIOLOGY: MEAN 78 ST DEV 10 PSYCHOLOGY: MEAN 81 ST DEV 11 a. What are the standard scores for your two tests? Starting with the biology test, the mean is 78 and the standard deviation is 10. Placing these values in the equation for the standard score gives: z= (88 - 78 / 10) = 1.00 For the psychology test, the mean is 81 and the standard deviation is 11. Placing these values into the equation for the standard score gives: z = (90 - 81 / 11) = approx. 0.82 b. On which test did you perform relatively best? On the biology test, you scored 1.00 standard deviations above the mean, compared to only 0.82 standard deviations above the mean for the psychology test. Even though the raw score on the psychology test is larger than the raw score on the biology test, relative to the mean and variability in the data sets, THE PERFORMANCE ON THE BIOLOGY TEST WASY SLIGHTLY BETTER. Once again, changing the scale of the data has beneficial effects. It enables comparison of two measurements drawn from different populations.

The interquartile range (IQR)

The difference between the third quartile and first quartile, and is seen as the size of the box in a box-and-whisker plot. *It is the range of the middle 50% of the data.* Given by IQR=Q3-Q1. The "whiskers" are the lines that extend to reach the minimum and maximum values.

Hinge

The lower hinge is a rough approximation for the first quartile, and the upper hinge is a rough approximation for the third quartile. The approximation method will always be equivalent to the percentile method when there are an EVEN NUMBER of data values.

Median

The median of a set of observations is the data value in the MIDDLE OF AN ORDERED ARRAY. The same number of data values is on either side of the median value. If the number of data values is an even number, then the median is the mean of the two middle numbers. First the data set MUST BE ORDERED! The median divides the area of a the distribution in half Ex: the average salary of football players in the NFL. This is quantitative data that has outliers since the superstars on the team make substantially more than the typical players.

Mode

The mode is the value in the data set the occurs most frequently. If all of the data values occur only once, or they each occur an equal number of times, the data set is considered to have no mode. If only one value occurs the most, then the data set is said to be unimodal. If exactly two values occur equally often and more than all the others, then the data set is said to be bimodal. If more than two values occur equally often and more than all the others, the data set is said to be multimodal. USED FOR NOMINAL DATA AND ORDINAL DATA!! Graphically, the mode is the highest peak of the distribution. Ex: The average hair color of college men

T/F

The population variance and sample variance are the same value for the same data set. FALSE! The population variance and sample variance have different denominators in their formula, so these values are almost always different.

Range

The range is the difference between the largest and smallest data value. Given the data: 0, 1, 2, 3, 9. The largest value is 9 and smallest value is 0, thus the range is 9-0=9.

Standard scores

The standard score, also known as the z-score, is a measure of relative position with respect to the mean and variability (as measured by the standard deviation) of the data set. A standard score transforms a data value into the number of standard deviations that value is from the mean. The standard score for a POPULATION value is given by: z= x - Mu / pop. standard deviation Where.. x= the value of interest from the population Mu= the population mean pop. st dev sign= the population standard deviation The standard score for a SAMPLE value is given by: z = x - xbar / s Where.. x= the value of interest from the sample xbar= the sample mean s= sample standard deviation **ROUNDING RULE** We will round standard scores to TWO decimal places.

Variance for grouped data

The variance for grouped data can easily be estimated using the relationship between variance and standard deviation. Recall that the variance is the standard deviation squared.

Data value of the Pth percentile

To find the data value for the Pth percentile, the location of the data value in the set is given by: l = n * P/100 where... l= the location of the Pth percentile in the ordered array of values n= the sample size P= the Pth percentile When using this formula, you MUST follow these 2 rules: 1. If the formula results in a decimal value for l, the location is the NEXT LARGEST INTEGER. 2. If the formula results in a whole number, the percentile's value is the mean of the value in that location and the one in the next largest location.

Quartiles

We can divide the data into as many parts as we wish. If we divide a data set into four parts, the numbers than form the divisions are called quartiles. If we wanted to divide a line segment into four parts, we would draw 3 lines. Similarly, when dividing a data set into 4 parts, we use 3 quartiles. Q1= FIRST QUARTILE: 25% of the data are less than or equal to this value. Q2= SECOND QUARTILE: 50% of the data values are less than or equal to this value. Q3= THIRD QUARTILE: 75% of the data values are less than or equal to this value. To find the quartiles, first note that they are equivalent to percentiles. The 1st quartile is equivalent to the 25th percentile, the 2nd quartile is equivalent to the 50th percentile, and the 3rd quartile is equivalent to the 75th percentile. Thus, to find the first quartile of a data set, we can use the method described previously to find the 25th percentile of the data set.

Weighted mean

When each data value in the set does not hold the same relative importance. To calculate a weighted mean for a sample, first multiply each value by its respective weight. Then divide the sum of these products by the sum of the weights to obtain the mean. The procedure for calculating the weighted mean for a population is the same but with the notation mu. Ex: Walter wants to calculate his overall average in his US history course. The syllabus in Walter's class states that the final grade is determined by tests (35%), homework (25%), quizzes (10%), and final exam (30%). His scores are: Tests: 86 HW: 95 Quizzes: 89 Final: 92 x bar = 86 (0.35) + 95 (0.25) + 89 (.10) + 92 (0.30) / 0.35 + 0.25 + 0.10 + 0.30 = 90.35

Empirical rule

When the distribution of a set of data is approximately BELL-SHAPED, the empirical rule can be used to estimate the percentage of values within a few standard deviations of the mean. The empirical rule is as follows: -Approximately 68% of the data lies within 1 standard deviation of the mean. -Approximately 95% of the data lies within 2 standard deviations of the mean. -Approximately 99.7% of the data lies within 2 standard deviations of the mean. Example: The distribution of heights of 5-year-old girls is bell-shaped with a mean of 106.68 centimeters and a standard deviation of 3.81 centimeters. a. What percentage of 5-year-old girls are between 95.25 and 118.11 centimeters tall? -Since we know the data is bell-shaped, we can apply the empirical rule. We need to know how standard deviations 95.25 and 118.11 are from the mean. By subtracting, we can find how far each of these figures is from the mean. Then, dividing by standard deviation, we can convert these differences into numbers of standard deviations. 95.25 - 106.68 = -11.43 Then -11.43 / 3.81 = - 3 Also, 118.11 - 106.68 = 11.43 Then 11.43 / 3.81 = 3 Thus, these heights lie 3 standard deviations from the mean (above and below). According to the empirical rule, approximately 99.7% of values lie within 3 STDev of the mean. Therefore, we can say that approximately 99.7% of 5-year-old girls are between 95.25 and 118.11 cm tall. Although the empirical rule is handy for bell-shaped distributions of data, it cannot be applied to other distributions. ChevyShev's Theorem is helpful when empirical rule cannot be used. However, ChebyShev's theorem simply gives a minimum estimate; it is NOT exact.

Standard deviation formula for grouped data

s = SQUARE ROOT OF n[sum sign (frequency * x^2(] - [sum sign (frequency * x)]^2 / n(n-1) n=sample size x=midpoint FREQUENCY TABLE SET UP/COLUMNS: Grade -> Frequency -> Midpoint (x) -> Frequency*Midpoint (f*x) -> Frequency*Midpoint^2 Do not round for the table unless otherwise stated


Ensembles d'études connexes

NUR 351 Exam 2 Quizizz & Kahoots

View Set

Social Problems: Chapter 16 - What problems are on the way?

View Set

Care of Patients with Liver Problems Practice

View Set