Chapter 3

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

Boxplots

-A *boxplot* is a graph that presents the five-number summary along with some additional information about a data set. -There are several different kinds of boxplots. -The one we describe here is sometimes called a *modified boxplot*

Mean

-A set is a measure of center. -If we imagine each data value to be a weight, then the mean is the point at which the data set balances. -To find the mean of a list of numbers, add the numbers, then divide by how many numbers there are. -The mean is the same thing as the average

Procedure for Finding the Median

-Step 1: Arrange the data values in increasing order. -Step 2: Determine the number of data values, n. -Step 3: If n is *odd*, the median is the middle number. In other words, the median is the value in position *(n + 1)/2*. If n is *even*, the median is the average of the two middle numbers. That is, the median is the *average* of the values in positions *n/2* and *n/2 + 1*

z-Score

-The z-score of an individual data value tells *how many standard deviations that value is from its population mean* -Let x be a value from a population with mean μ and standard deviation σ -For Example: who is taller relative to their gender, a man 73 inches tall or a woman 68 inches tall? One way to answer this question is with a *z-score*.

Quartiles

-There are *three special percentiles* which divide a data set into four pieces, each of which contains approximately one quarter of the data. -These values are called the *quartiles*

Right Skewed Histogram

-When a data set is *skewed to the right*, there are large values in the right tail. -Because the median is resistant while the mean is not, the mean is generally more affected by these large values. -Therefore for a data set that is skewed to the right, *the mean is often greater than the median.*

Notation

A list of n numbers is denoted by x1, x2, x3, ... xn

Rounding

In general, it is good practice to round the mean to *one more decimal place* than the data that appear in the original data set.

Resistant

-A statistic is *resistant* if its value is *not affected much by extreme values* (large or small) in the data set. -The median is resistant, but the mean is not. -One important difference between the mean and the median is that the formula for the mean uses every value in the data set, but the formula for the median depends only on the middle number or the middle two numbers. -This is particularly important for data sets in which one or more numbers are unusually large or unusually small. -In most cases, these extreme values will have a *large influence on the mean*, but *little or no influence on the median*.

The Range is Not Used in Practice

-Although the range is easy to compute, it is *not often used in practice* -The reason is that the range involves only two values from the data set; the largest and smallest. -The measures of spread that are most often used are the *variance and the standard deviation*, which use every value in the data set.

Outliers

-An outlier is a *value that is considerably larger or considerably smaller than most of the values in a data set.* -Some outliers result from errors; for example a misplaced decimal point may cause a number to be much larger or smaller than the other values in a data set. -Some outliers are correct values, and simply reflect the fact that the population contains some extreme values.

The Median

-Another measure of center. -The *median* is a number that splits the data set in half, so that half the data values are less than the median and half of the data values are greater than the median. -The procedure for computing the median differs, depending on whether the number of observations in the data set is even or odd.

The Mode

-Another value that is sometimes classified as a measure of center is the *mode*. -The mode of a data set is the *value that appears most frequently*. -If two or more values are tied for the most frequent, they are all considered to be modes. -If no value appears more than once, we say that the data set has no mode -However, this *isn't really accurate.* -The mode can be the largest value in a data set, or the smallest, or anywhere in between.

Determining Skewness

-Boxplots can help determine the skewness of a data set. -If the *median is closer to the first quartile* than to the third quartile, or the upper whisker is longer than the lower whisker, *the data are skewed to the right*. -If the *median is closer to the third quartile* than to the first quartile, or the lower whisker is longer than the upper whisker, *the data are skewed to the left.* -If the median is *approximately halfway between the first and third quartiles*, and the two whiskers are approximately equal in length, * the data are approximately symmetric*

Percentile

-For some data it is often useful to *compute measures of positions other than the center*, to get a more detailed description of the distribution. -Percentiles provide a way to do this. -*Percentiles divide a data set into hundredths.* -For a number p between 1 and 99, the *pth percentile* separates the lowest p% of the data from the highest *(100 - p)%*

Sample Means

-If x1,x2,...,xn is a sample, then the mean is called the *sample mean* and is denoted with the symbol *x̅* -We will use a *lowercase n* to denote a sample size

Bell-Shaped Histogram

-Many histograms have a single mode near the center of the data, and are approximately symmetric. -Such histograms are often referred to as *bell-shaped*

Interquartile Range

-One method for detecting outliers involves a measure called the *Interquartile Range* -The interquartile range is found by *subtracting the first quartile from the third quartile*: *IQR = Q3 - Q1*

Standard Deviation & Resistance

-Recall that a statistic is *resistant* if its value is *not affected* much by extreme data values. -*The standard deviation is not resistant.* -That is, the standard deviation is *affected by extreme data values*

Left Skewed Histogram

-Similarly, when a data set is *skewed to the left, the mean is often less than the median.*

z-Scores & The Empirical Rule

-Since the z-score is the number of standard deviations from the mean, we can *easily interpret the z-score for bell-shaped populations using The Empirical Rule.* -When a population has a histogram that is approximately bell-shaped, then: -Approximately *68% of the data will have z-scores between -1 and 1*. -Approximately *95% of the data will have z-scores between -2 and 2* -All, or almost *all of the data will have z-scores between -3 and 3.*

Computing the Percentile Corresponding to a Given Value

-Sometimes we are given a value from a data set, and wish to *compute the percentile corresponding to that value*. Step 1: Arrange the data in increasing order. Step 2: Let x be the data value whose percentile is to be computed. Use the following formula to compute the percentile: *Percentile = [(𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑣𝑎𝑙𝑢𝑒s 𝑙𝑒𝑠𝑠 𝑡ℎ𝑎𝑛 𝑥 + 0.5)/(𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑣𝑎𝑙𝑢𝑒𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑑𝑎𝑡𝑎 𝑠𝑒𝑡)] x 100* -Step 3: Round this result to the nearest integer. *This is the percentile corresponding to the value x.*

Population Means

-Sometimes we need to discuss the mean of all the values in a *population*. -The mean of a population is called the *population mean* and is denoted by *μ* (the Greek letter mu). -We will use an *uppercase N* to denote a population size.

Approximating the Standard Deviation with *Grouped Data*

-Step 1: Approximate the mean of the frequency distribution. -Step 2: Find the midpoint of each class -Step 3: For each class, subtract mean from the class midpoint to obtain *(Midpoint - Mean)* -Step 4: For each class, square the differences obtained in Step 3 to obtain *(Midpoint - Mean)^2* -Step 5: Multiply by the frequency to obtain *(Midpoint - Mean)^2 x (Frequency)* -Step 6: Add the products *(Midpoint - Mean)^2 x (Frequency)* over all classes. -Step 7: To compute the *population variance*, divide the sum obtained in Step 6 by *N* (the sum of all the frequencies). To compute the *sample variance*, divide the sum obtained in Step 6 by *n-1*. -Step 8: Take the square root of the variance obtained in Step 7. The result is the *standard deviation*

Procedure for Drawing a Boxplot

-Step 1: Compute the *first quartile, the median, and the third quartile*. -Step 2: Draw *vertical lines* at the *first quartile, the median, and the third quartile*. -Step 3: Draw *horizontal lines between the first and third quartiles* to complete the box. -Step 4: Compute the *lower and upper outlier boundaries*. -Step 5: Find the largest data value that is less than the upper outlier boundary. Draw a horizontal line from the third quartile to this value. This horizontal line is called a *whisker*. -Step 6: Find the smallest data value that is greater than the lower outlier boundary. Draw a horizontal line (whisker) from the first quartile to this value. Step 6: Determine which values, if any, are outliers. Plot each outlier separately.

Approximating the Mean with *Grouped Data*

-Step 1: Compute the midpoint of each class. -Step 2: For each class, multiply the class midpoint by the class frequency. -Step 3: Add the products *(Midpoint)x(Frequency)* over all classes. -Step 4: Divide the sum obtained in Step 3 by the *sum of all frequencies*.

Procedure for Computing the Population Variance

-Step 1: Compute the population mean μ (add all values together, divide by the number of values there are) -Step 2: For each population value xi compute *xi - μ.* (This is called the deviation for the value xi. ) -Step 3: Square the deviations to obtain the quantity *(xi - μ)^2*. -Step 4: Sum the squared deviations to obtain the quantity *Σ(xi - μ)^2*. -Step 5: Divide the sum obtained in Step 4 by the population size *N* to obtain the population variance σ^2.

IQR Method for Detecting Outliers

-Step 1: Find the first quartile :*Q1 (L = 25/100 x n)*, and the third quartile: *Q3 (L = 75/100 x n)* -Step 2: Compute the interquartile range: *IQR = Q3 - Q1* -Step 3: Compute the *outlier boundaries*. These boundaries are the cutoff points for determining outliers: *Lower Outlier Boundary = Q1 - 1.5(IQR)* *Upper Outlier Boundary = Q3 + 1.5(IQR)* -Step 4: Any data value that is less than the lower outlier boundary or greater than the upper outlier boundary is considered to be an outlier.

Coefficient of Variation

-The *coefficient of variation* (CV for short) tells *how large the standard deviation is relative to the mean*. -It can be used to compare the spreads of data sets whose values have different units. -The coefficient of variation is found by *dividing the standard deviation by the mean.*

First Quartile

-The *first quartile*, denoted Q1, is the *25th percentile*. -Q1 separates the lowest 25% of the data from the highest 75%. *L = (25/100) x n*

Five-Number Summary

-The *five-number summary* of a data set consists of the median, the first quartile, the third quartile, the smallest value, and the largest value. -These values are generally arranged in order. -The five-number summary of a data set consists of the following quantities: 1. Minimum 2. First Quartile 3. Median 4. Third Quartile 6. Maximum

Second Quartile

-The *second quartile*, denoted Q2, is the *50th percentile* -Q2 separates the lower 50% of the data from the upper 50%. -*Q2 is the same as the median.*

Third Quartile

-The *third quartile*, denoted Q3, is the *75th percentile*. -Q3 separates the lowest 75% of the data from the highest 25%. *L = (75/100) x n*

Population Variance

-The average of the squared deviations is the *population variance* -Let *x1,x2,...,xn denote the values in a population of size N*. -Let *μ denote the population mean*. -The population variance, denoted by σ^2, is:

Deviation

-The difference between a population value, x, and the population mean, μ, is *x − μ* -This difference is called a *deviation*. -Values less than the mean will have negative deviations, and values greater than the mean will have positive deviations. -Data sets with *a lot of spread* will have *many large squared deviations*, while those with *less spread* will have *smaller squared deviations*.

Mode for Qualitative Data

-The mean and median can only be computed for quantitative data. -The mode, on the other hand, can be computed for *quantitative data and qualitative data* -For qualitative data, the mode is *the most frequently appearing category*.

Describing the Shape of a Data Set

-The mean and median measure the center of a data set in different ways. -When a data set is *symmetric, the mean and median are equal*.

A Misconception About the Mean

-The mean is not necessarily a typical value for the data. -In fact, the mean may be a value that could not possibly appear in the data set.

Population Standard Deviation

-The population standard deviation σ is the *square root of the population variance σ^2*

Degrees of Freedom

-The quantity *n − 1* is sometimes called the *degrees of freedom* for the sample standard deviation. -The reason is that the *sum of the deviations x − x̅ will always sum to 0* -Thus, if we know the first n − 1 deviations, we can compute the nth one. -The number of *degrees of freedom* for the sample variance *is one less than the sample size.*

The Range

-The range of a data set is a measure of spread. -That is, it measures how spread out the data are. -The range of a data set is the difference between the largest and the smallest value. *Range = Largest Value - Smallest Value*

Sample Standard Deviation

-The sample standard deviation s is the *square root of the sample variance s^2*

Standard Deviation

-The standard deviation of a sample is denoted *s*, and the standard deviation of a population is denoted *σ*. -*CAUTION*: Don't round off the variance when computing the standard deviation.

The Variance

-The variance is a measure of *how far the values in a data set are from the mean*, on the average. -When a data set has a *small amount of spread*, most of the values *will be close to the mean*. -When a data set has a *larger amount of spread*, more of the data values will be *far from the mean*. -The variance is computed slightly differently for populations and samples. (The population variance is presented first)

Finding the Midpoint of a Class

The midpoint of a class is found by taking the *average* of the lower class limit and the lower limit of the next larger class.

Sample Variance: Why Divide by n -1?

-When computing the sample variance, we use the sample mean to compute the deviations. -For the population variance we use the population mean for the deviations. -It turns out that the *deviations using the sample mean tend to be a bit smaller* than the deviations using the population mean. -If we were to divide by n when computing a sample variance, the value would tend to be a bit smaller than the population variance. -It can be shown mathematically that the appropriate correction is to *divide the sum of the squared deviations by n -1* rather than n.

Sample Variance

-When the data values come from a sample rather than a population, the variance is called the *sample variance*. -The procedure for computing the sample variance is a bit different from the one used to compute a population variance. -In the formula, the mean μ is replaced by the sample mean, *x̅*, and the denominator is *n - 1* instead of N. -The sample variance is denoted by s^2.

Notation for the Mean

-When we wish to write down a list of n numbers without specifying what the numbers are, we often write x1,x2,...,xn. -To indicate that we are adding these numbers, we write *Σx* which represents the sum of these numbers

Procedure for Computing Percentiles

Step 1: Arrange the data in increasing order. Step 2: Let *n* be the number of values in the data set. For the pth percentile, compute the index *L = (p/100) x n. * Step 3: If *L is a whole number*, then the *pth percentile is the average of the number in position L and the number in position L + 1*. If L is *not a whole number*, round it up to the *next highest whole number*. The *pth percentile is the number in the position corresponding to the rounded-up value.*


Kaugnay na mga set ng pag-aaral

Risk assessment and decision support

View Set

11.2.10 - Manage files from cmd line - Practice Questions

View Set

ATI Chapter 64 Osteoarthritis and Low-Back Pain

View Set

CWTS-11-Performing an RF Wireless LAN Site Survey, itsp215/itsp216

View Set

Chapter 67: Management of Patients With Cerebrovascular Disorders

View Set

Audit of Cash and Financial Instruments

View Set