Statistics Class 4

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

median

number that splits data set in half, so that half of the data values are less than, and half are more than; 1. arrange the data values in increasing order 2. determine the number of data values, n 3. if n is odd: the median is the middle number if n is even: the median is the average of the middle two numbers

percentiles

provide a way to compute measures of positions other than the center, to get a more detailed description of the distribution; divide a data set into hundredths; or a number p between 1 and 99, the pth percentile separates the lowest p% of the data from the highest (100-p)%

population variance (σ²)

the averages of the squared deviations; let x₁,x₂,...,xn denote the values in a population size N; let µ denote the population mean; the population variance, denoted by σ², is σ₂=∑(x−µ)₂/N; NEVER negative

mean notation

- a list of n numbers is denoted x₁+x₂+,...,xn - ∑x represents the sum of these numbers: ∑x=x₁+x₂+...+xn

IQR method for finding outliers

1. Find the first quartile, Q₁, and the third quartile, Q₃, of the data set 2. compute the interquartile range (IQR): IQR=Q₃-Q₁ 3. compute the outlier boundaries; these boundaries are the cutoff points for determining outliers lower outlier boundary=Q₁-1.5IQR upper outlier boundary=Q₃+1.5IQR 4. any data value that is less than the lower outlier boundary or greater than the upper outlier boundary is an outlier

procedure for computing the data value corresponding to a given percentile

1. arrange the data in increasing order 2. let n be the number of values in the data set; for the pth percentile, compute the value: L=p/100×n 3. if L is a whole number, then the pth percentile is the average of the number in position L and the number in position L+1 if L is not a whole number, always round it UP to the next higher whole number; the pth percentile is the number in the position corresponding to the rounded-up value

boxplot

a graph that presents the five-number summary along with some additional information about a data set; there are several kinds, including a modified boxplot

mean (arithmetic mean)

average; round it to one more decimal place than the data; does not necessarily represent "typical" data value; more influenced by extreme data values than median because includes every number on the data set

sample standard deviation (s)

s is the square root of the sample variance s²; s=√s²

coefficient of variation (CV)

tells how large the standard deviation is relative to the mean; can be used to compare the spreads of data sets whose values have different units; found by dividing the standard deviation by the mean: CV=σ/µ

z-scores and the Empirical Rule

when a population has a histogram that is approximately bell-shaped, then - approximately 68% of the data will have z-scores between -1 and 1 - approximately 95% of the data will have z-scores between -2 and 2 - all, or almost all, of the data will have z-scores between -3 and 3

population standard deviation (σ)

σ is the square root of the population variance σ²; σ=√σ²

determining skewness from a boxplot

- if the median is closer to the first quartile than to the third quartile, or the upper whisker is longer than the lower whisker, the data are skewed to the right - if the median is closer to the third quartile than to the first quartile, or the lower whisker is longer than the upper whisker, the data are skewed to the left - if the median is approximately halfway between the first and third quartiles, and the two whiskers are approximately equal in length, the data are approximately symmetric

data set is approximately symmetric

mean is approximately equal to median

data set is skewed to the right

mean is noticeably greater than median

data set is skewed to the left

mean is noticeably less than median

deviation

the difference between a population value (x) and the population mean (µ) (x−µ); values less than the mean will have negative deviations, and values greater than the mean will have positive deviations; so we square the deviations to make them all positive; data sets with a lot of spread will have many large squared deviations, while those with less spread will have smaller squared deviations

range

the difference between the largest value and the smallest value in a data set (largest value - smallest value)

degrees of freedom and sample size

the number of degrees of freedom for the sample variance is one less than the sample size

standard deviation

the square root of the variance, needed so that units of variance are not squared units of data, and are accurate; is NOT resistant, is affected by extreme data values

mode

the value that appears most frequently in a data set; if two or more values are tied for the most frequent, they are all considered to be modes; if no value appears more than once, we say that the data set has no mode

z-score (z)

the z-score of an individual data value tells how many standard deviations that value is from its population mean; so, e.g., a value one standard deviation above the mean has a z-score of 1; a value two standard deviations below the mean has a z-score of -2; z=(x-µ)/σ; less useful for populations that are not bell-shaped; ALSO, x=µ+z(σ)

quartiles

three percentiles (25th, 50th, 75th) that are used more often than the others; divide the data into four parts, each of which contains approximately one quarter of the data; every data set has three quartiles

Chebyshev's Inequality

very rough approximation; in any data set, the proportion of the data that will be within K standard deviations of the mean is at least 1-1/K²; specifically, by setting K=2 or K=3, we obtain the following results: - at least 3/4 (75%) of the data will be within two standard deviations of the mean (xbar-2s and xbar+2s) - at least 8/9 (88.9%) of the data will be within three standard deviations of the mean (xbar-3s and xbar+3s)

degrees of freedom

what the quantity n-1 is sometimes called for the sample standard deviation; it;s called this because the deviations will always sum to 0; thus, if we know the first n-1 deviations, we can compute the nth one

procedure for computing the percentile corresponding to a given data value

1. arrange the data in increasing order 2. let x be the data value whose percentile is to be computed; use the following formula to compute the percentile: percentile=100×((number of values less than x) + 0.5)/number of values in the data set 3. round the result to the nearest whole number

procedure for constructing a (modified) boxplot

1. compute the first quartile, the median, and the third quartile 2. draw vertical lines at the first quartile, the median, and the third quartile; draw horizontal lines between the first and third quartiles to complete the box 3. compute the lower and upper outlier boundaries 4. find the largest data value that is less than the upper outlier boundary; draw a horizontal line (whisker) from the third quartile to this value 5. find the smallest data value that is greater than the lower outlier boundary; draw a horizontal line (whisker) from the first quartiles to this value 6. determine which values, if any, are outliers; plot each outlier separately with an x

procedure for computing the population variance

1. compute the population mean µ 2. for each population value x, compute the deviation x−µ 3. square the deviations, to obtain quantities (x−µ)² 4. sum the squared deviations obtaining ∑(x−µ)² 5. divide the sum obtained in step 4 by the population size N to obtain the population variance σ²

procedure for computing the sample variance

1. compute the sample mean xbar 2. for each sample value x, compute the difference x-xbar; this quantity is called a deviation 3. square the deviations, to obtain quantities (x−µ)² 4. sum the squared deviations, obtaining ∑(x−µ)² 5. divide the sum obtained in step 4 by n−1 to obtain the sample variance s²

resistant

a statistic is resistant if its value is not affected much by extreme values (large or small) in the data set (e.g., median is resistant, mean is not)

variance

a measure of how far the values in a data set are from the mean, on average

interquartile range (IQR)

a measure of spread that is often used to detect outliers; IQR=Q₃-Q₁

outlier

a value that is considerably larger or considerably smaller than most of the values in a data set; some result from errors; some are correct, and simply reflect the fact that the population contains some extreme values; DO NOT delete an outlier unless it is certain that it is an error

sample variance (s²)

a variance when data values come from a sample rather than a population; let x₁,...,xn denote the values in a sample of size n. s²=∑(x−xbar)²/(n−1); or, s²=∑x²−nx with line over it²/n−1

five-number summary

consists of the following quantities of a data set: minimum, first quartile, median/second quartile, third quartile, maximum

first quartile

denoted Q₁, is the 25th percentile; Q₁ separates the lowest 25% of the data from the highest 25%; computation: L=25/100×n

second quartile

denoted Q₂, is the 50th percentile; Q₂ separates the lower 50% of the data from the upper 50%; same as the median

third quartile

denoted Q₃, is the 75th percentile; Q₃ separates the lowest 75% of the data from the highest 25%; computation: L=75/100×n

approximating the mean with grouped data

for rough estimates and continuous, NOT discrete, variables: 1. compute the midpoint of each class (found by taking the average of the lower class limit and the lower class limit of the next larger class) 2. for each class, multiply the class midpoint by the class frequency 3. add the products midpoint×frequency over all classes 4. divide the sum obtained in step 3 by the sum of the frequencies

sample mean (xbar)

if x₁,x₂,...,xn is a sample, this mean is: x-bar = (x₁,x₂,...,xn)/n=∑x/n

population mean (µ)

if x₁,x₂,...,xn is a sample, this mean is: µ = (x₁,x₂,...,xn)/n=∑x/n

measures of spread

numerical summaries of data that describe how spread out the data values are

measures of center

numerical summaries of data that describe the center of the data; mean, may not be a value in the data set

measures of position

numerical summaries of data that specify the proportion of the data that is less than a given value

procedure for approximating the standard deviation with grouped data

rough estimate for CONTINUOUS variables: 1. compute the midpoint of each class; then compute the mean 2. for each class, subtract the mean from the class midpoint to obtain midpoint-mean 3. for each class, square the difference obtained in step 2 to obtain (midpoint-mean)², and multiply by the frequency to obtain (midpoint-mean)²×frequency 4. add the products of (midpoint-mean)²×frequency over all classes 5. compute the sum of the frequencies n; to compute the population variance, divide the sum obtained in step 4 by n; to compute the sample variance, divide the sum obtained in step 4 by n-1 6. take the square root of the variance obtained in step 5; the result is the standard deviation

Empirical Rule

when a population has a histogram that is approximately bell-shaped, then: - approximately 68% of the data will be within one standard deviation of the mean; in other words, approx 68% of data will be between µ−σ and µ+σ - approximately 95% of the data will be within two standard deviations of the mean; in other words, approx 95% of data will be between µ−2σ and µ+2σ - all, or almost all, of the data will be within three standard deviations of the mean; in other words, all/almost all of data will be between µ−3σ and µ+3σ


Kaugnay na mga set ng pag-aaral

Algebra 4.2 Set A Point Slope to Slope Intercept

View Set

Insurance Exam CH 11 Individual Policy Provisions

View Set

AP WORLD 3.3 READING QUIZ (READING QUESTIONS)

View Set

Series 7: Trading Markets (NASDAQ Market / OTC Market)

View Set