stats chapter 3

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

spread

(aka dispersion) of the distribution describes the variability in the data, which is the degree to which the data values are either clustered together or spread out; numerical measures of spread include the range, variance, standard deviation, and interquartile range

general z-score formula

(observation-mean)/standard deviation

for bell-shaped data, approx 68% of observations will have z-score between

-1&1

for bell-shaped data, approx 95% of observations will have z-score between

-2&2

for bell-shaped data, approx 99.7% (almost all) of observations will have z-score between

-3&3 (quite unusual to have a z-score larger than 3 or smaller than -3; observations with z-scores greater than 3 in absolute value are considered outliers)

an observation that is unusually large or small relative to the other values in a data set is called an outlier. outliers are typically attributable to one of the following causes:

-measurement is observed, recorded, or entered into the computer incorrectly -the measurement comes from a different population -the measurement is correct but represents a rare (chance) event

for bell-shaped distributions: what percentage will fall within 1 standard deviation of the mean?

68%

for bell-shaped distributions: what percentage will fall within 2 standard deviations of the mean?

95%

for bell-shaped distributions: what percentage will fall within 3 standard deviations of the mean?

99.7% (essentially all)

boxplot

a graph of the 5 number summary. make it by: -draw a number line long enough to include max and min values -above line, draw a box from lower quartile Q1 to upper quartile Q3 -draw line inside box at median -a line goes from lower end of box to smallest observation that is not an outlier; a second line goes from upper end of box to largest observation that is not an outlier, these lines are called whiskers -any data value less than the lower fence or greater than the upper fence are considered outliers and are marked with asterisk

resistant

a numerical summary of data is said to be resistant if extreme observations (very large or small compared to the rest of the data) have little, if any, influence on its value

variance

an average of the squares of the deviations of the observations from their mean

finding quartiles:

arrange data in order and find the medianM(second quartile), the median of the observations below M is Q1, the median of the observations above M is Q3

z-score is zero if...

data value equals the mean

measures of position

describe the relative position of a particular data value within the entire set of data. two examples are z-scores and percentiles

shape

described by mentioning any symmetry or skewness, the number of peaks, any clusters or gaps, and any unusually high or low observations (called outliers)

center

describes a "typical" or "representative" data value; numerical measures of center include the mean and median

to describe a distribution:

discuss its shape, provide an appropriate measure of center, and provide an appropriate measure of spread

interquartile range

distance between first and third quartiles. IQR = Q3-Q1 it's a measure of spread; more spread out the data is, the larger the IQR tends to be represents the range of the middle 50% of observations not affected by outliers, so is a RESISTANT measure of spread

most common percentiles are quartiles, which

divide the data into 4 equal parts

Q1 Q2 Q3

first quartile is 25th percentile second quartile is median (M) which is 50th percentile third quartile is 75th percentile

pth percentile

for any set of n observations arranged in order, the pth percentile is a number such that p% of the observations fall below the pth percentile and (100-p)% fall above it (think SAT scores)

outlier

has a value that is significantly larger or smaller than all of the other observations

when will positive z-score occur?

if a data value is larger than the mean

when will negative z-score occur?

if a data value is smaller than the mean

difference in median regarding number of observations:

if n is odd, the median M is the middle observation in the ordered sample; if n is even, the median M is the average of the two middle observations in the data set

why is a measure of center alone not enough to describe a distribution well?

it doesn't indicate the degree to which the data are spread out (ie the amount of dispersion)

for approx symmetric distributions, report what?

mean and standard deviation

resistance of mean

mean is NOT resistant (changes without outliers)

generally, if the shape of the distribution is approximately symmetric, then ...

mean is about equal to median

generally, if the shape of the distribution is skewed right, then...

mean is larger than median

generally, if the shape of the distribution is skewed left, then...

mean is smaller than median

if there is an outlier, take the ...

median

for distributions that are skewed or contain outliers, report what?

median and interquartile range

resistance of median

median is a resistant measure of center! (median is just about the same without the outlier)

median (M)

middle observation when the observations are listed in order

what does 5 number summary include?

minimum, Q1, M, Q3, and maximun

what should box plots be used for?

moderate to larger data sets (it takes at least 5 numbers to make one anyway) also only for unimodal distributions! they hide bimodality (peaks)

mode

observation that occurs most frequently; describes a typical observation in terms of the most common outcome, but the mode need not be near the center of a distribution; most often used to describe the category of a categorical variable that has the highest frequency

sigma (o thing)

population standard deviation

sigma^2

population variance

when asked to compare two distributions, you need to discuss the similarities and/or differences in their shapes, centers, and spreads;

provide measures of center and spread for each distribution, and discuss which one is larger/smaller. when comparing two distributions, you should always use the same measures of center and spread for both distributions, otherwise the comparison isn't valid. if a group is strongly skewed or has outliers, it is usually best to compare the medians and interquartile ranges for all groups; otherwise compare means and standard deviations

s

sample standard deviation

s^2

sample variance

fences

serve as cutoff points for determining outliers; if data value is less than the lower fence or greater than the upper fence, it's considered an outlier lower fence = Q1-1.5(IQR) upper fence = Q3+1.5(IQR)

when you describe the distribution of a quantitative variable, you need to mention 3 characteristics:

shape, center, spread

range

the difference between the largest and smallest observations; simplest and easiest measure of spread to compute, but it's SEVERELY affected by outliers (R = max-min)

z-score for an observation represents...

the distance that a data value is from the mean in terms of the number of standard deviations

if the distribution is close to symmetric or only mildly skewed, what measure of center should be used?

the mean is usually preferred because it uses the numerical value of all the observations

if a distribution is highly skewed, what measure of center should be used?

the median is usually preferred over the mean because it better represents what is typical

for distributions on box plots that are approximately symmetric,

the median will be near the center of the box and the left and right whiskers will be roughly the same length

for distributions on box plots that are skewed right,

the median will be slightly left of the center of the box and the right whisker will be longer than the left whisker (or there may be high outliers)

for distributions on box plots that are skewed left,

the median will be slightly right of the center of the box and the left whisker will be longer than the right whisker (or there may be low outliers)

standard deviation

the most popular summary of spread, which represents a typical value for how far the data fall from their mean; uses all data values in its computations

the most common measures of spread for a quantitative variable are ...

the range, the interquartile range, the variance, and the standard deviation

as well as the center, we are also interested in..

the spread (aka variability, consistency, dispersion)

mean (average)

the sum of the observations divided by the number of observations (x bar means sample mean and mu means population mean)

when the data are either skewed left/right...

there are extreme values in the tail, which tend to pull the mean in the direction of the tail

sometimes, the median may be a better measure of central tendency than the mean

true

properties of the standard deviation, s:

value will always be greater than or equal to zero; greater the spread of the data, the larger s's value; when all the observations are the same value, s=0 (there is no spread in the data); strong skewness or a few outliers can greatly increase s; so s is NOT resistant; has the same units of measurement as the original observations, but the variance is in square units (s^2) (cannot have chips^2 lol) ?

the shape of a distribution influences...

whether the mean is larger or smaller than the median

population z-score:

z=x-mu/sigma

sample z-score:

z=x-xbar/s


Ensembles d'études connexes

Algebra 2 Chapter 4 Lesson 1 and 2; True False

View Set

NCLEX: Chronic Illness and Older Adults

View Set

Interpersonal Communication Ch 1-4

View Set

Anatomy: Endocrine System Exam Review

View Set

Acute and Chronic Test fluid and electrolytes with MSK ppt questions

View Set

Business Dynamics Midterm STUDY GUIDE, Chapter 7 Business Dynamics, Business Dynamics Chapter 6 , Chapter 3 Business Dynamics, Business Dynamics Chapter 10, Manufacturing and Services in persepctive

View Set