stat ch 3
how to make a 5 numb summary in statcrunch
1. stat 2. summary stat 3. GRAPH - boxplot 4. draw boxes horizontally
how do we find all the measures of variation in stat crunch??
1. stat 2. summary stat - columns 3. we can click on what we wanna see, we can look for the mean, median , mode, and coefficient of variation etc.
to find the mean, median, and mode on statcrunch
1. stats 2. summary stats and columns 3.click mean, median, mode, min and max (note, RANGE IS NOT MIDRANGE WE HAVE TO DO THAT BY HAND, JUST CLICK MIN AND MAX in ur column and then subtract those 2 )
if we had our approx mean and we were given a real mean, what is the forumula we use to check and see if its <5%?
1. we subtract to find the difference between the real mean and the approx mean ie real mean - approx mean the formula is: (difference/actual mean) x 100%
IQ scores have a bell-shaped distribution with a mean of 100 and a standard deviation of 15. What percentage of IQ scores are between 70 and 130?
95% cuz thats 2 standard devs out
What is a boxplot?
A boxplot (or box-and-whisker-diagram) is a graph of a data set that consists of a LINE extending from the minimum value to the maximum value, and a box with lines drawn at the first quartile, Q1, the median, and the third quartile, Q3. THE LINE IS MIN TO MAX THE BOX IS Q1-Q3
a boxplot is a graphical representation of what
A boxplot is a graphical representation of the 3 quartiles ( 25, 50, and 75%) and the minimum and maximum values, referred to as the "5 number summary".
what does it mean if a statistic is resistant?
A statistic is resistant if the presence of extreme values (outliers) does not cause it to change very much.
what is the RANGE rule of thumb (not dealing w signif numbers) , and what is the equation?
To roughly ESTIMATE the standard deviation from a collection of known sample data use to approximate the range in the range rule of thumb u must: take the range (the range is the MAX value - the MIN value) and divide it / by 4 , this gives u the range
the mode can describe categorical data, ex)
Tom, Jill and Mary are running for class president. Everyone in the class writes down one name on a piece of paper for their vote. The data would like tom, tom, jill, jill, jill, mary, tom, etc.....
what is N
epresents the number of data values in a POPULATION
if we want to compare 2 sample means or compare two samples variations, if they are v close we just look at the std deviation. the LARGER the std the more variation it has ! (if therye close to each other) but what do we do if the two sample means are NOT v close? what do we calculate then?
f the two sample means are not the same or very close, we have to calculate the coefficient of variation for each sample, and compare them.
IS THE MEAN SENSITIVE? IS IT RESISTANT?
he mean is V SENSITIVE to every data value. One extreme value can affect it dramatically. The mean is NOT RESISTANT
population standard deviation
ie the square root of sigma squared
calulate the following z score example
if the data value = 60 if the mean (xbar) = 70 if the standard deviation (s) = 5 z= 60-70/5 = -2 , so its 2 standard deviations BELOWWW THE MEaN
what are the 3 types of modes?
Bimodal - two data values occur with the same greatest frequency Multimodal - more than two data values occur with the same greatest frequency No Mode - no data value is repeated
what is x?
is the variable usually used to represent the individual data values.
Sigma ( ie) the population standard deviation
is this pic but w/ capital N .. ie all the population values - the population mean , u square all of these to make them positive, add them up then divide by the total number of all the ppl in ur population
What is the range rule of thumb FOR identifying significant numbers ?
it says any value that is lower than the mean - two std dev are SIGNIF LOW if any value is bigger than 2 stnd dev it means SIGNIF HIGH if its in between the standard devs it is not significant ie those that lie within the interval are "usual values"
what is a modified boxplot
its a boxplot but the outliers are not apart of the max and min they have a symbol away from it
STATCRUNCH WILL CALCULATE ALL 3 QUARTILES UNDER "SUMMARY STATS!"
just click that u want them
what is relative standing a measure of? NOTE: THE MOST IMPORTANT CONCEPT IS THE Z SCORE
measures of relative standing, which are numbers showing the location of data values relative to the other values within a data set. They can be used to compare values from different data sets, or to compare values within the same data set. The most important concept is the z score.
How do we tell if our mean approximation is good?
our approximation is good if the difference between our approximation and the ACTUAL mean is LESS than <5%
finding the percentile of a data value FORMULA ** WE MUST KNOW HOW TO DO THIS BC STATCRUNCH WONT DO IT FOR US KNOW KNOW KNOW !
percentile = number of values less than x / total number of values multiplied by 100
ex) find the percentile of 83% out of exam scores of 92, 50, 83, 72, 74, 81, 46
put in order, then count , there are 5 lower than 83 out of 7 scores so 5/7 x 100 = 71.428 = 71%
what is n
represents the number of data values in a SAMPLE
What is the inner quartile range? whats the forumula, helps us do what
the IQR is the length of 50% of the data it is found w/ Q3-Q1 = IQR HELPS US QUANTIFY OUTLIERS!
what is the EMPIRICAL RULE (aka 68-95-99.5) rule THINK PERCENTS
For ONLY data sets having a distribution that is approximately NORMAL, the following properties apply: this says: - About 68% of all data values fall within 1 standard deviation of the mean. -About 95% of all values fall within 2 standard deviations of the mean. -About 99.7% of all values fall within 3 standard deviations of the mean.' ( here we are using the standard deviation as a measuring stick)
the four measures of center are
Mean Median Mode Midrange note! midrange is only included here to emphasize that there are different ways to determine the "center" of a set of data. Typically only mean, median, and mode are studied. AND KNOW: In a symmetric and bell-shaped distribution, the mean, median, and mode ARE THE SAME!!!
what do measures of variation describe?
Measures of Variation describe how "spread out" our data is. And, specifically, HOW FAR does our data as a whole differ or deviate FROM THE MEAN.
what are the usual values in the rule of thumb?
Minimum "usual" value (mean) - 2 × (standard deviation) Maximum "usual" value (mean) + 2 × (standard deviation) note, all we are doing is xbar (mean) +/- 2s
Whenever a value is less than the mean, its corresponding z score is negative
NEGATIVES MATTER WHEN FINDING Z SCORES YALL , and once again the deviations > or < 2 is unusual values
in a box plot set what do outliers do / what do they affect?
Outliers can have a dramatic effect on the mean and the standard deviation. THEY F UP THE HISTOGRAM MAKE IT LOOK SKEWED Outliers can also have a dramatic effect on the scale of the histogram, so that the true nature of the distribution is totally obscured.
What are quartiles? what are the 3 quartiles : Q1, Q2 *NOTE, Q2 IS THE MEDIAN*, Q3
Quartiles are just specific percentiles. QUARTILE 1 (Q10) = the 25th percentile Q2= the 50th percentile NOTE!!! Q2 IS THE MEDIAN Q3 = the third quartile and is the 75% away from the top
sample deviation notations: S= S^2 = σ (sigma) = σ ^2 =
S = sample of standard deviation, s = statistics! S2 = sample VARIANCE (its the variance bc its squared) σ = POPULATION standard deviation , sigma is a parameter σ2 = population variance (variance bc its squared)
why do we divide by n-1 instead of n when we have a large population?
So, when we are using a sample to make inferences about a larger population, we divide by n -1 instead of n to over-estimate the variation. For example, 100/20 = 5, 100/19 = 5.3 Dividing by 1 less gives us a larger result. dividing by n -1 overestimates the variation instead of UNDERESTIMATING
what must the sum of the standard deviation always be?
The sum of the deviations always equals zero.
measures of center attempt to answer what questions about a value to describe a data set
The value exactly in the middle? The most frequent value? The Average? The mid-point between largest value and the smallest?
What is the Z score (aka standardized value) , what does it mean when its neg or pos?
the Z score is the # of standard deviations that a given value x is above or below the mean ie the distance a data value is from a mean measured in standard dev ie if the z score is positive it lies above the mean, and if it is negative it lies below the mean the mean of a data set has a z score of 0
What is standard deviation? and if x represents a data value then the mean is written (x-xbar ) ie x - the mean
the average deviation, or difference, of all the data values from the mean.
sample variance
the m would be the xbar or the mean in this pic... to find the s2 we take our data value - the mean, square it, add them up, and we subract by n - 1
μ "mu" is a parameter, wht does it mean
the mean of a population its Ex/N mu = adding all the data values of an entire population and dividing by capital N
is the mean sensitive? is the mean resistant?
the median IS RESISTANT. That is, it is not sensitive to outliers.
is the midrange resistant?
the midrange is not resistant. That is, it is very sensitive to data values that are far removed from the bulk of the data.
s=
the square root of the variants
a z score formula for a SAMPLE and for a POPULATIO
to find z of a sample u: take ur data value - the mean / divided by the standard devaition its the same except u minus the mean of the population divided by the standard dev of the populat
Find the standard deviation of these numbers of chocolate chips: 22, 22, 26, 24 note; the mean is 23.5
we take each chip, minus by the mean, square them all separatley, then add them all and divide by the total number of cookies (4) minus - (1) and sqaure them all to get the standard dev = square root of (22-23.5)^2 + (22-23.5)^2 + (26-23.5)^ and so on... / 4-1 = a number and sqaure that number
x̄ is a statistic, what does it mean
(the pic without the i) ie the mean of the sample values xbar = the sum of all the sample data values (divided by). / how many there are
how do we find the mean?
1. WE MUST FIRST PUT ALL THE VALUES IN ORDER 2. if two means take the 2 middle #'s and + add them, then divide by 2
measures of relative standing ( aka measures of position) theres 4
1. Z-score (standardized scores) 2. percentiles 3. quartiles 4. minimum and maximum data values
how do we identify outliers??? what we calulate
1. calculate the IQR: Q3-Q1 2. calculate the extreme which is E = (1.5)x(IQR) and any data value smaller than q1-e or bigger than q3+e is an outlier
WHAT DOES THE 5 NUMBER SUMMARY CONTAIN
1. min value 2. q1 3. q2 4. q3 5. max value
what the the median?
The Median is the measure of center that is the middle value when the original data values are arranged in order of increasing (or decreasing) magnitude. its the MIDDLE LOL if its an even number of data values tho u take the two middle numbers and add them and divide by two to get ur mean
What is the midrange? whats the midrange formula?
The Midrange is the value MIDWAY between the MAXimum and MINimum values in the original data set. the midrange = the MAX value + the MIN value / 2!
what is the mode? CAN THERE BE MORE THAN 1 MODE?
The Mode of a data set is the value that occurs with the greatest frequency note; A data set can have one, more than one, or no mode. note; the mode is always an actual data value. and NOTE! he mode is the only measure of center that can be used to describe categorical data.
what are the 3 measures of variation?
The Range The Variance The Standard Deviation
what is the range ? whats the formula to get it? is the range sensitive to other values ?
The Range of a set of data values is the difference between the maximum data value and the minimum data value. Range = (maximum value) - (minimum value) It is very sensitive to extreme values; therefore, it is not as useful as other measures of variation.
what is the STANDARD deviation
The Standard Deviation is the SQUARE ROOT of the Variance. The standard deviation is expressed in the same units of measure as the original data.
what is the variance of a standard deviation?
The Variance is the value we get when we SQUARE ALL the deviations, add (+) them up and divide / by the total number of deviations. The variance is not expressed in the same units of measure as the original data. IT IS EXPRESSED AS THE SQUAREEEE OF THE ORIGINAL UNITS
what is the coefficient of variation ? what is the equation?
The coefficient of variation (or CV) describes the standard deviation relative to the mean. u take the sample variation / the sample mean and multiply x by 100% ( but this can be done in stat crunch)
examples of the use of empirical rule WITH EMPIRICAL RULE THINK PERCENT
We had 30 data values with a mean of about 65 and a standard deviation of about 15. If we go out 1 std deviation from the mean in each direction, we create an interval of (50,80). According to the empirical rule, 68% of our those data values, ie roughly 20 exam scores, should be between 50% and 80%. If we look at the data for the exams we see that 19 exam scores fell between 50% and 80%. One was exactly 50%. if we go out 2 std deviations we would go out 65 + (15x2) or just 80 + 15, and then backwards going 50-15 or 65 - (15x2) to equal (35, 95) , so 98% of the data needs to be between this
is the mode resistant?
YEAH!
Since most of our data, 95%, lies within an interval of 2 standard deviations of the mean, we consider any values outside of this interval to be highly unusual. IE IF A DATA VALUE IS 2 PLUS OR 2 LESS THAN THE 2 STANDARD DEVIATIONS WE CALL IT UNUSUAL
YEP
what is the measure of center tho
a measure of center is the value at the center or middle of a data set.
what does symmertry in the bx plox indicate
a normal distribution
what is the mean ?
aka the average The mean (arithmetic mean) is the measure of center obtained by adding (=) ALL the values together and dividing by the total number of data values. note the mean may not be an actual data value present in the data
percentiles are what note: if. a student is in the 70th percentile it doesnt mean that she scores a 70% it means she scores 70% higher than anythone else who took that test , ie it also implies only 30% of people scored higher than her
cool
the box plot is legit the MIN, q1, q2, q3, then the max so
cool
example: we have 40 cookie data values based on how many chips a cookie has, find the percentile for a cookie with 23 chips. We could and see there are 10 cookies with FEWER than 23 chips! so we take 10/40 x 100 = 25! a cookie w/ 23 chips is in the 25th percentile meaning it is higher amount of chips than 25% of other cookies, and more than 75% of cookies have more chips than 23
cool and note we round the percentiles to whole numbers
what is ∑ ? (uppercase sigma/summation)
denotes the sum of a set of values.
how do we approximate the mean from a frequency distribution?
so, we approximate all sample values in each class by the class midpoint So we multiply each class midpoint by its class frequency. Add and divide by the total frequency. ex) we know there are 2 frequencies in the class of 30-39, we will find the midpoint by going (30+39)/2 = 34.5, and we will MULTIPLY this (midpoint) of 34.5 x(TIMES) 2 (the frequency) and we REPEAT THIS FOR ALL OF THE CLASSES AND THEIR midpoints. once we have multiplied each midpoint by its frequency and added all of these numbers together, THEN we divide that number the the TOTAL amount of frequencies to approximate the mean. Ex) . 34.5)(2) + (44.5)(4) + (54.5)(3) + (64.5)(8) + (74.5)(9) (84.5)(3) + (94.5)(1) = 1945 / 30 (total freq) = 64.83
so how do we find the standard deviation so it can equal zero and not be negative? we square each value!
yea
is large variation in a data set bad?
yes