AP Statistics Semester 1
Completely Randomized Design
Experimental units assigned to the treatments completely by chance
Cumulative frequency
Add the values with all the smaller values
Statistically significant
An observed effect s large that it would rarely occur by chance
Describing the sampling distribution of barx1- barx2
Liquid L Mean 27 ounces of soft drink Standard deviation1 .8 ounces 20 cups Liquid M Mean- 17 ounces Standard deviation- .5 ounces 25 cups The mean= 27-17= 10 ounces Standard deviation= SQRT((.8^2/.2)+ (.5^2/25)= .205 ounces
Cluster Sample
Group individuals located near each other then do an SRS
Mean and standard deviation of sampling distribution of Bar X
Mean of bar x is mu SD= standard deviation of population/SQRT(n)
Unimodal
Have a single peak
Sample size for desired margin of error
Z star* SQRT(Phat(1-Phat)/n)</= ME if p hat =.5
Multiplying by a constant
To change units like meters to feet
Boxplot
-A central box is drawn from the first quartile to the third quartile -a line in the box marks the median -lines extend from the box out to the minimum and maximum -outliers are marked with special symbols like an asterisk
Misleading charts
-Charts that start anywhere other than 0 -Bar charts that have different widths per category
Standard deviation facts
-Should only be used when mean is the measure of center -is always greater or equal to 0 - has the same unit of measurement as the original observations - is not resistant so outliers can make it larger or smaller
Density Curve
-always on or above the horizontal axis -has an area of exactly 1
Variability of a statistic
Described by the spread of its sampling distribution. The spread is determined by the size of the random sample. Larger samples give smaller spreads. Population size does not matter just the sample size
Conditional distribution
Describes only one group
Probability model
Description of some chance process that consists of two parts: a sample space S and a probability for each outcome
Simple Random Sample (SRS)
Everyone has an equal chance to be selected
Explanatory variable
Explain changes in response variable
C%. Confidence interval
Gives an interval of plausible values for a parameter. This interval is point estimate +/- margin of error The difference between the point estimate and the true parameter value will be less than the margin of error in C% of all samples
Confidence level C
Gives the overall success rate of the method for calculating the confidence interval
Population distributij
Gives values of the variable for all individuals in the population
Parameter
Number that describes a characteristic of the population
Population distribution & distribution of sample data vs. sampling distribution
Population distribution & distribution of sample data describes individuals while sampling distribution describes who a statistic varies in many samples
Confidence interval formula
Statistic +/- critical value * Standard deviation
Interquartile range
Third quartile- 1st quartile
Determining sample size example
Wants a margin of error no more than .03 at 95% confidence 1.96SQRT(.5(1-.5)/n)< .03 1.96/.03* SQRT (.5)(1-.5)<SQRT(N) (1.96/.03)^2 (.5)(1-.5)<N 1067.111<N
Confidence interval example
799 teens and 2253 adults asked about use of social media 80% teens 69% adults said they used social media 95% confidence interval (.8-.69) +/- 1.96(SQRT((.8)(.2)/799)+ ((.69)(.31)/2253)= .11+/- .034 (.076,.144)
Interpreting confidence levels
"If we take many samples of the same size from this population, about 95% of them will result in an interval that captures the actual parameter value
Difference of random variables
D= mean X- Mean Y
Experiment
Deliberately imposes some treatment to measure their responses
Cumulative relative frequency
Divide the entries in the cumulative frequency column by the total number of values and multiply by 100 for the percent
Relative frequency
Divide the value by the total number and multiply by 100 to get the percent
Subtracting a constant (for error)
Error=Guess-actual answer You move each data set 13 to the left to see the error in their guesses
Finding a critical valued
Find the critical value for an 80% confidence interval 20% is left out and split in half for each tail of the curve so the area is .1 Invnorm(.1,0,1)= -1.28 = z score
We can increase the power of a sig test by
Increasing the sample size, increasing the significance level, or increasing the difference that is important to detect between the null and alternative parameter values
Mean and SD of bar x example
Mu- 25 microorganisms per liter SD- 7 microorganisms per liter SRS of 10 adults X is mean of 25 microorganisms because it is an unbiased estimator of Mu SD- 7/SQRT(10)= 2.214
Multiplying or dividing by a constant effect on random variable (b)
Multiples measures of spread, center, and location -shape of distribution did not change
Mean of discrete random variable
Multiply each value by its probability
Statustic
Number that describes a characteristic of a sample
Residuals
Observed Y- Predicted Y
Observational study
Observes individuals and measures variables of interest but does not attempt to influence the responses
Nonresponse
Occurs when an individual chosen for the sample can't be contacted or refuses to participate
Undercoverage
Occurs when some members cannot be chosen in a sample
Binomial coefficient
P(X=k)= P^k(1-P)^n-k- this is to calculate binomial probability N = n!/k!(n-k)! K 5 total kids = 5!/2!3!= 5,4,3,2,1/2,1,3,2,1 2 kids have O blood Cancel out the 3,2,1 to get 5,4/2,1 to get 5*4/2*1= 10
Geometric probability formula
P(Y=K)=(1-P)^(k-1) * P
Using binomial probability formuka
P(x>3) if more than 3 kids have type O blood 5(got from 5 Ncr 4)*(.25^4)* (.75)+ (1)(*.25^5)*(.75^0)= .01465+.00098= .01563= 1.6% chance
Difference between pie and bar charts
Pie charts MUST ADD UP TO 100 OR ITS A PAC-MAN CHART
Cumulative relative frequency graph
Plot a point corresponding to the cumulative relative frequency in each class
Random sampling
Using a chance process to determine which members to include in the sample
Bias
Using a method that favors certain outcomes over the others
One Proportion z test on calculator
Using previous example STAT Tests 1 Prop Z Test P0= .08 X=47 N:500 Proportion: >P0
Significance test for p1-p2
Using previous example H_0: p1-p2= 0 H_a: P1-p2=/=0 Where p1-p2 is the difference of proportions of the students who didn't eat breakfast between 2 schools Random- yes 10%- yes Large counts- all are greater than 10 19/80= .2375 26/150= .1733 .2375-.1733= .0642 .0642/SQRT((.1574)/80)+((.1574/150))= 1.17 2 Prop z Test- P value- .2427 There is not convincing evidence that the true proportions at the two schools who didn't eat breakfast are different
A confidence interval for two sided test
Using previous problem PHat +/- Z* (SQRT((PHAT)(1-PHAT)/n)) InvNorm- (.025,0,1)= -1.96 95% confidence .6 +/- 1.96 SQRT((.6)(.4)/150)= .6 +/- .078= (.522,.0678) this is the true proportion of students who say they have never had a cigarette
Normal probability distributions example
What's the probability that a randomly chosen woman had height between 68 and 70 inches Mean= 64 in Standard deviation= 2.7 inches (68-64)/2.7= 1.48 (70-64)/2.7= 2.22 Find Scores in table A .9868-.9306= .0562
Using calculator for probability distributions
What's the probability that a randomly chosen woman had height between 68 and 70 inches Mean= 64 in Standard deviation= 2.7 inches Normal CDF (lower- 68, upper- 70, etc.)
Computing the test statistic and p value
A battery company wants to test H_0:Mu= 30 H_a:Mu>30 SRS of 15 batteries BarX= 33.9 S_x= 9.8 (33.9-30)/(9.8/SQRT(15))= 1.54 2nd Vars We use T values H_a>30 = T>1.54 Degrees of freedom= 15-1= 14 Tcdf(1.54,1E99,14)= .0729
Checking conditions example
A confidence interval for the true proportion. P of red beads in the container Class has 107 red beads and 144 white beads N*P hat> 10 251(107/251)= 107> 10 251(144/251)= 144> 10
Regression line
Summarizes the relationship between two variables- line of best fit
Binomial setting examples
Suppose a parent has 5 children. Let x be the number of children with o blood- is independent, has a specific number, has success rate IS BINOMIAL Turn over the first ten cards and let Y be the number of aces you observe NOT INDEPENDENT because probability of getting an Ace decreases the more you get it NOT BINOMIAL Turn over top card and put it in the deck and keep doing this until you get an ace- NO DESIGNATED NUMBER OF TRIALS SO NOT BINOMIAL
1 Sample t test for a mean on calculator
TESTS T-Test DATA Mu_0= 5 List= (whatever list you have the data on) Frequency- 1 Mu: <Mu
Variance of the sum of independent random variables
T^2= X^2 + Y^2
Standard Normal Table
Table of areas under a standard normal curve
Continuous random variables
Takes all values in an intervals of numbers. Probability distribution described by a density curve
Random Variable
Takes numerical values that describe the outcomes of some chance process
Chebyshev's inequality
This is for the proportions of other curves besides normal 1-(1/k^2) k=number of standard deviations away from mean
Probability distribution
Tossing a coin 3 times has 8 equally likely outcomes 0 heads= 1/8 1 head= 3/8 2 heads= 3/8 3 heads= 1/8 -all possible values and distributions
Describing sampling distribution of phat1-phat2
Two bags of colored goldfish Bag 1 has 25% red crackers and bag 2 has 35% red crackers Teacher takes an SRS of 50 crackers from bag 1 and SRS of 40 crackers from bag 2 Shape- 50(.25)=12.5 50(.75)= 37.5 40(.35)=14 40(.65)=26 All are greater than 10 Mean- .25-.35= .10 Standard deviation SQRT(((.25)(.75)/50)+((.35)(.65)/40))= .0971
Connection between mutually exclusive and independence
Two mutually exclusive event can never be independent like because if one event happens the other event is guaranteed not to happen Like someone who is pregnant and someone who is a man *insert triggered tumblr SJW's*
Finding binomial coefficient on calculator
Type the total number N Math PRB nCr Then type in the number needed So like 5,Math,PRB, NCr, 2= 10
Extrapolation
Use of a regression line for prediction outside the interval of values of x. Often not accurate
Point estimators examples
A. Quality inspectors want to estimate the mean lifetime Mu of the AA batteries produced in an hour at a factory. They select fifty random batteries- the mean bar x is the point estimator B. What proportion of high schoolers smoke? 2792 said they smoked out of 15,425 2792/15425= point estimate C. Quality control inspectors in A want to investigate the variability in battery lifetime by estimating the population variance sigma^2 which is the point estimate
Informed consent
All individuals must give informed consent
Data Analysis
Gathering, organizing, analyzing, and interpreting Data
Independent random variables
If knowing any event involving x tells us nothing about event involving y
Symmetric
If left or right sides appear to be even
Pie Charts and Bar Charts
Analyzing categorical data
Variables
Characteristics of individuals
Inference
Drawing conclusions that go beyond the data
Outliers
Values that stand out
Confidence interval for a difference between two means
(Barx1- barx2) +/- t*SQRT((s1^2/n1)+(s2^2/n2))
Confidence interval for a difference between two proportions
(Phat1-phat2)+|- Standard Deviation (look up for formula ⬆️)
Test statistic
(Statistic-parameter)/standard deviation of statistic
Confidence interval on calculator
(Using previous problem) ^ Stat Tests 1-PropZint (MAKE SURE ITS EXACTLY 1-PROPZINT) X= 246 N= 439 C-level- .95 Calculate (Make sure to press enter only once after calculate or else you'll only get the higher interval) You should get the lower and upper intervals and the sample proportion as well as N
Standardizing
(X-Mean)/Standard deviation= Z score
Finding Combined phat
(X1+X2)/(n1+n2) School 1 said 19/80 students didn't have breakfast while school 2 said 26/150 students didn't have breakfast Combined PHat= (19+26)/(80+150)= 45/230= .1957
Variance of discrete random variaboes
(X1-mean)^2 *(Probability) + (x2-mean)^2 *(P) etc Standard deviation is square root of variance
One sample t test statistic
(barx-Mu)/(S_x/SQRT(n))
Individuals
(the subjects) objects described by a data set
Effects of Adding or Subtracting a constant
-changes measures of center and location -does not change shape of the distribution or measures of spread (range,IQR, Sx)
Effects of multiplying or dividing
-changes measures of center, location, and spread -does not change the shape of the distribution
Assessing normality
-plot the data and see if it is symmetric or bell shaped -check whether it follows the 68-95-99.7 rule -making a normal probability plot
Normal Curves
-symmetric, single peaked, bell shaped -mean is mu, standard deviation sigma -changing the mean without changing the standard deviation moves the normal curve along the horizontal axis without changing the spread -the standard deviation controls the spread. Curves with larger standard deviations are more spread out
Rules of probability
-the probability of any event is a number between 0 and 1 -All possible outcomes together must have probabilities that add up to 1 -if all outcomes in the sample space are equally likely the probability that even A occurs is P(A)= number of outcomes corresponding to event A/ total number of outcomes in the sample space -The probability that an event does not occur is 1 minus the probability that the even does occur Event (not A)= A^c (complement of A) -If two events have no outcomes in common, the probability that one or the other occurs is the sum of their individual probabilities
Analyzing random variables on calculator
-values in list 1 - probabilities in list 2 -1 Variable stats
Venn Diagram
Displays sample space of two events and shows the union and intersection of the two
Area to z-scores on calculators
1. 2nd VARS 2. InvNormal Question will ask what is the x percentile Area= x percentile Mean-0 Standard deviation- 1
Z-scores to areas on calculators
1. 2nd VARS 2. Normalcdf If questions asks for greater than a value Lower= Value Upper= 1E99 (2nd , ) Mean- 0 Standard deviation- 1
Making a normal probability plot
1. Arrange data from largest to smallest and record percentiles 2. Use invNorm to find Z-scores 3. Plot each observation x against its z-score If it looks like a straight line, it is a normal curve
Criteria for establishing causation when we can't do an experiment
1. Association is very strong 2. Association is consistent 3. Larger values of the explanatory variable are associated with stronger responses 4. Alleged cause precedes the effect in time 5. Alleged cause is plausible
Types of variables
1. Categorical- cannot do math with, places into groups (zip code, phone number) 2. Quantitative- numbers you can do math with (hint- FINDING AVERAGE)
Principles of experimental design
1. Comparison- use a design that compares two or more treatments 2. Random assignment- use chance to assign experimental units to treatments 3. Control- keep other variables that might affect the response the same for all groups 4. Replication- use enough experimental units in each group so that any differences in the effects of the treatments can be distinguished from chance differences between the groups
Facts about correlation
1. Correlation makes no distinction breeds explanatory and response variables 2. R does not change when we change units of measurement 3. R is not a unit of measurement 4. Correlation does not imply causation 5. Correlation requires both variables to be quantitative 6. Correlation only measures linear relationships 7. A value close to 1 or -1 does not mean it is linear. USE YOUR EYES 8. R is strongly affected by outliers so it is not resistant 9. Correlation is not a completely summary of two variable data
Making histogram on a calculator
1. Stat 2. Edit 3. 2nd + y- hit the first graph unless already using 4. Find histogram icon 5. Press zoom and zoom stat
Choosing SRS- calculator
1. Math 2. PRB 3. Rand(Int 4. Min, Max
Moving beyond a point estimate
1. Sample mean is 240.8 but bar x should not be exactly Mu 2. 68-95-99.7 rule says x with 2(5)= 10 which is 2 standard deviations away from the mean (5 is known SD) n=16 3. 240.8+10= 250.8 240.8-10= 230.8
Determining sample size
1. Significance level- what is a bigger risk? A type 1 or type 2 Error? If type 1, decrease the alpha or significance level to .01. If type 2, do the opposite. 2. Effect Size- how large a difference between the null p value and the actual p value is important to accept 3. Power- what chance do we want our study to have to detect a difference of size we think is important
Finding Standard deviation
1. Stat 2. Calc 3. 1 Var Stats 4. Hit enter 5. S_x IS the Standard devation
Calculating R and linear regression
1. Stat 2. Calc 3. LinReg(ax+b)
Paired Data
1. Subtract the data so you can get 1 Table
How to make a residual plot
1. Turn on plot by 2nd y= 2. Make sure it's on the first option:scatterplot 3. Click down the the y list 4. 2nd stat (list) 5. Go to residual For your Y list
Normal calculations involving n example
1500 college students are asked how far away their home is 35% attend college within 50 miles of their home Find the probability that the random sample will give a result within 2% of the true value (33% & 37%) Mean= p= .35 SD= SQRT((.35)(.65)/1500)= .0123 Normalcdf(.33,.37,.35,.0123)= .8961 About 90% of samples of size 1500 will give a result within 2 percentage points of the truth
Finding t values on calculator
2nd VARS InvT( (For 95% confidence interval based on SRS of size n=12) Area- .025 Df- 11 Calculate- should get -2.2009
Geometric probability on calculator
2ND VARS geometpdf= P(Y=K) Geometcdf= P(Y<K) P='probability X value
Binomial probability on calculator
2nd Vars Binompdf= P(x=k) Binomcdf= P(x<K) Trials= Total N= 5 trials for blood type P= probability for O blood= .25 X= 3 kids .9843 When looking for x>K 1-Binomcdf= X
One sample t interval for Mu
40 light duty engines of the same type and the mean reading was 1.2676 and SD was .3332 A. Construct a 95% confidence interval for the mean amount of nox emitted Bar x +/- t* (S_x/SQRT(N) InvT(.025,39) 1.2675+/- 2.023 (.3332/SQRT(40)= 1.2675+/- 1.066= (1.1609,1.3741)
68-95-99.7 rule
68% of the observations fall within 1 standard deviation of the mean 95% fall within 2 standard deviations 99.7% fall within three standard deviations ONLY WORKS FOR NORMAL CURVES
Stem plot
A display for fairly small data sets
Critical value
A multiplier that makes the interval wide enough to have the stated capture rate (the 95% with 2 standard deviations)
Performing a significance test about the mean
A researcher measures the dissolved oxygen level at 15 randomly chosen locations along stream The results are 4.53, 5.04, 3.29, 5.23, 4.13, 5.50, 4.83, 4.40, 5.42, 6.38, 4.01, 4.66, 2.87, 5.73, 5.55 A DO level below 5 puts aquatic life at risk H_0= 5 H_a:Mu<5 Where Mu is the mean dissolved oxygen level Random: 15 RANDOM spots are picked 10%: there are an infinite number of locations so this is fine Large Sample- CHECK NORMALITY USING GRAPHS 1 Variable stats- BarX- 4.77 S_x= .94 (4.77-5)/(.94/SQRT(15))= -.94 Degrees of Freedom- 14 H_a:Mu<5= T<-.94 Tcdf(1E-99,-.94,14)= .1816 We fail to reject H_0
Treatment
A specific condition applied to the individuals
Null Hypothesis
A statement of no difference or the original hypothesis. The claim we seek evidence against. Ex. H_0:P=.8
Point estimator
A statistic that provides an estimate of a population parameter. That value is called a point estimate
Unbiased estimator
A statistic used to estimate a parameter if the mean of its sampling distribution is equal to the value of the parameter being estimated
Sample
A subset of individuals that we actually collect data from
Influential
A value is influential if removing it would distinctly change the result of the calculation AKA outliers
Linear transformations example
A. Temperature x of water follows normal distribution With mean 34 degrees Celsius and standard deviation of 2 degrees Celsius Mean y-32 + 9/5*meanx = 32+ 9/5(34)= 93.2 Fahrenheit Standard deviation- SDy= 9/5* SDx 9/5*2= 3.6 Fahrenheit B. Bath water should be between 90-100 Fahrenheit NormalCDF(90,100,93.2,3.6)= .7835
Can we use a t* critical value to calculate a confidence interval for the population mean
A. To estimate the average GPA at your school you randomly select 50 students from classes you take- NO BECAUSE BIAS LITERALLY LEARNED THIS IN RHE BEGINNING OF THE YEAR B. Still has to be relatively symmetric SO NO SKEWING BECUZ OF OUTLIERS c. there is skewing but NO OUTLIERS SO YES
A two sided test
According to the CDC, 50% of High School students have never smoked a cigarette. 150 random students are sampled and 90 say they have never smoked a cigarette P_0= .5 P_a=|= .5 Where p= the proportion of students who have never smoked a cigarette Random- the sample is 150 RANDOM students 10%- reasonable to assume 1500 students go to a high school Large counts- 150(.5)= 75 > 10 90/150= .6 .6-.5/(SQRT((.5)(.5)/150)= 2.45 1 PROP Z TEST(.5,90,150,=|=P_0)= .014 We are able to reject H_0 because the p value is less than a= .05. We have convincing evidence that of all the students in the high school. The proportion that says they had never smoked a cigarette is .5
Mean
Add up all the value and divide by the number of values The mean is sensitive to extreme values
Discrete random variables
Adding all probabilities must add up to 1 -fixed set of values with gaps in between
Adding or subtracting a constant effect on random variable (a)
Adds measure of center (mean, median, percentiles) - does not change shape or measures of spread (Range, IQR)
Least-Squares regression Line
Aka line of best fit
Confidentiality
All individual data must be kept confidential
Institutional review board
All studies must be reviewed in advance by this board. -board protects safety and well being of subjects
Distribution
All the different values of a variable and the frequency of those values
Sampling without replacement
An airline trained 25 officers- 15 male and 10 female Of 8 captains chosen 5 are female why is that not fair? 8 NCR 5* (.4)^5*(.6)^3= .124 The correct probability is .106 Binomial probability assumes that the number of females chosen stays the same at 40%. Because 8/25 is almost 1/3, tie binomial probability is off
Factors
Another terms for explanatory variables
Event
Any collection of outcomes from some chance process. Designated by capital letters Example: P(A)= a sum of a dice roll equals five P(A)=4/36
Geometric Setting
Arises when we perform independent trials of the same chance process and record the number of trials it takes to get one success. P of successes must be the smae
Convenience sample
Asking the first people you see
Sampling distribution of a difference between two means
Average height of 10 year old girl- 56.4 inches Standard deviation- 2.7 inches Random sample of 12 girls Spread- 2.7/SQRT(12)= .78 inches Average height of 10 year old boy- 55.7 inches Standard deviation- 3.8 inches Random sample of 8 boys Spread- 3.8/SQRT(8)= 1.34 inches
One sample t interval for a population mean
Bar x +/- t* (S_x)(SQRT(N))
Bias, variability, and shape as described by a dart board
Bias means the aim is off and we miss the bulls-eye High variability means the results are scattered So we need Low bias and low variability
Voluntary response sample
Consists of people who choose themselves by responding in a general invitation
Simulating a sampling distribution
Choosing 500 SRSs of size n=20 from population of 200 chips 100 red and 100 blue Exjplain what a dot at .15 is- in SRS of 20 chips there were three read chips
Stratified Random Sample
Classify the population into groups then choose a separate SRS for each group
Census
Collects data from every individual
Matched Pairs Design
Create blocks by matching pairs of similar experimental units
Dot plots and stem plots
Describe quantitative data
Describing scatterplots
Direction- negative or positive association Form- curved or linear Strength- how closely the points follow the form
Checking for independence
Dominant hand. Male. Female. Total Right. 39. 51. 90 Left. 7. 3. 10 Total. 46. 54. 100 (Left Handed/Male)= 7/46 Left handed= 10/100 They are not independent because they're not the same score
Central Limit Theorem
Draw an SRS of size n from any population with mean mu and finite SD sigma. When n is large, the sampling distribution is approximately normal
Identifying outliers
First quartile - (1.5*IQR)= lower outliers Third Quartile + (1.5*IQRH= upper outliers
Significance Test
Formal procedure for using observed data to decide between two competing claims
Block
Group of experimental units that are known before the experiment to be similar in some way that is expected to affect the response to the treatments
Type 1 Error probability examples
H_0:P= .08 Ha:P>.08 Where p is the proportion of all potatoes with blemishes in the shipment of 500 potatoes at Alpha=.05 Shape- approximately normal because np>10 Center- .08 SD- SQRT((.08)(.92)/500)= .01213 INVNORM(.95,.08,.01213)= .05 We reject!
Two sided test using t values
H_0= 5 H_a =|= 5 N= 37 T= -3.17 Tcdf(1E-99,-3.17,36) and then multiply the answer by two to get both sides of the tail
Finding probabilities involving the sample mean exmaple
Height of young women Mean- 64.5 inches SD- 2.5 inches Find probability a girl is taller than 66.5 inches Normalcdf(66.5,1 E 99, 64.5, 2.5)= .2119 Find the probability that the mean height of an SRS of 10 young women exceeds 66.5 inches SD= 2.5/SQRT(10)= .79 NormalCDF(66.5,1 E 99, 64.5, .79)= .0057
Confidence interval example
How do the sizes of of long leaf pine trees in the north and south compare Number of each- 30 North: mean- 23.7; SD- 17.5 South: mean- 34.53; SD- 14.26 34.53-23.7 +/- 1.699
Parameters and statistics example
Identify population, parameter, sample, and statistic A poll asked a random sample of 515 U.S. adults whether or not they believe in ghosts. 160 said yes. Population: all US adults Parameter is p, the proportion of all us adults who believe in ghosts Sample is 515 interviewed And the statistic is p hat= 160/515
Performing a significance test
If more than .08 of the potatoes have blemishes the truck will be sent back A random sample out of 500 had 47 potatoes with blemishes Alpha- .05 H_0:P= .08 H_a:P> .08 Where p is the proportion of potatoes with blemishes Random- 500 random potatoes 10%- reasonable to assume 5000 potatoes in shipment Large counts 500(.08)= 40 >10 P hat= 47/500= .094 .094-.08/SQRT((.08)(.92)/500)= 1.15= Z score NormalCDF(1.15,1E99,0,1)= .1251 = P Value Because the p value is .1251, we fail to reject H_0. There is not convincing evidence that the shipment contains more than 8% of blemished potatoes
Conditional Probabiltiy
If one event has happened, the chance that another event will happen. P(B/A) event B occurs given event A occurs Ex. Probability a randomly selected student is male given that the student has pierced ears P(Male/Pierced ears)
Skewed to the left or right
If the left or right side are longer than the other
Independent events
If the occurrence of one event does not change the probability that the other event will happen P(A/B)= P(A) & P(B/A)= P(B)
Significance level Alpha
If the p value is smaller than alpha, we say the results of a study are statistically significant and we reject the null Hypothesis
Normal/large sample condition for sample means
If the population distribution is normal, so is the sampling distribution of bar x. This is true no matter what the sample size n is. If the population distribution is not normal, the central limit theorem says that the sampling distribution of x will be approximately normal if n is greater than or equal to 30
Mutually Exclusive or Disjoint
If two events have no outcomes in common and so can never occur together
Type 2 Error
If we fail to reject H_0 when H_a is true
Type 1 Error
If we reject H_0 when H_0 is true
Computing a test statistic
In an SRS of 50 free throws a player made 32 where the Average is 40 Z= (.64-.8)/SQRT((.8)(.2)/50)= -2.83 Normal CDF(1E-99,-2.83,0,1)= .0023 We can reject!
Five Number Summary
Minimum, First Quartile, Median, Third Quartile, Maximum
Calculating Binomial probabilities
Looking at the previous thing ^ the probability that a kid has O blood is .25 What's a probability that out of 5 kids none have o blood .75^5= .2373 What's the probability one has O blood? .75^4* .25* 5= .39551 because there are five different children who could have O blood Probability two have O blood .75^3*.25^2*10=.26367
Histogram
Looks at the distribution in groups Groups 1-5 6-10 etc (to make easier tho doesn't always have to)
Curve is symmetric
Mean and median are the same
Curve is skewed right
Mean is to the left (vice Verza for skewed left)
Sampling distribution of a sample proportion
Mean of P Hat= p Standard Deviation= SQRT(((P(1-P))/n As n increases the sampling distribution becomes approximately normal
Calculations using Central limit theorem
Mean= 1 hours SD= 1 hour Company will service 70 random air conditioners in the city. Plan to budget an average of 1.1 hours per unit Will this be enough? Standard Deviation= 1/SQRT(70)= .12 NormalCDF(1.1,1E99,1,.12)= .2023 so a 20% chance that you won't complete the work
Mean of a geometric random variable
Mean= 1/P
Formulas for mean and standard deviation of the sampling distribution of BarX1-BarX2
Mean= Mu1-Mu2 Spread- SQRT((sigma 1^2/n1)+(sigma 2^2/n2))
Mean and SD of binomial random variable
Mean= Total number * Probability SD= SQRT(NP(1-P))
Correlation
Measures direction and strength of linear relationship R is greater than or equal to -1 and less than or equal to 1
Response variable
Measures outcome of study
Standard Deviation & Variance
Measures the typical distance of values from the mean Deviations= x-the mean Variance= all the values of ((X-The Mean)^2) divided by N-1 Standard deviation = the square root of the variance
Double-Blind
Neither the subjects nor those who interact with them and measure the response know which treatment a subject received. Only the statistician knows until the end of the experiment
Linear transformations (multiplying and adding) effects on a random variable Y=a+bx
New Mean- a+(b * Old Mean) Standard deviation- |b|*old SD
Standard normal distribution
Normal distribution with mean 0 and standard deviation 1
Confidence interval for p no. 2
Out of 439 teens 246 yes to having sex find 95% confidence interval Invnorm: .025,0,1= .196 .56+ 1.96*SQRT((.56)(.44)/439) .56+/-.046 (.524,.606)
Sample proportion of successes
P Hat= Count of successes/size of sample= X/n
Proportions for samples and populations
P is simply a population proportion P hat is used to estimate the unknown parameter P
Rejecting or not rejecting H_0
P value small- reject H_0 P value large- fail to reject H_0
General addition rule
P(A U B)= P(A) + P(B) - P(A and B)
Multiplication rule for independent events
P(A Union B)= P(A) * P(B) The probability an individual o-ring functions properly is .977 (.977)*(.977)(.977).977).977).977)= .87
General Multiplication ruke
P(A union B)= P(A) * P(B/A) Ex. 93% of teenagers use internet 55% of online teens have a profile on social networking site Find probability that a randomly selected teen uses the Internet and has posted a profile P(Online+Profile)= P(Online)*P(profile/online) (.93)(.55)= .5115
Calculating conditional probabilities
P(A/B)= P(A union B)/ P(B) vice versa
Finding probability of at least ine
P(At least one positive)= 1- (no positives) One rapid rear has probability of .004 of producing a false positive If 200 people are selected who are free of the disease what is the chance one false positive will arise? .996^200= .4486 1-.4486= .5514 chance at least one person will receive a false positive
Experimental units
Smallest collection of individuals to which treatments are applied. When units are human beings they are usually called subjects
Binomial distribution
Probability distribution of X
Sample S
The list of all possible outcomes
Spread
The lowest point to the highest point
Interpreting confidence intervals
Question: if presidential election was held would you vote for A or B 95% Confidence interval for A is (.48,.54) Interpreting confidence interval- we are 95% certain that the interval from .48 to .54 captures the true proportion of all registered voters who favor candidate A What is the point estimate- .51 its the midpoint
Conditions for supporting A sig test about a difference in proportions
Random 10% Large counts
Conditions for performing inference about mu1- mu2
Random 10% Large sample size n1>30 n2>30
Conditions for performing a significance test about a mean
Random 10% of population Normal/Large Sample- the population has either a normal distribution or is >30 If n<30, use a graph to assess normality
Conditions for performing a significance test about a proportion
Random sample 10% rule Large counts- NP and N(1-P)> 10
Normal approximation to a binomial
Random sample of 2500 adults if they disagreed or disagreed 60% said they agreed Mean- 2500*.6= 1500 which is greater than 10 so it is normal SD- SQRT(2500)(.6)(.4)= 24.49 Find probability that at least 1520 agree NormalCDF(1520,100000000,2500,24.99)= 21%
Conditions for constructing a confidence interval about a proportion
Random: the data comes from a well designed random sample 10%- check that n<1/10N Large count- both n*p hat and n(1-p hat) are at least 10
Geometric settings and random variable exampke
Y= number of picks to correctly match the lucky day Probability- 1/7 Since it wants to know the number of trials it is geometric
Placebo effect
Responding favorably even when they take a placebo
Role of independence examples
SAT math score X= mean- 519 SD- 115 SAT reading Y- mean- 507 SD- 111 Mean x + Mean Y= 1026 Cannot compute Standard deviation Also is not independent because a student who scores high on one, most likely scores high on the other
Standard error of the sample mean
SE_barx= (S_x)/SQRT(N) where S_x is the standard deviation of the sample
Calculate confidence interval for p
SRS of beads from a container and got 107 red beads and 144 white bears A. Calculate and interpret 90% confidence interval for p B. It is claimed that 50% of the beads in the bag are red. Comment on this claim PHat- 107/251= .426 InvNorm(.05,0,1)= -1.646= Z score phat +/- z * SQRT(Phat(1-PHat)/N) .426+/- SQRT((.426)(1-.427)/251)= .375,.477 B. Because .5 is not in the interval, then [x] doubt the claim
Confidence intervals on calculator
STAT TESTS 2 PROP Z INT X1- 639 (.8)(799) n1- 799 X2- 1555 n2- 2253 C-level- .95
Residual plot
Scatterplot of residuals against the explanatory variable
Sampling distribution of phat1-phat2
Shape- when n1p1, n1(1-p1), n2,p2, and n2(1-p2) are all at least 10, the distribution is normal Standard Deviation of p1-p2 is SQRT((p1(1-p1)/n1)+ (p2(1-p2)/n2))
Influences on sample size
Significance level- a smaller significance level needs a larger sample Power- a higher power needs a larger sample Effect size- a noticeable difference between a null p value and actual p value needs a large sample size
Four step process of simulation
State- ask a question of interest about some chance process Plan- describe how to use a chance device to imitate one repetition of the process. Tell what you will record at the end of each repetition Do- perform the simulation Conclude- use the results of your simulation to answer the question of interest
Differences between sample and population means
The Greek letter Mu is population mean Bar X is sample mean
One sided vs. two sided hypothesis
The alternative Hypothesis is one sided if it states that it is smaller or if it states it is larger than H_0. It is two sided if it states that the parameter is different from the null Hypothesis
Alternative Hypothesis
The claim we hope to be true instead of the null Hypothesis ex. H_a:P<.8
Binomial random variable
The count X of successes in a binomial setting
Range
The difference between the largest value and the smallest
Sampling distribtuion
The distribution of values taken by the statistic in all possible samples of the same size from the same population
Population
The entire group of individuals we want information from
Coefficient of determination- R^2
The fraction of the variation in the values of y that is accounted for by the least squares regression line Examples: If all the points fall exactly on the regression line the r^2= 1. Then all the variation in y is accounted for by the linear relationship with s
Simulation
The imitation of chance behavior based on a molded that accurately reflects the situation
First quartile
The median of the values left of the median
Third quartile
The median of the values right of the median
Median
The midpoint of the distribution -if the number of values is odd the median is the center observation -if the number of values is even the median is the average of the two center observations -The median is a resistant measure of center
Center
The midpoint of the values
Geometric Random Variable
The number of trials Y it takes to get a success in a geometric setting.
Shape
The overall shape of the spread (skewed right etc)
Interpreting a P value
The p value is essentially the conditional probability. The NIH recommends a calcium intake of 1300 mg per day of calcium for teens. The NIH says that teens aren't getting enough calcium. H_0:mean= 1300 H_a:mean<1300 Where Mu is the true mean daily calcium intake in the population of teens The researchers found bar x to be 1198 and the p value to be .1404 "Assuming the daily calcium intake for teens is 1300 mg, there is a .1404 probability of getting a sample mean of 1198 mg or less just by chance."
Percentile
The percent of observations less than the chosen observation
Geometric distribution
The probability distribution of Y
P value
The probability, computed assuming H_0 is true, that the statistic (such as Phat or bar x) would take a value as extreme as or more extreme than the one actually observed, in the direction specified by Ha is called the p value of the test
Probability
The proportion of times the outcome would occur in a very long series of repetitions
Randomized block design
The random assignment of experimental units to treatments is carried out separately within each block
Degrees of freedom
The statistic t= (bar x-mu)/(S_x/SQRT(N) Has t distribution with degrees of freedom Df= n-1 If not normal then it will have a t_n-1 distribution
Marginal Distributions
The totals at the bottom and far right margins of a table
10% condition
When taking an SRS of size n from ovulation of size N. We can use a binomial distribution to model the count of success in the sample as long a n<1/10M
Standard Error
When the Standard deviation is estimated from data SQRT((P hat(1-p hat))/n)
Confounding
When two variables are associated in such a way that their effects on a response variable cannot be distinguished from each other
Binomial Setting
When we perform several independent trials or the same chance process and record the number of times that a particular outcome occurs Four conditions Binary?- either success or failure Independent- one trial should not tell the success rate of the other Number- number of trial should be fixed Success- same probability of success each time
Rules for adding random variables
X= Pete's passengers mean- 3.75, SD- 1.0897 Y= Erin's passengers mean- 3.1 SD- .943 Pete charges $150 Erin charges $175 Calculate mean and SD of total amount Mean X= 150*3.75= $562.50 SD X- 150 * 1.0897= 163.46 Mean Y- 175*3.1= 542.50 SD Y- 175*.943= 165.03 Sum mean 562.50 + 542.50= 1105 Sum Variance- (163.46)^2 + (165.03)^2=.53,954.07 Sum SD- Sqrt(53,954.07)= 232.28
Regression line equation
Y(hat)= a(y intercept) +b(slope)x