Stats 119
Five Number Summary
Consists of minimum, Q1, median, Q3, maximum Can be found in the 1-VAR-STATS by scrolling to the second page (STATS > CALC > 1-VAR-STATS
If we are interested in the Mean what is next?
Decide if the original distribution is normal or non-normal
If we are interested in a number what is next?
Decide whether rule of thumb is satisfied (i.e. both np and nq are greater than or equal to 10.)
Types of Bias-Undercoverage
Entire population targeted is not included in the design of the sample. -Be on the lookout for any mention of a certain group in the sample design (if they mention they only sampled females, or people of a certain age group, or people from a certain region) and check that the group mentioned is the same as the population they are interested in. If not, you have undercoverage.
Determining question types-Binomial Question
Exact Questions- will ask the probabiltiy that a single # is our outcome. Calculator: binompdf(n, p, k) calculates the probability of getting exactly k successes. Inequality Questions- Will ask the probability that our outcome is part of a range of values Step1: Write out all possible outcomes: 0, 1, 2....N Step 2: Circle the outcomes that we're interested in. Step 3: Use either binompdf to add up all outcomes we're interested in OR use binomcdf Calculator: binomcdf (n, p, k) adds up the probabilities from 0 up to and including k. -Just like the normal distribution, you need to subtract from 1 for "at least" or "more than questions"
Linear Regression-Other Definitions
Extrapolation-Don't make predictions outside the range for which you have data. You may have to determine which of the below characteristics a certain point has on a scatterplot of residual plot. Outlier: Far away from other points, on either x or y. Leverage: Far away from other points, only in regards to the x value. Influential Point- Removing the point would result in a very different regression line.
Finding Confidence Intervals
Find the point estimate (p-hat or x-lineabove), then the margin of error. Subtract and add the margin of error to your point estimate to obtain an interval Calculator: These call all be found under STATS->TESTS Proportions=1-PropzInterval Means (mean known)=Z interval Means(mean unknown)=T Interval
Approximate Normal Questions-Inference
For MEANS, it is essential that you read carefully to see if you've been given sd(z)(population standard deviation) or s(sample sd.) This will decide whether your critical value or test statistic is a z* or a t*. -t In both real life and on the final exam, the questions tend to be weighted towards t* questions, but that does mean we can assume anything. -Read carefully, know that on multiple choice the wrong method will be an answer choice!
Finding Values from Given Confidence Interval
Given a Confidence interval: (lower, upper) you can obtain the point estimate and margin. Point Est.=(upper+lower)/2 ME=(upper-lower)/2
Determining question types- Inverse Calculations
If at any point in the question they give you a percentage, percentile or quartile, you're doing an inverse question. These questions will usually also ask for a value. Do not get these confused with the sample proportion questions! You need the word NORMAL before you even consider an inverse question. These problems boil down to the following sequence: % -> Z -> X Step 0: Draw your curve! This stops you from choosing the opposite value. Step 1: % -> Z: Convert the given area to a Z-Score using invNorm in your calculator -Beware, invNorm takes the left side probability! Step 2: Z -> X: Convert that Z into X using the corresponding INVERSE formula on your formula sheet. Calculator: invNorm(left area, mean, sd)
Determining question types- Normal Distribution
If the word "normal" is anywhere in the question or if you're given mean and S.D. First you need to decide if it's a regular normal question or a sample mean question. This is as simple as seeing if you have a sample size. Then, determine if it's a direct or inverse calculation.
Probability
If you can answer the question intuitively without notation, do it that way! If you're not sure where to start, translate everything into probability notation and see if a contingency table or the formula sheet can help out.
Experimental Designs
If you're asked a question about experimental design, cross out anything that isn't what you're being asked for! Some experimental designs have methods almost identical to our sampling design (SRS v. completely randomized and stratified v. block design) These will distract us if we aren't careful
Determining question types- # questions
If you're given n and p, then asked about a #, you're doing one of these types of questions. Your first step needs to be to check the rule of thumb, to see if you're doing a binomial or approx normal question. You can get answers using the wrong method, but they'll be wrong answers..
Approximate Normal Questions-Proportion
If you're given n and p, then asked about a proportion, you're doing one of these types of questions. These are the same as direct normal question, but you need to first calculate the mean and standard deviation from the "proportion" section of your formula sheet.
Types of Bias-Response Bias
Interviewee's responses are untrue or incorrect. -Any mention of inappropriate behavior on the interviewer or interviewee, if you can see any reason the interviewee may lie - the way the question was asked, who was asking the question, etc.
Disjoint vs Independent
It is very important to understand the difference between disjoint and independent events. Expect at least one question that involves you understanding the difference. -Disjoint: Mutually exclusive. Two events cannot happen at same time. --P(AnB)=0 -Independent: The outcome of one event does not influence the outcome of the other. ==P(A|B)=P(A) --P(AnB)=P(A)*P(B) --Any time you see independence given in a problem, write down. P(AnB)=P(A)*P(B), it will probably used in the problem.
What are the assumptions necessary for the BINOMIAL
(B) There are only two (BINARY) outcomes: Outcome of interest(success) and its complement(failure) (I) The trials or observations are INDEPENDENT (N)There are a fixed number (N) trials or observations (S) The probability of SUCCESS is the same for each trial
Displaying Data
Know the types of graphs for the two different types of data. -Categorical Data:Pie Charts and bar graphs -Quantitative Data: Histrograms, boxplots, and stemplots
Hypothesis Testing-Step 4
Make ( and justify) a conclusion -pvalue (less than or equal too)
Describing and Displaying Data- Measures of Center
Mean:Average of all data values Median: middle value of all data values --Robust to outliers or skey since it ignores all data to either side of the middle. Remember, your mean will move in the direction of a tail (skew) or outliers. Assume you will have at least one question giving you a mean and median and asking about the shape of the distribution. -A mean larger than the median means the distribution is skwered right (positive skew) -A mean smaller than the median means the distribution is skewed left (negative skew) -A mean close to the median means the distribution is approximately symmetric
Linear Regression-Correlation Coefficient (r)
Measures the strength and direction of a linear relationship between two variables. Weak 0 < r < 0.3 (Pos.) -0.3 < r <0 (Neg.) Moderate 0.3< r < 0.7(Pos.) -0.7 < r < -0.3(Neg.) Strong 0.7 < r <1(Pos.) -1 < r < -0.7(Neg.) -The interpretation for a correlation coefficient is: "There is a strong negative linear relationship between x and y" - -1 (< or equal) r (< or equal) 1 ---The closer r is to -1 or 1, the closer the points on the scatterplot fall on a straight line. -Be able to calculate r from R^2 --r= (+-) sqroot R^2 --The sign on r will be he same as the sign of the slope, don't forget to write R^2 as a decimal first!
Displaying Data-Boxplots
Memorize your fence formula -Mild Lower Fence: Q1 - 1.5(IQR) -Mild upper fence: Q3 + 1.5(IQR) -Extreme lower fence: Q1 -3(IQR) -Extreme upper fence Q3 + 3(IQR) Know how to use your fences to determine if a value is an outlier EAch portion of the boxplot represents 25% of the data, the box itself represents the interquartile range (IQR, middle 50%) of the data.
Linear Regression-Regression Line
Most commonly tested concepts in terms of regression: -Interpretation of slope: For each 1 unit increase in x, y increases/decreases by the slope. -Make predictions: Plus given x into the regression equation to find prediction (y-hat) -Calculate a residual. --Memorize formula: e=y - (y-hat) (error is rue value minus predicted value) -Calculate r or R^2
Gathering Data-Sampling Designs
Most commonly tested is the difference between stratified and cluster sampling: Stratified-Subjects within group are SIMILAR for some characteristics or set of characteristics........We choose a sample from EACH group. Cluster-Subjects withing group or DISSIMILAR.....We choose ENTIRE group(s) at random
Rule of thumb is not satisfied(Prop.)
N/A
What are the three things are we interested in when we are asked to find a mean or a standard deviation or to describe a sampling distribution?
Number Proportion Mean
Relationship Between Errors
-Alpha=Significance level= probability of Type I error -Beta = probability of Type II error -1 - beta = power! Errors are a tradeoff. If you decrease one, the other will increase. Power moves with Type I error.
Hints and Tips for Sampling without replacement
-Most common things asked for is the probability of getting EXACTLY 1 of a type of item, do not forget that there are two ways to pick this if we choose 2 items!! -Remember in a sampling without replacement, you need to determine how the first pick effects the probability in the second pick
Given Formulae
-P(A or B)= P(A u B) = P(A) + P(B) - P(A n B) --Should be used for any "or" question. --Can also be used to solve for "and" with a little algebra... P(A u B) = P(A) + P(B) - P(A n B) -P(A and B) = P(A n B) = P(A|B)*P(B) --Only use if you have a conditional! Otherwise, try using the formula above. --If you have P(B|A), don't fret: P(A n B) = P(B n A)...so P(A n B) = P(B|A)*P(A)
Line Regression-Measures of Predictive Power
-R^2/Coefficient of Determination/Percent of Variability... --All three of the above are ways R^2 can be referred to, don't let yourself be confused! --Can find from correlation coefficient r, by squaring and multiplying by 100% ---R^2= (r)^2 * 100% --"The percentage of variablity in Y that is exaplained by the regression line (or X)" --As R^2 gets closer to 100%, the predictive power for our model gets BETTER. Residuals - e=observed y (truth) - predicted y (predicted) = y - (y-phat) -Negative Residual: prediction is too high (overpredict) -Positive residual: prediction is too low (underpredict)
Sampling without replacement - Recognizing Sampling without Replacement
-The simplest question will say, "without replacement" -A more difficult question will leave these keywords out. Instead, you'll need to notice a few key things. --The number of total items will be given, and likely fairly small < 30. --We will want to come away with two items, not just recording a characteristic of those two items. Selecting a committee and purchasing items are good examples of this.
What are the assumptions necessary for inference on a MEAN (S.D. known)
1. Sampled Values must be INDEPENDENT. 2. Sample must come from SRS. 3. Sample must be less than 10% OF POPULATION 4. If the population is non-normal, the RULE OF THUMB must be satisfied (n is greater than or equal to 30) The Central Limit Theorem says that if a sample is LARGE enough, an on-normal distribution will have an APPROXIMATELY NORMALLY distributed SAMPLE MEAN
What are the assumptions necessary for inference on a MEAN (S.D. Unknown)
1. Sampled Values must be INDEPENDENT. 2. Sample must come from SRS. 3. Sample must be less than 10% OF POPULATION 4. If the population is non-normal, the RULE OF THUMB must be satisfied (n is greater than or equal to 40) The Central Limit Theorem says that if a smaple is LARGE enough, an on-normal distribution will have an APPROXIMATELY NORMALLY distributed SAMPLE MEAN
What are the assumptions necessary for inference on PROPORTION
1. Sampled values must be INDEPENDENT 2. Sample must come from SRS 3. Sample mus be less than 10% OF POPULATION This is the thing most often written incorrectly. It seems counter-intuitive that we want a small sample size, but if we take too large a sample we'd have to deal with the fact that SRS is sampling without replacement (i.e) not independent. If given the population size, be on the lookout for a violation of this assumption. 4. The RULE OF THUMB must be satisfied (np and nq are greater than or equal to 10) Be sure to check both p and q!
Type of probability Quesions-Conditional
A conditional probability just means that we already know some outcome has occured. There are number of ways to realize you're dealing with a conditional probability question: -Both variables have been mentioned, but there is no "and" word (remember "but" is another word that means "and") -The Key words to let yo know which variable we already know are OF, IF ,AND GIVEN. The "probability of A given B" or "Of those in B, what is the probability A would occur" are two ways to denote a conditional probability. P(A|B) = P(AnB)/P(B) -As soon as you realize they've told you something is given, write the line in your notation and write whatever is given BEHIND the line. -The formula for conditional probability is given on the formula sheet, just be careful changing the As and Bs to whatever your letters are. Note: The given portion is ALWAYS the denominator.
What does a Z score/Standard Normal give us?
A z-score gives us the number of standard deviations away from the mean a value is, as well as if that value is below or above the mean. You can find the Z-Score for a value with the given formula: A Z-score of 2 means the value is 2 S.D. above the mean. A Z-score of -1 means the value is 1 standard deviation below the mean.
Describing and Displaying Data- Measures of spread
All of these give us a measure of how spread out our data is ( or how much our values vary from one another.) -S.D (and Variance): measures the spread or variablity of the data. --S (> or equal too) the closer s is to 0, the less spread out our data is --For s to be exactly 0, all data values must be identical --Will get very large with a skewed distribution or with outlier -Interquartile Range (IQR) = Q3-Q1 --Robust to outlier or skew since it cuts off the 25% of data on either end
Describing and Displaying Data-Describing Distributions Numerically
All our measures in this section can be found in the 1-Var Stats output for your TI-83 or Ti-84 Know how to find all numeric measures --Shape of Distribution --Symmetric Distribution-Skewed distribution or one with outliers -Measure of Center-Mean-Median -Measure of Spread-Sd-IQR Except to be asked which meassrue ofcenter or spread to use for a described (or graphically presented) data set. Use the above chart to determine and pay close attention to if they're asking for a center or spread.
Types of Bias-Non-Response
An individual selected into the sample cannot be contacted or refuses to cooperate. -Any mention of choosing people into the sample and people not responding - whether not being home for interview or phone call, refusing to participate.
Types of Bias
Any question about biases will ask for the most prevalent type of bias. Even if you can argue that a few types listed below are answers (they usually will be), be on the lookout for those keys to let you know what the problem being tested is
Determining question types- Normal Questions
Any question that has the world "normal" in the question stem (or if you're given a mean, sd and n) We now have to look out for sample mean questions. If you're given a sample size (n), it's probably a sample mean question. They still divide into two types, but use the correct formula! Direct-Given a value, asked for a probability or a percent. Inverse-given a probability or a percent, asked for a value.
Gathering Data- Types of Data
Be able to identify if a variable is quantitative (numerical, takes a quanitity) or categorical (divides people into categories) Remembe, not all numerical variables are quanititative- your Red Id, SSN, or zip code are all numerical variables that divide people into categories
Expirement & Observational Studies
Be able to tell the differences between an observational study and an expirement. If we assign individuals to a treatment group (this can be as innocuous as asking them to eat a serving of a vegetables or watch a commercial) it's an experiment. If we just ask them about or observe their behavior, then it's an observational study
Experimental Designs-Block Design
Blocks of similar subjects are formed and within each block, they are randomly assigned to treatment groups. -Similar to a stratified set up, if we first divide our subjects into groups that are similar then randomly assign from each group into our treatment groups, we have a block design.
Experiment Terminology
Of all the expirement terminology, it's most likely you'll be asked about number of treatments or to list all the treatments. Remember that treatments include all possible combinations of the levels of each factor. If given a description of an experiment, you should be able to identify each of the following: Experimental units/Subjects: Individuals you are studying in the experiment . Response Variable: Outcome or dependent variable. This is what we are ultimately measuring or interested in (our y variable in a regression context) Factors: The explanatory (independent, x) variables that are thought to influence the response variable studied. Treatment: A Specific condition applied to the subjects. (A combination consisting a level or each factor. A good experiment will usually employ the following to help CONTROL for lurking variables. Know these definitions. Placebo-A dummy treatment. Used to control for the placebo effect. Blinding- A double-blind experiment is one where neither the participant nor the researcher taking the measurements knows who had which treatment. A single-blind experiment is one where the participants do not know which treatment they have been assigned. Helps reduce Bias! A last definition associated with experiments that you may be asked about: -Statistically Significant: An Observed effect so large that it would rarely occur by chance.
Displaying Data- Stem Plots
On back-to-back stemplots, always make sure you read away from the stem and that you're answering for the group the question is about!
Sample Sizes
Remember if a previous proportion or estimate isn't given and we're trying to estimate a PROPORTION use p-hat=0.5! In mean questions, it's all abotu reading carefully for which type of sd you've been given. -In sd unknown questions, you'll need to pretend that s is sd and use the top formula just to get an n so you can have df use use in your t tale for the REAL formula. You do not need to do the second step if n is more than 60 in the first step. Cautions: Again, if p-hat isn't given, use p-hat=0.5 Remember, always ROUND UP! we can't ahve 1/2 of a person. That weird arrow is a reminder that we do two steps if we a sample standard deviation.
Errors in Hypothesis Testing
Remember that all errors are condition probabilities and that Type 1 error is RTN, if H0 is true. When you make an error, your decision and the truth are not the same thing. -Type I - H0: TRUTH Ha:DECISION -Type II - H0 DECISION Ha: TRUTH -Power: a correct decision, and moreover the decision we want H0: Ha:DECISION/TRUTH
Confounding V Lurking Variables
Remember the drawings from class. Confounding variables add another relationship with the response variable ( in addition to the one with the existing explanatory and response variables (which is the entire cause of the relationship we saw between the explanatory and response variable.) -Two variables are confounded when you can't tell which of them (or whether it's the combination) had an affect on the response variable. --Example: You might want to test if a fertilizer helps your garden produce more tomatoes. Suppose you spread it on half the plants and record the number oft you spread it on the sunny half, leaving the shady half unfertilized! Now, even if you have more tomatoes on your fertilized plants, you don't know whether that's because of fertilizer or sunshine ( or the two together, which is actually the most likely case) -A lurking variable is sometimes referred to as common response it's a variable that drives two other variables, creating the impression of an association between them while in reality the two variables are BOTH response variables. --Example: Supposes a researcher finds a strong association between the number of computers per capital and life expectancy - countries with fewer computers have lower life expectancy. Do we think that the computers affect life expectancy in some way? NO! The general socioeconomic status(which could be measured in something like gross domestic product (GDP) is likely causing both the number of computers and the life expectancy to rise.
Confidence Interval Formula
Remember your PROPORTION formula decomposes into the parts of the confidence interval: phat +- Z*(sqroot phat qhat/n) phat= Point Estimate Z*(sqroot phat qhat/n)=ME (sqroot phat qhat/n)=SE Same thing is true for your MEAN formula x-line +- t*(s/sqroot n) The mean formula for when sd is known doesn't have a standard error because the standard deviation is actually known so the term in the place where SE would be is the real standard deviation.
Displaying Data-Histograms
Remember, first tep should be filling any missing information into your histogram - putting any missing x values down and writing how many observations are in each bar at the top. Know how to find median Q1, and Q3 intervals for a histogram. -Enter column in L1 and frequencies in L2, then do 1-VAR-STATS L1, L2 Know how to find a probabilities using a given histogram.
If we are interested in a proportion what is next?
Rule of thumb satisfied? (i.e. BOTH np and nq are greater than or equal to 10.)
Experimental Designs-Completely Randomized Design
Similar to an SRS setup, each subject is equally likely to get assigned to any treatment
Properties of a Student's Distribution
Similar to the normal in that is a symmetric bell-curve, but with two commonly tested differences. There is more area in the tails to account for the fact that we're estimating the standard deviation. Approaches a normal distribution as the degrees of freedom or sample size increases.
Table with ways to spot statistics and parameters
Stats(p hat,x line above, s)-Parameter(p,mean,sd) SAMPLE-Population Our sample size,n-known Estimate-true Survey-Assume Poll-Suppose
Hypothesis Testing-Step 1
Step 1: Identify Hypotheses Always set H0=mean=mean0(H0:p=p0) -u0(p0) is the given mean (proportion) we're interested in testing. -Usually easiest to determine x-lineabove(p-hat), which means u0(p0) is the other mean (proportion) mentioned. --For p-hat, we can calculate it using x/n --For x-line, we can calculate it using a list of values. Read Promt to determine the alternative hypothesis. -Ha; mean > mean0 (p> p0): increase, greather than, more than, better -Ha: mean < mean0 (p< p0) decrease, smaller than, less than. -Ha: mean (not equal) u0 (p (not equal) p0): changed, different than, not equal to. IF THE PROMPT HAS AT LEAST (GREATER THAN OR EQUAL TOO) OR AT MOST (LESS THAN OR EQUAL TOO), THAT IS THE NULL AND ITS COMPLEMENT IS THE ALTERNATIVE!
Hypothesis Testing-Step 2
Step 2: Calculate Test Statistic Use the formula sheet. Read carefully for mean problems for what type of sd we were given.
Hypothesis Testing-Step 3
Step 3: Find the p-value If you calculated a Z in step 2, look up the negative Z in your Normal Table. This will correspond to the using the alternative to determine the direction to look up in your normal table. For two-sided tests, multiply your pvalue by 2 -The two sided test (not equal) is looking for total in BOTH tails. (YOUR P-VALUE SHOULD ALWAYS BE LESS THAN .5!) If you calculated a t in step 2, look up the positive t in your t table. -First you need to calculate df=n-1 -Then look for where t would fall along that row, you will get a range for your p-value For two sided tests, multiply both sides of the range by 2!
Hypothesis Testing-Step 5
Step 5: Interpret your decision RTN: There is enough evidence to conclude the true mean (proportion) of ______ (whatever we're testing) is ____(less than, greater than, not equal to - depends on alternative) ____. FTRN: There is NOT enough evidence to conclude the true mean (proportion) of ___ (Whatever we're testing) is ____(less than, greater than, not equal to -- depends on alternative) ___. -The interpretations should end with the ALTERNATIVE HYPOTHESIS.
Experimental Designs-Matched Pairs
Subjects are paired according to variables that effect the response and then randomly assigned with one to the treatment and the other to the control group -Keep an eye out for before and after studies. They are a matched pair design where each subject is their own control! If it seems like every got the treatment, check to see if you've been given a before and after study.
Definition of P-Value
The definition of p-value is another conditional probability: Probability of getting a sample value as or more extreme than that obtained in our sample (or a test statistic as or more extreme than the one obtained from our sample), GIVEN that the true proportion is actually the null proportion (i.e. the null hypothesis is true.)
Cluster Sampling
The population is divided into groups that are dissimilar on characteristics. We choose entire clusters at random and comine 1 or more clusters to get our overall sample.
Stratified Sampling
The population is divided into groups that are similiar on some characteristic or set of characteristics, we then choose people at random (by SRS) from each group (We sample every group rather than taking a few entire groups!) and combine those samples into our overall sample.
Voluntary Response Bias
The type of bias associated with voluntary response samples. -Web Surveys mail in surveys, call in surveys
Contingency Table
There should be at least one contingency problem. It will either be one which can be solved by aid of a contingency table or one with a given table. -For the created table, you can then be asked for one of the other "and" probabilities - just pick it off the table! -For both, the most commonly asked for things or the "or" probabilities and the conditional probabilities.
Approximate Normal Questions-H
These are the same as direct normal question, but you need to first calculate the mean and standard deviation from the "number" section of your formula sheet.
Statistical Significance
Things to know about statistical significance: -Statistical significance is equivalent to rejecting the null. -We have statistical significance when the difference between our sample proportion and null proportion is LARGE ENOJUGH that it COULD NOT have happened by chance. -This is very different than practical significance! BY taking a very large sample, I can make any difference statistically significant. However, if the magnitude of that difference is very small, it may not be meaningful in reality. --Consider a researcher who sampled 2000 sharks of a particular type to determine that they weigh, on average, 1 pound less than another type of shark. His findings may be statistically significant, but the difference is so small that it is relatively pointless.
Determining question types- Direct Calculations
This is when we're given a vlaue and asked to find a probability or percent above or below that value( or between two values) Normally this is as easy as looking at the question, did they ask you: "what is the probability..." or "what percent?" Step 0: Draw your curve! This stops you from choosing the opposite area. There will be a keyword letting you know which side you're interested in. Step 1: X ->Z: Convert the given value to a Z-score using the correspoding DIRECT formular in the formula sheet. Step 2: Z -> %: Convert that Z into a probability (area under the curve) by looking it up in the Z-table. Remember that your table gives out the left side probability. You'll need to subtact from 1 to get the correct area for right side ("at least" or "more than") questions. Calculator: Normalcdf(lower,upper,mean,sd)
What two types of z-score questions can you get?
Two types of questions: Unusual: Scores further away from zero are more unusual, regardless of sign. Better: Be careful about direciton if asking who performed "better" - for a test, a higher z-score would be better since scoring higher on the test is better; however, for a race, a lower z-score is better since a smaller time is actually a better performance.
Interpreting Confidence Intervals
We can be ____%(Confidence level) CONFIDENT that the TRUE(population) PROPORTION/MEAN of ____ is in the interval.
Systematic Sampling
We choose every nth item
Simple Random Sample (SRS)
We choose people at random (i.e picking names out of a hat). Each member of population has an equal chance of being included
Original distribution is Nonnormal
X(line above)-AN(mean, S.D)
Original distribution is Normal
X(line above)-N(mean,S.D)
Rule of thumb is satisfied(#)
X-AN(mean,S.D)
Rule of thumb is not satisfied(#)
X-B(n,p)
Principles of Experimental Design
You may be asked questions regarding the basic principles of a good experiment, so make sure you know these 3! -Control: The experimental conditions for all treatment groups to assure that lurking variables do not bias the results, part of this is including a control (non treatment group) -Randomization:The experimental units must be RANDOMLY assigned the treatments. -Replication:Replication of our experiment must be used to reduce chance variation in the results All of the above requirements are important because they help control for lurking variables
Determining question types-Number/Proportion Questions
You'll be given n (usually a sample size) and p(as a decimal or percent, possibly even as a ratio.) Number-Find a probability that less than 30 people attended a concert Binom-if np or nq is less than 10 AN- if np and nq are graeter than or equal to 10 Proportion-Find the probability that more than 52% of people use liquid handsoap
Gathering data- Basic Definitions
You'll want to be able to differentiate between the following four definitions: population, sample. parameter and statistic If given an example of a statistical study, could you identify each part? -Population: The entire group of individuals that we want information about. Any numeric value from the population is a parameter. --Known, suppose, assume and population are all keywords signaling a parameter -Sample: Subset of population that we actually examine to gather information about the population. Any numeric value from the sample is a statistic. --Survey, sample, poll and estimate are all keywords signaling a statistic
Rule of thumb is satisfied(Prop.)
p hat- AN(mean,S.D)