stats 4 dummies
correlation
" is not the same as "Association" specifically, is "linear association")
proportion
# of individuals within range/ total # of individuals in data set
discrete examples
# of pets, #of siblings
Margin of Error
(.7, .4) (upper-lower) Divide by 2
ways to draw a random sample
(1) Simple Random Sampling (SRS) (2) Stratified Random Sampling (3) Cluster Sampling (4) Multistage Sampling (5) Systematic Random Sampling
replication
(a) Repeating conditions within an experiment to determine the reliability of effects and increase internal validity; (b) repeating whole experiments to determine the generality of findings of previous experiments to other subjects, settings, and/or behaviors
response bias
(answers incorrectly or question is misleading), when respondents' answers may be affected by survey design
one tailed
(directional) test Rejection region is located in just one end/ tail of the sampling distribution 5% is placed all in one end/ tail
two tailed tests
(non-directional) test Rejection regions are located in both ends/ tails of the sampling distribution Always use two-tailed in this class (and research) mu ≠ mu1 Split the 5% (2.5% each) between two ends/ tails
Degrees of freedom for chi square
(r - 1)(c - 1)
non-response bias
(refuse to participate or fail to answer), bias introduced to a sample when a large fraction of those sampled fails to respond
sampling bias
(undercoverage)exists when a sample is not representative of the population from which it was drawn
Inferential Statistics
-Lets us generalize beyond our actual data -Designed to help make decisions and test our hypotheses -Infers your data -(z-tests, t-tests, ANOVA, Chi square)
Descriptive statistics
-Organize and summarize variability in our actual data -Makes things concise and easy to understand -Describes your data -(frequency, mean, median, mode, correlation, regression)
the 1.5xIQR rule
-Q3+(1.5)(IQR) -Q1-(1.5)(IQR)
inference
Using results from a sample statistic value to draw conclusions about the population parameter.
predicted y (symbolized by y-hat )
Value for y at a specified x as predicted by the regression equation; computed by plugging the value for x into the equation and solving for y.
Discrete Quantitative
Values form a set of whole numbers
Continuous quantitative
Values measured on an interval (time)
σ2
Variance of a population or distribution.
equal variance (or equal standard deviation)
Variances (or standard deviations) for each of the treatment groups (or samples) in ANOVA are all equal. In regression, the variances of the y's at each x are all assumed to be equal.
natural variation
Variation from object to object within a population.
bias (all can contributed to)
Volunteer Response Sample Convenience Sample Not identifying the population correctly
association vs. causation
We can only argue causation from association if the results having significant association are from an experiment.
probability rules
1 the probablity P(A) of any event A satisfies 0<P(A)<1 2 If S is the sample sapce then P(S)=1 3 the disjoint addition rule P(A or B)=P(A)+P(B) 4 for any event A P(A does not occur)=1-P(A)
Empirical rule (3)
1) 68% of observations fall within 1 standard deviation of the mean 2) 95% of observations fall within 2 standard deviations of the mean 3) 99.7% of observations fall within 3 standard deviations of the mean
Regression problems (4)
1) extrapolation 2) influential outliers 3) correlation does not imply causation 4) lurking variables
What are the steps in the computation of a t test to test the significance of a correlation coeffiecient?
1. A statment of the null and research hypothesis. 2. Setting the level of risk( or the level of significance or Type 1 error) associated with the null hyppthesis. 3. and 4. Selection appropriate test statistic. 5. Determination of the value needed for rejection of the null hypothesis using the appropriate table of critical values for the particular statistic. 6. A comparison of the obtained value and the critical value is made. 7. and 8. Making a decision.
The Binomial Setting
1. There are a fixed number n of observations. 2. The n observations are all independent. That is, knowing the result of one observation does not change the probabilities we assign to other observations. 3. Each observation falls into one of just two categories, which for convenience are called "success" and "failure". 4. The probability of a success, p, is the same for each observation.
Practical importance
A difference between the observed statistic and the claimed parameter value that is large enough to be worth reporting. To assess practical importance, look at the numerator of the test statistic and ask ̳Is it worth anything?' If yes, then results are also of practical importance. Note: Do not assess practical importance unless results are statistically significant.
sampling distribution of x
A distribution of the sample mean; a list of all the possible values for x together with the frequency (or probability) of each value.
histogram
A graphical display of a quantitative data set; data are grouped into intervals (usually of equal width) and a bar is drawn over each interval having height proportional to the frequency (or percentage) of values in the interval. Values of the variable are given on the x axis and frequencies (or percentages) are given on the y axis. Histograms are examined to determine shape, center and spread.
histogram:
A graphical display of a quantitative data set; data are separated into intervals of equal width and a bar is drawn over the interval having height equal to the frequency (or percentage) of values in the interval. Values of the variable are given on the x axis and frequencies (or percentages) are given on the y axis. (Hence, a histogram gives a distribution.) Histograms are described by shape, center and spread. Used for large data sets.
pie chart:
A graphical display of categorical data using a "pie"; each category is represented as a slice where the size of the slice is proportional to the percentage of data in that category.
stemplot (also called stem and leaf plot):
A graphical representation of a quantitative data set. Leading values of each data point are presented as stems and second digits are given as leaves.
bar graph
A graphical representation of categorical data. Names of each category are listed on the x-axis and a bar that has height representing the frequency (or percentage) in that category is place over each category name.
mall-intercept sample
A sample where respondents are contacted in a shopping mall or similar location. Often the method of selection is haphazard although occasionally systematic.
estimate of a parameter
A single value or a range of values used to estimate a parameter.
interaction
A situation that occurs in an experiment when the effect of one explanatory variable on the response variable is not the same across all levels of another explanatory variable.
What is the null hypothesis?
A statement of equality between sets of variables.
What is the research hypothesis?
A statement of inequality between two variables.
standard error
A statistic providing an estimate of the possible magnitude to error. The larger the standard error of measurement, the less reliable the score.The standard error of the estimate is a rough measure of the average amount of predictive error (avg amount y deviates from y')
robust
A statistical procedure that is insensitive to moderate deviations from an assumption upon which it is based; e.g., t procedures give P-values and confidence levels that are very close to correct even when data are not normally distributed.
experiment
A study where treatments are deliberately imposed on the individuals in the study before data is gathered in order to observe their responses to the treatment.
spread
A summary number representing variability of the observations. Measures of spread include range, interquartile range, and standard deviation
resistant measure
A summary number that is not affected by outliers. The median is a resistant measure of center.
location measure
A summary number that tells the location (typically the center) of a data set on the number line.
random number table
A table consisting of the digits 0 through 9 in equal proportions such that the digit in any position in the table is independent of the digits in neighboring positions (i.e., there is no pattern in the order of the digits.)
approximate t test:
A test for comparing the means of two independent samples or two treatments where the test statistic has an approximate t distribution. This is the preferred two sample test, but it requires statistical software.
Chi-square test statistic
A test statistic computed from data that has an approximate chi-square distribution.
scatterplot
A two dimensional plot used to examine direction, form and strength of the relationship between two quantitative variables.
categorical (or qualitative) variable:
A variable that can be classified into groups or categories such as gender and religion.
response variable:
A variable that gives the outcomes of interest of the study (may not be a number); also called the dependent variable.
lurking variable:
A variable that has an important effect on the relationship among the variables in a study but is not taken into account.
explanatory variable
A variable that may or may not explain the outcomes (responses) of a study, also called independent or predictor variable.
lurking variable
A variable that the researcher is not necessarily interested in studying but which affects the relationship between the explanatory variable and the response variable.
quantitative variable
A variable with numerical values such as height or weight. This type of data required for both variables in regression analysis.
lack of realism
A weakness in experiments where the setting of the experiment does not realistically duplicate the conditions we really want to study.
How large a sample?
ALWAYS ROUND UP
Condition for expected counts in chi square
All expected counts greater than 5
Why is central limit theorem so important?
Allows us to compute probabilities on x and p
Alternate Hypothesis and null
Always mew
Normal distribution (a density curve)
Always on or above the horizontal axis Has area exactly 1 underneath curve Area under the curve for any interval is the proportion of all values that fall in that range
right-tailed alternative hypothesis:
An alternative hypothesis that states the parameter value of a treatment or population is greater than some number or the parameter from another treatment or population.
left-tailed alternative hypothesis
An alternative hypothesis that states the parameter value is less than some number or the parameter from another treatment or population.
one-sided or one-tailed:
An alternative hypothesis where the researcher is interested in deviations in only one direction. (" < " or " > " is in Ha.)
Association
An association exists between two variables if a particular values for one variable is more likely to occur with certain values of the other variable
expected count
An estimate of how many observations should be in a cell of a two way table if there were no association between the row and column variables.
Confidence interval
An estimate of the value of a parameter in interval form with an associated level of confidence; in other words, a list of reasonable or plausible values for the parameter based on the value of a statistic. E.g. a confidence interval for µ gives a list of possible values that µ could be based on the sample mean.
double blind study
An experiment where neither the subjects nor the diagnosticians (e.g. doctor or nurse) know which treatment is administered to whom.
randomized block design (RBD)
An experimental design where treatments are randomly allocated within each block.
random outcome
An individual outcome from a random phenomenon.
one sample t procedure for mean
An inferential procedure using the mean from one sample to test or estimate the population mean; the test statistic follows a t distribution.
one sample z procedure for proportion
An inferential procedure using the proportion from one sample to test or estimate the population proportion; the approximate distribution of the test statistic is z or standard normal.
Outliers
An observation in a data set that lies an abnormal distance from other values in a random sample population. > Q3 + (1.5)*IQR < Q1 - (1.5) * IQR
outlier:
An observation that falls outside the overall pattern of the data set. Can be detected by checking: observation < Q1 - 1.5 IQR or observation > Q3 + 1.5 IQR.
influential point
An observation that substantially alters the fitted regression equation.
lower tailed alternative hypothesis
Another name for a left-tailed alternative hypothesis.
variable
Any characteristic of an individual or object; it may take on any number of values either categorical or numerical.
Variable
Any characteristics that is observed for the subjects in a study
t-test
Any test of significance where the test statistic can be modeled with the t-distribution; used when σ is unknown.
The way to set up statistical problems
Ask: - what are the n individuals/units in the sample (of size "n"?) - what is being recorded about those n individuals/ units? - is that a number (quantitative) or a statement (categorical)?
Parameters of a binomial distribution for X successes in n observations
B(n,p) n is the number of observations. p is the probability of success on each observation. X is the count of successes, and can be any whole number between 0 and n.
27. What are the assumptions necessary for the above regression analysis?
B. The y's at each x are Normally and independently distributed with equal variances. NOT C. The samples are random and the sample sizes are large.
Slope
B1, the amount that mean of y changes when x increases by one unit
Linear Regression Null
B=0
What is the prob that an individual will be less than--use this
BECAUSE IT DOESN'T USE A SAMPLING DISTRIBUTION
Boxplots vs. bar graphs
Bar graphs are for categorical data
Why is it so important to use tstar and not Zstar for a confidence interval for mew?
Because you only know s, not sigma
interviewer bias
Bias introduce into survey results by body language, voice intonation, gender, race, etc. of an interviewer
selection bias:
Bias introduced into sample results due to how the units were selected for sampling.
measurement bias
Bias introduced into survey results because of poorly worded questions, interviewer effects, measuring instrument difficulties, etc.
respondent bias
Bias resulting from respondents lying when asked about illegal or unpopular behavior, forgetting or confusing past behavior, having no knowledge about the question content and not wanting to appear stupid, etc.
under-coverage bias
Bias that occurs in sample results because a segment of the population with a certain characteristic is not sampled.
Y intercept
Bo, the predicted value when y=0
• How do you distinguish between matched pairs and two sample t
Can do it with twins, a before and after measurement with one person, a couple treated as a unit, two treatements on the same person -- ALWAYS MATCHED PAIRS WITH THESE Two sample t: § Random sample of BYU and random sample of U of U -- compare them
Types of variables (3)
Categorical Discrete Quantitative Continuous Quantitative
causation
Changes in the explanatory variable directly affect the response variable. Experiments are needed to verify causation.
How are conditions from a t-test checked?
Check if data came from an SRS and check a plot of the data. (Is there outliers or strong skew?)
Sampling Distribution of a Count
Choose an SRS of size n from a population with proportion p of successes. When the population is much larger than the sample, the count X of successes in the sample has approximately the binomial distribution with parameters n and p.
Interval estimate for a mean
Confidence interval
When to use nPhat vs npnot tests
Confidence interval we don't have a pnot so use nPhat
linear quantitative
Correlation measures the strength of _____ relationship between two ________ variables.
1. The chi-square test statistic measures A. the difference between the observed statistic and the claimed parameter value. B. by how many standard errors the two sample statistics differ. C. the number of squared deviations the observed values are from the mean. D. the differences between the observed and expected counts.
D -- NOT THE CLAIMED PARAMETER VALUE, BUT OBSERVED AND EXPECTED COUNTS--JUST DIFFERENCE
Regression response variable
Dependent variable, the outcome variable on which comparisons are made
shape
Description of the overall pattern of a histogram using terms including symmetric, right skewed, left skewed, flat (uniform), bell-shaped, etc.
measurement variation
Differences in repeated measurements on the same object.
cluster random sample
Divide the population into small groups and select any number of entire groups.
Graphs for quantitative variables
Dot plot, stem and leaf plot, histogram, boxplot
individual
Each object or unit described or examined in a data set
Subject
Entities that we measure in a study
b
Estimated (sample) slope in a regression equation.
a
Estimated (sample) y-intercept in a regression equation.
mutually exclusive events
Events that cannot occur together. A single person's birth month can't be both January and February Pr( A or B) = Pr(A) + Pr(B)
X
Explanatory variable in regression analysis.
True or False: A 95% confidence interval for mew gives us a set of reasonable values for the response variable
FALSE its a set for the population mean
The tails of a standard Normal curve are fatter than the tails of a t-distribution curve with 26 degrees of freedom. False True
False, the opposite is true
The probability that the value for m is in this 95% confidence interval, (27.595, 28.293), is 0.95.
False. It is either in there or it is not. Either 100% of the time or 0% of the time.
Box plot (5)
Fence 1: Q1-(1.5)*(Q3-Q1) Q1: lower half median Q2: median Q3: upper half median Fence 2: Q3-(1.5)*(Q3-Q1)
Factorial Notation
For a given number n, its factoria n! is n! = n ⋅ (n-1) ⋅ (n-2) ... 3 ⋅ 2 ⋅ 1 And 0! = 1
association
For quantitative data, large values of one variable tend to occur with large (or small) values of another variable. For categorical data, certain responses for one variable tend to occur with certain responses of the other variable.
Scatter plot
For two quantitative variables Explanatory variable is the x variable Response variable is the y variable
Z score
Gives the number of standard deviations from the mean the observation is and direction (Outliers include and z score above or below 3)
Nominal
Hair color (N)
Bell shaped
Histogram in which the bars make a pyramid
Bimodal
Histogram where the bars make two bell shapes
interquartile range
IQR=Q3-Q1 makes up 50% of the data the 25% of observations above and below the median
Binomial Probability
If X has the binomial distribution with n observations and probability p of success on each observation, the possible values of X are 0, ,1, 2, ... n. If k is any one of these values, [image].
What is the difference between causation and association?
Just because things are related and share something in common with one another has no bearing on whether there is a casual relationship between the two.
2 Sample t for proportions requirement
Just np checks and randomization
positive association
Large values of one variable tend to occur with large values of another variable and small values of one variable tend to occur with small values of the other.
c
Level of confidence
α
Level of significance or probability of a type I error (probability of rejecting a true null hypothesis).
Asked for P Value-
Look up Negative Z
Normal CDF
Lower,Upper, Mean, SD
Given, N, P, X- asked to get mean+ SD NP Bigger than 10-
M= NP SD= Square root of NPQ
3. Gas mileage for 10 cars with dirty air filters and clean air filters was studied. Each car was tested once with a clean air filter and once with a dirty air filter (with the order of the testing randomized.) The research question is: "Do cars get better miles per gallon on average with clean air filters?" What type of study
MATCHED PAIRS
tstar times s/sqrtn
Margin of error for one-sample t confidence interval for μ
C→Q
Matched pairs t Two-sample t ANOVA
Matched pairs t vs. 2 sample t for means hypothesis or 2 sample z for proportions
Matched pairs: mewD 2 Sample t: M1-M2 = 0 OR m1=m2 2 sample z: p1-p2 = 0
Range
Max-min
skewed left
Mean < Median
population mean (μ)
Mean of all the observations in the population.
μ
Mean of the sampling distribution of x .
center
Mean, Median, Mode
quartiles
Measures of a central tendency that divide a group of data into 4 subgroups or parts.
skewed right
Median < Mean
Parameter in hypothesis of matched pairs vs 2 sample t
Mewd for matched pairs-the minus for the others
BinomePDF
N,P,K
Can categorical data have a normal distribution?
NO
Quality control
Nine consecutive either above or below the midline means a problem. One outside 3 SAMPLE standard deviations out signals a problem
Null Hypothesis for testing whether a linear relationship exists -- regression analysis in general?
No linear relationship --
Parameter
Numerical summary of a population (usually unknown)
Statistic
Numerical summary of a sample
TInterval(x, sx, n, c)
OR TInterval(List, Freq, c) The output of this function is the c-level confidence interval for the population mean when the population standard deviation is unknown and a sample mean x and a sample standard deviation sx have been computed from a sample of size n. Alternatively a data list can be entered, along with a frequency list, instead of the computed statistics.
Categorical variables
Observation belongs to a set of categories (hair color, gender)
1. A certain brand of fishing line claims to have an average breaking strength of 30 pounds. A group of fishermen become angry because this brand of line seems to break so easily and test 25 randomly selected lines of this brand. The mean breaking strength is 27.994 pounds with a standard deviation of 0.846 pounds. A plot of the data follows. Do these data provide sufficient evidence for the fishermen to conclude that the average breaking strength is less than claimed? A. What type of study is this?
Observational study
Quantitative Variables
Observations that take on numerical values
quartile
One of the three values that divide the ordered data set into quarters.
One-sample procedures
One-sample Z for proportions One-sample t for means
sample surveys
Opinion polls are examples of _________, designed to ask questions of a small group of people in the hope of learning something about the entire population
Calculate Test Statistic-
P Hat is calculated by X divided by N
P(A ∩ B')
P(A ∩ B') = P(A) - P(A ∩ B)
P(A ∪ B)
P(A ∪ B) = P(A) + P(B) − P(A ∩ B)
Observed effect = x - m0. If observed effect decreases, what happens to P-value?
P-value increases as observed effect decreases because x gets closer to m0 = 10
Not all counts have binomial distributions.
Pay attention to the binomial setting.
If a researcher wants to examine the relationship between variables and not the difference, what test statistics should he or she use?
Pearson Correlation Coefficient
Graphs for categorical variables
Pie chart or bar graph
direction strength form
Positive/Negative Strong/Moderate/Weak Linear/Non-linear
y-bar
Predicted y.
Interval estimate for the number
Prediction interval
Sampling distribution of the mean
Probability distribution of means for all possible random samples of a given size for some population.
significance level (α)
Probability of a Type I error, i.e. probability of rejecting a true null hypothesis; the largest risk of rejecting a true null hypothesis that a researcher is willing to take .
β
Probability of a type II error
1 - P(A ∪ B)
Probability that Events A and B do not occur.
p
Proportion (or percentage) of a population.
p-hat
Proportion (or percentage) of a sample.
population proportion ( p )
Proportion (or percentage) of all the observations in the population having a certain characteristic.
Lower fence
Q1 - 1.5 IQR
Quartiles
Q1: 25th percentile = median of the lower half including Q2 if odd Q2: 50th percentile = median Q3: 75th percentile = median of the upper half including Q2 if odd
Upper fence
Q3 + 1.5 IQR
correlation (conditions)
Quantitative variables condition: Are the two variables quantitative? Check: See how the data were collected. Straight enough condition: Is the assumption of a linear relationship appropriate? Check: Look at the scatter plot. (3) Outliers: Are there outliers that might distort the relationship? Check: Look at the scatter plot.
random error
Random unsystematic differences (random error) Random error: combined effects of all uncontrolled factors on the scores of individual subjects Individual differences AND experimental error Variability in general
Which condition is the most important?
Randomization
Spread
Range Inter-Quartile Range 5-number Summary Variance Standard Deviation (SD)
What is the range for a correlation coefficient?
Range between -1 and +1.
Central Limit Theorem
Regardless of the shape of the underlying population, the shape of the sampling distribution of the mean approximates a normal curve if the sample size is sufficiently large.
Q→Q
Regression inference
How do you interpret a p value associated with a correlation coefficient?
Represents the population estimate of the correlation coefficient.The p-value is a number between 0 and 1 representing the probability that this data would have arisen if the null hypothesis were true.
Y
Response variable in regression analysis.
Statistical significance
Results of a study that differ too much from what we expected to attribute to chance variation alone.
Least squares line has to smallest..
SQUARED residuals
Randomization checks
SRS for observational RAT for experiments
independent samples
SRS's collected separately from each of two (or more) disjoint populations; matched pairs data are considered to be dependent samples
r
Sample correlation coefficient
Simple random sample (SRS)
Sample is chosen in such a way that every subject is equally likely to be selected for the study
x-bar
Sample mean
n
Sample size
s
Sample standard deviation.
multistage samples
Sampling schemes that combine several methods are called ______
Symbols
Sigma--sd for population. Mew--Mean for Population. S---SD for sample. Xbar--mean for sample
Binomial Distributions are models for ___?
Some categorical variables, typically representing the number of successes in a series of n independent trials.
O. Are the conclusions also practically significant? (fish problem)
Some fishermen would say 3 pounds is enough so that they can't catch the fish they want to. They would say these results are practically significant. Others don't fish for those huge fish and they would not care so they would say that the results are not practically significant.
standard deviation
Square root of Variance. A measure of distance from the mean, ALWAYS POSITIVE. Approximately the average amount that scores in a distribution vary from the mean The bigger the SD, the more spread there is. BASICALLY, SD summarizes the spread, and helps us understand the meaning of individual scores.
Standard deviation
Square root of the variance (X - mean)^2 then, add all of these, divide it by (n-1), all square rooted
σ
Standard deviation of a population or distribution.
σ/sq. root of n
Standard deviation of the sampling distribution of x-bar
s/sq. root of n
Standard error of x-bar; estimates standard deviation of the sampling distribution of
s/sqrtn
Standard error of xbar
Σ
Summation symbol
Whenever n > 30, we can reliably use one-sample t procedures even for skewed data or data with outliers provided the data were collected with an SRS. True False
TRUE This is because the Central Limit Theorem is so powerful. However, you should always plot your data. You could have such an extreme outlier that you shouldn't use a one-sample t procedure.
Inference for Regression
Test and confidence interval for slope of true regression line (or population regression line). Confidence and prediction intervals for µy
Explanatory (independent variable)
The X-variable A variable that affects or explain the outcome Not all pairs of variables are explanatory/response variables. They could be unrelated.
Response (dependent)
The Y-variable usually measures an outcome
sampling frame
The _______ is a list of individuals from which the sample is drawn.
explained variation
The amount of total variation in the y's that is accounted for by a regression model; it is equal to Σ( y$ −y)2
follow-up analysis
The analysis performed on data after an overall test on the equality of multiple means or the equality of multiple proportions is found to be significant. It determines which means or which proportions differ from which.
fail to reject H0
The appropriate statistical conclusion in hypothesis testing when the P-value is greater than α; equivalently, conclude that "There is not enough evidence to believe Ha."
reject H0
The appropriate statistical conclusion when the P-value is less than or equal to α; conclude that "There is enough evidence to believe Ha."
Variance
The average of the squared deviations of each observation from the mean
random
The best way to avoid bias is to select individuals for the sample at ______
Binomial Mean and Standard Deviation
The center and spread of the binomial distribution for a count X are defined by mean µ and standard dev. σ: µ = np σ = √np(1-p)
Binomial Distribution
The count X of successes in the binomial setting has the binomial distribution with parameters n and p. The parameter n is the number of observations and p is the probability of a success on any one observation. The possible values of X are the whole numbers ranging from 0 to n.
deviation
The difference (distance) between an observation and the mean of all the observations in a data set, or the difference between an observation and the corresponding regression model estimate.
interquartile range (IQR)
The difference between Q3 and Q1 (i.e., Q3 - Q1); the length of the box in a boxplot
residual ( y − y$ )
The difference between the actual y and the predicted y.
μ1 - μ2
The difference between the means of two populations.
statistical significance
The difference between the observed statistic and the claimed parameter value as given in H0 is too large to be due to chance alone. To assess, ask "Is P-value < α?" If yes, then results are statistically significant
practical significance:
The difference between the observed statistic and the claimed parameter value is large enough to be worth reporting. To assess practical significance, look at the numerator of the test statistic and ask "Is the difference important?" If yes, then results are also of practical significance
p1 - p2
The difference between the proportions of two populations.
Interquartile Range
The difference between the scores (or estimated scores) at the 75th percentile and the 25th percentile. Used more than the range because it eliminates extreme scores.
x-bar1-x-bar2
The difference between two sample means
p-hat1-p-hat2
The difference between two sample proportions
Matched pairs t-test--which graph to check?
The differences graph
sampling distribution (theoretical):
The distribution of a statistic; a list of all the possible values of a statistic together with the frequency (or probability) of each value.
normality of Y at each X
The distribution of all the Y values at each possible value of X is normal.
population distribution
The distribution of all the observations in a population.
conditional distribution (conditional percents)
The distribution of the values in a single row (or a single column) of a two-way table.
population
The entire group of individuals of interest in a study.
Type II error
The error made when a false null hypothesis is not rejected. (i.e. you fail to reject H0 when H0 isfalse.)
type II error
The error made when a false null hypothesis is not rejected; the error of believing H0 when Ha is true.
Type I error
The error made when a true null hypothesis is rejected. (i.e. you reject H0 when H0 is true.)
type I error
The error made when a true null hypothesis is rejected; the error of believing Ha when H0 is true.
Complement
The event not occurring. The probability that Event A will not occur is denoted by P(A').
variability
The extent to which the scores in a data set tend to vary from each other and from the mean.
law of large numbers
The fact that the average ( x ) of observed values in a sample will tend to get closer and closer to μ as the sample size increases.
Null Hypothesis
The hypothesis of no difference or no change. This is the hypothesis that the researcher assumes to be true until sample results indicate otherwise. Generally it is the hypothesis that the researcher wants to disprove. (Note: Interpretations of P-value and statistically significant need to say something about ―if H0 is trueǁ in order to be correct.)
Alternative hypothesis
The hypothesis that the researcher wants to prove or verify; a statement about the value of a parameter that is either "less than," "greater than," "not equal to."
least squares regression line
The line that minimizes the sum of squared residuals.
regression
The mathematical modeling of relationships between numerical variables.
margin of error for 95% confidence
The maximum amount that a statistic value will differ from the parameter value for the middle 95% of the statistics. (Note: Changing the level of confidence changes the percentage of interest, e.g. 95%.)
Margin of error
The maximum amount that a statistic value will differ from the parameter value for the middle 95% of the distribution of all possible statistics. (Note: 95% can be changed to any other level of confidence.) This only accounts for sampling variability
How do we interpret margin of error?
The maximum size of the error where the error is the difference between the sample mean, x, and the population mean, m.
Measures of Center for a density curve
The median of a density curve is the equal-areas point, the point that divides the area under the curve in half The mean of a density curve is the balance point, at which the curve would balance if made of solid material
Median
The midpoint of the observations when they are ordered from smallest to largest (resistant to outliers)
Central Limit Theorem (CLT)
The name of the theorem stating that the sampling distribution of a statistic (e.g. x ) is approximately normal whenever the sample is large and random.
z-score
The number of standard deviations a value or observation is from the mean; a standardized xvalue.
Binomial Coefficient
The number of ways of arranging k successes in n observations, with constant probability p of success, in an unordered sequence.
Mode
The observation that shows up the most in the data set
residual (Vertical Deviation)
The observed y minus the predicted y; denoted: yyˆ ; prediction error.
Dependent
The occurrence of Event A changes the probability of Event B.
Independent
The occurrence of Event A does not change the probability of Event B. P(A ∩ B) = P(A)P(B); P(A|B) = P(A)
independent events
The occurrence of one event has no effect on the probability that the other event will occur. In independent events, multiply the probabilities of individual outcomes to find the probability that these outcomes will occur together The multiplication (and) rule Dealing cards with replacement—always use the full deck Birthday of married couples (both being born in May) Pr (A and B) = [Pr(A)] [(Pr(B)]
dependent variable
The outcome, the variable that is dependent on the treatment
LinRegTInt(Xlist, Ylist, Freq, c)
The output of this function includes the c-level condence interval estimate for the pop- ulation slope . Additionally, the output can include a and b for the least-squares line equation, r and r2, and the standard error s.
LinReg(a+bx) Xlist, Ylist
The output of this function includes the intercept a and the slope b for the least-squares line of best t for the data pairs entered via the lists. Additionally (with DiagnosticsON), the correlation coecient r and coecient of determination r2 will be shown.
LinRegTTest(Xlist, Ylist, Freq, alternate hypothesis)
The output of this function includes the t-test statistic and the P-value for the test given the alternate hypothesis entered.. Additionally, the output can include a and b for the least-squares line equation, r and r2, and the standard error s.
T-Test(0, x, s, n, alternate hypothesis)
The output of this function includes the t-test statistic, the P-value, and the point es- timate for for the test given the null hypothesis, statistics, and alternate hypothesis entered. Alternatively a data list can be entered, along with a frequency list, instead of the computed statistics.
2-SampTTest(x1, sx1, n1, x2, sx2, n2, alternate hypothesis, Pooled)
The output of this function includes the t-test statistic, the P-value, the point estimates for 1 and 2, the sample standard deviations for the test given the null hypothesis, statistics, and alternate hypothesis entered. Alternatively a data list can be entered, along with a frequency list, instead of the computed statistics.
Z-Test(0, , x, n, alternate hypothesis)
The output of this function includes the z-test statistic, the P-value, and the point es- timate for for the test given the null hypothesis, statistics, and alternate hypothesis entered. Alternatively a data list can be entered, along with a frequency list, instead of the computed statistics.
2-PropZTest(x1, n1, x2, n2, alternate hypothesis)
The output of this function includes the z-test statistic, the P-value, and the point es- timates for p1 and p2 for the test given the null hypothesis, statistics, and alternate hypothesis entered.
1-PropZTest(p0, x, n, alternate hypothesis)
The output of this function includes the z-test statistic, the P-value, and the point estimate for p for the test given the null hypothesis, statistics, and alternate hypothesis entered.
2-SampZTest(1, 2, x1, n1, x2, n2, alternate hypothesis)
The output of this function includes the z-test statistic, the P-value, the point estimates for 1 and 2, the sample standard deviations for the test given the null hypothesis, statistics, and alternate hypothesis entered. Alternatively a data list can be entered, along with a frequency list, instead of the computed statistics.
2-Test (Observed matrix, Expected matrix)
The output of this function is the 2-test statistic and the P-value for the test with a default alternate hypothesis of \The events are not independent.". Note that the Expected matrix entered can be empty or some other irrelevant matrix. In the process of this function, the TI-84 will overwrite that matrix with the values of the Expected matrix.
2-SampZInt(1, 2, x1, n1, x2, n2, c)
The output of this function is the c-level condence interval for the dierence between the means of two independent populations with known standard deviations 1 and 2 and sample means x1 and x2 computed from samples of size n1 and n2, respectively. Alternatively a data lists can be entered, along with a frequency lists, instead of the computed statistics.
2-SampTInt(x1, sx1, n1, x2, sx2, n2, c, Pooled)
The output of this function is the c-level condence interval for the dierence between the means of two independent populations with unknown standard deviations and sample means x1 and x2 and sample standard deviations sx1 and sx2 computed from samples of size n1 and n2, respectively. Alternatively a data lists can be entered, along with a frequency lists, instead of the computed statistics.
2-PropZInt(x1, n1, x2, n2, c)
The output of this function is the c-level condence interval for the dierence between the population proportions of two independent populations with point estimates ^p1 = x1 n1 and ^p2 = x2 n2 .
1-PropZInt(x, n, c)
The output of this function is the c-level condence interval for the population proportion p when a point estimate ^p = x n has been determined.
ZInterval(stdev, x, n, c)
The output of this function is the c-level confidence interval for the population mean when the population standard deviation is known and a sample mean x has been computed from a sample of size n. Alternatively a data list can be entered, along with a frequency list, instead of the computed statistics.
Level of confidence
The percent of the time that the confidence interval estimation procedure will give you intervals containing the value of the parameter being estimated. (Note: This can only be defined in terms of probability as follows: ―The probability that the confidence interval to be computed (before data are gathered) will contain the value of the parameter. After data are collected, level of confidence is no longer a probability because a calculated confidence interval either contains the value of the parameter or it doesn't.)
Regression r-square (R^2)
The percentage of variation in ^y explained by x Example: R^2=__% of the variation in y is explained by x ** if R is close to one, correlation is strong
Binomial distributions describe ___? And are used when ___?
The possible number of times that a particular event will occur in a sequence of observations. They are used when we want to know the probability of the number of times an occurrence takes place.
β
The probability of failing to reject a false null hypothesis (probability of a type II error).
P-value
The probability of getting a test statistic as extreme or more extreme than the value actually observed assuming H0 is true.
p-value
The probability of getting data (summarized with the test statistic) as extreme or more extreme than the one observed (in the direction of the alternative hypothesis) assuming Ho is true
power (1 - β)
The probability of rejecting a false null hypothesis.
Significance level (symbolized by α)(probability of a type I error)
The probability of rejecting a true null hypothesis; equivalently, the largest risk a researcher is willing to take of rejecting a true null hypothesis.
Level of significance
The probability of rejecting a true null hypothesis; equivalently, the largest risk a researcher is willing to take of rejecting a true null hypothesis. Probability of Type 1 error. Alpha.
Conditional Probability
The probability that Event A occurs, given that Event B has occurred. The conditional probability of Event A, given Event B, is denoted by the symbol P(A|B).
Intersection
The probability that Events A and B both occur is the probability of the intersection of A and B. The probability of the intersection of Events A and B is denoted by P(A ∩ B). If Events A and B are mutually exclusive, P(A ∩ B) = 0.
Union
The probability that Events A, B, or both occur is the probability of the union of A and B. The probability of the union of Events A and B is denoted by P(A ∪ B) .
Percentile
The pth percentile is a value such the p percent of the observations fall below or at the value
Empirical Rule
The rules gives the approximate % of observations w/in 1 standard deviation (68%), 2 standard deviations (95%) and 3 standard deviations (99.7%) of the mean when the histogram is well approx. by a normal curve
Population
The set of all the subjects of interest
Sample
The set of subjects for which we have data
population standard deviation (σ)
The standard deviation of all observations in a population; a measure of the variability of all the population values about their mean.
What does standard error estimate?
The standard deviation of the sampling distribution of x-bar
Mean
The sum of observations divided by the number of observations (sensitive to outliers)
Chi-distribution
The theoretical distribution that models the test statistic for doing Chi-Square tests
pooled sample proportion
The value used for p$ when computing p$ ( p$ ) for the two
sampling variability
The variability of a statistic from one sample to the next; a measure of sampling variability necessary to effectively do inference.
unexplained variation:
The variation of the y's about the regression equation; equals the sum of squared residuals.
y-intercept
The y value where the regression line intercepts (crosses) the y-axis.
five number summary
These five values: minimum, Q1, median, Q3, maximum; preferred numerical summary when data are very skewed or outliers are present
sample proportion z test statistic.
To compute, add the number of successes in both samples and divide by the sum of the two sample sizes.
Mutually Exclusive
Two events are mutually exclusive or disjoint if they cannot occur at the same time.
bivariate data
Two measurements are made on each unit.
Graphs
Two way table: 2 categorical. Scatter-- 2 Quant. Histogram--1 Quantitative. Box-plot--quantitative. Bar graph--categorical data.
C→C
Two-sample Z for proportions Chi-square test
Types of histograms (6)
Uniform Bimodal Bell shaped Skewed right (mean>median) Skewed left (mean<median) Symmetric (mean=median)
symmetric distributions
Use Mean and Variance/SD
skewed distributions
Use Median and IQR or 5-number summary.
prediction
Using a regression equation to estimate a value of the response variable for a given value of the explanatory variable.
continuous random variable
X takes on all values in an interval of numbers
Regression line equation
Y=Bo + B1x
What value should be used for a 95% interval for tstar?
YOU DON'T KNOW THIS UNTIL YOU KNOW THE DEGREES OF FREEDOM AND LOOK AT THE T TABLE!
Linear Regression test
You test normality of histogram or stemplot of residuals
Measure of position
Z score, z=(observation - mean)/ standard deviation
yˆ =
a + bx
negative correlation
a correlation in which then is inverse agreement between the location of cases on X and Y. As scores on X increase, scores on Y increase.
positive correlation
a correlation in which there is direct agreement
discrete
a countable number of outcomes
Random Sampling
a method of selecting participants for a study so that everyone in a population has an equal chance of being in the study
parameter
a number that describes some characteristic of the population
matched-pairs design
a randomized blocked experiment in which each block consists of a matching pair of similar experimental units
Haphazard sample
a sample obtained when a researcher attempts to emulate a true chance mechanism or tires to pick a representative sample based on their idea of what the population looks like
Convenience sample
a sample of easily accessible units in a population
convenience sample
a sample of easily accessible units in a population
Voluntary response sample
a sample of units from a population that select themselves for inclusion
voluntary response sample
a sample of units from a population that select themselves for inclusion -people with strong opinions are most likely to respond
union
a set containing all and only the members of two or more given sets
Statistics
a set of concepts and procedures that helps us understand variability Helps us organize, summarize, interpret, and generalize our data
event
a set of outcomes in a probability experiment
If the most serious error is the type I error, then a should be set at
a should be set as small as possible (like 0.01) if the type I error is most serious.
trial
a single attempt at a probability experiment
lurking variables
a variable that is not among the explanatory or response variable in a study but may influence the response variable
response
a variable whose values are compared across different treatments, dependent variable, what is measured
41. Which one of the following is NOT part of the definition for P-value?
a) Probability that the null hypothesis is true.
60. What characterizes a probability sample but not a sample of convenience?
a) Some type of random device is used to obtain a probability sample. b) Their probabilities can be computed. c) All possible probability samples can be listed. d) Inferences can appropriately be made from probability samples. e) All of the above.
39. To standardize means to
a) subtract the mean from a value and then divide by the standard deviation.
The t-distribution with 8 degrees of freedom has ____________________ the standard Normal distribution.
a) the same center but is more spread out than
33. The least squares regression line is the line for which
a. the sum of the squared residuals is minimized.
examples of quantitative
age height time
quantitative
age in years (type of variable)
compliment of an event
all of the outcomes NOT in an event
probability of an event
always between 0-1
prediction interval
an interval estimate of plausible values for a single observation of Y at a specified value of X
relative frequency
another term for proportion; number of times event occurs divided by the number of trials (long run)
law of large number
as the number of independent trials increase, the poportion of occurrences of any given outcome approaches a particular outcome number "in the long run"
null hypothesis
asserts that nothing special is happening with respect to some characteristic of the underlying population NO difference in the groups you're testing The means of different groups are the same We want to disprove this
29. How is level of confidence determined?
b) Subjectively determined by the researcher.
15. The margin of error in a confidence interval covers only which kind of errors?
b) errors due to random sampling
State the null hypothesis for testing independence (Chi square test).
b. H0: No relationship between type of community and internet usage. ( H0: p1 = p2 = p3 (conditional distributions are the same for all 3 communities) Since these data are from an SRS with two questions, we should do a test of independence. Answer "a" would be valid if these data were from a stratified sample of community type.
State the alternative hypothesis for testing independence. (Chi square)
b. Ha: Relationship exists between type of community and internet usage. ---I think that if we had more than two questions and a stratified sample, the answer would be this: Ha: At least one pi differs from the others.
Least-Squares Regression Line
best fit linethe line that minimizes the sum of the squares of the vertical distances of the data points from the line
BinomCDF
binomcdf(n, p, k)
disjoint
both events cannot occur in same "probability experiment"
58. Why do we compare different treatment groups in experiments?
c) To neutralize the effects of lurking variables and measure treatment differences.
places individuals into one of several groups or categories
categorical variable
(x-bar chart)
chart used to monitor a process to determine whether it is in control or out of control.
histogram
create equal-length intervals or "bins" - obtain frequencies or relative frequencies of cases in each bin. - bins need to span the whole range of the data, but not overlap. - The bars will be of equal thickness since the intervals are of equal length. - There should be no gaps between bars unless they are meaningful.
59. Which one of the following is NOT a difficulty in experimentation?
d) Using blocking to remove variation associated with a lurking variable from experimental error.
subjective frequency
degree of belief that an outcome will occur (short run)
mathematical approximations to histograms -smooth curve
density curve
relative frequency distribution
describes the fraction (or %) of times a value occurs in the dataset for this variable.
frequency distribution for a variable
describes the number of times a value occurs in the dataset for this variable.
residual plot:
diagnostic plot of the residuals versus the explanatory variable used to access how well the regression line fits the data; complete scatter with a shoe box shape is good; curvature indicates that a non-linear model would better fit the data, and a megaphone pattern indicates the standard deviation of y is not the same for all values of x.
range
difference between the minimum value and maximum value
y hat
distance/difference predicted b/t actual y and predicted y
the ________ of a quantitative variable tells us what values the variable takes on and how often it takes those values
distribution
sampling distribution of p$
distribution of the sample proportion; a list of all the possible values for p$ together with the frequency (or probability) of each value.
t distribution
distribution specified by degrees of freedom used to model test statistics for the sample mean, differences between sample means, etc. where σ ('s) is (are) unknown.
correlation _________ describe curved relationships between variables
does not
Constant with linear regression
don't use it
quantitative visual aide
dot plot, stem and leaf plot, histogram
Ordinal
dress size (type of variable
categorical
each observation belongs to a type or a set
randomness
each subject of the population has the same chance of being included in the sample
subjects
entities measured in a study
type 2 error
fail to reject a false null hypothesis Saying there is not an effect when there is -occurs when sample size is too small
linear regression:
finding the line that best describes how the response variable linearly depends on the explanatory variable.
a probability model with a finite sample space is called...
finite or discrete
regression line
fitted through the points of a scatterplot that summarize and describes the relationship between the variables
consists of the smallest observation, the first quartile, the median, the third quartile and the largest observation, written in order from smallest to largest
five number summary
how to tell which way a graph is skewed
follow the tail
examples of categorical
gender species political preerence
histogram
graph using vertical bars to portray the frequencies of outcomes
categorical (or qualitative)
hair color (type pf variable)
Correlation (properties) linear
has no units between x and y is the same as y and x ranges between -1 and +1 Not affected by changes in center or scale of either variable sensitive to outliers only measures strenth of ________ association
approximate sampling distribution:
he distribution of x -values obtained from repeatedly taking simple random samples of the same size from the same population. (An x is computed from each sample.)
the final digit of each observation
leaf
Outlier
less than Q1 - (1.5*IQR) or more than Q3 + (1.5*IQR)
extreme outlier
less than Q1 - (3*IQR) or more than Q3 + (3*IQR)
What things affect margin of error?
level of confidence and sample size--POPULATION SIZE DOES NOT
alpha is AKA?
level of significance
bias sampling
likely to under or over represent groups in the population that tend to have different values of a variable under investigation -undercoverage -nonresponse response
to assign probabiites in a finite model
list the probabilites of all the individual outcomes these probabilities must be numbers between 0 and 1 that add to 1 the probability of any event is the sum of the probabilites of the outcomes making up the event
regression analysis
make quantitative predictions of one variable from the values of another
symmetric distribution
mean = median
not resistant to outliers and skewedness
mean and standard deviation
What is the parameter being tested in context? (FOR THE MATCHED PAIRS CARS EXPERIMENT)
mean difference between mpg with clean air filters and mpg with dirty air filters
u
mean of all population distribution
Dbar
mean of the sample of differences
negatively skewed
mean, then median, then mode, tail on left
Mean and stdev
meaningless for categorical data, but not when you're talking about the sampling distribution of phat.
measures of variability
measures that indicate the degree of dispersion or spread of the data; include range, variance, and standard deviation
Pearson product-moment correlation (r)
measures the strength of linear relationship between two normally distributed interval or ratio scale variables
correlation ( r )
measures the strength of the linear relationship between two quantitative variables
inferential stats
method of making decisions or predictions based on data obtained
descriptive stats
methods for summarizing data
Positively skewed
mode then median then mean, tail on right
measures of central tendency
mode, median mean Types of Descriptive Statistics answers "what is a typical score?" describes the center
normal probability model
most common and arguably one of the most important
mode
most frequently occurring value, resistant
sample statistic
numerical summary of a sample taken from a population
observational study
observes individuals and measures variables of interest but does not attempt to influence the responses
selection bias
occurs when some part of the target population is not thesampled population
The data for the box plot:
only quantitative data
normalcdf(lower bound, upper bound, mew, standard dev)
oustput is the probability that a random variable (with a normal distribution) falls in the interval from lower bound to upper bound. The default values for mean and standard deviation correspond to the standard normal curve where mew=0 and stdev=1
independent
outcome of one event doesn't depend on outcome of the other event
independent trials
outcome of one trial does not effect the outcome of any other trial
matched pairs design
randomized blocked experiment in which each block consists of a matching pair of similar experimental units
Non-probability samples have no____
randomness
population parameter
refers to the numeric summary of population
type 1 error
reject a null hypothesis that is actually true Saying there is an effect when there isn't Reduce Type I by decreasing α to .01 or .001 BUT this will increase your chance of a Type II error
significant result
test of significance that yields a P-value less than α; an observed effect that is larger than could reasonably be expected due to chance alone.
T ratio is
test statistic in linear regression
Degrees of Freedom with matched pairs
the # of pairs minus one
Confidence interval--what goes in it?
the PARAMETER
response variable
the quantitative measure from units or people of interest dependent variable predicted variable y
conditions for inference
the relationship is linear in the population the response varies noramlly about the population regression line observations are independent the standard deviation of the responses is the same for all values of x NO FUNNEL SHAPE IN RESIDUALS
if the distribution is exactly symmetric then the mean and median are...
the same
The theoretical conditions necessary for performing a one-sample t procedure are:
the sample was randomly selected and the population is Normally distributed.
complement
the set of all outcomes in S that are not contained in A
sample space
the set of all possible outcomes
in a skewed distribution the mean is usually farther out in which direction
the tail
bias
the tendency to systematically favor certain parts of the population over others
sampling distribution of (r)
the theoretical distribution of values of r obtains from all possible samples of size N drawn from a population in which there is no correlation between X and Y
experimental units
the total number of subject in the experiment
influential data point
the x value is relatively high or low compared to other data; point has a large residual for the regression line without using that data point
continuous examples
time, age, weight
independent variable
treatment is manipulated by the researcher
correlation is not resistant
true
is the area in the fail to reject H0 region under the curve defined by Ha. True False
true
What is the margin of error
tstar times the standard error (s/sqrt N)
disjoint event
two or more events with no outcomes in common.
randomize
use chance to assign experimental untis to treatments
approx symmetric distribution
use mean an d standard deviation
Random sampling
use of chance to select a sample, is central principle of statistical sampling
skewed distributions or distributions with extreme outliers
used median and quartiles
outlier
value falling far from the rest of the data
Parameter of a normal distribution
values that uniquely identify the distribution. For the Normal dist. they are the mean µ ("mu") and standard deviation ("sigma") of the population Note that the mean and standard deviation computed from actual observations (data) are denoted by y and s, respectively. These are not parameters
a characteristic that is observed on an individual that can change between individuals -corresponds to the column
variable
explanatory
variable that is manipulated, or the treatment
standard deviation squared
variance
residual
vertical distance between a point and a reqression line. squared so positives and negatives do not cancel one another out
outcome
what was observed on a trial
negative relationship
when above-average values one one tend to accompany below-average values of the other
Nonresponse bias
when an individual chosen for the sample can't be contacted or refuses to participate
Response bias
when an individual does not answer accurately or honestly for any reason (embarrassment, want to please, wording of question)
positive relationship
when avoe-average values of one tend to accompany above-average values of the other, and when below-average values also tend to occur together
Undercoverage bias
when some group sin the population are left out of the process of choosing the sample
confounding:
when the effects of the lurking variable and the explanatory variable on the response variable cannot be distinguished from each other.
How do you compute the value of the test statistic for a matched pairs t?
where Mnot = zero
discrete random variable
x takes fixed set of possible values - whole numbers
residual calculation
y - y hat
What is the difference between σ and s?
σ is the standard deviation of a population whereas s is the standard deviation of a sample.
equation relations
• Mewy = alpha + betaX • Xbar = a + bX
Between what two values are 95% of the bottle weights
○ If you know the the mean and the standard deviation parameters it's just the mean plus/minus 2 times the Stdev
• least squares line is the line for which the sum of
○ RESIDUALS (variability of y's about the line, not the mean of the y's) is minimized.
Why is the condition of random selection or random allocation so important?
○ So we can make valid inferences
mean
sum of values/# of observations, not resistant
representative
Is my sample __________ of my population or can the results of my sample be generalized to the population that i intend to study?
What do you already know about a correlation coefficient?
It can act as its own test statistic.
What does a correlation coefficient represent?
It is a numerical index that relfects the relationship between two variables.
consists of all but the right most digit
stem
Observations must meet these requirements
- the total number of obs.s n is fixed in advance - each obs. falls into just 1 of 2 categories: success or failure - the outcomes of all n obs.s are statistically independent - all n obs.s have the same probability of "success", p.
three main principles for experimental design
-control -randomize -replication
comparison study
A sample type where the researcher contacts those subjects who are readily available and does not use any random selection. The results are almost always biased.
displays for quantitative
stem-n-leaf display box plot histogram
margin of errors
1/ square root of 'n' how well the sample estimate predicts the population percent
what quartile is the median?
2nd quartile and is resistant to outliers
When to use chi
3 or more categories for either exp. variable or response variable
Empirical rule #1
68% includes mean minus 1 standard deviation and mean plus 1 standard deviation
Empirical rule #2
95% includes mean minus 2 standard deviations and mean plus 2 standard deviations
Empirical rule #3
99.7% includes the mean minus 3 standard deviations and mean plus 3 standard deviations
Multiple comparisons
: Performing two or more tests of significance on the same data set. This inflates the overall α (probability of making a type I error) for the tests. (The more comparisons performed, the greater the chances of falsely rejecting at least one true null hypothesis.)
Residual
=y-^y, the difference between true y to the predicted y
Normal distribution
A bell-shaped symmetric density curve that is often used as a model for data or other random variables; specified by μ and σ.
direction of relationship
A characteristic of data in a scatterplot that is identified as either a positive or negative association.
degrees of freedom
A characteristic of the t-distribution (and other distributions like F and χ2) indicating the amount of information available in the data. A complete definition of "degrees of freedom" is beyond the scope of an introductory statistics course.
bias (sampling)
A condition that occurs when the design of a study systematically favors certain outcomes.
unbiased:
A condition where the mean of all possible statistics equals the parameter that the statistic estimates.
left skewed
A density curve where the left side of the distribution extends in a long tail. (Mean < median.)
right skewed distribution
A density curve where the right side of the distribution extends in a long tail; (mean > median).
non-probability sample
A sample selected without randomization; hence, the probability of obtaining a particular sample cannot be computed.
block
A group of experimental units sharing some common characteristic. In a randomized complete block design, random allocation of treatments is carried out separately within each group.
17. What is a distribution of a random variable?
A list of possible values of a variable together with how often each value occurs.
distribution
A list of the possible values of a variable together with the frequency (or probability) of each value.
distribution:
A list or a graph that shows the possible values of a variable together with the frequency of each value.
Q1 (First Quantile):
A location measure of the data that has approximately one-fourth or 25% of the data below it.
Q3 (Third Quantile)
A location measure of the data that has approximately three-fourths or 75% of the data below it.
regression equation
A mathematical formula for a straight line that models a linear relationship between two quantitative variables.
standard deviation of p-hat
A measure of the variability of the sampling distribution of p-hat ; equals the sq. root of p(1-p)/n
slope ( β = parameter symbol; b = statistic symbol):
A measure of the average rate of change in the response variable for every one unit increase in the explanatory or independent variable.
variance
A measure of the average squared deviation of the data about the mean.
probability
A measure of the proportion of times an outcome occurs in a very long series of repetitions indicating the likelihood of the outcome.
correlation coefficient
A measure of the strength of the linear relationship between two quantitative variables.
randomization
A method of assigning experimental units to treatment groups that eliminates bias and gives each unit the same probability of being assigned to any treatment group
voluntary response
A method of sample selection that consists of people choosing themselves by responding to a general appeal
test statistic
A numerical value calculated from the sample information assuming H0 is true; used to obtain P-value.
dotplot
A one dimensional plot of a quantitative data set where each value in the data set is represented by a dot above its corresponding location on the x axis.
random phenomenon
A phenomenon with outcomes that are individually unpredictable, but follow a predictable distribution in the long run (i.e., in a very large number of repetitions).
boxplot
A plot of data that incorporates the maximum observation, the minimum observation, the first quartile, the second quartile (median) and the third quartile.
in control
A process functioning within acceptable limits.
probability sample
A sample chosen using some type of random device. The probability of any specific sample can be computed and is greater than zero.
Normal Approximation for Binomial Distributions
If n is large and p is not too close to 0 or 1, the binomial distribution can be approximated by a Normal distribution. B(µ=np, σ=√np(1-p)) ~ N(µ=np, σ=√np(1-p)) It can generally be used when np ≥ 10 and n(1-p) ≥ 10. This approx. can be improved w/ continuity correction.
For fixed a, how can we increase power?
Increasing sample size will decrease the spread, making both curves taller and skinner. That will increase power.
Regression explanatory variable
Independent variable, the groups to be compared with respect to values on the response variable
Asked for Percent or probability-Binomial
Inequality corresponding to sign.
data
Information collected on individuals.
IQR
Interquartile Region IQR = Q3 - Q1
Asked for quartile, normal distribution.
InvNorm- (Percentage to left-mean-SD)
___________ show the distribution of a quantitative variable by using bars whoe height represents the number of individuals who take on a value within a particular class
histograms
standard deviation
how each value differs from the mean
The distribution of a count depends on
how the data are produced.
1. draw and label a number line that includes the range of the distribution 2. draw a central box from Q1 to Q3 3. note the median M inside the box 4. extend lines from the box out to the min and max values that are NOT outliers
how to make a boxplot
5 number summary
includes the minimum, first quartile, median, third quartile, & the maximum
As degrees of freedom_____ the t-distribution curve approaches normal
increase
The shape of the t-distribution gets closer and closer to the shape of standard Normal distribution as the degrees of freedom
increase.
an object described by data ex: people, animals, households -corresponds to the row
individuals
Statistical inference Definition
inferring something about the population based on the observed sample.
When many tests of significance are performed on one set of data, the researcher is guilty of performing multiple analyses and
inflating the overall α.
Appropriate interpretation of interval vs. of level of confidence
interval is that the parameter is in there. Level is that the confidence interval procedure yeilds intervals that contain it a percentage of the time
InvNorm
invNorm(left area/probability, mean, sd)
Power
is the probability of rejecting H0 when H0 is false.
Standard deviation for the first formula
is the standard deviation of pHAT
When computing a test statistic for a matched pairs t, mewnot
is zero (almost always)
The mean of every t-distribution
is zero just like the standard Normal distribution
How to know if its np checks
it always is for proportions
resistant
outliers have little effect on the values
which outlier is influential
outliers in the x direction changes the equation of the line or correlation coefficient
1-VarStats(List)
output is mean of the list, x, x2, the sample standard deviation, the population standard deviation, the length of the list, and the five number summary
binompdf(n, p, r)
output is the probability of exactly r successes out of n trials in a binomial experiment where the probability of success in a single trial is p
binomcdf(n, p, r)
output is the probability of no more than r successes out of n trials in a binomial experiment where the probability of success in a single trial is p
invNorm(p, mew, stdev)
output is the value of the random variavle with left tail probability p. This can be used to determine critical z values.
invT(p, degrees of freedom)
output is the value of the t-value with left tail probability p. This can be used to determine critical t values.
variable:
particular characteristic of an individual
displays used for categorical
pie and bar chart
categorical visual aide
pie chart, bar graph
data:
pieces of information about individuals organized into variables
continuous
possible values form an interval, infinite set of values
For fixed α, increasing sample size increases
power.
-always one or above x-axis -always positive -has an area of exactly 1 underneath curve
properties of density curve
a variable that is observed as a nukmber
quantitative variable
facts about correlation coefficient
r is always between -1 and 1 r>0 indicates postive association values near 0 indicate a weak relationship r=-/+1 perfect linear relationship
shows quantitative variable over time, can see if any trend is occurring over time
stemplot
________ separate each observation into a stem and a leaf that are then plotted to display the distribution
stemplots
the variable one suspects is affected by the explanatory variable -measure outcome of study -variable you wish to predict -interest to study
response variable
r vs. rsquared
rsq is percentage of variation of y that can be explained by x r: strengthj and direction of linear relationship
Sample standard deviation symbol
s
Standard Error
s/sqrtn
what is x-bar
sample mean or average
T-star and sample size
sample size is not a condition
simple random sample
sample size n is one in which each set of n elements in the population has an equal chance of selection
simple random sample SRS
samples of size n are equally likely selected from the popultaion of interest
Bias
sampling needs to be done in a way that we avoid _____
most useful graph to display the relationship between two quantitative variables
scatterplot
What kind of sample: California houses called through random digit dialing--each respond asked a question
simple random sample
uniform probability model
simplest of all continuous probability model the probability of each individual outcome is the same over the range of possible values
Simple random sample (SRS)
size of n consists of n individuals from the population chosen in such a way that every set of n individuals has an equal chance to be the sample actually selected
mean<median
skewed to the left
mean>median
skewed to the right
-can NOT be a negative -can be 0
standard deviation
o
standard deviation of population distribution
z score
standard score indicating the number of standard deviations above or below the mean value - mean/standard deviation
alternative hypothesis
statement about the predicted relation between the groups you're studying Opposite of null hypothesis Something special is happening The means of different groups are NOT the same
29. Two variables are confounded when
the effect of one variable on the response variable cannot be separated from the effect of other variable on the response variable.
46. A test of significance is intended to assess
the evidence provided by data AGAINST the null hypothesis in favor of the alternative hypothesis.
Standard error of the sample proportion
the formula with phat
median
the midpoint of observations, resistant
stratified random sample
the population is divided into distinct groups. Members are selected at random from each group.
y-hat
the predicted value of the response variable for a given value of x