stats 4 dummies

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

correlation

" is not the same as "Association" specifically, is "linear association")

proportion

# of individuals within range/ total # of individuals in data set

discrete examples

# of pets, #of siblings

Margin of Error

(.7, .4) (upper-lower) Divide by 2

ways to draw a random sample

(1) Simple Random Sampling (SRS) (2) Stratified Random Sampling (3) Cluster Sampling (4) Multistage Sampling (5) Systematic Random Sampling

replication

(a) Repeating conditions within an experiment to determine the reliability of effects and increase internal validity; (b) repeating whole experiments to determine the generality of findings of previous experiments to other subjects, settings, and/or behaviors

response bias

(answers incorrectly or question is misleading), when respondents' answers may be affected by survey design

one tailed

(directional) test Rejection region is located in just one end/ tail of the sampling distribution 5% is placed all in one end/ tail

two tailed tests

(non-directional) test Rejection regions are located in both ends/ tails of the sampling distribution Always use two-tailed in this class (and research) mu ≠ mu1 Split the 5% (2.5% each) between two ends/ tails

Degrees of freedom for chi square

(r - 1)(c - 1)

non-response bias

(refuse to participate or fail to answer), bias introduced to a sample when a large fraction of those sampled fails to respond

sampling bias

(undercoverage)exists when a sample is not representative of the population from which it was drawn

Inferential Statistics

-Lets us generalize beyond our actual data -Designed to help make decisions and test our hypotheses -Infers your data -(z-tests, t-tests, ANOVA, Chi square)

Descriptive statistics

-Organize and summarize variability in our actual data -Makes things concise and easy to understand -Describes your data -(frequency, mean, median, mode, correlation, regression)

the 1.5xIQR rule

-Q3+(1.5)(IQR) -Q1-(1.5)(IQR)

inference

Using results from a sample statistic value to draw conclusions about the population parameter.

predicted y (symbolized by y-hat )

Value for y at a specified x as predicted by the regression equation; computed by plugging the value for x into the equation and solving for y.

Discrete Quantitative

Values form a set of whole numbers

Continuous quantitative

Values measured on an interval (time)

σ2

Variance of a population or distribution.

equal variance (or equal standard deviation)

Variances (or standard deviations) for each of the treatment groups (or samples) in ANOVA are all equal. In regression, the variances of the y's at each x are all assumed to be equal.

natural variation

Variation from object to object within a population.

bias (all can contributed to)

Volunteer Response Sample Convenience Sample Not identifying the population correctly

association vs. causation

We can only argue causation from association if the results having significant association are from an experiment.

probability rules

1 the probablity P(A) of any event A satisfies 0<P(A)<1 2 If S is the sample sapce then P(S)=1 3 the disjoint addition rule P(A or B)=P(A)+P(B) 4 for any event A P(A does not occur)=1-P(A)

Empirical rule (3)

1) 68% of observations fall within 1 standard deviation of the mean 2) 95% of observations fall within 2 standard deviations of the mean 3) 99.7% of observations fall within 3 standard deviations of the mean

Regression problems (4)

1) extrapolation 2) influential outliers 3) correlation does not imply causation 4) lurking variables

What are the steps in the computation of a t test to test the significance of a correlation coeffiecient?

1. A statment of the null and research hypothesis. 2. Setting the level of risk( or the level of significance or Type 1 error) associated with the null hyppthesis. 3. and 4. Selection appropriate test statistic. 5. Determination of the value needed for rejection of the null hypothesis using the appropriate table of critical values for the particular statistic. 6. A comparison of the obtained value and the critical value is made. 7. and 8. Making a decision.

The Binomial Setting

1. There are a fixed number n of observations. 2. The n observations are all independent. That is, knowing the result of one observation does not change the probabilities we assign to other observations. 3. Each observation falls into one of just two categories, which for convenience are called "success" and "failure". 4. The probability of a success, p, is the same for each observation.

Practical importance

A difference between the observed statistic and the claimed parameter value that is large enough to be worth reporting. To assess practical importance, look at the numerator of the test statistic and ask ̳Is it worth anything?' If yes, then results are also of practical importance. Note: Do not assess practical importance unless results are statistically significant.

sampling distribution of x

A distribution of the sample mean; a list of all the possible values for x together with the frequency (or probability) of each value.

histogram

A graphical display of a quantitative data set; data are grouped into intervals (usually of equal width) and a bar is drawn over each interval having height proportional to the frequency (or percentage) of values in the interval. Values of the variable are given on the x axis and frequencies (or percentages) are given on the y axis. Histograms are examined to determine shape, center and spread.

histogram:

A graphical display of a quantitative data set; data are separated into intervals of equal width and a bar is drawn over the interval having height equal to the frequency (or percentage) of values in the interval. Values of the variable are given on the x axis and frequencies (or percentages) are given on the y axis. (Hence, a histogram gives a distribution.) Histograms are described by shape, center and spread. Used for large data sets.

pie chart:

A graphical display of categorical data using a "pie"; each category is represented as a slice where the size of the slice is proportional to the percentage of data in that category.

stemplot (also called stem and leaf plot):

A graphical representation of a quantitative data set. Leading values of each data point are presented as stems and second digits are given as leaves.

bar graph

A graphical representation of categorical data. Names of each category are listed on the x-axis and a bar that has height representing the frequency (or percentage) in that category is place over each category name.

mall-intercept sample

A sample where respondents are contacted in a shopping mall or similar location. Often the method of selection is haphazard although occasionally systematic.

estimate of a parameter

A single value or a range of values used to estimate a parameter.

interaction

A situation that occurs in an experiment when the effect of one explanatory variable on the response variable is not the same across all levels of another explanatory variable.

What is the null hypothesis?

A statement of equality between sets of variables.

What is the research hypothesis?

A statement of inequality between two variables.

standard error

A statistic providing an estimate of the possible magnitude to error. The larger the standard error of measurement, the less reliable the score.The standard error of the estimate is a rough measure of the average amount of predictive error (avg amount y deviates from y')

robust

A statistical procedure that is insensitive to moderate deviations from an assumption upon which it is based; e.g., t procedures give P-values and confidence levels that are very close to correct even when data are not normally distributed.

experiment

A study where treatments are deliberately imposed on the individuals in the study before data is gathered in order to observe their responses to the treatment.

spread

A summary number representing variability of the observations. Measures of spread include range, interquartile range, and standard deviation

resistant measure

A summary number that is not affected by outliers. The median is a resistant measure of center.

location measure

A summary number that tells the location (typically the center) of a data set on the number line.

random number table

A table consisting of the digits 0 through 9 in equal proportions such that the digit in any position in the table is independent of the digits in neighboring positions (i.e., there is no pattern in the order of the digits.)

approximate t test:

A test for comparing the means of two independent samples or two treatments where the test statistic has an approximate t distribution. This is the preferred two sample test, but it requires statistical software.

Chi-square test statistic

A test statistic computed from data that has an approximate chi-square distribution.

scatterplot

A two dimensional plot used to examine direction, form and strength of the relationship between two quantitative variables.

categorical (or qualitative) variable:

A variable that can be classified into groups or categories such as gender and religion.

response variable:

A variable that gives the outcomes of interest of the study (may not be a number); also called the dependent variable.

lurking variable:

A variable that has an important effect on the relationship among the variables in a study but is not taken into account.

explanatory variable

A variable that may or may not explain the outcomes (responses) of a study, also called independent or predictor variable.

lurking variable

A variable that the researcher is not necessarily interested in studying but which affects the relationship between the explanatory variable and the response variable.

quantitative variable

A variable with numerical values such as height or weight. This type of data required for both variables in regression analysis.

lack of realism

A weakness in experiments where the setting of the experiment does not realistically duplicate the conditions we really want to study.

How large a sample?

ALWAYS ROUND UP

Condition for expected counts in chi square

All expected counts greater than 5

Why is central limit theorem so important?

Allows us to compute probabilities on x and p

Alternate Hypothesis and null

Always mew

Normal distribution (a density curve)

Always on or above the horizontal axis Has area exactly 1 underneath curve Area under the curve for any interval is the proportion of all values that fall in that range

right-tailed alternative hypothesis:

An alternative hypothesis that states the parameter value of a treatment or population is greater than some number or the parameter from another treatment or population.

left-tailed alternative hypothesis

An alternative hypothesis that states the parameter value is less than some number or the parameter from another treatment or population.

one-sided or one-tailed:

An alternative hypothesis where the researcher is interested in deviations in only one direction. (" < " or " > " is in Ha.)

Association

An association exists between two variables if a particular values for one variable is more likely to occur with certain values of the other variable

expected count

An estimate of how many observations should be in a cell of a two way table if there were no association between the row and column variables.

Confidence interval

An estimate of the value of a parameter in interval form with an associated level of confidence; in other words, a list of reasonable or plausible values for the parameter based on the value of a statistic. E.g. a confidence interval for µ gives a list of possible values that µ could be based on the sample mean.

double blind study

An experiment where neither the subjects nor the diagnosticians (e.g. doctor or nurse) know which treatment is administered to whom.

randomized block design (RBD)

An experimental design where treatments are randomly allocated within each block.

random outcome

An individual outcome from a random phenomenon.

one sample t procedure for mean

An inferential procedure using the mean from one sample to test or estimate the population mean; the test statistic follows a t distribution.

one sample z procedure for proportion

An inferential procedure using the proportion from one sample to test or estimate the population proportion; the approximate distribution of the test statistic is z or standard normal.

Outliers

An observation in a data set that lies an abnormal distance from other values in a random sample population. > Q3 + (1.5)*IQR < Q1 - (1.5) * IQR

outlier:

An observation that falls outside the overall pattern of the data set. Can be detected by checking: observation < Q1 - 1.5 IQR or observation > Q3 + 1.5 IQR.

influential point

An observation that substantially alters the fitted regression equation.

lower tailed alternative hypothesis

Another name for a left-tailed alternative hypothesis.

variable

Any characteristic of an individual or object; it may take on any number of values either categorical or numerical.

Variable

Any characteristics that is observed for the subjects in a study

t-test

Any test of significance where the test statistic can be modeled with the t-distribution; used when σ is unknown.

The way to set up statistical problems

Ask: - what are the n individuals/units in the sample (of size "n"?) - what is being recorded about those n individuals/ units? - is that a number (quantitative) or a statement (categorical)?

Parameters of a binomial distribution for X successes in n observations

B(n,p) n is the number of observations. p is the probability of success on each observation. X is the count of successes, and can be any whole number between 0 and n.

27. What are the assumptions necessary for the above regression analysis?

B. The y's at each x are Normally and independently distributed with equal variances. NOT C. The samples are random and the sample sizes are large.

Slope

B1, the amount that mean of y changes when x increases by one unit

Linear Regression Null

B=0

What is the prob that an individual will be less than--use this

BECAUSE IT DOESN'T USE A SAMPLING DISTRIBUTION

Boxplots vs. bar graphs

Bar graphs are for categorical data

Why is it so important to use tstar and not Zstar for a confidence interval for mew?

Because you only know s, not sigma

interviewer bias

Bias introduce into survey results by body language, voice intonation, gender, race, etc. of an interviewer

selection bias:

Bias introduced into sample results due to how the units were selected for sampling.

measurement bias

Bias introduced into survey results because of poorly worded questions, interviewer effects, measuring instrument difficulties, etc.

respondent bias

Bias resulting from respondents lying when asked about illegal or unpopular behavior, forgetting or confusing past behavior, having no knowledge about the question content and not wanting to appear stupid, etc.

under-coverage bias

Bias that occurs in sample results because a segment of the population with a certain characteristic is not sampled.

Y intercept

Bo, the predicted value when y=0

• How do you distinguish between matched pairs and two sample t

Can do it with twins, a before and after measurement with one person, a couple treated as a unit, two treatements on the same person -- ALWAYS MATCHED PAIRS WITH THESE Two sample t: § Random sample of BYU and random sample of U of U -- compare them

Types of variables (3)

Categorical Discrete Quantitative Continuous Quantitative

causation

Changes in the explanatory variable directly affect the response variable. Experiments are needed to verify causation.

How are conditions from a t-test checked?

Check if data came from an SRS and check a plot of the data. (Is there outliers or strong skew?)

Sampling Distribution of a Count

Choose an SRS of size n from a population with proportion p of successes. When the population is much larger than the sample, the count X of successes in the sample has approximately the binomial distribution with parameters n and p.

Interval estimate for a mean

Confidence interval

When to use nPhat vs npnot tests

Confidence interval we don't have a pnot so use nPhat

linear quantitative

Correlation measures the strength of _____ relationship between two ________ variables.

1. The chi-square test statistic measures A. the difference between the observed statistic and the claimed parameter value. B. by how many standard errors the two sample statistics differ. C. the number of squared deviations the observed values are from the mean. D. the differences between the observed and expected counts.

D -- NOT THE CLAIMED PARAMETER VALUE, BUT OBSERVED AND EXPECTED COUNTS--JUST DIFFERENCE

Regression response variable

Dependent variable, the outcome variable on which comparisons are made

shape

Description of the overall pattern of a histogram using terms including symmetric, right skewed, left skewed, flat (uniform), bell-shaped, etc.

measurement variation

Differences in repeated measurements on the same object.

cluster random sample

Divide the population into small groups and select any number of entire groups.

Graphs for quantitative variables

Dot plot, stem and leaf plot, histogram, boxplot

individual

Each object or unit described or examined in a data set

Subject

Entities that we measure in a study

b

Estimated (sample) slope in a regression equation.

a

Estimated (sample) y-intercept in a regression equation.

mutually exclusive events

Events that cannot occur together. A single person's birth month can't be both January and February Pr( A or B) = Pr(A) + Pr(B)

X

Explanatory variable in regression analysis.

True or False: A 95% confidence interval for mew gives us a set of reasonable values for the response variable

FALSE its a set for the population mean

The tails of a standard Normal curve are fatter than the tails of a t-distribution curve with 26 degrees of freedom. False True

False, the opposite is true

The probability that the value for m is in this 95% confidence interval, (27.595, 28.293), is 0.95.

False. It is either in there or it is not. Either 100% of the time or 0% of the time.

Box plot (5)

Fence 1: Q1-(1.5)*(Q3-Q1) Q1: lower half median Q2: median Q3: upper half median Fence 2: Q3-(1.5)*(Q3-Q1)

Factorial Notation

For a given number n, its factoria n! is n! = n ⋅ (n-1) ⋅ (n-2) ... 3 ⋅ 2 ⋅ 1 And 0! = 1

association

For quantitative data, large values of one variable tend to occur with large (or small) values of another variable. For categorical data, certain responses for one variable tend to occur with certain responses of the other variable.

Scatter plot

For two quantitative variables Explanatory variable is the x variable Response variable is the y variable

Z score

Gives the number of standard deviations from the mean the observation is and direction (Outliers include and z score above or below 3)

Nominal

Hair color (N)

Bell shaped

Histogram in which the bars make a pyramid

Bimodal

Histogram where the bars make two bell shapes

interquartile range

IQR=Q3-Q1 makes up 50% of the data the 25% of observations above and below the median

Binomial Probability

If X has the binomial distribution with n observations and probability p of success on each observation, the possible values of X are 0, ,1, 2, ... n. If k is any one of these values, [image].

What is the difference between causation and association?

Just because things are related and share something in common with one another has no bearing on whether there is a casual relationship between the two.

2 Sample t for proportions requirement

Just np checks and randomization

positive association

Large values of one variable tend to occur with large values of another variable and small values of one variable tend to occur with small values of the other.

c

Level of confidence

α

Level of significance or probability of a type I error (probability of rejecting a true null hypothesis).

Asked for P Value-

Look up Negative Z

Normal CDF

Lower,Upper, Mean, SD

Given, N, P, X- asked to get mean+ SD NP Bigger than 10-

M= NP SD= Square root of NPQ

3. Gas mileage for 10 cars with dirty air filters and clean air filters was studied. Each car was tested once with a clean air filter and once with a dirty air filter (with the order of the testing randomized.) The research question is: "Do cars get better miles per gallon on average with clean air filters?" What type of study

MATCHED PAIRS

tstar times s/sqrtn

Margin of error for one-sample t confidence interval for μ

C→Q

Matched pairs t Two-sample t ANOVA

Matched pairs t vs. 2 sample t for means hypothesis or 2 sample z for proportions

Matched pairs: mewD 2 Sample t: M1-M2 = 0 OR m1=m2 2 sample z: p1-p2 = 0

Range

Max-min

skewed left

Mean < Median

population mean (μ)

Mean of all the observations in the population.

μ

Mean of the sampling distribution of x .

center

Mean, Median, Mode

quartiles

Measures of a central tendency that divide a group of data into 4 subgroups or parts.

skewed right

Median < Mean

Parameter in hypothesis of matched pairs vs 2 sample t

Mewd for matched pairs-the minus for the others

BinomePDF

N,P,K

Can categorical data have a normal distribution?

NO

Quality control

Nine consecutive either above or below the midline means a problem. One outside 3 SAMPLE standard deviations out signals a problem

Null Hypothesis for testing whether a linear relationship exists -- regression analysis in general?

No linear relationship --

Parameter

Numerical summary of a population (usually unknown)

Statistic

Numerical summary of a sample

TInterval(x, sx, n, c)

OR TInterval(List, Freq, c) The output of this function is the c-level con fidence interval for the population mean when the population standard deviation is unknown and a sample mean x and a sample standard deviation sx have been computed from a sample of size n. Alternatively a data list can be entered, along with a frequency list, instead of the computed statistics.

Categorical variables

Observation belongs to a set of categories (hair color, gender)

1. A certain brand of fishing line claims to have an average breaking strength of 30 pounds. A group of fishermen become angry because this brand of line seems to break so easily and test 25 randomly selected lines of this brand. The mean breaking strength is 27.994 pounds with a standard deviation of 0.846 pounds. A plot of the data follows. Do these data provide sufficient evidence for the fishermen to conclude that the average breaking strength is less than claimed? A. What type of study is this?

Observational study

Quantitative Variables

Observations that take on numerical values

quartile

One of the three values that divide the ordered data set into quarters.

One-sample procedures

One-sample Z for proportions One-sample t for means

sample surveys

Opinion polls are examples of _________, designed to ask questions of a small group of people in the hope of learning something about the entire population

Calculate Test Statistic-

P Hat is calculated by X divided by N

P(A ∩ B')

P(A ∩ B') = P(A) - P(A ∩ B)

P(A ∪ B)

P(A ∪ B) = P(A) + P(B) − P(A ∩ B)

Observed effect = x - m0. If observed effect decreases, what happens to P-value?

P-value increases as observed effect decreases because x gets closer to m0 = 10

Not all counts have binomial distributions.

Pay attention to the binomial setting.

If a researcher wants to examine the relationship between variables and not the difference, what test statistics should he or she use?

Pearson Correlation Coefficient

Graphs for categorical variables

Pie chart or bar graph

direction strength form

Positive/Negative Strong/Moderate/Weak Linear/Non-linear

y-bar

Predicted y.

Interval estimate for the number

Prediction interval

Sampling distribution of the mean

Probability distribution of means for all possible random samples of a given size for some population.

significance level (α)

Probability of a Type I error, i.e. probability of rejecting a true null hypothesis; the largest risk of rejecting a true null hypothesis that a researcher is willing to take .

β

Probability of a type II error

1 - P(A ∪ B)

Probability that Events A and B do not occur.

p

Proportion (or percentage) of a population.

p-hat

Proportion (or percentage) of a sample.

population proportion ( p )

Proportion (or percentage) of all the observations in the population having a certain characteristic.

Lower fence

Q1 - 1.5 IQR

Quartiles

Q1: 25th percentile = median of the lower half including Q2 if odd Q2: 50th percentile = median Q3: 75th percentile = median of the upper half including Q2 if odd

Upper fence

Q3 + 1.5 IQR

correlation (conditions)

Quantitative variables condition: Are the two variables quantitative? Check: See how the data were collected. Straight enough condition: Is the assumption of a linear relationship appropriate? Check: Look at the scatter plot. (3) Outliers: Are there outliers that might distort the relationship? Check: Look at the scatter plot.

random error

Random unsystematic differences (random error) Random error: combined effects of all uncontrolled factors on the scores of individual subjects Individual differences AND experimental error Variability in general

Which condition is the most important?

Randomization

Spread

Range Inter-Quartile Range 5-number Summary Variance Standard Deviation (SD)

What is the range for a correlation coefficient?

Range between -1 and +1.

Central Limit Theorem

Regardless of the shape of the underlying population, the shape of the sampling distribution of the mean approximates a normal curve if the sample size is sufficiently large.

Q→Q

Regression inference

How do you interpret a p value associated with a correlation coefficient?

Represents the population estimate of the correlation coefficient.The p-value is a number between 0 and 1 representing the probability that this data would have arisen if the null hypothesis were true.

Y

Response variable in regression analysis.

Statistical significance

Results of a study that differ too much from what we expected to attribute to chance variation alone.

Least squares line has to smallest..

SQUARED residuals

Randomization checks

SRS for observational RAT for experiments

independent samples

SRS's collected separately from each of two (or more) disjoint populations; matched pairs data are considered to be dependent samples

r

Sample correlation coefficient

Simple random sample (SRS)

Sample is chosen in such a way that every subject is equally likely to be selected for the study

x-bar

Sample mean

n

Sample size

s

Sample standard deviation.

multistage samples

Sampling schemes that combine several methods are called ______

Symbols

Sigma--sd for population. Mew--Mean for Population. S---SD for sample. Xbar--mean for sample

Binomial Distributions are models for ___?

Some categorical variables, typically representing the number of successes in a series of n independent trials.

O. Are the conclusions also practically significant? (fish problem)

Some fishermen would say 3 pounds is enough so that they can't catch the fish they want to. They would say these results are practically significant. Others don't fish for those huge fish and they would not care so they would say that the results are not practically significant.

standard deviation

Square root of Variance. A measure of distance from the mean, ALWAYS POSITIVE. Approximately the average amount that scores in a distribution vary from the mean The bigger the SD, the more spread there is. BASICALLY, SD summarizes the spread, and helps us understand the meaning of individual scores.

Standard deviation

Square root of the variance (X - mean)^2 then, add all of these, divide it by (n-1), all square rooted

σ

Standard deviation of a population or distribution.

σ/sq. root of n

Standard deviation of the sampling distribution of x-bar

s/sq. root of n

Standard error of x-bar; estimates standard deviation of the sampling distribution of

s/sqrtn

Standard error of xbar

Σ

Summation symbol

Whenever n > 30, we can reliably use one-sample t procedures even for skewed data or data with outliers provided the data were collected with an SRS. True False

TRUE This is because the Central Limit Theorem is so powerful. However, you should always plot your data. You could have such an extreme outlier that you shouldn't use a one-sample t procedure.

Inference for Regression

Test and confidence interval for slope of true regression line (or population regression line). Confidence and prediction intervals for µy

Explanatory (independent variable)

The X-variable A variable that affects or explain the outcome Not all pairs of variables are explanatory/response variables. They could be unrelated.

Response (dependent)

The Y-variable usually measures an outcome

sampling frame

The _______ is a list of individuals from which the sample is drawn.

explained variation

The amount of total variation in the y's that is accounted for by a regression model; it is equal to Σ( y$ −y)2

follow-up analysis

The analysis performed on data after an overall test on the equality of multiple means or the equality of multiple proportions is found to be significant. It determines which means or which proportions differ from which.

fail to reject H0

The appropriate statistical conclusion in hypothesis testing when the P-value is greater than α; equivalently, conclude that "There is not enough evidence to believe Ha."

reject H0

The appropriate statistical conclusion when the P-value is less than or equal to α; conclude that "There is enough evidence to believe Ha."

Variance

The average of the squared deviations of each observation from the mean

random

The best way to avoid bias is to select individuals for the sample at ______

Binomial Mean and Standard Deviation

The center and spread of the binomial distribution for a count X are defined by mean µ and standard dev. σ: µ = np σ = √np(1-p)

Binomial Distribution

The count X of successes in the binomial setting has the binomial distribution with parameters n and p. The parameter n is the number of observations and p is the probability of a success on any one observation. The possible values of X are the whole numbers ranging from 0 to n.

deviation

The difference (distance) between an observation and the mean of all the observations in a data set, or the difference between an observation and the corresponding regression model estimate.

interquartile range (IQR)

The difference between Q3 and Q1 (i.e., Q3 - Q1); the length of the box in a boxplot

residual ( y − y$ )

The difference between the actual y and the predicted y.

μ1 - μ2

The difference between the means of two populations.

statistical significance

The difference between the observed statistic and the claimed parameter value as given in H0 is too large to be due to chance alone. To assess, ask "Is P-value < α?" If yes, then results are statistically significant

practical significance:

The difference between the observed statistic and the claimed parameter value is large enough to be worth reporting. To assess practical significance, look at the numerator of the test statistic and ask "Is the difference important?" If yes, then results are also of practical significance

p1 - p2

The difference between the proportions of two populations.

Interquartile Range

The difference between the scores (or estimated scores) at the 75th percentile and the 25th percentile. Used more than the range because it eliminates extreme scores.

x-bar1-x-bar2

The difference between two sample means

p-hat1-p-hat2

The difference between two sample proportions

Matched pairs t-test--which graph to check?

The differences graph

sampling distribution (theoretical):

The distribution of a statistic; a list of all the possible values of a statistic together with the frequency (or probability) of each value.

normality of Y at each X

The distribution of all the Y values at each possible value of X is normal.

population distribution

The distribution of all the observations in a population.

conditional distribution (conditional percents)

The distribution of the values in a single row (or a single column) of a two-way table.

population

The entire group of individuals of interest in a study.

Type II error

The error made when a false null hypothesis is not rejected. (i.e. you fail to reject H0 when H0 isfalse.)

type II error

The error made when a false null hypothesis is not rejected; the error of believing H0 when Ha is true.

Type I error

The error made when a true null hypothesis is rejected. (i.e. you reject H0 when H0 is true.)

type I error

The error made when a true null hypothesis is rejected; the error of believing Ha when H0 is true.

Complement

The event not occurring. The probability that Event A will not occur is denoted by P(A').

variability

The extent to which the scores in a data set tend to vary from each other and from the mean.

law of large numbers

The fact that the average ( x ) of observed values in a sample will tend to get closer and closer to μ as the sample size increases.

Null Hypothesis

The hypothesis of no difference or no change. This is the hypothesis that the researcher assumes to be true until sample results indicate otherwise. Generally it is the hypothesis that the researcher wants to disprove. (Note: Interpretations of P-value and statistically significant need to say something about ―if H0 is trueǁ in order to be correct.)

Alternative hypothesis

The hypothesis that the researcher wants to prove or verify; a statement about the value of a parameter that is either "less than," "greater than," "not equal to."

least squares regression line

The line that minimizes the sum of squared residuals.

regression

The mathematical modeling of relationships between numerical variables.

margin of error for 95% confidence

The maximum amount that a statistic value will differ from the parameter value for the middle 95% of the statistics. (Note: Changing the level of confidence changes the percentage of interest, e.g. 95%.)

Margin of error

The maximum amount that a statistic value will differ from the parameter value for the middle 95% of the distribution of all possible statistics. (Note: 95% can be changed to any other level of confidence.) This only accounts for sampling variability

How do we interpret margin of error?

The maximum size of the error where the error is the difference between the sample mean, x, and the population mean, m.

Measures of Center for a density curve

The median of a density curve is the equal-areas point, the point that divides the area under the curve in half The mean of a density curve is the balance point, at which the curve would balance if made of solid material

Median

The midpoint of the observations when they are ordered from smallest to largest (resistant to outliers)

Central Limit Theorem (CLT)

The name of the theorem stating that the sampling distribution of a statistic (e.g. x ) is approximately normal whenever the sample is large and random.

z-score

The number of standard deviations a value or observation is from the mean; a standardized xvalue.

Binomial Coefficient

The number of ways of arranging k successes in n observations, with constant probability p of success, in an unordered sequence.

Mode

The observation that shows up the most in the data set

residual (Vertical Deviation)

The observed y minus the predicted y; denoted: yyˆ ; prediction error.

Dependent

The occurrence of Event A changes the probability of Event B.

Independent

The occurrence of Event A does not change the probability of Event B. P(A ∩ B) = P(A)P(B); P(A|B) = P(A)

independent events

The occurrence of one event has no effect on the probability that the other event will occur. In independent events, multiply the probabilities of individual outcomes to find the probability that these outcomes will occur together The multiplication (and) rule Dealing cards with replacement—always use the full deck Birthday of married couples (both being born in May) Pr (A and B) = [Pr(A)] [(Pr(B)]

dependent variable

The outcome, the variable that is dependent on the treatment

LinRegTInt(Xlist, Ylist, Freq, c)

The output of this function includes the c-level con dence interval estimate for the pop- ulation slope . Additionally, the output can include a and b for the least-squares line equation, r and r2, and the standard error s.

LinReg(a+bx) Xlist, Ylist

The output of this function includes the intercept a and the slope b for the least-squares line of best t for the data pairs entered via the lists. Additionally (with DiagnosticsON), the correlation coecient r and coecient of determination r2 will be shown.

LinRegTTest(Xlist, Ylist, Freq, alternate hypothesis)

The output of this function includes the t-test statistic and the P-value for the test given the alternate hypothesis entered.. Additionally, the output can include a and b for the least-squares line equation, r and r2, and the standard error s.

T-Test(0, x, s, n, alternate hypothesis)

The output of this function includes the t-test statistic, the P-value, and the point es- timate for for the test given the null hypothesis, statistics, and alternate hypothesis entered. Alternatively a data list can be entered, along with a frequency list, instead of the computed statistics.

2-SampTTest(x1, sx1, n1, x2, sx2, n2, alternate hypothesis, Pooled)

The output of this function includes the t-test statistic, the P-value, the point estimates for 1 and 2, the sample standard deviations for the test given the null hypothesis, statistics, and alternate hypothesis entered. Alternatively a data list can be entered, along with a frequency list, instead of the computed statistics.

Z-Test(0, , x, n, alternate hypothesis)

The output of this function includes the z-test statistic, the P-value, and the point es- timate for for the test given the null hypothesis, statistics, and alternate hypothesis entered. Alternatively a data list can be entered, along with a frequency list, instead of the computed statistics.

2-PropZTest(x1, n1, x2, n2, alternate hypothesis)

The output of this function includes the z-test statistic, the P-value, and the point es- timates for p1 and p2 for the test given the null hypothesis, statistics, and alternate hypothesis entered.

1-PropZTest(p0, x, n, alternate hypothesis)

The output of this function includes the z-test statistic, the P-value, and the point estimate for p for the test given the null hypothesis, statistics, and alternate hypothesis entered.

2-SampZTest(1, 2, x1, n1, x2, n2, alternate hypothesis)

The output of this function includes the z-test statistic, the P-value, the point estimates for 1 and 2, the sample standard deviations for the test given the null hypothesis, statistics, and alternate hypothesis entered. Alternatively a data list can be entered, along with a frequency list, instead of the computed statistics.

2-Test (Observed matrix, Expected matrix)

The output of this function is the 2-test statistic and the P-value for the test with a default alternate hypothesis of \The events are not independent.". Note that the Expected matrix entered can be empty or some other irrelevant matrix. In the process of this function, the TI-84 will overwrite that matrix with the values of the Expected matrix.

2-SampZInt(1, 2, x1, n1, x2, n2, c)

The output of this function is the c-level con dence interval for the di erence between the means of two independent populations with known standard deviations 1 and 2 and sample means x1 and x2 computed from samples of size n1 and n2, respectively. Alternatively a data lists can be entered, along with a frequency lists, instead of the computed statistics.

2-SampTInt(x1, sx1, n1, x2, sx2, n2, c, Pooled)

The output of this function is the c-level con dence interval for the di erence between the means of two independent populations with unknown standard deviations and sample means x1 and x2 and sample standard deviations sx1 and sx2 computed from samples of size n1 and n2, respectively. Alternatively a data lists can be entered, along with a frequency lists, instead of the computed statistics.

2-PropZInt(x1, n1, x2, n2, c)

The output of this function is the c-level con dence interval for the di erence between the population proportions of two independent populations with point estimates ^p1 = x1 n1 and ^p2 = x2 n2 .

1-PropZInt(x, n, c)

The output of this function is the c-level con dence interval for the population proportion p when a point estimate ^p = x n has been determined.

ZInterval(stdev, x, n, c)

The output of this function is the c-level con fidence interval for the population mean when the population standard deviation is known and a sample mean x has been computed from a sample of size n. Alternatively a data list can be entered, along with a frequency list, instead of the computed statistics.

Level of confidence

The percent of the time that the confidence interval estimation procedure will give you intervals containing the value of the parameter being estimated. (Note: This can only be defined in terms of probability as follows: ―The probability that the confidence interval to be computed (before data are gathered) will contain the value of the parameter. After data are collected, level of confidence is no longer a probability because a calculated confidence interval either contains the value of the parameter or it doesn't.)

Regression r-square (R^2)

The percentage of variation in ^y explained by x Example: R^2=__% of the variation in y is explained by x ** if R is close to one, correlation is strong

Binomial distributions describe ___? And are used when ___?

The possible number of times that a particular event will occur in a sequence of observations. They are used when we want to know the probability of the number of times an occurrence takes place.

β

The probability of failing to reject a false null hypothesis (probability of a type II error).

P-value

The probability of getting a test statistic as extreme or more extreme than the value actually observed assuming H0 is true.

p-value

The probability of getting data (summarized with the test statistic) as extreme or more extreme than the one observed (in the direction of the alternative hypothesis) assuming Ho is true

power (1 - β)

The probability of rejecting a false null hypothesis.

Significance level (symbolized by α)(probability of a type I error)

The probability of rejecting a true null hypothesis; equivalently, the largest risk a researcher is willing to take of rejecting a true null hypothesis.

Level of significance

The probability of rejecting a true null hypothesis; equivalently, the largest risk a researcher is willing to take of rejecting a true null hypothesis. Probability of Type 1 error. Alpha.

Conditional Probability

The probability that Event A occurs, given that Event B has occurred. The conditional probability of Event A, given Event B, is denoted by the symbol P(A|B).

Intersection

The probability that Events A and B both occur is the probability of the intersection of A and B. The probability of the intersection of Events A and B is denoted by P(A ∩ B). If Events A and B are mutually exclusive, P(A ∩ B) = 0.

Union

The probability that Events A, B, or both occur is the probability of the union of A and B. The probability of the union of Events A and B is denoted by P(A ∪ B) .

Percentile

The pth percentile is a value such the p percent of the observations fall below or at the value

Empirical Rule

The rules gives the approximate % of observations w/in 1 standard deviation (68%), 2 standard deviations (95%) and 3 standard deviations (99.7%) of the mean when the histogram is well approx. by a normal curve

Population

The set of all the subjects of interest

Sample

The set of subjects for which we have data

population standard deviation (σ)

The standard deviation of all observations in a population; a measure of the variability of all the population values about their mean.

What does standard error estimate?

The standard deviation of the sampling distribution of x-bar

Mean

The sum of observations divided by the number of observations (sensitive to outliers)

Chi-distribution

The theoretical distribution that models the test statistic for doing Chi-Square tests

pooled sample proportion

The value used for p$ when computing p$ ( p$ ) for the two

sampling variability

The variability of a statistic from one sample to the next; a measure of sampling variability necessary to effectively do inference.

unexplained variation:

The variation of the y's about the regression equation; equals the sum of squared residuals.

y-intercept

The y value where the regression line intercepts (crosses) the y-axis.

five number summary

These five values: minimum, Q1, median, Q3, maximum; preferred numerical summary when data are very skewed or outliers are present

sample proportion z test statistic.

To compute, add the number of successes in both samples and divide by the sum of the two sample sizes.

Mutually Exclusive

Two events are mutually exclusive or disjoint if they cannot occur at the same time.

bivariate data

Two measurements are made on each unit.

Graphs

Two way table: 2 categorical. Scatter-- 2 Quant. Histogram--1 Quantitative. Box-plot--quantitative. Bar graph--categorical data.

C→C

Two-sample Z for proportions Chi-square test

Types of histograms (6)

Uniform Bimodal Bell shaped Skewed right (mean>median) Skewed left (mean<median) Symmetric (mean=median)

symmetric distributions

Use Mean and Variance/SD

skewed distributions

Use Median and IQR or 5-number summary.

prediction

Using a regression equation to estimate a value of the response variable for a given value of the explanatory variable.

continuous random variable

X takes on all values in an interval of numbers

Regression line equation

Y=Bo + B1x

What value should be used for a 95% interval for tstar?

YOU DON'T KNOW THIS UNTIL YOU KNOW THE DEGREES OF FREEDOM AND LOOK AT THE T TABLE!

Linear Regression test

You test normality of histogram or stemplot of residuals

Measure of position

Z score, z=(observation - mean)/ standard deviation

yˆ =

a + bx

negative correlation

a correlation in which then is inverse agreement between the location of cases on X and Y. As scores on X increase, scores on Y increase.

positive correlation

a correlation in which there is direct agreement

discrete

a countable number of outcomes

Random Sampling

a method of selecting participants for a study so that everyone in a population has an equal chance of being in the study

parameter

a number that describes some characteristic of the population

matched-pairs design

a randomized blocked experiment in which each block consists of a matching pair of similar experimental units

Haphazard sample

a sample obtained when a researcher attempts to emulate a true chance mechanism or tires to pick a representative sample based on their idea of what the population looks like

Convenience sample

a sample of easily accessible units in a population

convenience sample

a sample of easily accessible units in a population

Voluntary response sample

a sample of units from a population that select themselves for inclusion

voluntary response sample

a sample of units from a population that select themselves for inclusion -people with strong opinions are most likely to respond

union

a set containing all and only the members of two or more given sets

Statistics

a set of concepts and procedures that helps us understand variability Helps us organize, summarize, interpret, and generalize our data

event

a set of outcomes in a probability experiment

If the most serious error is the type I error, then a should be set at

a should be set as small as possible (like 0.01) if the type I error is most serious.

trial

a single attempt at a probability experiment

lurking variables

a variable that is not among the explanatory or response variable in a study but may influence the response variable

response

a variable whose values are compared across different treatments, dependent variable, what is measured

41. Which one of the following is NOT part of the definition for P-value?

a) Probability that the null hypothesis is true.

60. What characterizes a probability sample but not a sample of convenience?

a) Some type of random device is used to obtain a probability sample. b) Their probabilities can be computed. c) All possible probability samples can be listed. d) Inferences can appropriately be made from probability samples. e) All of the above.

39. To standardize means to

a) subtract the mean from a value and then divide by the standard deviation.

The t-distribution with 8 degrees of freedom has ____________________ the standard Normal distribution.

a) the same center but is more spread out than

33. The least squares regression line is the line for which

a. the sum of the squared residuals is minimized.

examples of quantitative

age height time

quantitative

age in years (type of variable)

compliment of an event

all of the outcomes NOT in an event

probability of an event

always between 0-1

prediction interval

an interval estimate of plausible values for a single observation of Y at a specified value of X

relative frequency

another term for proportion; number of times event occurs divided by the number of trials (long run)

law of large number

as the number of independent trials increase, the poportion of occurrences of any given outcome approaches a particular outcome number "in the long run"

null hypothesis

asserts that nothing special is happening with respect to some characteristic of the underlying population NO difference in the groups you're testing The means of different groups are the same We want to disprove this

29. How is level of confidence determined?

b) Subjectively determined by the researcher.

15. The margin of error in a confidence interval covers only which kind of errors?

b) errors due to random sampling

State the null hypothesis for testing independence (Chi square test).

b. H0: No relationship between type of community and internet usage. ( H0: p1 = p2 = p3 (conditional distributions are the same for all 3 communities) Since these data are from an SRS with two questions, we should do a test of independence. Answer "a" would be valid if these data were from a stratified sample of community type.

State the alternative hypothesis for testing independence. (Chi square)

b. Ha: Relationship exists between type of community and internet usage. ---I think that if we had more than two questions and a stratified sample, the answer would be this: Ha: At least one pi differs from the others.

Least-Squares Regression Line

best fit linethe line that minimizes the sum of the squares of the vertical distances of the data points from the line

BinomCDF

binomcdf(n, p, k)

disjoint

both events cannot occur in same "probability experiment"

58. Why do we compare different treatment groups in experiments?

c) To neutralize the effects of lurking variables and measure treatment differences.

places individuals into one of several groups or categories

categorical variable

(x-bar chart)

chart used to monitor a process to determine whether it is in control or out of control.

histogram

create equal-length intervals or "bins" - obtain frequencies or relative frequencies of cases in each bin. - bins need to span the whole range of the data, but not overlap. - The bars will be of equal thickness since the intervals are of equal length. - There should be no gaps between bars unless they are meaningful.

59. Which one of the following is NOT a difficulty in experimentation?

d) Using blocking to remove variation associated with a lurking variable from experimental error.

subjective frequency

degree of belief that an outcome will occur (short run)

mathematical approximations to histograms -smooth curve

density curve

relative frequency distribution

describes the fraction (or %) of times a value occurs in the dataset for this variable.

frequency distribution for a variable

describes the number of times a value occurs in the dataset for this variable.

residual plot:

diagnostic plot of the residuals versus the explanatory variable used to access how well the regression line fits the data; complete scatter with a shoe box shape is good; curvature indicates that a non-linear model would better fit the data, and a megaphone pattern indicates the standard deviation of y is not the same for all values of x.

range

difference between the minimum value and maximum value

y hat

distance/difference predicted b/t actual y and predicted y

the ________ of a quantitative variable tells us what values the variable takes on and how often it takes those values

distribution

sampling distribution of p$

distribution of the sample proportion; a list of all the possible values for p$ together with the frequency (or probability) of each value.

t distribution

distribution specified by degrees of freedom used to model test statistics for the sample mean, differences between sample means, etc. where σ ('s) is (are) unknown.

correlation _________ describe curved relationships between variables

does not

Constant with linear regression

don't use it

quantitative visual aide

dot plot, stem and leaf plot, histogram

Ordinal

dress size (type of variable

categorical

each observation belongs to a type or a set

randomness

each subject of the population has the same chance of being included in the sample

subjects

entities measured in a study

type 2 error

fail to reject a false null hypothesis Saying there is not an effect when there is -occurs when sample size is too small

linear regression:

finding the line that best describes how the response variable linearly depends on the explanatory variable.

a probability model with a finite sample space is called...

finite or discrete

regression line

fitted through the points of a scatterplot that summarize and describes the relationship between the variables

consists of the smallest observation, the first quartile, the median, the third quartile and the largest observation, written in order from smallest to largest

five number summary

how to tell which way a graph is skewed

follow the tail

examples of categorical

gender species political preerence

histogram

graph using vertical bars to portray the frequencies of outcomes

categorical (or qualitative)

hair color (type pf variable)

Correlation (properties) linear

has no units between x and y is the same as y and x ranges between -1 and +1 Not affected by changes in center or scale of either variable sensitive to outliers only measures strenth of ________ association

approximate sampling distribution:

he distribution of x -values obtained from repeatedly taking simple random samples of the same size from the same population. (An x is computed from each sample.)

the final digit of each observation

leaf

Outlier

less than Q1 - (1.5*IQR) or more than Q3 + (1.5*IQR)

extreme outlier

less than Q1 - (3*IQR) or more than Q3 + (3*IQR)

What things affect margin of error?

level of confidence and sample size--POPULATION SIZE DOES NOT

alpha is AKA?

level of significance

bias sampling

likely to under or over represent groups in the population that tend to have different values of a variable under investigation -undercoverage -nonresponse response

to assign probabiites in a finite model

list the probabilites of all the individual outcomes these probabilities must be numbers between 0 and 1 that add to 1 the probability of any event is the sum of the probabilites of the outcomes making up the event

regression analysis

make quantitative predictions of one variable from the values of another

symmetric distribution

mean = median

not resistant to outliers and skewedness

mean and standard deviation

What is the parameter being tested in context? (FOR THE MATCHED PAIRS CARS EXPERIMENT)

mean difference between mpg with clean air filters and mpg with dirty air filters

u

mean of all population distribution

Dbar

mean of the sample of differences

negatively skewed

mean, then median, then mode, tail on left

Mean and stdev

meaningless for categorical data, but not when you're talking about the sampling distribution of phat.

measures of variability

measures that indicate the degree of dispersion or spread of the data; include range, variance, and standard deviation

Pearson product-moment correlation (r)

measures the strength of linear relationship between two normally distributed interval or ratio scale variables

correlation ( r )

measures the strength of the linear relationship between two quantitative variables

inferential stats

method of making decisions or predictions based on data obtained

descriptive stats

methods for summarizing data

Positively skewed

mode then median then mean, tail on right

measures of central tendency

mode, median mean Types of Descriptive Statistics answers "what is a typical score?" describes the center

normal probability model

most common and arguably one of the most important

mode

most frequently occurring value, resistant

sample statistic

numerical summary of a sample taken from a population

observational study

observes individuals and measures variables of interest but does not attempt to influence the responses

selection bias

occurs when some part of the target population is not thesampled population

The data for the box plot:

only quantitative data

normalcdf(lower bound, upper bound, mew, standard dev)

oustput is the probability that a random variable (with a normal distribution) falls in the interval from lower bound to upper bound. The default values for mean and standard deviation correspond to the standard normal curve where mew=0 and stdev=1

independent

outcome of one event doesn't depend on outcome of the other event

independent trials

outcome of one trial does not effect the outcome of any other trial

matched pairs design

randomized blocked experiment in which each block consists of a matching pair of similar experimental units

Non-probability samples have no____

randomness

population parameter

refers to the numeric summary of population

type 1 error

reject a null hypothesis that is actually true Saying there is an effect when there isn't Reduce Type I by decreasing α to .01 or .001 BUT this will increase your chance of a Type II error

significant result

test of significance that yields a P-value less than α; an observed effect that is larger than could reasonably be expected due to chance alone.

T ratio is

test statistic in linear regression

Degrees of Freedom with matched pairs

the # of pairs minus one

Confidence interval--what goes in it?

the PARAMETER

response variable

the quantitative measure from units or people of interest dependent variable predicted variable y

conditions for inference

the relationship is linear in the population the response varies noramlly about the population regression line observations are independent the standard deviation of the responses is the same for all values of x NO FUNNEL SHAPE IN RESIDUALS

if the distribution is exactly symmetric then the mean and median are...

the same

The theoretical conditions necessary for performing a one-sample t procedure are:

the sample was randomly selected and the population is Normally distributed.

complement

the set of all outcomes in S that are not contained in A

sample space

the set of all possible outcomes

in a skewed distribution the mean is usually farther out in which direction

the tail

bias

the tendency to systematically favor certain parts of the population over others

sampling distribution of (r)

the theoretical distribution of values of r obtains from all possible samples of size N drawn from a population in which there is no correlation between X and Y

experimental units

the total number of subject in the experiment

influential data point

the x value is relatively high or low compared to other data; point has a large residual for the regression line without using that data point

continuous examples

time, age, weight

independent variable

treatment is manipulated by the researcher

correlation is not resistant

true

is the area in the fail to reject H0 region under the curve defined by Ha. True False

true

What is the margin of error

tstar times the standard error (s/sqrt N)

disjoint event

two or more events with no outcomes in common.

randomize

use chance to assign experimental untis to treatments

approx symmetric distribution

use mean an d standard deviation

Random sampling

use of chance to select a sample, is central principle of statistical sampling

skewed distributions or distributions with extreme outliers

used median and quartiles

outlier

value falling far from the rest of the data

Parameter of a normal distribution

values that uniquely identify the distribution. For the Normal dist. they are the mean µ ("mu") and standard deviation ("sigma") of the population Note that the mean and standard deviation computed from actual observations (data) are denoted by y and s, respectively. These are not parameters

a characteristic that is observed on an individual that can change between individuals -corresponds to the column

variable

explanatory

variable that is manipulated, or the treatment

standard deviation squared

variance

residual

vertical distance between a point and a reqression line. squared so positives and negatives do not cancel one another out

outcome

what was observed on a trial

negative relationship

when above-average values one one tend to accompany below-average values of the other

Nonresponse bias

when an individual chosen for the sample can't be contacted or refuses to participate

Response bias

when an individual does not answer accurately or honestly for any reason (embarrassment, want to please, wording of question)

positive relationship

when avoe-average values of one tend to accompany above-average values of the other, and when below-average values also tend to occur together

Undercoverage bias

when some group sin the population are left out of the process of choosing the sample

confounding:

when the effects of the lurking variable and the explanatory variable on the response variable cannot be distinguished from each other.

How do you compute the value of the test statistic for a matched pairs t?

where Mnot = zero

discrete random variable

x takes fixed set of possible values - whole numbers

residual calculation

y - y hat

What is the difference between σ and s?

σ is the standard deviation of a population whereas s is the standard deviation of a sample.

equation relations

• Mewy = alpha + betaX • Xbar = a + bX

Between what two values are 95% of the bottle weights

○ If you know the the mean and the standard deviation parameters it's just the mean plus/minus 2 times the Stdev

• least squares line is the line for which the sum of

○ RESIDUALS (variability of y's about the line, not the mean of the y's) is minimized.

Why is the condition of random selection or random allocation so important?

○ So we can make valid inferences

mean

sum of values/# of observations, not resistant

representative

Is my sample __________ of my population or can the results of my sample be generalized to the population that i intend to study?

What do you already know about a correlation coefficient?

It can act as its own test statistic.

What does a correlation coefficient represent?

It is a numerical index that relfects the relationship between two variables.

consists of all but the right most digit

stem

Observations must meet these requirements

- the total number of obs.s n is fixed in advance - each obs. falls into just 1 of 2 categories: success or failure - the outcomes of all n obs.s are statistically independent - all n obs.s have the same probability of "success", p.

three main principles for experimental design

-control -randomize -replication

comparison study

A sample type where the researcher contacts those subjects who are readily available and does not use any random selection. The results are almost always biased.

displays for quantitative

stem-n-leaf display box plot histogram

margin of errors

1/ square root of 'n' how well the sample estimate predicts the population percent

what quartile is the median?

2nd quartile and is resistant to outliers

When to use chi

3 or more categories for either exp. variable or response variable

Empirical rule #1

68% includes mean minus 1 standard deviation and mean plus 1 standard deviation

Empirical rule #2

95% includes mean minus 2 standard deviations and mean plus 2 standard deviations

Empirical rule #3

99.7% includes the mean minus 3 standard deviations and mean plus 3 standard deviations

Multiple comparisons

: Performing two or more tests of significance on the same data set. This inflates the overall α (probability of making a type I error) for the tests. (The more comparisons performed, the greater the chances of falsely rejecting at least one true null hypothesis.)

Residual

=y-^y, the difference between true y to the predicted y

Normal distribution

A bell-shaped symmetric density curve that is often used as a model for data or other random variables; specified by μ and σ.

direction of relationship

A characteristic of data in a scatterplot that is identified as either a positive or negative association.

degrees of freedom

A characteristic of the t-distribution (and other distributions like F and χ2) indicating the amount of information available in the data. A complete definition of "degrees of freedom" is beyond the scope of an introductory statistics course.

bias (sampling)

A condition that occurs when the design of a study systematically favors certain outcomes.

unbiased:

A condition where the mean of all possible statistics equals the parameter that the statistic estimates.

left skewed

A density curve where the left side of the distribution extends in a long tail. (Mean < median.)

right skewed distribution

A density curve where the right side of the distribution extends in a long tail; (mean > median).

non-probability sample

A sample selected without randomization; hence, the probability of obtaining a particular sample cannot be computed.

block

A group of experimental units sharing some common characteristic. In a randomized complete block design, random allocation of treatments is carried out separately within each group.

17. What is a distribution of a random variable?

A list of possible values of a variable together with how often each value occurs.

distribution

A list of the possible values of a variable together with the frequency (or probability) of each value.

distribution:

A list or a graph that shows the possible values of a variable together with the frequency of each value.

Q1 (First Quantile):

A location measure of the data that has approximately one-fourth or 25% of the data below it.

Q3 (Third Quantile)

A location measure of the data that has approximately three-fourths or 75% of the data below it.

regression equation

A mathematical formula for a straight line that models a linear relationship between two quantitative variables.

standard deviation of p-hat

A measure of the variability of the sampling distribution of p-hat ; equals the sq. root of p(1-p)/n

slope ( β = parameter symbol; b = statistic symbol):

A measure of the average rate of change in the response variable for every one unit increase in the explanatory or independent variable.

variance

A measure of the average squared deviation of the data about the mean.

probability

A measure of the proportion of times an outcome occurs in a very long series of repetitions indicating the likelihood of the outcome.

correlation coefficient

A measure of the strength of the linear relationship between two quantitative variables.

randomization

A method of assigning experimental units to treatment groups that eliminates bias and gives each unit the same probability of being assigned to any treatment group

voluntary response

A method of sample selection that consists of people choosing themselves by responding to a general appeal

test statistic

A numerical value calculated from the sample information assuming H0 is true; used to obtain P-value.

dotplot

A one dimensional plot of a quantitative data set where each value in the data set is represented by a dot above its corresponding location on the x axis.

random phenomenon

A phenomenon with outcomes that are individually unpredictable, but follow a predictable distribution in the long run (i.e., in a very large number of repetitions).

boxplot

A plot of data that incorporates the maximum observation, the minimum observation, the first quartile, the second quartile (median) and the third quartile.

in control

A process functioning within acceptable limits.

probability sample

A sample chosen using some type of random device. The probability of any specific sample can be computed and is greater than zero.

Normal Approximation for Binomial Distributions

If n is large and p is not too close to 0 or 1, the binomial distribution can be approximated by a Normal distribution. B(µ=np, σ=√np(1-p)) ~ N(µ=np, σ=√np(1-p)) It can generally be used when np ≥ 10 and n(1-p) ≥ 10. This approx. can be improved w/ continuity correction.

For fixed a, how can we increase power?

Increasing sample size will decrease the spread, making both curves taller and skinner. That will increase power.

Regression explanatory variable

Independent variable, the groups to be compared with respect to values on the response variable

Asked for Percent or probability-Binomial

Inequality corresponding to sign.

data

Information collected on individuals.

IQR

Interquartile Region IQR = Q3 - Q1

Asked for quartile, normal distribution.

InvNorm- (Percentage to left-mean-SD)

___________ show the distribution of a quantitative variable by using bars whoe height represents the number of individuals who take on a value within a particular class

histograms

standard deviation

how each value differs from the mean

The distribution of a count depends on

how the data are produced.

1. draw and label a number line that includes the range of the distribution 2. draw a central box from Q1 to Q3 3. note the median M inside the box 4. extend lines from the box out to the min and max values that are NOT outliers

how to make a boxplot

5 number summary

includes the minimum, first quartile, median, third quartile, & the maximum

As degrees of freedom_____ the t-distribution curve approaches normal

increase

The shape of the t-distribution gets closer and closer to the shape of standard Normal distribution as the degrees of freedom

increase.

an object described by data ex: people, animals, households -corresponds to the row

individuals

Statistical inference Definition

inferring something about the population based on the observed sample.

When many tests of significance are performed on one set of data, the researcher is guilty of performing multiple analyses and

inflating the overall α.

Appropriate interpretation of interval vs. of level of confidence

interval is that the parameter is in there. Level is that the confidence interval procedure yeilds intervals that contain it a percentage of the time

InvNorm

invNorm(left area/probability, mean, sd)

Power

is the probability of rejecting H0 when H0 is false.

Standard deviation for the first formula

is the standard deviation of pHAT

When computing a test statistic for a matched pairs t, mewnot

is zero (almost always)

The mean of every t-distribution

is zero just like the standard Normal distribution

How to know if its np checks

it always is for proportions

resistant

outliers have little effect on the values

which outlier is influential

outliers in the x direction changes the equation of the line or correlation coefficient

1-VarStats(List)

output is mean of the list, x, x2, the sample standard deviation, the population standard deviation, the length of the list, and the five number summary

binompdf(n, p, r)

output is the probability of exactly r successes out of n trials in a binomial experiment where the probability of success in a single trial is p

binomcdf(n, p, r)

output is the probability of no more than r successes out of n trials in a binomial experiment where the probability of success in a single trial is p

invNorm(p, mew, stdev)

output is the value of the random variavle with left tail probability p. This can be used to determine critical z values.

invT(p, degrees of freedom)

output is the value of the t-value with left tail probability p. This can be used to determine critical t values.

variable:

particular characteristic of an individual

displays used for categorical

pie and bar chart

categorical visual aide

pie chart, bar graph

data:

pieces of information about individuals organized into variables

continuous

possible values form an interval, infinite set of values

For fixed α, increasing sample size increases

power.

-always one or above x-axis -always positive -has an area of exactly 1 underneath curve

properties of density curve

a variable that is observed as a nukmber

quantitative variable

facts about correlation coefficient

r is always between -1 and 1 r>0 indicates postive association values near 0 indicate a weak relationship r=-/+1 perfect linear relationship

shows quantitative variable over time, can see if any trend is occurring over time

stemplot

________ separate each observation into a stem and a leaf that are then plotted to display the distribution

stemplots

the variable one suspects is affected by the explanatory variable -measure outcome of study -variable you wish to predict -interest to study

response variable

r vs. rsquared

rsq is percentage of variation of y that can be explained by x r: strengthj and direction of linear relationship

Sample standard deviation symbol

s

Standard Error

s/sqrtn

what is x-bar

sample mean or average

T-star and sample size

sample size is not a condition

simple random sample

sample size n is one in which each set of n elements in the population has an equal chance of selection

simple random sample SRS

samples of size n are equally likely selected from the popultaion of interest

Bias

sampling needs to be done in a way that we avoid _____

most useful graph to display the relationship between two quantitative variables

scatterplot

What kind of sample: California houses called through random digit dialing--each respond asked a question

simple random sample

uniform probability model

simplest of all continuous probability model the probability of each individual outcome is the same over the range of possible values

Simple random sample (SRS)

size of n consists of n individuals from the population chosen in such a way that every set of n individuals has an equal chance to be the sample actually selected

mean<median

skewed to the left

mean>median

skewed to the right

-can NOT be a negative -can be 0

standard deviation

o

standard deviation of population distribution

z score

standard score indicating the number of standard deviations above or below the mean value - mean/standard deviation

alternative hypothesis

statement about the predicted relation between the groups you're studying Opposite of null hypothesis Something special is happening The means of different groups are NOT the same

29. Two variables are confounded when

the effect of one variable on the response variable cannot be separated from the effect of other variable on the response variable.

46. A test of significance is intended to assess

the evidence provided by data AGAINST the null hypothesis in favor of the alternative hypothesis.

Standard error of the sample proportion

the formula with phat

median

the midpoint of observations, resistant

stratified random sample

the population is divided into distinct groups. Members are selected at random from each group.

y-hat

the predicted value of the response variable for a given value of x


Ensembles d'études connexes

The definition and basic information about Derivatives

View Set

Network+ Guide to Networks, Chapter 9, SUPER QUIZ!

View Set

chapter 1 uppers, downers, all arounders

View Set