Stats Final Exam
The least-squares regression line
of y on x is the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible.
Susan is curious of the consumption of sweets mainly donuts is increased as people get [older. susan asked 100 customers at a local krispy kreme how many donuts they ear in a month, y, and recorded their age ,x. She obtained the following regression line y= 35-1.23x. What is meaning for the slope of the regression line?
on average, the monthly consumption of donuts goes down 1.23 for each additional increase in age
categorical variable
places an individual into one of several groups or categories.
mean μ and standard deviation σ of the
population.
Correlation is usually written as
r
cycles
regular up-and-down movements
= observed y — predicted y = y — y^
reisduals
sampling design:
Describes exactly how to choose a sample from the population
block design
Design of an experiment where the random assignment of individuals to treatments is carried out separately within each block.
law of large numbers
Draw observations at random from any population with finite mean µ. As the number of observations drawn increases, the mean of the observed values tends to get closer and closer to the mean µ of the population.
extrapolation
the extension of a graph, curve, or range of values by inferring unknown values from trends in the known data.
For something to be statistically significant
the observed value needs to be very large
Vets are using a anti cancer drug in horses. A group of vets wanted to find out how many others have given the drug to their horses. They obtained a list of all the vets treating large animals including large animals . They sent questionnaires to all the vets on the list. Such survey is called a census . The response rate was 40% what is the population. the population is all vets the population is all vets treating large animals the population is all vets in the us treating large animals including horses all answers choices are correct
the population is all vets in the us treating large animals including horses
What is the meaning of p-value?
the probability that we would observe our sample statistic or more extreme , given the null hypothesis is true.
relationship that holds for several groups is reversed when combing all the groups
Simpson's paradox.
Q3-Q1
IQR
quantitative variable
: Takes numerical values for which arithmetic operations such as adding and averaging make sense. The values are usually recorded with a unit of measurement such as seconds or kilograms.
finding the median
(n + 1)/2
Find the proportion, round up to two decimal places, of observations from a standard normal distribution that satisfies the standard z>1
0.16 go to normal cdf put in lower and upper bound
Simple Conditions for Inference about a Mean
1. We have an SRS from the population of interest. There is no nonresponse or other practical difficulty. The population is large compared to the size of the sample. 2. The variable we measure has an exactly Normal distribution N(μ, σ) in the population. 3. We don't know the population mean μ. But we do know the population mean μ. But we do know the population standard deviation σ.
You measure the lifetime of a random sample of 64 tires of a certain brand. The sample mean is x=50 motnhs. Suppose that the lifetimes for all tires follow a normal distribution, with unknown mean mu and the standard error is calculated to be 1.225. What is a 95% confidence interval for mu
50 +- 2.45
2 way tables compares
2 categorical data
ecological correlation
A correlation based on averages rather than on individuals.
density curve
A curve that is always on or above the horizontal axis and has area exactly 1 underneath it. It describes the overall pattern of a distribution. The area underneath it and above any range of values is the proportion of all observations that fall in that range
histogram:
A graph of the distribution of one quantitative variable. To construct, first divide the range of the data into classes of equal width. Next, count the individuals in each class. Then mark the horizontal axis in the units of measurement for the variable. The vertical axis contains the scale of counts. Each bar represents a class. The base of the bar covers the class, and the bar height is the class count.
block
A group of individuals that are known before the experiment to be similar in some way that is expected to affect the response to the treatments.
probability model
A mathematical description of a random phenomenon consisting of two parts: a sample space S and a way of assigning probabilities to events.
statistic
A number that can be computed from the sample data without making use of any unknown parameters. In practice, we often use this to estimate an unknown parameter.
parameter
A number that describes the population. In statistical practice, the value is not known because we can rarely examine the entire population.
voluntary response sample
A sample composed of people who choose themselves by responding to a broad appeal. These samples are biased because people with strong opinions are most likely to respond.
stratified random sample
A sample composed of separate SRSs chosen for different strata of a population (groups of individuals that are similar in some way that is important to the response).
convenience sample:
A sample selected by taking the members of the population that are easiest to reach.
experiment:
A study that deliberately imposes some treatment on individuals in order to observe their responses. The purpose is to study whether the treatment causes a change in the response
observational study
A study that observes individuals and measures variables of interest but does not attempt to influence the responses. The purpose is to describe some group or situation.
sample survey
A survey conducted on a sample from the population of all individuals about which we desire information. We base conclusions about the population on data from the sample.
test of significance
A test to assess the evidence provided by data against a null hypothesis H0 in favor of an alternative hypothesis Ha.
lurking variable
A variable that is not among the explanatory or response variables in a study and yet may influence the interpretation of relationships among those variables.
explanatory variable:
A variable that may explain or influence changes in a response variable. called independent variables or predictor variable
response variable
A variable that measures an outcome of a study. dependent variables
Probability rules
All possible outcomes together must have probability 1. Because some outcome must occur on every trial, the sum of the probabilities for all possible outcomes must be exactly 1. Rule 2. If S is the sample space in a probability model, then P (S) = 1.
simple random sample
Also denoted as SRS, this sample of size n consists of n individuals from the population chosen in such a way that every set of n individuals has an equal chance to be the sample actually selected.
randomized comparative experiment
An experiment that uses both comparison of two or more treatments and random assignment of subjects to treatments.
statistically significant
An observed effect so large that it would rarely occur by chance.
event:
An outcome or set of outcomes of a random phenomenon. That is, it is a subset of the sample space.
Probability rules
Any probability is a number between 0 and 1. Any proportion is a number between 0 and 1, so any probability is also a number between 0 and 1. An event with probability 0 never occurs, and an event with probability 1 occurs in every trial. An event with probability 0.5 occurs in half the trials in the long run. Rule 1. The probability P (A) of any event A satisfies 0 ≤ P(A) ≤ 1.
treatment
Any specific experimental condition applied to the subjects. If an experiment has more than one factor, this is a combination of specific values of each factor.
test statistic:
Calculated from the sample data, measures how far the data diverge from what we would expect if the null hypothesis H0 were true. Large values of the statistic show that the data are not consistent with H0.
standard deviation:
Denoted by s, measure of the spread about the mean of a distribution. Equal to the square root of the variance.
significance level:
Denoted by α, the probability of a Type I error of any fixed level test.
critical value:
Chosen so that the standard Normal curve has area C between —z* and z*.
In this experimental design, all the subjects are allocated at random among all the treatments.
Control—restrict the effects of lurking variables on the response, most simply by comparing two or more treatments. 2. Randomization—use chance to assign subjects to treatments. 3. Replication—use enough subjects in each group to reduce chance variation in the results.
null hypothesis
Denoted by H0, the claim being tested by a statistical test. Usually this is a statement of "no effect" or "no difference".
alternative hypothesis
Denoted by Ha, the claim about a population that we are trying to find evidence for.
Probability rules
If two events have no outcomes in common, the probability that one or the other occurs is the sum of their individual probabilities. If one event occurs in 40% of all trials, a different event occurs in 25% of all trials, and the two can never occur together, then one or the other occurs on 65% of all trials because 40% + 25% = 65%. Rule 3. Two events A and B are disjoint if they have no outcomes in common and so can never occur together. If A and B are disjoint, P (A or B) = P(A) + P(B)
marginal distribution
In a two-way table of counts, the distribution of values of one of the categorical variables among all individuals described by the table.
y = a + bx
In this equation, b is the slope, the amount by which y changes when x increases by one unit. The number a is the intercept, the value of y when x = 0.
completely randomized design:
In this experimental design, all the subjects are allocated at random among all the treatments.
density curve
Is always on or above the horizontal axis. Has area exactly 1 underneath it.
The Median
Is resistant
what does "statistically significant difference" mean in describing the outcome of a randomized comparative study? The difference in responses between the groups are small the difference in responses between the groups are large It means that the observed differences in the responses between the groups are not likely due to chance, and is more likely to be explained by the fact that different treatments were applied it means that the observed differences were no more than what might reasonably occur by chance if there was no effect due to the treatments
It means that the observed differences in the responses between the groups are not likely due to chance, and is more likely to be explained by the fact that different treatments were applied
Right
Many economic variables have distributions that are skewed
The mean
Not a resistant measure
matched pairs design
One type of block design of an experiment that combines matching with randomization. This design compares just two treatments. Each subject receives both treatments in a random order, or the subjects are matched in pairs as closely as possible, and one subject in each pair receives each treatment.
If A hypothesis test is significant at level alpha .10 then what is known for the p-value ?
P value less than or equal to .10
statistical inference
Provides methods for drawing conclusions about a population from sample data.
1.5 × IQR
Rule for outliers
Distribution
Tells us what values a variable takes and how often it takes these values.
residual
The difference between an observed value of the response variable and the value predicted by the regression line.
the correlation between high and weight of a teenager and 13-18 is found to be around r=0.7. Suppose we use the height x of a teenager to predict the weight y of teenager. We can conclude that: The fraction of variation in weights explained by the least squares regression line of weight on height is 0.49 height is generally 80% of a teenagers weight about 70% of the time age will accurately predict weight the least squares regression line of y on x would have a slope of 0.7
The fraction of variation in weights explained by the least squares regression line of weight on height is 0.49 r Does no equal slope
subjects:
The individuals studied in an experiment, particularly when they are people.
What is the placebo effect
The positive response to a dummy treatment
Probability rules
The probability that an event does not occur is 1 minus the probability that the event does occur. If an event occurs in (say) 70% of all trials, then it fails to occur in the other 30%. The probability that an event occurs and the probability that it does not occur always add to 100%, or 1. Rule 4. For any event A, P(A does not occur) = 1 - P(A)
P-value
The probability, computed assuming that H0 is true, that the test statistic would take a value as extreme or more extreme than that actually observed. The smaller it is, the stronger the evidence against H0 provided by the data.
probability
The proportion of times an outcome of a random phenomenon would occur in a very long series of repetitions.
sample space:
The set of all possible outcomes of a random phenomenon.
confidence level:
The success rate of the method that produces the confidence interval. Gives the probability that the interval will capture the true parameter value in repeated samples.
confounded
Two variables (explanatory variables or lurking variables) are said to be this when their effects on a response variable cannot be distinguished from each other.
confidence interval
Uses sample data to estimate an unknown population parameter with an indication of how accurate the estimate is and of how confident we are that the result is correct. It has two parts: an interval calculated from the data and a confidence level. It often has the form: estimate ± margin of error.
nonresponse
When an individual chosen for the sample can't be contacted or refuses to participate.
under coverage
When some groups in the population are left out of the process of choosing the sample.
the sampling distribution of a statistics
a distribution of values taken by a statistic in all possible samples of the same size from the same population.
a small p-value in a test of significance indicates
a large amount of evidence against the null hypothesis
what are some sources of error that are not covered by the margin of error ? voluntary response bias nonresponse bias undercoverage bias all of these none of these
all of these
what is a 95% interval
an interval computed by a method by which 95% of all samples will contain the true value of the parameter of interest in the interval
A vet is interested in studying the causes of cancer in horses. In order to find a causal relationship between diets and cancer a vet needs to use double blind observational study non-randomized experimental design in which the examining veterinarian is blinded to the diet the horse is fed comparative randomized experiment survey questionnaire
comparative randomized experiment
For the given scatter plot the correlation between the two quantitative variables x/y , was determined to be 0.68. which of the following is true? B/c the correlation is positive , we know that high values of one of the variables are always associated with high values of the other variable correlation is inappropriate here because the linear relationship between the variables does not appear to be linear the result is surprising b/c the plot seems to suggest that there maybe a negative association between the variables to get a better idea of the true relationship the values of the observations should be standardized before calculation r
correlation is inappropriate here because the linear relationship between the variables does not appear to be linear
Increase the sample size. Often, the most practical way to decrease the margin of error is to increase the sample size. ... Reduce variability. The less that your data varies, the more precisely you can estimate a population parameter. ... Use a one-sided confidence interval. ... Lower the confidence level.
decrease the width of a confidence interval
You can describe the overall pattern of a scatterplot by the
direction, form, and strength of the relationship.
an experiment investigated the effect of dosage (5,10,15 mg) and method of delivery (pill, or patch) of a new drug on the incidence of a particular effect. Four subjects were randomly assigned to each of the treatment combos. In this experiment what are the factors? The four people in the experiment the six treatments combos of dosage and method of delivery the new and old drugs dosage and method of delivery
dosage and method of delivery
The explanatory variables in an experiment are often called
factors
Other things being equal, the margin of error of a confidence interval gets smaller a
he confidence level C decreases, the population standard deviation σ decreases, the sample size n increases.
The objects described by a set of data. They may be people, but they may also be animals or things.
individuals
lurking variable
is a variable that is not among the explanatory or response variables in a study and yet may influence the interpretation of relationships among those variables.
rebbecca obtains a p-value of 0.03 . This would significant
level alpha of 0.05
a study of video game usage and eyesight recruited 200 high school students who play video games. each of these students were paired with someone with similar aspects. (Same sex , age, race etc) but who does not play video games. the eyesight were measured for all students. What kind of design is this matched pair design random block design completely randomized design uncontrolled experiment convenience design
matched pairs designs
The wais is a common IQ test for adults. The distributions of the scores for a person of 16 is about normal with a mean 100 and a s.d. 15. What is the mean and s.d. of the sampling distribution of the average wais score of an srs of 10 people
mean= 100 S.d.= 4.7434
Correlation and least-squares regression lines are
not resistan
some researchers have noted that adolescents who spend a lot of time playing video or computer games are at greater rusk of depression and violence . This is an ex) of what? valid conclusion bc more time yeilds more aggression is a + association observational study with lurking variables that may explain the association single-blind experiment bc the subjects knew they were watching tv paired data experiment bc we are studying aggression and tv watching
observational study with lurking variables that may explain the association
Tests and confidence intervals for a population proportion p when the data are an SRS of size n are based on the
sample proportion . p^
central limit theorem
says that when n is large, the sampling distribution of the sample mean is approximately Normal:
The most useful graph for displaying the relationship between two quantitative variables is a
scatterplot.
Histogram
shape, center, and variability. You will sometimes see variability referred to as spread.
regression line
summarizes the relationship between two variables, but only in a specific setting: one of the variables helps explain or predict the other. That is, regression describes a relationship between an explanatory variable and a response variable.
All Normal curves have the same overall shape
symmetric, single-peaked, bell-shaped. Any specific Normal curve is completely described by giving its mean μ and its standard deviation σ.
an instructor collected homework grades and quizzes from students in her class. She calculated the least-square regression line to be quiz grade =18.04+0.788* (Homework grade)
the value of the quiz grade when hw grade is 0
a long-term upward or downward movement over time
trend
Any characteristic of an individual. This can take different values for different individuals.
variable
An average of the squares of the deviations of a set of observations from their mean. Equal to the standard deviation squared.
variance
The sample size needed to obtain a confidence interval with approximate margin of error m for a population proportion is
where p* is a guessed value for the sample proportion , and z* is the standard Normal critical point for the level of confidence you want. If you use p* = 0.5 in this formula, the margin of error of the interval will be less than or equal to m no matter what the value of is.
The level C large-sample confidence interval for p is
where z* is the critical value for the standard Normal curve with area C between −z* and z*. Use this interval only when both the number of successes and the number of failures in the sample are at least 15.
