Stats Final Exam

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Referring to the output, which of the following shows the correct calculation for a 95% confidence interval for 𝛽β? Use df = 10 (not the correct df). (slope is -.0073 and standard error of slope is .00042)

-0.0073 ± (2.228)(0.00042)

Define -1 through 1

-1 is perfect negative relationship Between -1 and 0 is moderate negative 0 is no linear relationship Between 0 and 1 is moderate positive 1 is perfect positive relationship

For a biology project, you measure the weight, in grams, and the tail length, in millimeters (mm), of a group of mice. The equation of the least-squares line for predicting tail length from weight is predicted tail length = 20 + 3*weight Suppose a mouse weighing 20 grams has a 78 mm tail. What is the residual for this mouse?

-2 mm A residual is how much our predicted point differs from our observed point, given a value for x. In this case we observed a mouse with a weight of 20 grams (our x value) and a tail length of 78 mm. To find our predicted point we need to plug 20 into our least squares regression line: predicted tail length=20+3*20=80. To find the residual we calculate (observed value-predicted value)=78-80= -2.

The mean height of American women in their twenties is about 64 inches, and the standard deviation is about 2.7 inches. The mean height of men the same age is about 69.3 inches, with standard deviation about 2.8 inches. If the correlation between the heights of husbands and wives is about r = 0.5, what is the slope of the regression line used to predict the husband's height (Y) from the wife's height (X) in young couples? Note: b = r(sy / sx)

0.5185 For this problem we are asked to find the slope given r, and the standard deviations of our two variables. sy is going to be the standard deviation of the husband's height and sx is going to be the standard deviation of the wife's height. (The scenario told us which variables were the X and Y in the last sentance). Thus we get the equation b=.5(2.8/2.7) which comes out to be 0.5185.

How to calculate r

1. Plot data - check linear form (if nonlinear, stop!) 2. Compute means, st. dev's of x's and y's 3. Calculate xi - xbar/Sx 4. Calculate yi - ybar/Sy 5. Multiply results from steps 3 and 4 for each individual 6. Sum products from step 5 7. Divide by n- 1 properties of r: Ranges between -1 and 1 Does not have units of measurement

Regression Inference

1. x and y are linearly related 2. response variables have normal distribution at a fixed value of x 3. same σy, for all x values

Researchers collected data on the number of breeding pairs of Scarlet Macaws in an isolated area of an Amazon rainforest in each of 8 years (X) and the percentage of males who returned the next year (Y). The data show that the percentage returning is lower after successful breeding seasons and that the relationship is roughly linear. Refer to the regression output given below. What percentage of returning males would we expect in a given year if we observed 25 breeding pairs during the previous year? Simple linear regression results:Dependent Variable: percent.returnedIndependent Variable: breeding.pairspercent.returned= 136.682 -3.218breeding.pairsSample size: 8R (correlation coefficient) = -0.8329R-sq = 0.6937Estimate of error standard deviation: 9.460Parameter estimates:

56.23% To find the predicted percentage of returning males given we observed 25 breeding pairs the previous year, we will plug this value into our least squares regression equation. We are trying to predict a value of y given a value for x. Looking at the output we can determine that the equation is: percent.returned=136.682 -3.21*breeding.pairs. Thus plugging in the value of 25 for breeding.pairs we get: percent.returned=136.682 -3.21*25. Solving this equation we then get that percent.returned=56.23%.

Researchers collected data on the number of breeding pairs of Scarlet Macaw in an isolated area of an Amazon rainforest in each of 8 years (X) and the percentage of males who returned the next year (Y). The data show that the percentage returning is lower after successful breeding seasons and that the relationship is roughly linear. The following shows a StatCrunch regression output for these data. What percentage of the variation in the percent of returning males can be explained by the number of breeding pairs? Simple linear regression results: Dependent Variable: percent.returned Independent Variable: breeding.pairs percent.returned = 136.682 - 3.218 breeding.pairs Sample size: 8 R (correlation coefficient) = -0.8329 R-sq = 0.6937 Estimate of error standard deviation: 9.460

69% The general interpretation of R-squared is "the percent of variating in y that is explained by the variation in x." To find the value of R-squared in the output, refer to the row that is labeled as "R-sq".

correlation

= Covariance/(std x * std y) tells us nothing about the slope For correlation you have to have a quantitative response and a quantitative explanatory (or it could just be two quantitative variables, doesn't really matter if its response or explanatory)

How do you decide whether a number is a parameter or statistic?

A number is a parameter if it summarizes the measurements for ALL individuals, it's a statistic if it summarizes the measurements for a sample of individuals

How do you distinguish between a parameter and a population?

A parameter is either a mean (for quantitative data) or a proportion (for categorical data). The population is the collection of individuals.

Explanatory Variable

An explanatory variable is one that explains changes in that variable. It can be anything that might affect the response variable.

Researchers deliberately overfed 16 young adults for 8 weeks. They measured fat gain (in kilograms) and change in energy use (in calories) from activities other than deliberate exercise. These activities, called nonexercise activities (NEA), include fidgeting, daily living, etc. The correlation between fat gain and NEA was r = -0.7786. What does the correlation coefficient tell us about the relationship between fat gain and NEA? Suppose fat gain was measured in pounds instead of kilograms. How would the value of r change? What is the range of possible values for r?

As NEA increases, fat gain decreases, on average. r would not change. -1.0 ≤ r ≤ +1.0

What ability does regression give you that correlations does not?

Calculate mean of y given x (for all the x's what is the mean y)

Suppose a Stat 121 professor would like to predict the mean final exam score of the students who obtained 80% on exam 3. What kind of interval should she use?

Confidence interval for μy

Which of the following makes NO distinction between an explanatory variable X and a response variable Y (i.e., you can interchange the roles of X and Y and get the same result)?

Correlation

Plot this data in a scatterplot of mileage versus speed and verify that the correlation between speed and mileage is r = 0.Obviously there is a strong relationship between speed and mileage. Why is the correlation 0, despite this relationship?

Correlation describes the strength of linear relationships between quantitative variables, not curved relationships.

b

Estimated (sample) slope in a regression equation

a

Estimated (sample) y-intercept in a regression equation

X

Explanatory variable in regression analysis (independent)

If we know the value of b, the slope of the regression line, we can accurately guess the value for the correlation coefficient without looking at the scatterplot.

False

Facts about r

Has values between -1 and +1 Has no unit of measure Measures strength of linear relationship both variables must be quantitative x and y can be interchanged Affected by outliers

In the following scatterplot, the number of Methodist ministers in Boston per year is plotted against number of barrels of rum imported for the same year. What is a possible lurking variable that could be influencing the strength of this relationship?

Increase in population of Boston is a lurking variable. As the population increased, so did the number of Methodist ministers and the demand for imported rum. (just look for an increase or decrease in variables to find a lurking variable)

how to decrease margin of error

Increase sample size, Decrease level of confidence

Below is a scatterplot depicting the linear relationship between 19 high school students' scores on the ACT and their scores on a reading test. What effect would removing the outlier near the bottom of the scatterplot from the data set have on the correlation coefficient?

It would increase the correlation between ACT score and reading score.

α (alpha)

Level of significance or probability of a type I error (probability or rejecting a true null hypothesis) True population y-intercept in a regression equation

C --> Q

Matched Pairs t for means Two Sample t for means ANOVA

A manufacturer of traditional computer keyboards has redesigned their keyboard to help with carpal tunnel syndrome. A company official plans a test to make sure that the redesign has not significantly slowed the typing speed of those using the new keyboard. Thirty secretaries will type a page of text using both the newly designed keyboard and the traditional keyboard. Whether the secretary types first on the new design or the traditional design will be randomized. The time required to type the page of text for each type of keyboard will be recorded. Which procedure should be used for this test? What is the response variable, and is it quantitative or categorical? What is the explanatory variable, and is it quantitative or categorical?

Matched pairs t test for mean difference Time required to type a page of text which is quantitative Type of keyboard which is categorical

Many colleges offer online versions of courses that are also taught in the classroom. It often happens that the students who enroll in the online version do better than the classroom students on the course exams. This is not because online instruction is more effective than classroom teaching, but because the people who sign up for online courses are often quite different from the classroom students. Three of the following statements describe lurking variables for why online students do better, but one does NOT. Which one does NOT give a valid lurking variable?

Material is presented online rather than in a classroom Answer B is not a valid lurking variable. The online and classroom versions of the course should have the same material presented. That's why they would offer both versions, so that students could learn the material in a different setting if needed.

μ

Mean of a population Mean of the sampling distribution of x bar

For right skewed data the ___________ is bigger. For left skewed data the ____________ is bigger.

Mean, Median

The parameter is either for a ____________ or a ______________

Mean, Proportion

For introductory statistics, suppose the correlation between exam score and the time (in hours) that elapsed since the exam period began before the student started taking the exam is -0.96. Because the negative relationship is so strong, can you say that taking the exam late on the last day would cause you to get a lower score than you would have had you taken it earlier?

No, because there is a potential lurking variable of preparedness. Having more time to prepare the exam is a potential lurking variable. As time since the opening of the exam increases, there has also been more time for students to study for the exam, thus a higher chance for them to good on their exam.

Should we use the regression line from the analysis to predict the price of a house that is 6000 square feet in size? Why or why not?

No. That would be extrapolating beyond the range of our data - something we should never do.

N(μ, σ)

Normal distribution with mean, µ, and standard deviation, σ.

Observed and Expected Counts

Observed counts should obviously be round numbers because you are measuring people but expected counts don't have to be because they are just an estimate. For X2 to be ok to calculate, we need the expected counts to be at least 5 or above. Make sure to check that on the test.

One-sample procedures

One-sample Z for proportions One-sample t for means

y-hat

Predicted y

β (beta)

Probability of a type II error (probability of failing to reject a false null hypothesis) True population slope in a regression equation

p

Proportion of a population Mean of the sampling distribution of p-hat

Q to Q

Regression inference

Type 1 error

Rejecting null hypothesis when it is true

Y

Response variable in regression analysis (dependent variable)

r

Sample correlation coefficient

n

Sample size

Formula for standard deviation of p-hat

Sd of p-hat = square root of p(1-p)/n

If the p-value for testing the hypotheses listed in the prompt is 0.0014, what is the appropriate conclusion at 𝛼α = 0.05?

Since the p-value is less than α, conclude that temperature is useful for predicting pH.

σ

Standard deviation of a population

s

Standard deviation of a sample

σ/√n

Standard deviation of the sampling distribution of x bar

sqaure root of p-hat(1-phat)/n

Standard error of p hat; estimates standard deviation of the sampling distribution of p-hat

s/√n

Standard error of x bar; estimates standard deviation of the sampling distribution of x bar

How do you know if the data have a strong or weak relationships on a scatterplot?

Strong - points concentrated about the form Weak - points loosely scattered about the form If the graphing is near zero than the strength of the relationship is very weak, even if the cloud is tight, if it's sitting on its side than it's a very weak relationship.

μ1-μ2

The difference between the means of two populations

p1-p2

The difference between the proportions of two populations

p-hat1 - p-hat2

The difference between the proportions of two samples

Just something to study

The general form of the regression line is y=a+bx. y represents the dependent variable which in this scenario is Price. x represents the independent variable which is Size. a represents the y-intercept. In the output table this is found at the estimate of the intercept. b represents the slope. In the output this is found at the estimate of the slope.

slope interpretations

The general interpretation of slope is: for everyone one unit increase in x, we expect y to increase/decrease by b units On average, if the size of a house increases by 1 square foot, we would expect the price of the house to increase by 1.055 thousands of dollars.

y-intercept interpretation

The general interpretation of y-intercept is: If the average x were 0, then we would expect y to be "a". When the size of a house is 0 square feet, we would expect this house to sell for -22.000 thousands of dollars.

A researcher investigated whether the temperature of milk, x, can be used to predict the milk's pH content, y. She predicted the pH level of a glass of milk with a temperature of 8.45 degrees to be ŷ= 6.795. A 95% confidence interval for the mean pH of all milk with temperature 8.45 degrees is calculated to be (6.76, 6.83). Which of the following is the correct interpretation of this confidence interval? Suppose we have a glass of milk whose temperature is 8.45 degrees. We calculate a 95% prediction interval for the pH of this glass of milk to be (6.71, 6.88). Which of the following is the correct interpretation of this prediction interval? True or False: The prediction interval for the pH of a glass of milk whose temperature is 8.45 degrees is wider than the confidence interval for the mean pH of all glasses of milk whose temperature is 8.45 degrees.

The mean pH level for all glasses of milk having a temperature of 8.45 degrees is somewhere between 6.76 and 6.83, with 95% confidence. The pH level for a glass of milk whose temperature is 8.45 degrees is somewhere between 6.71 and 6.88, with 95% confidence. True

Response Variable

The response variable is the focus of a question in a study or experiment.

What does standard error of x estimate?

The standard deviation of the sampling distribution of x-bar

A study was done on the graduation rates from two universities: Eastern College and Southern College. Data were collected for the year 2013 in a variety of majors, and it was found that Eastern did better overall. However, in each of the majors separately, Southern did better. Which of the following statements is correct?

This is an example of Simpson's Paradox. Simpson's Paradox refers to when multiple data sets are combined and get one result, but then looking at the data set separately we get the opposite result. This is the case with this scenario because looking at both data sets together, Eastern College does better. Whereas when looking at the separate majors, Southern College does better.

True or False: Assuming all the conditions for the test are met, if the test on slope (H0: β = 0) is significant, then the corresponding least squares regression line can be used for prediction.

True

True or False: The correlation coefficient measures the strength (and direction) of only the linear relationship between x and y.

True. the correlation coefficient is only valid for linear relationships. It measures the strength and direction of a relationship assuming that it is a linear one.

C to C

Two-sample Z for proportions Chi-square test

How to find p-value in chi-square table

Use a chi square table with (r-1) X (c-1) degrees of freedom to get a p-value. R = rows and C = columns in table

Suppose a 95% confidence interval for 𝛽β was calculated to be (-4.56, 2.34). What is the correct interpretation of this confidence interval?

We are 95% confident that the true slope of the linear relationship between temperature and pH lies between -4.56 and 2.34.

Suppose the predicted time is 15.9 seconds for when a cheetah has 1555 feet to run. Interpret this predicted y value in context.

When a cheetah runs 1555 feet, the predicted time to accelerate to maximum speed is 15.9 seconds.

When do you use the central limit theorem?

When the value is abnormal

When do we use ANOVA? What are the correct hypotheses when comparing different group means? What conditions do we need to check?

When we want to compare three or more means H0: u1 = u2 = u3 (you can keep going) Ha: at least one mi is different from the others (write this verbatim) Just one of the muus has to be different to reject the null hypothesis and say it's statistically significant Randomization (either have random samples from each population or random assignment) Normality (plot the data and make sure there are not outliers or skewness or make sure samples sizes are large. N1, n2, n3 bigger than 30. Population SDs are equal (max of sd/min of sd < 2 (less than 2) then we're ok.

A study was done to determine whether the proportion of women who had accidents on Friday the 13th was greater than the proportion of men who had accidents on Friday the 13th. To find out, a random sample was taken of 2,000 individuals. Each individual was asked their gender and whether they had been involved in an accident on Friday the 13th. What is the response variable, and is it quantitative or categorical? What procedure should be used for this test? What is the explanatory variable, and is it quantitative or categorical?

Whether person had an accident on Friday the 13th which is categorical Two-sample z test for proportions Gender of person which is categorical

How to find expected count

excepted = row total X column total/table total refers to the values of the cell counts that would be expected if the null hypothesis were true (no association)

type 2 error

failing to reject a false null hypothesis

How to select appropriate statistical procedure to use. If Y is quantitative, the procedure is about...... If Y is categorical, the procedure is about..... Next determine how many samples were taken and how many variables are being studied One sample and one variable..... Two samples/groups and one variable..... Three or more samples/groups and one variable..... One sample and two variables.... the research question is about associations or linear relationships... the research question is about prediction....

means proportions one-sample procedure two-sample procedure ANOVA or Chi-square Chi-square or Correlation/Regression analysis Chi-square or Correlation analysis Regression analysis

residual formula

observed y - predicted y (y- yhat)

How do you compute the value of the test statistic for a matched pairs t?

t = xbar - u0 / s/ square root of n (where u0 = zero)

Power (and how to increase it)

the probability that you reject H0 when H0 is false. increase power by increasing sample size.

Upper and lower bound formula

u - 2(sigma) and u + 2(sigma).

Extrapolation

use of regression line to estimate mean of y for x far outside x-range of data Problem: no information on nature of relationship outside x-range (often with extrapolation, we haven't always tested farther than a certain point on the line of regression. Therefore, we can't always assume that following the line of regression is accurate all the way up or down) -all relationships have to bend somewhere

Regression

y-hat = a + bx Summarizes linear patter of scatterplot using best fitting straight line a = intercept b = slope y-hat = predicted y-value (mean of y for given x) (y represents the dependent variable, x represents the independent variable) For regression it draws a firm distinction between explanatory and response variables.

All proportion problems deal with....

z's not t's


Ensembles d'études connexes

El Burlador de Sevilla y convidado de piedra

View Set

Chapter 4-Life Policy Provisions and Options

View Set

EMT Chapter 18 Gastrointestinal Emergencies

View Set

Cold War timeline of essential dates

View Set