Math 227 - Stats

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

After running the following R code: Age.model <- lm(Salary ~ Age, data = SalaryGender) Age.model The following is outputted in the R Console: A teacher is 46 years old. What is the Age.model's prediction for their Salary (in thousands of dollars)? 52.5245

-9.305 + 1.319*46

Using the SpeedDating data frame, fit a model of LikeM using SharedInterestsM as the explanatory variable. What is the 95% confidence interval for β1?

.35 to .53

If you use shuffle() to create a randomized sampling distribution of b1 (a group difference) based on a sample of data, what will be the mean of the resulting sampling distribution?

0

The mean maximum swim velocity when wearing a wetsuit (i.e., Wetsuit) is 1.51 m/sec. If the margin of error is 0.08 m/sec, what's the range of possible values within which you're 95% confident that actual population mean would fall?

1.43 m/sec to 1.59 m/sec

Using the SpeedDating data frame, run favstats on AttractiveF. At what value would the sum of squared errors (sum of squares) be lowest? favstats(~AttractiveF, data=SpeedDating) min Q1 median Q3 max mean sd n missing 1 5 6 8 10 6.273723 1.917694 274 2

6.3 ~ mean

The best-fitting model using AttractiveM to predict LikeM can be specified like this: LikeMiLikeMi = b0b0 + b1AttractiveMib1AttractiveMi+eiei Which of the following is an INCORRECT interpretation of the confidence interval for β1 in this model?

95% of all LikeM ratings have this relationship with AttractiveM ratings.

The best-fitting model using NoWetsuit to predict Wetsuit can be specified like this: WetsuitiWetsuiti = b0b0 + b1b1NoWetsuitiNoWetsuiti + eiei If the confidence interval for β1β1 is .9547 m/sec plus or minus 0.118 m/sec, which of the following is NOT a correct interpretation?

95% of all Wetsuit velocities have this relationship with the NoWetsuit velocity.

Which of the following is the correct interpretation of PRE (0.98) in the supernova table above?

98% of the SS from the empty model can be explained by adding NoWetsuit to the complex model.

Presumably, a person's swimming velocity wearing just their swimsuit (NoWetsuit) will predict their maximum velocity while wearing a wetsuit (Wetsuit). How would we depict the null model of maximum velocity in a wetsuit on this plot? response - correct

A horizontal line at the mean of Wetsuit

If you created a bootstrapped sampling distribution of 10,000 means from your sample of SpeedUp, what qualities would you expect it to have?

A roughly normal shape, and a standard deviation smaller than the standard deviation of the sample

Which distribution would you use to create a confidence interval around a parameter estimate?

A sampling distribution

What kind of distribution would this code create? do(10000) * b1(Wetsuit ~ NoWetsuit, data = resample(Wetsuits, 12))

A sampling distribution of bootstrapped slopes

Recall that the variable SpeedUp is the difference between swimming with Wetsuit versus NoWetsuit. Why might you want to find the point below which 2.5% of bootstrapped sample means for SpeedUp fall, and the point above which 2.5% of simulated sample means for SpeedUp fall?

All of the above.

Here is the supernova table for Age2Group.model: i. Interpret the PRE. Select one: 0.14 of the total variation in salary is explained by the age groups. ii. Why is the degrees of freedom for Model equal to 1? Select one: The Age2Group model discards one of the data points used to fit the empty model and thus loses a degree of freedom.

Answer 1:0.14 of the total variation in salary is explained by the age groups. Answer 2:The Age2Group model discards one of the data points used to fit the empty model and thus loses a degree of freedom. 4/8

Suppose you make the plot below to further explore the idea that Age would predict a teacher's Salary. i. If you fit an empty model to this data, how would you depict it on this plot? Select one: [ Select ] ["A vertical line that shows the mean Age.", "You would not be able to represent the empty model visually because it is a single number.", "A horizontal line that shows the mean Salary.", "A diagonal line that bisects the cloud of points."] ii. If you fit a model that predicts Salary by including Age as a quantitative explanatory variable, how many parameters would the model have?

Answer 1:A horizontal line that shows the mean Salary. Answer 2:Two: Salary and Age. 4/8

If we write the empty model in GLM notation, Y i = b 0 + e i, i. What is the value of Y i? Select one: Each individual teacher's salary. ii. What is the value of b 0? Select one: The mean salary of this sample, 52.5245. iii. What is the value of e i? Select one: The difference between each teacher's salary and the mean salary of this sample.

Answer 1:Each individual teacher's salary. Answer 2:The mean salary of this sample, 52.5245. Answer 3:The difference between each teacher's salary and the mean salary of this sample.

After running supernova(Age.model), the following gets outputted in the R Console: i. Which of the following is the correct interpretation of MS Total (1782.607)? Select one: [ Select ] ["This is, roughly, the total number of points in the data frame.", "This is, roughly, the average squared residual from the mean.", "This is, roughly, the standard deviation from the mean.", "This is, roughly, the total number of squared means based on the empty model."] ii. Which of the following is the correct interpretation of PRE (0.23) in the supernova table? Select one: [ Select ] ["23% of the SS from the empty model can be explained by adding Age to the complex model.", "The Age model's SS total will be 23% of the SS total from the empty model.", "23% of the teachers' salaries in the data frame can be predicted with their Age.", "23% of the Age model can be proportionally reduced by the empty model."]

Answer 1:This is, roughly, the average squared residual from the mean. Answer 2:23% of the SS from the empty model can be explained by adding Age to the complex model. 8/8

If you use lm() to fit the empty model for LikeM, and then use confint() to find the confidence interval, what does the confidence interval tell you?

B. It gives you a range of possible β0 s that could have generated your sample. C. It gives you a range of possible μ s that could have generated your sample. Both B and C are correct.

Why might the mean be a good simple model for the distribution of salaries, below:

Because the mean is a model that balances the deviations from the model and minimizes the sum of squared residuals.

Above we show the favstats() for females' ratings of their date's intelligence (IntelligentF). Why might it be helpful to examine the distribution of means from random samples of n=273 drawn from a normal distribution with the same mean and standard deviation as IntelligentF?

Because the resulting sampling distribution would give us an idea of how much random sample means could vary.

Use the DataCamp window above to construct a faceted histogram of how much fun males perceive females to be (FunM) by males' race (RaceM) in the SpeedDating data frame. Which of the race groups looks most like the panel below?

Caucasian

One researcher wondered if some of the variation in the difference in velocity came from the type of swimmer they were. Triathletes swim in wetsuits more often than competitive swimmers do, and she worried that their experience would influence the results of this study. Above is a faceted histogram of difference in velocity by the Type of athlete. The two vertical lines depict the mean of the swimmer group and the mean of the triathlete group. What would be the PRE of this model: Type.model <- lm(SpeedUp ~ Type, data = Wetsuits)

Close to 0

What will the following code do? resample(Wetsuits, 12)

Create a new sample from the observations in Wetsuits

FunM.model <-lm(LikeM ~ FunM, data = SpeedDating) Using the R code above, we fit this model: Yi=b0+b1Xi+ei What does the Xi refer to?

Each male's value on FunM

TRUE or FALSE: Correlation implies causation.

False

Assume we run these two lines of code:AttractiveF.stats <- favstats(~ AttractiveF, data = SpeedDating) rnorm(100, AttractiveF.stats$mean, AttractiveF.stats$sd) What will the second line of code (in red) do?

Generate a random sample of 100 data points from a normal distribution with the same center and spread as AttractiveF.

If the sampling distribution of means is normal, the underlying population distribution is:

Impossible to tell

Sample distributions are made up of _________; sampling distributions are made up of _________.

Individual scores; sample statistics

SpeedUp contains the swimming velocity with Wetsuit minus NoWetSuit. Could these differences in swimming velocity be normally distributed in the population?

It is possible.

When you add an explanatory variable to your model, what should be the effect on the Total Sum of Squares (from the empty model)?

It should remain unchanged.

You fit a regression model, then construct a 95% confidence interval for the estimate of β1. If the confidence interval includes 0, what does this mean?

It suggests we should retain the empty model.

If you decide you want to increase your level of confidence for your estimate of LikeM (from 95% to 99%), what will happen to your confidence interval?

It will become wider.

If you decide you want to increase your level of confidence in your estimate of Wetsuit (from 95% to 99%), what will happen to your confidence interval?

It will become wider.

What's the value in using the t distribution? response - correct

It works well as a model of the sampling distribution if the sample size is small, or standard deviation of the population is unknown.

What would the sampling distribution of means look like for samples of n=1?

It would have the same shape and standard deviation as the population distribution.

If you increase your sample size in a study, how does it affect the 95% confidence interval around a parameter estimate?

It would make the confidence interval narrower.

FunM.model <- lm(LikeM ~ FunM, data = SpeedDating) confint(FunM.model) Using the code above, we created a model to predict LikeM using FunM as the explanatory variable. We then constructed 95% confidence intervals around the parameter estimates. The output is pictured below. If we repeated this study and found a larger standard error, what would be different about the confidence interval for β1?

It's likely that the 95% confidence interval for β1β1would be wider.

If, in the population, females' mean intelligence ratings of males is 7.7, with a standard deviation of 1.2, how likely is it that we would randomly draw a sample of n=276 with a mean of less than 7.5?

Less than 5%

Suppose there is a correlation between a teacher's salary and their age. There might not be a causation relationship due to gender. Gender is called a: Regression line Causation variable Correlation coefficient Confounding variable

NOT Causation variable

After running lm(Salary~Age, data=SalaryGender), the following is outputted in the R Console: Call: lm(formula = Salary ~ Age, data = SalaryGender) Coefficients: (Intercept) Age -9.305 1.319 According to our model, someone who is 0 years old would have a salary of -9.305 (thousands of dollars). This of course makes no sense. It doesn't make sense because of: Modeling Extrapolation Causation Regression Correlation

NOT Regression

fter running the code, cor(Salary ~ Age, data = SalaryGender), the following result is outputted in the R Console: [1] 0.4770431 Now that we know r = 0.47 for a regression line, what percent of variation is explained by the explanatory variable? 50% 100% 0% We can't determine the value. 53% 0.47$ 47%

NOT We can't determine the value.

We used AgeM to predict ratings of fun (FunM). The F value for this model in the table above is .02. What does this F ratio tell us?

None of the above.

Which scatterplot shows a larger value of Pearson's r? closer to a line

Plot B

Which of the following would have the same exact value?

Population mean and sampling distribution mean

Which of the following describes the error pointed to by the green label? e i = y i - y hat r Grand Mean Negative residual Positive residual PRE

Positive residual

What will the following code do? xqt(.025, df = 999)

Return t critical for a sample size of 1000

The best-fitting model using NoWetsuit to predict Wetsuit can be specified like this: WetsuitiWetsuiti = b0b0 + b1b1NoWetsuitiNoWetsuiti + eiei If the confidence interval for β1β1 is .9547 m/sec plus or minus 0.118 m/sec, how big is the standard error of the sampling distribution of b1b1?

Roughly .118 divided by 2

The best-fitting model using AttractiveM to predict LikeM can be specified like this: LikeMiLikeMi = b0b0 + b1AttractiveMib1AttractiveMi+eiei If the 95% confidence interval for β1β1 is 0.7139 plus or minus 0.0814, how big is the standard error of the sampling distribution of b1?

Roughly 0.0814 divided by 2

How can sampling distributions help us interpret our data?

Sampling distributions give us a way to assess whether a relationship we've observed in our data is likely to have occurred just by chance.

What is the difference between Standard Deviation and Standard Error?

Standard Error applies to a sampling distribution; Standard Deviation applies to sample or population distributions.

If the distribution of NoWetsuit was more variable (that is, has a greater standard deviation) than the distribution of Wetsuit, what would be true about the confidence interval of NoWetsuit compared to Wetsuit? response - correct

The NoWetsuit confidence interval would be wider.

If the researchers are interested in whether wearing wetsuits affects swimming velocity, what is the outcome variable of interest?

The difference in velocity between swimming in NoWetsuit compared to Wetsuit.

Above we have included the ANOVA tables for Wetsuit = NoWetsuit + other stuff. Which distance is the basis of the SS Error?

The distance between data points and the NoWetsuit model's prediction.

What would this code show us? SDob1 <- do(100000) * b1(Wetsuit ~ NoWetsuit, data = resample(Wetsuits, 12)) SDob1 <- arrange(SDob1, desc(b1)) SDob1$b1[2500] response - correct

The highest population increment that could have produced our sample and it would still be considered likely

The best-fitting model using NoWetsuit velocity to predict Wetsuit is this: Yi=0.1423+0.9547Xi+eiYi=0.1423+0.9547Xi+ei How should we interpret 0.9547?

The increment to add on to the prediction of Wetsuit for every 1 m/sec of NoWetsuit

If you bootstrap a sampling distribution based on your sample of data, what will be the mean of the bootstrapped distribution?

The mean of your sample

For which scatterplot does the slope of the regression line equal the correlation coefficient? gf_point(Salary ~ Age, data = SalaryGender, size = 4, color = "black")%>% gf_lm() gf_point(zSalary ~ zAge, data = SalaryGender, size = 4, color = "firebrick")%>% gf_lm() Both of them. Neither of them. The one on the left. The one on the right.

The one on the right.

When we calculate the residuals from both the empty model and the complex model, what is similar about these two sets of residuals?

The residuals represent the difference between the data and the model's prediction.

If you want to know if a regression model is better than a simple model in terms of making a prediction, what parameter should you make a sampling distribution of?

The slope

What is standard error?

The standard deviation of the sampling distribution of an estimate

The sampling distribution of means above was created with these three lines of code: SincereF.stats <- favstats(~ SincereF, data = SpeedDating) SDoM <- do(10000) * mean(rnorm(272, SincereF.stats$mean, SincereF.stats$sd)) gf_histogram(~ mean, data = SDoM, color = "darkorchid4", fill = "darkorchid1") What is true about the standard error of this distribution?

The standard error of this distribution is smaller than SincereF.stats$sd.

Imagine we drew two random samples from a population, and measured each case sampled on the same outcome variable. One sample had an n = 30, the other an n = 60. Which of the following statements is true?

The sum of squares of the larger sample would almost certainly be greater than the sum of squares of the smaller sample.

Imagine we drew two random samples from a population, and measured each case sampled on the same outcome variable. One sample had an n=30, the other an n=60. Which of the following statements is true?

The sum of squares of the larger sample would almost certainly be greater than the sum of squares of the smaller sample.

If we used this code to fit the empty model: Empty.model <- lm(Salary ~ NULL, data = SalaryGender) And then used the predict() function to make a prediction for each teacher's salary, what value would it predict for each teacher?

The value would be the mean salary of this sample and would be the same for each teacher.

The plot above shows males' liking of female (LikeM) as a function of whether or not they want to date them again (DecisionM, Yes or No). What does the plot show?

There were females that males liked a lot, but with whom they did not want to go out on another date.

What's the purpose of generating a sampling distribution of means of SpeedUp by resampling (also called bootstrapping)? response - correct

This distribution can help you quantify how much your best estimate of the population mean could vary.

SpeedUp contains the swimming velocity with Wetsuit minus NoWetsuit. The histogram above was created with this code: gf_histogram(~ SpeedUp, data = Wetsuits, bins = 6, fill = "black", color = "royalblue1", alpha = .8) How would you modify this code to look at the distribution of swimmers that were faster with a wetsuit and those that were faster with no wetsuit?

This histogram shows that all swimmers were faster with a wetsuit.

Which of the following R code would create a new variable called SpeedUp that contains the difference between swimming with Wetsuit versus NoWetsuit? response - correct

Wetsuits$SpeedUp <- Wetsuits$Wetsuit - Wetsuits$NoWetsuit

You're interested in females' ratings of males' intelligence. You simulate 500 samples of 276 ratings, calculate the mean of each sample, and plot the resulting distribution of means in a histogram. What will be the mean of this sampling distribution?

Whatever mean you set when you ran the simulation

After running the following R code: Age.model <- lm(Salary ~ Age, data = SalaryGender) Age.model The following is outputted in the R Console: Which of the following represents the fitted model?

Y i = − 9.035 + 1.319 X i + e iY i = − 9.035 + 1.319 X i + e i

Given this distribution of data, could the population of females' ratings of males' intelligence be shaped like a normal distribution?

Yes, it is possible because samples often do not look exactly like the populations that they were drawn from.

If you've identified your confidence interval for SpeedUp, what exactly are you confident about?

You're confident that the true effect of wearing a wetsuit on swimming velocity lies within it.

Take a look at the model that we fit in this output. How would we represent this number with General Linear Model (GLM) notation? coefficient (intercept):

b0

You're interested in computing the confidence interval around the estimated mean of SpeedUp (how much a swimmer's velocity increases by having a wetsuit). What should you add to the following code in order to do so? Empty.model <- lm(SpeedUp ~ NULL, data = Wetsuits) response - correct

confint(Empty.model)

Presumably, a person's swimming velocity wearing just their swimsuit (NoWetsuit) will predict their maximum velocity while wearing a wetsuit (Wetsuit). Which is the best way to take a look at this relationship (between Wetsuit and NoWetsuit)?

gf_point(Wetsuit ~ NoWetsuit, data = Wetsuits)

Here is a depiction of the relationship between Salary and Age2Group: When Age2Group is included as an explanatory variable in our model of Salary, we write: SalaryGender$Age2Group <- ntile(SalaryGender$Age, 2) SalaryGender$Age2Group <- factor(SalaryGender$Age2Group, levels = c(1,2), labels = c("young", "old")) lm(formula = Salary ~ Age2Group, data = SalaryGender) As a result, the following is printed in the R Console:

i. How would you interpret 31.87? This represents the difference in average salaries for teachers in the high group relative to the low group. ii. In Y i = 36.59 + 31.87 X i + e i, what does X i stand for? Whether a teacher is in the young group or not. iii. When Age2Group is included in our model to explain variation in Salary, how is error from this more complex model calculated? The deviation of each teacher's Salary from the mean Salary of their Age2Group (Young or Old). 8/12

If we add more participants to the SpeedDating study, which of these could not be affected?

β0


Ensembles d'études connexes

Ch. 5 Race Ethnicity and Families

View Set

APES Chapter 19 review questions

View Set

Neuromuscular Disorders in Children

View Set

nur 320 chapter 16: Immunizations and Communicable Diseases

View Set

115 PrepU Ch. 19 Management of Patients with Chest and Lower Respiratory Tract Disorders

View Set

Java Multithreading / Concurrency - terms

View Set