MKT317 Exam 2 reading Q's

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Suppose the weather forecast predicts that the high temperature tomorrow will be 102 degrees Fahrenheit. The manager at the ice cream stand decides to offer an 8% discount rate on all ice cream sold tomorrow. Using the model below, what is the predicted sales for tomorrow? Assume that the temperature forecast is correct, and the high temperature is exactly 102 degrees. hatSales= 401 + 2703(Discount)-102(Under 95)-1336 (Discount*Under 95)

617.24

Is the following a correct or incorrect interpretation of the model whose output appears above: "The average Sales in the South is about $80,000 higher than in the West"

Incorrect interpretation (i.e. this is either a false statement, or we do not have enough information to know if this statement is true or false).

Which model would be more appropriate to describe the relationship between time and the child mortality rate?

Question9to13_Model <- lm(log(Child.mortality) ~ Year + name, data=Gapminder_ChildMortlity_1950_2020)

How do we know that this model was constructed incorrectly?

bThis model is incorrect because the output does not include dummy variables; the classmate should have used as.factor(AGENCY).

In the previous question, you answered the question "After controlling for size, is there a statistically significant difference in average price when comparing homes that are 0-5 years old vs. homes that are more than 5 years old?" What was the p-value that you used to obtain your answer? (Hint: this p-value will be associated with one of the dummy variables for the variable AgeGroup.")

3.01*10^(-9)

What is the R-squared given in the summary of the model output for the model named MyPanelModel? Answer with four or more decimal places.

0.8533

Use the model output given above, which is a model whose dependent variable is selling price (in thousands of dollars). The estimated selling price of a 1000 square foot home that was sold by Agency #4 is _____ thousand dollars. (Round to the nearest whole number)

86.0

y-axis: observed values and x-axis: predicted values y-axis: residuals and x-axis: predicted values

Add the line y=x to help visualize model accuracy Add the line y=0 to help visualize model accuracy

Suppose we have data that were used to create a multiple linear regression model. A plot of the prediction error vs. the Predicted value of Y is given below. Does the model fit the data?

No; the model is not accurate. EXPLANTION: The dots show a pattern; they are not randomly dispersed around the line y=0 (in purple)

Suppose we created a multiple linear regression model in R that we named MODEL. What R commands do we use for which purpose?

Determining what the observed values are =View(name of data set) - these are the Y-values in the data set. Calculate the predicted value of Y using the MODEL= predict(MODEL) Calculate the prediction error for an observation based on the MODEL=residuals(MODEL) = erros(MODEL)

7. Using the data above, will the commands below create the desired linear regression model? Model <- lm(Spending ~ FICO + Years + AgeGroup + as.factor(Segment), data=DF) summary(Model)

Yes - this will create the desired model.

Is the following a correct or incorrect interpretation of the model whose output appears above: "After controlling for advertising budget and segment, the average Sales in the South is $80,000 higher than in the West."

correct interpretation

In the first scatter plot, the price axis labels the values 300, 1000, 3000, 10000, 30000. Is price being graphed using the linear scale or the log scale?

log scale

Using the data above, will the commands below create the desired linear regression model? Model <- lm(Spending ~ FICO + Years + as.factor(AgeGroup) + as.factor(Segment), data=DF) summary(Model)

Yes - this will create the desired model.

For homes greater than 2500 square feet, when we compare homes of the same size, is there a statistically significant difference in average home prices between homes on corner lots and non-corner lots?

Yes; for homes greater than 2500 square feet, after controlling for size, the average price for homes on non-corner lots is higher than the average price of homes on corner lots.

Use the correct numerical answers for the two multiple choice questions above to complete the following sentences. We can compare the average price of good vs. fair 1.5-carat diamonds in two ways: The average price of a 1.5-carat Good cut diamond is $______ ...more expensive than the average price of a 1.5-carat Fair cut diamond. The average price of a 1.5-carat Good cut diamond is

2004, 30

Suppose the weather forecast predicts that the high temperature next Friday will be 82 degrees Fahrenheit. The manager at the ice cream stand decides to offer an 8% discount rate on all ice cream sold next Friday. Using the model below, what is the predicted sales for next Friday? Assume that the temperature forecast is correct, and the high temperature is exactly 82 degrees. hatSales=401 + 2703(Discount)-102(Under 95)-1336 (Discount*Under 95)

408.36

Were there any key words / important aspects of the model that let you decide between a change in "units (deaths per thousand)" vs. a percentage change? How did you know (using the R output and not the graphs) if it was an increase or a decrease?

Because it's a log model, its going to be a percentage change its a decrease because the intercepts are negatives

Is the following statement true or false: "The results of this model must be incorrect, because the estimated value for East is -90000, and it's impossible to have a negative amount for Sales."

FALSE: This model appears to have been computed correctly; it's ok for the coefficient for East to be negative.

Using the data above, will the commands below create the desired linear regression model? Model <- lm(as.factor(Spending) ~ FICO + Years + as.factor(AgeGroup) + as.factor(Segment), data=DF) summary(Model)

No - this will not create the desired model.

Use the three scatter plots created above to answer this question (plots 4, 5, and 5). If we compare homes of the same size, does there appear to be a difference in average home price when comparing corner and non-corner lots?

Sometimes. For homes that are less than 2500 square feet, there does not appear to be a difference in average price of equally-sized homes when comparing corner and non-corner lots, however for homes larger than 2500 sqft, the corner lot homes are less expensive than their same-sized non-corner lot homes.

The predicted sales from customers in the domestic region with an average unit price of $500 is ..., and the predicted sales from customers in the domestic region with an average unit price of $550 is

11623.67, 12867.94

Suppose we created a multiple linear regression model using the variable location (the location is either NORTH or SOUTH, and they are included in the model in dummy variable format). The model we obtained is Estimated average selling price (in thousands of dollars) = 13.9 + 0.07(size, in sqft) + 2(SOUTH) Use the model to compute the estimated average selling price of a 2000 square foot house that is in the NORTH location.

153.9 thousand dollars

Use the correct numerical answers in the two multiple choice questions above to complete the following sentences. We can compare the average price of good vs. fair 3-carat diamonds in two ways: The average price of a 3-carat Good cut diamond is $ ...more expensive than the average price of a 3-carat Fair cut diamond. The average price of a 3-carat Good cut diamond is

6604, 30

In the question above, you answered a question about statistical significance, which means a p-value was involved. What is the p-value you used to answer the question "For homes larger than 2500 square feet, when we compare homes of the same size, is there a statistically significant difference in average home prices between homes on corner lots and non-corner lots?"

7.11*10^(-10)

Using the population of houses represented by the data in the HOUSEDATA data set, we can conclude that when we are comparing houses of the same size, the average price is higher in the North than in the South.

false

What is the predicted price of a 3-carat Fair cut diamond?

ln(Price) = 8.12 + 1.72*ln(3) = 10.00961, and e^(10.00961) = 22239

Since color D diamonds appear to be more expensive once we have accounted for the size, but the average price of a color J diamond is highest overall, it's likely that among diamonds included in the data set, there are many large (and therefore expensive) color J diamonds, and many small (and therefore less expensive) color D diamonds.

true

Create both of the plots above. In the second plot, we see additional arguments xlab and ylab. Comparing the two plots you made: what do these arguments do?

They allow us to label the x-axis and y-axis.

True or False: Using the model output above (which is additionally included in this question), we can conclude that the average value of Y at the "Before" time slot equals 9.05.

false EXPLANATION: The intercept is the estimate of the average Y when all variables we see in the table equal 0. If "time before" equals 0, then the time associated with the interpretation of the intercept must equal "after" If the dummy variables for ID2 through ID6 are all equal to 0, then it must mean that the intercept is giving some information about ID #1. So the intercept is telling us that the predicted Y-value for ID #1 in the after time is 9.05

What is the predicted price of a 0.75-carat, Ideal-cut, D-color, IF-clarity, diamond? Rounding: use at least two decimal places in each of your intermediate steps.

5300

In diamonds, it is very difficult to distinguish between colors D and E, and it appears that consumers can't tell the difference either! From our model, we can conclude that when comparing diamonds of the same carat, there is not a statistically significant difference in average price when comparing colors D and E. We were able to make this conclusion using the p-value for the dummy variable of Color E. This p-value equals: (round to three or more decimal places).

.208

In the Question 07 above, you answered a question about statistical significance, which means a p-value was involved. What is the p-value you used to answer the question "When looking at homes of the same size, is there a statistically significant difference in average home price when comparing homes in the North and homes in the South?"

0.2998

In the suburban location, after controlling for the variation between stores, every week, the average revenue increased about $_____ thousand dollars. (Round to at least two decimal places)

12.45

For the model Predicted Total Sales = 16*(Average Unit Price)^(1.196) we can conclude that the predicted total sales becomes ______ higher whenever the average unit price doubles.

129%

Suppose you have a data set containing information for every Meijer in Michigan. In this data set, we have the store location, the total revenue from grocery sales in June 2020, and the total revenue from cleaning supply sales in June 2020. This is an example of

Cross-Sectional Data

8. Using the data above, will the commands below create the desired linear regression model? Model <- lm(Spending ~ as.factor(FICO) + Years + as.factor(AgeGroup) + as.factor(Segment), data=DF) summary(Model)

No - this will not create the desired model.

9. Using the data above, will the commands below create the desired linear regression model? Model <- lm(Spending ~ FICO + Years + as.factor(AgeGroup) + Segment, data=DF) summary(Model)

No - this will not create the desired model.

Suppose we have data that were used to create a multiple linear regression model. A plot of the observed value vs. the Predicted value of Y is given below. Does the model fit the data?

No; the model is not accurate.

What is a criticism of the output from the summary of the model called Question9to13_Model?

The output is incredibly long, and the majority of the output we are not interested in (i.e. we want to ignore most of the coefficients table).

Suppose we have data from a lot of variables. We used this data to construct a model to predict the variable Y (based on all the x-variables) Based on the plot below, what can we conclude about the relationship between the x-variables and the y-variable?

The plot does not allow us to determine which, of any, of the x-variables are correlated with the y-variable; we need more information.

In the model output given above, what is the reference level for the variable Advertising budget?

This is a trick question - advertising budget is a quantitative variable, and only categorical variables have reference levels.

For homes less than or equal to 2500 square feet, when we compare homes of the same size, is there a statistically significant difference in average home prices between homes on corner lots and non-corner lots?

no

Since the average price of a color J diamond is higher than the average price of a color D diamond, it means that the people are willing to pay for more diamonds that are color J than color D - color J is the more desired color.

false

True or False: When comparing homes of the same size, if a house is sold by Agency 4, the average price decreases about 17.39 thousand dollars.

false

Complete the following interpretation: In the suburban location, after controlling for the often unknown reasons that might cause variability between individual stores,

there is a statistically significant change in revenue over time; the marketing campaign appears to be generally successful in the suburban areas.

In urban areas, after controlling for variability between stores, there is not a statistically significant increase in revenue over time. We know this because the p-value associated with the estimate of the coefficient of the variable Time in the fixed effects model we created is ______, which is much larger than 0.05.

0.365

We concluded that for homes in the city where the data were collected, there is not a statistically significant difference between homes on corner lots and homes on non-corner lots. What was the p-value used that allowed us to make this conclusion? Answer with two (or more) decimal places.

0.985

In question 05 above, you answered a question about statistical significance, which means a p-value was involved. What is the p-value you used to answer the question "When looking at homes in the city in general, is there a statistically significant difference in average home price when comparing homes in the North and homes in the South?"

1.92*10^(-11)

Suppose we created a multiple linear regression model using the variable location (the location is either NORTH or SOUTH, and they are included in the model in dummy variable format). The model we obtained is Estimated average selling price (in thousands of dollars) = 13.9 + 0.07(size, in sqft) + 2(SOUTH) Use the model to compute the estimated average selling price of a 2000 square foot house that is in the SOUTH location.

155.9 thousand dollars

The predicted sales for a Domestic customer who purchases items with an average unit price of $10 is___ ..., and the predicted sales for an international customer who purchases items with an average unit price of $10 is

179, 2078

What is the predicted price of a 3-carat Good cut diamond?

28843 explanation: ln(Price) = 8.12 + 1.72*ln(3) + 0.26 =10.26961, and e^(10.26961) = 28842.64

After controlling for variability between countries, every ten years, the average child mortality rate decreased _____ %.

29.5%

How do we know that "When comparing individuals with the same value for Unit Price, there is not a statistically significant difference in predicted sales when comparing the four segments?"

Because ALL of the p-values for ALL of the dummy variables created from the Segment variable were large.

The plot above suggests that once we control for carat, there is a difference in price based on clarity. We note that the x-axis and y-axis are both graphed using a log scale, indicating that we wish to expand upon a power model predicting price when given the carat. What R commands do we use to add clarity to the power model for price and carat?

ClarityModel <- lm(log(price) ~ log(carat) + clarity, data=DiamondSales)

When comparing homes of the same size and same distance from the train, what can we say about the difference in average prices between the North and the South?

When comparing homes of the same size and same distance from the train, there is not a significant difference in average price between homes in the North and the South.

True or False: The average selling price for homes sold by Agency 2 is about 18.72 thousand dollars less than the average selling price of homes sold by Agency 1.

false, The 18.72 came from a model that includes SIZE, so the interpretation must include "among homes of the same size"

Recall that these data include information about individual customers, and the unit price indicates that average unit price of all items that a customer has purchased - customers with a small value for Unit Price often purchase cheaper items, and customers who have a large value for Unit Price generally tend to purchase more expensive items. The Sales represents the total sales from all orders from that individual. We have the model Predicted Total Sales = 16*(Average Unit Price)^(1.196) Using the model, we can conclude that the predicted total sales of an individual who purchases items with an average unit price of $100

is about $3945

For the population of houses represented by the data in the HOUSEDATA data set, the average home price is different when comparing houses in the North and the South.

true

The independent t-tests that we learned in STT 200 or STT 315 can be considered a special case (or simplification) of linear regression.

true

Note that this question is worth more points than the two-option multiple choice questions that proceeded it! Create the model from Question 9 that was a better fit for the data (i.e. use the correct answer to question 9). What is the multiple R-squared of this model? Round to four decimal places.

.9411

Note that this question is NOT a continuation of the previous question! The data set is the same, but the model and variables used are different. For this question we will use the Motor Trend Car Data, which is saved as the data set mtcars in R. Use R to estimate the coefficients of the statistical model below: hatmpg= b0+b1(wt) + b2(hp)+ b3(wt*hp) What is the value that you obtain for b3 ?

0.02758

Create the multiple linear regression model where the independent variables are SIZE, TRAIN.DIST, and three of the four dummy variables associated with AGENCY. What percentage of the variation in price is explained by the variables SIZE, TRAIN.DIST, and AGENCY? In other words, what is the Multiple R-squared of this model (as shown at the bottom of the summary output)?

0.9460

What is the predicted price of a 1.5-carat Good cut diamond?

8755 explanation: ln(Price) = 8.12 + 1.72*ln(1.5) + 0.26 = 9.0774, and e^(9.0774) = 8755

In the previous question, we set up R command to answer the question "after controlling for individual variation, is there a statistically significant relationship between X and Y, and if so, what is it?" Create that model in R. What can we conclude?

After controlling for individual variability, there is a statistically significant relationship between X and Y. For any given individual, whenever X increases 1 unit, we expect Y to increase about 1.22 units.

Suppose you would like to use the mtcars data set. you would like mpg to be the dependent variables, and would like to include independent variables wt, disp, and cyl. How would you create a multiple linear regression model where there is an interaction between wt and cyl, but disp is not involved in an interaction? i.e. you wish to create hatmpg = b0 + b1(wt) + b2(disp) + b3(cyl) + b4(wt*cyl)

Model <- lm(mpg ~ wt*cyl + disp, data=mtcars)

What is a piece of information that we can learn from the plots that we are not able to determine from the two previous regression models that we have created?

The plots indicate that revenue increases quickly the first two weeks and then slowly between weeks 2 and 4 for rural locations, but it is the opposite for suburban locations, where revenue increases slowly the first couple weeks, and then quickly between weeks 2 and 4. We can not get this information from the linear regression models we created.

Using the result of our model, we can determine a general trend for the model. We will fill in the missing number (represented by _____ in the options below) in the next question. The general trend of the conclusion will be: After controlling for the variability that exists between countries, every ten years, the child mortality will

decrease ____ %

Is the following a correct or incorrect interpretation of the model whose output appears above: "After controlling for advertising budget and segment, the average Sales in the South increases by $80,000."

incorrect interpretation

Continue using the HOUSEDATA data set, but with a different model. We will use the model with an interaction term: Model2 <- lm(PRICE ~ SIZE + TAX + SPEC.FEATS + AGE + LOCATION + LOT*SIZE + TRAIN.DIST, data=HOUSEDATA) Suppose a different house is listed for sale. It is 1500 square feet, is charged $750 in annual property taxes, is located in the South, is a Non-corner lot, is 30 years old, has 2 special features, and is 3.0 miles away from the train. We are 95% confident that this house will have a selling price between $___ and $___

91399.0, 154378.0

Suppose we have data about monthly revenue (in millions of dollars) for a large corporation. Additionally, we have data for Marketing budget (in millions of dollars) and Region (Americas, Europe, Middle East, Asia, and Africa). The results of a multiple linear regression model are given directly above this question. Is there a statistically significant interaction between Region and Marketing budget with respect to Monthly Revenue?

Yes, there is a statistically significant interaction between Region and Marketing budget with respect to Monthly Revenue.

After controlling for size, is there a statistically significant difference in average price when comparing homes that are 0-5 years old vs. homes that are more than 5 years old?

Yes; after controlling for size, the average price for New Construction homes are more expensive than the average price for Not New homes.

Suppose we have data that were used to create a multiple linear regression model. A plot of the prediction error vs. the Predicted value of Y is given below. Does the model fit the data?

Yes; the model is generally accurate.

In the plot above (Figure 3), is the y-axis graphed using the linear scale or the log scale?

linear scale

In the question above, you answered a question about statistical significance, which means a p-value was involved. What is the p-value you used to answer the question "For homes less than or equal to 2500 square feet, when we compare homes of the same size, is there a statistically significant difference in average home prices between homes on corner lots and non-corner lots?" Answer with 4 (or more) decimal places.

0.2339

The diamond sales data set includes data collected from 1 year with data for 1000 round-cut diamonds. (Since all diamonds are collected in one year, we consider this as an example of when all the data come from "one moment in time"). The data set includes the selling price of the diamond (in dollars), the weight (in carats), the color, cut, and clarity of the diamond, as well as size/dimension measurements (x, y, z, table). What type of data is the DiamondSales data?

Cross sectional data explantation: Every diamond only appears once in the data set, and all diamonds were sold at about the same time.

Suppose we have the multiple linear regression model created from a quantitative variable, X, and a categorical variable, Group (there are two groups: Group A and Group B). GroupA in the equation below is a dummy variable Suppose we have the following model, where all p-values for all coefficients are very small. revenue = 1 + 2X + 3(GroupA) + 4(X*GroupA) True or False: Once we've controlled for X, the average revenue for Group A is 3 higher than the average revenue for Group B.

False Explanation: The interpretation given in the true/false statement can only be made for parallel models for Group A and Group B. An example of a model where this statement would be true is: revenue = 1 + 2X+ 3(GroupA)

Suppose we have the multiple linear regression model created from a quantitative variable, X, and a categorical variable, Group (there are two groups: Group A and Group B). GroupA in the equation below is a dummy variable Suppose we have the following model, where all p-values for all coefficients are very small. revenue = 1 + 2X + 3(GroupA) + 4(X*GroupA) True or False: Once we've controlled for X, the average revenue for Group A is 7 higher than the average revenue for Group B.

False explanation: The interpretation given in the true/false statement can only be made for parallel models for Group A and Group B. An example of a model where this statement would be true is: revenue = 1 + 2X+ 7(GroupA)

Is the following a correct or incorrect interpretation of the model whose output appears above: "The average Sales in the South is $80,000 higher than in any other location"

Incorrect interpretation (i.e. this is either a false statement, or we do not have enough information to know if this statement is true or false).

Using the data above, will the commands below create the desired linear regression model? Model <- lm(Spending ~ FICO + Years + AgeGroup + Segment, data=DF) summary(Model)

No - this will not create the desired model.

Overall (not accounting for agency), the predicted price of a 2000 square foot house is

The model created in this section, which was created using the R commands Model <- lm(PRICE ~ SIZE + as.factor(AGENCY), data=HOUSEDATA), does not give us enough information to answer this question. explanation : The model that was given accounts for Agency, and all interpretations and computations will account for Agency. If we don't want to account for agency (and only square footage), then we need to make a new model where the only x-variable is SIZE; the current model does not provide enough information to make the desired calculation.

True or False: Using the model output above (which is additionally included in this question), can conclude that after we have accounted for variability between individuals, there is a statistically significant difference in average Y when comparing the before and after times.

True EXPLANATION : The p-value for the dummy variable for before is very small, indicating that when all else in the model has been controlled for (i.e. individual variation), then there is a statistically significant difference in average Y between the "before" and "after" times.

uppose we have the multiple linear regression model created from a quantitative variable, X, and a categorical variable, Group (there are two groups: Group A and Group B). GroupA in the equation below is a dummy variable Suppose we have the following model, where all p-values for all coefficients are very small. revenue = 1 + 2X + 3(GroupA) + 4(X*GroupA) True or False: Whenever X increases 1 unit, the average revenue increases 4 more units in Group A than in Group B.

True explanation: The interaction term measures the difference in the slope!

Which of the interpretations below can we make about the relationship between carat and price?

When looking at diamonds with the same clarity, whenever the carat increases 10%, the price increases about 20%. explanation: Let's look at an example: calculating the predicted price for a 1-carat IF diamond, and the predicted price of a 1.1-carat IF diamond. {These are done using a scientific calculator - R is not needed} for a 1-carat IF diamond: ln(price) = 7.82562 + 1.8*ln(1) + 1.054 = 8.87562, so the price = e^(8.87562) = $7,155 For a 1.1-carat IF diamond: ln(price) = 7.82562 + 1.8*ln(1.1) + 1.054 = 9.05, so the price = e^(9.05) is about $8,500 Comparing these prices, we get (8500-7155)/7155 * 100% = about 19% (which is about 20%)

When controlling for segment, is there a statistically significant correlation between log(unit price) and log(sales), (and hence a statistically significant correlation between unit price and sales after controlling for segment)?

Yes, we know this because the p-value for log(unitprice) is very small.

When looking at homes in the city in general, is there a statistically significant difference in average home price when comparing homes in the North and homes in the South?

Yes; the average price for homes in the North is greater than the average price for homes in the South.

Carat measures the weight of a diamond, however it is often used to describe its size. While this is not completely accurate, using the carat to describe size is generally accepted. So we can conclude: The average price of a good cut diamond is ________ higher than the average price of a fair cut diamond of the same size.

about 30%

We would like to answer the question, "after controlling for individual variation, does the average value of Y change over time - in other words, is there a statistically significant difference in average Y when comparing the before and the after times?" Suppose the name of the data set is ExampleData. What model do we use? (Hint: a screenshot of the desired model output is given immediately after this question).

lm(Y ~ Time + as.factor(ID), data=ExampleData) explanation: Accounting for individual variation means include as.factor(ID) as an independent variable - as.factor() because the ID variable is categorical but listed in numeric format. The desired interpretation does not mention X, only Time. So we use Time as an independent variable. Time is categorical, but we do not need to use as.factor(Time) in this question, as it is entered in text format.

In the plot above (Figure 4), is the y-axis graphed using the linear scale or the log scale?

log scale

Suppose you have a data set containing the daily revenue (in dollars) for the Lake Lansing Meijer from 2010 to 2020. This data set is an example of

time series data explanation: Values for one variable (Revenue) are being recorded for one store (Lake Lansing Meijer) at regular time periods (every day).

Continue using the HOUSEDATA data set and Model1, which uses the variables SIZE, TAX, as.factor(AGENCY), LOCATION, and SPEC.FEATS. Suppose a house is listed for sale. It is 3000 square feet, is charged $1600 in annual property taxes, is being sold by Agency 4, is located in the North, and has 8 special features, and is 4.5 miles away from the train. We are 95% confident that this house will have a selling price between $__.....and $____

217957.0, 257008.0

Suppose we have data that were used to create a multiple linear regression model. A plot of the observed value vs. the Predicted value of Y is given below. Does the model fit the data?

Yes; the trend in the model reflects the general trend in the data. EXPLANTION: It's ok for the dots to be spread out. They generally follow the trend y=x (i.e. the dots do not show a strong pattern that is not linear), so a linear regression model "fits" the data.

We would like to answer this question "after controlling for individual variation, is there a statistically significant relationship between X and Y, and if so, what is it?" The variables X, Y, Time, and ID are saved in a data set named ExampleData. What model should we use to answer the question about the relationship between X and Y (after controlling for individual variation)?

lm(Y ~ X + as.factor(ID), data=ExampleData) explanation: The desired interpretation asked about individual variation, so we should include as.factor(ID) as an independent variable; the as.factor() is because ID is a numeric categorical variable. The interpretation also asks about a relationship between X and Y, so Y can be the dependent variable and X should be another independent variable. The interpretation does not mention time, so we do not include time in the model.

Suppose you have a data set containing information for every Meijer in Michigan. In this data set, we have the store location, the total revenue from grocery sales for every month from April 2020 to June 2020, and the total revenue from cleaning supply sales for every month from April 2020 to June 2020. This is an example of

panel data

Use the model output given above, which is a model whose dependent variable is selling price (in thousands of dollars). The estimated selling price of a 1000 square foot home that was sold by Agency #1 is _____ thousand dollars. (Round to the nearest whole number)

103.0

What is the predicted price of a 1.5-carat Fair cut diamond?

6751 explanation: ln(Price) = 8.12 + 1.72*ln(1.5) = 8.8174, and e^(8.8174) = 6751

Suppose you would like to use the HOUSEDATA data set and evaluate if there is an interaction between LOCATION and SIZE with respect to PRICE. Which model would you use?

Model <- lm(PRICE ~ LOCATION*SIZE, data=HOUSEDATA)

Use the bar graph created above to answer this question (plot 3). Looking at the houses in the data set overall, does there appear to be a difference in average prices when we compare corner and non-corner lots?

No; there does not appears to be a large difference in average home price when comparing corner lots and non-corner lots.

Below is a scatterplot containing summary data for all nations with available data (image created on gapminder.org). The scatter plot indicates the average national percentage of out-of-pocket health spending for individuals, as well as the life expectancies of each country. The dots are color-coded by income group, and the size of the dot represents the population of the nation. The green dots represent countries with high or upper middle income, and the blue dots represent lower middle or low income. Suppose we were interested in describing the relationship between out-of-pocket healthcare spending and life expectancy. We are considering adding the dummy variables associated with income group into the model. Which type of model is most likely to fit the data the most accurately?

A model that includes an interaction between income group and out-of-pocket total health spending. For such a model, each income group is associated with a linear equation, each with possibly different slopes and different intercepts. explanation: The green dots show a decreasing trend, but the blue dots follow an almost horizontal trend or possibly increasing trend. Since the dots are not following parallel trends, the plot suggests there is an interaction.

IF is an abbreviation for "internally flawless" What is an interpretation that we can make that involves comparing internally flawless diamonds with other diamonds of the same carat?

After controlling for carat, internally flawless diamonds are on average 2.86 times more expensive than I1 diamonds.

When we use this model to make an interpretation about how price will change if any one of the x-variables change, the general trend will always involve a percentage change in price. Why? Examples of these interpretations involving a percentage change in price: "After controlling for carat, clarity, and color, the predicted price of an ideal cut diamond is 21% more than a fair cut diamond." or "After controlling for cut, clarity, and color, every additional 10% increase in cut will increase the predicted price 19.7%."

Because we used the log(price) in the original linear regression model (what was entered into R). Since logs are associated with percentage change, this means that any "general trend" will involve a percentage change with price.

The p-values for VVS1 and VVS2 are both very small (these are for diamonds with very very slight inclusions). True or False: Because these two p-values are very small, we can conclude that among diamonds of the same carat, there is a statistically significant difference in average price when comparing VVS1 and VVS2 diamonds.

FALSE or not enough information explanation: We do not have enough information to make a comparing between VVS1 and VVS2 diamonds that are the same carat. The p-values are comparing with the "reference level" which are I1 diamonds (diamonds with inclusions that are visible without needing magnification). The small p-value for VVS1 means that among diamonds with the same carat, there is a statistically significant different price between I1 and VVS1 diamonds. The small p-value for VVS2 means that among diamonds with the same carat, there is a statistically significant different price between I1 and VVS2 diamonds. This model does not allow us to make a comparison between VVS1 and VVS2 diamonds: for that, we would need to make a new model that omits either the dummy variable for VVS1 or the dummy variable for VVS2.

Above, we created the model: LOCATIONModel <- lm(PRICE ~ LOCATION, data=HOUSEDATA) What can we interpret from the output from this model? (ignore output from all previous models)

When not accounting for any other piece of information about homes, the average selling price of houses in the South is $27,789 less than the average selling price of houses in the North, and this difference is statistically significant.

When looking at homes of the same size, is there a statistically significant difference in average home price when comparing homes in the North and homes in the South?

no

You may wish to create an example by calculating the predicted price of diamonds for colors D and J that have the same carat (i.e. both 1-carat, or both 2-carat, etc.) When we compare diamonds of the same carat, the average price of color D diamonds are approximately _______ more expensive than the average price of color J diamonds.

one and a half times

Suppose we have the multiple linear regression model created from a quantitative variable, X, and a categorical variable, Group (there are two groups: Group A and Group B). GroupA in the equation below is a dummy variable Suppose we have the following model, where all p-values for all coefficients are very small. revenue = 1 + 2X + 3(GroupA) + 4(X*GroupA) True or False: Once we've controlled for X, the average revenue for Group A is 4 higher than the average revenue for Group B.

False explanation: The interpretation given in the true/false statement can only be made for parallel models for Group A and Group B. An example of a model where this statement would be true is: revenue = 1 + 2X+ 4(GroupA)

Suppose you would like to use the HOUSEDATA data set and evaluate if there is an interaction between AGENCY and SIZE with respect to PRICE. Which model would you use? Hint: Remember that AGENCY is a numeric categorical variable!

Model <- lm(PRICE ~ SIZE*as.factor(AGENCY), data=HOUSEDATA)

Using the data above, will the commands below create the desired linear regression model? Model <- lm(as.factor(Spending) ~ as.factor(FICO) + as.factor(Years) + as.factor(AgeGroup) + as.factor(Segment), data=DF) summary(Model)

No - this will not create the desired model.


Ensembles d'études connexes

Chapter 13: Fixing Service Failures

View Set

HESI Questions Part 2 - Health and Physical Assessment

View Set