Probability & Statistics Module #3
Students at a large state university system are upset over the rate at which their fees have increased in the last five years (2005-2010). A small group present before the state legislature and report a predicted fee for 2020 based on their model. What error could they be accused of?
extrapolating
A student in an intro stats course collects data at her university. She wants to model the relationship between student jobs and GPA. She collects a random sample of students and asks each for their GPA and the number of hours per week they work. What graph should she make to check the conditions for linear regression?
A scatterplot of work hours and GPA.
Shown on the right is a scatterplot of the production budgets (in millions of dollars) vs. the running time (in minutes) for major release movies in 2005. Dramas are plotted as red x's and all other genres are plotted as blue dots. A separate least squares regression line has been fitted to each group. a) What are the units for the slopes of these lines? b) In what way are dramas and other movies similar with respect to this relationship? c) In what way are dramas different from other genres of movies with respect to this relationship?
A. Million dollars per minute B. They have the same rate of increase in budget per increase in runtime. C. On average dramas cost about $20 million less for the same runtime.
A random sample of records of home sales from Feb. 15 to Apr. 30, 1993, from the files maintained by the Albuquerque Board of Realtors gives the Price and Size (in square feet) of 117 homes. A regression to predict Price (in thousands of dollars) from Size has an R2 of 71.4%. The residuals plot indicated that a linear model is appropriate. Complete parts a through c below. a) What are the variables and units in this regression? b) What units does the slope have? c) Do you think the slope is positive or negative? Explain.
A. Price (in thousands of dollars) is y and Size (in square feet) is x B. The slope has units of thousands of dollars per square foot. C. The slope is positive. As the size of the home increases, the price should also increase.
Tell what each of the residual plots to the right indicates about the appropriateness of the linear model that was fit to the data. A. left top bottom curve up to the right B. Scattered widely across C. left bottom top curve bottom right
A. The curved pattern in the residuals plot indicates that the linear model is not appropriate. The relationship is not linear. B. The scattered residuals plot indicates an appropriate linear model. C. The curved pattern in the residuals plot indicates that the linear model is not appropriate. The relationship is not linear.
A researcher wants to determine if the nicotine content of a cigarette is related to "tar". A collection of data (in milligrams) on 29 cigarettes produced the accompanying scatterplot, residuals plot, and regression analysis. a) Is the linear model appropriate here? Explain. b) The linear model on tar content accounts for 92.4% of the variability in nicotine content.
A. The linear model could be appropriate. There is some curvature to the residuals but not enough to completely disregard the linear model. Some more data points may be required. B. The linear model on tar content accounts for 92.4% of the variability in nicotine content.
Is there evidence that the age at which women get married has changed over the past 100 years? The accompanying scatterplot shows the trend in age at first marriage for American women. a) Is there a clear pattern? Describe the trend. b) Is the association strong? c) Is the correlation high? Explain. d) Is a linear model appropriate? Explain.
A. The trend appears to be linear up to about 1940, but from 1940 to about 1970 the trend appears to be nonlinear. From 1975 or so to the present, the trend appears to be linear. B. The association appears relatively strong. C. No, as a whole the graph is clearly nonlinear. D. No, a linear model would not be appropriate, although one could fit a linear model to the period from 1975 to 2003.
A least squares regression line was calculated to relate the length (cm) of newborn boys to their weight in kg. The line is weight=−5.22+0.1635 length. Explain in words what this model means. Should new parents (who tend to worry) be concerned if their newborn's length and weight don't fit this equation? a)What does the given model mean? b)Should new parents (who tend to worry) be concerned if their newborn's length and weight don't fit this equation?
A. The weight of a newborn boy can be predicted as −5.22 kg plus 0.1635 kg per cm of length. B. No, because this is a model fit to data. No particular baby should be expected to fit this model exactly.
If you find any outliers or high leverage points in your data, you should delete them from the analysis.
False
A variable that is not part of the model but affects the way variables in the model appear to be related is called a(n) _____________.
Lurking variable
There is a strong correlation between the temperature and the number of skinned knees on playgrounds. Does this tell us that warm weather causes children to trip?
No. In warm weather, more children will go outside and play.
Some friends of yours in a political science class are angry about a new town ordinance restricting off-campus parties. They make an online survey asking students' opinions. This type of sampling might be classified as a __________ sample.
convenience
For many people, breakfast cereal is an important source of fiber in their diets. Cereals also contain potassium, a mineral shown to be associated with maintaining a healthy blood pressure. An analysis of the amount of fiber (in grams) and the potassium content (in milligrams) in servings of 77 breakfast cereals produced the regression model Potassium=38+27Fiber. If your cereal provides 9 grams of fiber per serving, how much potassium does the model estimate you will get?
281 milligrams of potassium
Which of the following is a true statement about residuals? a)The regression line is the line that minimizes the standard deviation of the residuals. b) The residual plot for a model that a good fit to the data should not show any pattern or have any unusual features. c) A residual is the difference between the actual data value and the value predicted by the model. d) All of the above
All of the above
Least squares means that the square of the largest residual is as small as it could possibly be.
False. Least squares means that the sum of the squares of all the residuals is minimized.
Choose the linear model that passes through the most data points on the scatterplot.
False. The line usually touches none of the points. Minimize the sum of the squared errors.
A student in an intro stats course collects data at her university. She wants to model the relationship between student jobs and GPA. She collects a random sample of students and asks each for their GPA and the number of hours per week they work. She checks the conditions and makes a linear model. If GPA is the response variable, what units will the slope of her line be?
GPA points/hr
Noting a recent study predicting the increase in cell phone costs, a friend remarks that by the time he's a grandfather, no one will be able to afford a cell phone. Explain where his thinking went awry.
He is extrapolating into the future. It is impossible to know if a trend like this will continue so far into the future.
A CEO complains that the winners of his "rookie junior executive of the year" award often turn out to have less impressive performance the following year. He wonders whether the award actually encourages them to slack off. Can you offer a better explanation? Which of the following is a better explanation for why the winners of the "rookie junior executive of the year" award often turn out to have less impressive performance the following year?
Perhaps they weren't really better than other rookie executives, but just happened to have a lucky year.
An analysis of the amount of fiber (in grams) and the potassium content (in milligrams) in servings of 77 breakfast cereals produced the regression model Potassium=39+28Fiber. Explain what the slope means.
The model predicts that cereals will have approximately 28 more milligrams of potassium for every additional gram of fiber.
In justifying his choice of a model, a student wrote, "I know this is the correct model because R2=99.4%." a) Is this reasoning correct? Explain. b) Does this model allow the student to make accurate predictions? Explain.
A. No. The scatterplot should be examined first to see if the conditions are satisfied. B. No, the linear model might not fit the data everywhere.
A regression of total revenue on ticket sales determined by a concert production company is given below. Revenue = -12,289 + 33.12 * Ticket Sales a) Management is considering adding a stadium-style venue that would seat 10,000. What does this model predict that revenue would be if the new venue were to sell out? b) Why would it be unwise to assume that this model accurately predicts revenue for this situation?
A. Revenue = $318,911 B. An extrapolation this far from the data is unreliable.
You are doing a study for a non-profit group helping at-risk children in your city. Suppose you know that 14.2% of the children in your city live in poverty. This percentage is an example of a __________.
Population parameter
Which matters more about a sample you draw from a population?
The size of the sample
A study based on data in which no one manipulates any experimental factors is called an _______________.
observational study
The model predicts that cereals will have approximately 28 more milligrams of potassium for every additional gram of fiber.
The true potassium contents of cereals vary from the predicted amounts with a standard deviation of 30.98 milligrams.
The residuals are the observed y-values minus the y-values predicted by the linear model.
True. The residuals are the observed y-values minus the y-values predicted by the linear model.
To look for outliers, and to check the Equal Variance Assumption, a ____________ should be created.
residual plot
If our data in a scatterplot is "straight enough" we can model our data with a _____.
linear model
there any evidence that an animal's gestation period is related to the animal's lifespan? The scatterplot shows Gestation Period (in days) vs. Life Expectancy (in years) for 18 species of mammals. The highlighted point at the far right represents humans a) For these data, r=0.541. This is not a very strong relationship. Do you think the association would be stronger if humans were removed? Explain. b) Is there reasonable justification for removing humans from the data set? Explain. c) Here are the scatterplot and regression analysis for the 17 nonhuman species. Comment on the strength of the association. d) Interpret the slope of the line. e) A certain mammal has a life expectancy of about 24 years. Estimate the expected gestation period of this species.
A. Stronger. Both slope and correlation would increase. B. Yes, restricting the study to nonhuman animals would justify it. C. The association is moderately strong D. On average, for every year increase in life expectancy, the gestation period increases by about 12.97 days. E. 395.4 days
A concert production company examined its records. The manager made the following scatterplot. The company places concerts in two venues, a smaller, more intimate theater (plotted with blue circles) and a larger auditorium-style venue (red x's). a) Describe the relationship between talent cost and total revenue. (Remember: direction, form, strength, outliers.) b) How are the results for the two venues similar? c) How are they different?
A. The scatterplot shows a strong positive linear relationship between talent cost and total revenue. There is 1 outlier that stands apart from the majority of the data. B. Both venues show an increase of revenue with talent cost. C. The larger venue has greater variability. Revenue for that venue is more difficult to predict.
You recently began an internship at your local chapter of savethepigeons.com. Concerned about a city ballot initiative dealing with the environment, you conduct a telephone survey of local residents. What are some possible sources of bias in your results?
Response bias, non-response bias, and undercoverage of the population
You are trying to study the amount of financial aid students at your University receive. You sample 50 students and find out the average size of their financial aid packages. The average of your sample is a __________.
Sample statistic
An analysis of spending by a sample of credit card bank cardholders shows that spending by cardholders in January (Jan) is related to their spending in December (Dec). The assumptions and conditions of the linear regression seemed to be satisfied and an analyst was about to predict January spending using the model Jan=$612.07+$0.403•Dec Another analyst worried that different types of cardholders might behave differently. She examined the spending patterns of the cardholders and placed them into five market segments. Then she plotted the data using different colors and symbols for the five different segments. Look at this plot carefully and discuss why she might be worried about the predictions from the model.
The different segments are not scattered at random throughout the residual plot. Each segment may have a different relationship.
For many people, breakfast cereal is an important source of fiber in their diets. Cereals also contain potassium, a mineral shown to be associated with maintaining a healthy blood pressure. An analysis of the amount of fiber (in grams) and the potassium content (in milligrams) in serving of 77 breakfast cereals produced the regression model Potassium=38+27Fiber. From this model you can estimate a cereal's potassium content from the amount of fiber it contains. In this context, what does it mean to say that a cereal has a negative residual?
The potassium content is actually lower than the model predicts for a cereal with that much fiber.
Researchers collected data on the annual mortality rate (deaths per 100,000) for males in 20 large towns and the water hardness in terms of the calcium concentration (parts per million, ppm) in the drinking water. a) The display to the right shows the relationship between mortality and calcium concentration for these towns. Describe what you see in this scatterplot, in context. b)Here is the regression analysis of mortality and calcium concentration. What is the regression equation? c) Interpret the slope of this line in context. d) Explain the meaning of the y-intercept of the line. e) The largest residual has a value of 81.2. Explain what this value means. f) The hardness of a certain town's municipal water is about 239 ppm of calcium. Use this equation to predict the mortality rate in this town. g) Explain the meaning of R-squared in this situation.
A. There is a fairly strong, negative, linear relationship between calcium concentration and mortality rate. Towns with harder water tended to have lower mortality rates. B. 1852.377 + -5.031x C. For each additional point in Calcium (ppm), the model predicts a decrease of 5.031 points in Mortality. D. The model predicts that a town with 0 ppm calcium concentration would have a mortality rate of 1852.377. E. The town had 81.2 more deaths per 100,000 people than the model predicts. F. The town is expected to have a mortality rate of 649.968 deaths per 100,000 people. G. 97.4% of the variability in the mortality can be accounted for by a linear model on calcium concentration.
Which of the following is a characteristic of a good experiment?
Comparative, Randomization, placebo controlled, and double blinded
A regression analysis of 117 homes for sale produced the following model, where price is in thousands of dollars and size is in square feet. Price=47.88+0.068(Size) a) Explain what the slope of the line says about housing prices and house size. b) What price would you predict for a 2500-square-foot house in this market? c) A real estate agent shows a potential buyer a 1300-square-foot house, saying that the asking price is $6500 less than what one would expect to pay for a house of this size. What is the asking price d) what is the $6500 called?
A. For every additional square foot of area of a house, the price is predicted to increase by $68. B. $217,880 C. $129,780 D. Residual
he scatterplot provided shows the gestations periods and life expectancies for several animal species. The plot contains two points that may be of concern. The point in the upper right corner of this scatterplot is for elephants, and the other point at the far right is for hippos. a) By removing one of these points, we could make the association appear to be stronger. Which point? Explain. b) Would the slope of the line increase or decrease? c) Should we just keep removing animals to increase the strength of the model? Explain. d) The slope of the scatterplot's regression line is 15.5. If we remove elephants from the scatterplot, the slope of the regression line becomes 11.6 days per year. Do you think elephants were an influential point? Explain.
A. Removing hippos would make the association stronger, since hippos are more of a departure from the pattern. B. Increase, because the point for the hippos is below the regression line. C. No. Only data points that are outliers should be removed. D. Yes, removing it lowered the slope significantly.
Players in any sport who are having great seasons, turning in performances that are much better than anyone might have anticipated, often are pictured on the cover of Sports Illustrated. Frequently, their performances then falter somewhat, leading some athletes to believe in a "Sports Illustrated jinx." Similarly, it is common for phenomenal rookies to have less stellar second seasons, the so-called "sophomore slump." While fans, athletes, and analysts have proposed many theories about what leads to such declines, a statistician might offer a simpler (statistical) explanation. Explain. What would be a better explanation for the decrease in performance of the Sports Illustrated cover athlete?
People on the cover are usually there for outstanding performances. Because they are so far from the mean, the performance in the next year is likely to be closer to the mean.
A group of students in your intro stats class design an experiment to test whether popcorn stored in the freezer pops better (fewer kernels left un-popped) than room temperature popcorn. In addition, they also want to test different power levels on their microwaves. The factor(s) in this experiment is(are) ___________________.
are temperature and power