STA 301 KCs
To what extent does vehicle mileage predict price on the used pickup truck market? The file trucks.csv Download trucks.csvcontains data on 46 pickup trucks listed for sale online, including the following variables: year: year of vehicle price: listed price for vehicle in dollars mileage: number of miles on vehicle odometer at time of listing make: vehicle brand Based on the data in this sample, which of the following statements is accurate? (1) A 2002 Ford truck with mileage of 87,226 has a listed price of $15,495 --- a poor offer for a potential buyer relative to what we'd predict for a pickup with that number of miles. (2) When vehicle mileage increases we expect, on average, that price will also increase. (3) If a used pickup somehow had zero mileage on the odometer, we'd expect its listed price to be $14,419.
# fit linear model and view coefficients (intercept and slope estimates) lm(price ~ mileage, data = trucks) %>% coef() (Intercept) mileage 14419.37617711 -0.06429944 # what price does our model predict for a truck with 87226 miles? 14419 + (-0.064 * 87226)
The greenbuildings.csv Download greenbuildings.csvdata set contains data on thousands of commercial real-estate properties nationwide. The Rent variable is the rent charged to tenants in that building, in dollars per square foot per year. What is the interquartile range (IQR) of rents in this data set?
$14.68
Imagine that you sell vintage posters and memorabilia from classic video games. You collect the following data on customers of your Etsy webpage: Spend: amount spent US: a dummy variable for whether the visitor had a US-based IP address (US = 1) vs. US = 0 for rest of the world Nin: a dummy variable for whether the visitor's search request was for a Nintendo-related product (Nin = 1), versus a product for some other video-game platform (Nin = 0) You bust out RStudio and apply your STA 301-acquired skills to fit this regression model for Spend in terms of US, Nin, and the interaction of these two variables: Spend = 15.5 - (2.7*US) + (4.6*Nin) - (1.4*US*Nin) + error What is the predicted spend for a US-based customer searching for a Nintendo product?
$16.00
Imagine that you sell vintage posters and memorabilia from classic video games. You collect the following data on customers of your Etsy webpage: Spend: amount spent US: a dummy variable for whether the visitor had a US-based IP address (US = 1) vs. US = 0 for rest of the world Nin: a dummy variable for whether the visitor's search request was for a Nintendo-related product (Nin = 1), versus a product for some other video-game platform (Nin = 0) You bust out RStudio and apply your STA 301-acquired skills to fit this regression model for Spend in terms of US, Nin, and the interaction of these two variables: Spend = 15.5 - (2.7*US) + (4.6*Nin) - (1.4*US*Nin) + error What is the predicted spend for a non-US-based customer searching for a Nintendo product?
$20.10 US = 0 and Nintendo=1 So predicted spend is y = 15.5 + 4.6 = 20.1
What factors predict how much money a film will earn? A data scientist at a top movie studio fits a multiple regression model to predict domestic Gross earnings ($ millions) from: the film Budget ($ millions), the film Run Time (minutes) Rotten Tomatoes Rating score [0, 100] Gross = -27 + (1.1 * Budget) - (0.43 * RunTime) + (2.6 * Rating) + error The predicted Gross earnings for a film with a $75 million Budget, 120-minute Run Time, and a Rotten Tomatoes Rating of 77 is closest to which of the following?
$204 million
Hoping to compete with the likes of Amazon Web Services and Microsoft Azure in the competitive cloud computing market, software-as-a-service provider Nimbus 2K has bid on a major contract. There is a 15% chance the firm wins the contract and earns a profit of $1,000,000 There is a 10% chance the firm wins the contact but with higher expenses, earning a profit of $750,000 Otherwise, the firm loses the contact and they earn no profit. The expected value of the contract profit for Nimbus 2K is:
$225,000 (.15 * 1000000) + (.1 * 750000) + (0.75 * 0)
The greenbuildings.csv Download greenbuildings.csvdata set contains data on thousands of commercial real-estate properties nationwide. The Rent variable is the rent charged to tenants in that building, in dollars per square foot per year. What is the median rent in this data set?
$25.20
The luxurious Ludo Bagman Casino offers a new dice game in which players roll a fair 6-sided die: Rolling a "1" wins $20; Rolling a "2" wins $10; any other outcome results in no winnings. Let W be a random variable that denotes the amount of winnings from playing one round of this game. The expected value of W is which of the following?
$5.00
This is one of multiple questions about the same scenario. NCAA coaches may be compensated with salary as well as with an annual bonus. The amount of that bonus may be contingent on team performance as well as meeting off-field thresholds with respect to players' grades and conduct. The distribution of salary and bonus is something that coaches negotiate with the Athletic Director at a NCAA school. A college football coach-who-will-not-be-named is evaluating the job market and wants to predict what sort of bonus compensation he might expect to earn at various NCAA schools. Using NCAA Salaries data, Links to an external site.he fits a model with the following variables: MaxBonus: maximum annual bonus (in millions of dollars) Salary: annual salary (in millions of dollars) SEC: whether the school is in the Southeastern Conference (SEC = 1) or not (SEC = 0) The model equation is: MaxBonus=0.45+0.15∗Salary+0.84∗SEC−0.11∗(Salary∗SEC) The predicted maximum annual bonus for a coaching position at a non-SEC school with a salary of $2 million is closest to which of the following?
$750,000
The file hmda.csv Download hmda.csvcontains Home Mortgage Disclosure Act data used in research on discriminatory practices in mortgage lending in Boston. The file contains information on 2380 mortgage applicants including the following variables: deny: dummy variable for whether or not the mortgage application was denied (1 = yes denied, 0 = no approved) black: is the individual Black? poor_credit: does the individual have a poor public credit record? selfemp: is the individual self-employed? hschool: does the individual have a high-school diploma? lvrat: loan to value ratio pirat: payments to income ratio Does having a high school diploma make one less likely to have poor credit? Use the variables hschool and poor_credit to calculate a large-sample 95% confidence interval for the difference in proportions of individuals with poor credit across the two groups. This interval is closest to which of the following?
(-0.14, 0.08)
This is one of several questions about the data in georgia2000.csv Download georgia2000.csv, which contains Georgia's county-level voting data from the 2000 presidential election. The 2000 election was among the most controversial in history; it turned on an esoteric set of issues surrounding voting machines and vote counts, and wasn't finally resolved until the Supreme Court's ruling in Bush v. Gore. This file contains information for all 159 counties in Georgia. For our purposes, the relevant variables are: votes: number of votes recorded ballots: number of ballots cast optical: whether the county used optical scan voting machines (Yes or No) poor: coded 1 if more than 25% of the residents in a county live below 1.5 times the federal poverty line; coded 0 otherwise. perAA: percent of people in the county who are African-American urban: coded 1 if the county is predominantly urban; coded 0 otherwise gore: number of votes for Gore bush: number of votes for Bush Use the mutate() function as per the following code to define a new variable that takes on the value TRUE for a county where Bush won versus FALSE for a county where Gore won: georgia2000 = georgia2000 %>%mutate(bush_win = (bush > gore)) Did Bush win more frequently in poorer counties or in richer counties? Use the variable poor and the new bush_win variable to calculate a 95% confidence interval for the difference in proportions of poor counties where Bush won relative to the proportion of rich counties where Bush won. This confidence interval is closest to which of the following?
(-0.39, -0.14)
Revisit the data in creatinine.csv Download creatinine.csvfrom an earlier knowledge check. Recall that each row represents a medical patient. The variables are: age: patient's age in years. creatclear: patient's creatine clearance rate in mL/minute, a measure of kidney health (higher is better). Fit a linear model that predicts a patient's creatine clearance rate in terms of their age. Construct a 95% bootstrap confidence interval for the model slope (using 10000 iterations). This interval is closest to which of the following?
(-0.70, -0.55)
The file CPS85.csv Download CPS85.csvcontains data from the 1985 Current Population Survey (CPS), used to supplement information from the U.S. Census in between official census years. These data consist of a random sample of U.S. residents with information on their wages, sex, number of years of education, years of work experience, occupational status, region of residence, and union membership status. Consider the wage and sex variables. Form a 95% confidence interval, based on the Central Limit Theorem, for the difference in average wage between females and males (that is, for the female average minus the male average). This interval is closest to which of the following?
(-2.97, -1.27)
The file vaccine.csv Download vaccine.csvcontains data from a phase 3 randomized vaccine trial. Persons at high risk for SARS-CoV-2 infection were randomly assigned in a 1:1 ratio to receive two injections of mRNA-1273 (the vaccine) or a placebo 28 days apart. Variables in the data frame include: group: was the participant randomly assigned to the placebo group or the vaccine group? covid: did the participant develop symptoms of illness (covid) or did they remain healthy? Use this data to answer the following questions. Form a 95% large-sample confidence interval (i.e., based on the Central Limit Theorem) for the difference in the proportion of participants who developed symptoms of covid between the vaccine and placebo group. This interval is closest to which of the following?
(0.0096, 0.0133)
Using the EPA cars model results, fill in the blank below with the value closest to your answer: Among vehicles in the "Car" Category (baseline), C02 emissions have changed at an average rate of _____ grams per mile each year since 1984, holding constant engine displacement.
-2.83
Using the EPA cars model results, fill in the blank below with the value closest to your answer: Among vehicles in the "TwoSeater" Category, C02 emissions have changed at an average rate of _____ grams per mile each year since 1984, holding constant engine displacement.
-3.239
Match each description below to the most appropriate probability model: Binomial, Bivariate Normal, Normal, or Poisson.
-A formal model of the shape and spread of the sampling distribution of most summary stats: Normal -Approximately 95% of these random variables fall within plus or minus two standard deviations of their mean: Normal -Observing a fixed number of random events with binary outcomes: Binomial -The expected value of this type of random variable is the number of times an event could occur multiplied by the prob that the event will occur each time: Binomial -This probability model describes the joint distribution of two random variables: Bivariate Normal -This distribution is used to model the number of times than some event occurs in a specified interval of time or space: Poisson -This model is governed by a single rate parameter: Poisson -This probability model incorporates correlation as a parameter: Bivariate Normal
Which of the following statements about R scripts is/are correct? Select all correct answers.
-A script is a file that collects multiple statements (i.e. lines of R code) in a single document. -Scripts make it simple to save your work and pick up where you left off, without having to remember what you've accomplished already. -One way to run R statements from a script is to highlight those statements and then hit Control-Enter on the keyboard. -Scripts make it easy to modify a complex analysis by adding or changing steps in the middle of a long chain of statements.
This is one of multiple questions about the same analysis of the data in utsat.csv Download utsat.csv, as per instructions provided previously.. Make a regression table Links to an external site.for your multiple regression model and focus on the School-level dummy variable coefficients. Based on these results, which of the following statements is/are accurate? Choose all correct statements.
-A typical student in the Engineering school is expected to have a GPA that is about 0.19 points lower than a typical student in the Architecture school who has equivalent SAT Math and SAT Verbal scores. -The school of Architecture is represented by the baseline/intercept. -Architecture students and Fine Arts students (with the same SAT Math and SAT Verbal scores) have nearly identical average graduating GPAs.
Which of the following are correct statements about the analysis of variance (ANOVA)? Select TRUE or FALSE as the best answer for each.
-An ANOVA table attempts to attribute credit to individual predictor variables included in the model: TRUE -An ANOVA table tracks the improvements in R-squared associated with each variable: TRUE -If predictors are correlated, changing the sequence in which these variables are added to the model changes the information in the ANOVA table about the relative importance of individual variables: TRUE -There can only be one correct ANOVA table for any given model in the context of an observational study involving correlated predictors: FALSE -ANOVA is most useful when working with experimental data --- when predictors are often not correlated: TRUE -The ANOVA table is not the fitted model itself but rather an attempt to partition credit for the model's predictive power across the different predictors: TRUE -The ANOVA table shows each model estimate along with a p-value associated with the estimate: FALSE -ANOVA is most useful when working with observational data --- when predictors are often correlated: FALSE
Which of the following is/are accurate statements about the analysis of variance (ANOVA)? Select all correct answers.
-An ANOVA table attempts to attribute credit to the individual variables included in the model. -An ANOVA table tracks the change in R-squared as variables are added to the model one at a time. -The construction of an ANOVA table is inherently sequential. -An ANOVA table can help you decide whether a given data set calls for an interaction between variables.
Match each of the following applications of bootstrapping that we have studied with its analogous large-sample inference procedure Links to an external site.
-Bootstrapping a sample mean and making a confidence interval: One-sample t-test -Bootstrapping a proportion and making a confidence interval: One-sample proportions test -Bootstrapping the difference in means between two groups and making a confidence interval for the difference: Two-sample t-test -Bootstrapping the difference in proportions between two groups and making a confidence interval for the difference: Two-sample test for equality of proportions
The file greenbuildings.csv Download greenbuildings.csvcontains data on 7280 commercial real-estate properties. Each row refers to a single building. The two variables of interest here are the building's age (in years) and class (A/B/C, indicating the overall quality of the building). Use ggplot to create a faceted histogram of building ages, faceted by class. Use this histogram to determine which of the following statements is/are accurate. Choose all correct statements. Note: you might find this easier if you make a density histogram Links to an external site..
-Buildings in class A tend to be newer, relative to buildings in the other two classes. -Fewer than half the buildings in Class C are newer than 50 years old. -More than half the buildings in Class A are newer than 50 years old. -Buildings in Class C are older, on average, than buildings in Class A or B.
Which of the following is true of the t distribution Links to an external site.? Select all correct answers.
-Calculating a confidence interval based on the t-distribution is effectively computing a confidence interval based on the Central Limit Theorem and de Moivre's equation, with a minor correction that widens the interval to account for small sample sizes. -The t distribution is unimodal and symmetric, similar to the normal distribution. -When we use an estimate of sigma (σ) rather than the true σ (as assumed with de Moivre's equation), it introduces in our calculation an extra source of uncertainty for which the t-distribution attempts to compensate. -Use of the t-distribution does not assume that we know the population standard deviation sigma (σ). -The RStudio function t.test() is based on this distribution.
In which of the following contexts would the normal distribution be an appropriate probability model?
-Characterizing uncertainty about a phenomenon that may be conceptualized as the sum or average of many independent events of comparable magnitude: normal model is appropriate -Characterizing a phenomenon for which large deviations from the mean -- of three standard deviations or more -- are frequent events: normal model is NOT appropriate -The distribution visualized with a histogram looks approximately symmetric, bell-shaped, and with thin tails: The normal model is appropriate -As an approximation for a large-N binomial model: The normal model is appropriate -Describing a situation where we count the number of discrete events over a fixed time interval, under the assumption that successive events are independent and occur at a constant rate: The normal model is NOT appropriate
Match each of the following legacy large-sample inference procedures with the corresponding 'shortcut' function(s) that we use for that procedure in RStudio:
-Confidence interval for a mean: t.test() -Confidence interval for a proportion: prop.test() -Confidence interval for a difference of means: t.test() -Confidence interval for a diff of proporti: prop.test() -Confidence interval for a regres slope: lm() + confint()
Which of the following statements is/are true of "colliders Links to an external site." in the context of regression modeling? Select all correct answers.
-Confounders tend to be causally "upstream" of X and Y, while colliders are causally "downstream" of X and Y. -If our goal is to isolate a partial relationship between X and Y, confounders should be included in the model, while colliders should be excluded.
Which of the following statements about correlation is/are accurate? Select TRUE or FALSE below for each statement.
-Correlation takes on only values ranging between -1 and 1: TRUE -Interpreting correlation depends on the numerical units with which the X and Y variables are measured. FALSE -Correlation of X and Y = Correlation of Y and X: TRUE -The sign of a correlation coefficient gives the direction (positive, negative) of the association between x and y. TRUE -Correlation is one of the parameters of the Binomial distribution: FALSE -Correlation is one of the parameters of the bivariate normal distribution: TRUE -Correlation measures the extent to which random variables X1 and X2 tend to more often be on the same side of their respective means, or whether they tend to be on the opposite side of their means: TRUE -Correlation will change if there are any changes in the center or scale of either X or Y variable: FALSE -Correlation is sensitive to outliers: TRUE
Which of the following are among standard practices in Machine Learning? Select all correct answers.
-Define a performance metric to provide a numerical summary of the quality of the model's predictions. -Split your data into a training set and a testing set. -Always test the performance of your model on a data set that wasn't used to fit the model in the first place.
Which of the following estimators generally have sampling distributions that are unimodal and symmetric (i.e., normally distributed)? Select all correct answers.
-Differences of means -Regression model coefficients -Medians -Proportions -Interquartile ranges -Standard deviations
This is one of multiple questions about the same scenario --- similar to the GSS model on this quiz but not based on the same data. A data scientist fits a model to investigate the extent to which the effect of work experience on salary is the same for males and females (using years of Experience and Sex as predictor variables): Salary=35000+3300⋅Experience−2000⋅Female−300⋅(Experience⋅Female) Which of the following statements about possible discriminatory compensation practices looks accurate, in light of the fitted model? Choose all correct statements.
-Females with 20 years of experience seem to get paid $8000 less than males of equivalent experience. -Both males and females seem to get paid more with increasing experience, but the expected increase in salary for an additional year of experience seems to be higher for males. -Females with 0 years of experience seem to get paid $2000 less than males of equivalent experience.
After an auspicious data science internship, the Environmental Protection Agency offers you a full-time position to continue studying how vehicle design characteristics impact fuel economy using data on 234 vehicles in the "mpg" data frame. Load these data in RStudio by entering the following on your script: library(tidyverse)data(mpg) Fit a linear model to predict highway miles per gallon (hwy) from the following: displ: engine displacement (liters) drv: drive type (4wheel, front wheel, or rear wheel) the interaction of displ and drv Fit this model without bootstrapping; use the results to evaluate the statements below and to indicate whether each is True or False.
-For a one-liter displacement increase in 4wheel drive vehicles, we are 95% confident of a decrease between 2.36--3.40 highway MPG: TRUE -The relationship between engine displacement and highway MPG for 4wheel drive is statistically significantly different (with 95% confidence) than the relationship between these two variables for front-wheel drive: FALSE -Based on these model estimates, we predict that a front-wheel drive vehicle with 4-liter engine displacement will get approximately 23 highway MPG: True -For rear-wheel drive vehicles, we expect a one-liter increase in engine displacement to result in a decrease of 0.92 highway MPG: True -Results suggest with 95% confidence that rear-wheel drive vehicles could get higher gas mileage (relative to 4wheel drive), perhaps as much as 3.34 highway MPG higher: True -For a one-liter increase in engine displacement in front-wheel drive vehicles, we estimate a change in highway MPG equal to approximately -0.72: FALSE lm(hwy ~ displ + drv + displ:drv, data = mpg) %>% get_regression_table()
Match each of the following with Statistics, Machine Learning, or both.
-Fundamentally about learning from data: Both -Uses lots of regression analysis: Both -About helping people learn from data.: Statistics -About helping machines learn and improve from data: Machine Learning -Focused mainly on understanding and interpreting: Statistics -Focused mainly on performing and predicting: Machine Learning -Supports automated decision-making algorithms capable of improving from experience: Machine Learning -Supports human decision making: Statistics
Suppose a statistical analysis produces a p-value equal to 0.051 under some null hypothesis. Which of the following can we conclude? Choose all correct statements.
-If -- before the analysis was conducted -- the data scientist declared 0.01 to be the arbitrary level of significance, they would fail to reject the null hypothesis. -This p-value provides a nearly identical amount of practical evidence against a null hypothesis as would a p-value equal to 0.049. -If -- before the poll was conducted -- the data scientist declared 0.10 to be the arbitrary level of significance, they would reject the null hypothesis.
This is one of multiple questions about the same scenario. The General Social Survey (GSS) is a renown nationally representative sociological survey of adults in the United States conducted since 1972. With the data in GSS.csv Download GSS.csv, fit a linear regression model for Income using the following predictor variables: a main effect for Gender, a main effect for Age, and the interaction between Gender and Age. Which of the following statements looks correct in light of your fitted model? Choose all correct statements.
-Income for males is about $11,600 lower than for females only at the nonsensical value of age=0, but not at any feasible ages where people actually might earning income. -Income seems to increase with age at different rates for males and females, and the entirely positive range of a confidence interval suggests that this difference in slopes is statistically significant.
This is one of several questions about the data in georgia2000.csv Download georgia2000.csv, which contains Georgia's county-level voting data from the 2000 presidential election. The 2000 election was among the most controversial in history; it turned on an esoteric set of issues surrounding voting machines and vote counts, and wasn't finally resolved until the Supreme Court's ruling in Bush v. Gore. This file contains information for all 159 counties in Georgia. For our purposes, the relevant variables are: votes: number of votes recorded ballots: number of ballots cast optical: whether the county used optical scan voting machines (Yes or No) poor: coded 1 if more than 25% of the residents in a county live below 1.5 times the federal poverty line; coded 0 otherwise. perAA: percent of people in the county who are African-American urban: coded 1 if the county is predominantly urban; coded 0 otherwise gore: number of votes for Gore bush: number of votes for Bush Use the mutate() function to define a new variable called ucountPct: ucountPct=100∗(ballots−votes)/ballots that measures the percentage of ballots (0-100) in a county that were "undercounted" or "spoiled", (i.e., ballots cast where no legal vote was recorded because the machine could not read the vote). One of the main legal issues in the election was whether some counties, had inferior equipment leading to higher undercount rates than others. Use 10000 iterations to calculate a 95% bootstrap confidence interval for the mean difference in undercount percentages between counties where optical voting equipment was used versus not used. Use this interval to evaluate the statements below. Select all correct answers.
-It seems that that the undercount rate was higher in counties where optical equipment was used, but this difference is not statistically significant at the 5% level. -The 95% confidence interval for the difference in undercount percentages between optical and non-optical counties is approximately -0.5% to 1.1%.
Indicate below whether each of the following statements about R libraries is true or false:
-Libraries need to be re-downloaded and re-installed each time you want to use them: False -Libraries need to be loaded each time you want to use them: True -A library is a piece of software that provides additional functionality to R, beyond what's contained in the basic R installation: True -Libraries, also called "packages," are installed from within RStudio itself: True -Once you load a library/package, you need never do so again when starting new RStudio sessions: False -If RStudio was a smart phone, library packages are like apps: True
Which of the following statements about interactions and/or correlated predictors are correct? Choose all correct statements.
-One simple way to diagnose an interaction between a numerical predictor and a categorical predictor is to make a plot -- to visualize a separate trend line in each category and to compare their slopes. -The interaction term does not directly encode the slope. Rather, it is an offset measuring the difference in slopes at different levels of a categorical variable. -There could be an interaction effect between two variables even if those variables are statistically independent of one another. -Interaction terms are used in regression models to describe situations where the relationship between some X predictor variable and response variable Y depends on some other variable in the context.
The file CPS85.csv Download CPS85.csvcontains data from the 1985 Current Population Survey (CPS), used to supplement information from the U.S. Census in between official census years. These data consist of a random sample of U.S. residents with information on their wages, sex, number of years of education, years of work experience, occupational status, region of residence, and union membership status. Consider the wage and educ variables, which describe each respondent's hourly wage and their number of years of formal schooling. Fit a linear regression model that predicts a wages in terms of years of education, and form a 95% "large-sample" confidence interval (i.e. based on the Central Limit Theorem) for the coefficients of this regression model. Which of the following statements about your analysis is/are accurate? Select all correct answers.
-Our best guess, in light of the sample, is that one additional year of schooling is associated with an extra $0.75 in hourly wages. -We are 95% confident that an extra year of formal schooling is associated with an increase in wages of somewhere between about $0.60 and $0.91.
Which of the following is true of randomization in experimental design? Select all correct answers.
-Randomization works, essentially, by flipping a coin independently for each subject in a study: heads, you get the treatment; tails, you get the control. -Randomization ensures balance, on average, even for possible confounding factors of which the experimenter is not aware. -The fundamental source of uncertainty in an experiment arises from the random assignment of experimental units to treatment or control.
Match each term below to its correct definition.
-Some aspect of the world about which we'd like to learn using data: Estimand -The set of all possible cases that might have been included in a data set: Population -This word describes a study design in which individual units are tracked through time: Longitudinal -Any systematic discrepancy between a sample and the corresponding population: Sampling bias -Using statistical computing to repeatedly simulate the random process that generated a sample: Monte Carlo simulation
Netflix collects data every time a subscriber uses its platform, including the variables listed below. Which of these variables are categorical?
-The U.S. state in which the subscriber resides -The genre of the show/movie -The day of the week
The data set in students.csv Download students.csvconsists of 8239 rows, each of them representing a particular student from a university in Berlin. Characteristics corresponding to that student are represented by variables in each column, including: stud.id: a unique identifier for each student height: height in centimeters weight: weight in kilograms gender: Male or Female age: years religion: Catholic, Muslim, Orthodox, Other, or Protestant nc.score: standardized test score [1, 4] major / minor: Biology, Economics and Finance, Environmental Sciences, Mathematics and Statistics, Political Science, or Social Sciences online.tutorial: did the student take any online classes? (1 = Yes, 0 = No) graduated: did the student graduate? (1 = Yes, 0 = No) salary: self-reported starting salary ($) Imagine that you are going to make plots to visualize some of these variables. First, consider which plot type will best serve the story that you hope to tell with the graphic. Match each 'data visualization objective' below with the MOST appropriate plot type.
-The association between student height and student weight: Scatter plot -The distribution of salary: Histogram -Comparison of the distribution of standardized test scores (nc.score) for males versus females: Side-by-side boxplots -Of the students who graduated versus didn't graduate, what proportion in each group received the online tutorial? And how does this association differ for male students versus female students? Faceted bar plot -How does the distribution of student age differ across different levels of the Religion variable: Side-by-side boxplots -How many students have declared each category of minor? bar plot
Based on the estimated coefficients of your EPA cars model, which of the following statements is/are accurate? Select all correct statements.
-The average C02 emissions of wagons, small cars, and trucks has gone down over time, holding engine displacement constant, but not as fast as average C02 emissions for vehicles in the "Car" category. -When comparing vehicles of similar engine displacements, SUV/Vans seem to have the steepest decrease in C02 emissions over time.
Which of the following statements are true of the bivariate normal distribution? Select all correct answers.
-The bivariate normal distribution is capable of describing both positive and negative associations. -In a bivariate normal model, the strength of association between X1 and X2 is described by the correlation parameter.
This is one of multiple questions about the same model. In 1990, the United Nations created a single measure that ranges between zero and one -- the Human Development Index (HDI) -- to summarize health, education, and economic status for world countries. The following is a fitted model to predict HDI from life expectancy and expected years of schooling: HDI=−0.34+0.01⋅LifeExpectancy+0.02⋅SchoolYears+Error Which of the following are correct interpretations of the β (beta) coefficient for LifeExpectancy? Select all correct answers.
-The change in HDI associated with a one-year increase in life expectancy, holding other predictors constant. -A partial slope.
Which of the following is true of the normal distribution? Select all correct answers.
-The normal distribution is also known informally as a "bell curve" and more formally as a Gaussian distribution. -The probability is about 95% that a normal random variable X will take on a value within two standard deviations of its expected value (mean). -The normal distribution is a formal mathematical description of the shape and spread of sampling distributions for many estimators. -The normal distribution is characterized with two parameters: the mean and the standard deviation. -Normal distributions are unimodal and symmetric.
Which of the following is true of the normal distribution model? Select all correct answers.
-The normal distribution originated as an approximation to the binomial distribution. -The area under a normal density curve represents probability. -The normal distribution has "thin tails" because large outliers are unlikely to occur. -It is a family of bell-shaped density curves, each with a different mean and standard deviation. -Phenomena that don't look obviously normal can be sometimes described using the normal distribution as a building block.
Match the random variable described with its correct type: Discrete or Continuous?
-The number of people who visit the Perry-Castañeda Library tomorrow: Discrete -Length of time required for a phone battery to lose its charge: Continuous -The sum of the numbers from rolling two dice: Discrete -The amount of rainfall in Austin in Nov: Continuous -The count of Haribo gummy bears in a bag: Discrete -The number of mortgages approved in Travis County last week: Discrete -The distance that a car travels with one tank of gas Continuous -The weight of the contents of a bag of Haribo gummy bears: Continuous -The number of cards drawn from a deck until a Queen is selected: Discrete
Match each description of the Patriots' coin toss example below to its corresponding major element of our process for detecting anomalies in data.
-The pre-game coin tosses were truly random, and the Patriots' coin-toss wins can therefore be explained by blind luck: null hypothesis -The number of Patriots' coin-toss wins: test statistic -This probability distribution provided context for the observed data: P(test statistic | null hypothesis) -We calculated this number to assess the extent to which our test statistic looked plausible under the null hypothesis: p-value
In a linear regression model, we describe data by an equation: yi=β0+β1∗xi+ei Which of the following is accurate for this equation? Select all correct answers.
-The predictor variable is represented by xi -Model error is represented by ei
Use R to simulate 100000 runs of weighted coin flips where N=150 and the probability of heads on each flip is 0.75. Remember the functions do() and nflip(). Which of these following statements is/are accurate, based on your simulation? Select all correct answers.
-The probability of seeing 100 or fewer heads is roughly 1%. -The probability of seeing 112 or fewer heads is roughly 50% -The probability of seeing 120 or more heads is roughly 9%
Consider the following plot that shows the sepal width vs. sepal length for 50 individual flowers from each of 3 species of iris. The species are Iris setosa,versicolor, and virginica. Which of the following statements about this plot is/are accurate? Select all correct answers
-The species of each flower has been mapped to the color of each point. -The sepal width of each flower has been mapped to the vertical (y) location of each point.
This is one of multiple questions about the same scenario. NCAA coaches may be compensated with salary as well as with an annual bonus. The amount of that bonus may be contingent on team performance as well as meeting off-field thresholds with respect to players' grades and conduct. The distribution of salary and bonus is something that coaches negotiate with the Athletic Director at a NCAA school. A college football coach-who-will-not-be-named is evaluating the job market and wants to predict what sort of bonus compensation he might expect to earn at various NCAA schools. Using NCAA Salaries data, Links to an external site.he fits a model with the following variables: MaxBonus: maximum annual bonus (in millions of dollars) Salary: annual salary (in millions of dollars) SEC: whether the school is in the Southeastern Conference (SEC = 1) or not (SEC = 0) The model equation is: MaxBonus=0.45+0.15∗Salary+0.84∗SEC−0.11∗(Salary∗SEC) Which of the follow statements is accurate, in light of the fitted model? Select all accurate statements below.
-This model incorporates an interaction term. -As salaries increase at SEC schools, bonuses also tend to increase. However, that increase is more slow and gradual than at non-SEC schools. -At non-SEC schools, bonuses increase at a rate of $150,000 for a $1 million-increase in salary.
Best practices in good experimental design include which of the following? Select all correct answers.
-Use a control group. -Use blocking when you can.
As discussed in class, which of the following is/are among ways that we diagnose interactions and decide whether or not to include an interaction term in our model? Select all accurate answers.
-Using an ANOVA table to assess the extent to which adding an interaction term improves the predictive power of a model. -Going ahead and fitting a model with an interaction, then examining a confidence interval for the interaction term. -Applying subject-matter knowledge about the data science context. -Creating a plot to visually diagnose the presence (or lack thereof) of an interaction between variables.
Statistical inference comprises a set of methods to quantify uncertainty. In which of the following data science situations is statistical inference likely to be useful? Use the dropdown choices below to select the best answer for each situation.
-We want to use data to make a prediction about the future, and we expect the future to be similar to the past: Statistical inference is likely to be useful -We have data from a census: Statistical inference is likely to be useful -We want to generalize to a population based on a representative sample from a well-designed study: Statistical inference is likely to be useful -Our observations are subject to systematic biases: Statistical inference is likely to be useful -Sampling variability is the greatest source of error in our data collection process: Statistical inference is likely to be useful
This is one of multiple questions about the same model. In 1990, the United Nations created a single measure that ranges between zero and one -- the Human Development Index (HDI) -- to summarize health, education, and economic status for world countries. The following is a fitted model to predict HDI from life expectancy and expected years of schooling: HDI=−0.34+0.01⋅LifeExpectancy+0.02⋅SchoolYears+Error Which of the following are correct interpretations of the β (beta) slope coefficient for SchoolYears? Select all correct answers.
-When comparing countries with similar life expectancies but a difference of one year in expected years of schooling, we'd expect their HDIs to differ by 0.02. -We expect HDI to increase by 0.02 for every one-year increase in expected schooling, after adjusting for the simultaneous effect of life expectancy on HDI.
In a phase 3 randomized vaccine trial, participants at high risk for SARS-CoV-2 infection were randomly assigned in a 1:1 ratio to receive two injections of mRNA-1273 or placebo 28 days apart. Variables in the data frame include: group: was the participant randomly assigned to the placebo group or the vaccine group? covid: did the participant develop symptoms of illness (covid) or did they remain healthy? The researchers would like to use bootstrapping with at least 100,000 iterations to calculate a 95% confidence interval for the difference in the proportions of participants who developed covid across the two groups (placebo and vaccine). Which of the following R functions will be required as part of the code to run this proposed analysis? Select all correct answers.
-diffprop() -do() -resample() -confint()
Which of the following are common steps in feature engineering? Select all correct answers.
-encoding categorical variables as dummy variables -extracting useful features from raw data such as date and time stamps -nonlinear transformations of existing numerical variables -combining or summarizing existing variables
The bivariate normal distribution is based on two variables (X1 and X2) and is described in terms of five parameters. Which of the following are among these five parameters? Select all correct answers.
-the standard deviation of X1 -the mean of X2 -the correlation of X1 and X2
The file vaccine.csv Download vaccine.csvcontains data from a phase 3 randomized vaccine trial. Persons at high risk for SARS-CoV-2 infection were randomly assigned in a 1:1 ratio to receive two injections of mRNA-1273 (the vaccine) or a placebo 28 days apart. Variables in the data frame include: group: was the participant randomly assigned to the placebo group or the vaccine group? covid: did the participant develop symptoms of illness (covid) or did they remain healthy? Use these data to answer the following questions. What proportion of study participants in the vaccine group developed symptoms of covid? Enter your answer as a decimal proportion between 0 and 1, rounded to four decimal places (e.g. 0.0432 for 4.32%, 0.5000 for 50% ).
0.0007
The file vaccine.csv Download vaccine.csvcontains data from a phase 3 randomized vaccine trial. Persons at high risk for SARS-CoV-2 infection were randomly assigned in a 1:1 ratio to receive two injections of mRNA-1273 (the vaccine) or a placebo 28 days apart. Variables in the data frame include: group: was the participant randomly assigned to the placebo group or the vaccine group? covid: did the participant develop symptoms of illness (covid) or did they remain healthy? Use these data to answer the following questions. What proportion of study participants in the placebo group developed symptoms of covid? Enter your answer as a decimal proportion between 0 and 1, rounded to four decimal places (e.g. 0.0432 for 4.32%, 0.5000 for 50% ).
0.0122
The long-run rate of defective iPhones coming off the assembly line is 0.6% when all manufacturing processes are working correctly. Because testing each phone for defects would be cost prohibitive, a random sample of 500 phones are inspected every 2 hours to determine if the manufacturing processes are working correctly. In the last assembly line sample of 500 iPhones, 7 iPhones proved to be defective. Use 100000 --- one hundred thousand --- simulations to calculate a p-value under the null hypothesis that the assembly line is still operating at its long-run rate of defective iPhones, and that the observation of 7 defects out of 500 phones is due to chance. The p-value is closest to which of the following?
0.03
Based on data from 2019, 22.8% of Americans claim no religious affiliation. In random sample of size n=250 of those living in America, what is the probability of sampling 45 or fewer people with no religious affiliation? Round your answer to 3 decimal places.
0.039
Revisit the data in creatinine.csv Download creatinine.csvfrom an earlier knowledge check. Recall that each row represents a medical patient. The variables are: age: patient's age in years. creatclear: patient's creatine clearance rate in mL/minute, a measure of kidney health (higher is better). Fit a linear model to predict creatine clearance rates in terms of the patient's age. Use bootstrapping with 10000 simulations to generate the sampling distribution for the model slope. The bootstrap standard error of the slope is closest to which of the following?
0.04
Suppose a packaging system fills boxes such that the weights are normally distributed with a mean of 16.3 ounces and a standard deviation of 0.21 ounces. What is the probability that a box weighs at most 16 ounces? Report your answer to 2 decimal places.
0.08
The UT Quadball team scores an average of 7.7 goals per match. Assuming that goal-scoring can be reasonably described by a Poisson distribution, which of the following is the probability that the team scores exactly 10 goals during their next match?
0.09
FiveThirtyEight's aggregate poll model for presidential approval ratings estimated that 52.1% of Americans approve of Joe Biden. Use a Binomial model to find the probability of finding at least 70 people who approve of Biden in a random sample of 120 Americans. This probability is closest to which of the following?
0.10 1-pbinom(69, 120, 0.521) sum(dbinom(70:120, 120, 0.521))
Census data from a particular county show that 19% of adult residents in the county are Hispanic. Suppose 72 residents receive a summons for jury duty, and 9 of them are Hispanic. An attorney protests that there are too few Hispanics in the panel of potential jurors; he feels certain that 9 out of 72 is an anomalously small fraction, given the wider community demographics. The p-value for this alleged under-representation of Hispanic residents is closest to which of the following? Use a simulation with at least 10000 iterations. (And remember: the attorney is asserting the the observed fraction is anomalously small, not anomalously large. Try drawing a picture of the sampling distribution.)
0.100
This is one of multiple questions about the same analysis of the data in utsat.csv Download utsat.csv, as per instructions above. Based on the coefficients of your fitted model, the change in graduating GPA, that we expect to see from a 100-point change in SAT Math scores (holding School and SAT Verbal score constant) is closest to which of the following?
0.12
Suppose a packaging system fills boxes such that the weights are normally distributed with a mean of 16.3 ounces and a standard deviation of 0.21 ounces. What is the probability that a box weighs between 16.4 and 16.5 ounces? Report your answer to 2 decimal places.
0.15
Suppose a packaging system fills boxes such that the weights are normally distributed with a mean of 16.3 ounces and a standard deviation of 0.21 ounces. What is the probability that a box weighs more than 16.5 ounces? Report your answer to 2 decimal places.
0.17
This is one of multiple questions about the same analysis of the data in utsat.csv Download utsat.csv, as per instructions provided previously. What proportion of overall variation in student GPA does your multiple regression model predict? Answer in decimal form (e.g. Fifty-two percent = 0.52) and round your answer to two decimal places.
0.18
The UT Quadball team scores an average of 7.7 goals per match. Assuming that goal-scoring can be reasonably described by a Poisson distribution, which of the following is the probability that the team scores at least 10 goals in their next match?
0.25
This is one of multiple questions about the same analysis of the data in utsat.csv Download utsat.csv, as per instructions provided previously. Calculate standardized coefficients Links to an external site.for your multiple regression model. Fill in the blank with a number rounded to two decimal places: A one standard-deviation change in SAT Verbal scores is associated with a _______ standard-deviation change in graduating GPA, holding School and SAT Math score constant.
0.26
The distribution of fifty years worth of S&P 500 monthly stock returns has a mean of 0.89% and standard deviation of 4.42%. You plot these data in a histogram and observe that the distribution of individual monthly returns is approximately normal. The probability that a randomly selected individual monthly return will be greater than 3% is closest to which of the following?
0.32
Use the training data (wine_train.csv Download wine_train.csv) to calculate root means squared error Links to an external site.(RMSE) for the larger model. Round your answer to three decimal places.
0.691
Use the training data (wine_train.csv Download wine_train.csv) to calculate root mean squared error Links to an external site.(RMSE) for the smaller model. Round your answer to three decimal places.
0.73
Using the EPA cars model results, fill in the blank below with the value closest to your answer: Holding constant vehicle Category and years since 1984, a one-standard deviation change Links to an external site.in engine displacement (displ) is associated with a _____-standard deviation change in C02 emissions.
0.74
Use the testing data (wine_test.csv Download wine_test.csv) to calculate root mean squared error Links to an external site.(RMSE) for the smaller model. Round your answer to three decimal places.
0.745
A private wealth manager notes that 45% of their clients have bonds in their portfolio, while 15% have options in their portfolio. Of those clients who invest in bonds, 27% also have invested in options. Which of the following is closest to the probability that a client invests in bonds given that they have invested in options?
0.81
The UT Quadball team scores an average of 7.7 goals per match. Assuming that goal-scoring can be reasonably described by a Poisson distribution, which of the following is the probability that the team scores at most 10 goals in their next match?
0.84
Disney+ is a new on-demand, ad-free streaming service with content from Disney, Pixar, Marvel, Star Wars, National Geographic, and 20th Century Fox. The Disney data science team observes that 40% of video streamers who accept an offer for a free two-week trial period will convert to buy a subscription. Assuming streamers' decisions are independent, what is the probability that the next four trials will result in at least one subscription?
0.87
As a junior data scientist for Uber, your first project focuses on surge pricing in Austin. Part of this work entails accurately modeling the number of ride requests within a given geographic location, for which a Poisson distribution is appropriate. From historical data you find that app users request, on average, λ = 18.7 rides per minute between midnight and 2:00 AM in the half mile radius of the intersection of Sixth Street and Congress Avenue. Uber wants to avoid a situation in which the supply of available drivers is inconsistent with demand for rides. The probability that there are no more than 25 requests per minute in the downtown Austin area during this late night interval is closest to which of the following?
0.936 ppois(25, 18.7)
Use the testing data (wine_test.csv Download wine_test.csv) to calculate root mean squared error Links to an external site.(RMSE) for the larger model. Round your answer to three decimal places.
0.946
De Moivre's equation is a function of which of the following? (1) The variability of a single data point in the distribution. (2) The number of data points that you are averaging to calculate a sample mean. (3) The confidence level associated with our interval.
1 and 2
Does having a high school diploma make one less likely to have poor credit? Use the variables hschool and poor_credit to calculate a large-sample 95% confidence interval for the difference in proportions of individuals with poor credit across the two groups. Which of the following is true of this analysis? (1) In this sample, the proportion of poor-credit applicants with high school diplomas is approximately 2.9% lower that the proportion of poor-credit applicants who have not finished high school. (2) Our interval suggests that applicants with a high school diploma may have a lower probability of poor credit, a higher probability of poor credit, or no difference in the probability of poor credit (relative to those who do not finish high school). (3) The difference in proportions between these two groups is statistically significant at the 0.05 alpha level.
1 and 2
What factors predict how much money a film will earn? A data scientist at a top movie studio fits a multiple regression model to predict domestic Gross earnings ($ millions) from: the film Budget ($ millions), the film Run Time (minutes) average Rotten Tomatoes Rating score [0, 100] Gross = -27 + (1.1 * Budget) - (0.43 * RunTime) + (2.6 * Rating) + error Which of the following is/are an accurate interpretation of the beta slope coefficient for Rating? (1) An estimated slope for a partial relationship. (2) The change that we expect to see in Gross earnings for a one-point increase in a film's Rotten Tomatoes Rating if we compare films with different Ratings but with the same Budget and RunTime. (3) We expect Gross earnings to increase by $2.6 million for every one-point increase in a film's Rotten Tomatoes Rating, regardless of the film's Budget and RunTime.
1 and 2
Which of the following are among guidelines for data scientists (Lesson 15.6) on what variables to INCLUDE in fitting multiple regression models? (1) It is essential to incorporate variables that directly affect both the outcome (Y) and the particular X predictor of interest. (2) It is beneficial, but not strictly essential, to include variables that affect Y even if they are not correlated with a particular X predictor of interest. (3) Always include an interaction term for two X predictors if those predictors are both main effects in the model.
1 and 2
Which of the following is a discrete random variable? (1) The number of customers waiting in line at Franklin BBQ when it opens tomorrow morning. (2) The count of typos on a page. (3) The time required for a plane to fly from Houston Hobby to Dallas Love Field.
1 and 2
Which of the following is true of the expected value of a random variable? (1) It is an average of possible outcomes of the random variable weighted by their associated probabilities. (2) It might make sense to pursue a business opportunity involving a negative expected value if, for example, it reduced some risk. (3) The Law of Large Numbers indicates that for any random phenomenon such as flipping a fair coin, any "unlucky streak" will soon be balanced out by a lucky streak.
1 and 2
Which of the following statements about bootstrapping is/are accurate? (1) Each bootstrapped sample must be of the same size as the original sample. (2) Each bootstrapped sample may contain duplicates and omissions from the original sample. (3) Each bootstrapped sample must sampled without replacement from a different population.
1 and 2
Which of the following statements about correlation is/are correct? (1) The sign of a correlation coefficient for two variables X and Y gives the direction (positive or negative) of the slope coefficient representing the association between X and Y. (2) Correlation has no units. (3) Correlation ranges from 0 to 1.
1 and 2
How does the distribution of team offensive rankings (offense variable) differ with respect to whether or not a team made the playoffs? Identify all accurate statements below. (1) The 25th percentile of offense rankings for teams that made the playoffs looks roughly equal to (i.e., within 0.1 of) the 75th percentile for teams that did not make the playoffs. (2) For teams that appeared in the playoffs, the offense ranking distribution is mildly skewed to the left. (3) During 2000-2019, there were two teams that did not make the playoffs with outstanding offense rankings (exceeding the third quartile of their distribution by more than 1.5 times the interquartile range).
1 and 3
Reasons to include an interaction term in our model include which of the following? (1) To estimate context-specific effects of some predictor variable on the outcome (y). (2) If the joint effect of two variables on the outcome can be correctly modeled as the sum of the main effects associated with each variable. (3) Looking at an ANOVA table suggests that an interaction term noticeably improves the predictive power of the model.
1 and 3
The Normal distribution would be an appropriate probability model in which of the following contexts? (1) Characterizing uncertainty about a phenomenon that may be conceptualized as the sum or average of many independent events of comparable magnitude. (2) Characterizing uncertainty about a phenomenon for which extremely large outliers are a regular occurrence. (3) The distribution visualized with a histogram looks approximately symmetric, bell-shaped, and with thin tails.
1 and 3
Which of the following are among key elements of bootstrapping? (1) Each bootstrap sample is the same size as the original sample. (2) The original sample must be a census from the target population of interest. (3) Each bootstrap sample is sampled with replacement from the original sample.
1 and 3
Which of the following is true of overfitting? (1) It is the primary reason that we split our data into training and testing sets when fitting machine learning models. (2) It is more likely to happen when the size of the data set is large or the model that we are fitting has few parameters. (3) It occurs when a model memorizes the random noise in a particular data set rather than learning an underlying pattern that would generalize to unseen data.
1 and 3
Which of the following is true of the Binomial distribution? (1) The model describes the result of a fixed number of random events each having a binary outcome. (2) The outcome of each random event depends on the outcomes of preceding random events in a Binomial situation. (3) Binomial random variables are characterized by two parameters: the number of trials and the success probability on each trial.
1 and 3
Which of the following is true of the normal distribution? (1) The normal distribution is a formal mathematical description of the shape and spread of sampling distributions for sample means. (2) The normal distribution is characterized with two parameters: the median and the IQR. (3) Approximately 95% of the observations in a normal distribution will take on a value within plus or minus 2 standard deviations of the mean.
1 and 3
Which of the following is/are among key elements of bootstrapping? (1) Each bootstrap sample is the same size as the original sample. (2) Each bootstrap sample is an exact copy of the original sample. (3) Each bootstrap sample is sampled with replacement from the original sample.
1 and 3
Which of the following statements is true of dummy variables? (1) In general, a grouping variable with K categories produces K-1 dummy variables. (2) In a fitted model, the coefficient on a dummy variable represents the average value of the outcome (y) whenever the dummy variable is 1. (3) In a fitted model, the coefficient on a dummy variable represents the difference in the average outcome (y) between two conditions: when the dummy variable is 1, versus when the dummy variable is 0.
1 and 3
Which of the following statements is true of the Central Limit Theorem? (1) The mean of a sufficiently large sample has an approximately normal sampling distribution. (2) To apply the Central Limit Theorem, a variable must be normally distributed in the underlying population of interest. (3) The sampling distributions of a sample mean looks more normal as the size of the sample N increases.
1 and 3
You collect data on daily sales of pints of guacamole at a local grocery store. The following jitter plots shows sales versus two variables: whether it's a weekend or a weekday. whether there are free samples of guacamole. The orange dots show the group means for each situation. Which of the following statements is/are accurate, in light of this picture? (1) The effect of offering free samples looks larger on a weekend than it does on a weekday. (2) The joint effect of the weekend and free sample variables on sales looks separable: that is, equal to the sum of the individual effects. (3) A regression model to predict sales in terms of these two predictors should also include an interaction between the weekend variable and the free samples variable.
1 and 3
At your data science internship with the Environmental Protection Agency, you are researching how vehicle design characteristics impact fuel economy using data on 234 vehicles in the "mpg" data frame. Load these data in RStudio by entering the following on your script: library(tidyverse)data(mpg) Investigate the distribution of city miles per gallon (cty) for each vehicle class category (2seater, compact, midsize, minivan, pickup , subcompact, SUV). Suggested methods include some combination of plots, tables, and comparing numerical summaries. Based on your analysis, which of the following statements is/are accurate with respect how the distribution of city miles per gallon (cty) varies by vehicle class? (1) Pickups have the worst average city gas mileage (i.e., the lowest mean). (2) Compact and midsize vehicles have the same median. (3) Subcompact vehicles have the largest interquartile range.
1 and 3 ggplot(mpg) + geom_boxplot(aes(x=class, y=cty)) favstats(cty ~ class, data = mpg) median(cty ~ class, data = mpg) %>% round(2) mean(cty ~ class, data = mpg) %>% round(2) xtabs(~class, data = mpg)
The average salary of 30 top quarterbacks in the 2017 National Football League was just over $13,000,000. A linear regression to predict Salary from Total QBR (an overall measure of performance based on various game statistics) produced the following equation: Salary = 150,000 + 225,000 * Total QBR Tom Brady (then of the New England Patriots) had a Total QBR of 83 and was paid $14,000,000 in 2016. Which of the following can we conclude based on the results of this linear model? (1) In 2016, Brady was underpaid by about $4.8 million, versus what we'd predict given his QBR. (2) Top NFL quarterback salaries increase, on average, by about $150,000 for every one-point increase in the player's Total QBR. (3) A quarterback with a Total QBR = 70 has a predicted salary of $15,900,000.
1 and 3 14000000-(150000 + (225000 * 83))150000 + (225000 * 70)
Identify all accurate statements. In the context of hypothesis testing, the test statistic: (1) is used to measure the strength of evidence in the data against the null hypothesis. (2) should be less than 0.05 in order to reject the null hypothesis. (3) directly measures the probability that the null hypothesis is false.
1 only
In April 2021, Coinbase filed to go public on the Nasdaq under the symbol "COIN". The Coinbase data scientists wanted to predict their IPO's opening price from firm-specific and market-level features. They split the data frame into an 80% training set and 20% testing set, then fit three models of different sizes. The models' predictive performance (in terms of RMSE) is summarized in this table: ModelParametersIn-sample RMSEOut-of-sample RMSE15$20.51$23.38214$17.37$18.10347$14.01$26.57 Which of the following is true of this set of models? (1) Here we see evidence that simpler models generally show less decline in performance with the testing data relative to performance with training data. (2) Model 1 has the best in-sample RMSE. (3) The model with the best overall predictive performance is Model 3.
1 only
Which of the following statements is accurate with respect to sampling distributions? (1) A sampling distribution is the theoretical distribution of sample estimates across many repeated samples. (2) A sampling distribution summarizes the results of only one particular sample. (3) A sampling distribution is the same thing as the distribution of individual data points (i.e., cases) collected in a sample.
1 only
Which of the following statements is/are true of the Binomial distribution? (1) The Binomial model describes the result of a fixed number of random events each having a binary outcome. (2) The outcome of each random event depends on the outcomes of preceding random events in a Binomial situation. (3) Binomial random variables are characterized by two parameters: the mean and standard deviation.
1 only
Calculate the standard deviation of the number of wins for teams with an above-average strength of schedule that did make the playoffs. Round your answer to two decimal places and enter below.
1.44 nfl %>%filter(strength > mean(strength), playoffs == "yes") %>%summarize(SD = sd(wins)) %>% round(2)
The time required to complete the OWL Standardized Test is approximately normally distributed with a mean of 70 minutes and a standard deviation of 10 minutes. The percentage of test-takers who require less than one hour to complete the exam is closest to which of the following?
16%
The file CPS85.csv Download CPS85.csvcontains data from the 1985 Current Population Survey (CPS), used to supplement information from the U.S. Census in between official census years. These data consist of a random sample of U.S. residents with information on their wages, sex, number of years of education, years of work experience, occupational status, region of residence, and union membership status, among other things. Consider the exper variable, a numerical variable for the number of years of work experience for each respondent. Form a 95% "large-sample" confidence interval (i.e. based on the Central Limit Theorem) for the population mean of years of U.S. residents' work experience. This upper limit of this interval is closest to which of the following?
19 years
A common practice in machine learning is splitting a data set into training and testing sets. Which of the following is true of this method? (1) Models are fit on the testing data set only, while model predictive performance is evaluated on the training data set. (2) Models that are overfit tend to see large degradation in performance when comparing the training to the testing sets. (3) Predictive performance on the testing set is regarded as more important than performance on the training set.
2 and 3
Fundamental pillars of good experimental designs include which of the following? (1) Bootstrapping (2) The use of blocking designs when possible (3) Randomization of experimental units to treatment and control groups
2 and 3
Imagine that you sell vintage posters and memorabilia from classic video games. You collect the following data on customers of your Etsy webpage: Spend: amount spent US: a dummy variable for whether the visitor had a US-based IP address (US = 1) vs. US = 0 for rest of the world Nin: a dummy variable for whether the visitor's search request was for a Nintendo-related product (Nin = 1), versus a product for some other video-game platform (Nin = 0) You bust out RStudio and apply your STA 301-acquired skills to fit this regression model: Spend = 15.5 - (2.7*US) + (4.6*Nin) - (1.4*US*Nin) + error Which of the following statements about this model is/are accurate? (1) This model assumes that the effect of the US and Nintendo variables are separable (i.e., their joint effect is equal to the sum of the individual effects). (2) This model incorporates an interaction. (3) Non US-based customers searching for non-Nintendo products are expected to spend $15.50, on average.
2 and 3
The Amazon e-commerce data science team fits a model to gain understanding of the extent to which having an Amazon Prime membership leads customers to buy more from the platform than customers without a Prime membership. The data set includes two variables: total Sales revenue for a customer account and a dummy variable representing whether the customer had a Prime membership (0 = no, 1 = yes). The fitted model equation is: Sales=764+1323∗Prime+e The data-science team calculates R-squared for this model to be 0.24: R2=0.24 This R-squared estimate indicates which of the following? (1) If using this model to predict Sales, we expect a typical model error of plus or minus about a quarter ($0.24). (2) Approximately 14 (24%) of the variability in Sales is predictable in terms of variation between the two groups (Prime versus no-Prime). (3) Approximately 34 (76%) of the variation in customer Sales is explained by factors other than whether or not they have a Prime membership.
2 and 3
Which of the following is true of fitting a multiple regression model? (1) We interpret each β (beta) coefficient as representing an overall relationship between y and the corresponding x. (2) We interpret each β (beta) coefficient as representing a partial relationship between y and the corresponding x, holding the other predictors constant. (3) If two predictor X variables are correlated, the difference between their overall relationships with a response variable Y and their partial relationships with a response variable Y may be very important in interpreting modeling results effectively.
2 and 3
A 95% confidence interval for the mean based on the Central Limit Theorem and de Moivre's equation generally takes the form: x¯±1.96⋅σn If we instead wanted to calculate an 80% confidence interval for the mean, which elements of this formula would be different? (1) The center of the interval, x¯ (2) The multiplier of 1.96. (3) The term σ/n .
2 only
Follow above instructions for the epa_cars.csv Download epa_cars.csvdata set. The EV level of the Powertrain variable corresponds to battery EVs --- that is, electric vehicles like Teslas and Nissan Leafs. (In case you're curious, for EVs, the government reports the cityMPG variable as the "equivalent" MPG based on the vehicle's battery usage.) Which of the following statements about the PowertrainEV coefficient is accurate? (1) EVs have an average cityMPG of 82.6 MPG (equivalent). (2) EVs have an average cityMPG that is 82.6 MPG (equivalent) higher than that of vehicles with a Diesel Powertrain. (3) EVs have an average cityMPG of 102.8 MPG (equivalent).
2 only
The benefits of randomization in an experiment include which of the following? (1) It completely prevents confounders from directly affecting the outcome variable. (2) On average, it balances the confounders between the treatment group and the control group. (3) It helps ensure that the results of a study will generalize accurately to the wider population.
2 only
At your data science internship with the Environmental Protection Agency, you are researching how vehicle design characteristics impact fuel economy using data on 234 vehicles in the "mpg" data frame. Load these data in RStudio by entering the following on your script: library(tidyverse)data(mpg) Fit a linear model to predict city miles per gallon (cty) from the following variables: displ: engine displacement cyl: the number of cylinders year: year of manufacture Use bootstrapping with 10000 iterations to generate confidence intervals for your model estimates and model fit statistics. The lower limit of the 95% confidence interval for the residual standard error ("sigma") is closest to which of the following?
2.0
This is one of multiple questions about the same scenario. The Amazon e-commerce data science team fits a model to gain understanding of the extent to which having an Amazon Prime membership leads customers to buy more from the platform than customers without a Prime membership. The data set includes two variables: total Sales revenue for a customer account and a dummy variable representing whether the customer had a Prime membership (0 = no, 1 = yes). The fitted model equation is: Sales=764+1323∗Prime+e Predicted sales revenue for a customer who is a Prime member is closest to which of the following?
2087
The data set in students.csv Download students.csvconsists of 8239 rows, each of them representing a particular university student, with a set of characteristics corresponding to that student. These variables include: height: student height in centimeters gender: Male or Female Student heights follow a Normal distribution. And similarly, the distribution of heights for Male students is also Normally distributed. The probability that a randomly-selected male student will be taller than 185 centimeters is closest to which of the following?
23% males = students %>% filter(gender == 'Male') 1-pnorm(185, mean(males$height), sd(males$height)) students %>% filter(gender == 'Male') %>% summarize(mean = mean(height), SD = sd(height)) 1-pnorm(185, 179, 7.99) 1-pnorm(185, 179, 7.99, lower.tail = FALSE)
Assume that 12% of passengers at United States airports are randomly selected for additional screening, and that the probability of selection is independent across passengers. Two traveling companions are passing through airport security. Which of the following is closest to the probability that at least one of them will be chosen for additional screening?
23% # P(at least one) = 1 - P(none) # = 1 - (P(notS) * P(notS)) 1 - (0.88 * 0.88) 1-pbinom(0, size=2, prob=0.12) dbinom(1, 2, 0.12) + dbinom(2, 2, 0.12) # P(at least one) = P(S1 or S2) = P(S1) + P(S2) - P(S1, S2) = 0.12 + 0.12 - (0.12*0.12)
A large sample of real estate ads from central Austin indicates that 68% of homes for sale have garages, 21% have swimming pools, and 16% have both a garage and a pool. What is the probability that a home for sale in this area has neither a pool nor a garage?
27%
American Express introduced a new credit card promotion giving awards to customers who make at least 20 purchases in a month. The data science team wants to know if the proportion of customers making 20+ purchases/month has changed from the 13% level observed before the promotion. They collect a random sample of 1,000 accounts and learn that 10 accounts had at least 20 purchases in the month following the new promotion. In this context, the null hypothesis could be summarized as which of the following? (1) 10 out of 1,000 of customers are making at least 20 purchases each month. (2) The new credit card promotion will result in more than 13% of customers with at least 20 purchases each month. (3) We'd expect 130 of every 1,000 accounts to make at least 20 purchases in a month.
3 only
Let's fit a group-wise model to describe differences in post-graduation starting salaries associated with bachelor's degrees from four UT Austin colleges: Radio/TV/Film Nursing Business Administration Chemical Engineering Our fitted equation looks like this: Salary = 33,100 + (31,500 * Nursing) + (42,500 * Business) + (56,800 * ChemEng) Which of the following interpretations of this model is accurate? (1) Radio/TV/Film degree holders are expected to earn a salary of $163,900, on average, during their first post-graduation year. (2) Nursing degree holders are expected to earn a salary of $31,500, on average, during their first post-graduation year. (3) Business Administration degree holders are expected to earn $42,500 more than Radio/TV/Film students during their first post-graduation year.
3 only
This question revisits the data on video-game reaction time that we analyzed in class and in Lesson 14 Links to an external site.. Specifically, we'll consider three variables from this data set (two that we looked at before, and one that we didn't): PictureTarget.RT: the subject's reaction time in milliseconds when presented with a visual stimulus FarAway: a dummy variable for whether the stimulus (i.e. the "bad guy" popping up on the screen) was near (0) or far away (1) within the visual scene. SceneLetter: a letter code (b, c, d, e, or f) indicating which of five scenes the subject was presented with. (Note: the letters start at b; there is no scene a in this data set) Based on this model output, which of the following statements is/are accurate? (1) This model includes an interaction term for the FarAway and SceneLetter variables. (2) Scene c is associated with the lowest reaction time, on average. (3) Scene e is associated with the highest reaction time, on average.
3 only
Which of the following is an accurate statement about overfitting? (1) If we see a model's predictive performance deteriorate sharply on the testing set relative to the training set, the problem is likelythat we don't have enough features or interactions in the model. (2) Overfitting is largely a theoretical concern, and not something that can typically happen in practice. (3) Overfitting occurs when a model essentially "memorizes" the pattern of random noise in the training data.
3 only
Which of the following is true of p-values? Identify all correct answers. (1) A p-value represents a conditional probability that the null hypothesis is correct, given the data that we have observed. (2) A p-value represents the probability of observing our test statistic or something more extreme, assuming that the null hypothesis is correct. (3) Smaller p-values are indicative of more evidence against the null hypothesis.
3 only
Follow above instructions for the epa_cars.csv Download epa_cars.csvdata set. What does the model predict for the expected cityMPG for a vehicle of the Category SUV/Van with a Hybrid Powertrain? Round your answer to one decimal place.
30.2
The file nycflights13.csv Download nycflights13.csv contains data on all domestic flights departing the three major New York City area airports (LGA, JFK, and EWR) in 2013, including the following variables: dest: flight destination airport carrier: abbreviation representing commercial airline sched_dep_time: hour of scheduled departure on a 24-hour clock dep_delay: departure delay in minutes distance: distance from origin to destination in miles origin: flight origin airport (EWR, LGA, or JFK) Wrangle the data to identify which airline (carrier) flew most often from the NYC area to San Antonio (dest == 'SAT')? How many flights did that carrier make from the NYC area to San Antonio in 2013? Enter your answer below.
330
The distribution of OWL Standardized Test scores is normal with a mean of 600 points and a standard deviation of 100 points. Students who cherish the ambition to join the Auror Office must earn a test score greater than 777. The proportion of test-takers who exceed this threshold is closest to which of the following?
4%
A data scientist fits a group-wise model to assess the extent to which perceptions of policies related to the covid-19 pandemic in New England may be predicted by residence in 1 of the 6 states in the region (Maine, Vermont, New Hampshire, Massachusetts, Connecticut, or Rhode Island). In a model written in "baseline/offset" form, how many dummy variables should this model use to encode the grouping variable "state"?
5
This is one of multiple questions about the data in utsat.csv Download utsat.csv, which contains the SAT scores and graduating college GPAs for UT students. The relevant variables in this data set are: SAT.V: score on the verbal section of the SAT (200-800) SAT.Q: score on the quantitative section of the SAT (200-800) SAT.C: combined SAT score School: college or school at which the student first matriculated (not necessarily where they ended up) GPA: college GPA upon graduation, on a 4-point scale This data constitutes a census of a specific population: every UT student who entered UT in a specific recent year and went on to graduate from UT within 6 years. But what happens if we take samples from this population? Use the data the answer the following question. Simulate 10,000 samples of size n=250 from this data set. Based on this Monte Carlo simulation, the standard error of the mean SAT Verbal score for a sample of size 250 is closest to which of the following?
5
Based on this model output, the predicted reaction time for scene e when FarAway=0 is closest to which of the following?
576 milliseconds
Next, consider a bigger model for lights as the outcome, using all the predictors from your baseline model in the previous question (i.e., all predictors except Appliances), in addition to all combinations of two-way interactions of these predictors. (Remember the lm() function notation for including interactions in large models. Links to an external site.) Fit this model to the data in household_train.csv Download household_train.csv, and calculate its out-of-sample RMSE on the data in household_test.csv Download household_test.csv. This RMSE is closest to which of the following?
6.4 watt-hours
This is one of multiple questions about the same scenario. The General Social Survey (GSS) is a renown nationally representative sociological survey of adults in the United States conducted since 1972. With the data in GSS.csv Download GSS.csv, fit a linear regression model for Income using the following predictor variables: a main effect for Gender, a main effect for Age, and the interaction between Gender and Age. The model estimate for the interaction between Gender (Male) and Age is closest to which of the following?
685
A data scientist fits a groupwise linear model to predict Revenue from research grants in terms of affiliation of the research team with 1 of the 8 academic Institutions in the University of Texas system (UT Arlington, UT Austin, UT Dallas, UT El Paso, UT Permian Basin, UT Rio Grande Valley, UT San Antonio, or UT Tyler). If this model uses "baseline/offset" form, how many dummy variables should this model use to encode the variable Institution?
7
The files household_train.csv Download household_train.csvand household_test.csv Download household_test.csvcontain data on energy consumption for a single household in Belgium. The goal here is to build a model that is capable of estimating household energy usage based on temperature and humidity readings -- both in individual rooms throughout the house, as well as for the air outdoors, as measured using data from the nearby weather station at Chievres airport. Each row in the data set represents power consumption over a single 10-minute period, and the data covers about 4.5 months in total. The household temperature and humidity conditions were monitored with a ZigBee wireless sensor network, and the energy data was logged every 10 minutes with energy meters. The variables are as follows: lights, energy use of light fixtures in the house in watt-hours Appliances, energy use in watt-hours T1 through T9: temperature (Celsius) in 9 individual rooms RH_1 through RH_9: relative humidity in 9 individual rooms To: Temperature outside (from Chievres weather station), in Celsius Pressure: (from Chievres weather station), in mm Hg RH_out: Humidity outside (from Chievres weather station), in % Wind speed (from Chievres weather station), in m/s Visibility (from Chievres weather station), in km Tdewpoint (from Chievres weather station), dewpoint outside First consider a baseline model for lights as the outcome, using main effects for all predictors except Appliances. (Remember the lm() function notation for specifying predictors in large models. Links to an external site.) Fit this model to the data in household_train.csv Download household_train.csv, and then calculate out-of-sample RMSE on the data in household_test.csv Download household_test.csv. This RMSE is closest to which of the following numbers?
7.08 watt-hours
This is one of multiple questions about the same scenario. The Amazon e-commerce data science team fits a model to gain understanding of the extent to which having an Amazon Prime membership leads customers to buy more from the platform than customers without a Prime membership. The data set includes two variables: total Sales revenue for a customer account and a dummy variable representing whether the customer had a Prime membership (0 = no, 1 = yes). The fitted model equation is: Sales=764+1323∗Prime+e Match each equation component with its appropriate label below.
764: the baseline 1323: the offset Sales: response variable Prime: categorical predictor variable e: residual
Around 34% of registered voters in the U.S identify as independents, according to 2020 electorate data. Consider a random sample of 25 registered voters. Assuming that the number of registered independents in the sample is a Binomial random variable, what is the expected value? Round your answer to one decimal place.
8.5
This is one of multiple questions about the same scenario --- similar to the GSS model on this quiz but not based on the same data. A data scientist fits a model to investigate the extent to which the effect of work experience on salary is the same for males and females (using years of Experience and Sex as predictor variables): Salary=35000+3300⋅Experience−2000⋅Female−300⋅(Experience⋅Female) What salary would we predict for a female with 16 years of work experience? Round your answer to the nearest hundred dollars.
81,000
This is one of multiple questions about the same scenario --- similar to the GSS model on this quiz but not based on the same data. A data scientist fits a model to investigate the extent to which the effect of work experience on salary is the same for males and females (using years of Experience and Sex as predictor variables): Salary=35000+3300⋅Experience−2000⋅Female−300⋅(Experience⋅Female) What salary would we predict for a male with 16 years of work experience? Round your answer to the nearest hundred dollars.
87,800
Follow above instructions for the epa_cars.csv Download epa_cars.csvdata set. What does the model predict for the expected cityMPG for a vehicle of the Car (Category) with a Gasoline Powertrain? Round your answer to one decimal place.
?
Which of the following is the best general definition of a discrete random variable?
A random variable is discrete if its possible outcomes are whole numbers, or can otherwise be enumerated like whole numbers.
According to guidelines for data scientists (Lesson 15.6), which of the following describe variables that we typically EXCLUDE from multiple regression models? (1) Variables that do not explain any variability in the response variable y. (2) Variables that convey information about y that is redundant to that information conveyed by other variables in the model. (3) Variables that represent a common effect of both x and y (rather than a common cause of both x and y).
All of the above (1, 2, and 3)
Around 34% of registered voters in the U.S identify as independents, according to 2020 electorate data. Consider a random sample of 25 registered voters. Which of the following is/are among assumptions we make in modeling the number of registered independents in the sample as a Binomial random variable? (1) We are observing a fixed number of random events (i.e., each person in the sample). (2) Each random event may be considered as a binary "yes" or a "no" outcome. (3) Knowing that one member of the sample is an independent (or not) does not change the probability that any other member of the sample is an independent (or not).
All of the above (1, 2, and 3)
Based on this model output, which of the following statements is/are accurate? (1) This model includes an interaction term for the FarAway and SceneLetter variables. (2) For scene d when FarAway=1, this model predicts a reaction time of approximately 560 milliseconds. (3) We are 95% confident that reaction time increases by between 20 milliseconds and 75 milliseconds when the stimulus is FarAway.
All of the above (1, 2, and 3)
During class we fit a multiple regression model to predict the listing price of a house in Saratoga County, New York. We concluded that the variable age (age of the house in years) should be included in a model attempting to isolate the partial relationship between fireplaces and price. Which of the following were among the reasons that we decided to include age in the model? (1) The confidence interval for the age coefficient did not contain zero. (2) The inclusion of the age variable affected the magnitude of the coefficient of interest: fireplaces. (3) Age was correlated with both the response variable (price) and the predictor variable of interest (fireplaces).
All of the above (1, 2, and 3)
Hooli's streaming entertainment division is developing an exciting new TV series, for which they decide to invest in a Super Bowl commercial. They hope that name recognition for the series will increase from the 35% historical average for similar content with standard marketing budgets. After the commercial runs, the data science team contact a random sample of 1000 people to ask if they recognize the new series name. Their analysis produces a p-value equal to 0.051. Which of the following may we conclude from this result? (1) If -- before the poll was conducted -- the team declared 0.01 to be the arbitrary cutoff value for statistical significance, they would fail to reject the null hypothesis. (2) If -- before the poll was conducted -- the team declared 0.10 to be the arbitrary cutoff value for statistical significance, they would reject the null hypothesis. (3) This p-value provides a nearly identical amount of practical evidence against a null hypothesis as would a p-value equal to 0.049.
All of the above (1, 2, and 3)
How does effective feature engineering help machine learning modelers to avoid overfitting? (1) Identifying the most useful features helps us build better predictive models. (2) Identifying irrelevant attributes in training data and removing them from the learned model helps us to balance the number of model parameters relative to the size of our sample. (3) Your model can't memorize complicated, non-generalizable facts about the past if you force it to ignore all facts except the simplest ones.
All of the above (1, 2, and 3)
The file CPS85.csv Download CPS85.csvcontains data from the 1985 Current Population Survey (CPS), used to supplement information from the U.S. Census in between official census years. These data consist of a random sample of U.S. residents with information on their wages, sex, number of years of education, years of work experience, occupational status, region of residence, and union membership status. Consider the married and union variables, and compute the proportion of those married among union members relative to the proportion of those married among non-union members. Form a 95% "large-sample" confidence interval (i.e. based on the Central Limit Theorem) for the difference between these two proportions (that is, marriage proportion for non-union members, minus the marriage proportion for those who are union members). Which of these statements about your analysis is/are accurate? Choose all correct statements. (1) Of the survey respondents who are not union members, about 63% are married. (2) Of the survey respondents who are union members, about 75% are married. (3) The confidence interval for the difference in marriage proportions for union members versus non-union members is approximately (-0.22, -0.01).
All of the above (1, 2, and 3)
Which of the following are key ingredients of a confidence interval based on the Central Limit Theorem? (1) A summary statistic from your sample (2) A critical value based the level of confidence associated with the interval (3) A formula for the standard error of your summary statistic
All of the above (1, 2, and 3)
Which of the following is true of the margin of error and confidence intervals? (1) The number referred to as the "margin of error" is not a characteristic of a particular sample but rather associated with the sampling procedure. (2) If we calculate the 95% confidence interval for a parameter as (0.49, 0.61), then the margin of error is 0.06. (3) A 95% confidence interval the sample proportion from a poll indicates that 95% of the intervals produced by repeating the poll's sampling procedure would contain the true population proportion of interest.
All of the above (1, 2, and 3)
Which of the following statements is true of "statistical significance Links to an external site." in the context of multiple regression modeling? (1) An estimated partial relationship is considered to be statistically significant if zero is not a plausible value for that partial slope in the model. (2) We generally express the level of statistical significance (e.g., 0.05) as the complement of the confidence level for an interval (e.g., 95%). (3) Statistical significance for an estimated partial relationship does not mean that the corresponding predictor variable is important in practical terms.
All of the above (1, 2, and 3)
Which of the following statements is/are correct? (1) Sampling variance refers to non-systematic (random) differences between our estimand and our estimate. (2) Sampling bias refers to systematic (non-random) differences between our estimand and our estimate. (3) Bootstrapping helps us to quantify the statistical uncertainty we have about our sample estimate and what it can tell us about some population estimand.
All of the above (1, 2, and 3)
Which of the following statements is/are true of root mean-squared error (RMSE)? (1) The RMSE measure may be conceptualized as the average error we'd expect a model to make when predicting future data points. (2) Lower values of RMSE indicate a more accurate model, relative to higher values. (3) In STA 301, we will use the modelr library in RStudio to calculate RMSE when fitting predictive models.
All of the above (1, 2, and 3)
Which of these statements is an accurate interpretation of the plot seen above? (1) As the model becomes more complex, the estimate of in-sample error becomes increasingly optimistic. (2) Degradation in prediction when moving from in-sample data to out-of-sample data is a sign of overfitting. (3) The measure plotted on the vertical y-axis may be conceptualized as the standard deviation of future forecasting errors made by the model.
All of the above (1, 2, and 3)
Why do we need a control group in experimental design? Identify the accurate statements below. (1) To rule out natural change and variation (as distinct from change related to some experimental treatment) as an explanation for the outcomes we observe. (2) To rule out alternative explanations for the outcome of the experiment related to placebo effects. (3) To provide a basis for comparison for those who received the treatment.
All of the above (1, 2, and 3)
Understanding the difference between statistical significance versus practical significance is an important distinction Links to an external site.in data science. Fill in the blanks below with the appropriate type of "significance." ____________ can usually be assessed by checking whether or not zero falls within the range of plausible values represented by a confidence interval. When we ask about the extent to which the numerical magnitude of something is large enough to matter to in the data science context, we are concerned with ______________. _______________ necessarily involves subjective judgment on what effects are "large enough to matter." Variable units and measurement scales matter when we are assessing _________________.
Answer 1: Statistical significance Answer 2: practical significance. Answer 3: Practical significance Answer 4: practical significance.
The _________ function allows us to change our unit of analysis by altering the row structure of a data frame. Use the_____________ function to define new variables or edit some aspect of existing variables. The ________________ function is often used to sort a data frame in a sequential order with respect to one or more variables. Use the ______________ function to isolate a subset of cases in a data frame based on some criteria.
Answer 1: group_by() Answer 2: mutate() Answer 3: arrange() Answer 4: filter()
There is no statistical test for _____________ . When we ask how large an effect a predictor variable has on an outcome variable, in context-specific terms, we are interested in _______________. ______________may be assessed by calculating a p-value. The units of variables and measurement scales of variables matter when we are assessing _______________ . ______________ is usually assessed by looking at a confidence interval and reasoning about the range of plausible effect sizes in the context of the problem. ________________ can usually be assessed by checking whether or not zero falls within the range of plausible values represented by a confidence interval. If determining whether or not a measured effect may be distinguished from zero, we are interested in statistical significance .
Answer 1: practical significance Answer 2: practical significance Answer 3: Statistical significance Answer 4: practical significance Answer 5: Practical significance Answer 6: Statistical significance Answer 7: statistical significance
To visualize the relationship between age and logins, make a ________ . Describe the distribution of age with a histogram . Of the customers who unsubscribed versus those who did not unsubscribe, what proportion in each group watched The Mandalorian? And does this difference in proportions hold for female customers relative to customers who do not identify as female? These comparisons can be made visually with a faceted bar plot . Use a [ Select ] ["histogram", "boxplot", "bar plot", "scatter plot", "faceted bar plot"] to compare the count of customers who fall into each category of the mandalorian variable. How does the distribution of customer age differ for customers who live in a city relative to those who do not live in cities? Show this with side-by-side boxplots .
Answer 1: scatter plot Answer 2: a histogram Answer 3: a faceted bar plot Answer 4: boxplot Answer 5: side-by-side boxplots
The data in olympics_top20.csv Download olympics_top20.csvcontains information on every Olympic medalist in the top 20 sports by participant count, all the way back to 1896. Use these data to answer the following: Which single event in the 2012 London games had the heaviest median male competitor? Hint: Create a data frame that only contains cases that meet the above criteria for the variables sex and year. Then create subgroups by the event variable.
Athletics Men's Shot Put
This is one of multiple questions about the same scenario. NCAA coaches may be compensated with salary as well as with an annual bonus. The amount of that bonus may be contingent on team performance as well as meeting off-field thresholds with respect to players' grades and conduct. The distribution of salary and bonus is something that coaches negotiate with the Athletic Director at a NCAA school. A college football coach-who-will-not-be-named is evaluating the job market and wants to predict what sort of bonus compensation he might expect to earn at various NCAA schools. Using NCAA Salaries data, Links to an external site.he fits a model with the following variables: MaxBonus: maximum annual bonus (in millions of dollars) Salary: annual salary (in millions of dollars) SEC: whether the school is in the Southeastern Conference (SEC = 1) or not (SEC = 0) The model equation is: MaxBonus=0.45+0.15∗Salary+0.84∗SEC−0.11∗(Salary∗SEC) The coefficient of 0.84 in the equation above indicates that:
Bonuses are about $840,000 higher at SEC schools versus non-SEC schools, assuming the same coach salary.
Match each grade component with the correct percentage weight used to calculate course grades.
Class Participation: 5% Knowledge Checks: 20% Midterm Exam: 20% Homework: 25% Final Exam: 30%
Follow above instructions for the epa_cars.csv Download epa_cars.csvdata set. The baseline vehicle in this model corresponds to which of the following levels of the Powertrain variable?
Diesel
Lyra is 31 years old, outspoken, and very bright. She majored in philosophy as an undergraduate at Oxford. As a student, she was deeply concerned with issues of discrimination and social justice, and also participated in anti-nuclear demonstrations. Which of the following events is more probable? Event 1: Lyra is a bank teller Event 2: Lyra is a bank teller and is active in the feminist movement
Event 1 is more probable
During a global pandemic, a researcher decides to conduct an experiment on the efficacy of contact tracing apps at public universities. Which of the following is the most important question that we should ask in evaluating the strength/quality of this experimental study design and its capacity to provide a balanced comparison? Choose the best answer based on the basic principles of experimental design that we discussed in class and in Lesson 12.
How do schools that have contact tracing apps and schools that do not have contact tracing apps differ in ways that might also predict the spread of the virus?
In RStudio, load the dataset diamonds -- built-in to the tidyverse package -- by running the following commands in your R script: library(tidyverse)data(diamonds) The data set describes features of almost 54,000 diamonds, including the following variables: price: price in U.S. dollars carat: weight of the diamond cut: the quality of the cut, from Fair to Ideal color: a letter code indicating the color of the diamond Which color of diamond in this data set has the highest median weight, in carats?
J
FiveThirtyEight published an analysis of patterns in Super Bowl television commercials from the ten brands that aired the most spots during 2000--2020. These data are in the superbowl.csv Download superbowl.csv file. Run the following code to remove missing (NA) values from the YouTube view_count and like_count variables: superbowl = superbowl %>%filter(!is.na(view_count), !is.na(like_count)) Then, use more data wrangling to answer the following. Which brand has the lowest average view_count for Super Bowl ads in this sample?
Kia
Suppose we're trying to build a smart-phone app that can take a picture of a food item, extract meaningful features from the raw pixels in the image, and use a regression model to classify the image as a "Hot Dog" or "Not Hot Dog." Does this sound more like we need the tools of statistics or machine learning? Why?
Machine learning, because we care chiefly about the predictive accuracy of the classification model in the context of a larger pipeline of hardware and software.
At Matt's El Rancho restaurant, dine-in customers can get a free rainbow sherbet dessert if (1) they order Bob Armstrong dip as an appetizer or if (2) they order a dinner entree. The manager has observed that: 60% of diners get the free dessert, 45% order the appetizer dip, and 33% order a dinner entree. Based on these estimates, the probability that a randomly selected diner orders both appetizer dip and a dinner entree is closest to which of the following?
NOT 15%
A tech journalist blogged recently about a perceived lack of female founders in the latest batch of startups accepted to the Hooli Hive accelerator, operated by one of the world's leading digital technology platforms. Recent industry statistics indicate that 33% of US tech startups have at least one female founder. Of the 75 founding teams admitted to the Hooli Hive accelerator, only 16 included a female founder. Is this seemingly low proportion of females an indication of gender bias at Hooli? On the other hand, could this observation simply be the result of natural sampling variability in founding team demographics?You are on the Hooli data science team and want to help mount a defense by quantifying the extent to which the observed data are consistent with the null hypothesis. Which of the following represents the null hypothesis in this scenario?
NOT prob = 0.21
To what extent does vehicle mileage predict price on the used pickup truck market? The file trucks.csv Download trucks.csvcontains data on 46 pickup trucks listed for sale online, including the following variables: year: year of vehicle price: listed price for vehicle in dollars mileage: number of miles on vehicle odometer at time of listing make: vehicle brand Based on the data in this sample, the change in price that we expect for a 1000-mile increase in vehicle mileage is closest to which of the following?
NOT: -$643 lm1 = lm(price ~ mileage, data = trucks) coef(lm1) (Intercept) mileage 14419.37617711 -0.06429944 -0.064 * 1000 = -64
The file trucks.csv Download trucks.csvcontains data on 46 pickup trucks listed for sale online, including the following variables: year: year of vehicle price: listed price for vehicle in dollars mileage: number of miles on vehicle odometer at time of listing make: vehicle brand Use the mutate() function to create a new categorical variable to indicate whether or not each truck is a GMC brand (make == "GMC") or else not a GMC. Based on the data in this sample, do GMC trucks have same average mileage relative to other non-GMC trucks?
NOT: GMC trucks in this sample have lower mileage, on average, and that difference is approximately 13,100 miles. # create new categorical variable for GMC trucks = trucks %>% mutate(GMC = ifelse(make == 'GMC', 'GMC', 'not_GMC')) # difference of group means observed in sample mean(mileage ~ GMC, data = trucks) diffmean(mileage ~ GMC, data = trucks)
This is one of several questions about the data in georgia2000.csv Download georgia2000.csv, which contains Georgia's county-level voting data from the 2000 presidential election. The 2000 election was among the most controversial in history; it turned on an esoteric set of issues surrounding voting machines and vote counts, and wasn't finally resolved until the Supreme Court's ruling in Bush v. Gore. This file contains information for all 159 counties in Georgia. For our purposes, the relevant variables are: votes: number of votes recorded ballots: number of ballots cast optical: whether the county used optical scan voting machines (Yes or No) poor: coded 1 if more than 25% of the residents in a county live below 1.5 times the federal poverty line; coded 0 otherwise. perAA: percent of people in the county who are African-American urban: coded 1 if the county is predominantly urban; coded 0 otherwise gore: number of votes for Gore bush: number of votes for Bush Use the mutate() function to define a new variable called ucountPct: ucountPct=100∗(ballots−votes)/ballots that measures the percentage of ballots (0-100) in a county that were "undercounted" or "spoiled", (i.e., ballots cast where no legal vote was recorded because the machine could not read the vote). One of the main legal issues in the election was whether some counties, had inferior equipment leading to higher undercount rates than others. Form a 95% bootstrap confidence interval for the mean difference in undercount percentages between urban and non-urban counties. Use this interval to evaluate the statements below. Select all correct answers.
Non-urban counties had higher undercount rates, and the confidence interval for the difference does not contain 0.
The plot above (based on a dataset from Motor Trend magazine on the design and performance of 32 car types) displays the distribution of fuel consumption (in miles per gallon) by the number of forward gears on the automobile. Which of the following can we conclude based on this plot? (1) Cars with 5 gears get the highest median miles per gallon. (2) Cars with 3 gears have the largest interquartile range. (3) The maximum miles per gallon for cars with 3 gears is less than the minimum miles per gallon for cars with 4 gears.
None of the above
Why would we bootstrap a statistical model? Identify all accurate statements. (1) To check whether our data form a true random sample from the population. (2) To eliminate the effect of sampling bias on the error of our estimate. (3) To reduce the effect of sampling variance on the uncertainty of our estimate.
None of the above
Texas leads the nation in number of farms and ranches, with 248,416 farms and ranches covering 127 million acres. Of the annual $25 billion in agricultural output, $1.1 billion comes from corn. Corn requires a lot of water to grow, however. Suppose we collect some data from a block-designed experiment on a new variety of corn. This new variety is bred to require less water during the crucial "silking" phase of growth. Variables for each plot of land are: Bushels: output of corn from that plot in number of bushels Treatment: whether plot was seeded with the new variety of corn (Treatment = 1) or the conventional control variety (Treatment = 0) Water: inches of irrigation per day on that plot (ranging between 0 to 0.3 inches per day) The fitted regression equation is: Bushels = 2.0 + (0.9 * Treatment) + (3.0 * Water) - (1.8 * Treatment * Water) Which of the following statements is accurate based on these model results?
Output increased at a rate of 1.2 bushels per inch of irrigated water on plots receiving the new variety of corn. 3.0 - 1.8 = 1.2
Which of the following represent(s) the concept of "sampling WITH replacement"? Select all correct answers.
Professor Snape selects a sample of students to "cold call" during each of his NEWT-level Potions classes. For each question, he uses a Resampulus charm, wherein his wand randomly points to a student irrespective of who was called previously. There is no limit on the number of times that an individual student might be selected during any given class session.
The data in olympics_top20.csv Download olympics_top20.csv contains information on every Olympic medalist in the top 20 sports by participant count, all the way back to 1896. Use these data to answer the following question. Which single women's event had the greatest variability in competitor's heights across the entire history of the Olympics, as measured by the standard deviation? Hint: Create a data frame that only contains cases that meet the above criteria for the variable sex. Then create subgroups by the event variable before calculating standard deviation for each.
Rowing Women's Coxed Fours
The colleges.csv Download colleges.csvdata set includes the following variables: -PercentOnFinancialAid: Percentage of students that receive some financial aid at the institution -AdmissionYield: Percentage of accepted applicants that decide to enroll at the institution Which of the following plots would be the best choice to visualize the association between the PercentOnFinancialAid variable and the AdmissionYield variable?
Scatter plot
Suppose we're looking at COVID-19 data from every county in the US, and we're trying to understand the relationship between various social-distancing measures taken in that county (x) and the growth rate of the virus in that county. We build a regression model that relates each county's COVID-19 growth rate (y) versus several predictors that measure the extent of each county's social distancing behavior. Our goal is to understand how different measures of social distancing seem related to the COVID-19 growth rate. Does this sound more like we need the tools of statistics or machine learning? Why?
Statistics, because we care chiefly about helping stakeholders (policy-makers, health professionals, etc) understand and interpret an important partial relationship.
Which of the following is among key principles of Machine Learning discussed in STA 301?
Subject-matter knowledge in the data science context is essential for good feature engineering.
A data scientist fits a regression model to predict product sales in terms of TV, radio, and newspaper advertising budgets for the products. Which of the following is the best interpretation of the radio coefficient in the multiple regression model results displayed above?
The average change in product sales associated with a one-unit increase in radio spending while holding constant spending for TV and newspaper.
The long-run rate of defective iPhones coming off the assembly line is 0.6% when all manufacturing processes are working correctly. Because testing each phone for defects would be cost prohibitive, a random sample of 500 phones are inspected every 2 hours to determine if the manufacturing processes are working correctly, or if there may be an issue leading to a higher rate of production defects. Halting the production line unnecessarily leads to lost revenue from fewer units shipped. On the other hand, producing defectives phones is bad for brand integrity. Which of the following represents a Type I Error in this context?
The process is working correctly and the plant manager temporarily halts production.
Consider both models fit with the wine training data: a smaller model that regresses quality versus all other variables (main effects only) a larger model that regresses quality versus all other variables plus all combinations of their two-way interactions Based on the predictive accuracy (RMSE) of these two models on unseen data, which model seems superior?
The smaller model
Match the summary statistic below with its correct application as a measure of center or measure of variability for the distribution of a numerical variable.
Variance: Measure of variability Standard deviation: Measure of variability Interquartile range: Measure of variability Mean: Measure of center Median: Measure of center
A confidence interval using de Moivre's equation is valid if which of the following conditions is true?
We are calculating a confidence interval for a population mean based on a sample mean.
Streaming platforms such as Disney+ strive for low churn rates--- i.e., minimizing the proportion of customers who cancel their subscription. They use data to make personalized recommendations for content that will keep customers on the platform. One recent popular series was The Mandalorian. The disney.csv Download disney.csvfile has information on a random sample of 5000 customers, including the following variables: unsubscribe: indicator of whether or not the customer cancelled their subscription ("unsubscribe") or did not cancel ("stay") mandalorian: yes/no indicator for whether the customer viewed any of The Mandalorian series female: yes/no indicator for whether the customer identifies as female city: yes/no indicator for whether the customer lives in an urban area age: age of the customer in years logins: the number of times the customer logged on in the previous week Consider the following two events: (1) a customer viewed The Mandalorian (2) a customer identified as female Based on the data in this sample, are these two events independent?
Yes, these events are independent because the probability that a female viewed The Mandalorian is equal to the probability that any customer in the sample viewed The Mandalorian. # P(mandalorian)xtabs(~mandalorian, data=disney) %>% prop.table() %>% round(2)# P(female)xtabs(~female, data=disney) %>% prop.table() %>% round(2)# P(mandalorian | female)xtabs(~mandalorian + female, data=disney) %>% prop.table(margin = 2) %>% round(2)# joint probabilitiesxtabs(~mandalorian + female, data=disney) %>% prop.table() %>% round(2)
A conditional (if/then) statement about some event that may not have actually occurred is known as:
a counterfactual.
The data frame in marketing.csv Download marketing.csvcontains information on the impact of three advertising media on the unit sales of 200 different products: youtube: YouTube advertising budget ($ thousands) facebook: Facebook advertising budget ($ thousands) newspaper: Newspaper advertising budget ($ thousands) sales: sales volume (in thousands of units) Which of the following best describes the distribution of the sales variable?
approximately symmetric
Streaming platforms such as Disney+ use data to make personalized recommendations for content that will keep customers on the platform. The disney.csv Download disney.csvfile has information on a random sample of 5000 customers, including their age in years. The distribution of the age variable is BEST described as which of the following?
approximately symmetric # age histogram disney %>% ggplot() + geom_histogram(aes(x=age))
A student is facing an exam in Ancient Runes for which they are totally unprepared. The exam contains 10 multiple choice questions, each with 4 answer choices (only one of which is correct). Instead of definitely getting a zero by skipping the exam, they decide to instead show up and guess randomly on each question. Which of the following R functions will return the probability that the student gets exactly half of the questions correct for a 50% score on the exam?
dbinom(5, 10, 0.25)
Choose the best answer. Holding other factors constant, increasing the size of a sample used to calculate a confidence interval will:
decrease the standard error.
Suppose we have a random sample of U.S. voters in a data frame called voters. In this data frame, there is a binary variable called House2022, which represents whether the voter intends to vote for a Republican or Democrat in their local election for the 2022 U.S. House of Representatives. Which of the following R commands would allow you to generate a bootstrap sampling distribution (using ten thousand simulations) for the proportion of all U.S. voters who intend to vote for a Democrat in the 2022 midterm House elections?
do(10000) * prop(~House2022, data=resample(voters))
An estimator is said to be 'asymptotically normal'...
if its sampling distribution is approximately normal for large enough samples.
A data set used by a marketing team contains the following information on 46 different internet advertising campaigns: total ad spend (measured in dollars of total advertising cost for that campaign) visibility (measured in impressions for all ads across the campaign) They fit a linear model to predict a campaign's visibility from its ad spend. What units does the slope have?
impressions per dollar
A financial planner wants to determine whether men or women are more likely to contribute to an IRA. She observes that both men and women contribute at a rate of 35%. This indicates that, in her data set, gender and contribution rate are:
independent.
A sampling distribution:
is the distribution of values of a summary statistic that we'd expect to see under repeated realizations of the same random data-generating process.
To what extent does vehicle mileage predict price on the used pickup truck market? The file trucks.csv Download trucks.csvcontains data on 46 pickup trucks listed for sale online, including the following variables: year: year of vehicle price: listed price for vehicle in dollars mileage: number of miles on vehicle odometer at time of listing make: vehicle brand Fit a linear model for price in terms of mileage and use bootstrapping with 10000 iterations to generate confidence intervals for model estimates and model fit statistics. Based on this analysis, we are 95% confident that the proportion of variability in price explained by the regression on mileage is at least _______ . Which of the following is closest to the value that you calculate to fill in the blank above?
lm_boot = do(10000) * lm(price ~ mileage, data = resample(trucks)) confint(lm_boot)
Does the proportion of customers who unsubscribe differ with respect to whether or not they have viewed The Mandalorian? Use bootstrapping with 1000 iterations to generate a 95% confidence interval for the true difference in proportions. The lower limit of this interval is closest to which of the following?
not 0.12 # sample estimates prop(unsubscribe ~ mandalorian, data = disney)diffprop(unsubscribe ~ mandalorian, data = disney) # bootstrapped interval boot_disney = do(1000) * diffprop(unsubscribe ~ mandalorian, data = resample(disney))confint(boot_disney)
The file CPS85.csv Download CPS85.csvcontains data from the 1985 Current Population Survey (CPS), used to supplement information from the U.S. Census in between official census years. These data consist of a random sample of U.S. residents with information on their wages, sex, number of years of education, years of work experience, occupational status, region of residence, and union membership status. Consider the sector variable, an categorical variable for the industry in which the respondent works (clerical, construction, management, manufacturing, professional, sales, service, or other). Use the mutate() function to create a new binary categorical variable called clerical, which takes on the values 'clerical' or 'not_clerical' depending on the value of the existing variable sector. Use this new variable clerical to form a 95% "large-sample" confidence interval (i.e. based on the Central Limit Theorem) for the population proportion of U.S. residents who work in clerical jobs. This upper limit of this interval is closest to which of the following?
not 15%
Follow above instructions for the epa_cars.csv Download epa_cars.csvdata set. What is the estimated coefficient on the PowertrainGasoline dummy variable in your fitted model? Round your answer to one decimal place.
not 23.1
Amazon's market share in worldwide cloud infrastructure during Q4 2021 exceeded the combined market share of Microsoft and Google, its two largest competitors. Amazon's market share was estimated as 33% with the following 95% confidence interval: [0.31, 0.35]. Which of the following may we conclude from this confidence interval? (1) The margin of error for this interval is not a characteristic of this particular analysis, but rather is associated with the sampling procedure that generated this analysis. (2) The margin of error for this interval is 0.04. (3) There is a 95% probability that the sample estimate of market share (33%) is equal to Amazon's true market share.
not 3 only
Payton Hobart is running for class president. Pre-election polls have generally agreed that he has the support of 63% of the student body. However, his chief data scientist recently surveyed a random sample of 75 students and observed that only 44 of them were Hobart supporters. She decides to conduct a hypothesis test to determine if her recent poll was anomalous, or whether it was consistent with the other pre-election polls. The null hypothesis for this test assumes which of the following for the probability (p) that a randomly chosen member of the student body will support Hobart?
p = 0.63
Hooli, one of the world's largest tech firms, is researching opportunities to develop data centers in various locations around the world. They collect information on each potential site including the property size in acres, the country in which the site is located, estimated cost to acquire the site, and whether or not competitors have a presence within 100 miles of the site. Which of the following represents the cases in this data set?
site locations
A data set with information on United States colleges includes a variable called PercentOnFinancialAid, the percentage of students who receive some financial aid, as seen below in a histogram. The distribution of this variable is best described as which of the following?
skewed to the left
The data frame in marketing.csv Download marketing.csvcontains information on the impact of three advertising media on the unit sales of 200 different products: youtube: YouTube advertising budget ($ thousands) facebook: Facebook advertising budget ($ thousands) newspaper: Newspaper advertising budget ($ thousands) sales: sales volume (in thousands of units) Which of the following best describes the distribution of the newspaper variable?
skewed to the right
The data set in students.csv Download students.csvconsists of 8239 rows, each of them representing a particular student from a university in Berlin. Characteristics corresponding to that student are represented by variables in each column, including: stud.id: a unique identifier for each student height: height in centimeters weight: weight in kilograms gender: Male or Female age: years religion: Catholic, Muslim, Orthodox, Other, or Protestant nc.score: standardized test score [1, 4] major and minor: Biology, Economics and Finance, Environmental Sciences, Mathematics and Statistics, Political Science, or Social Sciences online.tutorial: did the student take any online classes? (1 = Yes, 0 = No) graduated: did the student graduate? (1 = Yes, 0 = No) salary: self-reported starting salary (euro) Match each of the following variables from the list above with its correct type: Numerical or Categorical.
stud.id: categorical weight: numerical nc.score: numerical salary: numerical graduated: category age: numerical religion: categorical
Which team in which year had the largest average weekly attendance?
the 2016 Cowboys nfl %>% group_by(team, year) %>% summarize(avg = mean(attendance)) %>% arrange(desc(avg))
In a simple regression model: y=b0+b1∗x1+ei with a dummy variable (0 or 1) predictor x, the coefficient b1 may be interpreted as:
the differential effect on the outcome of having the dummy variable equal to 1, rather than 0.
Choose the best answer to complete the sentence below. If X and Y are two random variables associated with the same uncertain outcome, the joint distribution for X and Y is:
the set of all possible joint outcomes, together with their associated probabilities.
In RStudio, load the dataset diamonds -- built-in to the tidyverse package -- by running the following commands in your R script: library(tidyverse)data(diamonds) The data set describes features of almost 54,000 diamonds, including the following variables: price: price in U.S. dollars carat: weight of the diamond cut: the quality of the cut, from Fair to Ideal color: a letter code indicating the color of the diamond The interquartile range of price for diamonds with an 'Ideal' cut is closest to which of the following?
$3,800
The data frame in marketing.csv Download marketing.csvcontains information on the impact of three advertising media on the unit sales of 200 different products: -youtube: YouTube advertising budget ($ thousands) -facebook: Facebook advertising budget ($ thousands) -newspaper: Newspaper advertising budget ($ thousands) -sales: sales volume (in thousands of units) Which of the following is the q=0.90 quantile (or 90th percentile) of the youtube variable in this sample?
$313,728
A store owner fits a linear model predicting daily sales revenue ($) from the number of customers who visited the store each day. The equation is: Salesi=10.56+5.23⋅Customersi+ei If 70 customers visit the shop tomorrow, the daily sales revenue predicted by this linear model is closest to which of the following?
$377
The data frame in marketing.csv Download marketing.csvcontains information on the impact of three advertising media on the unit sales of 200 different products: -youtube: YouTube advertising budget ($ thousands) -facebook: Facebook advertising budget ($ thousands) -newspaper: Newspaper advertising budget ($ thousands) -sales: sales volume (in thousands of units) Which of the following is the q=0.90 quantile (or 90th percentile) of the facebook variable in this sample?
$52,224
Consider the following data frame on five airplane models operated by major commercial carriers. CaseID Manufacturer Model First flight MaxPassengers 1 Boeing 737 1967 215 2 Boeing 747 1969 605 3 Boeing 787 2009 330 4 Airbus A320 1987 186 5 Airbus A380 2005 853 Which of the following statements about this data frame is/are accurate? Select all correct answers.
-"Model" is a categorical variable. -"First flight" is a numerical variable. -"CaseID" is a numerical variable. -Each case in this data frame corresponds to an airplane manufacturer. -"Max passengers" is a categorical variable. -"Manufacturer" is a categorical variable.
This is one of three questions about the data in creatinine.csv Download creatinine.csv. Each row represents a patient in a doctor's office. The variables are: age: patient's age in years. creatclear: patient's creatine clearance rate in mL/minute, a measure of kidney health (higher is better). Fit a linear model that predicts a patient's creatine clearance rate in terms of their age. Use this model to answer the following question. Which of the following model estimates represents the change in creatinine clearance rates that we expect to see as a function of age for this sample? (This number should have units mL/minute per year.)
-0.62
Which of the following plot design choices should generally be AVOIDED in data visualization? Select all correct answers.
-A barplot with truncated y-axis -3D designs
This is one of three questions about the data in creatinine.csv Download creatinine.csv. Each row represents a patient in a doctor's office. The variables are: age: patient's age in years. creatclear: patient's creatine clearance rate in mL/minute, a measure of kidney health (higher is better). Fit a linear model that predicts a patient's creatine clearance rate in terms of their age. Use this model to answer the following question. Which of the following statements about this model are correct? Choose all correct answers.
-About 67% of the total variation in patients' creatinine clearance rate can be predicted by their ages. -The typical error made by this model, as measured by the residual standard deviation, is approximately 6.9 mL/min.
In 2019, The Walt Disney Company earned more than $11 billion worldwide with eight of the year's ten highest-grossing films, as depicted by the left-hand chart below. The right-hand chart shows the trajectory of Disney's domestic box-office sales, bolstered by a series of brilliant cinematic acquisitions such as Marvel Studios and Lucasfilm. What plot types are represented in this data visualization? Select TWO correct answers
-Bar plot -Line graph
The file billboard.csv Download billboard.csvcontains data on every song to appear on the weekly Billboard Top 100 chart during 1959 through 2020. Each row of this data frame corresponds to a single song in a single week. Variables include: performer: who performed the song song: the title of the song year: year (1959 to 2020) week: chart week of that year (1, 2, etc) week_position: what position that song occupied that week on the Billboard top 100 chart. Use your skills in data wrangling to make a table of the top 10 most popular songs in the data set, as measured by the total number of weeks that a song spent on the Billboard Top 100. Which of the following performers appear in this table? Select all correct answers. HINTS: Your table should have 10 rows and 3 columns: (1) performer, song, and count, where count represents the number of weeks that song appeared in the Billboard Top 100. Make sure the entries are sorted in descending order of the count variable, so that more popular songs appear at the top of the table. You'll want to use both performer and song in any group_by() operations, to account for the fact that multiple unique songs can share the same title. The n() function will be useful to generate counts.
-Carrie Underwood -Adele -AWOLNATION -Imagine Dragons
Match the following R commands with their appropriate use.
-Computes the sample proportion for a binary (yes/no) variable: prop -Repeatedly execute the same statement/s many times: do -Take a random sample of rows from a data frame: sample -Simulate a sequence of binary events and count the number of "yes" outcomes: nflip
Which of the following statements about our course Midterm Exam is accurate? Select all correct answers.
-If you don't have RStudio installed and working on your computer, you will not be able to complete all the questions on the exam. -You will take the midterm exam using your own computer in our classroom during class. -The midterm exam will be open-notes and open-book and open-internet.
Which of the following statements about R libraries are correct?
-Libraries need to be loaded each time you want to use them. -A library is a piece of software that provides additional functionality to RStudio, beyond what's contained in the basic R installation.
Match the terms below to their correct definitions.
-Population: The set of all possible cases that might have been included in a data set. -Sample: A specific selection of cases from the population. -Data frame: A tabular representation of a data set in which the rows correspond to cases and the columns to variables. -Code book: A file or separate list that provides all the necessary information to interpret each variable in a data set -Unit of analysis: The type of entity you choose to focus on in a data analysis -Sampling bias: Any systematic discrepancy between a sample and the corresponding population
A data-science team at a large grocery chain observes the quantity of ice cream cartons sold at different levels of outside temperature each day. Their goal is to use a statistical model to understand how changes in temperature predict changes in consumer demand for ice cream, as measured by quantity sold on a given day. Select all correct answers below that describe this model.
-Quantity sold should be the response variable, and temperature should be the predictor variable. -If the data scientists use a linear model, the model intercept represents, mathematically, what we'd expect ice cream sales quantity to be if the outside temperature was exactly 0.
Required materials for this class include which of the following? Select all correct answers
-R and RStudio installed on your computer. -A free online book, "Data Science in R: A Gentle Introduction" -Chrome or Firefox web browser for working in Canvas, of particular importance during quizzes and exams. -Laptop or desktop computer with modern and updated operating system (MacOS, Windows or Linux).
Which of the following are accurate statements about R2? Select all correct answers.
-R2 ranges from 0 to 1 -If a linear model produces a R2 equal to 0.77, it indicates that 23% of the variability in Y is predicted by factors other than variation in X
Match the sampling terms below to the best description below.
-Random sample: A sample in which every member of the population is equally likely to be included. -Longitudinal sample: A sample based on individual cases tracked over time with respect to one or more variables. -Cross-sectional sample: A sample based on various attributes of individual cases collected at a single point in time. -Convenience sample: A sample based on individual units that were not selected at random from the population of interest. -Census: A sample that comprises all cases in the population.
Match the following data wrangling operation with its corresponding R function or feature.
-Select a subset of cases in a data frame that meet certain criteria: filter() -Generate statistics to characterize sets of values: summarize() -Alter the structure of a data frame: group_by() -Add new variables to the data frame based on existing variables: mutate() -Combine multiple operations into a single sequence: The pipe operator: %>%
Which of the following statements about the Final Exam is accurate? Select all correct answers.
-The Final Exam must be completed alone. You may not seek or receive aid in answering questions from any classmate or any other person. -You will take the final exam using your own computer in the location of your choice. -If your final exam score is higher than your midterm score, then it will replace your midterm score when calculating your final course grade. -The final exam will be open-notes and open-book and open-internet.
Consider the data in power_christmas.csv Download power_christmas.csv, which contains hourly data on the electrical grid load in Texas on Christmas day in each of three years: 2010, 2011, and 2012. Each row corresponds to a single hour. The three variables in this data frame are: hour: hour of the day, where 0 = midnight, 12 = noon, 23 = 11 PM, etc date: the calendar date of the data point. Three levels, corresponding to Christmas day in 2010, 2011, and 2012. ERCOT: the hourly peak power demand on the ERCOT Links to an external site.grid during that hour Use this data and ggplot() to make a line graph of peak power demand (y) versus hour of the day (x), faceted by date. Based on this plot, please select all accurate statements from those below.
-The highest single-hour peak demand in this data set occurred in 2012. -2012 exhibits a different afternoon pattern than the other two years, because in that year peak demand continues to increase between hour=11 and hour=15. -The lowest single-hour peak demand in this data set occurred in 2012.
The plot below shows people's perceptions of probability associated with various English phrases. Specifically, survey respondents were asked to provide a numerical probability that they felt best corresponded to a given phrase, such as "Highly likely." Which of the following is true of the plot below? Select all correct answers.
-The horizontal orientation of boxplots is effective because it makes it easy to read the phrase categories. -Side-by-side boxplots are an effective plot design to display the distribution of a numerical variable (assigned probability) across different levels of a categorical variable (phrases representing perceptions of probability). -The design choice to use a different color for every category is not effective in interpreting the data because the colors do not encode information useful for comparison.
Which of the following statements about course policies on Class Participation is accurate? Select all correct answers.
-The number of available class participation points for the semester will not be determined in advance. -Students who earn 70% of available class participation points will receive full credit on the class participation portion (5%) of their course grade. -Class participation points may be earned from in-class activities including polls, reading quizzes, and short feedback surveys.
Bitcoin is a decentralized digital currency that functions without a central bank. It was first released as open-source software in 2009. Since then, many of these cryptocurrencies have raised money through Initial Coin Offerings (ICOs). A data frame gives information on nine ICOs that have raised more than $100 million. For each ICO, the dataset lists an ID (numbered from 1 to 9), the name of the offered currency, the location of the ICO, the date of the ICO, the price of Bitcoin on the date of the ICO, and the amount of funding raised (in millions of U.S. dollars). Which of the following variables in this data frame is/are numerical? Select all correct answers.
-The price of Bitcoin on the date of the ICO -Amount raised
This is one of multiple questions about the data in CPS85.csv Download CPS85.csv. This data is from the 1985 Current Population Survey (CPS), which is used to supplement U.S. census information between census years. These data consist of a random sample of people living in the U.S., with information on wage and other characteristics of the workers, including sex, number of years of education, years of work experience, occupational status, region of residence and union membership. Consider the married variable, which indicates whether someone is married or single. Form a bootstrap sampling distribution, using at least 10,000 bootstrap samples, for the proportion of respondents who are married. Use this sampling distribution to select which of the following statements is/are true. Select all accurate statements. Note: you might get numbers that differ from our numbers by a few tenths of a percentage point. This is down to Monte Carlo variability. So if a statement looks basically correct but the numbers are slightly different in that third decimal place, mark it as correct. We are not trying to trick you with subtle differences in rounding.
-The standard error for the proportion of married respondents in this sample is about 0.02, or 2%. -We're 95% confident that the proportion of married individuals in the wider U.S. population is somewhere between about 61.5% and 69.5%.
Is the "musical diversity" of the Billboard Top 100 changing over time? Let's measure the musical diversity of given year as the number (count) of unique songs that appeared in the Billboard Top 100 that year. Use the code below to count the number of unique songs that appeared on the Top 100 in each year: yearlycounts = billboard %>% group_by(year, song, performer) %>% summarize(n = n()) %>% group_by(year) %>% summarize(n=n()) After running the code above, you should have a new data frame in the RStudio data environment called "yearlycounts". Use this new data frame to make a line graph that plots the measure of musical diversity over the years. The x axis should show the year, while the y axis should show the number of unique songs (i.e., our calculated measure of musical diversity). Based on your line graph, evaluate the statements below. Select all accurate statements.
-The year 2020 saw more than 800 unique songs on the Billboard Top 100 for the first time since before the 1970s. -The 1960s had the greatest musical diversity relative to other decades. -The time period that we observed with lowest musical diversity is around the year 2000.
The data in nobel_winners.csv Download nobel_winners.csvhas information on all Nobel Prizes awarded from 1901 through 2016. Make a side-by-side boxplot of the winners' ages (age_of_winner), stratified by gender. Use this plot to evaluate the statements below. Select all accurate statements.
-The youngest Nobel winner in this data set was female. -Male Nobel winners have a higher median age.
Which of the following statements about outliers is/are accurate? Select all correct answers.
-There is no generally accepted formal definition of what constitutes an outlier. -If an outlier noticeably changes the results of your analysis, it's a good idea to report results both with and without the outlier included.
Make sure you're familiar with the concept of an "object Links to an external site." from our course packet, which is essential to coding in R. Then consider the following block of R code. objectA = 10objectB = 2*objectA + 10objectC = objectB/6 Which of the following statements about this code block are correct?
-This block of code illustrates the assignment of values to objects using R's assignment operator (=). -This code block will run just fine if you type it directly into the console. But the best practice here would be to type the commands into a script instead, and then run those commands from the script. -If we were to run this block of code, one result will be that objectB should now store the value 30.
ggplot(tvshows) + geom_point(aes(x=GRP, y=PE)) + facet_wrap(~Genre) Which of the following statements about this code block are correct? Choose all correct answers.
-This code block illustrates the use of faceting, and will produce a panel of multiple scatter plots (one plot for each genre). -This plot will show the GRP variable on the horizontal axis and the PE variable on the vertical axis.
t1 = xtabs(~acl + lollapalooza, data=aclfest) t1 %>% prop.table(margin=2) %>% round(3) Which of following statements about this code block are correct?
-This code creates a table of counts and stores it in an object called t1. -The pipe operator (%>%) is always used to feed the result of one calculation into the next calculation, as illustrated in this code block.
Which of the following statements about independence are correct? Select all correct answers.
-Two events A and B are independent if P(A) = P(A | B) -Two events A and B are independent if P(A,B) = P(A) • P(B)
Which of the following instructions should students follow for each homework assignment? Select all correct answers.
-Upload the homework write-up as a PDF file on the Canvas assignment page. -Do not include or re-type the homework assignment questions or instructions in your write-up. -Save and submit the R script that you created for analysis to complete the homework write-up. The R script will not be graded, but you must submit the script to receive credit on the write-up.
Which of the following are among best practices for effective plots? Select all correct answers.
-Use faceting to show the same basic plot across multiple conditions. -Present relevant comparisons and avoid irrelevant ones. -Incorporate clear labels and annotations that help the viewer make sense of the graphic.
Which of the following statements about course policies on Knowledge Checks (KCs) is accurate? Select all correct answers.
-We drop your lowest KC score in calculating your overall KC average for the semester -Any requests to drop an additional KC or for an extension must be accompanied with a verification of absence from Student Emergency Services -Canvas will save the highest score from your quiz attempts. For example, if you earn 90% on your second attempt and 84% on your third attempt, Canvas will retain your score of 90%
Which of the following statements about course grading policies is accurate? Select all correct answers.
-While we reserve the right to lower grade cutoff values (i.e., make them more generous) at our sole discretion, we will not raise them. -As with all McCombs classes, course grades are not numerically rounded (e.g., an 89.99 is not rounded up to 90). -The cutoff for a B grade is 84%.
Match the R function on the left to its primary purpose in fitting linear models on the right.
-lm: fit the model -coef: print the estimated parameters of the model -rsquared: calculate a measure of model fit -resid: generate error terms for each observ data point -geom_point: make a scatterplot
Match the definitions below with their appropriate concept / term.
-some aspect of the world about which we'd like to learn using data: estimand -a summary statistic designed to estimate some aspect of the world about which we'd like to learn using data: estimator -use of statistical computing to repeatedly simulate the random process that generated our data: Monte Carlo simulation -the average value of a summary statistic under repeated sampling from the same random process that generated our original sample: expected value -the standard deviation of a sampling distribution: standard error -data for the entire population of interest: census
The file nycflights13.csv Download nycflights13.csv contains data on all domestic flights departing the three major New York City area airports (LGA, JFK, and EWR) in 2013, including the following variables: dest: flight destination airport carrier: abbreviation representing commercial airline sched_dep_time: hour of scheduled departure on a 24-hour clock dep_delay: departure delay in minutes distance: distance from origin to destination in miles origin: flight origin airport (EWR, LGA, or JFK) Suppose that you want to calculate the average departure delay in minutes (dep_delay) for Delta Airlines flights (carrier == 'DL') to Atlanta Hartsfield-Jackson (dest == 'ATL') scheduled to depart (sched_dep_time) after 8:00 AM. Which of the following functions will you need to calculate this summary statistic for a subset of cases in the data frame? Select all appropriate functions for this objective.
-summarize() -filter() -mean()
When we calculate your final homework average, how many homework grades do we drop?
0
There was fierce competition for tickets to the Jimmy Fallon show hosted on the UT Austin campus. A student who entered the ticket lottery had only a 5% chance of getting a ticket. However, students in the marching band had a 30% chance of getting a ticket. Of all students who entered the lottery, 7% were in the marching band. What proportion of lottery entrants received a ticket and were in the marching band? Enter your answer as a decimal rounded to three digits. For example, 53.7% = 0.537.
0.021
Use the music festivals dataset (aclfest.csv Download aclfest.csv) --- as seen in class and in Lesson 3 on Counting Links to an external site. --- to answer the following. What is the joint probability that a band from this sample played both Outside Lands and Bonnaroo in the same year? Express your answer as a probability between 0 and 1, and round your answer to 3 decimal places.
0.026
This is one of multiple questions about the data in utsat.csv Download utsat.csv, which contains the SAT scores and graduating college GPAs for UT students. The relevant variables in this data set are: SAT.V: score on the verbal section of the SAT (200-800) SAT.Q: score on the quantitative section of the SAT (200-800) SAT.C: combined SAT score School: college or school at which the student first matriculated (not necessarily where they ended up) GPA: college GPA upon graduation, on a 4-point scale This data constitutes a census of a specific population: every UT student who entered UT in a specific recent year and went on to graduate from UT within 6 years. But what happens if we take samples from this population? Use the data the answer the following question. Simulate 10,000 samples of size n=100 from this data set. Based on this Monte Carlo simulation, the standard error of the proportion of Business students represented in a sample of size 100 is closest to which of the following? Hint: you might find this easier if you use mutate() to define a new binary variable to identify Business students.
0.04
The file plays_top50.csv Download plays_top50.csvcontains data about 15,000 users of a music streaming service. The first column is a unique numerical identifier for the user. The remaining columns are for the top 50 artists most frequently streamed by this particular subset of users. The entries in the data frame represent "did play" (1) and "did not play" (0), with 1 meaning that a given user streamed a given artist at least once during the data-collection period. For a randomly selected user from this sample, what is P(plays Franz Ferdinand)? Express your answer as a number between 0 and 1, rounded to 3 decimal places.
0.059
This is one of multiple questions about the data in utsat.csv Download utsat.csv, which contains the SAT scores and graduating college GPAs for UT students. The relevant variables in this data set are: SAT.V: score on the verbal section of the SAT (200-800) SAT.Q: score on the quantitative section of the SAT (200-800) SAT.C: combined SAT score School: college or school at which the student first matriculated (not necessarily where they ended up) GPA: college GPA upon graduation, on a 4-point scale This data constitutes a census of a specific population: every UT student who entered UT in a specific recent year and went on to graduate from UT within 6 years. But what happens if we take samples from this population? Use the data the answer the following question. Simulate 10,000 samples of size n=50 from this data set. Based on this Monte Carlo simulation, the standard error of the mean GPA for a sample of size 50 is closest to which of the following?
0.07
microfi_households %>%filter(village == 46) %>%summarize(loan_prop = mean(loan)) %>%round(3) Upon running these commands, you should see a three-digit decimal number printed out under the word "loan_prop" in your console. What number do you see?
0.073
The file plays_top50.csv Download plays_top50.csvcontains data about 15,000 users of a music streaming service. The first column is a unique numerical identifier for the user. The remaining columns are for the top 50 artists most frequently streamed by this particular subset of users. The entries in the data frame represent "did play" (1) and "did not play" (0), with 1 meaning that a given user streamed a given artist at least once during the data-collection period. For a randomly selected user from this sample, what is P(plays Bob Dylan | plays the Beatles)? Express your answer as a number between 0 and 1, rounded to 3 decimal places.
0.194
The data in the CPS85.csv Download CPS85.csvfile are from the 1985 Current Population Survey (CPS), which is used to supplement U.S. census information between census years. The data frame is a random sample of people living in the U.S. for whom we have information on wages and other characteristics, including sex, number of years of education, years of work experience, occupational status, region of residence, and union membership status. Consider the wage variable, which refers to the hourly wage (in nominal 1985 dollars) of each respondent. Form a bootstrap sampling distribution, using at least 10,000 bootstrap samples, for the mean wage of respondents. The bootstrap standard error of the mean wage is closest to which of the following?
0.2
Research from 2019 indicates that 60% of US adults ages 18-29 have used Snapchat, 65% have used Instagram, and 47% have used both. What is the probability that someone in this demographic uses neither Snapchat nor Instagram? Express your answer as a probability between 0 and 1, rounded to two decimal places.
0.22
Research indicates that 60% of US adults ages 18-29 have used Snapchat, 65% have used Instagram, and 47% have used both. What is the probability that someone in this demographic uses neither Snapchat nor Instagram?
0.22
The file plays_top50.csv Download plays_top50.csvcontains data about 15,000 users of a music streaming service. The first column is a unique numerical identifier for the user. The remaining columns are for the top 50 artists most frequently streamed by this particular subset of users. The entries in the data frame represent "did play" (1) and "did not play" (0), with 1 meaning that a given user streamed a given artist at least once during the data-collection period. For a randomly selected user from this sample, what is P(plays Coldplay or plays Muse)? Express your answer as a number between 0 and 1, rounded to 3 decimal places.
0.234
The British ocean liner Titanic sank into the North Atlantic Ocean on April 15, 1912. This contingency table displays the survival outcomes for 2,201 passengers: Outcome Passenger Class or Crew Survived Did not survive 1st class 203 122 2nd class 118 167 3rd class 178 528 Crew 212 673 What is P(Crew | Survived)?
0.298
A media buyer cross-tabulates her data set with information about 200 brands in their portfolio and the platforms with which each brand is allocating advertising resources. Buy ads Facebook? Yes No Buy adds Yes 80 53 Youtube? No 46 21 Of the brands that (Yes) buy ads on Facebook, the proportion that (No) do not buy ads on YouTube is closest to which of the following?
0.37
Use the music festivals dataset (aclfest.csv Download aclfest.csv) --- as seen in class and in Lesson 3 on Counting Links to an external site. --- to answer the following. What is the conditional probability that a band played Coachella, given that they played ACL Fest in the same year? Express your answer as a probability between 0 and 1, and round your answer to 3 decimal places.
0.392
This contingency table displays the distribution of the 538 electoral college votes in the 2016 Presidential Election with respect to geographical region and election outcome (Republican or Democratic). Total Electoral Votes Democratic Republican South 0 162 West 98 30 Northeast 101 29 Midwest 30 88 Consider the following events: An electoral vote is cast in the South region An electoral vote is cast for the Democratic candidate Using probabilities derived from the contingency table above, which of the statements below is/are correct? The two events South and Democratic: (1) are mutually exclusive. (2) are not independent. (3) are equally likely.
1 and 2
Which of the following is true of standard error and the similar-sounding but conceptually different "margin of error"? (1) The number referred to as the "margin of error" is not a characteristic of a particular sample but rather associated with the sampling procedure. (2) The "margin of error" --- usually operationalized as one or two multiples of the standard error --- is a colloquial term without a fixed formal definition. (3) The "margin of error" always means the same thing: it is the standard deviation of the sampling distribution.
1 and 2
The greenbuildings.csv Download greenbuildings.csvdata set contains data on thousands of commercial real-estate properties nationwide. The stories variable is the height of the building in stories. Based on this distribution, what is the z-score for a 33-story building?
1.58
In RStudio, load the dataset diamonds -- built-in to the tidyverse package -- by running the following commands in your R script: library(tidyverse)data(diamonds) The data set describes features of almost 54,000 diamonds, including the following variables: price: price in U.S. dollars carat: weight of the diamond cut: the quality of the cut, from Fair to Ideal color: a letter code indicating the color of the diamond What is the median weight (carat variable) for diamonds with a price of at least $10,000? Round your answer to two decimal places.
1.69
This is one of three questions about the data in creatinine.csv Download creatinine.csv. Each row represents a patient in a doctor's office. The variables are: age: patient's age in years. creatclear: patient's creatine clearance rate in mL/minute, a measure of kidney health (higher is better). Fit a linear model that predicts a patient's creatine clearance rate in terms of their age. Use this model to answer the following question. Based on this model, what creatinine clearance rate should we expect, on average, for a 55-year-old? Round your answer to the nearest whole number.
114
FiveThirtyEight published an analysis of patterns in Super Bowl television commercials from the ten brands that aired the most spots during 2000--2020. These data are in the superbowl.csv Download superbowl.csv file. Run the following code to remove missing (NA) values from the YouTube view_count and like_count variables: superbowl = superbowl %>%filter(!is.na(view_count), !is.na(like_count)) Then, use more data wrangling to answer the following. The median like_count for advertisements that feature animals is closest to which of the following?
118
The data in olympics_top20.csv Download olympics_top20.csvcontains information on every Olympic medalist in the top 20 sports by participant count, all the way back to 1896. Use these data to answer the following question. The 95th percentile of heights for female competitors in 'Athletics' events (i.e., track and field) is closest to which of the following? Hint: Create a data frame that only contains cases that meet the above criteria for variables sex and sport. Then, calculate the 95th percentile.
183 centimeters
The file nycflights13.csv Download nycflights13.csv contains data on all domestic flights departing the three major New York City area airports (LGA, JFK, and EWR) in 2013, including the following variables: dest: flight destination airport carrier: abbreviation representing commercial airline sched_dep_time: hour of scheduled departure on a 24-hour clock dep_delay: departure delay in minutes distance: distance from origin to destination in miles origin: flight origin airport (EWR, LGA, or JFK) Let's define a "long-haul flight" as one with a distance exceeding 2500 miles. Wrangle the data to identify which of the three NYC airports (origin) had the most long-haul flights in 2013. How many long-haul flights originated at that airport in 2013? Enter your answer below.
9,471
The data in olympics_top20.csv Download olympics_top20.csv contains information on every Olympic medalist in the top 20 sports by participant count, all the way back to 1896. Use these data to answer the following: How has the average age of Olympic swimmers changed over time? Does the trend look different for male swimmers relative to female swimmers? Create a data frame to visualize these groups over time, then plot the data with a line graph. Which of the following is true of this plot? (1) The mean age of male and female swimmers has generally increased since the 1950s. (2) After 1924, male swimmers are generally a few years older than female swimmers. (3) The two groups had an approximately equal average age in the 2000 Summer Olympics.
All of the above (1, 2, and 3)
The graphic seen above needs improvement for reasons that include which of the following? (1) The use of a three-dimensional figure is distorting such that the "Apple" segment (19.5%) appears larger than the "Other" segment (21.2%). (2) Colorblind individuals who may be in the audience may struggle to interpret the plot. (3) The use of a legend rather than direct labeling of the segments asks the audience to repeatedly pivot back and forth from legend to plot when interpreting the graphic.
All of the above (1, 2, and 3)
Which of the following three statements is/are accurate? The grammar of graphics is a theoretical framework for data visualization that: (1) defines a set of rules for creating graphics by combining different types of layers. (2) is implemented in R with the ggplot2 package. (3) conceptualizes a statistical graphic as a mapping of data variables to aesthetic attributes of geometric objects.
All of the above (1, 2, and 3)
The data in the CPS85.csv Download CPS85.csv file are from the 1985 Current Population Survey (CPS), which is used to supplement U.S. census information between census years. The data frame is a random sample of people living in the U.S. for whom we have information on wages and other characteristics, including sex, number of years of education, years of work experience, occupational status, region of residence, and union membership status. Consider the union variable, which indicates whether someone is in a union or not. Form a bootstrap sampling distribution, using at least 10,000 bootstrap samples, for the proportion of respondents who are not in a union in this sample. Use this to form a 95% confidence interval for the proportion of Americans who are NOT in a union: Lower bound closest to: 0.79 Upper bound closest to: 0.85
Answer 1: 0.79 Answer 2: 0.85
The data in the CPS85.csv Download CPS85.csvfile are from the 1985 Current Population Survey (CPS), which is used to supplement U.S. census information between census years. The data frame is a random sample of people living in the U.S. for whom we have information on wages and other characteristics, including sex, number of years of education, years of work experience, occupational status, region of residence, and union membership status. Consider the wage variable, which refers to the hourly wage (in nominal 1985 dollars) of each respondent. Form a bootstrap sampling distribution, using at least 10000 bootstrap samples, for the MEDIAN wage of CPS survey respondents. Use this to form a 95% confidence interval for the MEDIAN wage in the wider population: Lower bound closest to: 7.4 Upper bound closest to: 8.5
Answer 1: 7.4 Answer 2: 8.5
The file greenbuildings.csv Download greenbuildings.csvcontains data on 7280 commercial real-estate properties. Each row refers to a single building. The two variables of interest here are the building's age (in years) and class (A/B/C, indicating the overall quality of the building). Use ggplot to create a faceted histogram of building ages, faceted by class. Use this histogram to determine which of the following statements is/are accurate. Choose all correct statements. Note: you might find this easier if you make a density histogram Links to an external site..
Buildings in class A tend to be newer, relative to buildings in the other two classes. Fewer than half the buildings in Class C are newer than 50 years old. More than half the buildings in Class A are newer than 50 years old. Buildings in Class C are older, on average, than buildings in Class A or B.
Students who feel that their homework was not evaluated fairly via the peer grading process will have an opportunity to appeal. Students who want to appeal a homework grade should do which of the following?
Complete the Homework Appeal Survey posted on Canvas within one week after grades are posted for that homework.
The colleges.csv Download colleges.csv data set includes the following variable: -AdmissionYield: Percentage of accepted applicants that decide to enroll at the institution Which of the following plots would be the best choice to visualize the distribution of the AdmissionYield variable?
Histogram
Bitcoin is a decentralized digital currency that functions without a central bank. It was first released as open-source software in 2009. Since then, many of these cryptocurrencies have raised money through Initial Coin Offerings (ICOs). A data frame gives information on nine ICOs that have raised more than $100 million. For each ICO, the dataset lists an ID (numbered from 1 to 9), the name of the cryptocurrency, the location of the ICO, the date of the ICO, the price of Bitcoin on the date of the ICO, and the amount of funding raised (in millions of U.S. dollars). Which of the following best represents the cases of this data frame?
ICOs
Hermione earned a score of 97 on her Arithmancy midterm exam, for which the average score was 88 points and the standard deviation was 6 points. Lavender earned a score of 72 on her Astronomy midterm exam, for which the average score was 83 points and the standard deviation was 4 points. Parvati earned a score of 85 on her Ancient Runes midterm exam, for which the average score was 80 points and the standard deviation was 7 points. Which of these three midterm scores is most unusual/extreme?
Lavender's score in Astronomy is the most extreme.
The plot that we choose depends on the comparison we are trying to make and the type of data that we have: The percentage of games won by each University of Texas women's soccer team in every season since the program was established in 1993. Which of the following plots would be the best choice to display the trajectory of this measure over time?
Line graph
The Austin City Council conducted a random sample of Austin residents on whether or not they approve of the upcoming 8.75-cent tax rate election to fund a mass transit plan including light rail, new bus routes, and a downtown subway system. Of the 1734 Austinites in the random sample, 51% approved of the plan. The standard error of this estimate was 1.25%, and the margin of error was set to be plus-or-minus two standard errors (2.5%). Which of the following may we conclude from this analysis? (1) Of Austinites in the overall population, 51% approve of the mass transit plan. (2) We are 100% certain, in light of the survey, that the proportion of the overall population of Austinites who approve of the transit plan is between 48.5% - 53.5%. (3) Nothing useful, because a random sample size of 1734 respondents is not sufficient to give us reliable insight about the approval rate in the overall population of more than a million people.
None of the above
You earn a summer internship at an emerging start-up firm in the competitive smart phone market. On Day 1, you are asked to characterize the age distribution of new customers who have opened accounts during 2022. You use R to create the following summary table: Age group Number of new accounts Younger than 25 years 6770 25-34 years 10481 35-54 years 12495 55-64 years 9704 65 years and older 853 But of course you also want to include a visualization in your report. Which of the following plots would be the best choice to display the data in this table?
bar plot
The marketing team at VRBO, a vacation rental platform, wants to gain market share from competitors such as Airbnb and Hotels.com by attracting travelers to book accommodations on their platform. Survey results indicate that in the past year, 23% of the target market segment have used Airbnb, 35% have used Hotels.com, and 13% have used both Airbnb and Hotels.com. Consider two events in this market segment: A is the event that a randomly-selected individual has used Airbnb H is the event that a randomly-selected individual has used Hotels.com Which of the following is true of events A and H?
events A and H are not independent and are not mutually exclusive.
If the distribution of a numerical variable is unimodal and skewed to the left, then the median of the distribution is:
greater than the mean.