sta kc vocab
Which of the following are among the five parameters of the bivariate normal distribution? the correlation of X1 and X2 the standard deviation of X1 the joint probability of X1 and X2
1,2
Which of the following are among the core elements of a typical machine-learning pipeline? (1) Bootstrapping (2) Hypothesis tests (3) Feature engineering
3
The grammar of graphics is a theoretical framework for data visualization that: is implemented in R with the ggplot2 package. All of these are correct. conceptualizes a statistical graphic as a mapping of data variables to aesthetic attributes of geometric objects. defines a set of rules for creating graphics by combining different types of layers.
All of these are correct.
The national weather service collects annual data for the Austin metropolitan area. Of the following, which is a continuous random variable? Count of tornado warnings Number of days of precipitation Annual rain accumulation
Annual rain accumulation
In order for a confidence interval based on de Moivre's equation to be valid, which of the following conditions must be true?
We must be forming a confidence interval for a population mean based on a sample mean.
Which of the following are among insights from our "Morning Show Science" lecture about how to be a smarter consumer of data-driven research? Select all correct answers. - Look at the original study in the journal where it was published, not just the summary in a media news outlet. - Consider what kind of study it was (e.g., an animal model or research involving human subjects). - How big was the sample used in the study? - What was the size of the effect observed? - Is this a single study or one of many pointing to similar results? - How was the study funded? - Was the study pre-registered?
all
A sampling distribution:
is the distribution of values of a summary statistic that we'd expect to see under repeated realizations of the same random data-generating process.
In the context of hypothesis testing, the test statistic: measures the strength of evidence in the data against the null hypothesis. should be less than 0.05 in order to reject the null hypothesis. directly measures the probability that the null hypothesis is false. All of these answers are correct.
measures the strength of evidence in the data against the null hypothesis.
Before an election, polling agencies estimate the percentage of voters who will vote for a particular candidate. Their estimates will be affected by random sampling variance. The best way to reduce the sampling variance of their estimate is to: None of these answers is correct. minimize bias when designing the survey. study a larger population. use a larger random sample.
use a larger random sample.
Which of the following are correct statements about "for" loops and "do" loops in R? - A "do" loop can always replace a "for" loop. - "For" loops are useful for chaining the results of computations together, with the result of one computation feeding into the next computation. - A "for" loop will always have a "counting" or "iterator" variable. - "do" loops allow us to repeat a calculation or simulation many times, as long as we don't require that the results of one simulation can affect the results of another simulation.
- "For" loops are useful for chaining the results of computations together, with the result of one computation feeding into the next computation. - A "for" loop will always have a "counting" or "iterator" variable. - "do" loops allow us to repeat a calculation or simulation many times, as long as we don't require that the results of one simulation can affect the results of another simulation.
Which of the following is true of Akaike information criterion (AIC)? Select all correct answers. - Higher AIC values indicate a better-fit model. - AIC is among the most common measures of predictive model performance. - AIC has a "penalizing" element such that it inflates in-sample RMSE, an optimistic proxy for out-of-sample RMSE. - Confidence intervals are more useful than AIC in the context of Machine Learning pipelines.
- AIC is among the most common measures of predictive model performance. - AIC has a "penalizing" element such that it inflates in-sample RMSE, an optimistic proxy for out-of-sample RMSE.
Which of the following statements about blocking and randomization are correct? - Both completely randomized designs and block designs control for confounding variables, including confounding variables that we don't even know about. - A randomized block design does not control for the placebo effect. - In a block design, participants within each block receive the same treatment. - In general, we'd expect that a completely randomized design controls for confounding variables better than randomization within blocks.
- Both completely randomized designs and block designs control for confounding variables, including confounding variables that we don't even know about.
Which of the following is true of observational studies? - None of these answers is correct. - The term "natural experiment" is a type of study where a research manipulates a natural phenomenon in order to observe a cause-and-effect relationship. - Cohort studies may be either prospective or retrospective. - Prospective studies involve looking backward in time to reconstruct a sequence of events involving the treatment and outcome variables.
- Cohort studies may be either prospective or retrospective.
Which of the following are among strategies to avoid overfitting? Select all correct answers. - Identifying all possible confounders and including them in the model. - Retrospectively sifting through data, cherry-picking a complicated detail that happened to be true. - Disallow complicated explanations by fitting only simple models to the data - If you use a big model, measure its performance on a testing set of unseen data.
- Disallow complicated explanations by fitting only simple models to the data - If you use a big model, measure its performance on a testing set of unseen data.
Which of the following statements is true about the normal random walk model, Y[t] = Y[t-1] + e[t], where e[t] is a normal shock with mean zero and standard deviation sigma? - Each shock is independent of all prior shocks. - The smaller sigma is, the quicker the system tends to wander away from its initial state. - The state of the system at time t, denoted Y[t], is independent of the state of the system at time t-1. - This random walk model depends upon the initial state of the system.
- Each shock is independent of all prior shocks. - This random walk model depends upon the initial state of the system.
Which of the following statements is/are true of "colliders"? Select all correct answers. - "Collider" is just another synonym for a confounder. - Exploratory data analysis (like plots and simple model fits) will help you tell the difference between a confounder and a collider. - If our goal is to isolate a partial relationship between X and Y, confounders and colliders must both be adjusted for in our regression model. - If our goal is to isolate a partial relationship between X and Y, confounders should be included in the model, while colliders should be excluded.
- If our goal is to isolate a partial relationship between X and Y, confounders should be included in the model, while colliders should be excluded.
Which of the following statements about interactions and/or correlated predictors are correct? Choose all correct statements. - If two predictors x1 and x2 are very strongly correlated with each other, then it will easier to isolate the partial effect of each predictor on y, compared with the situation where x1 and x2 are uncorrelated. - Interaction terms are used in regression to describe situations where the relationship between some x and the outcome y depends on context. - Anytime two predictor variables x1 and x2 are correlated, we would expect to need an interaction between those two variables in our model. - It is not possible to have an interaction between x1 and x2 in a situation where x1 and x2 are uncorrelated with each other
- Interaction terms are used in regression to describe situations where the relationship between some x and the outcome y depends on context.
Suppose we're trying to build a smart-phone app that can take a picture of a food item, extract meaningful features from the raw pixels in the image, and use a regression model to classify the image as a "Hot Dog" or "Not Hot Dog." Does this sound more like we need the tools of statistics or machine learning? Why? - Statistics, because we're using regression model - Machine learning, because we care chiefly about the predictive accuracy of the classification model in the context of a larger pipeline of hardware and software. - Statistics, because the most important criterion for evaluating the app is whether the reason it predicts one way or another can be interpreted by human beings. - Machine learning, because we are trying to isolate the partial relationship between image color and whether it's a hot dog.
- Machine learning, because we care chiefly about the predictive accuracy of the classification model in the context of a larger pipeline of hardware and software.
Which of the following statements about matching are correct? Select all correct answers. - Matching can help to balance observed confounders between the treatment and control groups. - Matching is done in the absence of randomization. - Matching happens after the experimenter randomizes units to treatment or control groups. - Matching involves discarding cases in the original data set for which no close match is identified. - Matching happens before the experimenter randomizes units to treatment and control groups. - Matching is used to construct a data set that is balanced with respect to one or more known variables, from an initially unbalanced data set.
- Matching can help to balance observed confounders between the treatment and control groups. - Matching is done in the absence of randomization. - Matching involves discarding cases in the original data set for which no close match is identified. - Matching is used to construct a data set that is balanced with respect to one or more known variables, from an initially unbalanced data set.
Which of the following is true of randomization in experimental design? Select all correct answers. - The goal of randomization is to construct a sample that represents the population as well as possible. - Randomization works, essentially, by flipping a coin independently for each subject in a study: heads, you get the treatment; tails, you get the control. - The fundamental source of uncertainty in an experiment arises from the random assignment of experimental units to treatment or control. - We use randomization to identify nuisance factors that could affect the outcome of the experiment.
- Randomization works, essentially, by flipping a coin independently for each subject in a study: heads, you get the treatment; tails, you get the control. - The fundamental source of uncertainty in an experiment arises from the random assignment of experimental units to treatment or control.
Suppose we consider two positively correlated measurements (X1 and X2) made of some underlying system. Which of the following statements about "regression to the mean" is accurate? Select all correct answers. - Regression to the mean implies that if X1 is above its mean, then X2 is likely to be below its mean in order to make their average balance out. - Regression to the mean implies that if X1 is above its mean, then X2 will also be above its mean. - Regression to the mean implies that if X1 is above its mean, then X2 is also more likely than not to be above its mean -- but most likely by a smaller amount than X1 was. - Regression to the mean is a consequence of the general observation that most extreme events in life are a product of both systematic factors and luck.
- Regression to the mean implies that if X1 is above its mean, then X2 is also more likely than not to be above its mean -- but most likely by a smaller amount than X1 was. - Regression to the mean is a consequence of the general observation that most extreme events in life are a product of both systematic factors and luck.
Which of the following are among standard practices in Machine Learning? Select all correct answers. - Overfitting the training data. - Split your data into a training set and a testing set. - Include as many features/predictors in your model as you can possibly find. - Always test the performance of your model on a data set that wasn't used to fit the model in the first place. - Define a performance metric that provides a numerical summary of the quality of your model's predictions.
- Split your data into a training set and a testing set. - Always test the performance of your model on a data set that wasn't used to fit the model in the first place. - Define a performance metric that provides a numerical summary of the quality of your model's predictions.
Suppose we're looking at COVID-19 data from every county in the US, and we're trying to understand the relationship between various social-distancing measures taken in that county (x) and the growth rate of the virus in that county. We build a regression model that relates each county's COVID-19 growth rate (y) versus several predictors that measure the extent of each county's social distancing behavior. Our goal is to understand how different measures of social distancing seem related to the COVID-19 growth rate. Does this sound more like we need the tools of statistics or machine learning? Why? - Machine learning, because our focus is on understanding cause and effect. - Machine learning, because all we care about is making good predictions. - Statistics, because we care chiefly about helping stakeholders (policy-makers, health professionals, etc) understand and interpret an important partial relationship. - Statistics, because we are using regression modeling.
- Statistics, because we care chiefly about helping stakeholders (policy-makers, health professionals, etc) understand and interpret an important partial relationship.
Which of the following statements is/are true of mean-squared predictive error (MSPE) and root mean-squared predictive error (RMSPE)? Select all correct answers. - The RMSPE measure may be conceptualized as the standard deviation of a model's future forecasting errors. - Conventionally, we report RMSPE when sharing results of predictive models or using predictions to make decisions, because RMSPE is given in the original Y variable units. - We estimate MSPE ("put a hat on it") because calculating the true MSPE would require us theoretically to complete the totally impractical task of averaging over all possible samples of new data points. - The MSPE measure is a property of a fitted model, not a property of an individual data point.
- The RMSPE measure may be conceptualized as the standard deviation of a model's future forecasting errors. - Conventionally, we report RMSPE when sharing results of predictive models or using predictions to make decisions, because RMSPE is given in the original Y variable units. - We estimate MSPE ("put a hat on it") because calculating the true MSPE would require us theoretically to complete the totally impractical task of averaging over all possible samples of new data points. - The MSPE measure is a property of a fitted model, not a property of an individual data point.
Which of the following statements are true of the bivariate normal distribution? Select all correct answers. - The bivariate normal distribution is capable of describing both positive and negative associations. - In a bivariate normal model, the strength of association between X1 and X2 is described by a single correlation parameter. - The bivariate normal distribution is fully parametrized by 4 numbers: two means and two standard deviations, one for each variable. - If X1 and X2 follow a bivariate normal distribution, then X1 and X2 are necessarily positively correlated.
- The bivariate normal distribution is capable of describing both positive and negative associations. - In a bivariate normal model, the strength of association between X1 and X2 is described by a single correlation parameter.
Which of the following is true of the normal distribution model? Select all correct answers. - The normal distribution has "thin tails" because large outliers are unlikely to occur. - A normal random variable is a discrete random variable. - It is a family of bell-shaped density curves, each with a different mean and standard deviation. - The parameters of the normal distribution are the median and the rate. - Phenomena that don't look obviously normal can be sometimes described using the normal distribution as a building block. - The normal distribution originated as an approximation to the binomial distribution. - The area under a normal density curve represents probability.
- The normal distribution has "thin tails" because large outliers are unlikely to occur. - It is a family of bell-shaped density curves, each with a different mean and standard deviation. - Phenomena that don't look obviously normal can be sometimes described using the normal distribution as a building block. - The normal distribution originated as an approximation to the binomial distribution. - The area under a normal density curve represents probability.
In the assigned Wired article, which of the following were among Dr. Andrew Gelman's criticisms of the term "p-hacking"? Select all correct answers. - "P-hacking" is insufficiently common, and the issues it encompasses are insufficiently important, to warrant any special term for the practice.. - "P-hacking" is an inappropriately derogatory term for a perfectly legitimate scientific practice. - The term "p-hacking" wrongly implies that most abuse of this form is outright cheating, when in reality it's rarely something the researcher actually intends to do. - The term "p-hacking" obscures how common it is for researchers to get lost in all the decisions that go into data analysis, and not even realize that they've gone astray.
- The term "p-hacking" wrongly implies that most abuse of this form is outright cheating, when in reality it's rarely something the researcher actually intends to do. - The term "p-hacking" obscures how common it is for researchers to get lost in all the decisions that go into data analysis, and not even realize that they've gone astray.
Why do we need a control group in experimental design? Choose all correct answers. - To rule out alternative explanations for the outcome of the experiment related to placebo effects. - To rule out natural change and variation (as distinct from change related to some experimental treatment) as an explanation for the outcomes we observe. - To identify possible confounders in the treatment group. - To provide a basis for comparison for those who received the treatment.
- To rule out alternative explanations for the outcome of the experiment related to placebo effects. - To rule out natural change and variation (as distinct from change related to some experimental treatment) as an explanation for the outcomes we observe. - To provide a basis for comparison for those who received the treatment.
Best practices in good experimental design include which of the following? Select all correct answers. - Use a control group. - When an experimental design uses blocking, randomization is not required for a rigorous study. - Use blocking when you can. - When there is a potential confounding variable for which it is impractical or impossible to use blocking, the researcher should use their best judgment to decide which study participants will go in the treatment group and which will go into the control group.
- Use a control group. - Use blocking when you can.
Which of the following are among guidelines for data scientists on what variables to exclude from multiple regression models? Select all correct answers. - Important confounding variables without which the model is vulnerable to omitted-variable bias. - Variables that do not explain any variability in the response variable y. - Variables that represent a common effect of both x and y (rather than a common cause of both x and y). - Variables that convey information about y that is redundant to that information conveyed by other variables in the model.
- Variables that do not explain any variability in the response variable y. - Variables that represent a common effect of both x and y (rather than a common cause of both x and y). - Variables that convey information about y that is redundant to that information conveyed by other variables in the model.
Select all correct answers. While randomized controlled trials (RCTs) are considered the gold standard of evidence, they are not a panacea because: - assigning the treatment to participants may be unethical. - they are the only way to establish causality. - they can potentially be very expensive. - it may be hard to generalize to a wider population if participants are systematically different from that population. - they can only balance potential confounders that we know about.
- assigning the treatment to participants may be unethical. - they can potentially be very expensive. - it may be hard to generalize to a wider population if participants are systematically different from that population.
Which of the following statements about correlation are correct? Select all correct answers. - If the correlation between X and Y is very high (close to 1), then we would expect corresponding high levels of regression to the mean for X and Y. - cor(X, Y) = cor(Y, X) - Correlation takes on only values ranging between -1 and 1. - Correlation depends on the units in which the X and Y variables are measured.
- cor(X, Y) = cor(Y, X) - Correlation takes on only values ranging between -1 and 1.
The normal distribution would be an appropriate probability model in which of the following contexts? (1) As an approximation for a large-N binomial model (2) Characterizing a phenomenon for which large deviations from the mean, of three standard deviations or more, are frequent events. (3) Describing a situation where we count the number of events over a fixed time interval, under the assumption that successive events are independent and occur at a constant rate.
1
Which of the following is a continuous random variable? (1) The high temperature in Austin today. (2) The number of undergraduate majors at UT Austin. (3) The count of Texas counties in which the majority of registered voters are affiliated with the Democratic party.
1
Which of the following is true of standard error and the similar-sounding but conceptually different "margin of error"? (1) The number referred to as the "margin of error" is not a characteristic of a particular sample but rather associated with the sampling procedure. (2) The "margin of error" --- usually operationalized as one or two multiples of the standard error --- is a colloquial term without a fixed formal definition. (3) The "margin of error" always means the same thing: it is the standard deviation of the sampling distribution.
1 & 2
Which of the following statements about bootstrapping is/are correct? (1) Each bootstrapped sample must be of the same size as the original sample. (2) Each bootstrapped sample may contain duplicates and omissions from the original sample. (3) Each bootstrapped sample must sampled without replacement from a different population.
1 & 2
Which of the following statements is/are correct? (1) Sampling variance refers to non-systematic (random) differences between our estimand and our estimate. (2) Sampling bias refers to systematic (non-random) differences between our estimand and our estimate. (3) Bootstrapping helps us to reduce the statistical uncertainty we have about our estimand, allowing us to arrive at an answer with a higher degree of confidence.
1 & 2
Which of the following are key ingredients of a confidence interval based on the Central Limit Theorem? (1) A summary statistic (e.g. a mean) from your sample (2) A multiple z, based on a tail area from the normal distribution. (3) A formula for the standard error of your summary statistic. (4) A bootstrapped sampling distribution, usually simulated with at least 10,000 Monte Carlo samples.
1, 2, 3
Which of the following statements is true of the Central Limit Theorem? (1) The mean of a sufficiently large sample has an approximately normal sampling distribution. (2) If sample data plotted on a histogram show a distribution of individual observations that does NOT look normal, the sampling distribution of the sample mean of these observations will also necessarily NOT look normal. (3) The sampling distributions of a sample mean looks more normal as the size of the sample N increases.
1, 3
Which of the following are among guidelines for data scientists on what variables to include in fitting multiple regression models? (1) It is essential to incorporate variables that directly affect both the outcome (Y) and the particular X predictor of interest. (2) It is beneficial, but not strictly essential, to include variables that affect Y even if they are not correlated with a particular X predictor of interest. (3) Always include an interaction term for two X predictors if those predictors are both main effects in the model.
1,2
Which of the following are common steps in feature engineering? (1) nonlinear transformations of numerical variables (2) combining or summarizing variables in the data frame to create new variables (3) bootstrapping a regression model to get confidence intervals
1,2
Which of the following is a discrete random variable? (1) The number of customers waiting in line at Franklin BBQ when it opens tomorrow morning. (2) The count of typos on a page. (3) The time required for a plane to fly from Houston Hobby to Dallas Love Field.
1,2
In November 2020, Airbnb made a long-awaited filing to go public on the Nasdaq (under the symbol "ABNB"). Their data science team wanted to predict the IPO opening price from both firm-specific and market-level features. They fit three models --- splitting the data into an 80% training set and 20% testing set --- and summarized predictive performance in this table: Which of the following is true of this set of models? (1) The model with the best predictive performance is Model 2. (2) The big model suffers relatively little degradation in performance when moving from in-sample data to out-of-sample data. (3) Here we see evidence that simpler models show less degradation in performance (than do more complex models) when moving going from in-sample to out-of-sample data.
1,3
Reasons to include an interaction term in our model include which of the following? (1) To estimate context-specific effects of some predictor variable on the outcome (y). (2) If the joint effect of two variables on the outcome can be correctly modeled as the sum of the main effects associated with each variable. (3) Looking at an ANOVA table suggests that an interaction term noticeably improves the predictive power of the model.
1,3
Which of the following are correct statements about multiple regression? (1) In observational studies, using a regression model to adjust for confounding variables is largely an after-the-fact, statistical process (as opposed to the active manipulation of a predictor variable of interest, while explicitly holding constant the levels of other relevant variables). (2) Multiple regression controls for all possible confounders, even those of which the researcher is not aware and/or cannot include as a model predictor. (3) A multiple regression equation attempts to isolate a set of partial relationships between the response variable and each of the predictor variables included in the model.
1,3
Which of the following is true of overfitting? (1) It is the reason that we split our data into training and testing sets. (2) It is more likely to happen when the size of the data set is large or the model that we are fitting has few parameters. (3) It occurs when a model memorizes the random noise in a particular data set rather than learning an underlying pattern.
1,3
Which of the following is true of the normal distribution? (1) Observing a normal random variable more than three standard deviations beyond its expected value is an unlikely, rare event. (2) All normal random variables have the same mean and variance. (3) The area under a normal density curve represents probability.
1,3
Which of the following statements is true of dummy variables? (1) In general, a grouping variable with K categories produces K-1 dummy variables. (2) In a fitted model, the coefficient on a dummy variable represents the average value of the outcome (y) whenever the dummy variable is 1. (3) In a fitted model, the coefficient on a dummy variable represents the difference in the average outcome (y) between two conditions: when the dummy variable is 1, versus when the dummy variable is 0.
1,3
A 95% confidence interval for the mean based on the Central Limit Theorem and de Moivre's equation generally takes the form: x¯±1.96⋅σn If we instead wanted to calculate an 80% confidence interval for the mean, which elements of this formula would be different? (1) The center of the interval, (xbar, mean) (2) The multiplier of 1.96. (3) The term σ/sqrt(n) .
2
Based on the article you read about a century of stock and bond correlations, which of the following statements is true about the correlation between monthly returns for stocks and for bonds? (1) In times of high and variable inflation, the correlation between the returns of stocks and bonds held over a similar period has generally been large and positive. (2) The correlations of returns of major asset classes have, over the previous decades, switched signs (from positive to negative or vice-versa). (3) Stock prices and bond yields have been generally uncorrelated since around the end of the 20th century.
2
Professor Snape administers two tests (Test 1 and Test 2) to every student in his NEWT-level Potions class. Historically, the average student performance across both tests is 50%, and the correlation of an individual student's scores across Test 1 and Test 2 is r = 0.61. Professor Snape identifies the five worst performers in his class, who together averaged 30% on Test 1, and offers them a remedial review session, hoping to help them improve their scores on Test 2. Which of the following statements about this context is correct? (1) If the five students who received tutoring improve their scores on Test 2, and their average on Test 2 is significantly different from 30% at the 0.05 level, then this provides evidence that the tutoring worked. (2) We'd expect the five students who received tutoring to improve their score on Test 2, on average, even if the tutoring is useless, because of regression to the mean. (3) If the tutoring has no effect on performance, then we'd expect this group of five students also to score around 30% on Test 2, on average.
2
The benefits of randomization in an experiment include which of the following? (1) It prevents the confounders from directly affecting the outcome variable. (2) On average, it balances the confounders between the treatment group and the control group. (3) It helps ensure that the results of a study will generalize accurately to the wider population.
2
Which of the following are correct statements about overfitting? (1) Overfitting is largely a theoretical concern, and not something that can typically happen in practice. (2) Overfitting occurs when a model memorizes the pattern of random noise in a data set. (3) If we see that model predictive performance deteriorates on the testing set relative to its predictive performance of the training set, the problem is likely that we don't have enough features or interactions in the model.
2
A common practice in machine learning is splitting a data set into training and testing sets. Which of the following is true of this method? (1) Models are fit on the testing data set only, while model predictive performance is evaluated on the training data set. (2) Models that are overfit tend to see large degradation in performance when comparing the training to the testing sets. (3) Predictive performance on the testing set is regarded as more important than performance on the training set.
2,3
Choose the true statements about the relationship between bootstrapping and confidence intervals based on the Central Limit Theorem (CLT). (1) Bootstrap confidence intervals typically give very different results from confidence intervals calculated using classical inference methods based on the Central Limit Theorem. (2) Bootstrapping offers an alternative to the use of CLT-based confidence intervals, removing the need to know or derive an explicit formula for the standard error of a summary statistic. (3) Variations on de Moivre's equation exist for many common summary statistics, allowing one to compute confidence intervals based on the normal distribution.
2,3
Which of the following are among important considerations when using ANOVA to understand regression models? (1) There can only be one correct ANOVA table for any given model in the context of an observational study involving correlated predictors. (2) ANOVA is very useful for regression models fit to experimental data, but generally less useful for regression models fit to observational data. (3) The ANOVA table is not the fitted model itself but rather an attempt to partition credit for the model's predictive power across the different variables.
2,3
Which of the following is true of fitting a multiple regression model? (1) We interpret each β (beta) coefficient as representing an overall relationship between y and the corresponding x. (2) We interpret each β (beta) coefficient as representing a partial relationship between y and the corresponding x, holding the other predictors constant. (3) If two predictor X variables are correlated, the difference between their overall relationships with a response variable Y and their partial relationships with a response variable Y may be very important in interpreting modeling results effectively.
2,3
Which of the following statements is true of Machine Learning (ML)? (1) Machines are used to make important decisions at every step of the pipeline, with humans playing little to no role. (2) Good feature engineering typically involves domain-specific knowledge. (3) Linear regression is a sensible starting point in understanding many ML applications.
2,3
As per class discussion and the assigned course packet reading, which of the following statements is true of multiple regression? (1) Using a regression model to "isolate" or "adjust for" variables is an experimental process, i.e. one that involves manipulating the variable of interest while holding others constant. (2) Using regression to isolate a partial relationship can usually produce study results that offer even stronger evidence of causality than the level of evidence that we expect from a randomized controlled trial. (3) A confounder is some variable that is correlated with both the response variable and a predictor variable.
3
Which of the following statements about elasticities are correct? (1) An elasticity describes the growth rate in some outcome (y) over time (t), and is usually associated with exponential growth models. (2) One way to estimate an elasticity from data is to run a regression for log(y) versus x; the slope of this line provides an estimate of the elasticity for y vs x. (3) An elasticity describes relative percentage change in y as a function of x.
3
Which of the following statements is true of these common Feature Selection methods? (1) With Forward Selection, we start with a working model that includes the entire scope---that is, all features under consideration for inclusion in the model. (2) With Backward Selection, we iteratively consider all possible one-variable additions to a working model. (3) Stepwise Selection is an iterative process that continues until there is no one-variable addition or deletion that improves model performance.
3
This is one of multiple questions about the same scenario. The Amazon e-commerce data science team fits a model to gain understanding of the extent to which having an Amazon Prime membership leads customers to buy more from the platform than customers without a Prime membership. The data set includes two variables: total sales revenue for a customer account and a dummy variable representing whether the customer had a Prime membership. The fitted model equation is: Sales=764+1323∗Prime+e Match each equation component with its appropriate label below.
764 - the baseline 1323 - the offset Sales - response variable Prime - categorical predictor variable e - residual
Which of the following design choices should generally be avoided in data visualization? Select all correct answers. A barplot with truncated y-axis Axis labels 3D designs Plots that use color to encode information Displaying percentages that sum to 100%.
A barplot with truncated y-axis 3D designs
During class we fit a multiple regression model to predict the listing price of a house in Saratoga County, New York. We concluded that the variable age (age of the house in years) should be included in a model attempting to isolate the partial relationship between fireplaces and price. Which of the following were among the reasons that we decided to include age in the model? - Age was correlated with both the response variable (price) and the predictor variable of interest (fireplaces). - The confidence interval for the age coefficient did not contain zero. - The inclusion of the age variable affected the magnitude of the coefficient of interest: fireplaces.
All of these answers are correct.
Which of the following are correct statements about the analysis of variance (ANOVA)? - An ANOVA table can help you decide whether a given data set calls for an interaction between variables. - An ANOVA table tracks the improvement in R-squared as we add variables to the model one a time. - An ANOVA table attempts to attribute credit to the individual variables included in the model.
All of these answers are correct.
Which of the following are key components of bootstrapping? Using the same size as the original sample in each simulation. Repeatedly resampling from the original sample to track the extent to which the estimate varies across samples. Sampling with replacement
All of these answers are correct.
Why would we bootstrap a statistical model?
Bootstrapping is used to assess the uncertainty of our estimate due to sampling variance. **** Assess not minimize
Which of the following are among sources of natural experiments? Select all correct answers. Bureaucratic rules Confounding variables Experimental randomization Hurricanes Lotteries Blocking
Bureaucratic rules Hurricanes Lotteries
Professor Snape seeks to encourage more real-time participation during online lectures for his large potions class. He surveys a sample of students who attend his optional Remedial Potions review sessions on Friday nights to rate some of his ideas for how to make class sessions more engaging. This study design is most compromised / limited by which of the following problems?
Convenience sampling
Which of the following statements about exponential growth/decay models and power laws are correct? Select all correct answers. Exponential growth models can be interpreted in terms of a doubling time for the y variable, while power laws are usually interpreted in terms of elasticities. Both power laws and exponential models involve a base (b) raised to some power (r). But in an exponential model, the x variable is part the power (r), whereas in a power law, the x variable is part of the base b. A power law is a particular type of exponential growth/decay model that allows one to estimate an elasticity. In both exponential and power-law models, a plot of log(y) versus x will look like it is well described by a linear relationship.
Exponential growth models can be interpreted in terms of a doubling time for the y variable, while power laws are usually interpreted in terms of elasticities. Both power laws and exponential models involve a base (b) raised to some power (r). But in an exponential model, the x variable is part the power (r), whereas in a power law, the x variable is part of the base b.
Match the terms below from the Machine Learning vernacular to their counterparts in the language of Regression Modeling.
Features - Predictor variables Target - Outcome variable Supervised Learning - Regression Attributes extracted from available training data - Predictor variables Prediction from a learned model - Outcome variable
Match each of the following with Statistics, Machine Learning, or both.
Fundamentally about learning from data - Both Uses lots of regression analysis - Both About helping people learn from data. - Statistics About helping machines learn and improve from data. - Machine Learning Focused mainly on understanding and interpreting - Statistics Focused mainly on performing and predicting. - Machine Learning Supports automated decision-making algorithms capable of improving from experience. - Machine Learning Supports human decision making. - Statistics
The City of Austin data science team is designing a survey to learn about citizens' modes of commuting to work. Which of the following survey questions is the best example of unbiased wording?
How many days a week do you use a bicycle to go from home to work?
Which of the following statements about outliers are correct? The mean and standard deviation are uninfluenced by extreme outliers. It's considered best practice to remove observations that fall more than 1.5 times the IQR away from the median, since these outliers can distort your analysis. If an outlier noticeably changes the results of your analysis, it's a good idea to report results both with and without the outlier. There is no generally accepted definition of what constitutes an outlier.
If an outlier noticeably changes the results of your analysis, it's a good idea to report results both with and without the outlier. There is no generally accepted definition of what constitutes an outlier.
Which of the following are measures to summarize the center of a distribution of a numerical variable? Select all applicable answers. Skewness Range Median Mean Variance
Median Mean
When analyzing data from a "multi-factor" experiment, it is common to conduct an analysis of variance. In an ANOVA table: - we track the statistical significance of our first variable, which should increase as variables are added sequentially. - we list all the fitted coefficients for all variables in the model. - we track the change in R-squared (R^2), which should decrease as variables are added sequentially.
None of these answers is correct.
Which of the following represent(s) the concept of "sampling WITH replacement"?
Professor Snape selects a sample of students to "cold call" during each of his NEWT-level Potions classes. For each question, he uses a Resampulus charm, wherein his wand randomly points to a student irrespective of who was called previously. There is no limit on the number of times that an individual student might be selected during any given class session.
In the 1940s, researchers at the British Medical Research Council began following over time two separate and initially healthy groups: smokers and non-smokers. They tracked the incidence of lung cancer in both groups. They found that daily smokers of at least 35 cigarettes increased their odds of dying of lung cancer by a factor of 40. This study had which of the following designs?
Prospective Cohort Study
Which of the following is true of randomization in the context of an experiment? - Randomization ensures balance, on average, even for possible confounding factors of which the experimenter is not aware. - The fundamental source of uncertainty in an experiment arises from the random assignment of nuisance factors as confounders. - Randomization is not necessary when an experimental design is double-blind. - None of these answers is correct.
Randomization ensures balance, on average, even for possible confounding factors of which the experimenter is not aware.
Match each of the following study designs with its corresponding rank in the "hierarchy" of study designs discussed during class.
Randomized controlled trials 1 Quasi-experiments with good mechanisms for randomization 2 Prospective cohort studies 3 Retrospective cohort studies 4 Other study types 5
Which of the following are measures to summarize the variability or spread of a distribution? Select all applicable answers. z-score Range Mode Interquartile Range Skewness Standard Deviation
Range Interquartile Range Standard Deviation
In military flight training, student officers must learn how to land a small plane. Instructors accompany the students individually and grade their performance numerically. Each student in a class of 30 students landed a small plane on Monday and was evaluated. Those students with the highest scores were praised; those students with the lowest scores were criticized. On Wednesday, the same students each tried the exercise again and were again evaluated. The instructors noticed that the students with the highest scores on Monday did not do as well Wednesday, on average, although they still did reasonably well. But the students with the lowest scores on Monday did better Wednesday, on average, although they were still sub-par. The military concluded that praise should be withheld and that all students should be criticized in order to improve performance. This strategy to improve student performance did not take into consideration which of the following statistical phenomena? Choose the best answer.
Regression to the mean
Which of the following is the best characterization of the term "researcher degrees of freedom"?
The choices that researchers must make in designing and running a study and in analyzing the data.
Netflix collects data every time a subscriber uses its platform, including the variables listed below. Which of these variables are categorical?
The day of the week The U.S. state in which the subscriber resides The genre of the show/movie
Match the random variable described with it's correct type: discrete or continuous?
The number of people who visit the Perry-Castañeda Library tomorrow. - Discrete Length of time required for a phone battery to lose its charge. - Continuous The sum of the numbers from rolling two dice - Discrete The amount of rainfall in Austin in November. - Continuous The count of Haribo gummy bears in a bag. - Discrete The number of mortgages approved in Travis County last week. - Discrete The distance that a car travels with one tank of gas - Continuous The weight of the contents of a bag of Haribo gummy bears. - Continuous The number of cards drawn from a deck until a Queen is selected. - Discrete
The long-run rate of defective iPhones coming off the assembly line is 0.6% when all manufacturing processes are working correctly. Because testing each phone for defects would be cost prohibitive, a random sample of 500 phones are inspected every 2 hours to determine if the manufacturing processes are working correctly, or if there may be an issue leading to a higher rate of production defects. Halting the production line unnecessarily leads to lost revenue from fewer units shipped. On the other hand, producing defectives phones is bad for brand integrity. Which of the following represents a Type I Error in this context?
The process is working correctly and the plant manager temporarily halts production.
Online opinion polls open for responses from to the general public are most likely to be compromised because of which of the following?
They rely on voluntary response samples.
Suppose a statistical analysis produces a p-value equal to 0.051 under some null hypothesis. Which of the following can we conclude? Choose all correct statements. The null hypothesis has a 5.1% chance of being correct. If we reject the null hypothesis, there is a 5.1% chance we are making a mistake. This p-value provides a nearly identical amount of practical evidence against a null hypothesis as would a p-value equal to 0.049. The probability that we would have observed our specific test statistic, if the null hypothesis were true, is 5.1%. If -- before the poll was conducted -- the data scientist declared 0.10 to be the arbitrary level of significance, they would reject the null hypothesis. If -- before the analysis was conducted -- the data scientist declared 0.01 to be the arbitrary level of significance, they would fail to reject the null hypothesis.
This p-value provides a nearly identical amount of practical evidence against a null hypothesis as would a p-value equal to 0.049. If -- before the poll was conducted -- the data scientist declared 0.10 to be the arbitrary level of significance, they would reject the null hypothesis. If -- before the analysis was conducted -- the data scientist declared 0.01 to be the arbitrary level of significance, they would fail to reject the null hypothesis.
Why might we use bootstrapping along with calculating a sample estimate? Select all correct answers. To check whether our data form a true random sample from the population. To simulate the variability inherent to the sampling process. To reduce or, if possible, eliminate uncertainty associated with our estimate. To measure the extent to which our data constitutes a biased sample from the population.
To simulate the variability inherent to the sampling process.
Which of the following statements about independence are correct? Select all correct answers. Two events A and B are independent if P(A,B) = P(A) • P(B) Two events A and B are independent if P(A) = P(A | B) Two events A and B are independent if P(A) = P(B) Two events A and B are independent if they cannot both happen.
Two events A and B are independent if P(A,B) = P(A) • P(B) Two events A and B are independent if P(A) = P(A | B)
Fill in the blank with the appropriate answer. In the context of scientific research, pre-registration: - occurs when study designers outline their plan to collect and analyze data before the study commences. - limits flexibility of scientists to capitalize on serendipity in reporting genuinely new findings that they did not anticipate during the study design process. - limits opportunities for p-hacking and the abuse of researcher degrees of freedom.
all
Which of the following are correct statements about the analysis of variance (ANOVA)? (1) An ANOVA table attempts to attribute credit to individual predictor variables included in the model. (2) An ANOVA table tracks the improvements in R-squared associated with each variable. (3) If predictors are correlated, changing the sequence in which these variables are added to the model changes the information in the ANOVA table about the relative importance of individual variables.
all
Which of the following statements is true of "statistical significance" in the context of multiple regression modeling? - Statistical significance for an estimated partial relationship does not mean that the corresponding predictor variable is important in practical terms. - We generally express the level of statistical significance as the opposite of the confidence level. - An estimated partial relationship is considered to be statistically significant if zero is not a plausible value for that partial slope in the model.
all
Holding other factors constant, increasing the size of a sample used to calculate a confidence interval will:
decrease the standard error.
If the distribution of a numerical variable is unimodal and skewed to the left, then the median is:
greater than the mean.
Recall bias:
is a systematic error that occurs when participants do not remember previous events or experiences accurately or omit details: the accuracy and volume of memories may be influenced by subsequent events and experiences.
Which of the following is true of p-values? A p-value represents the probability of observing our specific test statistic, assuming that the null hypothesis is correct. Larger p-values are indicative of more evidence against the null hypothesis. A p-value represents a conditional probability that the null hypothesis is correct, given the data that we have observed.
none
Select all correct answers. Statistical inference comprises a set of methods to quantify uncertainty. Inference is appropriate in a variety of common data-science situations, including when: we have data from a census. generalizing from a sample to a wider population is not relevant to our data science goals. our observations are subject to measurement error. we are trying to make a prediction about the future based on data, and we the future may look different from the past in ways that are material to the data science context. when we wish to generalize from our sample to a wider population, and our data arise from a convenience sample. analyzing results from a randomized experiment. we are trying to make a prediction about the future based on data, and it is reasonable to assume that the future will resemble the past/present in ways that are relevant to the given data science context. systematic measurement biases are likely to be larger than random and repeatable sources of statistical variability. the data arise from an intrinsically random or variable process.
our observations are subject to measurement error. analyzing results from a randomized experiment. we are trying to make a prediction about the future based on data, and it is reasonable to assume that the future will resemble the past/present in ways that are relevant to the given data science context. the data arise from an intrinsically random or variable process.
A formula that defines each term in a sequence using the preceding terms in that sequence is said to be:
recursive
If determining whether or not a measured effect can be distinguished from zero, we are interested in ___________. There is no statistical test for ___________. When we ask how large an effect a predictor variable has on an outcome variable, in context-specific terms, we are interested in ___________. ___________ may be assessed by calculating a p-value. The units of variables and measurement scales of variables matter when we are assessing ___________. ___________ is usually assessed by looking at a confidence interval and reasoning about the range of plausible effect sizes in the context of the problem. ___________ can usually be assessed by checking whether or not zero falls within the range of plausible values represented by a confidence interval.
statistical significance practical significance practical significance statistical significance practical significance practical significance statistical significance