STA 301H Final

अब Quizwiz के साथ अपने होमवर्क और परीक्षाओं को एस करें!

objectA = 10 objectB = 2*objectA +10 objectC = objectB/6 Which of the following statements about this code block are correct?

-This block of code illustrates the assignment of values to objects using R's assignment operatior (=) -If we were to run this block of code, one result will be that objectB should now store the value 30 -This block will run just fine if you type it directly into the console. But the best practice would be to type the commands into a script instead, and then run those commands from the script

ggplot(tvshows) + geom_point(aes(x=GRP, y=PE)) + facet_wrap(~Genre)

-This code block illustrates the use of faceting, and will produce a panel of multiple scatter plots (one plot for each genre) -This plot will show the GRP variable on the horizontal axis and the PE variable on the vertical axis

Suppose a statistical analysis produces a p-value equal to 0.051 under some null hypothesis. Which of the following can we conclude?

-This p-value provides a nearly identical amount of practical evidence against a null hypothesis as would a p-value equal to 0.049 -If -- before the analysis was conducted -- the data scientist declared 0.01 to be the arbitrary level of significance, they would fail to reject the null hypothesis -If -- before the poll was conducted -- the data scientist declared 0.10 to be the arbitrary level of significance, they would reject the null hypothesis

Which of the following statements is true about the normal random walk model, Y[t] = Y[t-1] +e[t], where e[t] is a normal shock with mean zero and standard deviation sigma?

-This random walk model depends upon the initial state of the system -Each shock is independent of all prior shocks

Which of the following statements about independence are correct? Select all correct answers.

-Two events A and B are independent if P(A,B) = P(A) * P(B) -Two events A and B are independent if P(A) = P(A | B)

Which of the following are key components of bootstrapping?

-Using the same size as the original sample in each simulation -Sampling with replacement -Repeatedly resampling from the original sample to track the extent to which the estimate varies across samples

Approximately 8% of males are color blind. Consider a random sample of 25 men. Which of the following are among assumptions we make in modeling the number of color blind males in the sample as a binomial random variable?

-We are observing a fixed number of random events (i.e., each person in the sample) -Each random event may be considered as a "yes" or a "no" -Knowing that one member of the sample is color-blind (or not) does not change the probability that any other member of the sample is color-blind (or not)

Which of the following are among guidelines for data scientists on what variables to include in fitting multiple regression models? (1) It is essential to incorporate variables that directly affect both the outcome (Y) and the particular X predictor of interest (2) It is beneficial, but not strictly essential, to include variables that affect Y even if they are not correlated with a particular X predictor of interest (3) Always include an interaction term for two X predictors if those predictors are both main effects in the model

1 and 2

Which of the following is a discrete random variable? (1) The number of customers waiting in line at Franklin BBQ when it opens tomorrow morning (2) The count of typos on a page (3) The time required for a plane to fly from Houston Hobby to Dallas Love Field

1 and 2

Which of the following is true of standard error and the similar-sounding but conceptually different "margin of error"? (1) The number referred to as the "margin of error" is not a characteristic of a particular sample but rather associated with the sampling procedure (2) The "margin of error" --- usually operationalized as one or two multiples of the standard error --- is a colloquial term without a fixed formal definition (3) The "margin of error" always means the same thing: it is the standard deviation of the sampling distribution

1 and 2

Which of the following statements about bootstrapping is/are correct? (1) Each bootstrapped sample must be of the same size as the original sample (2) Each bootstrapped sample may contain duplicates and omissions from the original sample (3) Each bootstrapped sample must be sampled without replacement from a different population

1 and 2

Which of the following statements is/are correct? (1) Sampling variance refers to non-systematic (random) differences between our estimand and our estimate (2)Sampling bias refers to systematic (non-random) differences between our estimand and our estimate (3) Bootstrapping helps us to reduce the statistical uncertainty we have about our estimand, allowing us to arrive at an answer with a higher degree of confidence

1 and 2

A data scientist fits an exponential growth model to estimate the rate of growth in daily COVID-19 Austin-area hospitalizations (Y) during the surge in June 2020. The fitted equation is below, where T=0 corresponds to June 1: log(Y) = 2.5 + 0.06*T Which of the following is true of the fitted model above? (1) The expected number of average daily hospitalizations (Y) at Day 0 (June 1) was approximately 12 (2) This model would predict approximately 3.2 hospitalizations on the 12th day after June 1 (t=12) (3) Each additional day in June saw 6% more hospitalizations than the previous day, on average

1 and 3

Reasons to include an interaction term in our model include which of the following? (1) To estimate context-specific effects of some predictor variable on the outcome (y) (2) If the joint effect of two variables on the outcome can be correctly modeled as the sum of the main effects associated with each variable (3) Looking at an ANOVA table suggests that an interaction term noticeably improves the predictive power of the model

1 and 3

Which of the following are correct statements about multiple regression? (1) In observational studies, using a regression model to adjust for confounding variables is largely an after-the-fact, statistical process (as opposed to the active manipulation of a predictor variable of interest, while explicitly holding constant the levels of other relevant variables) (2) Multiple regression controls for all possible confounders, even those of which the researcher is not aware and/or cannot include as a model predictor (3) A multiple regression equation attempts to isolate a set of partial relationships between the response variable and each of the predictor variables included in the model

1 and 3

Which of the following is true of the normal distribution? (1) Observing a normal random variable more than three standard deviations beyond its expected value is an unlikely, rare event (2) All normal random variables have the same mean and variance (3) The area under a normal density curve represents probability

1 and 3

Which of the following statements is true of dummy variables? (1) In general, a grouping variable with K categories produces K-1 dummy variables (2) In a fitted model, the coefficient on a dummy variable represents the average value of the outcome (y) whenever the dummy variable 1 (3) In a fitted model, the coefficient on a dummy variable represents the difference in the average outcome (y) between two conditions: when the dummy variable is 1, versus when the dummy variable is 0

1 and 3

Which of the following statements is true of the Central Limit Theorem? (1) The mean of a sufficiently large sample has an approximately normal sampling distribution (2) If sample data plotted on a histogram show a distribution of individual observations that does NOT look normal, the sampling distribution of the sample mean of these observations will also necessarily NOT look normal (3) The sampling distributions of a sample mean looks more normal as the size of the sample N increases

1 and 3

The McCombs MS Program Office wants to analyze a data frame whose rows represent students, and whose columns contain two variables from a survey: (1) the student's MS Program affiliation: 1.Business Analytics 2.Finance 3.IT Management 4.Marketing (2) the student's mode of attendance for the Summer 2021 term: 1.In-person 2.Virtual Which of the following would be an appropriate plot to visualize students' responses to these questions across the whole data set? (1) Faceted barplot (2) Scatter plot (3) Boxplot

1 only

The normal distribution would be an appropriate probability model in which of the following contexts? (1) As an approximation for a large-N binomial model (2) Characterizing a phenomenon for which large deviations from the mean, of three standard deviations or more, are frequent events (3) Describing a situation where we count the number of events over a fixed time interval, under the assumption that successive events are independent and occur at a constant rate

1 only

Which of the following is a continuous random variable? (1) The high temperature in Austin today (2) The number of undergraduate majors at UT Austin (3) The count of Texas counties in which the majority of registered voters are affiliated with the Democratic party

1 only

Which of the following are key ingredients of a confidence interval based on the Central Limit Theorem? (1) A summary statistic (e.g. a mean) from your sample (2) A multiple z, based on a tail area from the normal distribution (3) A formula for the standard error of your summary statistic (4) A bootstrapped sampling distribution, usually simulated with at least 10,000 Monte Carlo samples

1, 2, and 3

Suppose that you have a house worth $200,000, and that there is a 1 in 23,021 chance that your home will experience a flood this year. Suppose that, if a flood happens, there;s a 90% chance that it will cause partial damage worth $20,000, and a 10% chance that it will total your house (i.e. cause $200,000 worth of damage). What is the expected value of flood damage to your house this year?

1.65

A data scientist fits an exponential growth model to estimate the rate of growth in daily COVID-19 Austin-area hospitalizations (Y) during the surge in June 2020. The fitted equation is below, where T=0 corresponds to June 1: log(Y) = 2.5 + 0.06*T In the Austin area during June 2020, the number of COVID-19 hospitalizations doubled roughly every:

12 days

A data scientist fits an exponential growth model to estimate the rate of growth in daily COVID-19 Austin-area hospitalizations (Y) during the surge in June 2020. The fitted equation is below, where T=0 corresponds to June 1: log(Y) = 2.5 + 0.06*T Which of the following can we say about this model? (1) The relationship between days since June 1 and the number of daily hospitalizations (Y) is assumed to be approximately linear (2) This equation describes multiplicative change in Y over time (3) This model can be fit as a liner regression, i.e., using ordinary least squares

2 and 3

Choose the true statements about the relationship between bootstrapping and confidence intervals based on the Central Limit Theorem (CLT) (1) Bootstrapping confidence intervals typically give very different results from confidence intervals calculated using classical inference methods based on the Central Limit Theorem (2) Bootstrapping offers an alternative to the use of CLT-based confidence intervals, removing the need to know or derive an explicit formula for the standard error of a summary statistic (3) Variations on de Moivre's equation exist for many common summary statistics, allowing one to compute confidence intervals based on the normal distribution

2 and 3

Which of the following are among important considerations when using ANOVA to understand regression models? (1) There can only be one correct ANOVA table for any given model in the context of an observational study involving correlated predictors (2) ANOVA is very useful for regression models fit to experimental data, but generally less useful for regression models fit to observational data (3) The ANOVA table is not the fitted model itself but rather an attempt to partition credit for the model's predictive power across the different variables

2 and 3

Which of the following is true of fitting a multiple regression model? (1) We interpret each beta coefficient as representing an overall relationship between y and the corresponding x (2) We interpret each beta coefficient as representing a partial relationship between y and the corresponding x, holding the other predictors constant (3) If two predictor X variables are correlated, the difference between their overall relationships with a response variable Y and their partial relationships with a response variable Y may be very important in interpreting modeling results effectively

2 and 3

The Amazon e-commerce data science team fits a model to gain understanding of the extent to which having an Amazon Prime membership leads customers to buy more from the platform than customers without a Prime membership. The data set includes two variables: total sales revenue for a customer account and a dummy variable representing whether the customer had a Prime membership. The fitted model equation is: Sales = 764 + 1323 * Prime + e What sales revenue do we predict for a customer who is a Prime member?

2087

Which of the following are correct statements about the analysis of variance (ANOVA)? (1) An ANOVA table attempts to attribute credit to individual predictor variables included in the model (2) An ANOVA table tracks the improvements in R-squared associated with each variable (3) If predictors are correlated, changing the sequence in which these variables are added to the model changes the information in the ANOVA table about the relative importance of individual variables

All of these answers are correct

Bitcoin is a decentralized digital currency that functions without a central bank. It was first released as open-source software in 2009. Since then, many of these so called crypto currencies have raised money through Initial Coin Offerings (ICOs). A data frame fives information on nine ICOs that have raised more than $100 million. For each ICO, the dataset ists an ID (numbered from 1 to 9), the name of the ICO, the location of the ICO, and the amount raised by the ICO in millions of U.S. dollars. Which of the following variables in this data frame is/are numerical? Select all correct answers

Amount raised

Before an election, polling agencies estimate the percentage of voters who will vote for a particular candidate. Their estimates will be affected by random sampling variance. The best way to reduce the sampling variance of their estimate is to:

Use a larger random sample

Hoping to compete with the likes of Amazon Web Services and Microsoft Azure in the competitive cloud computing market, software-as-a-service provider Nimbus 2k has bid on a major contract -There is a 15% chance the firm wins the contract and earns a profit of $1,000,000 -There is a 10% chance the firm wins the contract but with higher expenses, earning a profit of $750,000 -Otherwise, the firm loses the contract and they earn no profit The expected value of the contract profit for Nimbus 2k is:

$225,000

The owner of Honeydukes uses data from their point-of-sale system to fit a linear model predicting daily sales revenue (in USD) from the number of customers who visited the store that day. The equation is: Salesi = 10.56 + 5.23 * Customersi +ei If 70 customers visit the shop tomorrow, the daily sales revenue predicted by this linear model is closest to which of the following?

$377

The luxurious Ludo Bagman Casino offers a new dice game in which players roll a fair 6-sided die: -Rolling a "1" wins $20 -Rolling a "2" wins $10 -any other outcome results in no winnings Let W be a random variable that denotes the amount of winning from playing one round of this game. The expected value of W is closest to which of the following?

$5.00

The model equation is: MaxBonus = 0.45 + 0.15 * Salary + 0.84 * SEC - 0.11 * (Salary * SEC) What is the predicted maximum annual bonus for a coaching position at a non-SEC school with a salary of $2 million?

$750,000

A data scientist fits a model to investigate the extent to which the effect of work experience on salary is the same for males and females (using years of Experience and Sex as predictor variables): Salary = 35,000 +3300 * Experience - 2000 * Female - 300 * (Experience * Female) What salary would we predict for a female with 16 years of work experience?

$81,000

A data scientist fits a model to investigate the extent to which the effect of work experience on salary is the same for males and females (using years of Experience and Sex as predictor variables): Salary = 35,000 +3300 * Experience - 2000 * Female - 300 * (Experience * Female) What salary would we predict for a male with 16 years of work experience?

$87,800

A data-science team at a large grocery chain observes the quantity of ice cream cartons sold at different levels of outside temperature each day. Their goal is to use a statistical model to understand how changes in temperature predict changes in consumer demand for ice cream, as measured by quantity sold on a given day. Which correct answers describe this model?

-Quantity sold should be the response variable, and temperature should be the predictor variable -If the data scientists use a linear model, the model intercept represents, mathematically, what we'd expect ice cream sales quantity to be if the outside temperature was exactly 0

Which of the following are accurate statements about R^2?

-R^2 ranges from 0 to 1 -If a linear model produces a R^2 equal to 0.77, it indicates that 23% of the variability in Y is predicted by factors other than variation in X

Netflix collects data every time a subscriber uses its platform, including the variables listed below. Which of these variables are categorical?

-The U.S. state in which the subscriber resides -The genre of the show/movie -The day of the week

During class we fit a multiple regression model to predict the listing price of a house in Saratoga County, New York. We concluded that the variable age (age of the house in years) should be included in a model attempting to isolate the partial relationship between fireplaces and price. Which of the following were among the reasons that we decided to include age in the model?

-The confidence interval for the age coefficient did not contain zero -The inclusion of the age variable affected the magnitude of the coefficient of interest: fireplaces -Age was correlated with both the response variable (price) and the predictor variable of interest (fireplaces)

Statistical inference comprises a set of methods to quantify uncertainty. Inference is appropriate in a variety of common data-science situations, including when:

-The data arise from an intrinsically random or variable process -Analyzing results from a randomized experiment -We are trying to make a prediction about the future based on data, and it is reasonable to assume that the future will resemble the past/present in ways that are relevant to the given data science context -Our observations are subject to measurement error

t1 = xtabs(~acl + lollapalooza, data=aclfest) t1 %>% prop.table(margin=2) %>% round(3) Which of the following statements about this code block are correct?

-The pipe operator (%>%) is always used to feed the result of one calculation into the next calculation as illustrated in this code block -This code created a table of counts and stores it in an object called t1

Which of the following statements about outliers are correct?

-There is no generally accepted definition of what constitutes an outlier -If an outlier noticeably changes the results of your analysis, it's a good idea to report results both with and without the outlier

In order for a confidence interval based on de Moivre's equation to be valid, which of the following conditions must be true?

We must be forming a confidence interval for a population mean based on a sample mean

Which of the following design choices should generally be avoided in data visualization?

-3D designs -A barplot with truncated y-axis

Which of the following are correct statements about "for" loops and "do" loops in R?

-A "for" loop will always have a "counting" or "iterator" variable -"Do" loops allow us to repeat a calculation or simulation many times, as long as we don't require that the results of one simulation can affect the results of another simulation -"For" loops are useful for chaining the results of computations together, with the result of one computation feeding into the next computation

In 1990, the United Nations created a single measure that ranges betwee zero and one -- the Human Development Index (HDI) -- to summarize health, education, and economic status for world countries. The following is a fitted model to predict HDI from life expectancy and expected years of schooling: HDI = -0.34 + 0.01 * LifeExpectancy + 0.02 * SchoolYears + Error Which of the following are correct interpretations of the beta coefficient for LifeExpectancy?

-A partial slope -The change in HDI associated with a one-year increase in life expectancy, holding other predictors constant

Which of the following statements about scripts are correct?

-A script is a file that collects multiple statements (i.e. lines of R code) in a single document -Scripts make it simple to save your work where you left off, without having to remember what you've accomplished already -One way to run R statements from a script is to highlight those statements and then hit Control-Enter on the keyboard -Scripts make it easy to modify a complex analysis by adding or changing steps in the middle of a long chain of statements

Which of the following are correct statements about the analysis of variance (ANOVA)?

-An ANOVA table can help you decide whether a given data set calls for an interaction between variables -An ANOVA table attempts to attribute credit to the individual variables included in the model -An ANOVA table tracks the improvement in R-squared as we add variables to the model one at a time

The model equation is: MaxBonus = 0.45 + 0.15 * Salary + 0.84 * SEC - 0.11 * (Salary * SEC) Which of the statements are correct, in light of the fitted model parameters?

-Bonuses increase with salary at SEC schools, but more slowly than at non-SEC schools -Bonuses increase with salary at a rate of $150,000 per million dollars of salary at non-SEC schools

A data scientist fits a model to investigate the extent to which the effect of work experience on salary is the same for males and females (using years of Experience and Sex as predictor variables): Salary = 35,000 +3300 * Experience - 2000 * Female - 300 * (Experience * Female) Which of the following statements about possible discriminatory compensation practices looks correct, in light of the fitted model parameters?

-Both males and females seem to get paid more with increasing experience but the expected increase in salary for an additional year of experience seems to be higher for males -Females with 20 years of experience seem to get paid $8000 less than males of equivalent experience -Females with 0 years of experience seem to get paid $2000 less than males of equivalent experience

Which of the following statements about exponential growth/decay models and power laws are correct?

-Both power laws and exponential models involve a base (b) raised to some power (r). But in an exponential model, the x variable is part of the power (r), whereas in a power law, the x variable is part of the base b -Exponential growth models can be interpreted in terms of a doubling time for the y variable, while power laws are usually interpreted in terms of elasticities

Which of the following are measures to summarize the variability or spread of a distribution?

-Interquartile range -Standard deviation -Range

Which of following is true of the normal distribution model?

-It is a family of bell-shaped density curves, each with a different mean and standard deviation -The normal distribution originated as an approximation to the binomial distribution -Phenomena that don't look obviously normal can be sometimes described using the normal distribution as a building block -The normal distribution has "thin tails" because large outliers are unlikely to occur -The area under a normal density curve represents probability

Which of the following statements about R libraries are correct?

-Libraries need to be loaded each time you want to use them -A library is a piece of software that provides additional functionality to R, beyond what's contained in the basic R installation

Which of the following are measures to summarize the center of a distribution of a numerical variable?

-Mean -Median

In a linear regression model, we describe data by an equation: yi = B0 +B1 * xi +ei Which of the following is accurate for this equation?

-Model error is represented by ei -The predictor variable is represented by xi

In 1990, the United Nations created a single measure that ranges betwee zero and one -- the Human Development Index (HDI) -- to summarize health, education, and economic status for world countries. The following is a fitted model to predict HDI from life expectancy and expected years of schooling: HDI = -0.34 + 0.01 * LifeExpectancy + 0.02 * SchoolYears + Error Which of the following are correct interpretations of the beta slope coefficient for SchoolYears?

-We expect HDI to increase by 0.02 for every one-year increase in expected school, after adjusting for the simultaneous effect of life expectancy on HDI -When comparing countries with similar life expectancies but a difference of one year in expected years of schooling, we'd expect their HDIs to differ by 0.02

In a phase 3 randomized vaccine trial, participants at high risk for SARS-CoV-2 infection were randomly assigned in a 1:1 ratio to receive two injections of mRNA-1273 or placebo 28 days apart. Variables in the data frame include: -group: was the participant randomly assigned to the placebo group or the vaccine group? -covid: did the participant develop symptoms of illness (covid) or did they remain healthy? The researchers would like to use bootstrapping with at least 100,000 iterations to calculate a 95% confidence interval for the difference in the proportions of participants who developed covid across the two groups (placebo and vaccine). Which of the following R functions will be required as part of the code to run this proposed analysis?

-diffprop -confint -do -resample

The grammar of graphics is a theoretical framework for data visualization that:

-is implemented in R with the ggplot2 package -conceptualizes a statistical graphic as a mapping of data variables to aesthetic attributes of geometric objects -defines a set of rules for creating graphics by combining two different types of layers

There was fierce competition for tickets to the Jimmy Fallon show hosted on the UT Austin campus. A student who entered the ticket lottery had only a 5% chance of getting a ticket. However, students in the marching band had a 30% chance of getting a ticket. Of all students who entered the lottery, 7% were in the marching band. What proportion of lottery entrants received a ticket and were in the marching band?

0.021

Based on the data from 2019, 22.8% of Americans claim no religious affiliation. In a random sample of size n=250 of those living in America, what is the probability of sampling 45 or fewer people with no religious affiliation?

0.039

Suppose the time it takes for a purchasing agent to complete an online order is normally distributed with a mean of 8 minutes and a standard deviation of 2 minutes. What is the probability that it will take longer than 11 minutes for the agent to complete an online order?

0.067

Suppose a packaging system fills boxes such that the weights are normally distributed with a mean of 16.3 ounces and standard deviation of 0.21. What is the probability that box has less than 16 ounces?

0.077

From a fair, shuffled 52-card deck, you draw the king of clubs and set it aside. What is the probability that the next card you draw will be a queen?

0.078

Research from 2019 indicates that 60% of U.S. adults aged 18-29 have used Snapchat, 65% have used Instagram, and 47% have used both Snapchat and Instagram. What is the probability that someone in this demographic uses neither Snapchat nor Instagram?

0.22

Imagine a jar containing 2500 normal "fair" coins. Into this jar, a friend places a single two-headed coin. Your friend then gives the jar a good shake, and you draw a single coin at random, with all 2500 coins in the jar equally likely to be drawn. You want to know whether the coin that you have drawn is the two-headed coin, but it's against the riles to simply look at both sides, so you conduct a statistical test for two-headedness by flipping the coin 12 times. Now suppose the coin comes up heads on all 12 flips. In light of this evidence, what is the probability that you are holding a two-headed coin?

0.62

At the Social Local Global Mobile manufacturing center, the quality engineer is called to the production floor if the smart phone assembly line encounters a problem. The engineer must then decide which of two actions to take to fix the problem: 1)make a simple adjustment that can sometimes fix the problem within an hour; or 2)shut down the entire production line (requiring an entire 8-hour shift) to guarantee a fix. Naturally , management would prefer if the assembly line is not shut down. In the past, the production line has experienced three problems associated with different smart phone parts: (1) the screen accounts for 10% of problems (2) the memory card accounts for 30% of problems (3) and the battery accounts for 60% of problems From a long experience, the engineering team has learned a set of conditional probabilities that each of these three problems can be fixed using the short, simple adjustment. The sim

0.64

Only 15% of all days in Central Texas are rainy. Moreover, you know the track record of the weather app on your phone: when it rains, your app gives the correct forecast (and correctly says it will rain tomorrow) 90% if the time. When it does not rain, your weather app raises a false alarm (and incorrectly says it will rain tomorrow) 5% of the time. Suppose your app says that it's going to rain tomorrow. What is the probability that it actually will rain?

0.76

As a junior data scientist for Uber, your first project focuses on surge pricing in Austin. Part of this work entails accurately modeling the number of ride requests within a given geographic location, for which a Poisson distribution is appropriate. From historical data you find that app users request, on average 18.7 rides per minute between midnight and 2:00am in the half mile radius of the intersection of Sixth Street and Congress Avenue. Uber wants to avoid a situation in which the supply of available drivers is inconsistent with demand for rides. What is the probability that there are no more than 25 requests per minute in the downtown Austin area during this late night interval?

0.936

American Express introduced a new credit card promotion giving awards to customers who make at least 20 purchases in a month. The data science team wants to know if the proportion of customers making 20+ purchases/month has changed from the 13% level observed before the promotion. A random sample of 1,000 accounts revealed that 10 had at least 20 purchases in the month following the new promotion. In this context, the null hypothesis could be summarized as which of the following? (1) 10 out of 1000 of customers are making at least 20 purchases each month (2) The new credit card promotion will result in more than 13% of customers with at least 20 purchases each month (3) We'd expect 130 of every 1000 accounts to make at least 20 purchases in a month

3 only

The national weather service collects annual data for the Austin metropolitan area. Of the following, which is a continuous random variable? (1) Count of tornado warnings (2) Number of days of precipitation (3) Annual rain accumulation

3 only

Which of the following statements about elasticities are correct? (1) An elasticity describes the growth rate in some outcome (y) over time(t), and is usually associated with exponential growth models. (2) One way to estimate an elasticity from data is to run a regression for log(y) versus x; the slope of this line provides an estimate of the elasticity for y vs. x. (3) An elasticity describes relative percentage change in y as a function of x.

3 only

Which of the following statements is true of multiple regression? (1) Using a regression model to "isolate" or "adjust for" variables is an experimental process, i.e. one that involves manipulating the variable of interest while holding others constant (2) Using regression to isolate a partial relationship can usually produce study results that offer even stronger evidence of causality that the level of evidence that we expect from a randomized controlled trial (3) A confounder is some variable that is correlated with both the response variable and a predictor variable

3 only

The distribution of fifty years worth of S&P 500 monthly stock returns has a mean of 0.89% and standard deviation of 4.42%. You plot these data in a histogram and observe that the distribution of individual monthly returns is approximately normal The probability that a randomly selected individual's monthly return will be greater than 3% is closest to which of the following

32%

A data scientist fits a groupwise linear model to predict Revenue from research grants in terms of affiliation of the research team with 1 of the 8 academic Institutions in the University of Texas system (UT Arlington, UT Austin, UT Dallas, UT El Paso, UT Permian Basin, UT Rio Grande Valley, UT San Antonio, or UT Tyler). If this model uses "baseline/offset" form, how many dummy variables should this model use to encode the variable Institution?

7

In the context of hypothesis testing, the test statistic:

Measures the strength of evidence in the data against the null hypothesis

Holding other factors constant, increasing the size of a sample used to calculate a confidence interval will:

Decrease the standard error

The model equation is: MaxBonus = 0.45 + 0.15 * Salary + 0.84 * SEC - 0.11 * (Salary * SEC) True or False: The coefficient of 0.84 indicated max bonuses are about $840,000 higher at SEC schools versus non-SEC schools, regardless of the coach's salary

False

If the distribution of a numerical variable is unimodal and skewed to the left, then the median is:

Greater than the mean

A data set used by a marketing team contains the following information on 46 different advertising campaigns: -total ad spend (measured in dollars of total advertising cost for that campaign) -visibility (measured in impressions for all ads across the campaign) They fit a linear model to predict a campaign's visibility from its ad spend. What units does the slope have?

Impressions per dollar

What is a correct statement about interactions and/or correlated predictors?

Interaction terms are used in regression to describe situations where the relationship between some x and the outcome y depends on context

A sampling distribution:

Is the distribution of values of a summary statistic that we'd expect to see under repeated realization of the same random data-generating process

The plot that we choose depends on the comparison we are trying to make, among other things. Suppose we have the following data set: The percentage of games won by each University of Texas women's soccer team in every season since the program was established in 1993. Which of the following plots would be an appropriate choice to visualize winning percentage over time based on these data?

Line graph

When analyzing data from a "multi-factor" experiment, it is common to conduct an analysis of variance. In an ANOVA table:

None of these answers are correct -We track the statistical significance of our first variable, which should increase as variable are added sequentially -We list all the fitted coefficients for all variable in the model -We track the change in R-squared (R^2), which should decrease as variables are added sequentially

Why would we bootstrap a statistical model?

None of these answers are correct: -To minimize the effect of sampling bias on the error of our estimate -To check whether our data forms a true random sample from the population -To reduce the effect of sampling variance on the uncertainty of our estimate

Which of the following is true of p-values?

None of these are correct" -A p-value represents the probability of observing our specific test statistic, assuming that the null hypothesis is correct -A p-value represents a conditional probability that the null hypothesis is correct, given the data that we have observed -Larger p-values are indicative of more evidence against the null hypothesis

The Austin City Council conducted a random sample of Austin residents on whether or not they approve of the upcoming 8.75-cent tax rate election to fund a mass transit plan including light rail, new bus routes, and a downtown subway system. Of the 1734 Austinites in the random sample, 51% approved of the plan. The standard error of this estimate was 1.25%, and the margin of error was set to be plus-or-minus two standard errors (2.5%). We can conclude that:

None of these are correct: -51% of Austinites in the overall population approve of the mass transit plan -We're 100% certain, in light of the survey, that the proportion of Austinites in the wider population who approve of the transit plan is between 48.5%-53.5% -Nothing useful, because 1734 randomly sampled respondents isn't enough to give us reliable information about a population of more than a million people

What is a representation of the concept of "sampling WITH replacement"?

Professor Snape selects a sample of students to "cold call" during each of his NEWT-level Potions classes. For each question, he uses a Resampulus charm, wherein his wand randomly points to a student irrespective of who was called previously. There is no limit on the number of times that an individual student might be selected during any given class session

A formula that defines each term in a sequence using the preceding terms in that sequence is said to be:

Recursive

The Amazon e-commerce data science team fits a model to gain understanding of the extent to which having an Amazon Prime membership leads customers to buy more from the platform than customers without a Prime membership. The data set includes two variables: total sales revenue for a customer account and a dummy variable representing whether the customer had a Prime membership. The fitted model equation is: Sales = 764 + 1323 * Prime + e The data-science team calculates R-squared for this model to be 0.24: R^2 = 0.24 This measure indicates which of the following?

Sources of variation within customer groups (Prime and non-Prime) dwarf the variation between the groups

Bitcoin is a decentralized digital currency that functions without a central bank. It was first released as open-source software in 2009. Since then, many of these so called crypto currencies have raised money through Initial Coin Offerings (ICOs). A data frame fives information on nine ICOs that have raised more than $100 million. For each ICO, the dataset ists an ID (numbered from 1 to 9), the name of the ICO, the location of the ICO, and the amount raised by the ICO in millions of U.S. dollars. Which of the following represent the cases of this data frame?

The ICOs

Many situations involve two sources of information about some uncertain event: 1)background information, i.e. how often the event tends to occur generally; 20 case-specific data, i.e. how likely it is that we'd see the data at hand if the event had actually occurred. When people focus only on the case-specific data and ignore the background information, this is referred to as:

The base-rate fallacy

Which of the following are named elements of the Bayes' rule equation?

The prior, the posterior, and the likelihood

The long-run rate of defective iPhones coming off the assembly line is 0.6% when all manufacturing processes are working correctly. Because testing each phone for defects would be cost prohibitive, a random sample of 500 phones are inspected every 2 hours to determine if the manufacturing processes are working correctly, or if there may be an issue leading to a higher rate of production defects Halting the production line unnecessarily leads to lost revenue from fewer units shipping. On the other hand, producing defective phones is bad for brand integrity. Which of the following represents a Type I Error in this context?

The process is working correctly and the plant manager temporarily halts production

A manufacturer of video game controllers is concerned that their controller may be difficult for left-handed users. Suppose that 22% of the population is left-handed. Consider a sample of 12 customers. Can the number of left-handed gamers in the sample be modeled as a binomial random variabe?

Yes, if we assume that each customer has a 22% chance of being left-handed

Suppose we have a random same of U.S. voters in a data frame called voters. In this data frame, there is a binary variable called House2022, which represents whether the voter intends to vote for a Republican or Democrat in their local election for the 2022 U.S> House of Representatives. Which of the following R commands would allow you to generate a bootstrap sampling distribution (using ten thousand simulations) for the proportion of all U.S. voters who intend to vote for a Republican in the 2022 midterm House elections.

do(10000) * prop(~House2022, data=resample(voters))

Payton Hobart is running for class president. Pre-election polls have generally agreed that he has the support of 63% of the student body. However, his chief data scientist recently surveyed a random sample of 75 students and observes that only 44 of them were Hobart supporters. She decides to conduct a hypothesis test to determine if her recent poll was anomalous, or whether it was consistent with the other pre-election polls. The null hypothesis for this test assumes which of the following for the probability (p) that a randomly chosen member of the student body will support Hobart?

p = 0.63


संबंधित स्टडी सेट्स

Géométrie analytique - définitions et formules

View Set

Macroeconomics Chap 5 and 6 Quiz

View Set

Florida 214 (book summary definitions ) FLORIDA INSURANCE LAWS (unit 18 & 19)

View Set