Business Analytics Final
According to the Central Limit Theorem, the means of random samples from distributions will be normally distributed IF ...
, assuming the samples are sufficiently large?
If an independent variable has a p-value of 0.0050, which of the following could represent the Lower 95% and the Upper 95% for that variable?
-235.62; -5.64 The p-value, 0.0050, is less than 0.05 so the independent variable is significant at the 5% significance level. Therefore, the 95% confidence interval for the coefficient of the independent variable does not include zero. The interval between -235.62 and -5.64 does not contain zero.
If p-value is less than .05 then....
.we can be confident that the relationship is significant at the 5% significance level and, equivalently, at the 95% confidence level.
For a standard normal distribution (µ=0, σ=1), the area under the curve less than 1.5 is 93.32%. What is the approximate percentage of the area under the curve less than -1.5?
6.68% 1-93.32%=6.68% is the area under the curve greater than 1.5. Since the normal distribution is symmetric, 6.68% is also the area under the curve less than -1.5.
Based on the regression model, forecast the expected production volume when there are 112 factory workers.
= intercept + (variable *112)
Points scattered all over without pattern
A low R2 and high p-value indicates that the independent variable explains little variation in the dependent variable and the linear relationship is not significant. Since the data points are widely dispersed and do not indicate a clear linear pattern, this relationship likely has a low R2 and high p-value.
A company randomly surveys 15 VIP customers and records their customer satisfaction scores out of a possible 100 points. Based on the data provided, calculate a 90% confidence interval to estimate the true satisfaction score of all VIP customers.
First calculate the mean and standard deviation of the sample, which you can do in Excel using either the descriptive statistics tool or the AVERAGE and STDEV.S functions. The mean and standard deviation are approximately 76.60 and 11.28 respectively. Since the sample size is only 15, use the function CONFIDENCE.T(alpha, standard_dev, size) to find the margin of error using the t-distribution. Here, this is approximately CONFIDENCE.T(0.1,11.28,15)=5.13. The lower bound of the 90% confidence interval is the mean minus the margin of error, 76.60-5.13=71.47. The upper bound of the 90% confidence interval is the mean plus the margin of error, 76.60+5.13=81.73. You must link directly to values in order to obtain the correct answer.
A store owner is interested in opening a second shop. She wants to estimate the true average daily revenue of her current shop to decide whether expanding her business is a good idea. The store owner takes a random sample of 60 days over a six-month period and finds that the mean revenue of those days is 3,472.00 dollars with variance 315,900.20 square dollars. Calculate a 95% confidence interval to estimate the true average daily revenue.
First calculate the sample standard deviation, which is equal to the square root of the variance. The sample standard deviation is $562.05. Then find the margin of error using the Excel function CONFIDENCE.NORM(alpha, standard_dev, size). Here, CONFIDENCE.NORM(0.05,SQRT(B2),60)=CONFIDENCE.NORM(0.05,562.05,60)=$142.22. The lower bound of the 95% confidence interval is the mean minus the margin of error, $3,472.00-$142.22=$3,329.78. The upper bound of the 95% confidence interval is the mean plus the margin of error, $3,472.00+$142.22=$3,614.22. You must link directly to values in order to obtain the correct answer.
Regression Analysis with multiple data
From the Data menu, select Data Analysis, then select Regression. The Input Y Range is B1:B31 and the Input X Range is C1:D31. You must check the Labels box to ensure that the regression output table is appropriately labeled. You must also check the Residuals and Residual Plots boxes so that you are able to analyze the residuals.
A researcher at a university wants to better understand how study habits affect student grades. To determine whether there is a link between hours spent studying and exam scores, the researcher recruits fifty students in an advanced calculus class to record their study hours over the semester, and also obtains permission to see their scores on the final exam. Perform a regression analysis, where the average number of hours spent studying per week over the two-month data collection period is the independent variable and the score on the final exam is the dependent variable. Be sure to include the residuals and residual plot in your analysis.
From the Data menu, select Data Analysis, then select Regression. The Input Y Range is B1:B51 and the Input X Range is A1:A51. You must check the Labels box to ensure that the regression output table is appropriately labeled. You must also check the Residuals and Residual Plots boxes so that you are able to analyze the residuals.
A streaming music site changed its format to focus on previously unreleased music from rising artists. The site manager now wants to determine whether the number of unique listeners per day has changed. Before the change in format, the site averaged 131,520 unique listeners per day. Now, beginning three months after the format change, the site manager takes a random sample of 30 days and finds that the site has an average of 124,247 unique listeners per day. The manager finds that the p-value for the hypothesis test is approximately 0.0743. How would you interpret the p-value?
If the average number of unique daily listeners per day is still 131,520, the likelihood of obtaining a sample with a mean at least as extreme as 124,247 is 7.43%. The null hypothesis is that the average number of unique daily listeners per day has not changed, that is, it is still 131,520. Therefore, the p-value of 0.0743 indicates that if the average number of unique daily listeners is still 131,520, the likelihood of obtaining a sample with a mean at least as extreme as 124,247 is 7.43%%.
The owner of an ice cream shop wants to determine whether there is a relationship between ice cream sales and temperature. The owner collects data on temperature and sales for a random sample of 30 days and runs a regression to determine if there is a relationship between temperature (in degrees) and ice cream sales. The p-value for the two-sided hypothesis test is 0.04. How would you interpret the p-value?
If there is no relationship between temperature and sales, the chance of selecting a sample this extreme would be 4%. Correct. The null hypothesis is that there is no relationship. The p-value indicates how likely we would be to select a sample this extreme if the null hypothesis is true.
How do you increase confidence interval?
Increase confidence level + increase sample size
If the street fair organizer wanted to compare the explanatory power of the original model and the following new regression model, which value should he consult for the new model?
It is important to use the Adjusted R2 to compare two regression models that have a different number of independent variables. 0.9225 is the Adjusted R2 of the new model.
What is the correlation coefficient of the relationship between the average weekly hours spent studying and the score on the final exam?
Multiple R Value
One sided vs two sided
One-sided hypothesis tests test for either an increase or a decrease, but not both. Two-sided hypothesis tests test for a change in either direction. The following are one-sided hypothesis tests: Test whether incoming students at a business school receive better grades in their classes if they've taken an on-line program covering basic material Test whether users of a commercial website are less likely to make a purchase if they are required to set up a user account on the site The following are two-sided hypothesis tests: Test whether there is a difference between men's and women's usage of a mobile fitness app Test whether the number of listeners of a streaming music service has changed after they changed the user interface
The following data set provides the 2012 revenue (in billions of dollars) for the top 75 companies as declared by the Fortune 500 rankings. What amount do 60% of the companies earn equal to or less than?
PERCENTILE.INC(B2:B76,0.60)=$74.40 billion. You must link directly to values in order to obtain the correct answer.
P VALUE IS ..
REGRESSION/SIGNIFICANCE F
A sporting goods store manager wants to forecast annual sneaker revenues based on the type of sport (running, tennis, or walking), color (red, blue, white, black, or violet) and its target audience (men or women). How many independent variables should the manager include in her multiple regression analysis? Please enter your answer as an integer; that is with no decimal point.
Sales revenue is the dependent variable. Type of sport, color, and target audience are categorical variables which must be represented using dummy variables. Recall that it is necessary to use one fewer dummy variables than the number of options in a category. Thus, type of sport should be represented by 3-1=2 dummy variables, color should be represented by 5-1=4 dummy variables, and target audience should be represented by 2-1=1 dummy variables, for a total of 2+4+1=7 independent variables.
A curious student in a large economics course is interested in calculating the percentage of his classmates who scored lower than he did on the GMAT; he scored 490. He knows that GMAT scores are normally distributed and that the average score is approximately 540. He also knows that 95% of his classmates scored between 400 and 680. Based on this information, calculate the percentage of his classmates who scored lower than he did.
Since GMAT scores are normally distributed, we know that P(μ-1.96σ ≤ x ≤ μ+1.96σ) = 95%. Thus, to find the standard deviation, subtract the lower bound from the mean and divide by 1.96. The standard deviation of the distribution is (B1-B2)/1.96 = (540-400)/1.96 = 71.4. (Note that because the normal curve is symmetrical, we could calculate the same value using (B3-B1)/1.96 = (680-540)/1.96 = 71.4). To find the cumulative probability, P(x ≤ 490), use the Excel function NORM.DIST(x, mean, standard_dev, TRUE). Here, NORM.DIST(B4,B1,71.4,TRUE) = NORM.DIST(490,540,71.4,TRUE) = 0.24, or 24%. Approximately 24% of his classmates scored lower than he did. You must link directly to the values in order to obtain the correct answer
Which of the formulas would calculate the statistic that is MOST APPROPRIATE for comparing the variability of two data sets with different distributions?
Standard Deviation/Mean This is the formula for the coefficient of variation, the best statistic to compute to compare the variability of two data sets with different distributions. Dividing by the mean provides a measure of the distribution's variation relative to the mean
R-square indicates what percentage of the variability in the dependent variable is explained by the regression line. if r sqaured = 0.7059 then...
That 71% of the variability in the number of chairs produced can be explained by whether the shift is in the morning or evening and whether it is a weekday shift or weekend shift.
A professor presents a summary of grades to her students. A student asks why the mean of the data set is so much smaller than the median. Which of the following is most likely?
The distribution of the data is skewed to the left When the distribution of data is skewed to the left, the mean is most likely less than the median. The extreme values in the left tail pull the mean towards them.
Bias vs Unbias
The following are biased questions: Isn't Daft Punk a better band than Oasis? Research has linked carbon emissions to global warming. Do you think the US government should enact legislation to limit carbon emissions? Do you enjoy the work of such literary giants as William Shakespeare? Do you think people benefit from taking overpriced diet supplements? The following are unbiased questions: Do you believe that current popular music is better, worse, or about the same quality as popular music from 20 years ago? Do you think women should be drafted into the military?
A journalist wants to determine the average annual salary of CEOs in the S&P 1,500. He does not have time to survey all 1,500 CEOs but wants to be 95% confident that his estimate is within $50,000 of the true mean. The journalist takes a preliminary sample and estimates that the standard deviation is approximately $449,300. What is the minimum number of CEOs that the journalist must survey to be within $50,000 of the true average annual salary? Remember that the z-value associated with a 95% confidence interval is 1.96. (Please enter your answer as an integer; that is, as a whole number with no decimal point.)
The formula for calculating the minimum required sample size is n≥(z X*s/m)^2, where M=50.000 is the desired margin of error for the confidence interval, s=$449,300 is the sample standard deviation, and z=1.96. Using these data we find that 1.96449,30050,0002=310.20 Since n must be an integer (let's not even think of what 0.20 CEOs would look like!) and n must be greater than or equal to 310.20, we must round up to 311. Since 310.20 is closer to 310 than to 311, we would normally round 310.20 down to 310. However, in this case we must round up to find the smallest integer that satisfies the equation. Therefore, the minimum required sample size is 311.
How can we tell which independent variables are significant at the 95% confidence level?
The intercept is not an independent variable and doesnt contain a zero
A streaming music site changed its format to focus on previously unreleased music from rising artists. The site manager now wants to determine whether the number of unique listeners per day has changed. Before the change in format, the site averaged 131,520 unique listeners per day. Now, beginning three months after the format change, the site manager takes a random sample of 30 days and finds that the site now has an average of 124,247 unique listeners per day. The manager finds that the p-value for the hypothesis test is approximately 0.0743. What can be concluded at the 95% confidence level?
The manager should fail to reject the null hypothesis; there is not enough evidence to conclude that the number of unique daily listeners has changed. Since the p-value, 0.0743, is greater than the significance level, 0.05, the manager should fail to reject the null hypothesis.
A streaming music site changed its format to focus on previously unreleased music from rising artists. The site manager now wants to determine whether the number of unique listeners per day has changed. Before the change in format, the site averaged 131,520 unique listeners per day. Now, beginning three months after the format change, the site manager takes a random sample of 30 days and finds that the site has an average of 124,247 unique listeners per day.
The null hypothesis is that number of unique listeners per day has not changed. Thus, μ=131,520 is the null hypothesis. The alternative hypothesis is that the number of unique listener per day has changed. Thus, μ≠131,520 is the alternative hypothesis.
The owner of a local health food store recently started a new ad campaign to attract more business and wants to test whether average daily sales have increased. Historically average daily sales were approximately $2,700. After the ad campaign, the owner took a random sample of forty-five days and found that daily average sales had increased to $2,984. What is store owner's null hypothesis?
The null hypothesis is the opposite of the hypothesis you are trying to substantiate. Since the owner wants to test for an increase, the null hypothesis is μ≤$2,700. Remember that the null hypothesis is always based on historical information.
If the two-sided p-value of a given sample is 0.0020, what is the one-sided p-value for that sample mean?
The one-sided p-value is half of the two-sided p-value. Thus, the one-sided p-value is 0.00202=0.0010.
An airport shuttle company forecasts the number of hours its drivers will work based on the distance to be driven (in miles) and the number of jobs (each job requires the pickup and drop-off of one set of passengers) using the following regression equation: Travel time=-0.60+0.05(distance)+0.75(number of jobs) On a given day, Victor and Sofia drive approximately the same distance but Sofia has two more jobs than Victor. If Victor worked for 4 hours, for how long can the company expect Sofia to work? Please enter your answer rounded to one digit to the right of the decimal point. For example, if you think Sofia would work 236.7134 hours, enter 236.7.
The only difference between the workloads of the two drivers is the number of jobs each has; Sofia has two additional jobs. Therefore the company can expect Sofia to work the four hours Victor worked, plus an additional 0.75 hours for each of the two additional jobs, that is, 4+0.75(2)=5.5 hours. For your answer to be graded as correct, it must be rounded to one digit to the right of the decimal point, as specified in the question.
A business school professor is interested to know if watching a video about the Central Limit Theorem helps students understand it. To assess this, the professor tests students' knowledge both immediately before they watch the video and immediately after. The professor takes a sample of students, and for each one compares their test score after the video to their score before the video. Using the data below, calculate the p-value for the following hypothesis test:
The p-value of the one-sided hypothesis test is T.TEST(array1, array2, tails, type)=T.TEST(B2:B31,C2:C31,1,1), which is approximately 0.0128. You must designate this test as a one-sided test (that is, assign the value 1 to the tails argument) and as a type 1 (a paired test) because you are testing the same students on the same knowledge at two points in time. You must link directly to values in order to obtain the correct answer.
A college student is interested in testing whether business majors or liberal arts majors are better at trivia. The student gives a trivia quiz to a random sample of 30 business school majors and finds the sample's average score is 86. He gives the same quiz to 30 randomly selected liberal arts majors and finds the sample's average score is 89. Using the data provided below, calculate the p-value for the following hypothesis test:
The p-value of the two-sided hypothesis test is T.TEST(array1, array2, tails, type)=T.TEST(A2:A31,B2:B31,2,3), which is approximately 0.0524. You must designate this test as a two-sided test (that is, assign the value 2 to the tails argument) and as a type 3 test (an unpaired test with unequal variances) because you are testing two different samples. You must link directly to values in order to obtain the correct answer.
A hotel CEO wants to determine whether average customer spending on incidentals at the hotel has changed since the implementation of a new rewards program. Historically the average amount spent per person per day was approximately $70. After the implementation of the rewards program, the hotel randomly sampled 2,000 guests and calculated the sample's summary statistics. The average amount spent per person per day in the sample was approximately $83 with a standard deviation of $14. Select the function that would correctly calculate the 95% range of likely sample means.
The range of likely sample means is centered at the historical population mean, in this case $70. Since this is a 95% range of likely sample means, alpha equals 0.05. Excel's CONFIDENCE.NORM function syntax is CONFIDENCE.NORM(alpha, standard_dev, size).
Consider the four outliers in the 2012 revenue data: companies with revenue of $237 billion, $246 billion, $447 billion, and $453 billion. If we removed these companies from the data set, what would happen to the standard deviation?
The standard deviation would decrease. The standard deviation gives more weight to observations that are further from the mean. Therefore, removing the outliers would decrease the standard deviation.
A college student is interested in testing whether business majors or liberal arts majors are better at trivia. The student gives a trivia quiz to a random sample of 30 business school majors and finds the sample's average test score is 86. He gives the same quiz to 30 randomly selected liberal arts majors and finds the sample's average quiz score is 89. The student finds that the p-value for the hypothesis test equals approximately 0.0524. What can be concluded at α=5%?
The student should fail to reject the null hypothesis and conclude that there is insufficient evidence of difference between business and liberal arts majors' knowledge of trivia. Since the p-value, 0.0524, is greater than the significance level, 0.05, the student should fail to reject the null hypothesis and conclude that there is insufficient evidence of difference between business and liberal arts majors' knowledge of trivia. Because the null hypothesis is that there is no difference between the two types of majors, this answer is correct.
Dummy vs Quant
Time to run a marathon, height, size of flat-screen television, hours spent studying CORe, and calories in desserts are quantitative variables. Shoe color, number on an athlete's jersey, gender, and ice cream flavor are categorical/qualitative variables and need to be transformed into dummy variables. Note that although athlete's jerseys have numbers, those values cannot be interpreted as real numbers. For example, Eli Manning's number is 10, whereas Peyton Manning's was 18. However, you can't interpret them to mean that Peyton is 80% more than Eli in some way.
If the mean of a normally distributed population is -10 with a standard deviation of 2, what is the likelihood of obtaining a value less than or equal to -7?
To calculate the likelihood of obtaining a value less than or equal to -7, P(x≤-7), use the Excel function NORM.DIST(x, mean, standard_dev, TRUE). Here, NORM.DIST(-7,B1,B2,TRUE)=NORM.DIST(-7,-10,2,TRUE)=0.93, or 93%. Approximately 93% of the population falls in the area under the curve less than or equal to -7.
The owner of a local health food store recently started a new ad campaign to attract more business and wants to test whether average daily sales have increased. Historically average daily sales were approximately $2,700. After the ad campaign, the owner took another random sample of forty-five days and found that average daily sales were $2,984 with a standard deviation of approximately $585. Calculate the upper bound of the 95% range of likely sample means for this one-sided hypothesis test using the CONFIDENCE.NORM function.
To construct a 95% range of likely sample means, calculate the margin of error using the function CONFIDENCE.NORM(alpha, standard_dev, size). However, CONFIDENCE.NORM finds the margin of error for a two-sided hypothesis test and this question asks for the upper bound of a one-sided test. To find the upper bound for the one-sided test you must first determine what two-sided test would have a 5% rejection region on the right side. Since the distribution of sample means is symmetric, a two-sided test with a 10% significance level would have a 5% rejection region on the left side of the normal distribution and a 5% rejection region on the right side. Thus, the upper bound for a two-sided test with alpha=0.1 will be the same as the upper bound on a one-sided test with alpha=0.05. The margin of error is CONFIDENCE.NORM(alpha, standard_dev, size)= CONFIDENCE.NORM(0.1,C3,C4)=CONFIDENCE.NORM(0.1,585,45)=$143.44. The upper bound of the 95% range of likely sample means for this one-sided hypothesis test is the population mean plus the margin of error, which is approximately $2,700+$143.44=$2,843.44.
How do you create lagged data?
To create the lagged variable, copy the data from C2:C32 and paste it into D3:D33. For example, 93,460 should be in cell D3; 85,220 should be in cell D4, and so on. Cell D2 should be blank and there is a new row of data that has an entry only in cell D33.
IQ scores are known to be normally distributed. The mean IQ score is 100 and the standard deviation is 15. What percent of the population has an IQ between 85 and 105?
To find P(85≤x≤105), the percent of the population has an IQ between 85 and 105, first compute the cumulative probability, P(x≤105), using the Excel function NORM.DIST(x, mean, standard_dev, TRUE). Then subtract the cumulative probability P(x≤85). Here NORM.DIST(105,B1,B2,TRUE)-NORM.DIST(85,B1,B2,TRUE)=NORM.DIST(105,100,15,TRUE)-NORM.DIST(85,100,15,TRUE)=0.63-0.16=0.47, or 47%. Approximately 47% of people have IQ scores between 85 and 105.
IQ scores are known to be normally distributed. The mean IQ score is 100 and the standard deviation is 15. What percent of the population has an IQ over 115?
To find P(x>115), the percent of the population has an IQ over 115, first compute the cumulative probability, P(x≤115), using the Excel function NORM.DIST(x, mean, standard_dev, TRUE). Here NORM.DIST(115,B1,B2,TRUE)=NORM.DIST(115,100,15,TRUE)=0.84, or 84%. Thus, P(x>115)=1-P(x≤115)=1-0.84=0.16, or 16%
which p-value(s) would reject the null hypothesis for a two-sided test at the 90% confidence level.
To reject the null hypothesis at the 90% confidence level, the p-value must be less than 1-0.90=0.10.
A streaming music site changed its format to focus on previously unreleased music from rising artists. The site manager now wants to determine whether the number of unique listeners per day has changed. Before the change in format, the site averaged 131,520 unique listeners per day. Now, beginning three months after the format change, the site manager takes a random sample of 30 days and finds that the site now has an average of 124,247 unique listeners per day. Using the data provided below, calculate the p-value for the following hypothesis test: H0:μ=131,520
To use Excel's T.TEST function for a hypothesis test with one sample, you must create a second column of data that will act as a second sample. So, first enter the historical average unique listeners into each cell in column B associated with a day in the sample; that is, enter 131,520 into cells B2 to B31. Then, the p-value of the two-sided hypothesis test is T.TEST(array1, array2, tails, type)=T.TEST(A2:A31,B2:B31,2,3), which is approximately 0.0743. You must link directly to values in order to obtain the correct answer.
An engineer designing a new type of bridge wants to test the stress and load bearing capabilities of a prototype before beginning construction. Her null hypothesis is that the bridge's stress and load capabilities are safe. Select which type of error would be worse. Make sure that the type of error is matched with the correct definition.
Type II; the engineer deems the bridge safe and moves onto construction even though it is not actually safe The type II error is that the engineer deems the bridge safe and moves onto construction even though it is not actually safe. This would be worse than presuming that a safe bridge is unsafe.
A college football coach has decided to recruit only the heaviest 15% of high school football players. He knows that high school players' weights are normally distributed and that this year, the mean weight is 225 pounds with a standard deviation of 43 pounds. Calculate the weight at which the coach should start recruiting players.
Use the properties of the normal distribution to solve this problem. Since the coach is only interested in recruiting the heaviest 15% of players, calculate the weight that 85% of players weigh less than. The Excel function NORM.INV(probability, mean, standard_dev) returns the inverse of a normal cumulative distribution function. Here, NORM.INV(0.85,B1,B2)=NORM.INV(0.85,225,43)=269.57 indicates that 85% of players weigh less than 268.95 pounds. Hence, 15% of high school players weigh 269.57 pounds or more.
IQ scores are known to be normally distributed. The mean IQ score is 100 and the standard deviation is 15. The top 25% of the population (ranked by IQ score) have IQ's above what value?
Use the properties of the normal distribution to solve this problem. Since you are only interested in the top 25%, calculate the IQ at which 75% of people are below. The Excel function NORM.INV(probability, mean, standard_dev) returns the inverse of a normal cumulative distribution function. Here, NORM.INV(0.75,B1,B2)=NORM.INV(0.75,100,15)=110 indicates that 75% of people have IQ's lower than 110. Hence, 25% of people have IQ's greater than 110.
The R Square value measures ???
how much of the total variation in the dependent variable
Using the regression model, forecast equation
intercept + known data
A college student is interested in testing whether business majors or liberal arts majors are better at trivia. The student gives a trivia quiz to a random sample of 30 business majors and finds the sample's average score is 86. He gives the same quiz to 30 randomly selected liberal arts majors and finds the sample's average score is 89. What is the alternative hypothesis of this test?
not equal to; The alternative hypothesis is the claim that is being tested. Since the student wants to test whether there is a difference between business school majors' and liberal arts majors' trivia scores, the alternative hypothesis is that the mean scores are not equal.
Multicollinearity
occurs when two or more independent variables are highly correlated Multicollinearity means that two or more of the independent variables are collinear, meaning they are highly correlated. One or more the independent variables may not be significant because the variable with which it is correlated serves as a proxy variable. Multicollinearity is usually not an issue when the regression model is only being used for forecasting Multicollinearity is typically not a problem when the model is being used for forecasting, especially if the predicative power of the model is increased by the additional variable(s).
How do you calculate coefficient
stand dev/mean
Create a regression model...
use labels, sometimes its b1:c37 to capture multiple data sets
r squared =
variation
If the mean weight of all students in a class is 140 pounds with a variance of 201.64 square pounds, what is the z-value associated with a student whose weight is 165 pounds?
z=x−μσ=165−14014.2≈1.76. The standard deviation, 14.2, is the square root of the variance, 201.64.