Module 5, BA Module 4, BA Module 3, BA Module 1 & 2

अब Quizwiz के साथ अपने होमवर्क और परीक्षाओं को एस करें!

For a normal distribution with mean 100 and standard deviation 10, find the probability of obtaining a value greater than or equal to 80 but less than or equal to 115.

First find the cumulative probability of 115 using the function NORM.DIST(115,B1,B2,TRUE). Then calculate the cumulative probability of 80 using the function NORM.DIST(80,B1,B2,TRUE). Then find the difference between the two: NORM.DIST(115,B1,B2,TRUE)-NORM.DIST(80,B1,B2,TRUE)=0.91, or 91%. 91% of the population lies between 80 and 115.

If you are performing a hypothesis test based on a 90% confidence level, what are your chances of making a type I error?

10%

How much variation in production volume can be explained by the number of factory workers?

57.56% The percent of variation in production volume that can be explained by the number of factory workers is represented by the R2 value. The R2 value is 57.56%.

What is the probability of obtaining a value less than or equal to two standard deviations below the mean?

Approximately 2%

What probability falls within one standard deviation of the mean?

Approximately 68% The phrase "within one standard deviation of the mean" means "between one standard deviation below the mean and one standard deviation above the mean."

distribution is symmetric

Because the distribution is symmetric and has only one peak, the mean, median, and mode are equal.

margin of error for large sample greater than 30

CONFIDENCE.NORM

We want to know the current average height and weight of citizens in each country that belongs to the European Union.

Cross-Sectional

A researcher finds a positive correlation between the number of traffic lights in a town or city and the number of crimes committed each month in that town. The hidden variable is population. Cities with a greater number of people have more traffic and thus need more traffic lights. These cities also have more people who can commit crimes (and be victims of crimes), and more crimes are committed.

Example of a hidden variable

A scientist believes that, over the years, the number of major earthquakes has been decreasing. To test his hypothesis, the scientist collects data on the number of earthquakes above magnitude 7.0 on the Richter scale that have occurred each year from 1900 to 2012. Using the data below create a scatter plot with year on the horizontal axis.

From the Insert menu, select Scatter, then select Scatter With Only Markers. The Input Y Range is B1:B114 and the Input X Range is A1:A114. You must check the Labels in first row box to ensure that the scatter plot's axes are appropriately labeled.

If we specify a 75% confidence level, what percentage of sample means do we expect to fall in the rejection region?

If we specify a 75% confidence level, what percentage of sample means do we expect to fall in the rejection region?

The scatter plot below displays the relationship between two variables. Which of the following options most accurately describes the R2 value and the p-value of this relationship?

Low R2; high p-value (i.e., p-value greater than 0.05) A low R2 and high p-value indicates that the independent variable explains little variation in the dependent variable and the linear relationship is not significant. Since the data points are widely dispersed and do not indicate a clear linear pattern, this relationship likely has a low R2 and high p-value.

Let's use the coefficient of variation to compare a few data sets. Which of the following data sets has the highest relative variation? Mean=7.00; Standard Deviation=1.64 Mean=82.76; Standard Deviation=189.53 Mean=150.82; Standard Deviation=201.15 Mean=172.00; Standard Deviation=25.54

Mean=82.76; Standard Deviation=189.53

Now let's look at the model which includes house size, distance from Boston, and lot size. Are all of the independent variables significant at the 5% significance level?

No The p-value for lot size is 0.1975, which is greater than 0.05, indicating that the relationship is not significant. You may also notice that the range indicated by the lower and upper bounds of the 95% confidence interval for lot size (-2.20, 10.13) contain zero. However, note that house size and distance from Boston are significant.

A hidden variable, such as GDP, may explain variation in oil consumption across various countries, and provide more clarity than looking solely at the number of barrels of oil consumed.

Not an example of a hidden variable

On the basis of the resulting p-value, would we reject the null hypothesis or fail to reject the null hypothesis at the 0.05 significance level?

Reject the null hypothesis Because the p-value, 0.0000, is less than the significance level, we should reject the null hypothesis.

example of biased

SHOULD WE HAVE STRICTER GUN LAWS TO PREVENT SCHOOL SHOOTINGS? SECONDHAND SMOKE CAN CAUSE CANCER. SHOULD WE BAN SMOKING FROM RESTAURANTS AND OTHER PUBLIC SPACES? UNDERAGE DRINKING IS OKAY, RIGHT? MOST NEW CAR OWNERS PURCHASED AN EXTENDED WARRANTY FROM THE DEALERSHIP. ARE YOU GOING TO BUY ONE? CHILDHOOD OBESITY IS ON THE RISE. DO YOU THINK KIDS SPEND TOO MANY HOURS PLAYING COMPUTER GAMES?

According to the Central Limit Theorem, the means of random samples from which of the following distributions will be normally distributed, assuming the samples are sufficiently large?

The heights of basketball players The sum of two dice The annual income of HBS Online graduates

If two data sets, A and B, have equal standard deviations, which of the following statements is true? Select all that apply.

The variance of A must equal the variance of B The variance is equal to the square of the standard deviation. If the standard deviation of A and B are equal, then the variances must also be equal. Note that another option is also correct.

An engineer designing a new type of bridge wants to test the stress and load bearing capabilities of a prototype before beginning construction. Her null hypothesis is that the bridge's stress and load capabilities are safe. Select which type of error would be worse. Make sure that the type of error is matched with the correct definition.

Type II; the engineer deems the bridge safe and moves onto construction even though it is not actually safe The type II error is that the engineer deems the bridge safe and moves onto construction even though it is not actually safe. This would be worse than presuming that a safe bridge is unsafe.

Are the results for Total Units Ordered (Units) significant at the 95% confidence level?

Yes A test's results are significant if the p-value is less than the significance level. In this case, the p-value, 0.0169, is less than 0.05, so the results are significant.

Are the relationships between selling price and house size, and between selling price and distance both significant at the 95% confidence level?

Yes The p-values for the independent variables (house size and distance), 0.0000 and 0.0033, respectively, are less than 0.05, so we can be confident that the relationship between price, house size, and distance is significan

Assuming we take sufficiently large samples, which of the following populations will have a distribution of sample means that is normal? Select all that apply.

a non symmetrical

confidence interval.

depends on the sample's mean, standard deviation, and sample size. As we'll see, a confidence interval also depends on how "confident" we would like to be that the range contains the true mean of the population. For example, if our sample mean is 50, we might calculate a 95% confidence interval from 42 to 58. We can be 95% confident that our interval contains the true population mean.

survey

researchers ask questions and record self-reported responses from a random sample of a population.

t-test (p-value)

the most common method used for hypothesis tests.

Correlation Coefficient(R)

±√Rsquared

The manager now has reason to believe that showing old classics has increased the customer satisfaction rating. For this one-sided hypothesis test, what alternative hypothesis should he use?

μ>6.7 The manager has reason to believe that the new artistic approach has increased the average customer satisfaction, so for a one-sided test he should use the alternative hypothesis Ha:μ>6.7Ha:μ>6.7. This is the claim he wishes to substantiate.

If the mean weight of all students in a class is 165 pounds with a variance of 234.09 square pounds, what is the z-value associated with a student whose weight is 140 pounds?

-1.63 z= x−μ / σ=140−165 / 15.3≈−1.63z=x−μσ=140−16515.3≈−1.63. The standard deviation, 15.3, is the square root of the variance, 234.09.

A movie theater manager wants to determine whether popcorn sales have increased since the theater switched from using "butter-flavored topping" to real butter. Historically the average popcorn revenue per weekend day was approximately $3,500. After the theater started using real butter, the manager randomly sampled 12 weekend days and calculated the sample's summary statistics. The average revenue per weekend day in the sample was approximately $4,200 with a standard deviation of $140. Select the function that would correctly calculate the 90% range of likely sample means.

3,500±CONFIDENCE.T(0.10,140,12) Correct. The range of likely sample means is centered at the historical population mean, in this case $3,500. Because the sample contains fewer than 30 data points, we use CONFIDENCE.T. Excel's CONFIDENCE.T function syntax is CONFIDENCE.T(alpha, standard_dev, size). Because we wish to construct a 90% range of likely sample means, alpha equals 0.10.

What is the significance level for a 95% confidence level?

5% Significance level=1-confidence level. 1-0.95=0.05, that is, 5%.

What percentile does the median represent?

50% Remember that half of a distribution's data points are less than or equal to the median. Therefore, the median is equal to the 50th percentile, because 50% of the data points are equal to or below this value.

The mean score on a particular standardized test is 500, with a standard deviation of 100. To assess whether a training course has been effective in improving scores on the test, we take a random sample of 100 students from the course and find that the average score of this sample is 550. Which function would correctly calculate the 95% range of likely sample means under the null hypothesis?

500 ± CONFIDENCE.NORM(0.05,100,100) The range of likely sample means is centered at the historical population mean, 500. Because our sample is larger than 30, we can assume the distribution of sample means is roughly normal, due to the central limit theorem, and use the CONFIDENCE.NORM function.

The mean score on a particular standardized test is 500, with a standard deviation of 100. To assess whether a training course has been effective in improving scores on the test, we take a random sample of 20 students from the course and find that the average score of this sample is 550. Which function would correctly calculate the 95% range of likely sample means under the null hypothesis?

500 ± CONFIDENCE.T(0.05,100,20) MEAN

Now suppose we take a sample of 25 students, taking the same standardized test, which has a mean score of 500 and a standard deviation of 100, and find that the average score of this sample is 530. Which function would correctly calculate the 95% range of likely sample means under the null hypothesis?

500 ± CONFIDENCE.T(0.05,100,25) The range of likely sample means is centered at the historical population mean, 500. Because our sample is less than 30, we cannot assume that the sample means are normally distributed, and so we should use CONFIDENCE.T rather than the CONFIDENCE.NORM function.

The mean of women's heights is 63.5 inches and the standard deviation is 2.5 inches. Select the two heights that define the lower and upper bounds of a range that encompasses about 99.7% of all women's heights.

56 inches A range that is three standard deviations from the mean encompasses about 99.7% of a normal distribution. In this case, the lower bound of the range equals the mean minus three standard deviations =63.5-3(2.5)=56 inches.

Suppose we wanted to calculate a 90% range of likely sample means for the movie theater example. Select the function that would correctly calculate this range.

6.7±CONFIDENCE.NORM(0.10,2.8,196) The range of likely sample means is centered at the historical population mean, in this case 6.7. Since this is a 90% range of likely sample means, alpha equals 0.10.

Suppose we wanted to calculate a 90% range of likely sample means for the movie theater example but our sample size had been only 15. (Assume the same historical population mean, sample mean, and sample standard deviation.) Select the function that would correctly calculate this range.

6.7±CONFIDENCE.T(0.10,2.8,15) The range of likely sample means is centered at the historical population mean, in this case 6.7. We must use CONFIDENCE.T since the sample size is less than 30.

A grocery store owner wants to analyze how weather, day of the week, and time of day are related to the number of transactions completed per hour. Which of the following hypothesis tests is NOT conducted in the multiple regression model that contains these variables?

A hypothesis test for the significance of day of the week on time of day, provided number of transactions remain constant

Recall that the owner of a local health food store recently started a new ad campaign to attract more business and wants to know if average daily sales have increased. Historically average daily sales were approximately $2,700. The upper bound of the 95% range of likely sample means for this one-sided test is approximately $2,843.44. If the owner took a random sample of forty-five days and found that daily average sales were now $2,984, what can she conclude at the 95% confidence level?

Average daily sales have increased Since the sample mean, $2,984, falls outside the range of likely sample means (which has an upper bound=$2,843.44), the store owner can reject the null hypothesis that μ≤$2,700μ≤$2,700 at a 95% confidence level. Since she can reject the null hypothesis, she can essentially accept the alternative hypothesis and conclude the average daily sales have increased.

For a normal distribution with mean 47 and standard deviation 6, find the upper value and lower value of the range of outcomes associated with the middle 25%.

Because the normal curve is symmetrical, the middle 25% corresponds to the 12.5% range below the mean and the 12.5% range above the mean. Thus we need to find the cumulative probabilities associated with 50%-12.5%=37.5% and with 50+12.5%=62.5%. Since NORM.INV(0.375,B1,B2)=45 and NORM.INV(0.625,B1,B2)=49, the middle 25% of the probability lies between 45 and 49. Note that these values define a symmetric range centered at the mean. You must link directly to cells to obtain the correct answer.

coefficient of variation

Coefficient of Variation= Standard Deviation / Mean ex) CVStock A=4.5025.70=0.180CVStock A=4.5025.70=0.180 CVStock B=4.50100.00=0.045CVStock B=4.50100.00=0.045 stock A is more volatile

Correlation values close to both +1 and -1 indicate strong relationships. Values closer to +1 indicates a strong positive relationship where the increase in A leads to similar increase in B. Values closer to -1 indicates strong negative relationship where an increase in A will lead to a decrease in B.

Correlation coefficients that approach 1 or approach -1 indicate a strong relationship. A correlation coefficient that is 1 indicates the two variables are perfectly coordinated, while a correlation coefficient of -1 indicates the two variables are perfectly INVERSELY coordinated. A correlation coefficient of zero doesn't tell you very much: the two variables could be totally unrelated to each other, or they might make a scatter plot with a curve, which is a relationship.

Cross-Sectional

Cross-sectional data contain data that measure an attribute across multiple different subjects (e.g. people, organizations, countries) at a given moment in time or during a given time period. The average oil consumption of ten countries in 2012 is an example of cross-sectional data. Managers use cross-sectional data to compare metrics across multiple groups.

A student finds that there is a positive correlation between the volume of music and the prevalence of acne. The hidden variable is age; teenagers tend to listen to louder music and have more acne.

Example of a hidden variable

A streaming music site changed its format to focus on previously unreleased music from rising artists. The site manager now wants to determine whether the number of unique listeners per day has changed. Before the change in format, the site averaged 131,520 unique listeners per day. Now, beginning three months after the format change, the site manager takes a random sample of 30 days and finds that the site has an average of 124,247 unique listeners per day. The manager finds that the p-value for the hypothesis test is approximately 0.0743. How would you interpret the p-value?

If the average number of unique daily listeners per day is still 131,520, the likelihood of obtaining a sample with a mean at least as extreme as 124,247 is 7.43%. The null hypothesis is that the average number of unique daily listeners per day has not changed, that is, it is still 131,520. Therefore, the p-value of 0.0743 indicates that if the average number of unique daily listeners is still 131,520, the likelihood of obtaining a sample with a mean at least as extreme as 124,247 is 7.43%%.

Since the p-value, 0.0026, is less than the 0.05 significance level, we reject the null hypothesis and conclude that the customer satisfaction rating has changed. How would you interpret the p-value of 0.0026?

If the null hypothesis is true, the likelihood of obtaining a sample with a mean at least as extreme as 7.3 is 0.26% The p-value of 0.0026 indicates that if the population mean were actually still 6.7, there would be a very small possibility, just 0.26%, of obtaining a sample with a mean at least as extreme as 7.3. Equivalently, since 7.3-6.7=0.6, this p-value tells us that if the null hypothesis is true, the probability of obtaining a sample with a mean less than 6.7-0.6=6.1 or greater than 6.7+0.6=7.3 is 0.26%.

The manager of a factory that is making an average of 14,000 pints of ice cream a day decides to start playing music for employees, believing this decision will increase both employee morale and productivity. However, the manager is concerned about the possibility that the music could distract employees, thereby decreasing productivity. After a month of playing music, the factory was making an average of 13,518 pints a day. The manager runs a two-sided hypothesis test to determine if the number of pints produced has changed. The p-value of the test is 0.238. What does this say about ice cream production?

If there were no actual change in the average number of pints of ice cream produced daily, the chance of seeing average ice cream production as low as 13,518 pints would be 23.8%.

When performing a hypothesis test based on a 95% confidence level, what are the chances of making a type II error?

It is not possible to tell without more information. A type II error occurs when we fail to reject the null hypothesis when the null hypothesis is actually false. The confidence level does not provide any information about the likelihood of making a type II error. Calculating the chances of making a type II error is quite complex and beyond the scope of this course.

In addition to deciding how to select a random sample, we also must determine how large the sample should be. The appropriate sample size depends on how accurate we want our estimates of the population parameters to be. Suppose we want to sample from two populations—the first population comprises 5,000 observations and the second population comprises 5 million. If we take a sample of size 1,000 from the first population, how many times larger does the sample need to be from the second population to ensure the same level of accuracy?

No larger We might expect that for a larger population, a larger sample size is needed to achieve a given level of accuracy, but this is not necessarily true. A sample of 1,000 is often a satisfactory representation of a population numbering in the millions, as long as the sample is randomly selected and representative of the entire population.

Market researchers at a corporation assess the sales and revenue for the corporation's hot dog subsidiary, but do not pay attention to the fact that many people in their market are vegetarians. The researchers' lack of understanding about the dietary habits of the market is a hidden variable.

Not an example of a hidden variable

A retail store owner offers a small discount on the same-day delivery service she offers for her store's products. In the week following the discount offer, sales via the delivery service jumped by 50%. The hidden variable is weather; it rained throughout that week and more people opted for delivery rather than going to the store.

Not an example of a hidden variable Although the weather is probably correlated with the increase in same-day delivery, it is not related to the discount, and so does not function as a hidden variable between weather and the discount.

Suppose we want to know whether students who attend a top business school have higher earnings than those who attend lower-ranked business schools. To find out, we collect the average starting salaries of recent graduates from the top 100 business schools in the U.S. We then compare the salaries of those who attended the schools ranked in the top 50 to the salaries of those who did not. Should we perform a one-sided hypothesis test or a two-sided test?

One-sided Since we are interested only in whether the average salaries of people who attended the top 50 business schools are higher than the salaries of those who did not, we should perform a one-sided test. If we were interested in learning whether the salaries of the people who went to the top 50 business schools were different (either higher or lower) than those from the other schools, we would conduct a two-sided test.

One-sided hypothesis test vs Two-sided hypothesis test

One-sided hypothesis test TEST WHETHER INCOMING STUDENTS AT A BUSINESS SCHOOL RECEIVE BETTER GRADES IN THEIR CLASSES IF THEY'VE TAKEN AN ON-LINE PROGRAM COVERING BASIC MATERIAL TEST WHETHER USERS OF A COMMERCIAL WEBSITE ARE LESS LIKELY TO MAKE A PURCHASE IF THEY ARE REQUIRED TO SET UP A USER ACCOUNT ON THE SITE Two-sided hypothesis test TEST WHETHER THERE IS A DIFFERENCE BETWEEN MEN'S AND WOMEN'S USAGE OF A MOBILE FITNESS APP TEST WHETHER THE NUMBER OF LISTENERS OF A STREAMING MUSIC SERVICE HAS CHANGED AFTER THEY CHANGED THE USER INTERF

Suppose again that the movie theater manager had gathered a sample that had an average customer satisfaction rating of 7.05 but in this case had firm convictions that if the average rating had changed, it had increased. Given what you know about the relationship between the p-values of one-sided and two-sided tests, would you reject or fail to reject the null hypothesis, H0:μ≤6.7H0:μ≤6.7, at a 5% significance level? As noted above, for a two-sided test with H0:μ=6.7H0:μ=6.7 and Ha:μ≠6.7Ha:μ≠6.7, the p-value of 7.05 is approximately 0.07.

Reject the null hypothesis The p-value for a one-sided hypothesis test is half the p-value of a two-sided test for the same value. The p-value for 7.05 for the two-sided hypothesis test was 0.07, so the p-value for 7.05 for the one-sided test is 0.035. Because 0.035 is less than the significance level, 0.05, we reject the null hypothesis and conclude that the average customer satisfaction rating has increased. Note that the outcomes of one-sided and two-sided tests can be different. Just because we did not reject the null hypothesis for the two-sided test does not mean that we will have the same result for the one-sided test.

We have found that for the movie theater example, the p-value for the one-sided hypothesis test is 0.0013. Assuming a 0.05 significance level, what would you conclude?

Reject the null hypothesis and conclude that the average satisfaction rating has increased Because the p-value is less than the specified significance level of 0.05, we reject the null hypothesis. Our alternative hypothesis, the claim we wish to substantiate, is μ>6.7μ>6.7, so by rejecting the null hypothesis we are able to conclude that the average satisfaction rating has increased.

A curious student in a large economics course is interested in calculating the percentage of his classmates who scored lower than he did on the GMAT; he scored 490. He knows that GMAT scores are normally distributed and that the average score is approximately 540. He also knows that 95% of his classmates scored between 400 and 680. Based on this information, calculate the percentage of his classmates who scored lower than he did

Since GMAT scores are normally distributed, we know that P(μ-1.96σ ≤ x ≤ μ+1.96σ) = 95%. Thus, to find the standard deviation, subtract the lower bound from the mean and divide by 1.96. The standard deviation of the distribution is (B1-B2)/1.96 = (540-400)/1.96 = 71.4. (Note that because the normal curve is symmetrical, we could calculate the same value using (B3-B1)/1.96 = (680-540)/1.96 = 71.4). To find the cumulative probability, P(x ≤ 490), use the Excel function NORM.DIST(x, mean, standard_dev, TRUE). Here, NORM.DIST(B4,B1,71.4,TRUE) = NORM.DIST(490,540,71.4,TRUE) = 0.24, or 24%. Approximately 24% of his classmates scored lower than he did. You must link directly to the values in order to obtain the correct answer

For a normal distribution with mean 222 and standard deviation 17, find the value associated with the cumulative probability of 88%.

Since NORM.INV(0.88,B1,B2)=242, 242 is the value that corresponds to a cumulative probability of approximately 88%. That is, 242 is the 88th percentile of this distribution.

A college football coach has decided to recruit only the heaviest 15% of high school football players. He knows that high school players' weights are normally distributed and that this year, the mean weight is 225 pounds with a standard deviation of 43 pounds. Calculate the weight at which the coach should start recruiting players.

Since the coach is only interested in recruiting the heaviest 15% of players, calculate the weight that 85% of players weigh less than. The Excel function NORM.INV(probability, mean, standard_dev) returns the inverse of a normal cumulative distribution function. Here, NORM.INV(0.85,B1,B2)=NORM.INV(0.85,225,43)=269.57 indicates that 85% of players weigh less than 268.95 pounds. Hence, 15% of high school players weigh 269.57 pounds or more. =NORM.INV(0.85,B1,B2)

IQ scores are known to be normally distributed. The mean IQ score is 100 and the standard deviation is 15. The top 25% of the population (ranked by IQ score) have IQ's above what value?

Since you are only interested in the top 25%, calculate the IQ at which 75% of people are below. The Excel function NORM.INV(probability, mean, standard_dev) returns the inverse of a normal cumulative distribution function. Here, NORM.INV(0.75,B1,B2)=NORM.INV(0.75,100,15)=110 indicates that 75% of people have IQ's lower than 110. Hence, 25% of people have IQ's greater than 110.

Which of the following formulas would calculate the statistic that is MOST APPROPRIATE for comparing the variability of two data sets with different distributions?

Standard Deviation/Mean This is the formula for the coefficient of variation, the best statistic to compute to compare the variability of two data sets with different distributions. Dividing by the mean provides a measure of the distribution's variation relative to the mean.

Randomly Sorting Data

Step 1 Before we generate random ID numbers, type "Random ID" in cell A1 to label column A. Step 2 In cell A2, enter the function =RAND() to generate a random ID number between 0 and 1. Step 3 Copy and paste the function from cell A2 into cells A3:A26 so that all 25 phone numbers are assigned a random ID number. You can use auto-fill instead of copying and pasting. Step 4 Now we need to sort the phone numbers. Highlight the data in column A and column B, excluding the labels, and select Sort Ascending from the Data menu. Note that the RAND function generates a random number for each phone number every time the spreadsheet is calculated. Therefore, even though the phone numbers actually were sorted, the (new) random numbers will not appear in order. The sorting was based on the previously assigned random numbers. After sorting, the 25 phone numbers on the list are in random order. If we wanted to draw a random sample of 10 phone numbers, we would start at the top of the list and choose the first 10 people.

Creating the Output / summary statistics pc1

Step 1 From the Data menu, select Data Analysis, then select Descriptive Statistics. Step 2 Enter the appropriate Input Range: The Input Range is the oil consumption data in column A with its label, A1:A11. Make sure to include A1, the cell containing the label, when inputting your range and check the Label in first row box, as this ensures that your output table will be appropriately labeled. Step 3 Enter the appropriate Output Range, in this case enter D1. (oil consumption label) This cell is the top left hand cell in which the output table will appear. Be sure to always use the specified Output Range, so that the calculated values are placed in the blue cells for grading. Step 4 Be sure to select Summary Statistics so that the output table is generated.

Creating Scatter Plots

Step 1 From the Insert menu, select Scatter, then select Scatter With Only Markers. Step 2 Enter the appropriate Input Y Range and Input X Range: The Input Y Range is the weight data in column C with its label, C1:C11. (usually later letter) The Input X Range is the height data in column B with its label, B1:B11. Make sure to include the cells containing labels when inputting your ranges and check the Labels in first row box, as this ensures that your scatter plot will be appropriately labeled.

The spreadsheet below contains a partial view of data about U.S. corn acreage planted (in millions of acres) and the amount of corn (in millions of bushels) in storage from the previous year at the beginning of the year for each year from 1976 to 2013. We wish to use the data to predict the number of acres of corn that will be planted, based on the beginning corn stock in storage. Which variable is the independent variable?

Stock of Corn at Start of Year (in million bushels) (later one)

For a normal distribution with mean 47 and standard deviation 6, find the probability of obtaining a value greater than or equal to 45.

Subtract the cumulative probability of 45 from 1. 1-NORM.DIST(45,B1,B2,TRUE)=0.63, or 63%. Approximately 63% of the population lies above 45. B1:mean , b2:standd

Which of the following Excel formulas or tools would correctly calculate the average hourly hot dog sales over a two-day period from the data shown below? SELECT ALL THAT APPLY.

The Descriptive Statistics tool AVERAGE(B2:B17) SUM(B2:B17)/COUNT(B2:B17)

The following data set lists the prices for thirty houses in and around Boston, Massachusetts. Create a histogram of the data using the bins provided in column D.

The Input Range is B1:B31 (selling price $) and the Bin Range is D1:D8. (selling price) You must check the Labels in first row box since we included B1 and D1 to ensure that the histogram's axes are appropriately labeled.

A travel agent wants to determine how much the average client is willing to pay for a weekend at an all-expense paid resort. The agent surveys 30 clients and finds that the average willingness to pay is $2,500 with a standard deviation of $840. However, the travel agent is not satisfied and wants to be 95% confident that the sample mean falls within $150 of the true average. What is the minimum number of clients the travel agent should survey? Note that z=1.96 for a 95% confidence interval. Please give your answer as an integer with no decimal point and no digits to the right of the decimal point.

The correct answer is 121.

Which of the following is the MOST LIKELY result of using a survey with biased questions?

The data in your sample will differ in a systematic way from data based on unbiased random selections from the population.

An internet marketing firm compiled a data set of the number of seconds website visitors stay on one of its client's homepage before abandoning the site. The firm presented the summary statistics for the data set to the client. The client asked why the mean of the data set is so much larger than the median. Which of the following is most likely true?

The distribution of the data is skewed to the right When the distribution of data is skewed to the right, the mean is most likely greater than the median. The extreme values in the right tail pull the mean towards them.

Given the general regression equation, ŷ =a+bxy^=a+bx, which of the following describes y? Select all that apply.

The expected value of the dependent variable The value we are trying to predict

A journalist wants to determine the average annual salary of CEOs in the S&P 1,500. He does not have time to survey all 1,500 CEOs but wants to be 95% confident that his estimate is within $50,000 of the true mean. The journalist takes a preliminary sample and estimates that the standard deviation is approximately $449,300. What is the minimum number of CEOs that the journalist must survey to be within $50,000 of the true average annual salary? Remember that the z-value associated with a 95% confidence interval is 1.96. Please enter your answer as an integer; that is, as a whole number with no decimal poin

The formula for calculating the minimum required sample size is n≥(z*s/M)squared, where M=50.000 is the desired margin of error for the confidence interval, s=$449,300 is the sample standard deviation, and z=1.96. Using these data we find that 1.96449,30050,0002=310.201.96449,30050,0002=310.20 Since n must be an integer (let's not even think of what 0.20 CEOs would look like!) and n must be greater than or equal to 310.20, we must round up to 311. Since 310.20 is closer to 310 than to 311, we would normally round 310.20 down to 310. However, in this case we must round up to find the smallest integer that satisfies the equation. Therefore, the minimum required sample size is 311.s

Amazon sampled 5,000 observations from each of the three different storage types and recorded "1" if there was a defect in the bin and "0" if there was not. The mean defect rate and standard deviation for each storage type are provided below. Using a 95% confidence level, calculate your best estimate of the true defect rate of storage type 1.

The lower bound of the 95% confidence interval is the sample mean minus the margin of error, B2-CONFIDENCE.NORM(0.05,B3,B4)=0.0251, or 2.51%. The upper bound of the 95% confidence interval is the sample mean plus the margin of error, B2+CONFIDENCE.NORM(0.05,B3,B4)=0.0345, or 3.45%.

Calculate the 95% confidence interval for the true population mean based on a sample with x=225, ss=8.5, and n=45.

The lower bound of the 95% confidence interval is the sample mean minus the margin of error, that is B1-CONFIDENCE.NORM(0.05,B2,B3)=222.52. *B1 is the mean, B2 is stand dev, B3 is sample size* The upper bound of the 95% confidence interval is the sample mean plus the margin of error,

A streaming music site changed its format to focus on previously unreleased music from rising artists. The site manager now wants to determine whether the number of unique listeners per day has changed. Before the change in format, the site averaged 131,520 unique listeners per day. Now, beginning three months after the format change, the site manager takes a random sample of 30 days and finds that the site now has an average of 124,247 unique listeners per day. The manager finds that the p-value for the hypothesis test is approximately 0.0743. What can be concluded at the 95% confidence level?

The manager should fail to reject the null hypothesis; there is not enough evidence to conclude that the number of unique daily listeners has changed. Since the p-value, 0.0743, is greater than the significance level, 0.05, the manager should fail to reject the null hypothesis.

Median

The middle value of a data set. The same number of data points fall above and below the median. To find the median, first arrange the values in order of magnitude. If the total number of data points is odd, the median is the value that lies in the middle. If the total number is even, the median is the average of the two middle values.

A business school professor is interested to know if watching a video about the Central Limit Theorem helps students understand it. To assess this, the professor tests students' knowledge both immediately before they watch the video and immediately after. The professor takes a sample of students, and for each one compares their test score after the video to their score before the video. Using the data below, calculate the p-value for the following hypothesis test: H0:μafter≤μbefore Ha:μafter>μbefore

The p-value of the one-sided hypothesis test is T.TEST(array1, array2, tails, type)=T.TEST(B2:B31,C2:C31,1,1), which is approximately 0.0128. You must designate this test as a one-sided test (that is, assign the value 1 to the tails argument) and as a type 1 (a paired test) because you are testing the same students on the same knowledge at two points in time. You must link directly to values in order to obtain the correct answer.

A college student is interested in testing whether business majors or liberal arts majors are better at trivia. The student gives a trivia quiz to a random sample of 30 business school majors and finds the sample's average score is 86. He gives the same quiz to 30 randomly selected liberal arts majors and finds the sample's average score is 89. Using the data provided below, calculate the p-value for the following hypothesis test: H0:μBusiness=μLiberal ArtsH0:μBusiness=μLiberal Arts Ha:μBusiness≠μLiberal ArtsHa:μBusiness≠μLiberal Arts

The p-value of the two-sided hypothesis test is T.TEST(array1, array2, tails, type)=T.TEST(A2:A31,B2:B31,2,3), which is approximately 0.0524. You must designate this test as a two-sided test (that is, assign the value 2 to the tails argument) and as a type 3 test (an unpaired test with unequal variances) because you are testing two different samples. You must link directly to values in order to obtain the correct answer

Calculate the 90% confidence interval for the proportion of voters who cast their ballot for the candidate.

The political campaign can be 90% confident that the true population proportion of voters who cast a ballot for the candidate is between 29.7% and 43.9%. Since the sample size is greater than 30, we can determine the margin of error using the CONFIDENCE.NORM function. The margin of error is CONFIDENCE.NORM(0.10,STDEV.S(B2:B126),125)=0.071, or approximately 7%. The lower bound of the 90% confidence interval is the sample mean minus the margin of error, AVERAGE(B2:B126)-CONFIDENCE.NORM(0.10, STDEV.S(B2:B126),125)=0.368-0.071=0.297, or 29.7%. Similarly, the upper bound of the 90% confidence interval is the sample mean plus the margin of error, AVERAGE(B2:B126)+CONFIDENCE.NORM(0.10, STDEV.S(B2:B126),125)=0.368+0.071=0.439, or 43.9%.

What happens to the sample mean and standard deviation as you increase the sample size?

The sample mean and standard deviation generally become closer to the population mean and standard deviation As we increase the sample size, the sample includes more members of the population, so it is less likely to include only unusual values. Therefore, as the sample grows, the sample mean and standard deviation approach the population mean and standard deviation.

What happens to the sample mean and standard deviation as you take new samples of equal size?

The sample mean and standard deviation vary but remain fairly close to the population mean and standard deviation. Since each sample is randomly selected, the mean and standard deviation vary from one sample to the next. However, since the sample size is fairly large, each sample's mean and standard deviation are fairly close to the population mean and standard deviation. We'll learn more about how to select a good sample later.

significance level

The significance level defines the rejection region by specifying the threshold for deciding whether or not to reject null hypothesis. When the p-value of a sample mean is less than the significance level, we reject the null hypothesis. The significance level is the area of the rejection region, meaning the area under the distribution of sample means over the rejection region. The significance level is the probability of rejecting the null hypothesis when the null hypothesis is actually true.

Consider the four outliers in the 2012 revenue data: companies with revenue of $237 billion, $246 billion, $447 billion, and $453 billion. If we removed these companies from the data set, what would happen to the standard deviation?

The standard deviation would decrease. The standard deviation gives more weight to observations that are further from the mean. Therefore, removing the outliers would decrease the standard deviation.

A college student is interested in testing whether business majors or liberal arts majors are better at trivia. The student gives a trivia quiz to a random sample of 30 business school majors and finds the sample's average test score is 86. He gives the same quiz to 30 randomly selected liberal arts majors and finds the sample's average quiz score is 89. The student finds that the p-value for the hypothesis test equals approximately 0.0524. What can be concluded at αα=5%?

The student should fail to reject the null hypothesis and conclude that there is insufficient evidence of difference between business and liberal arts majors' knowledge of trivia. Since the p-value, 0.0524, is greater than the significance level, 0.05, the student should fail to reject the null hypothesis and conclude that there is insufficient evidence of difference between business and liberal arts majors' knowledge of trivia. Because the null hypothesis is that there is no difference between the two types of majors, this answer is correct.

We want to know if a company's profits have increased after it started advertising more.

Time Series

Time Series

Time series data contain data about a given subject in temporal order, measured at regular time intervals (e.g. minutes, months, or years). U.S. oil consumption from 2002 through 2012 is an example of a time series. Managers collect and analyze time series to identify trends and predict future outcomes.

The owner of a local health food store recently started a new ad campaign to attract more business and wants to test whether average daily sales have increased. Historically average daily sales were approximately $2,700. After the ad campaign, the owner took another random sample of forty-five days and found that average daily sales were $2,984 with a standard deviation of approximately $585. Calculate the upper bound of the 95% range of likely sample means for this one-sided hypothesis test using the CONFIDENCE.NORM function.

To construct a 95% range of likely sample means, calculate the margin of error using the function CONFIDENCE.NORM(alpha, standard_dev, size). However, CONFIDENCE.NORM finds the margin of error for a two-sided hypothesis test and this question asks for the upper bound of a one-sided test. To find the upper bound for the one-sided test you must first determine what two-sided test would have a 5% rejection region on the right side. Since the distribution of sample means is symmetric, a two-sided test with a 10% significance level would have a 5% rejection region on the left side of the normal distribution and a 5% rejection region on the right side. Thus, the upper bound for a two-sided test with alpha=0.1 will be the same as the upper bound on a one-sided test with alpha=0.05. The margin of error is CONFIDENCE.NORM(alpha, standard_dev, size)= CONFIDENCE.NORM(0.1,C3,C4)=CONFIDENCE.NORM(0.1,585,45)=$143.44. The upper bound of the 95% range of likely sample means for this one-sided hypothesis test is the population mean plus the margin of error, which is approximately $2,700+$143.44=$2,843.44.

A streaming music site changed its format to focus on previously unreleased music from rising artists. The site manager now wants to determine whether the number of unique listeners per day has changed. Before the change in format, the site averaged 131,520 unique listeners per day. Now, beginning three months after the format change, the site manager takes a random sample of 30 days and finds that the site now has an average of 124,247 unique listeners per day. Using the data provided below, calculate the p-value for the following hypothesis test: H0:μ=131,520H0:μ=131,520 Ha:μ≠131,520Ha:μ≠131,520

To use Excel's T.TEST function for a hypothesis test with one sample, you must create a second column of data that will act as a second sample. So, first enter the historical average unique listeners into each cell in column B associated with a day in the sample; that is, enter 131,520 into cells B2 to B31. Then, the p-value of the two-sided hypothesis test is T.TEST(array1, array2, tails, type)=T.TEST(A2:A31,B2:B31,2,3), which is approximately 0.0743. You must link directly to values in order to obtain the correct answer.

Before we determine the significance of the results, let's look at the direction of the change. What is the effect of changing the shopping cart design on Total Units Ordered (Units) and Ordered Product Sales (OPS)?

Units increased and OPS increased For each test, the mean of the treatment is larger than the mean of the control. This indicates that changing the design of the shopping cart increased both Units and OPS. The Mean Difference and % Mean Difference confirm the increases.

A researcher wants to select a random sample of consumers for a study. Generate a random ID number between 0 and 1,000 for each consumer in the spreadsheet.

Use the function =RAND()*1000 in cells A2:A30 to generate random numbers for each consumer.

Suppose that you have a sample with a mean of 50. You construct a 95% confidence interval and find that the lower and upper bounds are 42 and 58. What does this 95% confidence interval around the sample mean indicate? Select all that apply.

We are 95% confident that the population mean lies between 42 and 58. The 95% confidence interval is a range around the sample mean. We can say that we are 95% confident that the true population mean is within this range, based on the methods we used to calculate the range. If we were to construct similar intervals for 100 samples drawn from this population, on average 95 of the intervals will contain the true population mean.

A p-value to test the significance of a linear relationship between two variables was calculated to be 0.0210. What can we conclude? SELECT ALL THAT APPLY.

We can be 90% confident that there is a significant linear relationship between the two variables. We can be 95% confident that there is a significant linear relationship between the two variables.

Are the results for Ordered Product Sales (OPS) significant at the 95% confidence level?

Yes A test's results are significant if the p-value is less than the significance level. In this case, the p-value, 0.0339, is less than 0.05, so the results are significant.

Calculate the 90% confidence interval for the true population mean based on a sample with x¯x¯=15, s=2, and n=100.

You can separate the calculations into separate formulas or you can combine them into one calculation. The margin of error is based on the significance level (1-confidence level, or 1-0.90=0.1), the standard deviation (in B2), and the sample size (in B3). We can compute the margin of error using the Excel function CONFIDENCE.NORM(0.10,B2,B3). The lower bound of the 90% confidence interval is the sample mean minus the margin of error, that is B1-CONFIDENCE.NORM(0.10,B2,B3)=14.67. The upper bound of the 90% confidence interval is the sample mean plus the margin of error, that is B1+CONFIDENCE.NORM(0.10,B2,B3)=15.33. You must link directly to cells to obtain the correct answer.

Suppose the movie theater manager had gathered a sample that had an average customer satisfaction rating of 7.05. For the two-sided test with H0:μ=6.7H0:μ=6.7 and Ha:μ≠6.7Ha:μ≠6.7, the p-value is approximately 0.07. Would you reject or fail to reject the null hypothesis, μ=6.7μ=6.7, at the 5% significance level?

ail to reject the null hypothesis Because the p-value, 0.07, is greater than the significance level, 0.05, we do not have enough evidence to reject the null hypothesis, so we would fail to reject it.

For a normal distribution with mean 50 and standard deviation 6, find the values associated with the middle 50%.

lower value: 46 upper:54 To use the NORM.INV function we need to think in terms of cumulative probabilities. Because the normal curve is symmetrical, the middle 50% corresponds to the 25% range just below the mean and the 25% range above the mean. Thus we need to find the cumulative probabilities associated with 50%-25%=25% and with 50%+25%=75%. Since NORM.INV(0.25,B1,B2)=46 and NORM.INV(0.75,B1,B2)=54, the middle 50% of the probability lies between 46 and 54.

Calculate the 99% confidence interval for the true population mean based on the same sample with x¯=225, s=8.5, and n=45. Notice how the width of the 99% confidence interval differs from the width of the 95% confidence interval of the same data.

mean-confidence.norm(0.01,standdev,samplesize) mean+confidence.norm(0.01,standdev,samplesize)

standard deviation

measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean of the set, while a high standard deviation indicates that the values are spread out over a wider range.

How large must our sample be for the 68% confidence interval to be within 1 kg/m2 of the true average BMI? Since we don't know σσ, the standard deviation of the population, let's use the standard deviation of our previous sample (ss=7.10) as an estimate.

n≥51n≥51 If we use the equation n≥(zsM)2n≥(zsM)2 to calculate nn, we find n≥(17.10 / 1)2≥50.41n≥(17.101)2≥50.41. However, because the sample size must be a whole number, we have to round up, so n≥51n≥51. We round up even though 0.41 is less than 0.50 because if 50.41 is the smallest value that satisfies the inequality, then 50 will not satisfy the equation.

An automotive manufacturer has developed a new type of tire that the research team believes to increase fuel efficiency. The manufacturer wants to test if there is an increase in the mean gas mileage of mid-sized sedans that use the new type of tire, compared to 32 miles per gallon, the historic mean gas mileage of mid-sized sedans not using the new tires. The automotive manufacturer should perform a _____________ hypothesis test to _____________.

one-sided, analyze a change in a single population The manufacturer believes that the new tires change fuel efficiency in a single direction (i.e., that efficiency increases) and thus should use a one-sided hypothesis test. The automotive manufacturer is analyzing the change of a single population mean compared to the known historic population mean of gas mileage in mid-sized sedans.

hidden variable

variable that is correlated with each of two variables (such as ice cream and snow shovel sales) that are not fundamentally related to each other. season is correlated with ice cream sales (people are more likely to buy ice cream in the summer when the weather is hot). Season is also correlated with snow shovel sales (people are more likely to buy snow shovels in winter when the weather is cold and snow begins to fall). However, there is no direct connection between ice cream sales and snow shovel sales

Recall that the z-value associated with a value measures the number of standard deviations the value is from the mean. If a particular standardized test has an average score of 500 and a standard deviation of 100, what z-value corresponds to a score of 350?

z=(x-µ)/σ. Here x= 350, µ=500, the population mean, and σ=100, the population standard deviation. Thus z = (350-500)/100 = (-150)/100 = -1.5

A manager of a factory wants to know if a new quality check protocol has decreased the number of units a worker produces in a day. Before the new protocol, a worker could produce 27 units per day. What alternative hypothesis should the manager use to test this claim?

µ < 27 units The manager wants to know if the new quality check protocol has decreased the average number of units a worker can produce per day. For a one-sided test, the manager should use the alternative hypothesis Ha: μ<27 units. This is the claim the manger wishes to substantiate.

A manager of a factory wants to know if a new quality check protocol has decreased the number of units a worker produces in a day. Before the new protocol, a worker could produce 27 units per day. What null hypothesis should the manager use to test this claim?

µ ≥ 27 units This is the null hypothesis. Remember that the null and alternative hypotheses are opposites.

A manager of a factory wants to know if the average number of workplace accidents is different for workers who attended an equipment safety training compared to those who did not attend. What null hypothesis should the manager use to test this claim?

µattended = µdid not attend If the manager's alternative hypothesis is that the average number of workplace accidents has changed between the two groups of workers, then the null hypothesis is that the average number of accidents has remained the same.

A manager of a factory wants to know if the average number of workplace accidents is different for workers who attended an equipment safety training compared to those who did not attend. What alternative hypothesis should the manager use to test this claim?

µattended ≠ µdid not attend The manager has reason to believe that the training has changed the average number of workplace accidents between the two groups of workers. For a two-sided test, the manager should use the alternative hypothesis Ha: µattended ≠ µdid not attend. This is the claim the manger wishes to substantiate.

A streaming music site changed its format to focus on previously unreleased music from rising artists. The site manager now wants to determine whether the number of unique listeners per day has changed. Before the change in format, the site averaged 131,520 unique listeners per day. Now, beginning three months after the format change, the site manager takes a random sample of 30 days and finds that the site has an average of 124,247 unique listeners per day. SELECT THE TWO ANSWERS below that represent the correct null and alternative hypotheses.

μ=131,520μ=131,520 The null hypothesis is that number of unique listeners per day has not changed. Thus, μ=131,520μ=131,520 is the null hypothesis. μ≠131,520μ≠131,520 The alternative hypothesis is that the number of unique listener per day has changed. Thus, μ≠131,520μ≠131,520 is the alternative hypothesis.

A college student is interested in testing whether business majors or liberal arts majors are better at trivia. The student gives a trivia quiz to a random sample of 30 business majors and finds the sample's average score is 86. He gives the same quiz to 30 randomly selected liberal arts majors and finds the sample's average score is 89. What is the alternative hypothesis of this test?

μBusiness≠μLiberal ArtsμBusiness≠μLiberal Arts The alternative hypothesis is the claim that is being tested. Since the student wants to test whether there is a difference between business school majors' and liberal arts majors' trivia scores, the alternative hypothesis is that the mean scores are not equal.

Suppose we want to know whether students who attend a top business school have higher earnings. What is the null hypothesis?

μtop 50≤μnot top 50 The null hypothesis is the claim we assume to be true. It is the opposite of the alternative hypothesis—the claim we wish to substantiate. In this case, our alternative hypothesis is that people who attended a school ranked in the top 50 earn more than those who did not. The opposite of this is that people who attended a school ranked in the top 50 earn less than or equal to those who did not.

For the one-sided hypothesis test, what should the movie theater manager use as the null hypothesis?

μ≤6.7μ≤6.7 If our alternative hypothesis is that the average satisfaction rating has increased, then the null hypothesis is that the rating is the same or lower. Thus, if our alternative hypothesis is that μ>6.7μ>6.7, our null hypothesis is that μ≤6.7μ≤6.7.

The owner of a local health food store recently started a new ad campaign to attract more business and wants to test whether average daily sales have increased. Historically average daily sales were approximately $2,700. After the ad campaign, the owner took a random sample of forty-five days and found that daily average sales had increased to $2,984. What is store owner's null hypothesis?

μ≥$2,984μ≥$2,984 The null hypothesis is the opposite of the hypothesis you are trying to substantiate. The owner wants to test for an increase. In addition, the null hypothesis is always based on historical information.

What is the standard deviation of the distribution of sample means?

σ / √n standard deviation divided by the square root of the sample size

Below is a partial regression output table, which of the following values most likely belongs in the Lower 95% cell for the independent variable in the output table?

-2.45 Since the p-value, 0.3956, is greater than 0.05, the linear relationship is not significant at the 95% confidence level. Therefore, the 95% confidence interval of the slope must contain zero. The confidence interval is centered around the slope of 1.78, so the lower and upper bounds must be equally distant from the slope. The Upper 95% minus the slope is 6.01 (x/upper)-1.78(x/coeff)=4.23, so the Lower 95% is 1.78-4.23=-2.45.

Which of the following 95% confidence intervals for a regression line's slope indicates that the linear relationship is not significant at the 5% level? Select all that apply.

-20.00; 5.00 The range between -20.00 and 5.00 contains zero, which indicates that the linear relationship is not significant at the 5% level. Note that another option is also correct. -0.36; 0.55 The range between -0.36 and 0.55 contains zero, which indicates that the linear relationship is not significant at the 5% level. Note that another option is also correct.

Given the regression equation, SellingPrice=13,490.45+255.36(HouseSize), which of the following values represents the value of HouseSizeHouseSize at which the regression line intersects the horizontal axis?

-52.83 square feet The regression line intersects the horizontal axis when Selling Price = $0, that is, when House Size = -52.83 square feet. 13,490.45+ 255.36*(-52.83)=$0.00 (actually, -52.82914, which rounds to -52.83).

What are the bounds of the 95% confidence interval for the coefficient, ValentinesDay? Consult the regression output table above.

-761.38; -125.85 -761.38 is listed under "Lower 95%" for ValentinesDay, and -125.85 is listed under "Upper 95%" for ValentinesDay, so these are the bounds of the 95% confidence interval for the coefficient. The estimated coefficient is the center of the range and can be found under "Coefficients."

Earlier in this module, we found that the correlation coefficient between house size and selling price is 0.86. What is the R2 of the best fit line that describes the relationship between selling price and house size?

0.74 Remember that for a single variable linear regression, R2 is the square of the correlation coefficient. Here, the correlation coefficient is 0.86, so R2=0.86squared=0.74.

If the expected production volume when there are 120 workers is approximately 131,958 units, which of the following equations would provide a reasonable estimate of the 68% prediction interval for the output of those 120 workers?

131,958±14,994.93 A reasonable estimate of the prediction interval is the point forecast (131,958) plus or minus the z-value times the standard error of the regression (14,994.93 - standard error). As usual, the z-value is based on the desired level of confidence. Since we want a 68% prediction interval, the z-value is equal to one. Therefore 131,958±14,994.93 is the best option.

Suppose we want to assign dummy variables to months (Jan-Dec) and day of week (Sun-Sat). How many dummy variables do we need?

17 For each category, we must use one fewer dummy variables than the number of options for that category. Since month and day of week are separate categories, we should subtract one for each category. Thus we would use 12-1=11 variables for month and 7-1=6 variables for day of week, giving a total of 17 dummy variables.

The following are qualitative variables:

2010, 2012, Christmas, NewYears, MemorialDay, PayDay, NewYears, SuperBowl, and Thanksgiving

Here is the correct regression line—the best fit line through the data. What is your estimate of the slope of this line, that is, the average change in selling price as house size increases by one square foot?

250 Pick two points on the x-axis—let's say 1,000 and 2,000—and see what the corresponding points are on the y-axis. According to the regression line, the expected selling price of a 1,000 square foot house is approximately $250,000, and for a 2,000 square foot house is approximately $500,000. Therefore, as house size increases by 1,000 square feet, price increases, on average, by approximately $250,000. To find the average change in price as house size increases by one square foot, we divide $250,000 by 1,000. We find that as house size increases by one square foot, price increases, on average, by approximately $250. (rise/run)

Given the regression equation, SellingPrice=13,490.45+255.36(HouseSize)SellingPrice=13,490.45+255.36(HouseSize), which of the following values represents the average change in selling price as house size increases by one square foot?

255.36 255.36 dollars/square foot is the line's slope, which is equal to the average change in selling price as house size increases by one square foot.

we wanted to compare the explanatory power of this model against a model that excludes the independent variables Christmas, Halloween, and MemorialDay, which value should we use? Consult the regression output table above.

39.94% We use the Adjusted R2 to compare the explanatory power of models with different numbers of independent variables.

Now suppose we take a sample and find the average satisfaction rating to be 7.3. What should be the center of the range of likely sample means? Remember that H0:μ=6.7H0:μ=6.7 and Ha:μ≠6.7Ha:μ≠6.7.

6.7 We always start a hypothesis test by assuming that the null hypothesis is true. Thus, the center of the range of likely sample means is the historical average—the average specified by the null hypothesis, in this case is 6.7. Remember, the null hypothesis is that showing old classics has not changed the average satisfaction rating.

A sporting goods store manager wants to forecast annual sneaker revenues based on the type of sport (running, tennis, or walking), color (red, blue, white, black, or violet) and its target audience (men or women). How many independent variables should the manager include in her multiple regression analysis? Please enter your answer as an integer; that is with no decimal point.

7 Sales revenue is the dependent variable. Type of sport, color, and target audience are categorical variables which must be represented using dummy variables. Recall that it is necessary to use one fewer dummy variables than the number of options in a category. Thus, type of sport should be represented by 3-1=2 dummy variables, color should be represented by 5-1=4 dummy variables, and target audience should be represented by 2-1=1 dummy variables, for a total of 2+4+1=7 independent variables

Let's forecast the selling price of a 1,500 square foot house using the regression equation, SellingPrice=13,490.45+255.36(HouseSize)

=SUMPRODUCT(array1, [array2], [array3],...), SUMPRODUCT(B2:B3,C2:C3)=B2*C2+B3*C3=B2*1+B3*1500. B2: intercept B3:house size C2:1 C3:1500

Net Relationship

A multiple regression model determines the net effect of an independent variable on a dependent variable. The net effect controls for all other factors (independent variables) included in the regression model. For example, in a regression model including both distance and house size as independent variables, the coefficient for house size controls for distance. That is, the regression determines the average change in selling price if a house's size increases by one square foot but its distance from Boston does not change. Coefficients in multiple regression are net with respect to variables included in the model and gross with respect to variables that are omitted from the model.

Gross Relationship

A single variable linear regression model determines the gross effect of an independent variable on a dependent variable. For example, the gross effect of house size on selling price is the average change in selling price when house size increases by one square foot. Since no other independent variables are included in the model, the coefficient for house size may pick up the effect of other factors related to selling price.

graph: close tog

High R2 (0.99): A large portion of the variation in yy is explained by the regression line. Low p-value (0.0000): There is a significant linear relationship between the dependent and independent variables.

A real estate developer has data on a number of U.S. National financial variables for each quarter from 1995 to 2001. The variables are housing starts (in thousands), the housing price index (a measure of average housing selling prices), unemployment rate, average disposable income, and home owner vacancy rates. A partial view of the data is below. If the developer wanted to create a regression model to predict housing starts from all the other financial variables, which of the following would be INDEPENDENT variables? (Select all that apply.)

House Price Index, Unemployment Rate, Disposable Income, and Home Owner Vacancy Rates are the independent variables used to create the regression model. Housing Starts (thousands) is the dependent variable used to create the regression model. Year and Quarter is not included as a dependent or independent variable.

The owner of an ice cream shop wants to determine whether there is a relationship between ice cream sales and temperature. The owner collects data on temperature and sales for a random sample of 30 days and runs a regression to determine if there is a relationship between temperature (in degrees) and ice cream sales. The p-value for the two-sided hypothesis test is 0.04. How would you interpret the p-value?

If there is no relationship between temperature and sales, the chance of selecting a sample this extreme would be 4%. Correct. The null hypothesis is that there is no relationship. The p-value indicates how likely we would be to select a sample this extreme if the null hypothesis is true.

Consider the p-value corresponding to the independent variable's coefficient in the regression shown. Do you think the p-value is less than 0.05 or greater than 0.05?

Less than 0.05 A p-value less than 0.05 indicates that we can be 95% confident that the true slope is not zero, that is, that there is a significant linear relationship between the two variables. This graph provides strong evidence that there is a significant linear relationship between the two variables.

graph: far apart

Lower R2 (0.70): A smaller portion of the variation in yy is explained by the regression line than in the previous graph. Low p-value (0.0000): There is a significant linear relationship between the dependent and independent variables, even though the R2 is lower than in the previous graph.

The organizer of a late night street fair in a popular tourist city wants to analyze the relationship between daily revenue and the following variables: the number of male visitors, the number of female visitors, the number of retail stands, the number of food (and beverage) stands, and the number of performances that take place on a given night. The regression output table is provided below. Based on these results and using a 10% significance level, the organizer thinks he can improve the model. He wants to try removing at least one variable from the analysis to create and compare new models. Which variable or variables would you recommend that he consider removing from the regression model? SELECT ALL THAT APPLY.

Number of Male Visitors The p-value of "Number of Male Visitors", 0.2016, is greater than 0.1 so the organizer should consider removing this variables from the regression model. Number of Performances The p-value of "Number of Performances", 0.5412, is greater than 0.1 so the organizer should consider removing this variable from the regression model.

Suppose the average satisfaction rating of the sample is 3.5 out of 10. Which of the following do you think would be the correct conclusion? Remember that H0:μ=6.7 and Ha:μ≠6.7.

Reject the null hypothesis The null hypothesis is that the average satisfaction rating has not changed (μ=6.7)(μ=6.7). Drawing a sample with an average satisfaction rating of 3.5 from a population that has an average rating of 6.7 is extremely unlikely, so we would reject the null hypothesis and conclude that the average satisfaction rating is no longer 6.7. Note that 3.5 is the same distance (3.2) from 6.7 as 9.9 is from 6.7. Since the distribution of sample means is symmetric, we can conclude that 3.5 and 9.9 have the same (very low) likelihood of being drawn from a population with a mean of 6.7. We will see shortly the key roles the distribution of sample means and the central limit theorem play in hypothesis testing.

Let's return to the movie theater example and focus on the sample taken after the manager changes the theater's artistic focus. Suppose the average satisfaction rating of the sample is 9.9 out of 10. Which of the following do you think would be the correct conclusion? Remember that H0:μ=6.7H0:μ=6.7 and Ha:μ≠6.7Ha:μ≠6.7.

Reject the null hypothesis The null hypothesis is that the average satisfaction rating has not changed, that is, that the population mean μμ is still equal to 6.7. Drawing a sample with an average satisfaction rating of 9.9 from a population that has an average rating of 6.7 is extremely unlikely, so we would almost certainly reject the null hypothesis and conclude that the average satisfaction rating is no longer 6.7.

Which model would we use to predict the price of a house that is 2,700 square feet?

SellingPrice=13,490.45+255.36(HouseSize)SellingPrice=13,490.45+255.36(HouseSize) Since we have data about just one independent variable, we should use a single variable regression model. This is a single variable linear regression model, in which house size is the only independent variable.

Suppose we want to forecast selling price based on house size and distance from Boston. Which equation should we use to forecast the price of a house that is 2,700 square feet and 15 miles from Boston?

SellingPrice=194,986.59+244.54(HouseSize)-10,840.04(DistancefromBoston) Since we have data about two independent variables, house size and distance from Boston, we should use the multiple regression model with those two variables.

What is the expected change in production volume, on average, as the number of factory workers decreases by five?

Since the slope represents the average change in production volume as the number of factory workers increases by one, the average change in production volume as the number of factory workers decreases by five is 1,638.98(-5)= -8,194.9. 1638.9 = number of worke/coeff

Adding the Best Fit Line to a Scatter Plot

Step 1 Create a scatploter t with "House Size (Sqft)" on the horizontal axis and "Selling Price ($)" on the vertical axis. Include the labels when inputting your ranges so that the scatter plot is appropriately labeled. Step 2 Select Chart Tools from the Insert menu. Then select Layout, then select Trendline. Check the Display Equation box to display the equation of the best fit line.

Regression Analysis with Dummy Variables

Step 1 From the Data menu, select Data Analysis, then select Regression. Step 2 Enter the appropriate Input Y Range and Input X Range: The Input Y Range is the range of the dependent variable, in this case selling price. The data with its label are in column C, C1:C31. The Input X Range is the range of the independent variable, in this case the dummy variable, "SAT (0=low, 1=high)". The data with its label are in column B, B1:B31. Since we included the cells containing the variables' labels when inputting the ranges, check the Labels box. Step 3 Scroll down and make sure to check the Residuals and Residual Plots boxes, as this ensures that the output table will include that information. As we saw with scatter plots, residual plots may be less helpful when we have a dummy variable, but we will choose to view the residual plot for the sake of completeness.

lagged variable.

Step 1: Copy the advertising data in range C2:C11. Step 2: To create the lagged variable, paste the advertising data into the range D3:D12 in Column D, under the title "Previous Year's Advertising." That is, the value from C2 will be pasted into D3, from C3 into D4, and so on until the value in C11 is pasted into D12. For example, in D3, the value for 2005 Previous Year's Advertising will be the advertising expenditure for 2004, $35,000. When completed properly, Row 12 should contain only one observation (in D12). Since we do not have advertising data for 2003, we do not know Previous Year's Advertising for 2004; thus, D2 should be blank. Note: Rather than copying and pasting, you may also choose to link directly to cells (for example, cell D3 would contain the formula =C2).

egression model, including the residuals and residual plots, with lagged data.

Step 1: Select Data, then Data Analysis, then Regression. Step 2: Enter your Input Y range as B3:B11. (Notice that we cannot use the data for Sales in B2 since we do not have an entry for D2) Step 3: Enter your Input X range as C3:D11. (Notice that we cannot use the data for Advertising for 2004 in C2 since we do not have an entry for D2. Moreover, we cannot use the data in D12 since we don't have data for other variables for 2014.) Step 4: Check the Residuals and Residual Plot boxes, but DO NOT check the Labels box. Click OK to start the regression analysis.

In order to create a regression model to analyze the relationship between housing starts and the other financial variables, which cell references should be entered?

The "Input Y Range" denotes the cell reference for the dependent variable, Housing Starts. The data of the dependent variable is in B1:B81. The "Input X Range" denotes the cell references for the independent variables: House Price Index, Unemployment Rate, Disposable Income, and Home Owner Vacancy Rates. The data of the dependent variables is in C1:F81. Data contained in column A, Year and Quarter, are not included as a dependent or independent variable in the regression model.

Perform a single variable linear regression analysis to analyze the relationship between gross box office sales and home video units. Make sure to include the residuals and residual plot in your analysis.

The Input Y Range is C1:C149 and the Input X Range is B1:B149. You must check the Labels box since we included C1 and B1 to ensure that the regression output table is appropriately labeled. You must also check Residuals and Residual Plots boxes so that you are able to analyze the residuals.

Based on the regression output, what proportion of the variability in revenue can be accounted for by whether the Red Sox are playing away? Enter the value of the percentage with exactly ONE digit to the right of the decimal place. See the drop bar if you need more detail on how to round your answer.

The R Square value measures how much of the total variation in the dependent variable (in this case, revenue) that is explained by the independent variable (in this case, away game). As shown in the regression output, the R-square value is 0.2252, or approximately 22.5% You must have followed the rounding instructions in the question and entered exactly 22.5 to be graded as correct.

Residual Sum of Squares

The Residual Sum of Squares is the amount of variation that is left unexplained by the regression line, that is, the sum of the squared differences between the predicted and observed values. That is exactly what this graph shows.

Total Sum of Squares

The Total Sum of Squares is the variance of yy, that is, the total variation in yy. The Total Sum of Squares equals the sum of the squared differences between the observed values of yy and the mean of yy. That is exactly what the graph shows. (graph = swaures overlapping and in a straight line)

Alternative Hypothesis (Ha):

The alternative hypothesis (the opposite of the null hypothesis) is the theory or claim we are trying to substantiate. If our data allow us to nullify the null hypothesis, we substantiate the alternative hypothesis.

The sports bar owner runs a regression to test whether there is a relationship between Red Sox away games and daily revenue. Which of the following statements about the regression output is true? SELECT ALL THAT APPLY.

The average daily revenue for days when the Red Sox do not play away is $1,768.32. (coeff/interccep) This option is true. $1,768.32 is the average daily revenue on days when the Red Sox do not play away. The average daily revenue for days when the Red Sox play away is $2,264.57. This option is true. The average daily revenue on days when the Red Sox play away is $1,768.32+496.25=$2,264.57. (496.25 = yes or no)

ext, using only the data, calculate the average selling price for homes that are in school districts where students perform well on the SAT (SAT=1). fx

The average selling price of homes, given they are located in school districts where students have high SAT scores can be calculated as AVERAGEIF(B2:B31,1,C2:C31)=$809,100.

Next, calculate the same quantity—the expected selling price of a home in a school district that has average SAT scores below 1700 (SAT=0)—but do it using only the data and standard Excel functions, without the regression model.

The average selling price of homes, given they are located in school districts where students have low SAT scores can be calculated as AVERAGEIF(B2:B31,0,C2:C31)=$389,376.

Using the new model, forecast the daily revenue when there are 10 retail stands and 15 food stands open, and approximately 1,500 women visiting. fx

The expected daily revenue is B15+(1500*B16)+(10*B17)+(15*B18)=$49,485. You must link directly to values in order to obtain the correct answer.

Based on the scientist's regression model, forecast the number of earthquakes above magnitude 7.0 that will occur in 2019.

The expected number of earthquakes above magnitude 7.0 that will occur in 2019 is B15+B16*2019=14.4. You must link directly to the cell values in order to obtain the correct answer.

Use the 2012 model to develop a baseline forecast for Frozen's home video units. Assume that Disney Studios estimated that the gross box office sales for Frozen would be approximately $360 million.

The expected number of home video units that will be sold is B15+B16*360=7,074 thousand. You must link directly to the values in order to obtain the correct answer. B15=intere/coeff B16=gross box office/coeff

Use the multiple regression model SellingPrice=194,986.59+244.54(HouseSize)-10,840.04(DistancefromBoston) where HouseSizeHouseSize is in square feet and DistancefromBostonDistancefromBoston is in miles, to predict the selling price of a house that is 1,500 square feet and 10 miles from Boston. pc3

The expected selling price of a 1,500 square foot home that is 10 miles from Boston is B15+B16*1,500+B17*10=$453,397.59. You must link directly to the values in order to obtain the correct answer.

Given the regression equation, SellingPrice=13,490.45+255.36(HouseSize), what do you expect the selling price of a 2,000 square foot home to be?

The expected selling price of a 2,000 square foot home is B2+B3*2000=$524,217.93.

Use the single variable regression model with house size as the independent variable to predict the selling price of a house that is 2,700 square feet. fx

The expected selling price of a 2,700 square foot home is B2+B3*2700=$702,972.54. You must link directly to the values in order to obtain the correct answer. (b2:inter/coeff b3:house size/coeff

Use the multiple regression model with house size and distance from Boston as the independent variables to predict the selling price of a house that is 2,700 square feet and 15 miles from Boston.

The expected selling price of a 2,700 square foot home that is 15 miles from Boston is B2+B3*2700+B4*15=$692,646.51. You must link directly to the values in order to obtain the correct answer. (b2:inter/coeff b3:house size/coeff b4:distance from boston

Given the regression equation, SellingPrice=13,490.45+255.36(HouseSize), what do you expect the selling price of a 6,000 square foot home to be?

The expected selling price of a 6,000 square foot home is B2+B3*6000=$1,545,672.87. You must link directly to the values in order to obtain the correct answer

Use the regression model to calculate the expected selling price of a home in a school district that has average SAT scores below 1700 (SAT=0).

The expected selling price of homes in school districts where students have low SAT scores is B15+B16*0=B15=$389,376. You must link directly to the values in order to obtain the correct answer. B15= coeff/intercept B16=SAT (0=low,1=high)

Null Hypothesis (H0)

The null hypothesis is a statement about a topic of interest. It is typically based on historical information or conventional wisdom. We always start a hypothesis test by assuming that the null hypothesis is true and then test to see if we can nullify it—that's why it's called the "null" hypothesis. The null hypothesis is the opposite of the hypothesis we are trying to prove (the alternative hypothesis).

An airport shuttle company forecasts the number of hours its drivers will work based on the distance to be driven (in miles) and the number of jobs (each job requires the pickup and drop-off of one set of passengers) using the following regression equation: Travel time=-0.60+0.05(distance)+0.75(number of jobs) On a given day, Victor and Sofia drive approximately the same distance but Sofia has two more jobs than Victor. If Victor worked for 4 hours, for how long can the company expect Sofia to work? Please enter your answer rounded to one digit to the right of the decimal point. For example, if you think Sofia would work 236.7134 hours, enter 236.7.

The only difference between the workloads of the two drivers is the number of jobs each has; Sofia has two additional jobs. Therefore the company can expect Sofia to work the four hours Victor worked, plus an additional 0.75 hours for each of the two additional jobs, that is, 4+0.75(2)=5.5 hours.

Which of the following independent variables are significant at the p < .05 level?

The p-value column in the bottom table gives the significance level of each variable. The only p-values that are less than .05 are for the Intercept (which we do not assess for significance) and ERA. Thus, ERA is the only independent variable that is significant at p < .05. Note also that ERA is the only independent variable with a 95% confidence interval that does not contain 0. Significant: ERA Not significant: Runs, Strikeouts, Completed Games runs:0.01 era:-0.12 rest:0

Based on the regression model, the expected daily production volume with 112 factory workers is 118,846 units. The human resource department noted that 123,415 units were produced on the most recent day on which there were 112 factory workers. What is the residual of this data point?

The residual is equal to the historically observed value minus the regression's predicted value(ε=y-ŷ). 112 factory workers historically produced 123,415 units, whereas the regression model predicts that 112 workers would produce 118,846 units. The residual is the difference between these two values: 123,415 units - 118,846 units = 4,569 units.

How would the width of the actual prediction interval (at a 95% confidence level) for a 3,000 square foot home differ from the width of the actual prediction interval (at a 95% confidence level) for a 2,000 square foot home, given that the average home size is approximately 1,750 square feet?

The width of the actual prediction interval for a 3,000 square foot home would be larger than the width of the prediction interval for a 2,000 square foot home.

The regression output table

divided into three main parts: the Regression Statistics table, the ANOVA table, and the Regression Coefficients table. Although this course does not cover some of the ANOVA (Analysis of Variation) measures, we've included the definitions for completeness. The Residual Output table appears only if we select Residuals when inputting data in the regression dialog box.

experiment

researchers divide a sample into two or more groups. One group is a "control group," which has not been manipulated. In the "treatment group (or groups)," they manipulate a variable and then compare the treatment group(s) responses to the responses of the control group.

observational study

researchers observe and collect data about a sample (e.g., people or items) as they occur naturally, without intervention, and analyze the data to investigate possible relationships.

A human resources department wants to understand the relationship between the number of factory workers and production volume, which is measured in units produced per day. Perform a regression analysis, where the number of workers is the independent variable and production volume is the dependent variable. Be sure to include the residuals and residual plot in your analysis.

rom the Data menu, select Data Analysis, then select Regression. The Input Y Range is A1:A21 and the Input X Range is B1:B21. You must check the Labels box to ensure that the regression output table is appropriately labeled. You must also check the Residuals and Residual Plots boxes so that you are able to analyze the residuals.

Multiple Regression (two or more independent variables)

sellingPrice=194,986.59+244.54(HouseSize)-10,840.04(DistancefromBoston)

What is the null hypothesis (H0H0) of the movie theater example? Recall that the historical average customer satisfaction rating is 6.7 out of 10.

μ=6.7 The null hypothesis is that the new artistic approach of showing old classics has not affected the average customer satisfaction rating; that is, the new average customer satisfaction rating is the same as its historical value of 6.7 out of 10. Therefore H0:μ=6.7H0:μ=6.7.

What is the alternative hypothesis (HaHa) of the movie theater example? Recall that the historical average customer satisfaction rating is 6.7 out of 10.

μ≠6.7 The alternative hypothesis is that the new artistic approach of showing old classics has changed the average satisfaction rating. Therefore Ha:μ≠6.7Ha:μ≠6.7. Note that Ha:μ≠6.7Ha:μ≠6.7 is the opposite of H0:μ=6.7H0:μ=6.7, which confirms our understanding that the alternative hypothesis is the opposite of the null hypothesis

right tailed

(basically more frequency on the left) This graph has a tail that extends out the right side. As selling price increases, the frequency of each bin is much less than those below . Therefore, we infer that this distribution is skewed to the right, or right-tailed.

Recall that the z-value associated with a value measures the number of standard deviations the value is from the mean. Given that the average height of all women is 63.5 inches and the standard deviation is 2.5 inches, what z-value corresponds to 61 inches?

-1 z=x−μ / σ, so z=61−63.5/ 2.5=−2.52.5=−1

Select the p-value(s) at which you would reject the null hypothesis for a two-sided test at the 90% confidence level. SELECT ALL THAT APPLY.

0.0250 To reject the null hypothesis at the 90% confidence level, the p-value must be less than 1-0.90=0.10. 0.0250 is less than 0.10 so we can reject the null hypothesis. 0.0500 To reject the null hypothesis at the 90% confidence level, the p-value must be less than 1-0.90=0.10. 0.0500 is less than 0.10 so we can reject the null hypothesis. 0.0900 To reject the null hypothesis at the 90% confidence level, the p-value must be less than 1-0.90=0.10. 0.0900 is less than 0.10 so we can reject the null hypothesis.

If the one-sided p-value of a given sample mean is 0.0150, what is the two-sided p-value for that sample mean?

0.0300 The two-sided p-value is double the one-sided p-value. Since the one-sided p-value is 0.0150, the two-sided p-value is 0.0150*2=0.0300.

If the two-sided p-value of a given sample is 0.0020, what is the one-sided p-value for that sample mean?

0.0010 The one-sided p-value is half of the two-sided p-value. Thus, the one-sided p-value is 0.00202=0.00100.00202=0.0010.

If the two-sided p-value of a given sample mean is 0.0040, what is the one-sided p-value for that sample mean?

0.0020 The one-sided p-value is half of the two-sided p-value. Since the two-sided p-value is 0.0040, the one-sided p-value is 0.0040/2=0.0020.

Which of the following p-values would indicate that we can be 95% confident that there is a significant linear relationship between two variables? Select all that apply.

0.0025 0.0100

Range

10 The range is the difference between the maximum value and the minimum value. We can see from the histogram that the maximum value in this data set is 10 and the minimum value is 0, so the range equals 10-0=10.

For a standard normal distribution (µ=0, σ=1), the area under the curve less than 1.25 is 0.894. What is the approximate percentage of the area under the curve less than -1.25?

0.106 1-0.894=0.106 is the area under the curve for all values greater than 1.25. Since the normal distribution is symmetric, 0.106 is also the area under the curve for all values less than -1.25.

What is the correlation coefficient of the relationship between the average weekly hours spent studying and the score on the final exam?

0.5049 0.5049 is the Multiple R value. Remember that for single variable linear regression, Multiple R, which is the square root of R2, is equal to the absolute value of the correlation coefficient. The regression coefficient for Average Weekly Hours Studying (0.03, as shown in the bottom table of the output) is positive, so the slope is of the regression line is positive. Therefore, the correlation coefficient must also be positi

If you are performing a hypothesis test based on a 0.10 significance level (10%), what are your chances of making a type I error?

10% The probability of a type I error is equal to the significance level (which is 1-confidence level). A 10% significance level indicates that there is a 10% chance of making a type I error.

Suppose we want to assign dummy variables to months (Jan-Dec). How many dummy variables do we need?

11 We always have one fewer dummy variable than the number of options. Since there are 12 months, there would be 11 dummy variables.

If you are performing a hypothesis test based on a 20% significance level, what are your chances of making a type I error?

20% The probability of a type I error is equal to the significance level, which is 1-confidence level.

If the variance of a data set is 9, what is the standard deviation?

3 The standard deviation is equal to the square root of the variance. If the variance is 9, then the standard deviation must be 3.

If a particular standardized test has a mean score of 500 and standard deviation of 100, what percentage of test-takers score between 500 and 600?

34% 100 is one standard deviation above the mean (600-500 =100= 1*100 = 1*stdev). We know that approximately 68% of the distribution is within 1 standard deviation of the mean. Therefore 34% must fall beyond 1 standard deviation above the mean.

How much of the variation in home video units can be explained by gross box office sales?

80.36% R2 is the amount of variation in home video units that is explained by this model. 80.36% of the variation in home video units can be explained by the relationship with gross box office sales

If the average height of all women is 63.5 inches and the standard deviation is 2.5 inches, approximately what percentage of women are between 58.5 and 68.5 inches tall?

95% 58.5 and 68.5 inches are two standard deviations from the mean, that is 63.5±2(2.5). According to the rules of thumb, approximately 95% of women's heights fall within two standard deviations of the mean.

A political campaign supporting a particular candidate conducted an exit poll of 125 voters to estimate the proportion of voters who cast a ballot in favor of its candidate. In Column B, create a dummy variable for the voters' responses, where 1 indicates a vote for the candidate and 0 indicates a vote against the candidate. You must use a spreadsheet function to create the dummy variable. Your answer will be graded as incorrect if you manually enter the dummy variable data.

=IF(A1="For",1,0)

Let's go ahead and figure out what percentage of women are shorter than 63 inches.

=NORM.DIST(63,B1,B2,TRUE).

Two houses are the same size, but located in different neighborhoods: House B is five miles farther from Boston than House A. If the selling price of House A was $450,000, what would we expect to be the selling price of House B?

Approximately $396,000 Since the two houses are the same size, to predict the expected difference in selling prices we should use the net effect of distance on selling price (that is, the effect of distance on selling price controlling for house size). This value, -$10,840.04/mile, is found in the multiple regression model. House B is five miles farther from Boston than House A so House B's expected selling price is: =House A's selling price+net effect of distance on selling price≈$450,000-$10,840.04(5miles)≈$450,000-$54,200.20≈$395,799.80

Assume we have created two single linear regression models, and a multiple regression model to predict selling price based on HouseSizeHouseSize alone, DistancefromBostonDistancefromBoston alone, or both. The three models are as follows, where HouseSizeHouseSize is in square feet and DistancefromBoston is in miles: SellingPrice=13,490.45+255.36(HouseSize) SellingPrice=686,773.86-15,162.92(DistancefromBoston) SellingPrice=194,986.59+244.54(HouseSize)-10,840.04(DistancefromBoston) House A and House B are the same size, but located in different neighborhoods: House B is five miles closer to Boston than House A. If the selling price of House A is $450,000, what would we expect to be the selling price of House B?

Approximately $504,000 distance decreases but selling price increases by 5*10840.04

Suppose the average satisfaction rating of the sample is 7.0 out of 10. Which of the following do you think would be the correct conclusion? Remember thatμ=6.7 and Ha:μ≠6.7Ha:μ≠6.7.

Do not reject the null hypothesis Although we can't be completely sure without doing the analysis, it would probably not be that unusual to draw a sample that has a mean of 7.0 if the average customer satisfaction rating has not changed, and is still 6.7. Therefore, we would probably fail to reject the null hypothesis. To be certain whether this is the case, we would have to complete the hypothesis test—that is, construct the range around the historical population mean and see whether or not 7.0 falls in that range.

standard normal curve

For a standard normal curve, we know the mean is 0 and the standard deviation is 1, so we could find a cumulative probability using =NORM.DIST(x,0,1,TRUE). Alternatively, we use Excel's NORM.S.DIST function =NORM.S.DIST(z, cumulative). The "S" in this function indicates it applies to a standard normal curve.term-48

Determine which variables are significant—at either the 99% or 95% confidence level—and which are not significant at either level. Make sure to choose the highest level of significance for each variable.

For a variable to be significant at the 99% confidence level, its p-value must be less than 1-0.99=0.01. Likewise, a variable is significant at the 95% confidence level if its p-value is less than 0.05. If the p-value of a variable is greater than 0.05, the variable is not significant at the 95% (or 99%) level.

The following are quantitative variables:

FreeIndependent, Group, Rate, SpecialEvent, TotalRewards, VIP, and Wholesale

example of unbiased

HOW MANY MILES DID YOU WALK YESTERDAY? SHOULD CYCLISTS BE REQUIRED TO WEAR HELMETS? SHOULD NUTRITIONAL INFORMATION BE LISTED AT FAST FOOD RESTAURANTS?

A real estate developer has data on a number of U.S. National financial variables for each quarter from 1995 to 2001. The variables are housing starts (in thousands), the housing price index (a measure of average housing selling prices), unemployment rate, average disposable income, and home owner vacancy rates. A partial view of the data is below. If the developer wanted to create a regression model to predict housing starts from all the other financial variables, which of the following would be INDEPENDENT variables? (Select all that apply.)

House Price Index Unemployment Rate Disposable Income Home Owner Vacancy Rates

Let's find the 99th percentile for women's heights.

In cell B4 enter the function =NORM.INV(0.99,B1,B2)

You report a confidence interval to your boss but she says that she wants a narrower range. SELECT ALL of the ways you can reduce the width of the confidence interval.

Increase the sample size Increasing the sample size provides a more accurate representation of the population and therefore, reduces the width of the confidence interval. Note that another option is also correct.

Which of the following would increase the width of the confidence interval? Select all that apply.

Increasing the sample mean Decreasing the sample size

If you are performing a hypothesis test based on a 90% confidence level, what are your chances of making a type II error?

It is not possible to tell without more information

Assuming the same level of confidence, how does the width of the confidence interval for small samples compare with that for large samples?

It is wider

Q scores are known to be normally distributed. The mean IQ score is 100 and the standard deviation is 15. What percent of the population has an IQ between 85 and 105?

NORM.DIST(105,B1,B2,TRUE)-NORM.DIST(85,B1,B2,TRUE)=NORM.DIST(105,100,15,TRUE)-NORM.DIST(85,100,15,TRUE)=0.63-0.16=0.47, or 47%. Approximately 47% of people have IQ scores between 85 and 105.

Q scores are known to be normally distributed. The mean IQ score is 100 and the standard deviation is 15. What percent of the population has an IQ over 115?

NORM.DIST(115,B1,B2,TRUE)=NORM.DIST(115,100,15,TRUE)=0.84, or 84%. Thus, P(x>115)=1-P(x≤115)=1-0.84=0.16, or 16%.

Do you feel comfortable with the prediction you just made for a 6,000 square foot house?

No 6,000 square feet lies quite far outside the range of our historical housing data. Remember that there is greater uncertainty as we forecast outside of the historical range of the data, so we probably should not feel very comfortable with this prediction.

Based on the residual plot, do you think that this regression model is a good fit?

No The linear model does not appear to be a good fit because the residuals are not randomly distributed. The residuals form a funnel shape, which indicates that they are heteroskedastic. That is, the size of the residuals grows (in absolute value) as the average weekly hours studying decreases.

What can be concluded from the fact that the correlation coefficient between the acceptance rate at the top 100 U.S. MBA programs and the percent of students in those programs who are employed upon graduation is -0.32?

On average, as the acceptance rate decreases, the percent of students employed upon graduation increases. -0.32 is negative which indicates that, on average, as acceptance rate decreases, the percent of students employed upon graduation increases.

confidence interval

Remember to add and subtract the margin of error from the sample mean to calculate the lower and upper bounds of the confidence interval.

Single Variable Linear Regression (one independent variable)

SellingPrice=13,490.45+255.36(HouseSize) SellingPrice=686,773.86-15,162.92(DistancefromBoston)

cumulative probability?

The cumulative probability associated with a certain number is the probability of obtaining any value less than or equal to that number, which is the area to the left of that number.

What is the center value of the distribution of the sample means?

The population mean (μμ) According to the Central Limit Theorem, if we take enough large samples, the mean of the set of sample means equals the population mean.

The regression table below shows the relationship among selling price, distance from Boston, and lot size. Are both independent variables significant at the 5% significance level?

The regression table below shows the relationship among selling price, distance from Boston, and lot size. Are both independent variables significant at the 5% significance level?

Mode

The value that occurs most frequently in a data set.

Before beginning a hypothesis test, an analyst specified a significance level of 0.10. Which of the following is true?

There is a 10% chance of rejecting the null hypothesis when it is actually true. Correct. The significance level specifies how different the observed sample mean has to be from the mean expected under the null hypothesis before we reject the null hypothesis. A significance level of 0.10 means that the observed sample mean is so different from the mean expected under the null hypothesis that it would only occur 10% of the time if the null hypothesis were true.

A food truck operator has traditionally sold 75 bowls of noodle soup each day. He moves to a new location and after a week sees that he has averaged 85 bowls of noodle soup sales each day. He runs a one-sided hypothesis test to determine if his daily sales at the new location have increased. The p-value of the test is 0.031. How should he interpret the p-value?

There is a 3.1% chance of obtaining a sample with a mean of 85 or higher assuming that the true mean sales at the new location is still equal to or less than 75 bowls a day. The p-value provides us with the likelihood of the sample value equal to or more extreme than the observed sample value if the null hypothesis is true. In this case the p-value of 0.031 tells us that there would be a 3.1% chance of the sample value of 85 or above being observed if the null hypothesis were true.

Which of the following is an example of a hidden variable?

There is a correlation between the number of firefighters who show up at a fire and how much damage the fire causes. The hidden variable is the size of the fire. A hidden variable is one that is correlated with each of two variables that are not fundamentally related to each other. In this case, the size of the fire leads to a call for more firefighters, and the size of the fire also generally leads to more damage. The number of firefighters does not lead to a greater amount of fire damage.

The manager now has reason to believe that showing old classics has increased the customer satisfaction rating. Recall that the historical average satisfaction rating was 6.7 and that the random sample of 196 moviegoers has an average satisfaction rating of 7.3 and a standard deviation of 2.8. Calculate the upper bound of the 95% range of likely sample means for this one-sided hypothesis test using the CONFIDENCE.NORM function.

To find the upper bound for the one-sided test we must first determine what two-sided test would have a 5% rejection region on the right side. Since the distribution of sample means is symmetric, a two-sided test with a 10% significance level would have a 5% rejection region on the left side of the normal distribution and a 5% rejection region on the right side. Thus, the upper bound for a two-sided test with alpha=0.1 will be the same as the upper bound on a one-sided test with alpha=0.05. The margin of error is *CONFIDENCE.NORM(0.1,C3,C4)=0.33*. The upper bound of the 95% range of likely sample means for this one-sided hypothesis test is the population mean plus the margin of error, which is approximately 6.7+0.33=7.03.

For a normal distribution with mean 222 and standard deviation 17, find the value associated with the top 28%.

To use the NORM.INV function we need to think in terms of cumulative probabilities. The value associated with the top 28% is the same as the value corresponding to the bottom 100%-28%=72%, so we need to find the value associate with a cumulative probability of 72%. Using NORM.INV(0.72,B1,B2)=232, we find that 28% of the distribution's values are greater than 232. If we wish, instead of first computing 100%-28%=72%, we can embed that formula in the function using NORM.INV(1-0.28,B1,B2)=232.

A car manufacturing executive introduces a new method to install a car's brakes that is much faster than the previous method. He needs to test whether the brakes installed with the new method are as safe and effective as those installed with the previous method. His null hypothesis is that the brakes installed using the new method are as safe as those installed using the old method. In this situation, would it be worse to make a type I error or a type II error?

Type II A type II error, or false negative, would be that the brakes are actually not safe but the manufacturer deems them safe and proceeds with the new installation method. This would be worse than returning to the slower method, because the unsafe cars could cause injuries or fatal accidents.

Do you feel comfortable with the prediction you just made for a 2,000 square foot house?

Yes 2,000 lies well within the range of our historical housing data, so we can feel relatively comfortable with this prediction.

Is the relationship between selling price and house size significant at the 95% confidence level?

Yes Since the p-value for the independent variable (house size), 0.0000, is less than 0.05, we can be confident that the relationship between price and house size is significant. Recall that the p-value for the intercept does not determine the significance of the relationship between the dependent and independent variable, so even though the p-value for the intercept is greater than 0.05, we can still say that the relationship between price and house size is significant.

The scientist performs additional analyses and observes that the number of major earthquakes does appear to be decreasing but wonders whether the relationship is statistically significant. Based on the partial regression output below and a 5% significance level, is the year statistically significant in determining the number of earthquakes above magnitude 7.0?

Yes Since the p-value is not provided, the confidence interval for the coefficient should be used. Since the 95% confidence interval, -0.11 and -0.04, does not contain zero, the coefficient for year is statistically significant.

Mean

average: The mean is equal to the sum of all of the data points in a set, divided by nn, the number of data points.

type II error

false negative (we incorrectly fail to reject the null hypothesis when it is actually not tru

R2 - R-squared

measures how closely a regression line fits a data set

How large must our sample size be for the 95% confidence interval to be within 1 kg/m2 of the true average BMI? Since we don't know σσ, the standard deviation of the population, let's use the standard deviation of our previous sample (ss=7.10) as an estimate.

n≥194n≥194 If we use the equation n≥(zsM)2n≥(zsM)2 to calculate nn, we find n≥(1.96 * 7.10 /1)2≥193.66 n≥(1.967.101)2≥193.66. However, because the sample size must be a whole number, we have to round up so n≥194n≥194.

type I error

often called a false positive (we incorrectly reject the null hypothesis when it is actually true)

Suppose we want to know whether students who attend a top business school have higher earnings. What is the alternative hypothesis?

μtop 50>μnot top 50 The alternative hypothesis is the claim we wish to substantiate. In this case, we want to establish that people who attended a school ranked in the top 50 earn more than those who did not, so μtop 50>μnot top 50μtop 50>μnot top 50.

Let's return to our Disney example. What do you estimate to be the R2 of the regression line that describes the relationship between home video units and 2011 gross box office sales?

0.80 The independent variable explains a lot of the variation in the dependent variable, but not quite all of it. In total, the data points are close to the best fit line, but they do not lie on it. Thus, an R2 of 0.80 seems like a good estimate. wrong: -0.80 Remember that R2 is a value between 0 and 1, so a negative value is not possible. For a single variable linear regression, the correlation coefficient equals the positive or negative square root of R2, and can range from -1 to 1.

If the street fair organizer wanted to compare the explanatory power of the original model and the following new regression model, which value should he consult for the new model?

0.9225 It is important to use the Adjusted R2 to compare two regression models that have a different number of independent variables. 0.9225 is the Adjusted R2 of the new model.

Assume we have created two single linear regression models, and a multiple regression model to predict selling price based on HouseSizeHouseSize alone, DistancefromBostonalone, or both. The three models are as follows, where HouseSizeHouseSize is in square feet and DistancefromBostonDistancefromBoston is in miles: SellingPrice=13,490.45+255.36(HouseSize)SellingPrice=13,490.45+255.36(HouseSize) SellingPrice=686,773.86-15,162.92(DistancefromBoston) SellingPrice=194,986.59+244.54(HouseSize)-10,840.04(DistancefromBoston) House A and House B are the same size, but located in different neighborhoods: House B is five miles closer to Boston than House A. If the selling price of House A is $450,000, what would we expect to be the selling price of House B?

Approximately $504,000 Since the two houses are the same size, to predict the expected difference in selling prices we should use -$10,840.04/mile, the net effect of distance on selling price (that is, the effect of distance on selling price controlling for house size), which can be found in the multiple regression model. House B is five miles closer to Boston than House A so House B's expected selling price is: House A's selling price+net effect of distance on selling price ≈ $450,000+$10,840.04(5 miles) ≈ $450,000+$54,200.20 ≈ $504,200.20

Given the regression equation, SellingPrice=13,490.45+255.36(HouseSize)SellingPrice=13,490.45+255.36(HouseSize), if you increase the square footage of a house by 100 square feet, what would happen to the selling price?

Average selling price would increase by approximately $25,500 The slope of our regression line, about 255 dollars/square foot, describes the expected change in price when house size increases by one square foot. If the square footage increased by a factor of 100, the expected price must also increase by a factor of 100. Therefore, the average increase in price as square footage increases by 100 square feet is 100($255)=$25,500.

Suppose the average satisfaction rating of the sample is 6.8 out of 10. Which of the following do you think would be the correct conclusion? Remember that H0:μ=6.7H0:μ=6.7 and Ha:μ≠6.7Ha:μ≠6.7.

Do not reject the null hypothesis If the average customer satisfaction rating has not changed (μ=6.7)(μ=6.7), it would not be unusual to draw a sample that has a mean of 6.8. Therefore, we would probably fail to reject the null hypothesis.

good fit

When a linear model is a good fit, the residuals are randomly scattered above and below the horizontal axis. When a linear model is not a good fit, we see patterns, such as curves or heteroskedasticity, in the residuals.

How many independent variables are there in the model Caesars uses? Consult the regression output table above.

38 There are 38 independent variables in this model. You could have found this either by counting the number of independent variables or by looking at the Regression df, which represents the number of independent variables

Based on the regression model, forecast the expected production volume when there are 112 factory workers.

The expected production volume when there are 112 factory workers is B15+B16*112=118,846. You must link directly to values in order to obtain the correct answer.

The best point forecast for the selling price of a 2,500 square foot house is the expected selling price of a 2,500 square foot home, approximately 13,490 + 255.36(2,500) = $652,000. Given that the standard error of the regression is about $151,000, which of the following would give the BEST estimate for the prediction interval for a 2,500 square foot home with approximately 95% confidence?

$652,000 ± 2($151,000) A prediction interval is centered at a point forecast, in this case $652,000. The standard error of the regression is multiplied by 2 since we wish to estimate the prediction interval at the 95% confidence level. Note that we are using 2 to approximate the z-value for a 95% prediction interval. The actual z-value corresponding to 95% (for sufficiently large samples) is 1.96.

For use in a linear regression model, categorize which of the following variables should be represented as dummy variables and which can be represented as quantitative variables.

Dummy Variables SHOE COLOR NUMBER ON AN ATHLETE'S JERSEY GENDER ICE CREAM FLAVOR Quantitative Variables TIME TO RUN A MARATHON HEIGHT SIZE OF FLAT-SCREEN TELEVISION HOURS SPENT STUDYING CORE CALORIES IN DESSERTS Time to run a marathon, height, size of flat-screen television, hours spent studying CORe, and calories in desserts are quantitative variables. Shoe color, number on an athlete's jersey, gender, and ice cream flavor are categorical/qualitative variables and need to be transformed into dummy variables. Note that although athlete's jerseys have numbers, those values cannot be interpreted as real numbers. For example, Eli Manning's number is 10, whereas Peyton Manning's was 18. However, you can't interpret them to mean that Peyton is 80% more than Eli in some way.

Use the regression model to calculate the expected selling price of a home in a school district that has average SAT scores above 1700 (SAT=1).

The expected selling price of homes in school districts where students have average SAT scores above 1700 is B15+B16*1=B15+B16=$809,100. You must link directly to the values in order to obtain the correct answer.

The owner of Boston sports bar believes that, on average, her restaurant is busier on days when the Red Sox play an away game (a game played at another team's stadium), but she wants to be sure before adding more staff. To test whether this is true, she takes a random sample of 50 days over the course of the baseball season and records the total daily revenue, along with whether the Red Sox were playing away that day (1 if yes, 0 if no). Using the data provided, perform a regression analysis to determine the effect of Red Sox away games on revenue. Be sure to include the residuals and residual plot in your analysis.

From the Data menu, select Data Analysis, then select Regression. The Input Y Range is A1:A51 and the Input X Range is B1:B51. You must check the Labels box to ensure that the regression output table is appropriately labeled. You must also check the Residuals and Residual Plots boxes so that you are able to analyze the residuals.

Given the general regression equation, ŷ =a+bx,which of the following describes ŷ y^? Select all that apply.

The expected value of y The dependent variable The value we are trying to predict

The spreadsheet below contains data about the current and lagged variables from the pop-culture blogger's tweets and the number of followers she gained that week. Create a regression model to predict the number of followers from the current week, the previous week, and the two weeks prior. Be sure to include the residuals and residual plots in your analysis.

Thus, you should only select rows with complete data and leave the Labels box unchecked. From the Data menu, select Data Analysis, then select Regression. The Input Y Range is B4:B18 and the Input X Range is C4:E18. You must check the Residuals and Residual Plots boxes so that you are able to analyze the residuals.

Based on the following partial regression output table, from which the information on the coefficients' t-statistics and p-values has been removed, which of the independent variables are significant at the 95% confidence level? SELECT ALL THAT APPLY.

Variable A The 95% confidence interval for the variable's coefficient does not contain 0, which indicates that Variable A is significant at the 95% confidence level. The p-value (not shown) of Variable A, is 0.0001. Since it is less than 1-0.95=0.05, its value confirms that the variable is significant at the 95% confidence level. Variable D The 95% confidence interval for the variable's coefficient does not contain 0, which indicates that Variable D is significant at the 95% confidence level. The p-value (not shown) of Variable D, is 0.0028. Since it is less than 1-0.95=0.05, its value confirms that the variable is significant at the 95% confidence level.

Based on the segment of the output table shown below for the regression analysis of the U.S. motion picture industry's 2011 home video units vs. 2011 gross box office revenues, is there evidence of a significant linear relationship between these two variables?

Yes Since the p-value of the independent variable, 0.0000, is less than 0.05, we can be 95% confident that there is a significant linear relationship between gross box office and home video units. We could also note that (19.58; 22.95), the 95% confidence interval for the slope, does not contain zero.

Is the relationship between Red Sox away games and average daily revenues significant at the 95% confidence level? Choose the correct answer with the corresponding correct reasoning.

Yes, because the p-value of the independent variable is less than 0.05 Since the p-value, 0.0005, is less than 0.05, we can be confident that the relationship is significant at the 5% significance level and, equivalently, at the 95% confidence level.


संबंधित स्टडी सेट्स

CWTS-8-Wireless LAN Terminology and Technology

View Set

Marketing-Information Management Systems Test

View Set

Arm abduction and rotator cuff muscles

View Set

Chapter 17: Disciplining: Correction of Behavior

View Set