MSIT

Ace your homework & exams now with Quizwiz!

which test to use

2 quantitative variables: t-test for the regression slope. 2 categorical variables: chi-squared test of independence.

The Gross Domestic Product (GDP) is the best way to measure the strength of a country's economy, and consists of the total value of everything produced by a country. Download the GDP data set and open directly in JMP.

click Analyze> → Tabulate and drag "GDP ($ billion)" into the drop zone for columns, then drag Mean/Median/Std Dev/Interquartile Range on top of the default Sum.

some data points increase/decrease

mean and SD will change median and IQR stay same "if" points do not cross median

all data points increase/decrease

mean and median change IQR and SD stay same

spread

more spread out, bigger prediction errors and less accurate. farther the average distance from mean; greater standard deviation

cluster sampling

population is split into groups and whole cluster is selected but not all clusters. Very cheap, not very precise

p w/hat

sample/total population

Each year, the Federal Reserve Board must estimate the total cash holdings of U.S. financial institutions as of July 1. In order to compensate for the different sizes of bank, all financial institutions are classified as small, medium, or large. Staff members select a random sample of institutions from each classification.

stratified

A large retail store analyzes their monthly advertizing budget ($ thousands) vs. amount of monthly sales ($ millions), and find a correlation of 0.67. Which type of test should the retail store use to investigate if advertizing budget is related to sales?

t-test for regression slope

z-score

(value-mean)/SD (phat-p)/square root((p(1-p)/n)

which graph to use

1 categorical: bar or pie 1 quantitative: histogram/boxplot 1 quantitative across levels of categorical: parallel boxplots 2 quantitative: scatterplot

convincing evidence

if p-value < a: reject (type 1) if p-value > a: do not reject (type 2)

A telemarketing firm is worried about the high turnover of employees. The manager wonders if younger workers are more likely to quit when they find other jobs. A random sample of 40 workers who quit is taken, recording their age and number of weeks worked before quitting.

B1=0 a=0.10 > p-value estimate average then: narrower

stratified random sampling

Divided into groups, and some people are randomly selected within every group. Expensive, time consuming, but very precise

finding p-value

JMP > dist. calc. > enter z score x<=q1 or x>q2 (2 tail)

Experts recommend that adults get 8 hours of sleep per night. Researchers want to show that the average adult isn't getting enough sleep. Which of these results would provide the most convincing evidence?

The most convincing evidence would be provided by the greatest difference (7 hours instead of 8) with the largest sample size

mean and standard deviation

good for unimodal, symmetric distributions

median and IQR

good for skewed or contains outliers

Which factors affect laptop speed? Time to complete various tasks were measured on 15 randomly selected laptops, together with the cache size and memory.

(a) 95% of the actual laptop speeds will be within 2(14.787) of their predicted values. 95% of the prediction errors will have a size smaller than 2(14.787). (b) The confidence interval is computed as: (sample estimate) ± (critical value)SE. For Memory, the sample estimate is given by 7.37 with SE = 2.17. Get the critical value from the JMP t calculator with DF = n−(number of variables)−1 = 15−2−1 = 12 (or simply look up DF for Error in the Analysis of Variance table). Enter Central Probability = 0.9, which shows 1.7823. The confidence interval is then computed as 7.37 ± (1.7823)(2.17) = (3.5024, 11.2376). (c) The p-value = 0.0001 for the F-test. Since the p-value < α, there is sufficient evidence to claim Ha: the model is useful. (d) R2-Adjusted could increase or decrease depending whether the new variable makes a significant contribution to predicting the response variable. In this case, the new variable makes a significant contribution to the model, so R2-Adjusted will increase implying lower RMSE

Traffic departments try to encourage drivers to slow down by placing speed-measuring devices on roads that display how fast the driver is going. In one recent test, traffic officials recorded the average speed of cars on a street close to an elementary school with a posted speed limit of 25 mph. A sample of 35 randomly chosen cars yielded an average speed of 29.6 mph with standard deviation 14.0 mph.

(a) Both H0 and Ha are based on the hypothesized value μ = 25, rather than the sample statistic y = 29.6. (b) The test statistic is given by (sample statistic − hypothesized value) / SE. The sample statistic is y = 29.6, and SE = 14.0 / sqrt(35) = 2.3664, so the test statistic = (29.6 − 25) / 2.3664 = 1.9439. (c) The p-value < α, so there is sufficient evidence to conclude Ha: μ > 25. (d) For sample sizes n ≥ 30, hypothesis tests for the mean are valid regardless of the shape of the data. (e) In the formula for SE, dividing by a larger sample size leads to a smaller SE. This results in a larger test statistic and smaller p-value.

A food distribution company is responsible for the filling and packaging of boxes of nuts. The boxes are labeled as containing 1 kg of nuts, but there have been several complaints that the boxes contain less nuts. To investigate this complaint, a consumer watchdog agency tests a sample of 81 randomly chosen boxes. The weights of these boxes are given in the Nuts data set.

(a) Click Analyze → Distribution, then drag 'weight' into the Y, Columns box and click OK. Click the second little red triangle (next to 'weight') followed by Test Mean. Enter the hypothesized value = 1000, and click OK. To test if the true weight is less than 1,000, use the left side p-value given by "Prob < t", which is 0.1351. (b) Since 0.1351 > 0.05, there is insufficient evidence to conclude the average weight is less than 1,000 grams.

To test that car companies publish honest values for the miles per gallon of their cars, the EPA randomly spot checks the mpg values of a sample of cars every year. The mpg data set gives the mpg values for a random sample of cars of the same model in one recent such check. The advertized fuel economy for this model was 29 mpg.

(a) Click Analyze → Distribution, then drag Mileage into the Y, Columns box and click OK. Click the second little red triangle (next to Mileage) followed by Test Mean. Enter the hypothesized value = 29, and click OK. To test if the true mpg is different than 29, use a 2-sided test (H0: μ = 29 vs. Ha: μ ≠ 29). The 2-sided p-value is given by "Prob > |t|". (b) The confidence interval limits are given under the Summary Statistics column ("Upper 95% mean" and "Lower 95% mean"). (c) Since 0.0381 < 0.05, there is sufficient evidence to reject H0 and conclude the true average mpg for this model is different than 29 mpg.

In an effort to reduce boarding time, an airline tries a new method of boarding its planes. Historically, only 32% of passengers were satisfied with the boarding process. A random sample of 71 passengers using the new boarding process found 43% who said they were satisfied with the boarding process.

(a) First compute the test statistic, which is given by (sample statistic − hypothesized value)/SE. Here, the sample statistic is p̂ = 0.43, the hypothesized value is the historical value 0.32, and SE = sqrt(0.32 * 0.68 / 71) = 0.0554. This gives a test statistic of 1.9870. In the JMP Normal calculator, keep Mean = 0, Std. Dev. = 1 and select "X > q" for the right tail (to show *improved* customer satisfaction), then input Value = 1.9870. (b) The airline is testing H0: p = 0.32 (the same proportion of customers were satisfied) vs. Ha: p > 0.32 (a greater proportion of customers were satisfied). A Type I error occurs when in reality H0 is true (the same proportion of passengers were satisfied), but the airline concludes the opposite.

A movie company is considering producing a movie of either Genre A or Genre B. The viewer ratings for a random sample of 100 movies from each genre are shown below:

(a) In Genre A the majority of viewer ratings are concentrated close to the center, so both the standard deviation and IQR would be smaller. In Genre B there are more viewer ratings farther from the center, so both the standard deviation and IQR would be larger. (b) A small standard deviation means that the spread of movie ratings is closely concentrated around the average rating. (c) Increasing any of the points would cause the mean to increase. None of these points crossed over the median, however, so it would remain the same. (d) The new center will now be 5 instead of 50, and the points will be a lot closer to 5 than they were to 50. The spacing between the data points will be a lot smaller; they will spread out over a narrower range so both measures of spread would decrease. (e) An outlier is any point farther than 3 SD's from the mean. In this case, 50 + 3(16) = 98.

Many people feel that CEO salaries have gotten out of control. An international management consulting firm developed a model to help companies decide on suitable salaries. In the early stages of the modeling process, the consulting firm took a random sample of 100 companies and recorded the CEO salary as well as 10 possible predictor variables, as shown in the CEO Salaries JMP file.

(a) First, note that gender and board membership are both categorical variables. Use a chi-squared test of independence to determine if two categorical variables are related. After opening the data in JMP, click Analyze → Fit Y by X, drag Board Member? into the Y,Response box and Gender into the X,Factor box, click OK. The mosaic plot shows that female CEOs have a higher percentage of No for board membership, while male CEOs have a higher percentage of Yes. (b) The above output shows the p-value below "Prob>ChiSq" in the Pearson row. (c) Experience and Salary are both quantitative variables, so we need to fit a regression model. This can be done in two ways: (i) Click Analyze → Fit Y by X, drag Salary into the Y,Response box and Experience into the X,Factor box, click OK, then click the little red triangle at the top left → Fit Line; or (ii) Click Analyze → Fit Model, drag Salary into the Y box and Experience into the Construct Model Effects box box, click Run. The regression output shows R2 = 0.6133, so take the square-root to get the correlation (R) (and note the slope is positive so no sign adjustment is necessary). (d) The slope is given at the bottom of the regression output in the Experience row, below "Estimate". (e) The correlation does not depend on the units of measurements, so it would stay the same and therefore also R2. However, the slope measures the change in Y per unit X so if these units change, the slope will too.

According to census data, 43% of households have at least one child below the age of 18. A children's clothing store in a certain city wanted to know if the households in this city differed from those in the general nation. A sample of randomly chosen households in this city found 48% with at least one child below 18.

(a) H0 assumes the initial value from census data, p = 0.43. To test if the households in this city differed from the nationwide average, we need a 2-sided test, Ha : p ≠ 0.43. The sample result 0.48 is never used for setting up H0 or Ha. (b) Since the p-value < α, there is sufficient evidence to claim Ha, namely, sufficient evidence to claim this city is different than the nationwide average. (c) The p-value gives the probability of obtaining a sample result, p̂, more extreme than the current sample value 0.48, assuming that H0 is true (households in this city are the same as the nation overall).

A company would like to know if employees in various job types feel differently about a proposed change to the retirement benefit plan. The table below shows the results from a survey of 236 randomly chosen employees

(a) H0: Opinion is not related to job type vs. Ha: Opinion is related to job type (b) DF = (no. of column categories − 1)(no. of row categories − 1) = (3−1)(2−1) = 2. Enter the test statistic into the Chi Square calculator in JMP, and compute the tail area on the right side, which is 0.0594. (Chi squared tests always compute the p-value on the right side.) (c) 0.0594 > α = 0.05, so there is insufficient evidence of Ha, namely, insufficient evidence that opinion is related to job type.

In order to study consumer preferences for health care reform in the US, researchers from the University of Michigan surveyed US households. Heads of household were asked whether they are in favor of, neutral about, or opposed to a national health insurance program in which all Americans are covered and costs are paid by tax dollars. The 439 responses are summarized in the table below. You wish to test if there sufficient evidence to conclude that opinions are unevenly divided on the issue of national health insurance.

(a) H0: pfavor = pneutral = poppose = 1/3 vs. Ha: at least one p ≠ 1/3 (b) The sample size condition for chi-squared tests specifies that the expected count of each cell should be at least 5. (n ≥ 30 is the condition for hypothesis tests of the mean; np is the condition for hypothesis tests of a single proportion.) (c) Since 0.0006 < 0.05, there is sufficient evidence of Ha. That is, sufficient evidence that opinions are unequally divided among US households.

Lotteries have become an important source of state funds. However, critics argue that people in certain risk groups give up a relatively large part of their income to play the lottery: poorer, less educated, older people and people with many children. To test these beliefs, a random sample of 123 frequent lottery players was taken. The response variable is the percentage of household income spent on lottery tickets.

(a) H0: βEducation = βAge = βChildren = βIncome = 0 vs. Ha: at least one β ≠ 0 (b) The overall model is useful if the p-value for the F-test is less than 0.05. (c) Education is the predictor variable, so when it increases by 1 unit (= 1 year), the response variable is predicted to decrease by 0.46... for the same values of the other variables. (d) Education and Income have p-value < 0.05, indicating they are useful predictors, after adjusting for the other variables in the model. (e) In this case the variable with the largest P-value is Children, so we remove it

A financial analyst computed the assets to liability ratio for 60 "healthy" and 44 "failing" firms. The computer output shows the test statistic t = 1.917 and p-value = 0.058.

(a) H0: μhealthy − μfailing = 0 vs. Ha: μhealthy − μfailing > 0 (b) Since 0.058 > 0.05, there is insufficient evidence of Ha, namely, insufficient evidence that healthy firms have a higher average asset to liability ratio. (c) The 95% confidence interval corresponds to α = 0.05. Since 0.058 > 0.05, there is insufficient evidence to reject H0. This implies 0 is one of the plausible values (since we did not reject it), and would be inside the interval. (d) If there really is a difference between healthy and failed firms, then Ha must be true. Whenever Ha is true but we make a mistake, this constitutes a Type II error.

Of the 5,750 first-year students who enrolled at UGA in August 2018, about 16% came from outside the state of Georgia. How many out-of-state students would we expect to see in a random sample of 20 freshmen? The graph below shows the sampling distribution of p̂ for a sample of size 20.

(a) Here we dealing with proportions. The standard error gives the average distance between the sample result [proportion] and the population value [proportion]. (b) Since SE = pq/sqrt(n), it will decrease if n increases. For large sample sizes such as 100, the sampling distribution will be approximately Normal.

How common are body piercings among college students? One statistics professor asked a large class of students to anonymously report the number of piercings they had.

(a) There are 60 students with 0 body piercings and 10 students with 1 body piercing, leading to (60 + 10)/118 = 0.593. (b) There is one main peak (unimodal), with a longer tail on the right side (right-skewed). (c) The median = 0 because slightly more than half of the 118 values are 0. All the other statistics are greater than 0.

McMillan Assembly has a contract to assemble components for radar systems to be used by the U.S. military. The time required to complete one part of the assembly is approximately normal with a mean of 34 hours and a standard deviation of 5.3 hours. (Use 4 decimal places for all answers below.)

(a) In the JMP Normal calculator: enter Mean = 34 and Std. Dev. = 5.3. Select "X <= q", and put 33 into the "Input: Value" box. (b) Select "Input Probability and calculate values" on the left side, then choose "Central probability" on the right side and enter 0.5 into the "Input: Probability" box. (c) Keeping the selection "Input Probability and calculate values" on the left side, select "Right tail probability" on the right side and put 0.03 into the "Input: Probability" box.

The marketing division of an electronics retailer wanted to know how much time website visitors spent looking at the customer reviews of the various products. They plan to take a random sample of 43 visitors to their website and record how long they spent looking at the various customer reviews. Based on these sample results, they will construct a 90% confidence interval.

(a) Margin of error = (critical value)SE, where SE = s/sqrt(n). Dividing by larger n results in a smaller SE and smaller ME. (The same result is also true for proportions.) Since it is the same confidence, however, it would have the same chance of capturing the true average viewing time. (The interval would be narrower and more precise -- which is better, but still have the same chance of capturing the true average value, which is determined only by the confidence level.) (b) Higher confidence requires a wider interval to capture the unknown parameter. This occurs through a larger critical value. Also, more confidence provides a greater chance of capturing the true average viewing time.

A credit card company studies how frequently consumers of various age groups use some type of card (either debit or credit) vs cash or check when making purchases. Sample data for 280 randomly chosen customers shows the use of cards by four age groups:

(a) Note that payment method and age group are both categorical variables. To investigate if two categorical variables are related, we should do a chi-squared test of independence. (b) For a chi-squared test for independence (two variables), we have DF = (no. of column categories − 1)(no. of row categories − 1) = (4−1)(2−1) = 3. Enter the test statistic into the Chi Square calculator in JMP, and compute the tail area on the right side, which is 0.0145. (Chi square tests always compute the p-value on the right side.) (c) Since 0.0145 < 0.05, there is sufficient evidence of Ha (payment method is related to age). For 2 variables (age and payment method, in this case), H0 is always that the 2 variables are not related, and Ha is that the variables are related to each other.

A mattress company would like to claim that sleeping on their state-of-the-art mattresses leads to greater productivity at work. They select two random samples of workers, half of which will sleep on a regular mattress and the other half will sleep on their special mattress. The next day, they give all the employees a difficult task and rate their performance on a scale from 1-100 (higher is better).

(a) Since the company tested two different sample of employees, they are using a test of two independent means. (b) The company would get much stronger evidence their mattresses work by using a paired design. In the original test of two independent means, it's possible that employees who are naturally more productive end up in the regular mattress sample and employees who are naturally less productive end up in special mattress sample, leading to very weak evidence. Seeing how much specific employees improve their productivity on the special vs. regular mattresses leads to a better measure of how much they help people be productive.

A cinema was considering showing a certain film, but wanted to know what early reviewers thought about the film. A histogram of ratings given by early reviewers is shown below, together with the 5-number summary

(a) Since the ratings are skewed, we should use the median to describe the typical value, and the IQR to measure the spread. The mean, standard deviation and range should only be used for symmetric data without outliers. (c) Increasing any of the values would increase the mean (average). Since none of these values cross the median, however, the median (midpoint of the data) remains the same. (c) Raising the minimum value to 4 would not change Q1 or Q3 (the boundaries of the lower and upper quarters of the data set) so the IQR would remain the same. However, the standard deviation would decrease because it is based on the spread of all the data, and increasing the lower ratings would decrease the spread of these ratings from the center. (d) The median is 7.2 which implies that 50% of the early reviewers gave it a rating of 7.2 or higher, the percentage who gave it a rating of 7.5 or higher must therefore be less than 50%.

Many CEOs say they would read more, but don't have the time. A company that provides training courses for speed reading would like to offer a free trial to a random selection of 50 CEOs from Fortune 500 companies. If their course helps the CEOs read faster, they hope that employees and other CEOs would purchase their course. They plan to measure the speeds at which these CEOs currently read, have them take the training course, re-measure their speeds and compute the average increase in reading speed.

(a) Since the same CEOs are being measured, this should be a paired t-test. (b) When a sample is randomly chosen, the results can be generalized to the group from which it was chosen. Note that random assignment (not selection) is what makes it possible to determine what caused house prices to increase. (c) For inference based on the mean, the sample size should be at least 30. Skewness is not relevant for large sample sizes. (d) The 95% interval corresponds to α = 0.05. Since 0.011 < 0.05, the H0 value of 0 can be rejected, and 0 is outside the 95% interval. For 99% confidence, α = 0.01 so 0.011 > 0.01; the H0 value of 0 cannot be rejected, and 0 is inside the 99% interval.

When fitting a regression model, which conditions should be checked?

-When plotting the original data on a scatterplot, there should be a linear trend (in order to fit a linear model). -When plotting the residuals, the points should be randomly scattered around 0 without any pattern. -The histogram of residuals should show a Normal (bell) shape.

According to US Census data, 26% of Americans have a college degree. An IT recruiting company in a certain city would like to know if the residents in this city differ from the nationwide average. They do a survey and compute the following 98% confidence interval for the proportion of residents in this city who have a college degree: (0.13, 0.22).

(a) Since the tested value 0.26 is outside the confidence interval, 0.26 cannot be one of the plausible values, and we should reject H0. (b) Since 0.26 is outside the interval, H0 should be rejected. When H0 is rejected, we must have p-value < α, where α = 0.02 (for 98% confidence).

An airline would like to estimate the overall satisfaction level of its passengers. On a certain day, it randomly chooses 9 passengers on each of its flights and asks for their satisfaction level (1-5).

(a) The airline would ideally like information about all of its passengers. The population is the large group consisting of all people about whom it would like to obtain information. (b) The population parameter (or parameter of interest) is the number they would have liked to know, usually this number is unknown. This is the number that would theoretically be computed from the population (all passengers, in this case). In this case they would have liked to know the satisfaction level of all their passengers. (c) The sampling frame consists of all the people who were eligible to be chosen. In this case the sample was chosen from the passengers who were flying on that particular day, so anybody who was flying that day could potentially have been chosen. (d) The sample statistic is the number computed from the sample (the people who actually took part in the survey). (e) The sample consists of the people who actually took part in the survey. (f) Stratified sampling occurs when a few people are chosen from every group, which is the case here. It would have been cluster if they had chosen a few flights at random and asked everybody on the chosen flights.

A sample of 44 randomly chosen students at a certain school yielded an average commuting time of 22.3 minutes with a standard deviation of 5.1 minutes.

(a) The confidence interval is given by (sample statistic) ± (critical value)SE. The sample statistic is y = 22.3 and SE = s/sqrt(n) = 5.1/sqrt(44) = 0.7689. To get the critical value: use the JMP t-calculator, enter DF = 43 and select "Input probability and calculate values". Enter Central Probability = 0.9, which shows 1.6811. The confidence interval is then given by 22.3 ± 1.6811(0.7689) = (21.0075, 23.5925). (b) There is 90% confidence of capturing the true average commuting time inside this interval. (c) Here we're testing H0: μ = 21 vs. Ha: μ ≠ 21. Since the H0 value of 21 is outside the confidence interval, we should reject H0, so we conclude there is sufficient evidence of Ha: μ ≠ 21. (d) As above, we're testing H0: μ = 21 vs. H0: μ ≠ 21. Since the H0 value of 21 is outside the confidence interval, we should reject H0. Rejecting H0 implies a small P-value < α, where α is the opposite of the confidence level. 90% confidence implies α = 0.10, so in this case we must have p-value < 0.10. (e) More confidence requires a wider interval in order to have a higher chance of capturing the true value. Margin of error is always half the width of the confidence interval, so a wider confidence interval implies a greater margin of error. (f) The formula for sample size is n = (z* s / ME)2 = (2.5758 · 5.1 / 1.2)2 = 119.84, which rounds up to 120 (sample size always rounds up). To get z* = 2.5758: use the JMP Normal calculator with mean = 0 and SD = 1. Select "Input Probability" and enter Central Probability = 0.99

The GoDawgs Computer Company manufacturers computers and delivers them directly to customers who order online. Their competitive edge is price and speed of delivery. Maintaining a large inventory helps achieve the objective of speedy delivery, but is expensive. To lower that cost, the operations manager wants to optimize inventory using both daily demand and delivery time.

(a) The confidence interval is given by (sample statistic) ± (critical value)SE. The sample statistic is y = 362.9. To get the critical value: Use the JMP t calculator for the average. Enter DF = 33, select "Input probability and calculate values", enter Central Probability = 0.95, which shows 2.0345. The interval is then given by = 362.9 ± 2.0345(13.6) = (335.2308, 390.5692). (b) Since H0: μ = 387 is inside the interval, we cannot reject 387 as a plausible value.

A group of educational specialists would like to estimate the percentage of UGA students who usually bring a laptop or tablet to class. They randomly select 25 UGA classes, asking them if they usually bring a laptop or tablet to class; they get a response from all the students enrolled in these classes. The total sample size is 1800, of which 504 (28%) usually bring a laptop or tablet to class.

(a) The exact value of the parameter of interest is unknown, and they would like to estimate it: the percentage of all UGA students who usually bring a laptop or tablet to class. (b) Cluster sampling (c) The email survey would not work due to large voluntary bias. Even if a larger number such as 7000 responded, it would likely focus more on the wrong type (e.g. those who are very enthusiastic about bringing a laptop or tablet to class).

A marketing consultant wanted to test if two well-known companies were equally popular with males and females. They took a sample of 50 randomly selected males and 50 randomly selected females, and asked them whether they had recently visited either company's website. The results are shown below:

(a) The expected count is given by: (row total)(column total)/(overall total) = 50(60)/100 = 30. (b), (c) Since there are two categorical variables, we need to do a chi-squared test of independence: H0: gender is not related to website visits Ha: gender is related to website visits Company A has observed counts very close to the expected counts, This implies a small test statistic, and large p-value (insufficient evidence the variables are related). Company B is very different from the expected values in H0, so would have a larger test statistic and smaller p-value.

The study abroad office claims 62% of students would study abroad if it didn't delay their graduation. A survey of 248 students revealed 142 who would study abroad if it didn't delay their graduation.

(a) The hypothesized value is H0: p = 0.62. This can be shown wrong in 2 ways: if p has increased or decreased, so we need to consider both sides and do a two-sided test: Ha: p ≠ 0.62. (c) For proportions, we always use the Normal calculator. In JMP, keeping mean = 0 and Std. Dev. = 1, select "X <= q1 OR X > q2" since Ha is 2-sided, and enter Value 1 = -1.5379 and Value 2 = 1.5379. This shows the p-value = 0.1241. (d) there is insufficient evidence of Ha (study abroad office is wrong).

To help determine how much soda to stock, a beverage company at Sanford stadium recorded the average temperature at the middle of the game, as well as the total number of items (cans/bottles) of soda that were sold (in thousands).

(a) The predicted value is y hat = -104.13 + 1.52(83) = 22.03; the actual sales = 19.80. The residual is then: Actual − Predicted = 19.80 − 22.03 = -2.23. (b) Looking at the scatterplot, this point is far away from the rest of the data and regression line. Adding a data point that does not fit the trend of the data would lead to increased errors, and a decrease in the correlation. Looking at the graph reveals that it would also flatten the slope. (c) The percentage variation in Y (soda sold) that can be accounted for on the basis of X (temperature) is given by R2 = (0.957)2 = 0.916 or 91.6%. (d) The data show a linear relationship for temperatures between 68 and 92. We cannot assume the model will continue with the same linear trend beyond 92; the line could flatten or become steeper, for instance. So this model should not be used to make any predictions for temperatures less than 68 or greater than 92

A statistics teacher asked each of her students to flip a coin 10 times. Each student recorded the proportion of heads in their sample. The proportions are shown in the graph below.

(a) The results from many different samples is called the sampling distribution. (b) A larger sample size would have sample results that are a lot closer to the true value of 0.5, that is, the sample results would be more concentrated around 0.5 and have a smaller range

A medical study was conducted to study the relationship between systolic blood pressure and weight (kg) and age (days) for infants. Babies were selected from 22 randomly selected hospitals nationwide that specialize in delivering premature and on-time babies

(a) The scatterplot is linear, the residuals are random and show constant variance, and the histogram shows a bell-shaped curve, so all the conditions are met. (b) The interval to estimate the blood pressure of individual baby with the specified height and weight is the wider interval: (91.81, 111.31). This is called a prediction interval.

An online sports-clothing investigated how much it should spend on advertising to attract potential customers to website. They fitted a regression model to predict their daily sales ($) based on the number of website clicks received per day.

(a) The slope of 8.04 gives the predicted change in Y (= daily sales) per unit of X (= clicks ). (b) The residual-by-x plot does not show a curved shape, so the linearity condition is met. It does show a fan shape, however, indicating the constant variance condition is not met. The histogram shows a uniform rather than Normal (bell) shape, indicating the Normality condition is not met.

The mean salary for all NFL players in 2018 was about $2.2 million. If we took many samples of 10 NFL players, what values would the sample mean take? The graph below shows the sampling distribution of y for a sample of size 10 from the NFL population.

(a) The standard error gives the average error between the sample statistic and the true population value. In this case, we are dealing with means rather than proportion so it gives the average error between the sample mean and population mean. (b) In the formula for standard error (SE), dividing by a larger sample size results in a smaller SE. A larger sample size also results in the sampling distribution becoming closer to a Normal distribution.

Has the average starting salary of business majors increased? In 2016, the average starting salary for a college graduate with a business degree was $49,230. In 2017, a sample of 39 randomly chosen business graduates yielded an average starting salary of $51,067 with a standard error of $1,343

(a) The test statistic is (sample statistic − hypothesized value) / SE = (51,067 − 49,230) / 1,343 = 1.3678. (b) The hypotheses being tested are: H0: μ = $49,230 vs. Ha: μ > $49,230. If in reality the average starting salary had increased (Ha is true), but you conclude the opposite (H0), this is called a Type II error

In 2014, the Pew Research Center randomly sampled 1850 adults, in order to estimate the proportion of Internet users between 18 and 34 years old (Gen Y "millenials"). Common wisdom holds that this group of young adults is among the heaviest users of the Internet. The researchers used random digit dialing to call the 1850 adults, and found that 29% of the Internet users in the sample were between 18 and 34.

(a) The true proportion of Internet users between 18 and 34 years old is the parameter and 29% is a statistic. (b) The sampling frame is the list of all adult internet users from which Pew Research selected the sample. This study likely suffers from non-response bias, but not voluntary response. Voluntary response is a problem when NO statistical sampling method is used, for instance posting the survey online and accepting whoever decides to answer it.

Capital Credit Union (CCU) recently began issuing a new credit card. Managers at CCU have been monitoring how customers have been using the card. Suppose a random sample of 364 customers reveals the monthly amount at gas stations is approximately normal with an average of $189 and a standard deviation of $37. (Use 4 decimals for the questions below.)

(a) This is asking for the z-score: z = (269 − 189) / 37 = 2.1622. (b) Use the Normal calculator in JMP; enter Mean = 189 and Std. Dev. = 37. In the lower right panel select "q1 < X <= q2" and enter the values 201 and 263.

A retail store with locations in several cities sent out large amounts of flyers in each city and recorded the number of sales enquiries it received during the 2 week period after the mailing.

(a) This point would be an outlier in the graph (top left corner). This would lead to more errors overall, so R2 would decrease and RMSE would increase. (b) The lowest value of flyers is 750 or 800 (thousand). We cannot extend the line down to 0 thousand because this would be extrapolation -- the line could very well level off or become steeper or follow some other nonlinear pattern and there is no indication that it would continue with the same slope or linear pattern as before. (c) The histogram should show a bell-shape (Normal), but this is not the case here. The scatterplot shows a linear trend which is correct, so the only problem is with the histogram.

Production engineers at Sinotron believe that a modified layout on its assembly lines might increase average worker productivity. They decide to study the average hourly production output for 39 randomly selected employees before and after the production line modification. Management wants to know if the modification has improved the average hourly production.

(a) This should be a paired t-test because the same people were measured before and after the modification. (b) The 95% interval corresponds to α = 0.05. Since 0.0017 < 0.05, 0 can be rejected and would be outside the 95% interval. For 99% confidence we have α = 0.01 with 0.0017 < 0.01, so 0 can be rejected and would be outside the 99% interval.

Labor officials investigated the wages paid to workers in the shipping and distribution facility of a large online retailer

(a) To decrease the margin of error: increase the sample size, decrease the confidence level, and/or decrease the standard deviation. (b) Use the formula for sample size: n = (z* s / ME)2 = (2.5758 * 187 / 26)2 = 343.21, which rounds up to 344. (Sample size always rounds up.) To get z* = 2.5758, use the JMP Normal calculator with mean = 0 and SD = 1, select "Input probability" and enter Central Probability = 0.99.

A travel agency is putting together tour packages for various cities and needs to decide which hotels to include as part of the tour. The average nightly rate for 4 and 5 star hotels in two of these cities is shown below:

(a) To determine which price is relatively more expensive compared to other 4 star hotels in the same city, compare the z-scores: New York: z = (338 − 269) / 43 = 1.605. Los Angeles: z = (310 − 231) / 58 = 1.362. (b) (i) A bar chart should be used for categorical data (ii) time series plot is the only appropriate graph to show the change over time of a quantitative variable such as number of annual travelers

To help determine risk and calculate appropriate insurance premiums, health insurance companies often conduct studies to investigate broad lifestyle habits. One such study aimed to predict the percent of people who smoke in a given state, based on the percentage of people who ate the recommended 5 daily servings of fruits and vegetables in that state. Data were collected from all 50 states.

(a) To investigate if a linear relationship exists, we test H0: β1 = 0 (no linear relationship) vs. Ha: β1 ≠ 0 (a linear relationship exists). (b) The slope gives the predicted change in Y (decrease of -0.8917 percentage smokers) per one unit increase in X (i.e., for every 1% increase in the % of people who eat the recommended 5 daily servings of fruits and vegetables). (c) For n = 50, the value of DF = 50-1-1 = 48. To determine if there is a linear relationship, we do a 2-sided test (Ha: β1 ≠ 0). In the JMP t calculator, enter the positive and negative test statistic -1.375 and 1.375 and select ("X <= q1 OR X > q2"), which gives a p-value of 0.1755.

Has the popularity of online search engines changed over the past few years years? In 2015, 12% searches were done with Bing, 67% with Google, 9% with Other search engines and 12% with Yahoo (in alphabetical order, to match the JMP data file). A random sample of searches in 2019 yielded the results shown in the Search Engines JMP file.

(a) Use a Chi-squared goodness-of-fit test to determine if the counts from a single categorical variable follow specific proportions (in this case, market share in 2015). (b) H0 is specified by the theoretical proportions (in this case, market share in 2015). (c) After opening the data in JMP, click Analyze → Distribution, drag Search Engine into the Y Columns box and Count into the Freq box, click OK. On the next screen, click the second red triangle at the top left → Test Probabilities and enter the probabilities from H0: 0.12, 0.67, 0.09, 0.12. The test statistic is given below "ChiSq" in the row labeled "Pearson", with the p-value below "Prob>ChiSq" in the same row. (d) Since the 0.1208 > 0.05, there is insufficient evidence of Ha: the popularity of the search engines has changed.

To help with marketing to prospective students, an MBA program would like to know if the MBA specialty chosen by students in its program is related to their undergraduate degree. A random sample of MBA students yielded the following results:

-chi-squared test of independence(2 categorical) -p-value<a so sufficient evidence

Hannah will only buy a car if it can be shown to have an average fuel efficiency greater than 34 mpg. A random sample of 32 test drives yields an average of 35 mpg with standard error of 0.553.

(a) We would like to show μ > 34, so this must be Ha. H0 is based on the same value, μ = 34. (b) The test statistic is given by (sample statistic − hypothesized value) / SE. The sample statistic is y = 35 and the hypothesized value is μ = 34, so the test statistic is (35 − 34) / 0.553 = 1.8083. (c) Always use the t curve for the average. Enter DF = 32 − 1 = 31, select "X > q" and input the test statistic, 1.8083. This shows the p-value = 0.0401. (Compute the area on the right side because Ha points to the right.) (d) The p-value < 0.05, so there is sufficient evidence of Ha. That is, sufficient evidence the average mpg is greater than 34. (e) Larger values of α make it easier to cross the rejection boundary and reject H0 / claim Ha, in other words, easier to claim Ha: μ > 34 (Hannah will buy the car). (f) The p-value of 0.0401 gives the probability of getting the sample result of 35 (or higher, since Ha points to the right)... if H0 is true (μ = 34 mpg). (g) In reality, μ = 34, which is H0. If H0 is true but we make a mistake, this constitutes a Type I error.

A market research group compared the prices of groceries at Kroger and Publix. Prices of 43 items were obtained from both Kroger and Publix (the same items at each store).

(a) paired t-test. (b) A paired t-test would provide stronger evidence of a difference because it measures the same items

Most people want cars that are both powerful and fuel efficient. Consider mpg vs engine size; a scatterplot of these measurements for a random selection of cars revealed a negative trend with two large residuals (outliers) and R = −0.75.

(a) t-test for regression (b) If the outliers are removed from the data, the strength of the relationship (R2) would increase and the prediction errors (RMSE) would decrease.

To help remain competitive, traditional grocery stores are constantly testing new technologies. One of these is a live price tracker attached to shopping carts, which helps customers remain on budget. Is there a difference between the average amounts spent with and without a live price tracker? Download the Shopping Carts.jmp data file, which contains the amounts spent by shoppers with and without a live price tracker. When shoppers entered the store, they were randomly assigned a shopping cart -- either one with a live price tracker, or one without a live price tracker.

(a) two sample t-test (b) After opening the data in JMP, select Analyze → Fit Y by X and drag Amount Spent into the Y, Response box, and Price Tracker into the X, Factor box. Click OK, then click the little red triangle at the top left next to Oneway Analysis → t Test. Since the grocery store wanted to see if there was a difference, a 2-sided test is needed... the appropriate p-value is given next to "Prob >|t|". (c) Since the p-value < 0.05, there is sufficient evidence of a difference. (d) The JMP output shows the difference is -2.8228, which it says is calculated as "Without − "With". If Without − With = -$2.82, then With = Without + $2.82, implying With is $2.82 greater than Without. (e) Since the shoppers were randomly assigned a shopping cart, it is reasonable to conclude that the difference is indeed caused by whether or not they received a live price tracker.

A marketing team designed a promotional web page to increase online sales. Visitors to the home page were randomly directed to the old page or the newly designed page. During a day of tests, 350 customers were randomly assigned. The 175 customers who were directed to the old page spent $282 on average, while those who went to the new page spent $304 on average

(a) two sample test (b) To show the new webpage has higher sales, we need to show μnew > μold, (=) which corresponds to Ha: μnew − μold > 0. (c) Since 0.06 > 0.05, they would have concluded insufficient evidence that the new webpage increased sales (H0). If this is wrong, then Ha is really true, which implies a Type II error.

he director for operations for the State Bank and Trust recently performed a study of the time bank customers spent from the moment they arrive in the parking lot until they exited the front door after completing banking business. Data from a random sample of 1296 customers over a 4-week period revealed a unimodal and symmetric shape, with a mean of 21.3 minutes and a standard deviation of 4.5 minutes.

(a) z = (28.3 − 21.3) / 4.5 = 1.556. (b) The value 30.3 is exactly 2 standard deviations above the mean, so there is 95% inside the interval and 5% outside, split equally between the lower and upper tails: 2.5% on each side.

UGA wants to survey freshmen about their experience filling out the FAFSA (Free Application for Federal Student Aid). They decide to randomly select 8 First Year Odyssey classes and ask every student in the selected classes to fill out a survey

-cluster sample -It is quicker and easier than both simple random and stratified sampling since you go to a smaller number of FYO classes. It is not "unlike" a systematic sample in that both methods involve randomness

An investment advisory company analyzed the financial characteristics for a sample of Fortune 500 companies randomly selected from a particular issue of Fortune magazine. Download the Financial data set and open directly in JMP

(a)cross-sectional data. (b) The easiest way to see the average sales for each company type is via the Tabulate function. Click Analyze> → Tabulate, drag "Sales" into the drop zone for columns, "Type" into the drop zone for rows, and drag Mean on top of Sum. This shows that Oil companies have the highest mean sales. (The same result is also found using the Median, if you made histograms and noticed the skewness.) (c) The company type with the highest standard deviation will have the greatest variation, i.e. individual companies will be spread out the most from each other. Using the previous tabulate window, drag Std Dev on top of Mean (or right next to it, in the same row). The result is also found using the IQR, if you noticed the skewness in the histograms.

An insurance company wants to develop a better understanding of its claims paid out for medical malpractice lawsuits. Its records show claim payments amounts, as well as information about the presiding physician and the claimant for a number of recently adjudicated or settled lawsuits. Download the Medical Malpractice data set and open directly in JMP. Use 2 decimals for all questions below.

-Click Analyze → Distribution and drag Attorney into the Y, Columns box, or click Analyze → Tabulate and drag Attorney into the drop zone for rows followed by "% of Total" into the drop zone for columns. -Need to be randomly assigned to know cause -Click Graph → Graph Builder, drag Amount onto the Y axis (response variable) and Age onto the X axis (explanatory variable). Best fitting line can be obtained via the 3rd icon from the left at the top of the graph, this is almost flat.(mean age)

Banks charge different interest rates for different loans. A random sample of 211 loans made by banks for the purchase of new automobiles was studied to identify variables that explain the interest rate charged. A multiple regression model was constructed using 4 explanatory variables. Response variable: Loan interest rate Explanatory Variables: Loan Size ($ thousands), Length of Loan (months), Percent Down Payment, Years at Current Address. R2 = 0.239; R2-Adjusted = 0.224 RMSE = 1.255

-H0: βLS = βLL = βDP = βCA = 0 vs. Ha: at least one β ≠ 0 (b) Use an F test for overall model usefulness, which has test statistic given by the F Ratio (in the Analysis of Variance table). (c) An increase of 1 unit of Loan Size (= $1,000) is predicted to decrease Loan Interest Rate (=Y) by the value of b1 = 1.162, while keeping all the other variables fixed. (The negative coefficient implies a decrease in loan interest rate.) (d) Only Loan size and Years at current address have P-value < α = 0.05, so these are the only two useful variables. (e) Remove the single variable with the largest P-value, provided this P-value > 0.05, and re-fit the model.

A comparison of the drive-through service times (in seconds) during the lunch hour for various fast food restaurants yielded the following boxplots.

-Q1 is approximately 250, while Q3 is approximately 300. This gives IQR = Q3 − Q1 = 50 -(c)(i) The value 175 is between the median and Q3, which implies the percentage above it is less than 50% but greater than 25%. (ii) The value 230 is between the min and Q1, which implies the percentage above it is more than 75%. (d) There is only 1 true statement: A z-score of −1.79 means the value is 1.79 standard deviations below the mean.

In some cities, the subway transportation system utilizes the honor system, in which passengers are requested to buy tickets but are not forced to.

-Since no previous estimate of p is available, use p = 0.5. Then n = (z*/ME)2pq = (2.5758/0.02)2(0.5)(0.5) = 4146.72 which rounds up to 4147 (sample size always rounds up). -(b)(i) The confidence interval is given by (sample statistic) ± (critical value)SE. Here, the sample statistic is p̂ = 0.18 and SE = sqrt(0.18*0.82 / 360) = 0.0202. The critical value is obtained through the JMP Normal calculator with mean = 0 and SD = 1. Select "Input probability" and enter Central Probabiity = 0.95, which shows 1.96. Then the interval is given by 0.18 ± (1.96)(0.0202) = (0.1403, 0.2197). (ii) Since the confidence interval contains only 0.2 but not 0.25, only manager A could be correct.

Fire damage in the US costs insurance companies billions of dollars every year. The time taken to arrive at a fire is critical; one insurance company conducted a study of 24 fires, recording the distance to the nearest fire station (in miles) and percent of building destroyed by the fire. The following results were obtained:

-The confidence interval is given by: (sample estimate of slope) ± (critical value)(SE). Find the critical value: In the JMP t calculator, enter DF = 24−1−1 = 22. Select "Input probability", then choose "Central probability" and enter 0.9, which gives 1.7171. -Since 0 is outside the confidence interval, there is sufficient evidence that a linear relationship exists between distance from the fire and percent fire damage, so Yes: distance from the fire would be useful to predict the percent of fire damage -The confidence interval contains the population slope with 90% confidence.

An GMAT tutoring company gave a free practice test to prospective students, with the following results:

-finding outliers: use IQR = Q3 − Q1, Q1 − 1.5(IQR) and Q3 + 1.5(IQR) to see if number is outside fence -finding distance: difference between median and each Qs -outlier on the lower end, and also because the scores are left skewed, the mean will be less than the median.

Many local businesses support the idea of an Internet sales tax, helping them compete with large online retailers. A nationwide survey of 1352 randomly chosen Americans and found a 90% confidence interval for the proportion who support Internet sales tax to be (0.38, 0.50).

-one sample z-interval -There is only one true statement: with 90% confidence, the true value of the proportion lies inside the interval. The interval does not contain a randomly chosen person, nor 90% of the people, nor the sample proportion. - The margin of error is half the interval width: (0.50 − 0.38) / 2 = 0.06 -More confidence requires a wider interval (in order to capture the parameter with greater confidence); a wider interval implies a greater margin of error since margin of error = half the interval width. The wider interval is obtained via a larger critical value (z*)

A realtor in Athens would like to advise her clients how much houses with different numbers of bedrooms typically cost. She takes a random sample of 200 houses currently for sale and records the asking price as well as number of bedrooms. Download the JMP data file: Athens House Prices.

After opening the data in JMP, click Analyze → Distribution, drag Price into the Y, Columns box and Bedrooms into the By box, click OK. -(a),(b) The mean house price for each category of Bedrooms is given in the Summary Statistics column, as well as the 95% confidence limits ('Lower 95% Mean' and 'Upper 95% Mean'). -Since $200,000 is contained inside the interval, there is insufficient evidence to reject it. -The margin of error only takes into account the variability/error due to the random selection process and any additional errors will need to be added on top of this.

The owner of a gas station expects the price of gas to drop soon so does not want to buy too much gas, but still enough to cover their daily sales. From previous experience, the number of gallons sold per day can be modeled by a Normal model with mean 1159 gallons and standard deviation of 128 gallons

In JMP, enter Mean = 1159 and Std. Dev. = 128. (a) Select "X > q" for the right tail and enter 1300 into the "Input: Value" box. (b) Select "q1 < X <= q2" and enter 1025 and 1345. (c) On the left side, select "Input probability and calculate values". On the right side, select "Central probability" and enter 0.6 into the "Input: Probability" box. (d) They want to buy enough gas to cover almost all of their demand... there is only 0.02 probability of having a demand more than what they buy, so this is the right tail of distribution (high gas sales). On the left side, select "Input probability and calculate values". On the right side, select "Right tail probability" and enter 0.02 into "Input: Probability" box.

A financial services company provides investors with a choice of many different mutual funds. For each $1,000 invested, the annual returns by their group of funds can be modeled by a Normal distribution with a mean of $1,152 and standard deviation of $651. Use 4 decimals for all questions below.

On Normal calculator in JMP, enter Mean = 1152 and Std. Dev. = 651: (a) Select "X > q" for the right tail and enter 1500 in the "Input: Value" box. (b) Select "X <= q" for the left tail and enter 0 in the "Input: Value" box. (c) Select "q1 < X <= q2" and enter 631 and 2454. (d) Keeping the selection "q1 < X <= q2", enter 1478 and 2194. (e) On the left side, select "Input probability and calculate values", then on the right side select "Right tail probability" and enter 0.14 in the "Input: Probability" box. (f) Keeping the left side on "Input probability", select "Left tail probability" and enter 0.01 in the "Input: Probability" box.

A large grocery chain has comment boxes that allow customers to drop in comment cards, anonymously rating their impressions of the shopping experience. Over the course of the year, they collect 1,258 ratings. Which of the following is a reasonable critique of this sampling design?

The 1,258 customers who chose to fill out comment cards may not be representative of all customers.

What type of educational background do CEOs have? In one survey, economists 338 randomly selected CEOs of medium and large companies were asked whether they had an MBA degree, and found 105 who had an MBA degree. The economists were interested in testing if the proportion of CEO's with MBA's has changed from the value of 21% that was true 5 years ago.

The conditions are met: the sample is random and np = 338(0.21) ≥ 10 and nq = 338(0.79) ≥ 10

One commonly cited metric in car commercials is time to accelerate from 0 to 60 mph. What car features might explain faster acceleration (shorter times)? A data set includes weight (in pounds), horsepower, and acceleration (in seconds) for 392 different types of car. Consider two models to predict acceleration.

The simple regression would have higher overall error (RMSE) and lower R

An article in The Wall Street Journal asks whether more than 50% of American workers would prefer being given $200 rather than a day off from work. A survey based on a random sample 1040 American workers yielded a p-value of 0.0238. What is the meaning of this p-value?

There is 0.0238 probability of getting the observed sample result (or something more extreme), if H0 were true

boxplot- which result in smaller p-value

bigger difference between the sample mean and null hypothesis value(in problem)

standard deviation

if points are close to center, small standard deviation


Related study sets

Chapter 24: Nursing Care of the Newborn and Family (Lowdermilk)

View Set

Unit II-Organization Behaviour-14-Workforce Diversity

View Set

AFFA Group Fitness Flash Card Set

View Set

Health Assessment Practice Questions (Test 1)good

View Set

AU 60 Commercial Underwriting Principle_Qs

View Set