QMB Exam 3
A farmer wants to decide if he should plant corn or lima beans. No matter what he plants he will get either a good, average, or bad crop yield. For corn, a good yield will give him a profit of $7,000, average will profit $3,000 and, and a bad yield will give a profit of $1,000. For lima beans, the profit for good, average, and bad yields are as follows: $5,000, $3,500, and $2000. For either crop, the probability of good yield is 0.4, the probability of average yield is 0.5 and the probability of a bad yield is 0.1. What are the outcomes for lima beans?
$5,000, $3,500, $2000
Kruskal Wallis test assumptions
-SRS -independent groups
Wilcoxon Signed Rank Test assumptions
-dependent random samples -quantitative response
Two way ANOVA assumptions
-equal variance -independence -normally distributed
One-Way ANOVA Assumptions
-equal variance (roughly same std. dev.) -independence -normally distributed response variables
Wilcoxon Sum Rank Test assumptions
-independent random samples -distributions are roughly the same
Friedman's Test for Randomized Block
-used for 3 or more blocks to compare -rank each row across instead of together
How many times should we take a sample with replacement data when we are bagging?
1000 times
A doughnut shop wants to determine if there is a difference in donut sales at different times of the day and for different types of doughnuts. They are open in the morning, afternoon, and night, and offer the following flavors: vanilla, chocolate, red velvet, and marbled. There were a total of 48 sales recorded. The shop conducted a two-way ANOVA test and found an F test statistic for Flavor of 14.87. What would be the numerator degree of freedom for the F test statistic to determine if the factor flavor was significant?
3
A grocery store company wanted to know how well some of their local stores were doing. In order to find out, they hired 4 different reviewers to rate 8 local stores. What are the degrees of freedom?
7
Kruskal-Wallis test
A nonparametric statistical test used to compare three or more unpaired (independent) samples where the outcome is either ordinal or continuous with a skewed distribution.
Wilcoxon Signed Rank Test
A nonparametric statistical test used to compare two paired (dependent) samples where the outcome of interest is ordinal or continuous with a skewed distribution.
One way ANOVA null hypothesis
All population means are equal
In One Way ANOVA Tests, which steps could be taken to check the nearly normal condition?
Check the histogram of each group (level). Check the boxplots of each group (level).
In what way is a random forest similar to bagged trees?
Each tree is made by sampling with replacement
Identify the factor and levels in the following scenario: The Mini Donut Factory in Tampa is trying out new frostings for their mini donuts (PB&J, pistachio, orange, and champagne). All of the doughnuts are fried as usual, but the frostings are randomly assigned. Twenty five regular customers are asked to rate the doughnuts from 1-4, with 1 being the best. The doughnuts were delivered to the customers in a random order every time.
Factor = frosting type. Levels = PB&J, pistachio, orange, and champagne
T/F Assuming all assumptions of normality are met, we prefer to use non-parametric tests over parametric tests because non-parametric tests are more powerful.
False
T/F For a time series for simple exponential smoothing, a larger alpha is smoother than a shorter period.
False
T/F For a time series for simple moving average, a shorter period will be smoother than a larger period because shorter periods do not react as quickly.
False
T/F If there is significant interaction, we look at the main effects and also look at multiple comparisons for every level of each factor.
False
The following is a valid null and alternative hypothesis for a Kruskall Wallace Test. A shoe company wants to know if three groups of workers have different salaries. Ho: at least one salary center is different. Ha: the centers for salary are all the same.
False
For the following scenario, would you utilize a Wilcoxon Sign Rank or Friedman's Rank test? A researcher wanted to test the ratings of three different brands of paper towels. Each brand had 7 reviewers.
Friedman's Rank test
In Friedman's Test for a Randomized Block Design, what is the correct null hypothesis?
Ho: All the medians are equal.
If a modeling method is described as a black box, what does this indicate?
In a black box model, the equation that evolves is not interpretable.
Given the following scenario, would you use the Wilcoxon Rank Sum or the Mann Whitney Test? Determining if there is a difference in ratings in maintenance quality at two TireZone locations. Forty customers were randomly selected from the location on 13th Street, and forty five customers were randomly selected from the Newberry Road location.
Mann-Whitney Test
What type of model is best to use when there is a consistent percentage increase in the data?
Multiple regression based model
Wilcoxon rank-sum test (Mann-Whitney U test)
Nonparametric statistical test used to compare two unpaired (independent) samples where the outcome of interest is ordinal or continuous and not normally distributed.
One Way ANOVA alternative hypothesis
Not all means are equal
What method is used in neural networks to transform the data?
Propagation - transforming the data into a "S"-shaped curve
nonparametric tests assumptions
SRS
What differentiates single blind experiments from double blind experiments?
Single blind means the participants do not know what treatment they received
Determine if the following is a supervised or unsupervised model. A local restaurant sends a 20% off coupon to its email subscribers. Based on past data, what is the typical amount that customers with certain incomes spend when they use the coupon?
Supervised Regression
A internet research firm gathers data on billions of Google search queries across thousands of categories. The analysts do not need to clean the data, because the data is always perfectly clean and accurate. A typical data set of queries may have a minimum of 100,000 rows of data. The research firm's main revenue source is its consulting operation, in which analysts and associates consult major firms in many different industries, including transportation, retail, manufacturing, and technology/software. The data consists of text, quantitative variables, categorical variables, video and others. However these firms only require data consulting services about once every 5 years, so the demand for the data is slow. Does the scenario above meet the four qualifications of big data?
The data meets the volume and variety requirements, but not the velocity and veracity requirements
Suppose you ran a additive regression-based model of price vs time (years) and the point estimate for change in price per year was 1140. Interpret the slope.
The price tended to increase by 1140 dollars per year.
When looking at an interaction plot, how can we tell if interaction may be present?
The slopes of the two lines cross or may cross at some point
What are the key differences between trends and seasonal components in time series data?
Trends are a consistent pattern (either linear or curves) that approach either a negative or positive direction; Seasonal Components are fluctuation in data that occur around the same time every period.
Determine if the following are valid null and alternative hypotheses for a Wilcoxon Rank Sum Test. Is there a difference in time for Army versus Navy recruits to finish an obstacle course? Ho: Army and Navy recruits have the same distribution Ha: The distributions are different for Army and Navy recruits
True
T/F Data Mining cannot automatically find beneficial patterns for a business.
True
T/F Interaction is when the effect of one factor depends on the level of the other factor.
True
T/F Kendall's Tau is used to measure monotonicity, which is the degree to which a relationship trends in one direction.
True
In terms of Big Data and Data Mining, what is the correct definition of variety?
Variety refers to the different types of data. Big Data tends to have many different types of data.
What is the correct definition of boosting?
When we fit a small tree model with limited splits, reweight the data depending on how it is classified, fit a new tree, and repeat
Given the following scenario, would you use the Wilcoxon Rank Sum or the Mann Whitney Test? Comparing the ratings of a new social documentary that discussed relations between men and women. There were twenty men and twenty women in the sample.
Wilcoxon Rank Sum Test
What type of test would you utilize for the following scenario? A researcher wanted to test the ratings of two different coffee brands. Each brand had 10 reviewers.
Wilcoxon Sign Rank Test
Imagine you are performing a Two-Way Anova test, and you find a p-value for interaction of 0.2368. What is the correct interpretation of this p-value?
With a p-value greater than all alpha levels, we fail to reject the Ho. Continue on to look at the main effects
Imagine you are performing a Two-Way Anova test, and you find a p-value for interaction of 0.0001. What is the correct interpretation of this p-value?
With a p-value less than all alpha levels, we reject the Ho. Do not look at main effects. Look at multiple comparisons for each level of each factor
A country club wanted to see if it could increase its revenue from its members. Every weekend, they offered half-off drinks from 12pm-2pm and 6pm-8pm. They also record whether or not the members are retired. Is there blocking in this experiment? And, if so, what is the blocking element?
Yes; being retired or not
Two-way ANOVA
a method used to study the effects of two factors on a response variable; tests for interaction btwn factors
Lucky's Market wanted to see if they could increase the number of sales per day before they closed their locations nationwide. To test this, they randomly chose half of their 80 locations and sent 50% off coupons to their customers only at those locations, while the other half of locations did not receive the 50% off coupons. Which of these answers represents an element of replication in this study? a. 40 locations b. while the other half of locations did not receive the 50% off coupons c. they randomly chose half d. their 80 locations
a. 40 locations
What is the correct null hypothesis for the Kruskall Wallace Test? a. Ho: The centers for all the groups are the same b. Ho: All of the slope coefficients are equal to zero. c. Ho: At least one of the centers is different d. Ho: All of the population means are equal to zero.
a. Ho: The centers for all the groups are the same
Select the best scenario for when you might want to know specific information about a black box model. a. If you need to determine if a car owner is likely to submit an insurance claim. b. If you need to forecast demand for the next week at a grocery store. c. If you need to predict the number of sales at a restaurant for the next period.
a. If you need to determine if a car owner is likely to submit an insurance claim.
Select the best scenario for when you might NOT want to know the black box model. a. If you need to predict the number of packages shipped through the mail each week. b. If you need to determine if a car owner is likely to submit an insurance claim. c. If you need to determine if a stock is likely to increase or decrease in price. d. If you need to determine if an incarcerated person should be released on good behavior.
a. If you need to predict the number of packages shipped through the mail each week.
A large movie theater chain was interested in seeing if they could increase sales by adding a monthly subscription service that allowed subscribers to see an unlimited amount of movies every month, instead of paying for tickets for each movie. To test this, they randomly selected half of their 400 theaters and introduced the new service, while the other half did not have the service. Which of these answers would be the best way to have control in this study? a. they had a group that opened at the standard pricing structure b. Use a computer to randomly pick half of the theaters c. their 200 locations
a. they had a group that opened at the standard pricing structure
Friedman's Test null hypothesis
all medians are equal
Friedman's Test. alternative hypothesis
all medians are not equal
What is the appropriate next step if there is significant interaction? a. Remove the interaction term from the model. b. Do not look at the main effects, but look at multiple comparisons for each of the levels of each factor c. Continue to look at the main effects
b. Do not look at the main effects, but look at multiple comparisons for each of the levels of each factor
What is the correct null hypothesis for a One way ANOVA test with three different groups? a. Ho: at least one of the population means are different. b. Ho: μ1 =μ2 =μ3 c. Ho: μ1≠μ2≠μ3μ1≠μ2≠μ3μ1≠μ2≠μ3 d. Ho: ybar1=ybar2=ybar3ybar1=ybar2=ybar3ybar1=ybar2=ybar3
b. Ho: μ1 =μ2 =μ3
What is the correct definition of interaction? a. Interaction is how correlated two factors are b. Interaction is when the effects of one factor are NOT similar across all levels of the other factor c. Interaction is the difference in p-value for two factors
b. Interaction is when the effects of one factor are NOT similar across all levels of the other factor
For a time series for exponential smoothing, which of the following is the smoothest? a. alpha = 0.7 b. alpha = 0.3 c. alpha = 0.5 d. alpha = 0.9
b. alpha = 0.3
A large supermarket chain was interested in seeing if they could increase their holiday sales by starting their Black Friday sales 2 hours earlier than normal. To test this, they randomly selected half of their 500 stores and told them to start their sales at the new starting time, while the other half of the stores would start at the standard starting time. They also collected information on what region of the US the stores were in. Which of these answers would be the best way to randomization in this study? a. allow store managers to volunteer b. use a computer program to select the locations that open earlier and those that don't c. have an unbiased person just pick half of the stores to open early
b. use a computer program to select the locations that open earlier and those that don't
What other method does Random forest start out the same as?
bagging
What is the correct alternative hypothesis for the Wilcoxon Rank Sum Test? a. Ha: The population slope coefficient is not equal to zero. b. Ha: the two samples come from the same distribution. c. Ha: one distribution is shifted in location higher or lower than the other. d. Ha: The population means are equal.
c. Ha: one distribution is shifted in location higher or lower than the other.
When examining boxplots, which of the following characteristics would tell us that there was a violation of one of the assumptions for One Way ANOVA? a. The box plots didn't have any outliers. b. The medians were not the same between the plots. c. One of the plots had a range that was three times the other plots. d. The box plots all had the same range.
c. One of the plots had a range that was three times the other plots.
What does it indicate to have interaction in Two-Way ANOVA? a. When the effects of two factors are completely independent of each other. b. When the effects of two factors are too closely correlated. c. The effect of one factor is not similar across all levels of the other factor d. The effect of one variable is similar across all levels of the other factor
c. The effect of one factor is not similar across all levels of the other factor
In which one of the four scenarios would you consider a non-parametric test? a. When we want to look at lag autoregressive models b. When we have time series data c. When we don't have quantitative data, but rather could have survey responses such as "Strongly agree", "Agree", etc. d. When we want to look at Residual plots
c. When we don't have quantitative data, but rather could have survey responses such as "Strongly agree", "Agree", etc.
In which one of the four scenarios would you consider a non-parametric test? a. When we have time series b. When we want to look at lag autoregressive models c. When we might want to replace numeric values with ranks if we are concerned about outliers d. When we want to look at Residual plots
c. When we might want to replace numeric values with ranks if we are concerned about outliers
Determine if the following is a OLAP, Classification, or Regression Model. Dollar General sends out a $5 off coupon to its email subscribers. Based on past behavioral patterns, what region of town has the most customers that use a high number of coupons?
classification model
One-way ANOVA test
compare 3 population means ex. the effects of major on salary
What is the correct alternative hypothesis for the Kruskall Wallace Test? a. Ha: All of the population means are not equal to zero. b. Ha: The centers for all the groups are the same c. Ha: All of the slope coefficients are not equal to zero. d. Ha: At least one of the centers is different.
d. Ha: At least one of the centers is different.
A model handles outliers well, handles many potential predictor variables easily, and sorts through predictor variables repeatedly to divide the data into groups to best determine the response variable.
decision tree
In boosting, does misclassified data get a higher weight or lower weight?
higher
An autoregressive model uses _____________ variables, and is also better at modeling ______________ trends.
lagged, quarterly
In simple exponential smoothing, which is smoother, a high alpha or a low alpha?
low
In the CRISP Model for Data Mining, what can be done after there is a business understanding and a data understanding?
modeling
Kendall's Tau is used to measure
monotonicity
A model is an automatic, flexible, non-linear regression model, not interpretable, and retains all predictor variables although it may limit the effects of some.
neural network
A sales manager at an insurance company wants to determine if there is a significant difference between the income of customers and the value of the monthly premium that the customers pay. He tells all his coworkers about this endeavor and he only uses certain friend's customer data in the sample. Does this scenario meet the Independence Assumption?
no, the manager was biased in his sampling
The Head Coach of the Kansas City Chiefs, wanted to know how many players visit the team's athletic trainer after practice for extra stretching, medical attention, or any other reason. He wanted to know if reminding the players about the services available affect the number of people that went to the trainer. He works with his athletic trainer to setup a camera in the training room to track the number of players who visit after practice. After several practices, the head coach and the trainer watched the footage and recorded the number of players. They recorded and analyzed the results. What type of study is this?
observational
MSE will be _________ with a 2-factor ANOVA
smaller
A farmer is trying to decide if he should plant carrots or peas. Depending on what he decides to plant, he will have different probabilities of getting a good, medium, or bad yield. When referring to the crop yield, is this an action or a state of nature?
state of nature
Kruskal Wallis null hypothesis
the pop. distribution of the response variable are the same for all of the groups/ the pop. medians are the same for all of the groups
Kruskal Wallis alternative hypothesis
the pop. distribution of the response variable is not the same for all of the groups/ the pop. medians is not the same for all of the groups
Wilcoxon Signed Rank Test null hypothesis
the population median of the differences is 0
Wilcoxon Signed Rank Test alternative hypothesis
the population median of the differences is not 0
Wilcoxon rank-sum test (Mann-Whitney U test) alternative hypothesis
the two populations have different medians
Wilcoxon rank-sum test (Mann-Whitney U test) null hypothesis
the two populations have the same medians
Given the following situation, determine if the data is rank data. Before Super Bowl LIV, the owner of an official NFL apparel store wanted to see if there was a difference in the value of sales between the Kansas City Chiefs and San Francisco 49ers themed items. He gathered data from the last two weeks, and right before the Super Bowl aired he conducted his hypothesis test. Assume both teams have the same products listed at the same price. He randomly selected thirty purchases of Kansas City items and thirty purchases of San Francisco items. Is this rank data?
this is not ranked data