Business Analytics quiz 1-5
In testing the difference between two independent population means, it is assumed that the level of measurement of the variable is at least _____________. a ordinal scale a nominal scale a ratio scale a interval scale
a interval scale
If 100 megabytes of storage cost a penny, a terabyte of storage would cost $1,000 $1,000,000 $100 $10,000 $100,000
$100
If a 100 megabytes of storage cost a penny, a petabyte of storage would cost $1,000, 000 $100 $10,000 $1,000 $100,000
$100,000
If two variables do not have a strong (linear) relationship, the correlation coefficient between the two variables will be close to -1 +1 0 None of the above
0
If the random variable of x is normally distributed, ______% of all possible observed values of x will be within two standard deviations of the mean 95.44 85.00 99.73 68.26
95.44
According to the empirical rule, the percentage of data that will fall within 3 standard deviations of the average in a normal distribution is approximately 99% 95% 90% 75% 89%
99%
In testing the difference between two independent population means, it is assumed that the level of measurement is at least _____________. a ratio variable a nominal variable a interval variable a ordinal variable
a interval variable
An experiment studies the number of tickets written each day by campus police for illegal parking by The University of Akron students. This variable is Nominal Ordinal Ratio Interval
Ratio
Which activity is performed in the Deployment phase of the CRISP-DM process? Try different analytical techniques Evaluate results Produce project plan Assess whether the model met business objectives or not Review project
Review project
A Stem & Leaf plot is similar to a histogram but is usually more informative display of relatively small data sets
True
A histogram is only appropriate for variables whose values are numerical and measured on an interval or ratio scale
True
A one-way ANOVA is a method that allows us to estimate and compare the effects of several treatments on a response variable
True
A statistical hypothesis is a statement about the value of a population parameter
True
A two-tailed test is one where Ha no direction is indicated and utilizes =/=
True
An outlier is an observation in a data set, which is far removed in value from the others in the data set
True
Big data can generate significant financial value across many companies and industry sectors
True
Big data make companies smart
True
CRISP-DM is an iterative process
True
Data Mining is a complex process requiring various tools and different people
True
Data mining should be viewed as a process
True
Getting data in order is so critical to analytics that most organizations have to undertake substantial data management efforts before they can do a lot of analysis
True
In Multiple Regression analysis, a t-test is used in testing the significance of an individual independent variable
True
In supervised DM technique, the algorithms learn which values of target variable are associated with predictor variables
True
Pie and bar charts are used to summarize nominal and ordinal data
True
Predicting whether a company will go bankrupt based on comparing its financial data to those of similar bankrupt and non-bankrupt firms is a supervised technique
True
Quartiles are values that divide a sample of data into four groups containing (as far as possible) equal numbers of observation
True
Regression analysis is an example of a supervised learning technique
True
The error term in the regression model describes the effects of all factors other than the independent variables on y (response variable)
True
The mean and median are the same for a normal distribution
True
The residual is the difference between the observed value of the dependent variable and the predicted value of the dependent variable
True
The three basic building blocks of business analytics are technology, process, and people
True
The yearly amount of snowfall in Akron over 10 years can be categorized as Ratio data
True
For a fixed sample size, the lower we set alpha, the higher is the __________. Type I error Type II error Random error p-value
Type II error
______________ refers to the accuracy or trustworthiness of the big data Volume Value Veracity Velocity None of the above
Veracity
Business understanding phase includes the following except Determine business goals & objectives Initial assessment of tools and techniques Determine data mining goals & objectives Visualize data
Visualize data
Records that have outlier values should always be removed from a dataset during data preparation and cleaning
False
Which of the following statements about this Pareto chart is false? It is a specialized form of a bar chart. It is a visualization tool for continuous data only. It is based on the "80/20 rule." All of the above statements are false.
It is a visualization tool for continuous data only.
Data has been collected on visitor's viewing habits at a bank's website. Which technique is used to identify pages commonly viewed during the same visit to the website? Classification Association Prediction Clustering
Association
An acceptable residual plot exhibits Constant error variance Increasing error variance A curved pattern Decreasing error variance
Constant error variance
Which of the following in not one of the four drivers of business analytics? Desire to analyze big data Desire to optimize business operations Predict new business opportunity Comply with regulatory requirements
Desire to analyze big data
Data was collected on students at a nearby university on their GPA at the end of their freshman year and their living accommodations (dorm, off-campus, and other). For which form of housing is the distribution of GPAs the most symmetric (least skewed)? Dorm Off-campus Other None of the abov
Dorm
A box plot can be used to summarize nominal data
False
A model's accuracy rate on the training data set is a better measure of the model's predictive ability than its accuracy rate on the validation data set
False
A negative correlation coefficient (r) implies a weak relationship among the variables
False
A scatterplot can be drawn with a set of two categorical data
False
Analytics help companies to be efficient but not effective
False
Business Intelligence is a subset of Business Analytics
False
CRISP-DM model is dependent on industry sector and technology used
False
CRISP-DM process model consists of five phases
False
For a hypothesis test about a population mean, if the level of significance (alpha) is less than the p-value, the null hypothesis is rejected
False
If the random variable of x is normally distributed, 68.26% of all possible observed values of x will be within two standard deviations of the mean
False
In a simple linear regression model, the coefficient of determination (R-Square) not only indicates the strength of the relationship between independent and dependent variables, but also shows whether the relationship is positive or negative
False
In multiple regression analysis, if the simple correlation coefficient (rxy) between the dependent variable and one of the independent variables is .95, then this indicates that multicollinearity exists
False
Most charts, graphs, and other visualizations would be considered predictive models
False
Predictive analytics involves higher mathematically complex models than prescriptive analytics
False
The controller of a chain of toy stores is interested in determining whether there is any difference in the weekly sales of store 1 and store 2. The weekly sales are normally distributed. This problem should be analyzed using Oneway ANOVA
False
The final stage of the CRISP-DM data mining process is "Evaluation."
False
The five Vs of Big Data are volume, velocity, volatility, variety, and veracity
False
The three basic building blocks of business analytics are technology, data, and people
False
The variance inflation factor measures the correlation between the dependent variable and the rest of the independent variables in the regression model
False
This is a valid alternate hypothesis: The average weight of desks made on assembly line one is same as the average weight of desks made on assembly line two
False
This is a valid null hypothesis: The average weight of desks made on assembly line one is different from the average weight of desks made on assembly line two
False
Today, business analytics are data driven rather than business driven
False
Two variables (x1 and y) have a high correlation coefficient ( rx1y .) Therefore, it can be concluded that changes in x1 cause y to change
False
Unstructured data increases the veracity in the data
False
When the F test is used to test the overall significance of a multiple regression model, if the null hypothesis is rejected, it can be concluded that all of the independent variables X1, X2, X3,... Xk are significantly related to the dependent variable Y
False
When using simple regression analysis, if there is a strong correlation between the independent and dependent variable, then we can conclude that an increase in the value of the independent variable causes an increase in the value of the dependent variable
False
Multicollinearity between independent variables is severe if the variance inflation factor is Greater than 10 Less than 5 Substantially less than 1 Between -1 and +1
Greater than 10
Please identify the type of visualization shown in the graphic: Bar Graph Histogram Stem & Leaf Plot Pareto Plot Scatterplot
Histogram
Which of the following questions is typically addressed via a Business Intelligence project? What will be the impact if we acquire a competing product? What is the optimal product mix? Why are we losing our 10% of our most valuable customers? In which country did we have the highest sales last quarter?
In which country did we have the highest sales last quarter?
The primary use of stepwise regression is to identify the most significant ___________ that should be included in the multiple regression model Dependent variables dummy variables Independent variables correlated variables
Independent variables
A survey handed out to customers in a local mall asked the following questions: marital status -- including single (S), married (M), widowed (W), divorced (D); annual household income (in $); age (in years); and overall satisfaction with city services (on a scale from 1 to 5 with 1 being poor and 5 being excellent.) Select the appropriate test/model to determine if there is a relationship between age and household income. ANOVA Linear Regression Chi-Square test/ Contingency model t-test of means
Linear Regression
An important factor in selecting software for word-processing and database management systems is the time required to learn how to use the system. To evaluate three file management systems, a firm designed a test involving word processing operators. Since operator variability was believed to have an impact, each of the operators was trained on all three of the file management systems. ------------------ ANALYSIS OF VARIANCE TABLE ----------------- SUM OF SQ'S D.F. MEAN SQ. F(D.F./8) P-VALUE System 103.33 2 51.67 56.36 0.000019 Operator 64.67 4 16.17 17.63 0.000494 ERROR 7.33 8 0.92 --------------------------------------------------------------- TOTAL 175.33 14 If we are using an alpha value of .05 then we would conclude that A) differences exist among both systems and operators. B) differences exist among systems only. C) differences exist among operators only. D) no differences exist.
A) differences exist among both systems and operators.
A survey handed out to customers in a local mall asked the following questions: marital status -- including single (S), married (M), widowed (W), divorced (D); annual household income (in $); age (in years); and overall satisfaction with city services (on a scale from 1 to 5 with 1 being poor and 5 being excellent.) Select the appropriate test/model to analyze differences in mean household income based on the four different levels of marital status. t-test Chi-square test/contingency table Linear regression ANOVA
ANOVA
A survey handed out to customers in a local mall asked the following questions: marital status -- including single (S), married (M), widowed (W), divorced (D); annual household income (in $); age (in years); and overall satisfaction with city services (on a scale from 1 to 5 with 1 being poor and 5 being excellent.). Select the appropriate test/model to analyze differences in mean household income based on the four different levels of marital status ANOVA T-Test Linear Regression Chi-Square test
ANOVA
Suppose everyone who visits our retail website either gets one of two promotional offers, or no promotion at all. We want to see if making the promotional offers makes a difference in sales. What statistical method would you recommend for this analysis? P-test Z-test ANOVA T-test
ANOVA
CRISP-DM is a hierarchical process model that consists of Phases Generic tasks Specialized tasks Process instances All of the above
All of the above
What is the benefit of running a pilot project during the final phase (Operationalize) of an analytics project? Limit risk Learn about performance constraints Learn what is needed to retrain the model over time All of the above
All of the above
Analytics are not applicable when You have no historical data Variables are difficult to be measured There is not time to perform analysis (rapid decision making) Totally unstructured decisions All of the above are true
All of the above are true
___________ is an iterative variable selection procedure that allows an independent variable to be deleted to a multiple regression model during the next iteration Mixed Elimination Forward Elimination Backward Elimination Stepwise regression
Backward Elimination
Which of the following data visualization charts might include "whiskers"? Stem and leaf plot Pareto chart Pie chart Boxplot
Boxplot
Which of the following tasks would NOT be part of the data understanding phase in CRISP-DM? Exploring the data Collecting the data Building the model Describing the data
Building the model
We can test "goodness of fit" or "independence" of categorical variables by using which of the following distributions? A) Binomial distribution B) F-distribution C) Chi-Square distribution D) Normal distribution E) t-distribution
C) Chi-Square distribution
Which of the following statements is not a property of the normal probability distribution? A. The area under the normal curve to the right of the mean is equal to the area under the normal curve to the left of the mean. B. The normal distribution is symmetric. C. 95.44% of all possible observed values of the random variable x are within plus or minus three standard deviations of the population mean. D. The mean, median, and mode are equal.
C. 95.44% of all possible observed values of the random variable x are within plus or minus three standard deviations of the population mean.
A survey handed out to customers in a local mall asked the following questions: marital status -- including single (S), married (M), widowed (W), divorced (D); annual household income (in $); age (in years); and overall satisfaction with city services (on a scale from 1 to 5 with 1 being poor and 5 being excellent.) For the variable City service satisfaction (scale of 1-5), identify the type of data. Categorical/Ordinal Continuous Categorical/nominal
Categorical/Ordinal
A survey handed out to customers in a local mall asked the following questions: marital status -- including single (S), married (M), widowed (W), divorced (D); annual household income (in $); age (in years); and overall satisfaction with city services (on a scale from 1 to 5 with 1 being poor and 5 being excellent.) For the variable Martial Status (S/M/W/D) identify the Data type. Categorical/Ordinal Categorical/nominal Continuous
Categorical/nominal
Accuracy and completeness of data is very important for analytics. The following data - 02/31/2020 is Both complete and accurate Complete but not accurate Not complete but accurate Not complete and not accurate None of the above
Complete but not accurate
Which is not considered a defining characteristic of Big Data? Volume Processing Complexity Data Structure Data Quality
Data Quality
All of the following is correct except one when describing the present day analytics Fact based decision-making Centrally managed Entire organization Decision focus Data driven
Data driven
In which CRISP-DM phase data is partitioned and training & validation data sets are created? Modeling Evaluation Data preparation Data Understanding Business understanding
Modeling
Data mining models can do the following except Predict sales based on historical data Classify into groups based on characteristics Assign customers to different segments Optimize Taste & Price of a product Associate products that are bought together
Optimize Taste & Price of a product
The data mining project manager meets with the data-warehousing manager to discuss how the data will be collected. Identify the CRISP-DM phase for this task Phase 1 Phase 2 Phase 3 Phase 5 Phase 6
Phase 2
The analysts meet to discuss whether the neural network or decision tree models should be implemented. Identify the CRISP-DM phase for this task Phase 1 Phase 2 Phase 3 Phase 4 Phase 6
Phase 4
WNBA scouts research UA basketball star Rachel Tecca's college scoring history to estimate how much they would blow away other teams if she signed with them. This is an example of Predictive Analytics Prescriptive analytics Descriptive Analytics None of the above
Predictive Analytics
____________________ is a form of analytics which examines data to answer the question "what should be done?" or "what can we do to make XYZ happen?" Descriptive Exploratory Predictive Prescriptive
Prescriptive
Assumptions of a regression model can be evaluated by plotting and analyzing the ________ dependent variables independent variables p values error terms
error terms
If the simple correlation coefficient between two independent variables is greater than .95, then ______________________ is considered to be severe multicollinearity coefficient of determination interaction
multicollinearity
Significant _________ may exist when the overall F-statistic is significant and the individual t statistics for all independent variables are insignificant autocorrelation independence multicollinearity outliers
multicollinearity
How many outlier records appear to be present in this distribution? zero one two unable to determine based on given information
one
In a multiple regression model, the ratio of MSRegression/MSError yields which statistic, used to test the overall model? the F statistic. the wrong statistic. the Chi-Square statistic. the t statistic.
the F statistic.