Data Analytics Exam
Demand for a product and the forecasting department's forecast (naïve model) for a product are shown below. Compute the mean squared error. Period: 1, 2, 3, 4 Actual Demand: 12, 15, 14, 18 Forecasted Demand: --, 12, 15, 16
4.67 (15-12)^2 + (14-15)^ + (18-16)^2 / 3 = 4.67
Which of the following is not present in a time series?
Operational variations
The mean absolute error, mean squared error, and mean absolute percentage error are all methods to measure the accuracy of a forecast. These methods measure forecast accuracy by ____
determining how well a particular forecasting method is able to reproduce the time series data that are already available
The _____ the lift ratio, the _____ the association rule.
higher; stronger
An exponential trend pattern occurs when ______.
the percentage change between periods in the value of the variable is relatively constant
What would be the value of the sum of squares due to regression (SSR) if the total sum of squares (SST) is 25.32 and the sum of squares due to error (SSE) is 6.89?
18.43- The three quantities are related as SST = SSR + SSE. Substituting the values, we get SSR=18.43.
A better understanding of consumer behavior through analytics directly leads to
A better understanding of consumer behavior through analytics leads to the better use of advertising budgets, more effective pricing strategies, improved forecasting of demand, improved product line management, and increased customer satisfaction and loyalty.
A _____ is a graphical summary of data previously summarized in a frequency distribution.
A common graphical presentation of quantitative data is a histogram. This graphical summary can be prepared for data previously summarized in a frequency, a relative frequency, or a percent frequency distribution.
Data dashboards are a type of _____analytics.
Descriptive analytics encompass the set of techniques that describes what has happened in the past. Big data is simply a set of data that cannot be managed, processed, or analyzed with commonly available software in a reasonable amount of time.
In the graph of the simple linear regression equation, the parameter ß0 represents the _____ of the true regression line.
In the graph of the simple linear regression equation, the parameter ß0 represents the y-intercept of the true regression line.
Jaccard's coefficient is different from the matching coefficient in that the former _____.
Jaccard's coefficient refers to a measure of similarity between observations consisting solely of binary categorical variables that consider only matches of nonzero entries.
Single linkage is a measure of calculating dissimilarity between clusters by _____.
Single linkage is a measure of calculating dissimilarity between clusters by considering only the two most similar observations in the two clusters.
The correlation coefficient will always take values _____.
The correlation coefficient will always take values between -1 and +1.
Assessing the regression model on data other than the sample data that was used to generate the model is known as _____.
cross-validation
In the moving averages method, the order k determines the ____
number of time series values under consideration
A _____ is an interval estimate of an individual y value, given values of the independent variables.
prediction interval
With reference to exponential forecasting models, a parameter that provides the weight given to the most recent time series value in the calculation of the forecast value is known as the _____
smoothing constant
The decisions concerning an organization's goals and future plans are called _____. operational decisions
strategic decisions- Strategic decisions involve higher-level issues concerned with the overall direction of the organization.
A positive forecast error indicates that the forecasting method _____ the dependent variable. accurately estimated
underestimated A positive forecast error indicates that the forecasting method underestimated the dependent variable. Forecast error = means that .
A _____ decision involves higher-level issues and is concerned with the overall direction of the organization, defining the overarching goals and aspirations for the organization's future.
A strategic decision involves higher-level issues and is concerned with the overall direction of the organization, defining the overarching goals and aspirations for the organization's future.
A tree diagram used to illustrate the sequence of nested clusters produced by hierarchical clustering is known as a _____.
A tree diagram used to illustrate the sequence of nested clusters produced by hierarchical clustering is known as a dendrogram.
A visual representation of a document or set of documents in which the size of the word is proportional to the frequency with which the word appears is called a _____.
A word cloud is a visual representation of a document or set of documents in which the size of the word is proportional to the frequency with which the word appears.
Which of the following measures of forecast accuracy is susceptible to the problem of positive and negative forecast errors offsetting one another?
Because positive and negative forecast errors tend to offset one another, the mean forecast error is not a very useful measure of forecast accuracy.
If the forecasted value of the time series variable for period 2 is 22.5 and the actual value observed for period 2 is 25, what is the forecast error in period 2?
Forecast error is the amount by which the forecasted value differs from the observed value. Forecast error = actual value - predicted value = . For the given values, the forecast error in period 2 is computed as 25 - 22.5 = 2.5.
_____ is the most critical step of the decision-making process.
Identifying and defining the problem- , is the most critical. Only if the problem is well-defined, with clear metrics of success or failure (step 2), can a proper approach for solving the problem (steps 3 and 4) be devised. Decision making concludes with the choice of an alternative (step 5).
Which of the following sources of big data is not publicly available?
Medical records
The difference between the observed value of the dependent variable and the value predicted using the estimated regression equation is known as the____
The difference between the observed value of the dependent variable and the value predicted using the estimated regression equation is known as the residual.
Suppose for a particular week, the forecasted sales were $4,000. The actual sales were $3,000. What is the value of the mean absolute percentage error?
The mean absolute percentage error is |3,000 - 4,000/ 3,000| (100) = 33.3 %
_____ refers to the number of times a collection of items occurs together in a transaction data set.
The number of times that a collection of items occurs together in a transaction data set is known as the support count.
The population parameters that describe the y-intercept and slope of the line relating y and x, respectively, are _____.
The population parameters that describe the y-intercept and slope of the line relating y and x, respectively, are B0 and B1.
The process of converting a word to its stem, or root word, is referred to as _____.
The process of converting a word to its stem, or root word, is referred to as stemming.
When a decision maker is faced with several alternatives and an uncertain set of future events, s/he uses _____ to develop an optimal strategy.
decision analysis
The strength of the association rule is known as _____ and is calculated as the ratio of the confidence of an association rule to the benchmark confidence.
LIFT- The strength of the association rule is known as lift and is calculated as the ratio of the confidence of an association rule to the benchmark confidence.
Which one of the following is used in predictive analytics?
Linear regression, time series analysis, some data-mining techniques, and simulation, often referred to as risk analysis, all fall under the banner of predictive analytics.
_____ is the dissimilarity measure that is more robust to outliers than Euclidean distance.
Manhattan distance is the distance traveled as if traveled along rectangular city blocks. It measures distance between observations by adding the lengths of perpendicular line segments of any two observations. In contrast, Euclidean distance corresponds to a straight line "as the crow flies" distance between the two observations.
Which of the following measures of forecast accuracy is susceptible to the problem of positive and negative forecast errors offsetting one another?
Mean forecast error
The scatter chart below displays the residuals versus the dependent variable, x. Which of the following conclusions can be drawn based upon this scatter chart?
The residual distribution is not normally distributed.
The simplest measure of variability is the _____.
The simplest measure of variability is the range.
The _____ is a measure of the error that results from using the estimated regression equation to predict the values of the dependent variable in the sample.
The value of sum of squares due to error is a measure of the error in using the estimated regression equation to predict the values of the dependent variable in the sample. The SSR measures how much the predicted y values on the estimated regression line deviate from the mean y value.
Demand for a product and the forecasting department's forecast (naïve model) for a product are shown below. Compute the mean squared error. Period: 1, 2, 3, 4, Actual Demand: 12, 15, 14, 18 Forecasted Demand: --, 12, 15, 16
4.67 MAE= (15-12)^2 + (14-15)^2 + (18 - 16)^2 / 3 = 4.67
A collection of text documents to be analyzed is called a _____.
A collection of text documents to be analyzed is called a corpus.
A popular measure for weighing terms based on frequency and uniqueness is _____.
A popular measure for weighing terms based on frequency and uniqueness is term frequency times inverse document frequency.
A _____ is an interval estimate of an individual y value, given values of the independent variables
A prediction interval is an interval estimate of an individual y value, given values of the independent variables.
_____ acts as a representative of the population.
A subset of the population is known as a sample, and acts as a representative of the population
____ is used to test the hypothesis that the values of the regression parameters ß1, ß2, ... ßq are all zero.
An F test- used to test the hypothesis that the values of the regression parameters ß1, ß2. . . , ßq are all zero.
An analysis of items frequently co-occurring in transactions is known as _____.
An analysis of items frequently co-occurring in transactions is known as market basket analysis.
Which statement is true of an association rule?
An association rule is ultimately judged on how actionable it is and how well it explains the relationship between item sets.
Suppose we had a data set of from a call center where customers were asked to choose between the following three options: hear account information, billing questions, and customer service. Using the given order of the three options, and using 0-1 dummy variables to encode the categorical variables, which of the following combinations would yield an entry "customer service"?
An entry of "customer service" would be captured using a dummy variable of 0 for "hear account information," a dummy variable of 0 for "billing questions," and a dummy variable of 1 for "customer service." Therefore, the correct combination is 001.
____ is a method of calculating dissimilarity between clusters by calculating the distance between the centroids of the two clusters.
Centroid linkage uses the averaging concept of cluster centroids to define between cluster similarity. The similarity between two clusters is defined as the similarity of the centroids of the two clusters.
The data preparation technique used in market segmentation to divide consumers into different homogeneous groups is called _____.
Clustering can be employed during the data preparation step to identify variables or observations that can be aggregated or removed from consideration. Cluster analysis is commonly used in marketing to divide consumers into different homogeneous groups, a process known as market segmentation.
The _____ is a measure of the goodness of fit of the estimated regression equation. It can be interpreted as the proportion of the variability in the dependent variable y that is explained by the estimated regression equation
Coefficient of determination is a measure of the goodness of fit of the estimated regression equation. It can be interpreted as the proportion of the variability in the dependent variable y that is explained by the estimated regression equation.
_____ is the amount by which the predicted value differs from the observed value of the time series variable.
Forecast error
Optimization models can be used to _____.
GE Asset Management uses optimization models to decide how to invest its own cash received from insurance policies and other financial products.
The letter grades (A, B, C, D, F) of business analysis students are recorded by a professor. This variable's classification _____.
If arithmetic operations cannot be performed on the data, they are considered categorical data.
Compute the third quartile for the following data. 10, 15, 17, 21, 25, 12, 16, 11, 13, 22
Quartiles divide data into four parts, with each part containing approximately one-fourth, or 25 percent, of the observations. This can be calculated with the Excel function =QUARTILE.EXC(range,3) = 21.25.
Scores on Ms. Bond's test have a mean of 70 and a standard deviation of 11. David has a score of 52 on Ms. Bond's test. Scores on Ms. Nash's test have a mean of 64 and a standard deviation of 6. Steven has a score of 52 on Ms. Nash's test. Which student has the higher standardized score?
Rationale: David's standardized score is (52 - 70) / 11 = -1.64 and Steven's standardized score is (52 - 64) / 6 = -2.00. Therefore, David has the higher standardized score.
Suppose for a particular week, the forecasted sales were $4,000. The actual sales were $3,000. What is the value of the mean absolute percentage error?
Rationale: The mean absolute percentage error is 33.3% |(3,000-4,000)/3,000|(100)=33.3 %
Single linkage can be used to measure the distance between clusters that are the _____ in cluster analysis.
Single linkage is a measure of calculating dissimilarity between clusters by considering only the two most similar observations in the two clusters.
Compute the coefficient of variation for the following sample data. 32, 41, 36, 24, 29, 30, 40, 22, 25, 37
The coefficient of variation indicates how large the standard deviation is relative to the mean. The coefficient of variation is (6.75 / 31.6 × 100) = 21.36%.
Compute the mode for the following data. 12, 16, 19, 10, 12, 11, 21, 12, 21, 10
The mode is the value that occurs most frequently in a data set. The value 12 occurs with the greatest frequency. Therefore, the mode is 12.
A normally distributed error term with a mean of zero would _____.
The practical implication of normally distributed errors with a mean of zero and a constant variation for any given combination of values of x1, x2, . . . , xq is that the regression estimates are unbiased, possess consistent accuracy, and tend to err in small amounts rather than in large amounts.
The process of dividing text into separate terms is referred to as _____.
The process of dividing text into separate terms is referred to as tokenization.
The strength of a cluster can be measured by comparing the average distance in a cluster to the distance between cluster centroids. One rule of thumb is that the ratio for between-cluster distance to within-cluster distance should exceed what value for useful clusters?
The ratio of between-cluster distance to within-cluster distance should exceed 1.0 for useful clusters.
Below is a histogram for the number of days that it took Wyche Accounting to perform audits in the last quarter of last year. What is the relative frequency of the 21-24 bin?
The relative frequency of a bin equals the fraction or proportion of items belonging to a class. Relative frequency of a bin = frequency of the bin /n = 5/20 = 0.25.
Hierarchical clustering using _____ results in a sequence of aggregated clusters that minimizes the loss of information between the individual observation level and the cluster level.
Ward's minimum variance can be used to measure the distance between clusters in cluster analysis.