BSAD

Ace your homework & exams now with Quizwiz!

In a study where the least squares estimates were based on 34 sets of sample observations, the total sum of squares and regression sum of squares were found to be: SST = 4.53 and SSR = 4.21. What is the error sum of squares?

.32 - SSE = 4.53 − 4.21 = 0.32.

Four investment return distributions are the same, except for skewness (e.g., same mean and standard deviation). Which investment would you chose to increase the likelihood of positive returns?

.75 - If all four investment distributions are the same with the exception of skewness, you would want to invest in the option that has the largest positive skewness (0.75) because that distribution implies a greater probability of extremely large gains.

In reviewing purchases at Costco on a given Saturday, 525 transactions out of 1,500 included toilet paper, detergent, and clothing or {toilet paper, detergent} => {clothing}. Calculate the support of the association rule.

0.350 - The calculation is support = number of transactions including antecedent and consequent divided by total number of transactions. Support = 525 ÷ 1,500 = 0.350.

Marcus wants to include the month of the year in the analysis as categories. How many dummy variables will be needed?

11 - If a given k categories = 12, then k − 1, or 12 − 1 = 11 dummy variables.

Ann is analyzing a data set that contains two variables, Job Title and 401K. 401K contains the name of the three companies that carry the retirement accounts. It is mandatory to have an account, thus no observation is blank. If 401K was transformed to dummy variables, how many should be created?

2 - The dummy variables would cover the three possible options for the company being used for the 401K funds. Given k categories of a variable, the general rule is to create k − 1 dummy variables, using the last category as reference. For 401k we only need to define two dummy variables (k − 1 = 3 − 1 = 2). Creating a third dummy would create data redundancy.

Using the following pruning table, which tree is the minimum error tree? LevelCPNum SplitsRel ErrorX ErrorX Std Dev10.5001.001.170.05073520.4410.500.730.06121530.0020.060.100.030551

3 - X error is lowest

Carmen is a professor at a local university. After collecting data on her Introduction to Business course for a year, she wants to calculate the z-score for a student who scores 91 on the final exam. The mean and the standard deviation scores on the exam are 75 and 5, respectively. Calculate the z-score.

3.20 - We use the z-score to find the relative position of an observation by dividing the difference of the observation from the mean by the standard deviation, or, equivalently, z = (x-xbar)/s=91 − 755 x - x¯s=91 - 755 = 3.20.

If R2 = 0.37, then how much of the sample variation is y?

37%

Four observations were binned into one group. In this group, the values are: 40, 45, 38, and 33. What is the average of the group?

39

If R2 = 0.62, then how much of the sample variation is y?

62%

In the following equation ŷ = 30,000 + 4x with given sales (γ in dollars) and marketing (x in $500), what does the equation imply?

An increase of one unit in marketing is associated with an increase of $2,000 in sales.

Amazon uses searches and items purchased to create future product marketing recommendations. Additionally, demographics drive additional potential products to be recommended. To do this, what type of market basket analysis is used?

Association Rule

Which option is not one of the three common strategies used in creating ensemble models? bagging boosting bootstrapping random Forest

Bootstrapping

When a target variable is categorical, the CART algorithm produces a __________blank tree to predict the class memberships of new cases.

Classification

To what cluster should Record 2 be assigned, given the following distances to the cluster centroids? Record IDDist.Cluster-1Dist.Cluster-2Dist.Cluster-3Dist.Cluster-414.05033843.8305211.7933454.16630921.95939244.2359103.3945093.89923533.85032781.2308222.9779342.781135

Cluster 1

To what cluster should Record 3 be assigned, given the following distances to the cluster centroids? Record IDDist.Cluster-1Dist.Cluster-2Dist.Cluster-3Dist.Cluster-414.05033843.8305211.7933454.16630921.95939244.2359103.3945093.89923533.85032781.2308222.9779342.781135

Cluster 2

The following results are a subset of a study on the demographics of a city population. Participants were asked to respond if male (1) or female (0), current annual salary, and if they were raised in a suburb (1) or in a city (0). Based on the hierarchical clustering results, which of the following is not a valid observation that can be made? ClusterSalaryGenderCity1 (N = 21)58,9800.47820.46912 (N = 68)60,8730.41020.5683 (N = 41)72,3900.51420.23514 (N = 34)82,0800.65010.5041

Cluster 4 has the highest average salary, with 65% of participants being female.

Of the following options, which is not accurate for clustering? Euclidean distance or Manhattan distance measures for numerical variables and matching. AGNES takes each observation in the data initially and forms its own cluster. Hierarchical clustering commonly follows agglomerative and divisive clustering. Cluster analysis is where small amounts of data are organized against larger statistical sets.

Cluster analysis is where small amounts of data are organized against larger statistical sets.

When using R for Agglomerative Clustering, the plot function is used to create the dendrogram as well as a banner plot. What function is used to split these results into distinct clusters?

Cutree

The primary purpose of a(n) _____________blank is to support decision-making and provide a composite view of the organization.

Data Warehouse

Which term represents data items, events, or things stored in a database file?

Entity

Under the association rule, a lift ratio between 0 and 1 indicates a positive association.

False - A lift ratio between 0 and 1 is a negative association and needs to be greater than 1 for a strong and positive association to exist.

Boosting is an ensemble modeling strategy that uses the bootstrap aggregation technique to create multiple training data sets by repeatedly sampling the original data with replacement.

False - Bagging is an ensemble modeling strategy that uses the bootstrap aggregation technique to create multiple training data sets by repeatedly sampling the original data with replacement. Boosting is an ensemble modeling strategy that forces the model to pay more attention to cases that are misclassified or have large prediction errors in previous trees through a weighted sampling process.

Converting observations into z-scores is also called doubling the observation.

False - Converting observations into z-scores is also called standardizing the observation.

Changing an individual's date of birth to age, combining height and weight to create body mass index, calculating percentages, or converting values to natural logarithms are examples of transforming categorical data.

False - Each of these examples converts data into numerical data values as opposed to transforming them into categories.

For interval estimates for the response variable y, the prediction interval is narrower than the confidence interval.

False - For interval estimates for the response variable y, the prediction interval is narrower than the confidence interval because of the added uncertainty in predicting the individual value of y.

Simple mean imputation is the best route for replacing large quantities of missing variables within a data set without distorting the relationship among variables.

False - If the number of missing variables is relatively small, then the simple mean process fills in the observations without biasing the results. However, in large quantities, simple mean will distort the data leading to biased results.

In a scatter plot diagram, if there is no discernable pattern, then there is a positive relationship between the numerical variables.

False - If there is no discernable pattern, then there is no relationship between the variables.

A pure subset contains leaf nodes where cases have contradicting values to the target variable, to enhance the variable case outcomes and allow for further splits.

False - Pure subsets contain leaf nodes that contain the same value of the target variable. There is no need to further split pure subsets.

R2 in linear regression is the correlation coefficient.

False - R2 in linear regression is the coefficient of determination, which is the proportion of the sample variation in the response variable that is explained by the sample regression equation. The correlation coefficient is the relationship between two variables.

Regression analysis captures the relationship between only two distinct variables.

False - Regression analysis captures the relationship between 2 or more variables.

If-Then logical statements are constructed with the If portion being the consequent and the Then being the antecedent.

False - The If portion is the antecedent and the Then portion is the consequent.

Because clustering is essentially an unsupervised technique for data exploration, the appropriate technique would be the one that makes the fewest clusters.

False - The ability of a clustering method to discover useful hidden patterns of the data depends on how it is implemented. Because clustering is essentially an unsupervised technique for data exploration, the appropriate technique would be the one that makes the most sense conceptually, not simply the one with the fewest clusters.

In data sets that contain outliers, the arithmetic mean is used as the measure of the central location.

False - The arithmetic mean is the primary measure of the central location. However, the median is used when the mean can be misleading due to outliers.

In a decision tree, the recursive process of partitions continues and only terminates when the Gini index reaches 0.5.

False - The process continues until all partitions become a pure subset. "(Gini index of 0)"

Using the following transactions, what is the frequency distribution? TransactionItem001Latte, Scone, Muffin002Coffee, Muffin003Espresso, Egg, Fruit Cup004Coffee, Egg, Muffin005Scone, Latte, Muffin006Latte, Scone, Fruit Cup007Latte, Muffin, Egg008Coffee, Muffin, Fruit Cup009Espresso, Scone, Cookie010Latte, Muffin, Cookie

Latte-5; Scone-4; Muffin-7; Egg-3; Espresso-2; Coffee-3; Fruit Cup-3; Cookie-2

Using the following transactions, what is the frequency distribution? TransactionItem001Latte, Scone, Muffin002Coffee, Scone003Espresso, Egg, Fruit Cup004Scone, Egg, Muffin005Scone, Latte, Muffin006Latte, Scone, Fruit Cup007Latte, Muffin, Egg008Coffee, Muffin, Fruit Cup009Espresso, Scone, Cookie010Latte, Muffin, Cookie

Latte-5; Scone-6; Muffin-6; Egg-3; Espresso-2; Coffee-2; Fruit Cup-3; Cookie-2

When using the CART algorithm, the Gini index is used in the classification tree, however in a regression tree, __________blank is used to measure impurity.

Mean squared error

In the presence of outliers in a data set, extremely small or large values, it is preferred to use the _____________blank instead of the _____________blank to impute missing variables.

Median; Mean

Which tree is the least complex and contains the smallest validation error? best-pruned tree full-grown tree minimum error tree Correct categorical tree

Minimum Error Tree

When using R for Agglomerative Clustering, the cutree function is used to split results into distinct clusters. What function is used to create the dendrogram as well as a banner plot?

Plot

If SST = 5,000 and SSE = 625, then the coefficient of determination is

R2 = 1 − SSE/SST = 1 − 625/5,000 = 0.88.

When performing an analysis, one technique is called RFM. Which of the following is not reflective of RFM? recency frequency monetary relevancy

Relevancy

If SSE = 200 and SSR = 300, then the coefficient of determination is

SST = SSR + SSE = 300 + 200 = 500; R2 = SSR/SST = 300/500 = 0.60.

If the coefficient correlation is computed to be −0.85, this means the relationship between the two variables are __________blank.

Strong and Negative

The coefficient correlation for rent and square footage is computed to be 0.84, this means the relationship between the two variables are __________blank.

Strong and Positive

What data preparation technique is Maeve using when she extracts a payroll data set into two separate files, one for hourly employees and one for salary employees?

Subsetting

Which method uses the farthest distance between a pair of observations that do not belong to the same cluster?

The Complete Linkage Method

Which description best fits the following tree structure for loan debt balance with a single age predictor?

The average loan debt balance of the two partitions are $42,964 and $32,980, respectively, when Age = 35.

The standard deviation of midterm scores and the final exam are 12.0 and 10.0, respectively. Which of the two exams is riskier and why?

The midterm exam is riskier because the standard deviation is higher.

Which of the chapter recommended guidelines is violated in the graph below?

The simplest graph should be used for a given set of data. Axis should be clearly marked and labeled. On a bar chart or histogram, bar widths should always be consistent as differing widths may create distortions. Be mindful of the upper limits on the vertical axis to prevent compression, hiding variant details.

A dummy variable takes on a value of 1 or 0 to describe two categories of a variable.

True

A line chart displays the numerical variable of a series of data points connected by a line.

True

Converting data from one structure to another is called data transformation.

True

Decision trees produced by the CART algorithm are binary, meaning that there are two branches for each decision node.

True

In understanding the association rules, it is best to think of them as an If-Then statement.

True

The process of retrieving, cleaning, integrating, transforming, and enriching data to support analysis is called data wrangling.

True

The strategy of removing observations with missing data is called omission.

True

The total sum of squares (SST) can be broken into two: explained variation and unexplained variation.

True

To view only a portion of the data that is of interest, subsetting is used.

True

A subset with the highest degree of impurity is when a 50% and 50% split occur between classes.

True - When half the cases belong to one and the other half to the other, the subset is considered to have the highest degree of impurity, meaning the two classes are not separated as well as they could be. In comparison, a "pure" subset happens when 100% of the cases belong to one class and 0% to the other class.

In a boxplot, if the median is in the center of the box and the left and right whiskers are equidistant from their respective quartiles, a symmetrical shape is implied.

True - A boxplot is also used to informally gauge the shape of the distribution. Symmetry is implied if the median is in the center of the box and the left and right whiskers are equidistant from their respective quartiles. If not, skewness is implied.

Constructing a contingency table allows for a clear visualization of the relationship between two categorical variables.

True - A contingency table allows for the two variables to be easily viewed and analyzed for relationships.

If approximately p percent of the observations have values less than the pth percentile, then approximately (100 − p) percent of the observations have values greater than the pth percentile.

True - A percentile is a measure of relative location that splits a data set into two parts: p and 100 − p, where p = the percent of observations in the data set.

A regression model treats all predictor variables as numerical, where observations of a categorical variable are first converted into numerical values.

True - A regression model treats all predictor variables as numerical, where observations of a categorical variable are first converted into numerical values. Recall from Chapter 1 that the observations of a numerical variable represent meaningful numbers, whereas the observations of a categorical variable represent different categories.

A scatterplot with a categorical variable allows for the dynamic view of the addition of a categorical variable to the numeric plot points adding an additional layer of visible detail.

True - By adding a categorical variable to the numeric scatterplot, an additional layer of data detail is visible.

When working with numerical variables, the frequency distribution is equal to the number of observations that falls into each interval.

True - Frequency distribution for a numerical value is determined on the number of observations that fall into a predetermined interval.

The response variable is the outcome of a variable, whereas the predictor is the input variable(s).

True - The outcome of a variable, called the response variable, is related to one or more other input variables, called the predictor variables.

The total sum of squares (SST) can be broken into two: explained variation and unexplained variation.

True - The total sum of squares, SST, can be broken down into two components: explained variation and unexplained variation.

Bill wants to calculate the width of each interval by using the approximation formula. He first created a frequency distribution with 5 intervals. The minimum and maximum for the variable are −32.4 and 65.56 respectfully. Calculate the width of each interval for Bill.

Using the approximation formula, the interval is calculated as (65.56 − (−32.40)) ÷ 5 = 19.592.

Based on goodness-of-fit measures, which is the preferred model based on the results below?

Using the lower standard error of the estimate, Model 3 is the best choice. Using the highest value of adjusted R2, Model 3 is the best choice. Even though the coefficient of determination for Model 3 is also the highest, it cannot be used to determine goodness-of-fit unless all three models have the same number of predictor variables, which is unknown in this example.

R2 = 1 − SSE/SST = 1 − 625/5,000 = 0.88.

the random error

Based on the following values for income, what are the possible split points? {13,465, 14,432, 28,763, 34,876, 41,967, 52,997}

{13948.5, 21597.5, 31819.5, 38421.5, 47482}

Based on the following sorted 20 values for age, what are the possible split points? {20, 22, 24, 26, 28, 31, 33, 34, 35, 40, 42, 43, 46, 47, 49, 50, 52, 54, 55, 57}

{21, 23, 25, 27, 29.5, 32, 33.5, 34.5, 37.5, 41, 42.5, 44.5, 46.5, 48, 49.5, 51, 53, 54.5, 56}


Related study sets

Chapter 19: Managing Costs and Budgets

View Set

Midterm Chapters 1-6 Social Psych

View Set

Europe Thinking Spatially and Data Analysis - Europe - Physical Geography

View Set

Convection in the Atmosphere and Wind

View Set