INFS 343 NING YANG EXAM STUDY GUIDE

Ace your homework & exams now with Quizwiz!

When a target variable is categorical, the CART algorithm produces a __________blank tree to predict the class memberships of new cases. A. classification B. regression C. minimum D. pruned

A

Which of the following statements is correct? A. CART stands for classification and regression tree. B. Regression tree predicts the class membership of future cases. C. Classification tree predicts the value of the outcome variable of future cases. D. CART algorithm cannot predict binary outcome variables.

A

Which tree is the least complex and contains the smallest validation error? A. best-pruned tree B. full-grown tree C. minimum error tree D. categorical tree

A

Please match the measurement scales A. is categorical. Attributes can be ordered. It reflects labels or names. B. is the most sophisticated. It has a true zero point AND reflects the absence of characteristics. C. is numerical. The distance is meaningful. Zero value is arbitrary and does not reflect the absence of characteristics. D. is the least sophisticated. It is categorical. Values differ by labels or names.

A = Ordinal B = Ratio C = Interval D = Nominal

When examining the relationship between two numerical variables, a scatter plot is a simple, yet useful, graphical tool. What does each point in a scatter plot represent? A. Two x-axis comparisons B. A paired observation with one x-axis point and one y-axis point. (x1, y1) C. Multiple paired observations such as (x1, x2), (y2, y3) D. An unpaired observation but two y-axis comparisons

B

When constructing a histogram, we typically mark off the interval limits along the horizontal axis. What does the height of each bar represent? Choose all that are correct responses. A. The type of response B. Relative frequency C. Frequency of each interval D. Number of intervals

B and C

Please match the definitions with their descriptions. A. more decision rules applied B. top node of tree, first available to which a split is applied C. classification or predictions given, no more splitting. Root node: Leaf/terminal node: Interior node:

B, C, A

Which of the following are reasons for data professionals to learn data wrangling skills? Select all that apply! A. Analytics professionals are superior to all other IT professionals B. Organizations will be able to make decisions more rapidly C. Analytics professionals need broader skill sets than data mining techniques D. Analytics professionals can no longer rely on the IT department to provide data

B, C, and D

Using the simple mean imputation strategy, what value would be placed in the missing observation in x1? X1: 73 79 91 ? 100 A. No idea B. 86 C. 84 D. 69

Bc

Amazon uses searches and items purchased to create future product marketing recommendations. Additionally, demographics drive additional potential products to be recommended. To do this, what type of market basket analysis is used? A. Information rule B. Supervised data analysis C. Association rule D. L-mean

C

Which statement is not correct regarding cluster analysis? A. Observations are similar within a cluster, dissimilar across clusters. B. Useful exploratory analysis by summarizing a large number of observations in a data set into a small number of clusters. C. Cluster characteristics or profiles help us understand and describe the different groups. D. The fewer clusters it generates, the better the model is.

D

A compilation of facts, figures or other contents, both numerical and nonnumerical?

Data

____________ includes the question, "how is data shared with third parties?"

Data Piracy

Organizing, summarizing, tabulating, and visualizing data.

Descriptive analytics

T or F: Data warehouse can be designed to support the marketing department for analyzing customer behaviors, and it contains only the data relevant to such analyses.

F

T or F: If-Then logical statements are constructed with the If portion being the consequent and the Then being the antecedent.

F

T or F: Most experts agree that only about 5% of all data used in business decisions are structured data.

F

T or F: Social media data such as Twitter, YouTube, Facebook, and blogs are examples of structured data.

F

T or F: There are four types of hierarchical cluster analysis, AGNES, DIANA, bottom-up, and top-down.

F

T or F: A pure subset contains leaf nodes where cases have contradicting values to the target variable, to enhance the variable case outcomes and allow for further splits.

False

Business analytics is the analysis of data by using tools and techniques to: Select that all apply! A. gain insight from the data B. develop actionable decisions C. improve business performance D. stay professional

Gain insight from the data, develop actionable insights, and improve business performance.

Data _____________ is a process that an organization uses to acquire, organize, store, manipulate, and distribute data.

Management

Gender is an example of which measurement scale?

Nominal

A _____________ consists of all observations or items of interest in an analysis.

Population

___________ ___________ is the use of the algorithms that allow the computer to identify complex processes and patterns without any specific guidance from the analyst.

Unsupervised learning

Which of the following 'V's" are considered to be the 3 V's which are characteristics of big data? Select all that apply! A. Valuable B. Velocity C. Variety D. Volume

Velocity, Variety, and Volume

T or F: The strategy of removing observations with missing data is called omission.

T

T or F: In order to analyze trend, we often transform raw data values into Percentages.

T

T or F: In understanding the association rules, it is best to think of them as an If-Then statement.

T

T or F: One of the biggest advantages of R is that it is free.

T

T or F: Supervised data mining, unlike unsupervised data mining, has predefined target variable or outcome variables.

T

T or F: Transformation of date values is often performed to help bring useful information out of the data

T

T or F: z-score measures the relative location of an observation and indicates whether it is an outlier.

T

T or F: The correlation coefficient describes both the direction and the strength of the linear relationship between x and y.

T

In reviewing purchases at Costco on a given Saturday, 815 transactions out of 2,200 included toilet paper, detergent, and clothing or (toilet paper, detergent) => (clothing). Calculate the support of the association rule.

.3705

A large lecture class has 280 students. The professor has announced that the mean score on an exam is 74 with a standard deviation of 8. The distribution of scores is bell-shaped. How many standard deviations above the mean would a score of 90 be?

2

A percentile is technically a measure of location; however, it is also used as a measure of relative position because it is so easy to interpret. if you know that the raw score corresponds to the 75th percentile, then you know that approximately how many students had scores lower than your score?

75%

In R, we use _____ function to check the data type of a variable. A. class() B. check() C. type() D. No idea.

A

Examples of transforming numerical data include transforming: Select all that apply! A. Individual's date of birth to age B. Calculating Percentages C. Combining height and weight to create body mass index D. There is no need to transform data

A, B, and C

What are the three common strategies when creating ensemble models? A. bagging B. boosting C. bootstrapping D. random forest

A, B, and D

CART is built using partitioned data sets: select that all apply! A. training data set B. exercising data set C. validation data set D. test data set

A, C, D

Subsetting can also be used to eliminate unwanted data such as: Select all that apply! A. Observations that contain low-quality data B. Observations that are in sets of data C. Observations that contain missing values D. Observations that contain outliers

A, C, and D

The mean and the standard deviation of scores on an accounting exam are 74 and 8, respectively. The mean and the standard deviation of scores on a marketing exam are 78 and 10, respectively. Find the z-scores for a student who scores 90 in both classes.

A. z-score in the marketing class is z=(90−78)/10=1.2

Recall that we use nominal and ordinal measurement scales to represent categorical variables. Which of the examples below represent a nominal scale representation of a categorical variable? A. Performance of a manager (excellent, good, fair, poor). B. Profit and inventory level of a distribution center C. Marital status (single, married, widowed, divorced, separated) D. The temperature of the resort location

C

What is not an advantage of a decision tree model? A. It is easy to interpret. B. Output was displayed as one or more upside-down trees. C. It can only deal with categorical target variables. D. It represents a set of If-Then rules.

C

Which of the following is a common graphical method that allows us to determine whether two numerical variables are related in some systematic way? A. Stacked column chart B. Pie chart C. Scatter plot D. Contingency table

C

Which two of the following are common measures of shape? A. Range B. MAD or the Mean absolute deviation C. Skewness coefficient D. Kurtosis coefficient

C and D

_____________ is an unsupervised data mining technique that groups data into categories that share some similar characteristic or trait.

Clustering

Which statement about R is not correct? A. R is a free, open-source software and programming language developed in 1995. B. R was developed at the University of Auckland as an environment for statistical computing and graphics. C. R has become one of the dominant programming languages for statistics and data analytics. D. R does not run on Mac or Linux systems.

D

T or F: The most popular query language used today is SQL or structured query language. This popular query language is used for manipulating data in a relational database using relatively simple and intuitive commands.

T

Association rule analysis is popular in many fields, including A. medicine B. retail C. entertainment D. all of the above

D

Of the following options, which is not accurate for clustering? A. Euclidean distance or Manhattan distance measures for numerical variables and matching. B. AGNES takes each observation in the data initially and forms its own cluster. C. Hierarchical clustering commonly follows agglomerative and divisive clustering. D. Cluster analysis is where small amounts of data are organized against larger statistical sets.

D

The standard deviation of midterm scores and the final exam are 12.5 and 12.0, respectively. Which of the two exams is riskier and why? A. Both the midterm and the final share the same amount of risk. B. The midterm exam is riskier because the standard deviation is lower. C. There is not enough information to determine which is the riskier of the two. D. The midterm exam is riskier because the standard deviation is higher.

D

What are the three broad categories of analytics techniques designed to extract value from data? Select all that apply! A. Summarization B. Prescriptive C. Predictive D. Descriptive

Prescriptive, Predictive, and Descriptive

Business analytics translates data into decisions to improve business performance through ______________ .

Quantitative Tools

A 'variable' is defined as when a characteristic of interest differs in kind or degree among various observations.

T

T or F: A subset with the highest degree of impurity is when a 50% and 50% split occur between classes.

T

T or F: Constructing a contingency table allows for a clear visualization of the relationship between two categorical variables.

T

T or F: Converting data from one structure to another is called data transformation.

T

T or F: Decision trees produced by the CART algorithm are binary, meaning that there are two branches for each decision node.

T

___________ ___________ ____________ is designed to identify events that tend to occur together. It is also known as affinity analysis or market basket analysis.

association rule learning

Data Warehouse Process: ETL is three separate functions combined into a single tool. ETL stands for __________, ____________, and ____________.

extract, transform, and load

The ____________ range is the difference between the third quartile and the first quartile.

interquartile


Related study sets

Lab CE Course - Coagulation Inhibitors and Factor Deficiencies

View Set

Module 9: Monitoring for Health Problems

View Set

Chapter 17 Section 2 and 3 - Hoffman

View Set

Business Law 1 // Ch. 6 Tort Law

View Set

Chapter 55: Drugs Acting on the Lower Respiratory Tract

View Set

2A: Understand, solve, and explain

View Set