INFO 320_Exam 2

Ace your homework & exams now with Quizwiz!

The data in a data set are often said to be ______ and ______before they have been preprocessed

"dirty", "raw"

Euclidean distance can be used to calculate the dissimilarity between two observations. Let u = (25, $350) correspond to a 25-year old customer that spent $350 at Store A in the previous fiscal year. Let v = (53, $420) correspond to a 53-year old customer that spent $4,100 at Store A in the previous fiscal year. Calculate the dissimilarity between these two observations using Euclidean distance.

75.39

Data preparation includes all of the following except which task?

Calculating the confidence ratio for all association rules

____________________ measures cluster similarity by calculating the distance between the centroids of the two clusters.

Centroid linkage

The data preparation technique used in market segmentation to divide consumers into different homogeneous groups is called

Cluster analysis

__________________ is a measure of calculating dissimilarity between clusters by considering only the two most dissimilar observations in the two clusters.

Complete linkage

Supervised learning

For prediction and classification

missing completely at random (MCAR)

If the missing value is a random occurrence

Which of the following is true for Euclidean distances?

It is commonly used as a method of measuring dissimilarity between quantitative observations.

Which statement is true of an association rule?

It is ultimately judged on how actionable it is and how well it explains the relationship between item sets.

In which of the following scenarios would it be appropriate to use hierarchical clustering?

When binary or ordinal data needs to be clustered.

Median linkage method

analogous to group average linkage except that it uses the median of the similarities computed between all pairs of observations between the two clusters The use of the median reduces the effect of outliers

Associaiton rules

convey the likelhood of certain items being purchases together

The strength of the association rule is known as ____________ and is calculated as the ratio of the confidence of an association rule to the benchmark confidence.

lift

An analysis of items frequently co-occurring in transactions is known as

market basket analysis

Ward's method

merges two clsuters such that the dissimilarity of the observations within the resulting single cluster increases as little as possible Tends to produce clearly efined clusters of similar size

If a variable is missing measurements for a large number of observations

removing this variable from consideration may be an option

Observation

set of recorded values of variables associated with a single entity

Observation refers to the

set of recorded values of variables associated with a single entity

If the Euclidean distance were to be represented in a right triangle, which of the following would be considered the distance between two observations of a cluster?

the hypotenuse

Dimension Reduction

the process of removing variables from the analysis without losing any crucial information

Goal of Clustering

to segment observations into similar groups based on the observed variables

Centroud linkage

uses the averaging concept of cluster centrouds to define btween-cluster similarity Centroid os the average observation of a cluster The similarity between two clsuters is then defined as the similarity of the centroids of the two clusters

Dendrogram

visually summarizes the output from a hierarchal clustering using the matching coefficient to measure similarity between obervartions and the group average linkage clustering method to measure similarity between clusters A chart that depicts the set of nested clusters resulting at each step of aggregation

Which is NOT a primary option for addressing missing data?

To generate random data to replace the missing values

In preparing categorical variables for analysis, it is usually best to

Convert the categories to binary, dummy variables

In which of the following data-mining process steps is the data manipulated to make it suitable for formal modeling?

Data preparation

Variable Representation

Dimension reduction A critical part of data mining is determining how to represent the measurements of the variables and which variables to consider It is best to encode categorical variables with 0-1 dummy variables

Identification of Outliers and Erroneous Data

Examining the variables in the data set by means of summary statistics, histograms, PivotTables, scatter plots, and other tools

missing at random (MAR)

If the missing values are not completely random (i.e., correlated with the values of some other variables)

The endpoint of a k-means clustering algorithm occurs when

No further changes are observed in cluster structure and number

In K-means clustering, k represents the

Number of variables

Data-mining approaches can be separated into two categories:

Supervised learning Unsupervised learning

Which of the following reason contribute to the increase in the use of data-mining techniques in business?

The ability to electronically warehouse data

If a model's implications depend on the inclusion or exclusion of outliers, then one should spend additional time to track down

The cause of the outliers

The increase in the use of data-mining techniques in business has been caused largely by three events

The explosion in the amount of data being produced and electronically tracked The ability to electronically warehouse these data The affordability of computer power to analyze the data

Measuring Similarity between Observations

The goal of cluster analysis is to group observations into clusters such that observations within a cluster are similar and observations in different clusters are dissimilar We need explicit measurements of similarity or dissimilarity Euclidean distance - the most common method to measure dissimilarity between observations

Data Preparation

Treatment of Missing Data Identification of Outliers and Erroneous Data Variable Representation

The process of eliminating variables from formal analysis without losing any crucial information is called

dimension reduction

Jaccard's coefficient is different from the matching coefficient in that the former

does not count matching zero entries while the latter does.

The __________ the lift ratio, the ____________ the association rule.

higher; stronger

missing not at random (MNAR)

if the reason that the value is missing is related to the value of the variable

When clustering only by dummy variables that represent categorical variables, the simplest measure of similarity between two observations is called the

matching coefficient.

Single-linkage clustering method

the similarity between two clusters is defined by the similarity of the pair of observations that are the most similar Single linkage will consider two clusters to be close if an observation in one of the clusters is close to at least one observation in the other cluster However, a cluster formed by merging two clsuters that are close with respoct to single linkage may also consist of pairs of observations that are very different The reason is that there is no consideration of how different an observation may be from other observations in a cluster as long as it is similar to at least one observation in that cluster

If a considerable number of observation have missing values

then replacing them with a value that seems reasonable may be useful , as it does not decrease the number of observations

Suppose we had a data set of from a call center where customers were asked to choose between the following three options:hear account information, billing questions, and customer service. Using the given order of the three options, and using 0-1 dummy variables to encode the categorical variables, which of the following combinations would yield an entry "customer service"?

001

The strength of a cluster can be measured by comparing the average distance in a cluster to the distance between cluster centroids. One rule of thumb is that the ratio for between-cluster distance to within-cluster distance should exceed what value for useful clusters?

Dealing with missing data requires:

Understanding of why the data are missing and the impact of the missing data

The goal of _____________________ is to use the variable values to identify relationships between observations.

Unsupervised Learning

_____________________ approaches are designed to describe patterns and relationships in large data sets with many observations of many variables.

Unsupervised learning

Cluster Analysis

commonly used in marketing to divide consumers into different homogeneous groups (market segmentation) Can be used to identify outliers

Average linkage is a measure of calculating dissimilarity between two clusters by

computing the average distance between every pair of observations between two clusters.

Single linkage is a measure of calculating dissimilarity between clusters by

considering only the two most similar observations in the two clusters.

McQuitty's method

considers mergng two clsuters A and B, the dissimilarity resulting cluster AB to any other clsuter C is calculated as: ((dissimilarity between A and C) + dissimilarity between B and C)/2)

Data preparation makes heavy use of the:

descriptive statistics and data visualization methods

A cluster's _____________ can be measured by the difference between the distance value at which a cluster is originally formed and the distance value at which it is merged with another cluster in a dendrogram.

durability

Euclidean distance can be used to measure the distance between________________ in cluster analysis.

observations

k-means clustering is the process of

organizing observations into distinct groups based on a measure of similarity.

Hierarchal Clustering

starts with each observation belonging to its own cluster and then sequentially merges the most simiar clusters to create a series of nested clusters Starts with each observation in its own cluster and then iteratively combines the two clusters that are the most similar into a single cluster Each iteration corresponds to an increased level of aggregation by decreasing the number of distinct clusters Determines the similarity of two clusters by considering the similarity between the observations

A ___________ refers to the number of times a collection of items occur together in a transaction data set

support count

Treatment of Missing Data (primary options)

To discard observations with any missing values To discard any variable with missing values To fill in missing entries with estimated values To apply a data-mining algorithm (such as classification and regression trees) that can handle missing values

Unsupervised learning

To detect patterns and relationships in the data Thought of as high-dimensional descriptive analytics Designed to describe patterns and relationships in large data sets with many observations of many variables

Options for replacing the missing entries for a variable include replacing the missing value with the variable's mode, mean, or median. Imputing values in this manner is truly valid only if variable values are

Missing Completely at Random (MAR)

K-Means Clustering

assigns each observation to one of k clusters in a manner such that the observations assigned to the same cluster are similar

Complete linkage clustering method

defines the similarity between two clsuters as the similarity of the pair of observations that are the most different Will consider two clusters to be close if their most different pair of observations are close Can be distorted by outlier observations

Group average linkage clustering method

defines the similarity between two clusters to be the average similarity between two clusters to the average similarity computed over all pairs of observations between the two clusters

A tree diagram used to illustrate the sequence of nested clusters produced by hierarchical clustering is known as a

INFO 320_Exam 2

Related study sets

BIO 112 Module 6 Lab Quiz Fungi

higher learning

Labor and Delivery Quiz 4

2 programming

BFM Exam 3

ITN 261 QUIZ 3

1: Computer networks

머터널뉴본챕터3

Chapter 13: Abdominal Vasculature

Biology Lab #5

Raisin In the Sun-Short Answer Study Guide

CAP 3104 Exam 1

Colorado Life Insurance Final Exam

PC Insurance

Business Management- Exam 2!

Test 1

American Family Exam 1 Iowa

Chapter 5 - Book notes

Finance Final

CH 37: PART 2