121 Exam 3

Ace your homework & exams now with Quizwiz!

what is KNN based on?

- customer identification: who are profitable or defect customers? - customer retention: how to find loyal customers? - customer development: cross-selling, direct marketing

The purpose of factor analysis can be best described as:

- data simplification - to summarize the information contained in a large number of metric measures with a smaller number of summery measures (factors)

pure portion in recursive partitioning

contains records mostly from one class - purer the class, the better the results

factor loadings

correlation between factor scores and the original variables

collinearity

correlation of independent variables with each other, which can bias estimates of regression coefficients - look for correlation between independent variables of .30 or greater - a VIF of 10 or greater shows collinearity

scaling of coefficients

a method of directly comparing the magnitudes of the regression coefficients of independent variables by scaling them in the same units or by standardizing the data

cluster analysis

a procedure for identifying subgroups of individuals or items that are similar in regards to certain variables or measurements

metric scale

a type of quantitative that provides the most precise measurement

dummy variable

a way of representing 2 group, or dichotomous, nominally scaled independent variables by coding one group as 0 and the other as 1 in regression analysis - aka binary variables

coefficient of determination

aka R^2 measure of the percentage of the variation in the dependent variable explained by variations in the independent variable

using validation error to prune

*minimum error tree: has the lowest error rate on validation data *best pruned tree: smallest tree within one standard error of minimum error - this adds a bonus for simplicity *the smaller the splits, the clearer the rules are

Eigenvalue in factor analysis

- 1 acts as a benchmark in factor analysis - if total is > 1, explains more than the average amount of varianece

Key ideas of Classification Tree include_______

- Recursive partitioning: repeatedly split the records into 2 parts to achieve maximum homogeneity within the new parts - Pruning the tree: simplifying the tree by pruning peripheral branches to avoid overfitting

what is a factor?

- a linear combination of variables that are correlated to each other - a weighted summary score of a set of related variables and each measure is first weighted according to how much it contributes to the variation of each factor

nominal or categorical

- a type of nonmtric qualitative data sale that only uses numbers to indicate membership in a group - most mathematical and statistical procedures cannot be applied to nomial data

Which of the following are hierarchical clustering methods?

- agglomerate methods *linkage methods including: single, complete, and average linkage

discriminant coefficients

- aka discriminant weights - estimate of the discriminatory power of a particular independent variable within multiple discriminant analysis - computed by means of the discriminant analysis program - independent variables with large discriminatory power (large differences between groups) have large weights - independent variables with little discriminatory power (small differences between groups) have small weights

KNN method can be used for ________ in Marketing

- classification - identifying customers -retaining customers - developing customers through cross-selling and direct marketing

classification matrix

- classifies people and objects - a matrix or table that shows the percentages of people or things correctly and incorrectly classified by the discriminant model

purpose of cluster analysis

- classify objects or people into some number of mutually exclusive and exhaustive groups so that those within a group are as similar as possible to one another - clusters should be homogeneous internally and heterogeneous externally

Which of the following multivariate procedures does not include a dependent variable in its analysis?

- cluster analysis - factor analysis

Data mining typically involves four classes of task. What are those 4 class task classes?

- clustering, classification, modeling, application

goals of multiple discriminant analysis

- determine if there are statistically significant differences between the average discriminant score profiles or 2 or more groups - establish a model for classifying individuals or objects into groups on the basis of their values in the independent variables - determine how much of the difference in the average score profiles of the group is accounted for by each independent variable

advantages of classification trees

- easy to use and understand - produce rules that are easy to interpret and implement - do not require the assumptions of statistical models - can work without extensive handling of missing data

factor analysis process

- factor scores - factor loadings - naming the factors - decide how many factors to retain

what kind of questions can multiple discriminant analysis answer?

- how are consumers who purchase various brands different from those who do not purchase the brand? - how do consumers who show high purchase probabilities for a new product differ in demographic and lifestyle characteristics from those with low purchase probabilities? - how do consumers who frequent one fast-food restaurant differ in demographic and lifestyle characteristics from those who frequent another fast-food restaurant? - how do consumers who have chosen either indemnity insurance, HMO coverage, or PPO coverage differ from one another in regard to healthcare use, perceptions, and attitudes?

What is true about KNN?

- it is used for both classification and regression - works better if all of the data have the same scale - works well with a small number of input variables, but struggles when the number of inputs is very large - makes no assumptions about the functional form of the problem being solved - when you decrease the K the bias will increase - when you decrease the K, the variance will increase

types of classification methods

- logistic regression - KN/Clustering - classification tree

disadvantages of tree classification

- may not perform well where there is structure in the data that is not well captured by horizontal or vertical splits - since the process deals with one variable at a time, there is no way to capture interactions between variables - best when variables are independent of each other because when variables interact with each other they can skew the results

Which of the following are requirements of KNN?

- number of clusters, K, must be specified - database: set of stored records - measure of the distance - K, the number of nearest neighbors, to retrieve - each cluster is associated with a centroid - each point is assigned to the cluster with the closest centroid

tips on reducing impurity in recursive partitioning

- obtain the combined impurity measure (weighted average of individual rectangles) - at each successive stage, compare this measure across all possible splits in all variables - choose the split that reduces the impurity the most

recursive partitioning steps

- pick one of the predictor variables - pick a value of Xi that divides the training data into 2 (not necessarily equal) portions - measure how "pure" or homogeneous each of the resulting portions are - algorithm tries different values of Xi and Si to maximize the purity in the initial split - after the "maximum purity" split, repeat the process for a second split, and so on

purpose of factor analysis

- purpose is to simplify data - summarize info contained in a large number of metric measures with a smaller number of summary measures (factors) - has no dependent variable

Key goals in classification/decision trees

- recursive partitioning - pruning the tree

causation

- regression analysis shows correlation, but not causation - inference that a change in one variable is responsible for an observed change in another variable

cluster analysis is particularly valuable for what type of marketing strategy?

- segmenting consumers into groups to identify a particular consumer base - finding similarities between consumers

Which of the following statement regarding KNN is true?

- select K points as the initial centroids - repeat - form K clusters by assigning all points to the closest centroid - recompute the centroid of each cluster - repeat until the centroids don't change - assign membership to to new members depending on the K - determine which K is best out of the total number of observations - fundamentally measures distance using Euclidean distance - choose value of K that has the lowest error rate in validation data - compute distance to other training records - identify k nearest neighbors - use class labels of nearest neighbors to determine the class label of unknown record

Which of the following statement regarding Pruning is true?

- tree after pruning/splits leads to decisions/rules to classify new info easily off of rules - rules that are made tend to be too customized and are unapplicable in other situations - prune when there is 100% purity or 0% impurity - let tree grow to the full extent then prune it back by finding the point at which the validation error begins to rise *this yields a set of trees of different sizes and associated error rates - use validation error to prune: *minimum error tree: has lowest error rate on validation data *best pruned tree: smallest tree within one standard error of minimum error; this adds a bonus for simplicity

Euclidean distance

- way of clustering by distance - standardization may be necessary, if scales differ - D = SQRT ((x1-x2)^2 + (Y1-Y2)^2)

pruning the tree

- we let the tree grow to the full extent, then prune it back by finding the point at which validation error begins to rise to cut off unnecessary brances - yields a set of trees of different sizes and associated error rates

discriminant score

-aka Z score - the score derived for each individual or object by means of the equation - basis for predicting the group to which the particular object or individual belongs

how to handle collinearity

1) drop one of the variables from the analysis 2) combine the correlated variables to form a new composite independent variable

ways to find distance in clustering

1) visually by the eyes: can lead to a lot of inaccuracy errors 2) Euclidean distance is the most common and more accurate

If the goal is to classify business travelers into distinct groups based on their responses to 20 questions on preferences to mode of transportation, hotel accommodation, and ethnic food, which of the following techniques would be most appropriate?

Factor analysis

In factor analysis, the correlation between factor scores and the original variables is known as the factor ________

Factor loadings

T/F: While running KNN, generally speaking, the larger the K is, the more accurate the results would be.

False - larger K isn't always more better, you have to ensure that the value of K is not too high or too low

T/F: Cluster analysis is useful for clustering groups of questions together into factors.

False. Cluster analysis is useful for identifying objects or people that are similar with regard to certain variables or measurements, then classifying them into a number of mutually exclusive and exhaustive groups

T/F: Classification tree is able to capture interactions between variables.

False. Factor analysis is used to capture interaction between variables

T/F: Minimum error tree is the tree with 0% error rate in the validation data.

False. Minimum error tree is the tree with the lowest error rate on validation data

T/F:The goal of factor analysis is to identify as many factors as possible.

False. The goal of factor analysis is to simplify the classifications

T/F: If a company was interested in the impact that advertising and sales price had on sales, cluster analysis would be an appropriate tool to use.

False. They would want to use multiple regression analysis to estimate the effects of advertising and sales price on sales

Gini index for a rectangle containing a number of classes

I(A): - 0 when all cases belong to the same class and are 100% pure - max value when all classes are equally represented - the smaller the better/purer

T/F: Generally, the naming of factors in factor analysis is a somewhat subjective process

True

T/F: If the goal of an analysis is to group respondents into mutually exclusive and collectively exhaustive subgroups, the preferred procedure would be cluster analysis.

True

T/F: There is no dependent variable in a factor analysis.

True

multivariate analysis

a general term for statistical procedures that simultaneously analyze multiple measurements of search data

perceptual mapping

appropriate when the goal is to analyze consumer perception of companies, products, brands, and the like

Which of the following statement regarding classification methods is TRUE?

classification methods include: - logistics regression:odds and probability - KNN/Clustering: distance - classification tree: Partitioning/midpoint

multiple discriminant analysis

enables the researcher to predict group membership on the basis of 2 or more independent variables - dependent variable is nominal or categorical

multiple regression analysis

enables the researcher to predict the level or magnitude of a dependent variable based on the levels of more than one independent variable - an extension of bivariate regression - fits a plane to observations in a multidimensional space

regression coefficients

estimates of the effect of individual independent variables on the dependent variable - b value in regression equations

Summarizing a number of survey questions into a single concept could be achieved with which type of analysis?

factor analysis

Cluster analysis produces mutually exclusive and exhaustive groups such that the individuals or objects grouped are ______________ within and ______________ between groups.

homogeneous within and heterogeneous between groups

An insurance company wants to know whether the subscribers to its HMO, PPO, or traditional single indemnity plan exhibit better health characteristics. They have the following data for each of their subscribers: % of body weight over optimum weight; blood pressure; cholesterol and hours per week of exercise. The analysis will group subscribers as either healthy, somewhat healthy or not healthy. Which of the following would be the best procedure for making such a determination?

multiple discriminant analysis

types of multivariate analysis procedures

multiple regression analysis multiple discriminant analysis cluster analysis factor analysis perceptual mapping conjoint analysis

K means cluster analysis

multivariate procedure that groups people or objects into cluster based on their closeness to each other on a number of variables

K-means cluster analysis

multivariate procedure that groups people or objects into clusters based on their closeness/distance to each other in a number of variables

Which of the following area Nonhierarchical clustering methods?

nonhierarchical: - K means

factor analysis

permits the analyst to reduce a set of variables to a smaller set of factors or composite variables by identifying underlying dimensions in the data - for factor analysis to be appropriate, factors must be correlated

conjoint analysis

provides a basis for estimating the utility that consumers associate with different product features or attributes - popular with marketers to determine what features a new product or service should have and how it should be prices - more powerful, more flexible, and usually less expensive way to address the important issues than the traditional concept test approach

recursive partitioning

repeatedly split the records into 2 parts to achieve maximum homogeneity within the new parts

the overfitting problem

rules made are too customized and are unapplicable in other situations - leads to overgrown tree * means there is 100% purity in training data -some identified rules are only applicable to a particular data set -past a certain point, the error rate for the new (validation) data starts to increase - low predictive accuracy of new data - need to prune when overgrown

pruning the tree

simplify the tree by pruning peripheral branches to avoid overfitting

Which of the following can be used to measure the degree of classification homogeneity?

the Gini index can be used to measure the degree of homogeneity within a classification tree - the purer the index, the better - 0 when all classes belong to the same class - max value when all classes are equally represented (0.50 in binary case) - the smaller, the better

what is the goal of discriminant analysis?

the prediction of a categorical variable

utilities

the relative value of attribute levels determined through conjoint analysis

A company that is trying to determine which 3 distinct target markets to focus its marketing efforts on might use which statistical tool?

they would use K means cluster analysis


Related study sets

RN Targeted Medical Surgical Endocrine Online Practice 2019

View Set

NX DESIGN ASSOCIATE CERTIFICATION EXAM

View Set

APhA Immunization Self-Study Evaluation

View Set

AP Computer Science Midterm Review Part 1

View Set

Ch.11 Differential Analysis: The Key to Decision Making

View Set

Social Psych Chapter 7: Persuasion

View Set

Chapter Three: Seawater; It's Makeup and Movements

View Set