121 Exam 3
what is KNN based on?
- customer identification: who are profitable or defect customers? - customer retention: how to find loyal customers? - customer development: cross-selling, direct marketing
The purpose of factor analysis can be best described as:
- data simplification - to summarize the information contained in a large number of metric measures with a smaller number of summery measures (factors)
pure portion in recursive partitioning
contains records mostly from one class - purer the class, the better the results
factor loadings
correlation between factor scores and the original variables
collinearity
correlation of independent variables with each other, which can bias estimates of regression coefficients - look for correlation between independent variables of .30 or greater - a VIF of 10 or greater shows collinearity
scaling of coefficients
a method of directly comparing the magnitudes of the regression coefficients of independent variables by scaling them in the same units or by standardizing the data
cluster analysis
a procedure for identifying subgroups of individuals or items that are similar in regards to certain variables or measurements
metric scale
a type of quantitative that provides the most precise measurement
dummy variable
a way of representing 2 group, or dichotomous, nominally scaled independent variables by coding one group as 0 and the other as 1 in regression analysis - aka binary variables
coefficient of determination
aka R^2 measure of the percentage of the variation in the dependent variable explained by variations in the independent variable
using validation error to prune
*minimum error tree: has the lowest error rate on validation data *best pruned tree: smallest tree within one standard error of minimum error - this adds a bonus for simplicity *the smaller the splits, the clearer the rules are
Eigenvalue in factor analysis
- 1 acts as a benchmark in factor analysis - if total is > 1, explains more than the average amount of varianece
Key ideas of Classification Tree include_______
- Recursive partitioning: repeatedly split the records into 2 parts to achieve maximum homogeneity within the new parts - Pruning the tree: simplifying the tree by pruning peripheral branches to avoid overfitting
what is a factor?
- a linear combination of variables that are correlated to each other - a weighted summary score of a set of related variables and each measure is first weighted according to how much it contributes to the variation of each factor
nominal or categorical
- a type of nonmtric qualitative data sale that only uses numbers to indicate membership in a group - most mathematical and statistical procedures cannot be applied to nomial data
Which of the following are hierarchical clustering methods?
- agglomerate methods *linkage methods including: single, complete, and average linkage
discriminant coefficients
- aka discriminant weights - estimate of the discriminatory power of a particular independent variable within multiple discriminant analysis - computed by means of the discriminant analysis program - independent variables with large discriminatory power (large differences between groups) have large weights - independent variables with little discriminatory power (small differences between groups) have small weights
KNN method can be used for ________ in Marketing
- classification - identifying customers -retaining customers - developing customers through cross-selling and direct marketing
classification matrix
- classifies people and objects - a matrix or table that shows the percentages of people or things correctly and incorrectly classified by the discriminant model
purpose of cluster analysis
- classify objects or people into some number of mutually exclusive and exhaustive groups so that those within a group are as similar as possible to one another - clusters should be homogeneous internally and heterogeneous externally
Which of the following multivariate procedures does not include a dependent variable in its analysis?
- cluster analysis - factor analysis
Data mining typically involves four classes of task. What are those 4 class task classes?
- clustering, classification, modeling, application
goals of multiple discriminant analysis
- determine if there are statistically significant differences between the average discriminant score profiles or 2 or more groups - establish a model for classifying individuals or objects into groups on the basis of their values in the independent variables - determine how much of the difference in the average score profiles of the group is accounted for by each independent variable
advantages of classification trees
- easy to use and understand - produce rules that are easy to interpret and implement - do not require the assumptions of statistical models - can work without extensive handling of missing data
factor analysis process
- factor scores - factor loadings - naming the factors - decide how many factors to retain
what kind of questions can multiple discriminant analysis answer?
- how are consumers who purchase various brands different from those who do not purchase the brand? - how do consumers who show high purchase probabilities for a new product differ in demographic and lifestyle characteristics from those with low purchase probabilities? - how do consumers who frequent one fast-food restaurant differ in demographic and lifestyle characteristics from those who frequent another fast-food restaurant? - how do consumers who have chosen either indemnity insurance, HMO coverage, or PPO coverage differ from one another in regard to healthcare use, perceptions, and attitudes?
What is true about KNN?
- it is used for both classification and regression - works better if all of the data have the same scale - works well with a small number of input variables, but struggles when the number of inputs is very large - makes no assumptions about the functional form of the problem being solved - when you decrease the K the bias will increase - when you decrease the K, the variance will increase
types of classification methods
- logistic regression - KN/Clustering - classification tree
disadvantages of tree classification
- may not perform well where there is structure in the data that is not well captured by horizontal or vertical splits - since the process deals with one variable at a time, there is no way to capture interactions between variables - best when variables are independent of each other because when variables interact with each other they can skew the results
Which of the following are requirements of KNN?
- number of clusters, K, must be specified - database: set of stored records - measure of the distance - K, the number of nearest neighbors, to retrieve - each cluster is associated with a centroid - each point is assigned to the cluster with the closest centroid
tips on reducing impurity in recursive partitioning
- obtain the combined impurity measure (weighted average of individual rectangles) - at each successive stage, compare this measure across all possible splits in all variables - choose the split that reduces the impurity the most
recursive partitioning steps
- pick one of the predictor variables - pick a value of Xi that divides the training data into 2 (not necessarily equal) portions - measure how "pure" or homogeneous each of the resulting portions are - algorithm tries different values of Xi and Si to maximize the purity in the initial split - after the "maximum purity" split, repeat the process for a second split, and so on
purpose of factor analysis
- purpose is to simplify data - summarize info contained in a large number of metric measures with a smaller number of summary measures (factors) - has no dependent variable
Key goals in classification/decision trees
- recursive partitioning - pruning the tree
causation
- regression analysis shows correlation, but not causation - inference that a change in one variable is responsible for an observed change in another variable
cluster analysis is particularly valuable for what type of marketing strategy?
- segmenting consumers into groups to identify a particular consumer base - finding similarities between consumers
Which of the following statement regarding KNN is true?
- select K points as the initial centroids - repeat - form K clusters by assigning all points to the closest centroid - recompute the centroid of each cluster - repeat until the centroids don't change - assign membership to to new members depending on the K - determine which K is best out of the total number of observations - fundamentally measures distance using Euclidean distance - choose value of K that has the lowest error rate in validation data - compute distance to other training records - identify k nearest neighbors - use class labels of nearest neighbors to determine the class label of unknown record
Which of the following statement regarding Pruning is true?
- tree after pruning/splits leads to decisions/rules to classify new info easily off of rules - rules that are made tend to be too customized and are unapplicable in other situations - prune when there is 100% purity or 0% impurity - let tree grow to the full extent then prune it back by finding the point at which the validation error begins to rise *this yields a set of trees of different sizes and associated error rates - use validation error to prune: *minimum error tree: has lowest error rate on validation data *best pruned tree: smallest tree within one standard error of minimum error; this adds a bonus for simplicity
Euclidean distance
- way of clustering by distance - standardization may be necessary, if scales differ - D = SQRT ((x1-x2)^2 + (Y1-Y2)^2)
pruning the tree
- we let the tree grow to the full extent, then prune it back by finding the point at which validation error begins to rise to cut off unnecessary brances - yields a set of trees of different sizes and associated error rates
discriminant score
-aka Z score - the score derived for each individual or object by means of the equation - basis for predicting the group to which the particular object or individual belongs
how to handle collinearity
1) drop one of the variables from the analysis 2) combine the correlated variables to form a new composite independent variable
ways to find distance in clustering
1) visually by the eyes: can lead to a lot of inaccuracy errors 2) Euclidean distance is the most common and more accurate
If the goal is to classify business travelers into distinct groups based on their responses to 20 questions on preferences to mode of transportation, hotel accommodation, and ethnic food, which of the following techniques would be most appropriate?
Factor analysis
In factor analysis, the correlation between factor scores and the original variables is known as the factor ________
Factor loadings
T/F: While running KNN, generally speaking, the larger the K is, the more accurate the results would be.
False - larger K isn't always more better, you have to ensure that the value of K is not too high or too low
T/F: Cluster analysis is useful for clustering groups of questions together into factors.
False. Cluster analysis is useful for identifying objects or people that are similar with regard to certain variables or measurements, then classifying them into a number of mutually exclusive and exhaustive groups
T/F: Classification tree is able to capture interactions between variables.
False. Factor analysis is used to capture interaction between variables
T/F: Minimum error tree is the tree with 0% error rate in the validation data.
False. Minimum error tree is the tree with the lowest error rate on validation data
T/F:The goal of factor analysis is to identify as many factors as possible.
False. The goal of factor analysis is to simplify the classifications
T/F: If a company was interested in the impact that advertising and sales price had on sales, cluster analysis would be an appropriate tool to use.
False. They would want to use multiple regression analysis to estimate the effects of advertising and sales price on sales
Gini index for a rectangle containing a number of classes
I(A): - 0 when all cases belong to the same class and are 100% pure - max value when all classes are equally represented - the smaller the better/purer
T/F: Generally, the naming of factors in factor analysis is a somewhat subjective process
True
T/F: If the goal of an analysis is to group respondents into mutually exclusive and collectively exhaustive subgroups, the preferred procedure would be cluster analysis.
True
T/F: There is no dependent variable in a factor analysis.
True
multivariate analysis
a general term for statistical procedures that simultaneously analyze multiple measurements of search data
perceptual mapping
appropriate when the goal is to analyze consumer perception of companies, products, brands, and the like
Which of the following statement regarding classification methods is TRUE?
classification methods include: - logistics regression:odds and probability - KNN/Clustering: distance - classification tree: Partitioning/midpoint
multiple discriminant analysis
enables the researcher to predict group membership on the basis of 2 or more independent variables - dependent variable is nominal or categorical
multiple regression analysis
enables the researcher to predict the level or magnitude of a dependent variable based on the levels of more than one independent variable - an extension of bivariate regression - fits a plane to observations in a multidimensional space
regression coefficients
estimates of the effect of individual independent variables on the dependent variable - b value in regression equations
Summarizing a number of survey questions into a single concept could be achieved with which type of analysis?
factor analysis
Cluster analysis produces mutually exclusive and exhaustive groups such that the individuals or objects grouped are ______________ within and ______________ between groups.
homogeneous within and heterogeneous between groups
An insurance company wants to know whether the subscribers to its HMO, PPO, or traditional single indemnity plan exhibit better health characteristics. They have the following data for each of their subscribers: % of body weight over optimum weight; blood pressure; cholesterol and hours per week of exercise. The analysis will group subscribers as either healthy, somewhat healthy or not healthy. Which of the following would be the best procedure for making such a determination?
multiple discriminant analysis
types of multivariate analysis procedures
multiple regression analysis multiple discriminant analysis cluster analysis factor analysis perceptual mapping conjoint analysis
K means cluster analysis
multivariate procedure that groups people or objects into cluster based on their closeness to each other on a number of variables
K-means cluster analysis
multivariate procedure that groups people or objects into clusters based on their closeness/distance to each other in a number of variables
Which of the following area Nonhierarchical clustering methods?
nonhierarchical: - K means
factor analysis
permits the analyst to reduce a set of variables to a smaller set of factors or composite variables by identifying underlying dimensions in the data - for factor analysis to be appropriate, factors must be correlated
conjoint analysis
provides a basis for estimating the utility that consumers associate with different product features or attributes - popular with marketers to determine what features a new product or service should have and how it should be prices - more powerful, more flexible, and usually less expensive way to address the important issues than the traditional concept test approach
recursive partitioning
repeatedly split the records into 2 parts to achieve maximum homogeneity within the new parts
the overfitting problem
rules made are too customized and are unapplicable in other situations - leads to overgrown tree * means there is 100% purity in training data -some identified rules are only applicable to a particular data set -past a certain point, the error rate for the new (validation) data starts to increase - low predictive accuracy of new data - need to prune when overgrown
pruning the tree
simplify the tree by pruning peripheral branches to avoid overfitting
Which of the following can be used to measure the degree of classification homogeneity?
the Gini index can be used to measure the degree of homogeneity within a classification tree - the purer the index, the better - 0 when all classes belong to the same class - max value when all classes are equally represented (0.50 in binary case) - the smaller, the better
what is the goal of discriminant analysis?
the prediction of a categorical variable
utilities
the relative value of attribute levels determined through conjoint analysis
A company that is trying to determine which 3 distinct target markets to focus its marketing efforts on might use which statistical tool?
they would use K means cluster analysis