GBUS 607 Chapter 5&15
5. Accuracy Measures (under Judging Classifier Performance)
-Different accuracy measures can be derived from the classification matrix -A main accuracy measure is the estimated misclassification rate, also called overall error rate, (n= total number of cases in the validation dataset)
Chapter 15: Cluster Analysis
-Grouping data or dividing a large data set into smaller data sets according to some observed patterns or similarity. Egs: Periodic table, portfolios, uniforms -Helps understand underlying characteristics of data and predict membership -It is a form of unsupervised learning -Distance between two data points is used to measure the similarity (or dissimilarity) -Euclidean distance: (insert formula) From Lecture: put together in a group from similar properties/patterns/characteristics -NEEDS NUMERICAL INPUT* for the euclidean distance
9. Generating to More than two Classes (under Judging Classifier Performance)
-Incorporating prior probabilities of the various classes is still done in the same manner -Evaluating misclassification costs becomes much more complicated -For an m-class case we have m(m-1) types of misclassifications -Constructing a matrix of classification costs thus becomes prohibitively complicated
Judging Classifier Performance
-Misclassification error: a natural criterion for judging the performance of a classifier is the probability -Misclassification means that the observations belongs to one class but the model classifies it as a member of a different class 1.Benchmark: The Naïve Rule -is to classify the record as a member of the majority class The naïve rule is used as a baseline or benchmark for evaluating the performance of more complicated classifiers A classifier that uses external predictor information should outperform the naive rule R^2 measures the distance between the fit of the classifier to the data and the fit of the naïve rule to the data Using the sample mean (y) as the naïve benchmark in the numerical outcome case, the naïve rule for classification relies solely on the y information and excludes any additional predictor information 2.Class Separation -If classes are well separated by the predictor information, even a small data set will suffice in finding a good classifier -If classes are not separated at all by the predictors, even a very large dataset will not help
6. Propensities and Cutoff for Classification (under Judging Classifier Performance)
-The first step is to estimate the probability that a case belongs to each of the classes -Also called propensities-Are used either as an interim step for generating predicted class membership of for rank-ordering the records by their probability of belonging to a class of interest
4. Lift Chart
-The goal is to search, among a set of new records, for a subset of records that gives the highest cumulative predicted values Lift chart gives predictive performance to a baseline model that has no predictors Only relevant when we are search for a set of records that gives the highest cumulative predicted values Not relevant if we are interested in predicting the outcome for each new record Lift chart is based on ordering the set of records of interest by their predicted value, form high to low Then we accumulate the actual values and plot their cumulative value on the y-axis as a function of the number of records accumulated This curve is compared to assigning the naïve prediction to each record and accumulating these average values, which results in a diagonal line The farther away the lift curve is from the diagonal benchmark line, the better the model is doing in separating records with high-value outcomes from those with low value outcomes Same for the decile lift chart, where the ordered records are grouped into 10 deciles, and for each decile the chart presents the ratio of model lift to nave benchmark lift
4. Using the Validation Data (under Judging Classifier Performance)
-To obtain an honest estimate of future classification error, we use the classification matrix is computed from the validation data -We summarize our results in a classification matrix for training data as well, the resulting classification matrix is not useful to the danger of overfitting -Then we compare the training data classification matrix to the validation data classification matrix, in order to detect overfitting although we expect somewhat inferior results on the validation data -Large discrepancy in training and validation performance might be indicative of overfitting
Two Distinct predictive uses of classifiers
1. Classification Aimed at predicting class membership for new most likely to belong to a class of interest 2. Ranking Is detecting among a set of new records the ones most likely to belong to a class
Judging ranking performance
1. Lift Chart for Binary Data -Lift charts also called lift curves, gains curves, or gains charts for a binary outcome -The lift curve helps us determine how effectively we can "skim the cream" by selecting a relatively small number of cases and getting a relatively large portion of the responders -The input required to construct a lift curve is a validation dataset that has been "scored" by spending to each case the propensity that it will belong to a given class -The model will give us an estimate of the extent to which we will encounter more and more noncheaters as we proceed through the sorted data starting with the records most likely to be tax cheats -Describing the case where our goal is to obtain a rank ordering among the records according to their class membership propensities 2. Decile Lift Charts -The decile chart aggregates all the lift information into 10 buckets -The bar shows on the y-axis the factor by which our model outperforms a Randoms assignment on 0's and 1's, taking one decile at a time 3. Beyond two Classes -A lift chart cannot be used with multiclass classifier unless a single "important class" is defined and the classifications are reduced to "important" and unimportant" classes 4. Lift Charts Incorporating Costs and Benefits We need a classifier that assigns to each record a propensity that it belongs to a particular class 5.Lift as Function of Cutoff Could also plot the lift as a function of the cutoff value The only difference is the scale on the x-axis When the goal is to select the top records based on a certain budget, the lift vs. number of records is preferable In contrast, when the foal is to find a cutoff that distinguishes well between the two classes, the lift vs. cutoff value is more useful.
3 main types of outcomes of interest
1.Predicted numerical value When the outcome variable is numerical (house price) 2. Predicted class membership When the outcome variable is categorical (buyer/nonbuyer) 3. Propensity The propensity of class membership, when the outcome variable is categorical (propensity to default)
k-Means clustering - how does it work?
Algorithm 1 Data is grouped into k-clusters in such a way that data in each cluster is similar to each other than to data in other clusters. How it works: ¤Step 1: Given n objects, initialize k cluster centers (also called centroids). NOTE: (K>2 not 1 since n is 1) ¤Step 2: Assign each object to its closest center ¤Step 3: Update the center for each cluster (reassign points that are away from center=update-find error) ¤Step 4: Repeat 2 and 3 until no change in each cluster center From LECTURE: -takes average from each cluster of initial iteration! -minimize how far distance is in each group and then change center to the calculating average of each cluster -we can decide how many clusters we want or how many in a cluster -centroid=multidimensional -random k # of points>which is closest to ne of the K's -find cluster center in each iteration the create/identify new cluster centers (move them to new cluster centers then can classify #'s) -only change the one that is causing issues (move around to other centers) -captures similar characteristics & places into right groups/centers
Evaluating Predictive Performance
Classical Statistical measures of performance are aimed at finding a model that fits well to the data on which the model was trained -In data mining we are interested in models that have high predictive accuracy when apply to new records -R^2: standard error for estimate are common metrics in classical regression modeling, and residual analysis is used to gauge goodness of fit in that situation -Assessing prediction performance; in all cases the measures are based on the validation sets, which serves as more objective ground than the training set to assess predictive accuracy -Validation set are more similar to the future records to be predicted in the sense that they are not used to select predictors or to estimate the model coefficients
NOTES from BOOK chapter 15 cont...
Distance measures for categorical data: Matching coefficients: N/A Jaccard's coefficients: ignores 0 matches Note: there is also measurements for mixed data such as: continuous and binary data Measuring distance b/w two clusters: -Minimum distance: distance b/w the pair of observations A and B that are closest -Maximum distance: distance b/w the pair of observations A and B that are farthest -Average distance: average distance of all possible distances b/w observations in one cluster and observations in the other other -Centroid distance: distance b/w two cluster centroids. NOTE: cluster centroid=is the vector of measurement averages across all the observations in that cluster
Chapter 5
Evaluating Predictive Performance -Danger of overfitting to the training data -Average error: MAPE and RMSE (based on validation data) -ROC Curve -> popular chart for assessing method performance at different cutoff values -Ranking: accurately classify the most interesting or important cases
Hierarchical (Agglomerative) clustering
How it works: ¤Each data point starts out as its own cluster. (don't have to have too many pred of how many we have to start w/ unlike k means algorithm) ¤Each cluster then is merged successively based on similarity (distance). ¤Eventually there will be one large cluster that contains all the data points. ¤The graph showing the mergers is called a Dendrogram (branches). ¤The larger the vertical distance, the higher the level of dissimilarity. From LECTURE: -which 2 points are closest together -**vertical distance shows how far the dissimilarities deffer -start out as independent clusters -*no iterations like k means algorithm From Exercise: -output shows average of distance of each cluster centers & the observations in that cluster -lowest average distance= most cohesive -0=1 point in cluster/group
8. Asymmetric Misclassification Costs (under Judging Classifier Performance)
Measures how effective we are in identifying the members of one particular class, is the assumption that the error of misclassifying a case belonging to one class is more serious than for the other class -which measures the average cost of misclassification per classified observation
NOTES from BOOK chapter 15:
Popular distance metrics/other distance measures for numerical data -correlation based similarity: sometimes it is more natural or convenient to work w/ a similarity b/w observations rather than distance; which measures dissimilarity: square of Pearson correlation coefficient. -statistical distance: (also called mahalanobis distance), it takes into account the correlation between measurements. With this-measurements that are highly correlated w/ other measurements do not contribute as much as those that are uncorrelated or mildly correlated. -manhattan distance: looks only @ the absolute differences rather than squared differences -maximum coordinate distance: looks only @ the measurement on which observations i and j deviate most. NOTE: normalize numerical measurements before doing euclidean formula to convert all measurements to the same scale. normalize continuous measurements.
Prediction vs Classification Method
Prediction method are used to generating numerical predictions Classification Method are used for generating propensities, and using a cutoff value on the propensities, we can generate predicted class memberships
3. Comparing Training and Validation Performance (under Evaluating Predictive Performance)
Residual -That are based on the training set tell us about model fit -That are based on the validation set (called "prediction errors") measure the model's ability to predict new data (predictive performance)\ -Training error should be smaller than the validation errors -The more complex the model, the greater it will overfit the training data -In extreme case of overfitting (the training errors would be zero; the validation errors would be nonzero and non negligible) -Discrepancies are also due to some outliers, and especially the large training error -Validation have slightly higher positive errors (underpredictions) than the training errors reflected by the medina and outliers
Strengths and Weaknesses of Hierarchical Clustering
Strength: ¤Computationally efficient. If there are n data points, there are exactly n-1 merges. Also, no need to specify number of clusters upfront as in k-means. Weaknesses: ¤Unlike in k-means where a data point may be reassigned among clusters, in hierarchical clustering once a merge has been done, it cannot be undone. ¤Works for numerical data only Note from LECTURE: -dendrogram=means branches
Strengths and Weaknesses of k-Means
Strengths: ¤Centroid is recalculated with each iteration ¤Data gets reassigned to new clusters as needed Weaknesses ¤Need to specify k, the number of clusters, in advance (expects you to state how many k's you want) ¤Works well when mean is defined, ie works for numerical data Note: may put #'s into large groups if only 2 Too many= generalized average value (doesn't reflect what you are looking for) 2 groups= too localized From LECTURE: -can add additional clusters but too many is not good/not practical. Goal is to reduce # of clusters. Too loose is okay for grouping. -Are the #'s under each cluster cohesive as possible?
3. The Classification Matrix (under Judging Classifier Performance)
The Classification Matrix Most accuracy measures are derived from the classification matrix also called the confusion matrix This matrix summarizes the correct and incorrect classification that a classifier produced for a certain dataset Rows and columns of the classification matrix correspond to the true and predicted classes, respectively The classification matrix gives estimates of the true classification and misclassification rates If we have a large enough dataset and neither class is very rare, our estimates will be reliable
7. Performance in Unequal Importance of Classes (under Judging Classifier Performance)
The following pair of accuracy measure are the most popular: 1. The sensitivity A classifier is its ability to detect the important class members correctly 2. The specificity A classifier is its ability to rule out C2 members correctly ROC Curve -More popular method for plotting the two measures is through ROC (Receiver Operating Characteristic) curves -The ROC curve plots the pairs as the cutoff value increases from 0 to 1 -Better performance is reflected by curves that a re closer to the top left corner Comparison curve is the diagonal, which reflect the performance of the naïve rule, using varying cutoff values Common metric is "area under the curve /(AUC)" which ranges from 1 to 0.5
Oversampling
The same stratified sampling procedure is sometimes called weighted sampling or understanding, the latter referring to the fact that the more plentiful class is under sampled, relative to the rare class Oversampling is the term used Oversampling is one way of incorporating these costs into the training process Oversampling without replacement in accord with the ratio of costs is the optimal solution, but that may not always be practical Exact ratio of cost is difficult to determine When it comes time to assess and predict model performance, we will need to adjust for the oversampling in one of two way Score the model to a validation set that has been selected without oversampling Score the model to an oversampled validation set, and reweight the results to remove the effects of oversampling
NOTES from BOOK chapter 15 cont...
Under the Hierarchical algorithm: Different clustering -Single Linkage: the distance measure that we us is the minimum distance (distance b/w nearest pair of observations in two two clusters, one observation in each cluster). -Complete Linkage: distance b/w two clusters is the maximum distance (b/w the farthest pair of observations) -Average Linkage: is based on average distance b/w clusters (b/w all possible pairs of observations) -Centroid Linkage: based on centroid distance, where clusters are represented by their mean value for each variable, which forms a vector of means. the distance b/w two clusters is the distance b/w these two vectors -Ward's Method: a measure of "error sum of squares" that measures the difference b/w individual observations and a group mean Validating Clusters: come up w/ meaningful/valid clusters: 1. Cluster interpretability- is the interpretation of the resulting clusters reasonabile? to interpret characteristics of each cluster 2. Cluster stability-Do cluster assignments change significantly if some of the inputs are altered slightly? To check stability partition data and see how well clusters formed based on one part apply to the other part. 3. Cluster separation: Examine the ratio of b/w-cluster variation to within-cluster variation to see whether the separation is reasonable. 4. number of clusters: the # of resulting clusters must be useful, given the purpose of analysis.
1. Benchmark: The Average (under Evaluating Predictive Performance)
using the average outcome -The prediction for a new record is simply the average outcome of the records in the training set (y) Good predictive model should outperform the benchmark criterion in terms of predictive accuracy 2. Few popular numerical measures: 1.MAE or MAD (mean absolute error / deviation). This gives the magnitude of the average absolute error 2.Average Error Similar to MAD except that it retains the sign of the errors, so that negative errors cancel out positive errors of the same magnitude. Gives an indication of whether the predictions are on average over-or under predicting the response 3.MAPE (mean absolute percentage error) Measure gives a percentage score of how predictions deviate on average from the actual values 4.RMSE (root mean squared error) This is similar to the standard error of estimate in linear regression, except it is computed on the validation data rather on the training data 5.SSE (total sum of squared errors) Note: All these measures are influenced by outliers To check outlier, we can compute median-based measure or simply plot a histogram or boxplot of the errors