Data Anal Final

Ace your homework & exams now with Quizwiz!

Modeling Essentials

1. determine the type of prediction - provide a rule to transform a measurement into a prediction (prediction rules) 2. select useful inputs - have a means of choosing useful inputs from a potentially large number of choices (split searching) 3. optimize complexity - be able to adjust the complexity of the model to compensate for captured noise (pruning)

Exact Bayesian Classifier

1. find all other records that are just like the target record, where predictor variables are exactly the same. 2. Determine what classes those matches all belong to and which class is most prevalent. 3. Assign the target record to the most prevalent class The more predictor variables we have, the more impossible it becomes to find exact matches.

e

2.718

Decision Optimization

Accuracy - proportion of agreement between outcome and prediction. High counts of true positives and true negatives. Misclassification - disagreement between outcome and prediction Squared Error = (target-estimate)^2 We need to minimize the disagreement between outcome and prediction Sensitivity - how many of the target cases we able to classify correctly? Sensitivity = number of true positives / all positives Specificity = number of true negatives / all negatives

CART - Accuracy

Accuracy = (true positives + true negatives) / total cases

Agglomerative Methods

Begin with n clusters (each record is its own cluster). Keep joining records into clusters until one cluster is left (the entire dataset). Most popular. Problems: data driven, computationally intensive, only makes one pass, unstable. A small change to variables can lead to new clusters

Log Reg Impact of Single Predictors

For each predictor X, there is a coefficient Beta, and an associated standard error. The associated p-value indicates the statistical significance of the relationship between the predictor and the outcome. Low p-values indicate statistical significance or a relation that is most likely not due to chance. A statistically significant relationship is not necessarily a predictor which has a large impact on the outcome. If a sample is very large, p-values will be naturally smaller

Sequential Variable Selection

Forward Selection - Creates a sequence of models of increasing complexity. Adds inputs with good p-values until no significant inputs are available. Backward - Creates a sequence of models of decreasing complexity. Starts with a full model and eliminates inputs that have poor p-values. Stepwise - combines both forward and backward selection. As inputs are added, the algorithm re-evaluates the statistical significance of included inputs and removes those that are no longer eligible.

Bayesian Classifier Categorical Variables

The Bayesian classifier only works with categorical variables. Continuous variables that are used must be binned or transformed into dummy variables. It is the only model that works only with categorical predictors.

Descriptive Modeling

With basic descriptive modeling techniques such as RFM, you can identify potentially profitable customers. However, with predictive modeling, you can produce risk scores, probabilities, and automated real time fact-based decision making. Descriptive statistics tell you about your sample and help to react to things that have happened in the past. It doesn't give you any insight in regards to how things will behave in the future

ROC Curve

the baseline is the power of a random model. The ROC curve bows above the baseline, the more it curves, the more predictive. A perfect model's ROC curve would reach 0,1

Lift/Gains Chart

the baseline represents the response rate. A model with good predictive power would have increasing PV+ as the depth increases. The gains chart is used in marketing to decide how deep in a database to go. Lift = PV+/response rate For a given depth on the chart, there are X(lift) more responders when selecting from the model compared to by random chance.

z-score

z=(X-meanX)/St.Dev.

Log Reg SAS Steps

1. Add the data source into the diagram 2. Data Partition node to split the data 3. Variable Clustering node to remove redundant variables 4. Regression node to generate a model 5. Model Comparison node to determine the best model

Log Reg Modeling Essentials

1. Determine the type of prediction - the prediction formula 2. Select useful inputs - variable clustering and selection 3. Optimize complexity - best model from sequence Goal: predict which class a new observation will belong to We begin with many variables and get rid of all the redundant ones with variable clustering. To determine class, we must first find the probability estimate and then compare it to the cutoff value

Naive Bayes Calculations

1. Establish a cutoff probability cutoff = 0.5 2. Calculate the probability that each predictor category occurs for the class of interest (conditional probability) P(fraud|legaltrouble,small) = 0.5 3. Multiply the probability of each occurring predictor by each other. Then times this number by the proportion of records belonging to the class of interest and divide by sum of similar values for all classes (Naive Bayes Probability) Pnb(fraud|legaltroube,small) = 0.87 4. Apply cutoff Since 0.87 is above the cutoff, all companies that have legal trouble and are small are automatically classified as fraudulent 5. The same process if applied for every single different combination

Log Reg Example: Personal Loan

1. Partition Data 2. Create dummy variables for categorical predictors 3. Construct a logistic regression model Prob(1) = probability of personal loan accepted Prob(1) = 1/1+e^-logit Odds(1) = e^logit 4. Find estimated coefficients Beta0 = -6.3525 Beta1(Income) = 0.0392 5. Final Model Prob(1) = 1 / 1 + e^-(-6.3525 + 0.0392X) We are predicting the probability of the categorical class given income as the predictor. We are classifying the customer as an loan acceptor or non-acceptor Odds(1) = e^-6.5325+(0.392(0)) = 0.0017 The odds that a customer with 0 income will accept the loan is 0.0017. This is the base-case odds. Odds(1) = e^-6.5325+(0.392(100)) = 0.088 The odds that a customer with 100K income will accept the loan is 0.088 The odds of accepting the loan when income = 100 will increase by a multiplicative factor of: e^(0.039*100) = 50.5. 50.5 refers to 50.5 times over the 0 income base odds of accepting

Log Reg Two Step Process

1. calculate estimates of the probabilities of belonging to each class (the probability of belonging to class 1) 2. a cutoff value is used to classify each case into one of the classes

CART Disadvantages

CARTs are sensitive to changes in data, even a slight change can result in very different splits. Since splits are made on single predictors rather than a combination of predictors, relationships may be missed between predictors. (Linear relationship models are better at doing this) Datasets that are best split diagonally do not perform well. This can be fixed with creating new combination predictors or by using a linear relationship model instead. Classification trees require large datasets to be most effective. Random Forests attempt to fix this issue by creating multiple classification trees from the data to create a forest. Their output is combined to obtain the best classifier. computationally, CARTs are hard to grow and prune.

Chi-Squared Automatic Interaction Detection CHAID

CHAID is a recursive partitioning method that employs the chi-squared test for independence to test if splitting a node improves the purity by a statistically significant amount. At each decision node, we split on the predictor that has the strongest association with the response variable. This association is measured by the p-value of a chi-squared test of independence. The tree is ended when if for the best predictor, the test shows no significant improvements. This method is only suitable for categorical variables. Continuous variables must be binned into categorical groups.

Cluster Analysis

Clustering aims to segment data into sets of homogeneous groups for the purpose of generating insights.

Conditional Probability

Conditional probability is the probability of event A, given that event B has occurred P(A|B)

Pattern Discovery Applications

Data reduction - exploiting patterns in data to create a more compact representation of the original Novelty detection - seeking unique or previously unobserved data patterns Profiling - creating rules that isolate clusters based on demographics or behavioral measurements Market Basket Analysis - analyzing streams of transaction data for combinations of items that occur more commonly than expected Sequence - examining data for sequences of items that occur more commonly than expected

Pattern Discovery

Demographic characteristics, customer types, customer activity for fraud detection, segmenting locations for new stores

Tree Structure

Each split on the chart is depicted as a node branching off into two successor nodes. If there was a split at x=10, everything <=10 would move down the left, everything >10 would move down the right. Moving down the tree, you eventually end up a rectangular terminal node that classifies the target. These are also known as the leaves. A cutoff value can be established to determine the proportion needed for determining the leaf class. This is useful for when success classes are rate and require leverage.

CART Measures of Impurity = Entropy

Entropy = -1(proportion belonging to A)log2 - (proportion belonging to A)-1(proportioning belonging to B)log2 - (proportion belonging to B) 0 = purest, Log2(number of classes) = everyone is represented equally, least pure

CART - Error Rate

Error Rate = (false positives + false negatives) / total cases

Logistic Regression Advantages

Even small datasets can be used for building logistic regression classifiers. Once the model has been estimated, it is computationally fast and cheap to classify, even with large samples.

CART Measures of Impurity = GINI

GINI Index - 1 - (proportion of units belonging to class A)squared - (proportion of units belonging to class B)squared the higher the GINI, the less pure. 0 = purest Also known as the Simpson Diversity Index.

Classification Rules from Trees

IF(income>92.5) AND (education>1.5) AND (income>114.5) THEN CLASS = 1

Logworth

In the identification of split points, all midpoints of consecutive variables are considered. The selected split point is based on the maximum Logworth of the resulting partition. Logworth = -Log(chi-squared p-value)

Log Reg Computing Parameter Estimates

Logistical regression models determine the relation between Y and Beta parameters with the maximum likelihood method, which finds the estimate parameter that maximizes the chances of obtaining the data that we have. Algorithms to calculate log reg coefficient estimates are less robust than linear reg algorithms Computed estimates are most reliable for well-behaved datasets with even distributions of 0 and 1 classes. It also helps to have a smaller number of predictors relative to the sample size. Collinearity between independent predictor variables can also lead to difficulties.

Logistic Regression Formula

Logit = Beta0 + Beta1X1 + Beta2X2 Log(odds) = Beta0 + Beta1X1 + Bet2X2 Log(p/1-p) = Beta0 + Beta1X1 + Beta2X2 Logit(p) = Beta0 + Beta1X1 + Beta2X2 Probability = Odds / 1 + Odds Probability = 1 / (1+e^-logit) Odds = p / 1 - p Odds = e^logit Odds = probability of target / probability of non-target

Logistic Regression Model

Logit = Beta0 + BetaX1 + BetaX2... Log(Odds) = Beta0 + BetaX1 + BetaX2... Logit = Log(Odds) Probability of Class 1 = 1 / 1 + e^-logit The Logit can be any values form negative to positive infinity. The Logit is the dependent variable and is modeled as a linear function of the predictors. The higher the Logit, the greater the odds. Low/negative Logits results in low odds. The higher the Beta coefficient, the greater their association with higher probabilities of belonging to the target class. Positive coefficients in the logit model translate to coefficients larger than 1 in the odds model. Negative coefficients in the logit model translate to coefficients smaller than 1 in the odds model.

Naive Bayes Advantages

Naive Bayes are simple, computationally effective, and have good classification performance. Naive Bayes outperforms more sophisticated classifiers, even if the predictors are not fully independent. Naive Bayes also outperforms classifiers when the number of predictors are large This test handles purely categorical data very well and works great with very large data sets. It is simple and computationally efficient.

Naive Bayes Characteristics

Naive Bayes is a data drive, not model driven, prediction format. It makes no assumptions about the data. Naive Bayes assumes and relies on independence of predictor variables within each class. (doesn't have to be definite, as long as it is good enough) The final prediction of Naive Bayes dose not differ greatly from the Exact Bayes test. There are no statistical models involved. Naive Bayes focuses its attention on complex interactions between predictors and classes and local structures.

Naive Bayes Disadvantages

Naive Bayes tests require a very large number of records for good results. When a predictor category is not represented in the training data, the model automatically assigns a probability of 0. Naive Bayes produces biased results when the goal is to predict probability of membership in a particular class. It is only good for classifying or ranking, not predicting probability. A large number of records is required. Problems arise when a predictor category is not present in training data. Probability rankings are more accurate than the actual probability estimates.

Odds

Odds = p / 1 - p Odds = e^logit Odds = probability of target / probability of non-target Increase in odds for a single unit increase in X = e^BetaX The odds of belonging to class 1 is the ratio of the probability of belonging to class 1 to the probability of belonging to class 0 High probability = high odds Odds 1 = Probability 0.5 (50/50) Odds 10 = Probability 0.91 Odds 0.1 = Probability 0.091

Odds Ratio

Odds Ratio is different from Odds. Odds Ratio is the odds of two odds. Odds ratios show the strength of the association between the predictor and the response variable. logit(p) = -.7567 + .4373*(gender) Odds ratio = Odds(female)/Odds(male) Odds ratio = (e^-.7567+.4373)/(e^-.7567) = 1.55 An odds ratio of 1.55 means that females have 1.55 times the odds of having the outcome compared to males. Odds ratio = 1: no association Odds ratio > 1: the odds group in the numerator has a higher odds of having the event Odds ratio < 1: the group in the denominator has a higher odds of having the event

Log Odds

Odds can be any non-negative number. It allows us to get rid of the upper bound. Log(Odds) allows us to get rid of the lower bound and allows negative numbers. Odds is not symmetric, Log(Odds) is symmetric. Log(odds) 0 = Probability 50% Log(odds) is highly negative for low probabilities Log(odds) is highly positive for high probabilities

Avoid Overfitting CART

Overfitting leads to poor performance with actual/new data. The overall error rate decreases as the number of levels in the tree grows and becomes more and more overfitted. This is because the tree models the local noise instead of the actual relationship between the variables. As the number of levels increases, the splits are based off an ever decreasing number of sample observations and thereby become more inaccurate.

PV+

PV+ = true positives / total positives

PV-

PV- = true negatives / total negatives

Log Reg Variable Selection

Performance of the model is based of the validation data. Models that perform roughly equally well as simpler models are preferred over complex models. Performance on validation data may be overly optimistic compared to predicting performance on new data.

Pattern Discovery Cautions

Poor data quality, opportunity, intervention, separability, obviousness, and nonstationarity.

Classification Trees for Two or More Classes

Pretty much the exact same thing, except now the terminal nodes can have more possible classifications

Logistic Response Function

Probability = 1 / 1 + e^-logit For any values of x predictors on the right hand side, they will always result in a value within the 0-1 range.

Logistic Regression Profiling

Profiling is when logistical regression is used in data where the class if known to find similarities between observations within each class (in terms of predictor variables) Example: finding factors that differentiate between male and female top executives

Recursive Partioning

Recursive - operating on the results of prior divisions. The area on the X and Y chart are divided again and again until all of the points/outcomes on the chart are in their own homogenous rectangles. It begins by splitting the area into half, with each side being as homogeneously classified as possible. Further splits are made within those splits to further divide the points. The ultimate goal is to divide the space up into rectangles as pure as possible. The split point is determined with an algorithm that examines each predictor variable and all possible split values for each of these variables to determine the best possible split. All possible split values are located at the midpoint of two data points. These split points are ranked according to how much they reduce impurity. Overall impurity before the split =- sum of impurities in the child rectangles recursive partitioning works with categorical variables by transforming them into dummy binary variables

Regression Trees

Regression Trees are trees for numerical response variables. They have the same general process whereby many splits are made and for each split we attempt to minimize impurity. The value of the terminal/leaf node is determined by averaging/voting of the training data that ends up in that leaf. the impurity of a node is computed by subtracting the average of the points in the leaf and then squaring these deviations and summing them. It is the sum of squared deviations from leaf mean. The lowest impurity is 0, this is when all values of the node are equal. Performance of the regression tree is measured with the same thing as other predictive models: root mean squared error and lift charts

Pruning - Sampling Error

Sampling error can also be incorporated for a better prune. Sampling error accounts for the probability of the minimum varying if we had a different data sample. The best pruned tree is the smallest tree that has an error within one standard error of the minimum error tree.

Best Variance Selection

Selection by 1 - R^2 ratio, with the smaller 1 - R^2 ratio the better. (1 - R^2 own cluster) / (1 - R^2 next cluster) This ratio quantifies how dissimilar an input is from its own cluster compared to how dissimilar it is to other clusters. The inputs with the lowest ratio are thought to best represent their cluster. R^2 = how much variance is explained by the variable This tells the impact of each variable on the logit validation error = average squared error

CART - Sensitivity

Sensitivity = true positives / total actual positives

Dendrogram

Shows cluster hierarchy

CART - Specificity

Specificity = true negatives / total actual negatives

Divisive Methods

Start with one all inclusive cluster. Repeatedly divide into smaller cluster.

Pruning - Cost Complexity Criterion

The CART algorithm uses the costs complexity criterion of a tree to generate a sequence of trees that are successively smaller (until there is only the root node). Cost Complexity Criterion = misclassification error (training data) + penalty factor for the size of the tree we start with a fully grown tree and increase the penalty factor until the cost complexity criterion of the full tree exceeds that of any smaller tree. Among all trees of a given size, choose the one with the lowest CC. This is repeated for each size of the tree.

Log Reg Cutoff Probability

The basic rational is to assign an observation to the class in which its probability of membership is the highest. The cutoff can be chosen to maximize overall accuracy, sensitivity, minimize false positives, or a cost-based approach to minimize expected cost of misclassification

Log Reg Evaluating Performance

The goal is to find a model that accurately classifies observations to their class using only the predictor information. We aim to find a model that best identifies the members of the class of interest. Lift Chart - the amount of lift over the base curve indicates, for a given number of cases measured by the x axis, the amount of additional acceptors that can be identified using the model. Decile Chart - Taking the top X% of records that are ranked by the model as most probable to belong in class 1, it will yield Y times as many acceptors compared to a random selection.

CART

The goal of classification and regression trees is to classify or predict an outcome based on a set of predictors. The output is a set of rules results in a outcome prediction.

Logistic Regression Classification

The goal of logistic regression is to predict which class a new observation will belong to. It classifies based off the values of its predictors. It produces the probability p that the event will occur. Logistic regression requires that the dependent output variable Y is categorical. Continuous Y variables can be binned and transformed into binary data. X predictors can be continuous or categorical.

Stopping Tree Growth

Tree growth can be limited with: 1. number of splits (tree depth) 2. minimum number of observations per decision node 3. minimum reduction in impurity for a new split to occur

Pearson Chi-squared Test

The pearson chi-squared test quantifies the independence of counts in the table's columns. It tells you if the proportion of the outcome class is the same in each child node of the splitted node. It essentially measures how similar the two nodes are. The larger the chi-squared, the less similar the groups are. The smaller the chi-squared, the more similar the groups are. The smaller the p-value, the less similar the groups are. Hence, a large chi-squared value or a small p-value indicates its unlikely the two groups are similar. The child nodes should have very different proportions than the parent node, otherwise the split would not be pointless.

Classification and Regression Trees

This method of classification performs well across a wide range of situations without requiring much effort from the analyst. CART is highly understandable by the consumer due to its easily understandable classification rules, as long as the tree isn't too large. CART works by recursively partitioning the space containing predictor variables into homogenous segments. The full tree is then pruned using validation data.

Evaluating CART performance

Training data is used to grow the tree. This original tree is overfitted with 100% accuracy. Validation data is then used to asses the performance of the tree. A third test data set can be used to asses the accuracy of the final tree

CART Advantages

Tree methods are useful for variable selection, with the most important predictors showing up at the top of the tree. There is little effort required from users as there is no need to transform or select variables. CART is intrinsically robust to outliers since splits are based of the ordering/group of observations and not their absolute magnitude. Since CARTs don't assume a specific relationship between response and predictor variables, they are nonlinear and nonparametric. This allows for the selection of a wide range of relationships. Classification trees are useful classifiers when the data is spaced adequately for horizontal and vertical splitting. Trees are able to handle missing data without having to impute values or delete observations with missing values CART can be extended to incorporate the importance ranking of the predictor variable in terms of their impact on the quality of the classification

Log Reg Variable Clustering

Two phased approach for input variable reduction: 1. variable clustering to minimize input redundancy (group highly correlated predictor variables) 2. use sequential selection method to find a model with good predictive performance on the validation set to get rid of irrelevance Input redundancy results in model instability. Small changes in data may lead to a radically different model.

Unsupervised Classification

Unsupervised classification is the grouping of cases based on similarities in input values. Cluster analysis is an unsupervised classification technique. The resulting clusters and their characteristics can be applied to classify new cases. Unsupervised cluster analysis can often be a step in the predictive modeling process

Euclidean Distance

Used to measure distance between records. Must be normalized with z-score.

Pruning

Validation data is used to prune back the deliberately overgrown training data tree. The weakest branches which hardly reduce error rate are removed. Branches made with just a few points are removed (these capture only noise). Pruning consists of successively selecting a decision node and redesignating it as a terminal node by cutting off the splits. Pruning stops when the validation error begins to rise. At each stage of pruning, multiple trees are possible. The Cost Complexity Criterion is used to select the best tree. The performance of the select pruned tree on the validation data is not fully reflective of the performance on new data because the validation data was used for the pruning process. Hence, there is a slight bias in favor of the validation data. This is when it is useful to employ a test dataset

Predictive Modeling

predictive modeling techniques, paired with scoring and good models enable you to make decisions about the past, present and future. It is more powerful and forward looking. Examples include marketing, churn analysis, credit scoring, and fraud detection.

Hierarchical Clustering

sequentially group cases based on distance between cases and clusters. Begins with a 22x22 distance matrix if you have 22 cases. Not practical with large datasets


Related study sets

Ch. 40 The child with respiratory dysfunction

View Set

Chapter 6: Organizational Strategy

View Set