buan 6356 exam 2
OLS regression
Y= alpha+betax + E, where Y = (0,1)
difference between classification and clustering& association rules
classification: supervised learning, dependent variable is discrete/nominal. essense is to find rules to separate objects into pre determined classes. needs at least training and validation datasets to train classifiers clustering and association rules: unsupervised learning, essense is that clusters are not pre determined
fraud
classifying credit card purchases into fraudulent and legitimate ones
examples of classification
classifying customers as loyal vs likely to discontinue contract, classifying credit risk as high vs low
classification applications
direct marketing, customer profiling, fraud, prediction of stock price, medical diagnosis
lift chart
graphically represents the improvement that a mining model provides when compared against a random guess and measures the change in terms of a lift score -model with a curve closer to left and top is a good model
supervised learning
has a dependent variable, e.g. class A or B?
unsupervised learning
has no dependent variable, there is no predetermined class
testing data
helps estimate the performance of the chosen classifier
validation data
helps tune classifiers, e.g. pruning a decision tree, or help choose among different classifiers
overfitting problem
many possible splitting rules that near perfectly classify the data -may not generalize to future datasets, particularly a problem with small datasets -too many branches, end up fitting noise in training data set -effect: great for training data, poor accuracy for unseen samples. error rate (average square error) increases as number of splits (complexity) increases
predictive task
mapping instances onto a predefined set of classes - classification
normalization
normalize to remove scale effect - divide by stdev -usually performed in PCA
new data
only input informatio available
logistic regression model
p= (1/(1-e^-(alpha+betax+E))) p= probability that event Y occurs, i.e p(Y=1) -constrains estimated probabilities between 0 and 1 -if (alpha+betax)=0, p=.5 -as (alpha +betax) increases, p approaches 1 -as (alpha +betax) decreases, p approaches 0 this expression is equivalent: ln(p/(1-p)) = alpha+betax+E p= probability that event occurs, i.e. p(Y=1) p/1-p = the odds ln(p/1-p) the natural logarithm of odds, log odds or logit -this model overcomes problems of linear regression: it selects regression coefficient to force predicted values for Y to be between (0,1); produces s shaped regression predictions rather than straigh tline; selects coefficicent through maximum likelihood estimation
customer profiling
predicting whether a customer will respond positively to a product
direct marketing
predicting whether or not a customer will respond to mailing
neural network
simplified model of human brain -a function approximator: transoforms inputs to outputs
Mean Squared Error (MSE)
sum of (actual-predicted)^2
if the target is categorical
use accuracy rate, misclassification rate, etc categorical = prediction is either accurate or not
if target is numerical
use mean squared error numerical = prediction can be close/relative. like sales
scoring
using new dataset where the value of the dependent variable is unknown
backward propagation of errors
weights are adjusted by observing errors in output and propagating adjustments back through the network
odds ratio
(odds(X=x+1)/odds(X=x))
advantages of neural networks
-can handle most types of response - discrete/continuous -does not require prior knowledge of relationships between inputs and outputs -nonlinear relationships and interactions taken care of inherently -robust, works well even when training sample contains errors -prediction accuracy is comparable with other statistical methods
item-user matrix
-cells are user preferences for items -preferences can be ratings or binary
disadvantages of neural networks
-cmplex setup process -potentially long training time -black box, no insight into relationships between predictors and outcome -non intuitive model - difficult to interpret results -no statistical test to assess importance of weights -may lead to overfitting
evaluating model performance
-commonly used statistics to cmpare alternative models (or to evaluate the performance of a single model) -compare the model that includes alpha versus the model that includes alpha, beta -model chi square, percent correct predictions, pseudo r2
collaborative filtering
-community of users -opinions on items obtained implicitly or explicitly -opinions of others are used to predict a user's unknown opinion -key idea: filter info based on user preference -user based collaborative filtering: find users who share the same rating patterns with the user of interest, use ratings from these like minded users to estimate a prediction for the user of interest -item based collaborative filtering: build item-item matrix of relationships between pairs of items to find and recommend similar items
structure of neural network
-composed of many neurons that co operate to perform desired function -output of neuron is function of the weighted sum of inputs + bias = computation of outputs of all neurons = f(w1i1+...+wnin + bias) = activation function
growing the tree using chi square & logworth
-compute logworth of the best split for each interval-valued input variable -then select the variable that has the highest logworth and use its split, suppose it is balance -under each of the two balance nodes, we then find the logworth of the best split of age and continue the process -stopping criteria for splitting: threshold on the significance of the chi square value for splitting, no more improvement(lowest miscalculation rate, proportion of mismatch between prediction and observation)
chi square statistic
-computes a measure of how different the number of observations is in each of the four cells as compared to expected number. the pvalue associated with the null hypothesis is computed -logworth of the pvalue, logworth = -log10(pvalue) -low pvalue -> high logworth -split that generates the highest logworth for a given input variable is selected
decision tree strengths
-easy to understand and interpret -easy to implement -running time is low -very popular
entropy
-entropy(A) = sum(qlog2(q)) -q=proportion of cases (out of m) in set A that belong to class k -ranges between 0(most pure) and log2(m) (equal representation of classes)
information gain
-expected reduction in impurity (entropy, gini index) -suppose node N gets partitioned into M child nodes {c1....cm} given attribute A -infromation gain = entropy reduction = entropy of N -sum of weighted entropies of c1...cm -gain(A) = entropy of N - sum(weighted entropy of cm) -select attribute with highest information gain
input variable selection
-for each split of an input variable a chi square statistic is calculated -a contingency table is formed that maps default and non default against the partitioned input variable -for example, null hypothesis might be that there is no difference between people with income <50k and >50k, if null hypothesis is rejected, means splitting leads to a purer sub group. the lower p value, the more likely we reject the hypothesis, meaning income split is a discriminating factor
-pearson correlation
-for each user pair, find co rated items, calculate correlation between vectors of their ratings for those items -note that the average ratings for each user across all products, not just co rated -sum((r1-r1bar)*(r2-r2bar))/(sqrt(sum(r1-r1bar)^2) * sqrt(r2-rbar2)^2)
gini index
-gini index for set A containing m classes of response variable: I(A) = 1- sum(q^2) -q= proportion of cases in set A that belong to class k -max value when all classes are equally represented, .5 in binary case
explanatory modeling
-goal: explain relationship between predictors and target
Principal Component Analysis
-goal: reduce set of numerical variables -idea: remove overlap of information between these variables, information is measured by the sum of the variances of the variables -final product: a smaller number of numerical variables that contain the most of the info -how?: create new variables that are linear combinations ofthe original variables, i.e they are weighted averages of the original variables -these new variables are uncorrelated (no info overlap) and only a few contain most of the original information -the new variables are called principal components
growing decision tree
-growing tree involves successively partitioning the data: recursively partitioning -if an input variable is binary, then the two categories can be used to split the data -if an input variable is interval, a splitting value is used to classify the data into two segments
recommender system
-guide people to ineresting materials based on information from other similar people or information from other similar items
lift
-improvement obtained via modeling -confidence/support of consequent
classification terminology
-inputs = predictors = independent variable -outputs = responses = dependent variables: categorical outputs use classification techniques, numeric outputs use prediction/regression techniques -models=classifiers -with classification, we want to use a model to predict what output will be obtained from given inputs
common mistakes with predictive analysis
-learn things that aren't true, patterns may not represent any underlying rule -self selection bias: preaching to wrong population? -data may be at the wrong level of detail -learn things that are true, but not useful - most rules learned are already known -knowledge that cannot be used -data integrity
cosine simialrity
-like correlation coefficient, except do not subtract means -cold start problem: for users with just one item, or items with just one user, neither cosine similarity nor correlation produces useful metric
structural breaks
-logical breaks in dataset, ex) pooled vs split -chi squared value > critical value -> statistically different -or just add a dummy variable
three steps of classification
-model building: use training dataset -model validation: use validation dataset. controlling overfitting is a major purpose -model application, i.e scoring
model chi squared
-model log likelihood ratio is LR = -2ln(L(alpha)/L(alpha, beta)) = -2[Ln(L(alpha)) - Ln(L(alpha, beta))] -the lr statistic is distributed chi sqaure with i degrees of freedom. i is the number of independent variables -use the model chi squared statistic to determine if the model (alpha, beta) is statistically significant
training
-modifying weights to better approximate desired function -based on training data -supervised training: supplies network with inputs and desired outputs, response of network to inputs is measured, weights modified to reduce difference between actual and desired outputs -unsupervised training: only supplies inputs, network adjusts its own weights so that similar inputs cause similar outputs, network identifies patterns and differences in inputs without external guidance
multilayer feedforward networks
-most common neural network -extension of the perceptron -multiple layers: hidden layers between input and output layers -activation function is not simply a threshold, usually a sigmoid function -more general function approximator, not limited to linear problems -information flows in one direction, the outputs of one layer act as inputs to the next layer
comparing measures
-no measure is better than others, factors: -speed and scalability, robustness, interpretability
omitted variable bias
-omitted variables can result in bias in the coefficient estimates -conduct LR test -chi squared value < critical value -> coefficients not statistically significant
epoch
-one iteration through the process of providing network with input and updating network's weights -typically many epochs required to train network
receiver operating characteristic (ROC) curve
-plots performance of a classifier in terms of its true positive rate (vertical axis) and false positive rate (horizontal axis) -true positive rate = #true positives/(#true positives+#false negatives) = sensitivity -false positive rate = # false positives/ (#false positives + true negatives) = 1-specificity -model with a curve closer to left and top is a good model
typical recommendation process
-prediction -ranking -recommendation
avoid overfitting
-prepruning: halt construction tree early, do not split node if this would result in purity measure falling below threshold -postpruning: remove branches from a fully grown tree, get sequence of pregressively pruned trees. use different dataset to select best pruned tree (ie cross validation)
confusion matrix
-records source of error: false positives and negatives -can be used for model comparison
decision tree
-series of nested tests -each node (root and interior) represents a test on an input variable, nominal attribute or numeric attribute -each leaf is a class assignment, provides a distribution over all possible classes, uses decision tree to classify new instances -level of tree -> complexity
entropy: case of two classes
-set S with P elements of class P and n elements of class N -entropy of set S is: E(S) = -((p/(p+n))(log2(p/(p+n)))) - (n/(p+n)(log2((n/(p+n)))))
reducing categories
-single categorical variable with m categories is typically transformed into m or m-1 dummy variables -each dummy variable takes values of 0 or 1 -problem: can end up with too many variables -solution: reduce by combining categories that are close to each other -use pivot tables to assess outcome variable sensitivity
constructing lift chart
-sort cases in the test set based on their predicted probability of belonging to the class of interest, in descending order: ranking of a case is more important than predicted probability -divide sorted cases into deciles, each interval represents 1/10 of entire sample -for each decile, calculate success rate -for each decile, calculate lift factor = success rate of decile/success rate of entire test set
User-based collaborative filtering
-start with single user who will be target of recommendations -find other users who are most simiar based on comparing preference vector
maximum likelihood estimation
-statistical method for estimating the coefficients of the model -likelihood function measures the probability of observing the particular set of dependent variable values (p1...pn) that occur in the sample -higher the value of L, higher the probability of observing (p1...pn) in the sample
interpreting coefficients
-the odds is intuitive -since ln(p/1-p) = alpha + betax + E -> (p/1-p) = e^alpha+betax+E -e^beta is the effect of the independent variable on the odds, or the odd ratio
decision tree construction
-trees constructed in a top down recursive partitioning manner -all training examples are at the root initially -attributes are assumed to be categorical, discretized if necessary -examples partitioned recursively based on selected input variables -basic algorithms are greedy -popular methods: ID3, C4.5, CART. similar ideas but differ in how tree is grown, splitting criteria, pruning methods, termination criteria -basic idea: partition training examples into purer and purer subgroups. group a is purer than group b if group members in a are more similar than members of group b because trees are constructed by recursively partitioning instances
motivation for logistic regression
-using regression models for classification -however the dependent variable is sometimes limited, e.g. voting, morbidity/mortality -logistic regression can be a good choice when the dependent variable is a dummy variable -dummy variable: 0/1 variable that is a binary choice
activation function
-usually sigmoid function: k = steepness/scaling factor = f(kx) = 1/(1+e^-kx)
decision tree weaknesses
-volatile: small changes in underlying data result in very diff models -cannot capture interactions between variables -can result in large error -volatility is reduced through bootstrap aggregation
how are weights obtained?
-weights are critical to usefulness = determining appropriate function -training
attribute selection
-which attribute should be used for a split: choose attribute that best partitions the relevant population onto purer groups at each decision node -other measures of impurity: entropy and gini index -then use information gain to decide which attribute to use: how informative is the attribute in distinguishing among instances from different classes -ideally also try to minimize number of splits/nodes to make it more compact and accurate
