Supervised Data mining
Accuracy
# of correct classifications/total number of test cases
lift chart y axis
% of cases
cumulative gains chart x axis
% of cases in the dataset
cumulative gains y axis
% true positives
cohens kappa definition
(model accuracy-random accuracy)/ (1 - random accuracy)
classifier accuracy definition
(true positives + true negatives)/All
f measure definition
2*precision*recall / (precision + recall)
AUC
Can quantify the performance of the model by calculating the area under the ROC curve
Accuracy limitations
FP and FN costs must be the same and class distribution is 50/50 ish
Lift
How can we tell how well our classification model performs above a baseline naïve prediction
impurity
Measure of the heterogeneity of observations in a classification tree
Coefficient of Determination AKA
Rsquared
Rsquared definition
SSR/SST
classification goal
To learn a classification model from the data that can be used to predict the value of a nominal/ordinal label attribute for new cases/instances
learning
a computer system is said to learn from D to perform the task T if after learning the system's performance on Timproves as measured by M
root mean square error
a goodness of fit measure that also accounts for the number of var; it gives an approximate value of the distance from the mean of all predicted values
bi
a linear regression coefficient
overfit
a model that is too well tailored to the dataset that was used to develop it that isn't as effective in other settings
node
a point of splitting in the tree
Data
a set of data records described by k independent attributes and a dependent attribute
t test meaning
a test used for statistical significance; values over 2 are generally significant
leaf (terminal node)
an end point in the tree
yi
an observed value of y
lift range and meaning
any positive number; higher is better
Decision trees data
categorical or continuous, binomial or multinomial
two types of supervised learning
classification and regression
supervides data mining uses
classification or estimation
alternate measures of impurity
gini impurity or classification error
lift chart x axis
lift
RSME meaning
lower is better
supervised data mining aka
machine learning, supervised learning, inductive learning
mlrm
many ind var and one dep var
F measure
mean of precision and recall
lift definition
model precision / percentage of positive cases
simple linear regression
one ind var and one dep var
Leave-one-out cross-validation
used for small data sets; Each fold of the cross validation has only a single test example and all the rest of the data is used in training
Multivariate segmenting
using more than one segmenting criteria
when to use recall
when FN cost more
when to use f measure
when FP and FN have similar costs
when to use Precision
when FP cost more
class imbalance problem
when there is a significant majority or minority of one class only
ROC curve best curve possible
would reach the upper left corner where it identifies all TP and no FP
population linear regression equation
y=beta0 + beta1+E
estimated regression equation
yhat = b0 +bixi + c
basic linear regression equation
yhat=b0+bixi
Regression goal
•To learn a regression model from the data that can be used to predict (estimate) the value of an interval/ratio attribute for new cases/instances
linear regression data type
quantitative, though can be qualitative
kappa range and meaning
ranges 0 to 1 bigger is better
f measure range and meaning
ranges between 0 and 1 where 0 is the worst and 1 is a perfect score
precision range and meaning
ranges between 0 and 1 where 0 is the worst and 1 is a perfect score
AUC range and meaning
ranges from 0 to 1 where 1 would be a perfect score
what does ROC stand for
receiver operating characteristics
two specifications of recall
sensitivity and specificity
ROC curve
shows TP rate plotted against the FP rate for a given classifier performance
confusion matrix
shows how a classifier labels each case relative to the actual class value of that case
cumulative gains chart
shows how many more positive cases identified with the model than using no model
Constructing a ROC Curve
sort classes by model scores decs, start with highest score and move up if TP and move left if TN
decision tree approach
splits the data by horizontal and vertical boundaries into regions of similarity
what does SSE stand for?
sum of squares error
what does SSR stand for?
sum of squares regression
what does SST stand for?
sum of squares total
t test definition
t = (b1-B1)/sb1
generalizability
the ability of a model to describe not only the data used to build it, but other datasets it hasn't been exposed to
n-fold cross validation
the available data is partitioned into n equally sized, discrete sets. one subset is used to test and the rest n-1 subsets are combined to make the training set
ybar
the average
information gain
the change in entropy due to ay amount of new information being added
SSR
the difference btn an a predicted value and the mean
SST
the difference btn an observed value and the mean
SSE
the difference btn the observed value and the predicted value
b0
the intercept
testing
the model is given an unseen section of data to assess the model accuracy
specificity
the negative version of recall; predicted negatives/ all negatives
coefficient of determination meaning
the portion of the total variation in the dep var that is explained by variation in the ind var
sensitivity
the positive version of recall; # of predicted positive / all positive
yhat
the predicted value of y
p value meaning
the probability of the event occurring on accident; very small p values (smaller than alpha) are significant
standard error of estimate meaning
the standard deviation of the variation of observations around the regression line
holdout set
the test set
ind var
the var used to explain the dep var
dep var
the variable we wish to explain
null hypothesis for linear regression
there is no relationship H0:B1=0
specificity definition
true neg / (total neg + false neg) = TN/ N
sensitivity definition
true pos / (true pos + false negative) OR TP/P
precision definition
true positives/ (true pos + false pos)
recall
completeness; measures how good predictions are with regard to false negatives
pruning
cutting off branches of the tree to not overfit the data
residual / error
deviation of an observed value from the prediciton
pros of tree induction
easy to understand, easy to implement, easy to use, computationally cheap
precision meaning
exactness; measures how good predictions are with regard to false positives
FP and FN represent what in a cost profit matrix
expected costs
TP and TN represent what in a cost profit matrix
expected profits
error rate definition
false positives + false negatives / all OR 1-Accuracy
classifier accuracy meaning
percentage of cases that are correctly classified
error rate meaning
percentage of test set cases that are incorrectly classified
Lift Chart
plots the actual lift at each decile of cases
regression
predicting a value of a continuous variable for a given case
Classification
predicting class membership for a given case
purposes of regression analysis
prediction, diagnosis
profit/cost matrix
profits and costs are assigned to each cell in a confusion matrix to calculate the EV of applying a classifier to the dataset
Cohens Kappa
provides a ratio of the model's accuracy above chance to the best possible accuracy above chance