Supervised Data mining

Ace your homework & exams now with Quizwiz!

Accuracy

# of correct classifications/total number of test cases

lift chart y axis

% of cases

cumulative gains chart x axis

% of cases in the dataset

cumulative gains y axis

% true positives

cohens kappa definition

(model accuracy-random accuracy)/ (1 - random accuracy)

classifier accuracy definition

(true positives + true negatives)/All

f measure definition

2*precision*recall / (precision + recall)

AUC

Can quantify the performance of the model by calculating the area under the ROC curve

Accuracy limitations

FP and FN costs must be the same and class distribution is 50/50 ish

Lift

How can we tell how well our classification model performs above a baseline naïve prediction

impurity

Measure of the heterogeneity of observations in a classification tree

Coefficient of Determination AKA

Rsquared

Rsquared definition

SSR/SST

classification goal

To learn a classification model from the data that can be used to predict the value of a nominal/ordinal label attribute for new cases/instances

learning

a computer system is said to learn from D to perform the task T if after learning the system's performance on Timproves as measured by M

root mean square error

a goodness of fit measure that also accounts for the number of var; it gives an approximate value of the distance from the mean of all predicted values

bi

a linear regression coefficient

overfit

a model that is too well tailored to the dataset that was used to develop it that isn't as effective in other settings

node

a point of splitting in the tree

Data

a set of data records described by k independent attributes and a dependent attribute

t test meaning

a test used for statistical significance; values over 2 are generally significant

leaf (terminal node)

an end point in the tree

yi

an observed value of y

lift range and meaning

any positive number; higher is better

Decision trees data

categorical or continuous, binomial or multinomial

two types of supervised learning

classification and regression

supervides data mining uses

classification or estimation

alternate measures of impurity

gini impurity or classification error

lift chart x axis

lift

RSME meaning

lower is better

supervised data mining aka

machine learning, supervised learning, inductive learning

mlrm

many ind var and one dep var

F measure

mean of precision and recall

lift definition

model precision / percentage of positive cases

simple linear regression

one ind var and one dep var

Leave-one-out cross-validation

used for small data sets; Each fold of the cross validation has only a single test example and all the rest of the data is used in training

Multivariate segmenting

using more than one segmenting criteria

when to use recall

when FN cost more

when to use f measure

when FP and FN have similar costs

when to use Precision

when FP cost more

class imbalance problem

when there is a significant majority or minority of one class only

ROC curve best curve possible

would reach the upper left corner where it identifies all TP and no FP

population linear regression equation

y=beta0 + beta1+E

estimated regression equation

yhat = b0 +bixi + c

basic linear regression equation

yhat=b0+bixi

Regression goal

•To learn a regression model from the data that can be used to predict (estimate) the value of an interval/ratio attribute for new cases/instances

linear regression data type

quantitative, though can be qualitative

kappa range and meaning

ranges 0 to 1 bigger is better

f measure range and meaning

ranges between 0 and 1 where 0 is the worst and 1 is a perfect score

precision range and meaning

ranges between 0 and 1 where 0 is the worst and 1 is a perfect score

AUC range and meaning

ranges from 0 to 1 where 1 would be a perfect score

what does ROC stand for

receiver operating characteristics

two specifications of recall

sensitivity and specificity

ROC curve

shows TP rate plotted against the FP rate for a given classifier performance

confusion matrix

shows how a classifier labels each case relative to the actual class value of that case

cumulative gains chart

shows how many more positive cases identified with the model than using no model

Constructing a ROC Curve

sort classes by model scores decs, start with highest score and move up if TP and move left if TN

decision tree approach

splits the data by horizontal and vertical boundaries into regions of similarity

what does SSE stand for?

sum of squares error

what does SSR stand for?

sum of squares regression

what does SST stand for?

sum of squares total

t test definition

t = (b1-B1)/sb1

generalizability

the ability of a model to describe not only the data used to build it, but other datasets it hasn't been exposed to

n-fold cross validation

the available data is partitioned into n equally sized, discrete sets. one subset is used to test and the rest n-1 subsets are combined to make the training set

ybar

the average

information gain

the change in entropy due to ay amount of new information being added

SSR

the difference btn an a predicted value and the mean

SST

the difference btn an observed value and the mean

SSE

the difference btn the observed value and the predicted value

b0

the intercept

testing

the model is given an unseen section of data to assess the model accuracy

specificity

the negative version of recall; predicted negatives/ all negatives

coefficient of determination meaning

the portion of the total variation in the dep var that is explained by variation in the ind var

sensitivity

the positive version of recall; # of predicted positive / all positive

yhat

the predicted value of y

p value meaning

the probability of the event occurring on accident; very small p values (smaller than alpha) are significant

standard error of estimate meaning

the standard deviation of the variation of observations around the regression line

holdout set

the test set

ind var

the var used to explain the dep var

dep var

the variable we wish to explain

null hypothesis for linear regression

there is no relationship H0:B1=0

specificity definition

true neg / (total neg + false neg) = TN/ N

sensitivity definition

true pos / (true pos + false negative) OR TP/P

precision definition

true positives/ (true pos + false pos)

recall

completeness; measures how good predictions are with regard to false negatives

pruning

cutting off branches of the tree to not overfit the data

residual / error

deviation of an observed value from the prediciton

pros of tree induction

easy to understand, easy to implement, easy to use, computationally cheap

precision meaning

exactness; measures how good predictions are with regard to false positives

FP and FN represent what in a cost profit matrix

expected costs

TP and TN represent what in a cost profit matrix

expected profits

error rate definition

false positives + false negatives / all OR 1-Accuracy

classifier accuracy meaning

percentage of cases that are correctly classified

error rate meaning

percentage of test set cases that are incorrectly classified

Lift Chart

plots the actual lift at each decile of cases

regression

predicting a value of a continuous variable for a given case

Classification

predicting class membership for a given case

purposes of regression analysis

prediction, diagnosis

profit/cost matrix

profits and costs are assigned to each cell in a confusion matrix to calculate the EV of applying a classifier to the dataset

Cohens Kappa

provides a ratio of the model's accuracy above chance to the best possible accuracy above chance


Related study sets

MS2 Quiz 12: Ch. 69 Evolve Questions

View Set

Comptia Network + Practice Tests - Second Edition (N10-008) - Ch. 5

View Set

Network+ Ch2 quiz updated better

View Set

Chapter 11: The Agreement: Acceptance (SmartBook Assignment)

View Set

Romanesque Art & Architecture | FINAL EXAM

View Set

Mastering Biology Chapter 12: Mitosis

View Set

Chapter 7, Chapter 7S, Chapter 8, Chapter 9, Chapter 4, Midterm, Operations Management - Chapter 4, Operations Management - Chapter 7, Operations Management - Chapter 8, Operations Management - Chapter 9, Chapter 4 - OPMT 303, Chapter 4 Post Quiz, SC...

View Set

MED-SURG | RESPIRATORY 3 - INFECTIOUS PROCESSES 03.05 Isolation Precautions (MRSA, C. Difficile, Meningitis, Pertussis, Tuberculosis, Neutropenia)

View Set

digital marketing and e-commerce (Google)

View Set

Chapter 3 Business Policy Pearson

View Set