Data Mining

Ace your homework & exams now with Quizwiz!

Positive

What happens to the decision boundary if the cost for false positive is high and we would like to be more conservation when predicting positive? Choose a (positive or negative) cut-off value?

Testing set

data used to assess the likely future performance of a model

Training set

data used to build models

Validation set

data used to tune parameters in models

Accuracy

# of correct decisions made / total # of decisions made 1 - error rate (TP + TN) / total

F1-score

2 * (precision* recall) / (precision + recall) precision = y axis recall = x axis

Bar Plot Dot Plot Mosaic Plot

3 types of data visualization for categorical attributes

Box Plot Histogram Scatter Plot

3 types of data visualization for numerical attributes

Business Understanding Data Understanding Data Preparation Modeling Evaluation Deployment

6 phases of CRISP-DM

Kernel

A similarity measure in transferred feature space

Higher

An instance being further from the separating boundary leads to a (higher or lower) probability of being in one class or the other?

Hinge Loss

Correctly classified (positive above or negative below line) but still within decision boundary bar. These data points are not good enough bc they're not far away enough from the bar

A model is too fit to the training set and does not generalize to the unseen data

Briefly explain what overfitting is

Cross Industry Standard Process for Data Mining

CRISP-DM

-(5/6)log2(5/6) - (1/6)log2(1/6) 0.65

Calculating entropy Target variable = yes or no 5 yes and 1 no

Cross Validation

Generalization performance

Regression

Company X wants to know how much return on investment it is going to get based on the funds it has invested in marketing a new product... what type of problem is this?

Overfitting avoidance

Complexity Control

Domain-knowledge validation

Comprehensibility

Information Gain

Difference between parents and children

Co-occurence grouping

Frequent items mining, association rule discovery, market-basket analysis nothing to predict associating items often bought together

Linear Discriminant Function

General Form: f(x) = w0 + w1x1 + w2x2 .... Training Method: minimize zero-one loss (aka misclassification error)

Entropy

How mixed up classes are

A

In a medical study, Group A has 70 smokers and 30 non-smokers and Group B has 85 smokers and 15 non-smokers, which has the higher entropy

B

Jim is inviting a few friends to his house and he wants to collect information about their attendance. Friend A says he will come 100% forsure, Friend B says she has a 50% change of coming and Friend C says he is busy and only has 20% change of coming. Who gives the answer with the highest entropy?

Regression

Numeric target

Logistic

Log odds

SVMs

Maximum margin

ROC Curve

Ranking

S

Supervised or Unsupervised? Classification

U

Supervised or Unsupervised? Clustering

U

Supervised or Unsupervised? Co-occurence grouping/frequent itemset

BOTH

Supervised or Unsupervised? Data reduction

S

Supervised or Unsupervised? Ranking

S

Supervised or Unsupervised? Regression

higher, more

The ____ the entropy value, the ___ uncertain/impure the data is

SVM

The output of the training process is a set of support vectors and the corresponding weights High performance, but very slow so not used frequently in real world

True

True/False Cross-validation is used to estimate generalization performance

False

True/False Finding the characteristics that differentiate my most profitable customers form my less profitable customers is an example of an unsupervised learning task

False

True/False For supervised data mining the value of the target variable is known when the model is used to predict future unseen data

False

True/False The best way to deal with missing values in a feature is to always remove observations with missing

False

True/False The difference between supervised and unsupervised learning is supervised learning has a categorical target variable and unsupervised learning has a numeric target variable

False

True/False The points on a model's precision-recall curve represent the cost of different classifications

True

True/False We can build unsupervised data mining models when we lack labels for the target variable in the training data

True

True/False When implementing CRISP-DM, a data scientist often needs to go through the operation for several iterations

B

Which is NOT true about overfitting? A. If a model is overfitting, it will have a poor generalization performance. B. Overfitting happens when the model is overly simplified. C. A hold-out set can be used to examine overfitting. D. Overfitting can be avoided by tuning parameters on a validation set or via crossvalidation.

D

Which of the following does NOT describe SVM (support vector machine)? A. SVM can be applied when the data are not linearly separable B. The decision boundaries are determined by the support vectors. Other training data can be ignored. C. SVM makes a prediction by evaluating the similarity between the new instance and the support vectors, usually represented by a kernel function. D. SVM uses Hinge loss as a loss function which is measured as the distance between the error point to the decision boundary

D

Which of the following is true about logistic regression? A. Logistic regression is a regression model and needs a numerical target variable. B. Logistic regression can generate non-linear decision boundaries without feature engineering. C. Logistic regression can directly work with any form of data so no data transformation is required for categorical attributes. D. Logistic regression predicts probability of membership in the positive class.

C

Which of the following models has a decision boundary different than others? A. Linear regression B. Logistic regression C. CART D. SVM with a linear kernel

logistic regression

a model for the probability of class membership; to estimate class probability log odds, logit p+ = 1 / (1 + e^-f(x))

Data leakage

a variable collected in historical data gives information on the target variable (info that appears in historical data but is not actually available when the decision has to be made)

Zero-one loss

assigns a loss of zero for a correct decision and one for an incorrect decision

pure

certain in the outcome, homogeneous with respect to the target variable

Support vectors

data points selected from the training set, they happen to lie on the edge of the decision boundary bar

Categorical/nominal data

data that has two or more categories, but there is no intrinsic ordering to the categories

ordinal data

data that has two or more categories, has a clear ordering of the variables

Margin

distance between line and bar end in SVM

Classification

find a decision boundary that separates one class from the other; supervised segmentation

Predictive Model

formula learned from old data for estimating the unknown value of interest for some new data

Clustering

grouping individuals together by their similarity so that individuals in the same group are more similar to each other than those in other groups

Specificity

how good a test is at avoiding false alarms TN / (TN + FP)

Recall/sesitivity

how good a test is at detecting the positives TP / (TP + FN)

Precision

how many of the positively classified were relevant TP / (TP + FP)

soft margin

maximize margin and C, minimize hinge loss

hard margin

maximize margin, allows no mistakes, data is linearly seperately

entropy

measures uncertainty/impurity

p+ / (1 - p+) odds = 9

odds of an event example: p(x) = 0.9

Classification tree

partition space of examples with axis-parallel decision boundaries

CRISP-DM

process that places a structure on the problem life cycle of 6 phases used to maintain reasonable consistency, repeatability, and objectiveness

Data reduction

reduce the dimension of the data to focus more on something; replace large dataset with small dataset

imputation

replacing missing data with substituted values estimated from the data set

Unsupervised

the model is NOT provided with the results (y) during training

Supervised

training data includes both input (x) and result (y)

Regression

value estimation; given an input x, predict a numerical value for the target variable y


Related study sets

Similar Figures Assignment and Quiz

View Set

Spreadsheets and Data Management Assignment

View Set

Chapter 34: Management of Patients With Hematologic Neoplasms

View Set

Certmaster Learn for A+ CORE 1 (Exam 220-1101) Flash Cards

View Set