Data Mining
Positive
What happens to the decision boundary if the cost for false positive is high and we would like to be more conservation when predicting positive? Choose a (positive or negative) cut-off value?
Testing set
data used to assess the likely future performance of a model
Training set
data used to build models
Validation set
data used to tune parameters in models
Accuracy
# of correct decisions made / total # of decisions made 1 - error rate (TP + TN) / total
F1-score
2 * (precision* recall) / (precision + recall) precision = y axis recall = x axis
Bar Plot Dot Plot Mosaic Plot
3 types of data visualization for categorical attributes
Box Plot Histogram Scatter Plot
3 types of data visualization for numerical attributes
Business Understanding Data Understanding Data Preparation Modeling Evaluation Deployment
6 phases of CRISP-DM
Kernel
A similarity measure in transferred feature space
Higher
An instance being further from the separating boundary leads to a (higher or lower) probability of being in one class or the other?
Hinge Loss
Correctly classified (positive above or negative below line) but still within decision boundary bar. These data points are not good enough bc they're not far away enough from the bar
A model is too fit to the training set and does not generalize to the unseen data
Briefly explain what overfitting is
Cross Industry Standard Process for Data Mining
CRISP-DM
-(5/6)log2(5/6) - (1/6)log2(1/6) 0.65
Calculating entropy Target variable = yes or no 5 yes and 1 no
Cross Validation
Generalization performance
Regression
Company X wants to know how much return on investment it is going to get based on the funds it has invested in marketing a new product... what type of problem is this?
Overfitting avoidance
Complexity Control
Domain-knowledge validation
Comprehensibility
Information Gain
Difference between parents and children
Co-occurence grouping
Frequent items mining, association rule discovery, market-basket analysis nothing to predict associating items often bought together
Linear Discriminant Function
General Form: f(x) = w0 + w1x1 + w2x2 .... Training Method: minimize zero-one loss (aka misclassification error)
Entropy
How mixed up classes are
A
In a medical study, Group A has 70 smokers and 30 non-smokers and Group B has 85 smokers and 15 non-smokers, which has the higher entropy
B
Jim is inviting a few friends to his house and he wants to collect information about their attendance. Friend A says he will come 100% forsure, Friend B says she has a 50% change of coming and Friend C says he is busy and only has 20% change of coming. Who gives the answer with the highest entropy?
Regression
Numeric target
Logistic
Log odds
SVMs
Maximum margin
ROC Curve
Ranking
S
Supervised or Unsupervised? Classification
U
Supervised or Unsupervised? Clustering
U
Supervised or Unsupervised? Co-occurence grouping/frequent itemset
BOTH
Supervised or Unsupervised? Data reduction
S
Supervised or Unsupervised? Ranking
S
Supervised or Unsupervised? Regression
higher, more
The ____ the entropy value, the ___ uncertain/impure the data is
SVM
The output of the training process is a set of support vectors and the corresponding weights High performance, but very slow so not used frequently in real world
True
True/False Cross-validation is used to estimate generalization performance
False
True/False Finding the characteristics that differentiate my most profitable customers form my less profitable customers is an example of an unsupervised learning task
False
True/False For supervised data mining the value of the target variable is known when the model is used to predict future unseen data
False
True/False The best way to deal with missing values in a feature is to always remove observations with missing
False
True/False The difference between supervised and unsupervised learning is supervised learning has a categorical target variable and unsupervised learning has a numeric target variable
False
True/False The points on a model's precision-recall curve represent the cost of different classifications
True
True/False We can build unsupervised data mining models when we lack labels for the target variable in the training data
True
True/False When implementing CRISP-DM, a data scientist often needs to go through the operation for several iterations
B
Which is NOT true about overfitting? A. If a model is overfitting, it will have a poor generalization performance. B. Overfitting happens when the model is overly simplified. C. A hold-out set can be used to examine overfitting. D. Overfitting can be avoided by tuning parameters on a validation set or via crossvalidation.
D
Which of the following does NOT describe SVM (support vector machine)? A. SVM can be applied when the data are not linearly separable B. The decision boundaries are determined by the support vectors. Other training data can be ignored. C. SVM makes a prediction by evaluating the similarity between the new instance and the support vectors, usually represented by a kernel function. D. SVM uses Hinge loss as a loss function which is measured as the distance between the error point to the decision boundary
D
Which of the following is true about logistic regression? A. Logistic regression is a regression model and needs a numerical target variable. B. Logistic regression can generate non-linear decision boundaries without feature engineering. C. Logistic regression can directly work with any form of data so no data transformation is required for categorical attributes. D. Logistic regression predicts probability of membership in the positive class.
C
Which of the following models has a decision boundary different than others? A. Linear regression B. Logistic regression C. CART D. SVM with a linear kernel
logistic regression
a model for the probability of class membership; to estimate class probability log odds, logit p+ = 1 / (1 + e^-f(x))
Data leakage
a variable collected in historical data gives information on the target variable (info that appears in historical data but is not actually available when the decision has to be made)
Zero-one loss
assigns a loss of zero for a correct decision and one for an incorrect decision
pure
certain in the outcome, homogeneous with respect to the target variable
Support vectors
data points selected from the training set, they happen to lie on the edge of the decision boundary bar
Categorical/nominal data
data that has two or more categories, but there is no intrinsic ordering to the categories
ordinal data
data that has two or more categories, has a clear ordering of the variables
Margin
distance between line and bar end in SVM
Classification
find a decision boundary that separates one class from the other; supervised segmentation
Predictive Model
formula learned from old data for estimating the unknown value of interest for some new data
Clustering
grouping individuals together by their similarity so that individuals in the same group are more similar to each other than those in other groups
Specificity
how good a test is at avoiding false alarms TN / (TN + FP)
Recall/sesitivity
how good a test is at detecting the positives TP / (TP + FN)
Precision
how many of the positively classified were relevant TP / (TP + FP)
soft margin
maximize margin and C, minimize hinge loss
hard margin
maximize margin, allows no mistakes, data is linearly seperately
entropy
measures uncertainty/impurity
p+ / (1 - p+) odds = 9
odds of an event example: p(x) = 0.9
Classification tree
partition space of examples with axis-parallel decision boundaries
CRISP-DM
process that places a structure on the problem life cycle of 6 phases used to maintain reasonable consistency, repeatability, and objectiveness
Data reduction
reduce the dimension of the data to focus more on something; replace large dataset with small dataset
imputation
replacing missing data with substituted values estimated from the data set
Unsupervised
the model is NOT provided with the results (y) during training
Supervised
training data includes both input (x) and result (y)
Regression
value estimation; given an input x, predict a numerical value for the target variable y