Intro Data Mining Midterm
Accuracy Formula
# correct predictions/total # predictions
Simple Matching Coeff
# matching attributes/ # attributes f00: x = 0 y = 0 f01: x = 0 y = 1 f10: x = 1 y = 0 f11: x = 1 y = 1 SMC = (f00 + f11)/ (f00 + f01 + f10 + f11)
Accuracy
closeness of measurements to the true value
Precision
closeness of repeated measurements to one another
Aggregation
combining of two or more objects into a single object ex; reducing 365 days into 12 months - provides higher level view of data
Covariance matrix
covariance of attributes is a measure of the degree to which the two attributes vary together - but can't judge degree of relationship which is why correlation is preferred
Evaluating Performance of classifier (4)
1. Holdout Method - training data split, part used for testing, part for training 2. Random sampling: repeated holdout method but bad because have no control over # times a specific record is used for testing and training 3. Cross-Validation - each record used same # of times for training and exactly once for testing 4. Bootstrap - training records are sampled with replacement
2 Important properties of rule based classifiers
1. Mutually exclusive: no two rules are triggered by same record, ensures every record is covered by at most one rule 2. Exhaustive rules: a rule for each combination of attributes ensures every record is covered by at least one rule
2 ways to handle overfitting in decision tree induction
1. Pre pruning (Early stopping approach) - tree growing algorithm halted before generating fully grown tree 2. Post-pruning
Sparsity
for some data sets, most attributes of an object have values of 0; in practical terms, sparsity is an advantage because usually only the non-zero values need to be stored and manipulated
Pessimistic error estimate
generalization error = sum of training error and a penalty term for model complexity
Hunt's Algorithm
grown in recursive fashion by partitioning the training records into successively purer subsets - look at example in notes: starts with Defaulted = No then grows from there
Learn-one-rule function
grows rule in greedy fashion -generates an initial rule r and keeps refining rule until a certain stopping criteria is met - rule then pruned to improve its generalization error
Cosine similarity = 1 vs = 0
if cosine sim = 1, angle between x and y is 0 and x and y are the same except for magnitude if cosine sim = 0, angle between x and y is 1 and x and y do not share any terms
Feature Weighting
important features assigned higher weight
General to specific rule growing strategy
initial rule: r: {} -> y empty set covers all examples and y contains target class - new conjunctions added to improve rule
Artificial Neural Networks
interconnected assembly of nodes and directed links which can be used to learn the functional relationship between a set of inputs and outputs
Data Cube
multidimensional representation of the data together will all possible totals (aggregates)
Dimensionality of a data set
number of attributes
Predictive Tasks
objective is to predict the value of a particular attribute based on the values of other attributes - regression and classification
Feature Creation
often possible to create new attributes that capture the important info in a data set much more effectively
Which summary statistic is more useful for continuous data?
percentile
Pros/cons of pre vs post-pruning
pre-pruning: avoid generating overly complex subtree but difficult to set threshold to stop post-pruning: better than pre-pruning but more computationally expensive
Regression
predictive task used for continuous variables - predict the output value using training data.
Classification
predictive task used for discrete target variables - goal: previously unseen records should be assigned a class as accurately as possible
Nominal, qualitative or quantitative
qualitative
Ordinal, qualitative or quantitative
qualitative
Interval, qualitative or quantitative
quantitative
Measurement scale
rule that associates a numerical or symbolic value with an attribute of an object
Dicing
selecting a subset of cells by specifying a range of attribute values
Perceptron activation function
sign - outputs a value of 1 if its arguments are positive and -1 if negative
Bayesian theorem
statistical principle for combining prior knowledge of classes with new evidence gathered from data
Correlation Matrix
takes covariance and divides by variance
Classification model
target function f which classifies data
Classification
task of assigning object to one of several predefined categories
curse of dimensionality
the difficulties associated with analyzing high-dimensional data
model underfitting
too little training data, model has yet to learn true structure of the data
Training error vs generalization error/test error
training error:# of misclassifications errors committed on training records generalization errors: expected error on previously unseen records
Validation set
training set divided; trained with 2/3, tested with 1/3 - used to estimate generalization/test error
if decision boundary in SVN not linear
transform data into higher dimensional space using kernel trick
Discretization
transforming continuous attribute into a categorical one
Feature subset selection
use subset of the features (if redundant or irrelevant features are present)
Jaccard Coeff
used to handle asymmetric binary attributes given f00: x = 0 y = 0 f01: x = 0 y = 1 f10: x = 1 y = 0 f11: x = 1 y = 1 J = f11/(f01 + f10 + f11)
Support Vector Machine
want to maximize the margin between classes
Dimensionality Reduction 3 benefits
- data mining algorithms work better - makes model more understandable - more easily visualized data
Ensemble method and what does it aim to decrease
- predict class label by aggregating predictions from multiple classifiers -variance
How to deal with rules which are not mutually exclusive
- this means that two rules could trigger same rule 1. Ordered rules/decision list: rules are ordered in decreasing order of priority. record is classified by highest ranked rule 2. Unordered rules: allows test record to trigger multiple classification rules: consequent of each rule is a vote for a particular class. majority wins
Sequential covering algorithm
- used for building rule based classifier - rules grown in greedy fashion - a rule is desirable if it covers most of the positive (matching) examples and none of the negative examples - once a rule is found, the training records covered by the rule are eliminated - generates a set of rules per class until stopping criteria is met
Specific to general rule growing strategy
-one positive example is randomly chosen as initial seed for rule growing process - during refinement, rule is generalized by removing one of its conjuncts - refinement step repeated until rule starts covering negative examples
Error rate
1 - accuracy or #wrong predictions/total
Fill in the blanks: Training set -build--> _____1._____ --test--> ____2.________ --evaluate--> _____3._____
1. Classification model 2. test set with unknown values 3. confusion matrix
Classification model useful for (2)
1. Descriptive modeling: explanatory tool to distinguish between objects of different classes 2. Predictive modeling: used to predict class label of unknown records
3 things to do about missing values
1. Eliminate Data Objects or Attributes 2. Estimate Missing Values 3. Ignore missing values
3 Approaches to feature selection
1. Embedded approaches - algorithm itself decides which attributes to use 2. Filter Approaches - before data mining algorithm is run 3. Wrapper Approaches - use the target data mining algorithm as a black box to find best subsets of attributes
3 Ways to create features
1. Feature extraction - creation of new set of features from original raw data 2. mapping the data to a new space - often using transformations 3. features constructed out of original features
Post pruning 2 types of trimming
1. replacing a subtree with a new leaf node whose class label is determined from majority of records affiliated with subtree 2. replace subtree with most frequently used branch of subtree
3 Types of nodes in decision trees
1. root node 2. internal node 3. leaf/terminal node
4 Ways to estimate generalization errors
1. select model with lowest training error rate assuming that also means low generalization error 2. incorporate model complexity: since chance of overfitting increases with increased complexity of models, simpler models should be chosen 3. Estimating statistical bounds - since generalization error usually higher than training error, statistical correction computed as upper bound to training error 4. Using a validation set - training set divided; trained with 2/3, tested with 1/3
Rule evaluation
1. statistical test 2. laplace and m-estimate: takes into account rule coverage 3. FOIL information gain
Contour Plot
3 dimensional data, 2 attributes specify position in a plane, 3rd has a continuous value (such as temp)
Types of splits: Binary attributes: Nominal: Ordinal: Continuous:
Binary: binary split Nominal: multiway or binary (grouped) Ordinal: multiway or binary (grouped - as long as grouping does not violate order property of attribute) Continuous: multiway or binary comparison tes ex: annual income >80K
List categorical and numeric types
Categorical: nominal and ordinal Numeric: Interval and ratio
Rule Based Classifier
Collection of "if...then..." rules ri: (conditioni) -> yi condition - rule antecedent/precondition yi - rule consequent
Why is accuracy not best for evaluating rules?
ex: r1: covers 50 positive, 5 negative r2: covers 2 positive, 0 negative r2 better accuracy but r1 clearly better rule (has better coverage)
Pivoting
aggregating over all dimensions except two
The bias-variance tradeoff
bias and variance are inversely related to each other
Pearson's Correlation
Correlation between sets of data is a measure of how well they are related - always in range of -1 to 1 if correlation of -1 or 1 means x and y have a perfect linear relationship - Zero means that for every increase, there isn't a positive or negative increase. The two just aren't related.
Quality of classification rule can be evaluated using (2)
Coverage(r) = |A|/|D| |A| - # records that satisfy the rule antecedent/precondition |D| total # records Accuracy(r) = |A n y| / |D| |A n y| - # records that satisfy the rule antecedent/precondition and the rule consequent
Association Analysis
Descriptive task - used to discover patterns that describe strongly associated features in the data; create dependency rules which predict occurrence of an item based on occurrence of other items - useful application: finding groups of genes that have related functionality
Direct Method vs Indirect method of building rule based classifier
Direct Method: extract classification rules from data Indirect Method: extract classification rules from other classification models (such as decision trees)
Discretization, which is preferred: equal width or equal frequency?
Equal frequency: puts same number of object into each interval not affected by outliers the way equal width is
Clustering Similarity Measure
Euclidian Distance if continuous
Convert these 5 categorical values to three binary attributes {awful, poor, OK, good, great}
First match to integer values: awful - 0 poor - 1 OK - 2 good - 3 great - 4 then, | x1, x2, x3 awful 0 0 0 poor 0 0 1 OK 0 1 0 good 0 1 1 great 1 0 0
Decision list
For rule based classifiers 1. Ordered rules/decision list: rules are ordered in decreasing order of priority. record is classified by highest ranked rule
Interval vs Ratio
Interval: differences between values are meaningful Ratio: differences, ratio, and 0 are meaningful (i.e. Fahrenheit would be interval but Kelvin would be ratio - 0 is meaningful)
Distinctness, Order, Addition, Multiplication Which of these 4 do these have: Nominal attributes: Ordinal Attributes: Interval Attributes: Ration Attributes:
Nominal attributes: distinctness Ordinal Attributes: distinctness, order Interval Attributes: distinctness, order, addition Ration Attributes: all 4
Nominal vs Ordinal
Nominal: different names, can only compare if they are equal or not (i.e. eye color, gender) Ordinal: provide enough info to order object (<, >) (i.e. {good, better, best})
Normalization vs Standardization
Normalization rescales the values into a range of [0,1]. This might be useful in some cases where all parameters need to have the same positive scale. However, the outliers from the data set are lost. Standardization rescales data to have a mean ( ) of 0 and standard deviation ( ) of 1 (unit variance)
Posterior probabilit
P(Y|X) Y - class X - attribute
Using bayesian theorem for classification
P(Y|X) = P(X|Y)P(Y)/P(X) let x denote attribute set let y denote class variable
PCA
Principle Component Analysis - a linear algebra technique for dimensionality reduction - for continuous attributes that finds new attributes that are 1. linear combinations of original attributes 2. orthogonal 3. capture max variation in the data
Ratio, qualitative or quantitative
Quantitative
Rule-based vs class-based ordering scheme
Rule-based: orders rules by some rule quality measure Class-based: rules that belong to same class appear together in rule set
Similarity Measures for Binary Data (3)
Simple Matching Coefficient, Jaccard Coeff, Cosine Similarity
Sampling Approaches (2)
Simple random sampling stratified sampling - pre-specified groups of object
Progressive Sampling
Start with small sample and increase until sample of sufficient size has been obtained
model overfitting
a model that fits training data too well can have poor generalization error
Bias
a systematic variation from the quantity being measured
Problem with binarization
can cause unintended relations among transformed attributes; thus for association analysis need asymmetric binary values
RIPPER Algorithm: for 2 classes
chooses majority class as default and then learns the rules for detecting the minority class
RIPPER Algorithm: for multiclass
classes ordered according to frequencies; during first iteration start with least frequent class labeled as positive examples and all others labeled as negative - sequential covering method used to generate rules that discriminate between + and - - then moves on to next class
Descriptive Tasks
derive patterns (correlations, trends, clusters, trajectories, and anomalies) that summarize the underlying relationships in the data - clustering, association analysis, anomaly detection
Clustering
descriptive task - seeks to find groups of closely related observations so that observations that belong to the same cluster are more similar to each other than observations that belong to other clusters
Anomaly Detection
descriptive task - the task of identifying observations whose characteristics are significantly different from the rest of the data; applications include: detection of fraud, network intrusions
Bias
difference between average prediction and actual value
Bagging
draw N bootstrap samples - retrain model on each sample average results -
Cosine similarity
each document has relatively few non-zero attributes thus similarity should not depend on number of 0-0 matches cos(x,y) = x y (dot product)/ ||x|| ||y||
k-fold cross-validation
evaluates Performance of classifier k-fold cross-validation, the original sample is randomly partitioned into k equal sized subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data.
Holdout method
evaluates Performance of classifier In the holdout method, we randomly assign data points to two sets d0 and d1, usually called the training set and the test set, respectively. The size of each of the sets is arbitrary although typically the test set is smaller than the training set. We then train on d0 and test on d1.
Random sampling
evaluates Performance of classifier Random sampling: repeated holdout method but bad because have no control over # times a specific record is used for testing and training
