Intro Data Mining Midterm

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Accuracy Formula

# correct predictions/total # predictions

Simple Matching Coeff

# matching attributes/ # attributes f00: x = 0 y = 0 f01: x = 0 y = 1 f10: x = 1 y = 0 f11: x = 1 y = 1 SMC = (f00 + f11)/ (f00 + f01 + f10 + f11)

Accuracy

closeness of measurements to the true value

Precision

closeness of repeated measurements to one another

Aggregation

combining of two or more objects into a single object ex; reducing 365 days into 12 months - provides higher level view of data

Covariance matrix

covariance of attributes is a measure of the degree to which the two attributes vary together - but can't judge degree of relationship which is why correlation is preferred

Evaluating Performance of classifier (4)

1. Holdout Method - training data split, part used for testing, part for training 2. Random sampling: repeated holdout method but bad because have no control over # times a specific record is used for testing and training 3. Cross-Validation - each record used same # of times for training and exactly once for testing 4. Bootstrap - training records are sampled with replacement

2 Important properties of rule based classifiers

1. Mutually exclusive: no two rules are triggered by same record, ensures every record is covered by at most one rule 2. Exhaustive rules: a rule for each combination of attributes ensures every record is covered by at least one rule

2 ways to handle overfitting in decision tree induction

1. Pre pruning (Early stopping approach) - tree growing algorithm halted before generating fully grown tree 2. Post-pruning

Sparsity

for some data sets, most attributes of an object have values of 0; in practical terms, sparsity is an advantage because usually only the non-zero values need to be stored and manipulated

Pessimistic error estimate

generalization error = sum of training error and a penalty term for model complexity

Hunt's Algorithm

grown in recursive fashion by partitioning the training records into successively purer subsets - look at example in notes: starts with Defaulted = No then grows from there

Learn-one-rule function

grows rule in greedy fashion -generates an initial rule r and keeps refining rule until a certain stopping criteria is met - rule then pruned to improve its generalization error

Cosine similarity = 1 vs = 0

if cosine sim = 1, angle between x and y is 0 and x and y are the same except for magnitude if cosine sim = 0, angle between x and y is 1 and x and y do not share any terms

Feature Weighting

important features assigned higher weight

General to specific rule growing strategy

initial rule: r: {} -> y empty set covers all examples and y contains target class - new conjunctions added to improve rule

Artificial Neural Networks

interconnected assembly of nodes and directed links which can be used to learn the functional relationship between a set of inputs and outputs

Data Cube

multidimensional representation of the data together will all possible totals (aggregates)

Dimensionality of a data set

number of attributes

Predictive Tasks

objective is to predict the value of a particular attribute based on the values of other attributes - regression and classification

Feature Creation

often possible to create new attributes that capture the important info in a data set much more effectively

Which summary statistic is more useful for continuous data?

percentile

Pros/cons of pre vs post-pruning

pre-pruning: avoid generating overly complex subtree but difficult to set threshold to stop post-pruning: better than pre-pruning but more computationally expensive

Regression

predictive task used for continuous variables - predict the output value using training data.

Classification

predictive task used for discrete target variables - goal: previously unseen records should be assigned a class as accurately as possible

Nominal, qualitative or quantitative

qualitative

Ordinal, qualitative or quantitative

qualitative

Interval, qualitative or quantitative

quantitative

Measurement scale

rule that associates a numerical or symbolic value with an attribute of an object

Dicing

selecting a subset of cells by specifying a range of attribute values

Perceptron activation function

sign - outputs a value of 1 if its arguments are positive and -1 if negative

Bayesian theorem

statistical principle for combining prior knowledge of classes with new evidence gathered from data

Correlation Matrix

takes covariance and divides by variance

Classification model

target function f which classifies data

Classification

task of assigning object to one of several predefined categories

curse of dimensionality

the difficulties associated with analyzing high-dimensional data

model underfitting

too little training data, model has yet to learn true structure of the data

Training error vs generalization error/test error

training error:# of misclassifications errors committed on training records generalization errors: expected error on previously unseen records

Validation set

training set divided; trained with 2/3, tested with 1/3 - used to estimate generalization/test error

if decision boundary in SVN not linear

transform data into higher dimensional space using kernel trick

Discretization

transforming continuous attribute into a categorical one

Feature subset selection

use subset of the features (if redundant or irrelevant features are present)

Jaccard Coeff

used to handle asymmetric binary attributes given f00: x = 0 y = 0 f01: x = 0 y = 1 f10: x = 1 y = 0 f11: x = 1 y = 1 J = f11/(f01 + f10 + f11)

Support Vector Machine

want to maximize the margin between classes

Dimensionality Reduction 3 benefits

- data mining algorithms work better - makes model more understandable - more easily visualized data

Ensemble method and what does it aim to decrease

- predict class label by aggregating predictions from multiple classifiers -variance

How to deal with rules which are not mutually exclusive

- this means that two rules could trigger same rule 1. Ordered rules/decision list: rules are ordered in decreasing order of priority. record is classified by highest ranked rule 2. Unordered rules: allows test record to trigger multiple classification rules: consequent of each rule is a vote for a particular class. majority wins

Sequential covering algorithm

- used for building rule based classifier - rules grown in greedy fashion - a rule is desirable if it covers most of the positive (matching) examples and none of the negative examples - once a rule is found, the training records covered by the rule are eliminated - generates a set of rules per class until stopping criteria is met

Specific to general rule growing strategy

-one positive example is randomly chosen as initial seed for rule growing process - during refinement, rule is generalized by removing one of its conjuncts - refinement step repeated until rule starts covering negative examples

Error rate

1 - accuracy or #wrong predictions/total

Fill in the blanks: Training set -build--> _____1._____ --test--> ____2.________ --evaluate--> _____3._____

1. Classification model 2. test set with unknown values 3. confusion matrix

Classification model useful for (2)

1. Descriptive modeling: explanatory tool to distinguish between objects of different classes 2. Predictive modeling: used to predict class label of unknown records

3 things to do about missing values

1. Eliminate Data Objects or Attributes 2. Estimate Missing Values 3. Ignore missing values

3 Approaches to feature selection

1. Embedded approaches - algorithm itself decides which attributes to use 2. Filter Approaches - before data mining algorithm is run 3. Wrapper Approaches - use the target data mining algorithm as a black box to find best subsets of attributes

3 Ways to create features

1. Feature extraction - creation of new set of features from original raw data 2. mapping the data to a new space - often using transformations 3. features constructed out of original features

Post pruning 2 types of trimming

1. replacing a subtree with a new leaf node whose class label is determined from majority of records affiliated with subtree 2. replace subtree with most frequently used branch of subtree

3 Types of nodes in decision trees

1. root node 2. internal node 3. leaf/terminal node

4 Ways to estimate generalization errors

1. select model with lowest training error rate assuming that also means low generalization error 2. incorporate model complexity: since chance of overfitting increases with increased complexity of models, simpler models should be chosen 3. Estimating statistical bounds - since generalization error usually higher than training error, statistical correction computed as upper bound to training error 4. Using a validation set - training set divided; trained with 2/3, tested with 1/3

Rule evaluation

1. statistical test 2. laplace and m-estimate: takes into account rule coverage 3. FOIL information gain

Contour Plot

3 dimensional data, 2 attributes specify position in a plane, 3rd has a continuous value (such as temp)

Types of splits: Binary attributes: Nominal: Ordinal: Continuous:

Binary: binary split Nominal: multiway or binary (grouped) Ordinal: multiway or binary (grouped - as long as grouping does not violate order property of attribute) Continuous: multiway or binary comparison tes ex: annual income >80K

List categorical and numeric types

Categorical: nominal and ordinal Numeric: Interval and ratio

Rule Based Classifier

Collection of "if...then..." rules ri: (conditioni) -> yi condition - rule antecedent/precondition yi - rule consequent

Why is accuracy not best for evaluating rules?

ex: r1: covers 50 positive, 5 negative r2: covers 2 positive, 0 negative r2 better accuracy but r1 clearly better rule (has better coverage)

Pivoting

aggregating over all dimensions except two

The bias-variance tradeoff

bias and variance are inversely related to each other

Pearson's Correlation

Correlation between sets of data is a measure of how well they are related - always in range of -1 to 1 if correlation of -1 or 1 means x and y have a perfect linear relationship - Zero means that for every increase, there isn't a positive or negative increase. The two just aren't related.

Quality of classification rule can be evaluated using (2)

Coverage(r) = |A|/|D| |A| - # records that satisfy the rule antecedent/precondition |D| total # records Accuracy(r) = |A n y| / |D| |A n y| - # records that satisfy the rule antecedent/precondition and the rule consequent

Association Analysis

Descriptive task - used to discover patterns that describe strongly associated features in the data; create dependency rules which predict occurrence of an item based on occurrence of other items - useful application: finding groups of genes that have related functionality

Direct Method vs Indirect method of building rule based classifier

Direct Method: extract classification rules from data Indirect Method: extract classification rules from other classification models (such as decision trees)

Discretization, which is preferred: equal width or equal frequency?

Equal frequency: puts same number of object into each interval not affected by outliers the way equal width is

Clustering Similarity Measure

Euclidian Distance if continuous

Convert these 5 categorical values to three binary attributes {awful, poor, OK, good, great}

First match to integer values: awful - 0 poor - 1 OK - 2 good - 3 great - 4 then, | x1, x2, x3 awful 0 0 0 poor 0 0 1 OK 0 1 0 good 0 1 1 great 1 0 0

Decision list

For rule based classifiers 1. Ordered rules/decision list: rules are ordered in decreasing order of priority. record is classified by highest ranked rule

Interval vs Ratio

Interval: differences between values are meaningful Ratio: differences, ratio, and 0 are meaningful (i.e. Fahrenheit would be interval but Kelvin would be ratio - 0 is meaningful)

Distinctness, Order, Addition, Multiplication Which of these 4 do these have: Nominal attributes: Ordinal Attributes: Interval Attributes: Ration Attributes:

Nominal attributes: distinctness Ordinal Attributes: distinctness, order Interval Attributes: distinctness, order, addition Ration Attributes: all 4

Nominal vs Ordinal

Nominal: different names, can only compare if they are equal or not (i.e. eye color, gender) Ordinal: provide enough info to order object (<, >) (i.e. {good, better, best})

Normalization vs Standardization

Normalization rescales the values into a range of [0,1]. This might be useful in some cases where all parameters need to have the same positive scale. However, the outliers from the data set are lost. Standardization rescales data to have a mean ( ) of 0 and standard deviation ( ) of 1 (unit variance)

Posterior probabilit

P(Y|X) Y - class X - attribute

Using bayesian theorem for classification

P(Y|X) = P(X|Y)P(Y)/P(X) let x denote attribute set let y denote class variable

PCA

Principle Component Analysis - a linear algebra technique for dimensionality reduction - for continuous attributes that finds new attributes that are 1. linear combinations of original attributes 2. orthogonal 3. capture max variation in the data

Ratio, qualitative or quantitative

Quantitative

Rule-based vs class-based ordering scheme

Rule-based: orders rules by some rule quality measure Class-based: rules that belong to same class appear together in rule set

Similarity Measures for Binary Data (3)

Simple Matching Coefficient, Jaccard Coeff, Cosine Similarity

Sampling Approaches (2)

Simple random sampling stratified sampling - pre-specified groups of object

Progressive Sampling

Start with small sample and increase until sample of sufficient size has been obtained

model overfitting

a model that fits training data too well can have poor generalization error

Bias

a systematic variation from the quantity being measured

Problem with binarization

can cause unintended relations among transformed attributes; thus for association analysis need asymmetric binary values

RIPPER Algorithm: for 2 classes

chooses majority class as default and then learns the rules for detecting the minority class

RIPPER Algorithm: for multiclass

classes ordered according to frequencies; during first iteration start with least frequent class labeled as positive examples and all others labeled as negative - sequential covering method used to generate rules that discriminate between + and - - then moves on to next class

Descriptive Tasks

derive patterns (correlations, trends, clusters, trajectories, and anomalies) that summarize the underlying relationships in the data - clustering, association analysis, anomaly detection

Clustering

descriptive task - seeks to find groups of closely related observations so that observations that belong to the same cluster are more similar to each other than observations that belong to other clusters

Anomaly Detection

descriptive task - the task of identifying observations whose characteristics are significantly different from the rest of the data; applications include: detection of fraud, network intrusions

Bias

difference between average prediction and actual value

Bagging

draw N bootstrap samples - retrain model on each sample average results -

Cosine similarity

each document has relatively few non-zero attributes thus similarity should not depend on number of 0-0 matches cos(x,y) = x y (dot product)/ ||x|| ||y||

k-fold cross-validation

evaluates Performance of classifier k-fold cross-validation, the original sample is randomly partitioned into k equal sized subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data.

Holdout method

evaluates Performance of classifier In the holdout method, we randomly assign data points to two sets d0 and d1, usually called the training set and the test set, respectively. The size of each of the sets is arbitrary although typically the test set is smaller than the training set. We then train on d0 and test on d1.

Random sampling

evaluates Performance of classifier Random sampling: repeated holdout method but bad because have no control over # times a specific record is used for testing and training

Voir tous les ensembles d'études

Intro Data Mining Midterm

Ensembles d'études connexes

Mod #4 - SMAW (Stick Welding) - Questions

chapter 8

Pledge of Allegiance

Chapter 6 Motivating Behavior with Work and Rewards

Exam #4

Project Management Quiz 1 Class Questions

Mercantilism

Unit 6

Muscles of the Forearm, Wrist, and Hand

Deepest place video quiz

Ch 3 Homeowners HO-2, HO-3, HO-4, Ho-5 Ho-6 Ho-8- Section 1

Fundamentals

Unit 7 Human Geography

Chapter 10 Test

Malay Class L5-04: Clothing and culture, etc.

Whole Life Insurance

Top Hat Chapter 4-6

Bio Day 2 `

BUS 204 Ch. 11

englsh ll unit 3