Data Mining Final
Confusion Matrix
A table of numbers showing how often a given stimulus is reported when another stimulus was shown. The table typically provides strong evidence for reliance on features in vision
Classifier Accuracy
((TN + TP) / All ) x 100
Minkowski distance
((x1-x2)^(h) + (y1-y2)^(h))^(1/h)
The quality of a clustering method depends on
-the similarity measure used by the method -its implementation, and -Its ability to discover some or all the hidden patterns
Challenges in performance estimation
1. Enough samples in train and test sets: • If the dataset consists of a relatively large number of samples (n) as compared to the number of features (p) (i.e. n >> p): then it is believed that a model trained on a training set and tested using a test set would provide realistic error estimates that capture true characteristics of the dataset. • However, If the dataset has a lower number of samples n), than the number of features (p) (i.e. n << p), • This problem could invariably generate inaccurate error estimates that could be misleading. • Appropriate train and test sets (with minimum number of samples) should be determined based on a predetermined model performance confidence interval prior to model creation. 2. Handling imbalanced datasets: • Models are affected by an imbalance in the number of samples in each class. • Classifiers when trained on imbalanced datasets create models that classify the test instances to the majority class (i.e. the class that has the most samples) that is least important. • These misclassifications of samples that belong to the minority class deteriorate the overall performance of the model.
I
= - Sigma c ( p(c) log 2 ( p(c)))
Ires
= - Sigma v p(v) Sigma c( p(c|v) log 2 ( p(c|v)))
False Positive Rate (Fall Out)
= FP/ ( FP+TN)
True Negative Rate (Specificity)
= TN / (FP+TN)
True Positive Rate (Sensitivity)
= TP / (TP+FN) = 1 - Specificity ( or TN rate)
Manhattan distance
= |x1 - x2| + |y1 - y2|
Precision (p)
=TP / (FP + TP) Total positive prediction rate.
Dendrogram
A clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster
Information Entropy-Theoretic Approach
Gain(A) = I - Ires(A)
Kernel Functions
Instead of computing the dot product on the transformed data, in math it is equivalent to applying a kernel function to the data
K-Medoids
Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster. Improves problem that k-means has which is that it's sensitive to outliers. • PAM (Partitioning Around Medoids) • Starts from an initial set of medoids and iteratively replaces one of the medoids by one of the non-medoids if it improves the total distance of the resulting clustering • PAM works effectively for small data sets, but does not scale well for large data sets (due to the computational complexity) • Efficiency improvement on PAM • CLARANS: Randomized re-sampling
Jaccard coefficient
Jaccard(i, j) = q/(q+r+s)
k Nearest Neighbor (KNN)
K-Nearest Neighbor can be used for classification/prediction tasks. Step 1: Using a chosen distance metric, compute the distance between the new example and all past examples. Step 2: Choose k examples that are closest to the new example. Step 3: Work out the predominant class of those k nearest neighbors - the predominant class is your prediction for the new example. i.e. classification is done by majority vote of the k nearest neighbors.
Issues with Euclidian Distance
Scaling of values: Since each numeric attribute may be measured in different units, they should be standardized. Weighting of attributes: Manual weighting: Weights may be suggested by experts Automatic weighting: Weights may be computed based on discriminatory power or other statistics. (e.g. in SAS, weighted dimension is based on the correlation to the target variable.) Treatment of categorical variables: Various ways of assigning distance between categories are possible
Content-based filtering
See what a customer has bought in the past and use this information to predict what he would like in the future.
Overfitting
An induced tree may overfit the training data • Too many branches, some may reflect anomalies due to noise or outliers • Poor accuracy for unseen samples
Laplacian correction
Add 1 to each case if you encounter a zero
Hierarchical Clustering
Agnes (agglomerative) and DIANA (divisive)
Proximity Measure for Nominal Attributes
Simple matching Jaccard coefficient Cosine Similarity
Recall (r)
Sometimes referred to as the TP rate or sensitivity = TP / (TP+FN) = 1 - Specificity ( or TN rate)
Gini Index (CART)
The attribute that provides the smallest gini index or largest reduction in impurity is chosen to split the node (need to enumerate all the possible splitting points for each attribute)
Holdout
The holdout method is considered to be the simplest form of performance estimation that partitions the data into two disjoint sets a train set and test set (prior). • The train set is used to train the chosen classifier for model generation during the training phase. • During the train phase: • The optimal values of the model parameters are determined • An appropriate performance measure is evaluated. • Testing phase: • The testing set is used to obtain an unbiased estimate of the generalized performance of the models. • The holdout estimates of error could be misleading when the testing set is not sufficient (i.e. not large enough) to provide good error estimates. • Therefore, we choose a 60:40 split of training:testing set.
SVM—Linearly Separable
W * x + b = 0 H1 = w0 + w1x1 + w2x2 >= 1 for y = +1, and H2 = w0 + w1x1 + w2x2 <= -1 for y = -1 Any training tuples that fall on hyperplanes H1 or H2 (i.e., the sides defining the margin) are support vectors
True Negative (TN)
When a sample belonging to the true negative class (C - ) is correctly classified as negative
False Positive (FP)
When a sample belonging to the true negative class (C - ) is misclassified as positive
True Positive (TP)
When a sample belonging to the true positive class (C + ) is correctly classified as positive
False Negative (FN)
When a sample belonging to the true positive class (C + ) is misclassified as negative
Binary Attributes
a nominal attribute with only two categories or states: 0 or 1. Boolean.
Average link
average distance between an element in one cluster and an element in the other • Centroid: distance between the centroids of two clusters, i.e., dist(Ki , Kj ) = dist(Ci , Cj ) • Medoid: distance between the medoids of two clusters, i.e., dist(Ki , Kj ) = dist(Mi , Mj )
Cosine measure of similarity
cos(d1, d2) = (d1 * d2)/ (d1 x d2)
Simple matching
d(i, j) = (p - m)/ p m: # of matches, p: total # of variables
Distance for Asymmetric Binary Values
d(i, j) = (r+s)/(q+r+s)
Distance for Symmetric Binary Values
d(i, j) = (r+s)/(q+r+s+t)
Reduction in Impurity
gini(D) - giniA(D)
Clustering
is grouping of objects (or data points). • Ambiguous: as there are many ways of grouping • Subjective: as relies on the application using it.
Complete link
largest distance between an element in one cluster and an element in the other
Disadvantage of Information gain
measure is biased towards attributes with many values
Proximity
refers to either similarity or dissimilarity
A good clustering method
should have • High intra-class similarity: Cohesive within clusters • Low inter-class similarity: Distinctive between clusters
Single link
smallest distance between an element in one cluster and an element in the other
Euclidean distance
sqrt((x1- x2)^2 + (y1 - y2)^2)
Cluster analysis
the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar to each other than to those in other groups (clusters).
C4.5
uses gain ratio to overcome the problem (normalization to information gain) • The attribute with the maximum gain ratio is selected as the splitting attribute • Disadvantage: tends to prefer unbalanced splits in which one partition is much smaller than the others
Distance with Ordinal Variables
zif = (rif - 1)/ (Mf - 1) Example: freshman:0; sophomore: 0.33; junior: 0.67; senior:1
Bayesian Classification
• A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities • Foundation: Based on Bayes' Theorem. • Performance: A simple Bayesian classifier, naïve Bayesian classifier, has comparable performance with decision tree and selected neural network classifiers • Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct — prior knowledge can be combined with observed data • Theoretical Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured
Dissimilarity matrix
• A triangular matrix • n data points, but registers only the distance (d) • d(i, i + 1) is the distance between rowi and rowi+1 of data matrix
Naïve Bayes Classifier
• Advantages • Easy to implement • Good results obtained in most of the cases • Disadvantages • Assumption: class conditional independence, therefore loss of accuracy • Practically, dependencies exist among variables • E.g., hospitals: patients: Profile: age, family history, etc. Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc. • Dependencies among these cannot be modeled by Naïve Bayes Classifier
SVM Classifiers
• Advantages • Prediction accuracy is generally high; even on high dimensional data. • Robust, works when training examples contain errors • Fast evaluation of the learned target function • Criticism • Long training time • Difficult to understand the learned function (weights) • Not easy to incorporate domain knowledge • A relatively new classification method for both linear and nonlinear data • It uses a nonlinear mapping to transform the original training data into a higher dimension • Within the new dimensions, it searches for the linear optimal separating hyperplane (i.e., "decision boundary") • SVM finds this hyperplane using support vectors ("essential" training tuples) and margins (defined by the support vectors) SVM searches for the hyperplane with the largest margin, i.e., maximum marginal hyperplane (MMH)
Grid-based approach
• Based on a multiple-level granularity structure • Typical methods: STING, WaveCluster, CLIQUE
Density-based approach
• Based on connectivity and density functions • Typical methods: DBSACN, OPTICS, DenClue
Algorithm for Decision Tree Induction
• Basic algorithm (a greedy algorithm) • Tree is constructed in a top-down, recursive, divide-and-conquer manner • Note: Attributes are categorical (if continuous-valued, they are discretized in advance) 1. At start, all the training examples are at the root • Root in the DT represents a selected attribute 2. Samples are partitioned recursively based on selected attribute 3. Attributes are selected based on a statistical measure (e.g., information gain) Conditions for stopping partitioning • All samples for a given node belong to the same class • There are no remaining attributes for further partitioning - majority voting is employed for classifying the leaf • There are no samples left
Major weakness of agglomerative clustering methods
• Can never undo what was done previously • Does not scale well: time complexity of at least O(n2), where n is the number of total objects
Partitioning approach
• Construct various partitions and then evaluate them by a user specified criterion, e.g., minimizing the sum of square errors • Typical methods: k-means, k-medoids, CLARANS
Hierarchical approach
• Create a hierarchical decomposition of the set of data (or objects) using a user specified criterion • Typical methods: DIANA, AGNES, BIRCH, CAMELEON
Building Profiles from Data
• Data Needed • Personal information, preferences & interests • Registration data, including demographic data • Customer ratings • Purchasing data • What was bought, when and where • Browsing & visitation data • Clickstream (Weblog files) • Building customer profiles • Demographic (e.g., name, address, age) • Behavioral (e.g., favorite type of book - adventure, largest transaction - $295) • Things learned from data
Issues that affect classification: Data Preparation
• Data cleaning • Preprocess data in order to reduce noise and handle missing values • Data transformation • Generalize and/or normalize data • Feature Selection (Relevance analysis) • Remove the irrelevant or redundant attributes • Correlation Analysis • Attribute Subset Selection
Separation of clusters
• Exclusive (e.g., an object belong only to one cluster) vs. non-exclusive (e.g., an object may belong to more than one cluster think hierarchal dendrogram)
Properties of Distance Measures
• For any object A, dist(A, A) = 0 • For all objects A, B, and C • dist(A, B) ≥ 0, (non-negative/positivity) • dist(A, B) = dist(B, A) (symmetric) • dist(A, C) ≤ dist(A, B) + dist(B, C) (triangular inequality)
Comparing Attribute Selection Measures
• Information gain: • biased towards multivalued attributes • Gain ratio: • tends to prefer unbalanced splits in which one partition is much smaller than the others • Gini index: • biased to multivalued attributes • has difficulty when # of classes is large • tends to favor tests that result in equal-sized partitions and purity in both partitions
DIANA (Divisive Analysis)
• Inverse order of AGNES • Eventually each node forms a cluster on its own
Distance Functions
• Manhattan distance • Euclidean distance • Hamming Distance • Cosine of the angle between vectors
Issues that affect classification: Performance Evaluation
• Measures of Accuracy • classifier accuracy: predicting class label • predictor accuracy: guessing value of predicted attributes • Speed • time to construct the model (training time) • time to use the model (classification/prediction time) • Robustness: handling noise and missing values • Scalability: efficiency in disk-resident databases • Interpretability • understanding and insight provided by the model • Other measures, e.g., goodness of rules, decision tree size or compactness of classification rules
How to handle continuous-valued attributes?
• Method 1: Discretize continuous values and treat them as categorical values • E.g., age : <20, 20...30, 30...40, 40...50, >50. • Method 2: Determine the best split point • Sort the values of A in increasing order • Possible split point: the midpoint between each pair of adjacent values • (ai + ai+1)/2 is the mid point between the values of ai and ai+1 • The point with the maximum information gain for A is selected as the split point for A • Split: Based on split point P • Set of tuples in D satisfying A ≤ P vs. those with A > P
Classification Two Step Process
• Model construction: describing a set of predetermined classes • Model usage: for classifying future or unknown objects
Collaborative Filtering: Drawback
• Needs real time recommendation • Scale - millions of customers, thousands of items • Works well only once a "critical mass" of preference has been obtained • Need a very large number of consumers to express their preferences about a relatively large number of products. • Consumer input is difficult to get • Solution: identify preferences that are implicit in people's actions • For example, people who order a book implicitly express their preference for the book they buy over other books • Works well but results are not as good as the results achieved using explicit ratings.
Dissimilarity measure (Distance function)
• Numerical measure of how different two data objects are • Lower when objects are more alike • Minimum dissimilarity is often 0 • Upper limit varies [0,1] or [0, inf) depending on the definition
Strengths of K-Nearest Neighbor
• Often work well for classes that are hard to separate using parametric methods or the splits used by decision trees. • Simple to implement and use • Comprehensible - easy to explain prediction • Robust to noisy data by averaging k-nearest neighbors. • Some appealing applications (e.g. personalization)
Benefits of using Performance estimation strategies
• Performance estimation strategies are used to avoid overfitting the error estimates of a model to provide overly optimistic (i.e. lower than the true error rate) results. • Fact: All applications will most of the time have finite set of relevant samples that are often insufficient for testing a hypothesis using classification models. • Exemplified with small sample sets, over-fitting is a prominent issue in several applications • Especially those that have many features (p).
Supervised learning (classification)
• Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations • New data is classified based on the training set
Unsupervised learning (clustering)
• The class labels of training data is unknown • Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data no predefined classes (i.e., learning by observations vs. learning by examples: supervised) • Typical applications • As a stand-alone tool to get insight into data distribution • As a preprocessing step for other algorithms
Why Is SVM Effective on High Dimensional Data?
• The complexity of trained classifier is characterized by the # of support vectors rather than the dimensionality of the data • The support vectors are the essential or critical training examples —they lie closest to the decision boundary (MMH)
K-fold cross validation
• The k-fold cross validation is the most prominently used performance estimation technique in data analytics applications. • K-fold cross validation divides the data set into: • k-disjointed (independent) subsets consisting of equal (or nearly equal) samples in each subset. • Each of the k disjointed data subsets is referred to as a 'fold,' thus the name k-fold. • The k-fold cross validation process is an iterative procedure in which one of the k subsets (chosen at random) is used as a test set for performance estimation at each iteration. The remaining k-1 disjointed subsets are combined to form the training set that is used to train the model.
Personalization Process
• Understand-Deliver-Measure Cycle
Handling imbalanced datasets
• Unsupervised approaches: rely on various resampling strategies • Supervised (algorithmic) approaches: • Rely on weighing and thresholding strategies that prioritize minority classes to counter the class imbalance caused by the majority classes. • These strategies include adjusting the decision threshold or oneclass learning rather than multi-class learning.
BOAT (Bootstrapped Optimistic Algorithm for Tree Construction)
• Use a statistical technique called bootstrapping to create several smaller samples (subsets), each fits in memory • Each subset is used to create a tree, resulting in several trees • These trees are examined and used to construct a new tree T' • It turns out that T' is very close to the tree that would be generated using the whole data set together • Adv: requires only two scans of DB, an incremental alg.
AGNES (Agglomerative Nesting)
• Use the single-link method and the dissimilarity matrix • Merge nodes that have the least dissimilarity • Proceeds iteratively in a non-descending fashion • Eventually all nodes belong to the same cluster
Prediction
• models continuous-valued functions, i.e., predicts unknown or missing values
Data matrix
• n data points (rows) with p dimensions (columns)
Classification
• predicts categorical class labels (discrete or nominal) • classifies new data based on the training set and the corresponding target values (class labels)
Why is decision tree induction popular?
• relatively faster learning speed (than other classification methods) • convertible to simple and easy to understand classification rules • can be used with SQL queries for accessing databases • comparable classification accuracy with other methods
Clustering Requirements and Challenges
☐ Scalability • Clustering all the data instead of only on samples ☐ Ability to deal with different types of attributes • Numerical, binary, categorical, ordinal, and mixture of these ☐ Constraint-based clustering • User may specify inputs or constraints • Use domain knowledge to determine input parameters ☐ Interpretability and usability ☐ Others ☐ Discovery of clusters with arbitrary shape ☐ Ability to deal with noisy data ☐ Incremental clustering and insensitivity to input order ☐ High dimensionality
K-Means
1. Partition objects into k non-empty subsets 2. Compute seed points as the centroids of the clusters of the current partitioning (the centroid is the center, i.e., mean point, of the cluster) 3. Assign each object to the cluster with the nearest seed point 4. Go back to Step 2, stop when the assignment does not change Strength: Efficient: O(n×t×k), where n is # objects, k is # clusters, and t is # iterations. Normally, t & k << n Comment: Often terminates at a local optimal. Weakness: 1. Applicable only to objects in a continuous n-dimensional space • Using the k-modes method for categorical data • In comparison, k-medoids can be applied to a wide range of data 2. Need to specify k, the number of clusters, in advance 3. Sensitive to noisy data and outliers 4. Not suitable to discover clusters with non-convex shape
Unsupervised approaches
1. The random over-sampling of minority class with replacement of samples, 2. Random under-sampling of majority class, 3. Directed over-sampling (the choice of which samples to replace is informed rather than random), and 4. Directed under-sampling (the choice of samples to eliminate is informed).
Symmetric Binary
Both outcomes are equally important.
Asymmetric Binary
Both outcomes are not equally important.
Recommendation Technologies
Collaborative filtering Content-based filtering Rule-based approach
How to decide "K" in KNN
Computational cost: For a large database, we'd have to compute the distance between the new example and every old example, and then sort by distance, which can be very time-consuming. Possible resolutions are: • sampling: store only relevant samples of the historic data so that you have fewer distances to compute.
Similarity measure
Distance-based (e.g., Euclidian, Manhattan) vs. connectivitybased (e.g., density or contiguity)
Similarity measure (Similarity function)
E.g. (Correlation) • Numerical measure of how alike two data objects are • Value is higher when objects are more alike • Often falls in the range [0,1]
Collaborative filtering
Find the closest customers and recommend based on what closest customers bought • Starts with a history of people's personal preferences • Uses a distance function - people who like the same things are "close" • Uses "votes" which are weighted by distances, so close neighbor votes count more Ex: David and Don: 6 David and Rachel: 1 Don's value: (1/7) Rachel's value: (6/7)
Rule-based approach
Identify business rules about what products should be recommended • Example: IF a customer fits a certain profile (e.g. male, age 25-35), THEN recommend a certain set of products.
Area under the curve (AUC)
It is a relative measure that ranges from 0 to 1 in the ROC space (see Figure). • A classifier is believed to perform well if the AUC is higher and approaches closer to 1, and viceversa.
Types of Classifiers
Linear and non-linear
Three-way split
One alternate approach to the holdout technique is the threeway split. In the three-way split: Model selection and Performance (true error) estimates are computed at the same time. This technique splits the data into three independent sets: • The training set, • The validation set, and • The testing set. The validation set consists of a set of samples that are used to fine tune the estimated parameters of the model selected using the train set. This fine-tuning enables the removal of biases from the true error estimates created during the model training using the train set.
Partitioning method
Partitioning a dataset / database D of p objects into a set of k clusters, such that the sum of squared distances is minimized (where ci is the centroid or medoid of cluster Ci ) • Global optimal: exhaustively enumerate all partitions • Heuristic methods: k-means and k-medoids algorithms • k-means (MacQueen'67, Lloyd'57/'82): Each cluster is represented by the center of the cluster • k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw'87): Each cluster is represented by one of the objects in the cluster
Personalization
Personalization/customization tailors' certain offerings by providers to consumers based on knowledge about them with certain goals in mind.
Two approaches to avoid overfitting
Prepruning: Halt tree construction early ̵do not split a node if this would result in the goodness measure falling below a threshold • Difficult to choose an appropriate threshold Postpruning: Remove branches from a "fully grown" tree-get a sequence of progressively pruned trees • Use a validation set of data (different from the training data) to decide which is the "best pruned tree"
ROC-curve
The receiver operating characteristics (ROC) graphical plots the true positive rate and the false positive rate of a classifier in the ROC space. • The specificity (FP rate) on the x-axis Vs. • The sensitivity (TP rate) on the y-axis. • A point in the ROC space is the representation of a classifier in terms of its (FP Rate, TP Rate) as coordinates in the ROC space using a test set. If the curve is skewed toward the southeast corner of the ROC space. • The classifier exhibits a higher FP rate and a lower TP rate. • The classifier is conservative when it is biased toward false positive classifications along with a lower TP rate. • If the ROC-curve of a classifier falls along the diagonal: • It is believed that the classifier has no bias towards the TP rate or the FP rate. • The classifier performs like a random guess as in the case of deciding by flipping a coin (head or tail).
