CSC529 Week1
Top-Down Tree Construction
(1) Apply S to D to find splitting criterion (2) if (t is not a leaf node) (3) Create children nodes of t (4) Partition D into children partitions (5) Recurse on each partition (6) endif
decision tree Split selection algorithms
(CART, C4.5, QUEST, CHAID, CRUISE, ...)
EP: The data is split upon the variable, "Income >= 75K". 50 customers make more than 75K. The entropy of that split of the data is 0.8. The entropy for the split of the other customers is 0.6. What is the information gain?
0.35
EP: A data set of 200 customers is collected and used to train a decision tree. 100 of these customers have churned (left the company). What is the entropy?
1
What is a data mining model?
A data mining model is a description of a specific aspect of a dataset. It produces output values for an assigned set of input values.
Decision Trees
A decision tree T encodes d (a classifier or regression function) in form of a tree.
Gini index
A measure of impurity (based on relative frequencies of classes in a set of instances) The attribute that provides the smallest Gini index (or the largest reduction in impurity due to the split) is chosen to split the node Possible Problems: Biased towards multivalued attributes; similar to Info. Gain. Has difficulty when # of classes is large
decision tree (leaf node)
A node t in T without children
decision tree Data access methods
CLOUDS, SLIQ, SPRINT, RainForest, BOAT, UnPivot operator
decision tree (leaf nodes classification problem)
Classification problem: Node t is labeled with one class label c in dom(C)
Many data mining and analytics tasks involve the comparison of objects and determining their similarities (or dissimilarities)
Clustering Nearest-neighbor search, classification, and prediction Characterization and discrimination Automatic categorization Correlation analysis
Why use data mining today?
Competitive pressure! Competition on service, not only on price Personalization CRM Security, homeland defense
Data matrix
Conceptual representation of a table Cols = features; rows = data objects n data points with p dimensions Each row in the matrix is the vectorrepresentation of a data object
Probabilistic Belief
Consider a world where a dentist D meets with a new patient P D is interested in only whether P has a cavity; so, a state is described with a single proposition - Cavity Before observing P, D does not know if P has a cavity, but from years of practice, he believes Cavity with some probability p and Cavity with probability 1-p The proposition is now a random variable and (Cavity, p) is a probabilistic belief
Common Distance Measures for Numeric Data
Consider two vectors Rows in the data matrix Common Distance Measures: Manhattan distance: Euclidean distance:
examples of distances
Cosine of the angle between vectors Manhattan distance Euclidean distance Hamming Distance
Why do data selection?
Data Sources are Expensive Obtaining Data Loading Data into Database Maintaining Data Most Fields are not useful Names Addresses Code Numbers
data transformation includes
Data cleaning Combine related data sources Create common units Generate new fields Sampling
Why Create Common Units?
Data exists at different Granularity Levels Customers Transactions Products Data Mining requires a common Granularity Level (often called a Case) Mining usually occurs at "customer" or similar granularity
Data Mining Step in Detail
Data preprocessing (data selection, data transformation) Data mining model construction Model evaluation
different types of classifiers
Decision Trees Simple Bayesian models Nearest neighbor methods Logistic regression Neural networks Linear discriminant analysis (LDA) Quadratic discriminant analysis (QDA) Density estimation methods
A data mining model can be described at two levels (functional level)
Describes model in terms of its intended usage Classification Clustering
K-Nearest-Neighbor Strategy(combination functions) Weighted Voting: not so "democratic": How can weights be obtained?
Distance-based closer neighbors get higher weights "value" of the vote is the inverse of the distance (may need to add a small constant) the weighted sum for each class gives the combined score for that class to compute confidence, need to take weighted average Heuristic weight for each neighbor is based on domain-specific characteristics of that neighbor
Representation of objects as vectors:
Each data object (item) can be viewed as an n-dimensional vector, where the dimensions are the attributes (features) in the data Example (employee DB): Emp. ID 2 = <M, 51, 64000> The vector representation allows us to compute distance or similarity between pairs of items using standard vector operations, e.g.,
decision tree (internal node splitting predicate)
Each internal node has an associated splitting predicate. Most common are binary predicates. Example predicates: Age <= 20 Profession in {student, teacher} 5000*Age + 3*Salary - 10000 > 0
Why Combine Data Sources?
Enterprise Data typically stored in many heterogeneous systems Keys to join systems may or may not be present Heuristics must be used when keys are missing Time-based matching Situation-based matching
K-Nearest-Neighbor Strategy(classification)
Find the class label for each of the k neighbor Use a voting or weighted voting approach to determine the majority class among the neighbors (a combination function) Weighted voting means the closest neighbors count more Assign the majority class label to x
decision tree pruning
For a tree T, the misclassification rate and the mean-squared error rate depend on P, but not on D. The goal is to do well on records randomly drawn from P, not to do well on the records in D If the tree is too large, it overfits D and does not model P. The pruning method selects the tree of the right size.
A data mining model can be described at two levels:
Functional level Representational level
Classification: Definition
Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class. Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen records should be assigned a class as accurately as possible. A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.
K-Nearest-Neighbor Strategy
Given object x, find the k most similar objects to x The k nearest neighbors Variety of distance or similarity measures can be used to identify and rank neighbors Note that this requires comparison between x and all objects in the database
Bayes' Rule - An Example
Given: P(Cavity) = 0.1 P(Toothache) = 0.05 P(Cavity|Toothache) = 0.8 Bayes' rule tells: P(Toothache | Cavity) = (0.8 x 0.05)/0.1 = 0.4
classifier requirements on the model
High accuracy Understandable by humans, interpretable Fast construction for very large training databases
Why Use Data Mining Today?
Human analysis skills are inadequate Availability of: data, storage, comp. power, expertise, off-the-shelf software An abundance of data commercial support (python, r, sas, aws, spss, azure, hadoop, etc.)
The Knowledge Discovery Process (steps)
Identify business (or other) problem Data mining Action Evaluation and measurement Deployment and integration into businesses processes
data selection
Identify target datasets and relevant fields
K-Nearest-Neighbor Strategy(prediction)
Identify the value of the target attribute for the k neighbors Return the weighted average as the predicted value of the target attribute for x
Proximity Measure for Nominal Attributes
If object attributes are all nominal (categorical), then proximity measures are used to compare objects Can take 2 or more states, e.g., red, yellow, blue, green (generalization of a binary attribute) Method 1: Simple matching m: # of matches, p: total # of variables Method 2: Convert to Standard Spreadsheet format For each attribute A create M binary attribute for the M nominal states of A Then use standard vector-based similarity or distance metrics
Measuring Distance or Similarity
In order to group similar items, we need a way to measure the distance between objects (e.g., records) Often requires the representation of objects as "feature vectors"
Vector-Based Similarity Measures
In some situations, distance measures provide a skewed view of data E.g., when the data is very sparse and 0's in the vectors are not significant In such cases, typically vector-based similarity measures are used Most common measure: Cosine similarity Dot product of two vectors: the cosine similarity is:
Gain Ratio
Information Gain measure tends to be biased in favor attributes with a large number of values Gain Ratio normalizes the Information Gain with respect to the total entropy of all splits based on values of an attribute Used by C4.5 (the successor of ID3) But, tends to prefer unbalanced splits (one partition much smaller than others)
examples of data mining models
Linear regression model Classification model Decision Tree Naïve Bayes K-Nearest Neighbor Clustering
Decision Trees: Summary
Many application of decision trees There are many algorithms available for: Split selection Pruning Handling Missing Values Data Access Decision tree construction still active research area (after 20+ years!) Challenges: Performance, scalability, evolving datasets, new applications
Why Data Cleaning?
Missing Data Unknown demographic data Impute missing values when possible Incorrect Data Hand-typed default values (e.g. 1900 for dates) Misplaced Fields Data does not always match documentation Missing Relationships Foreign keys missing or dangling
Why Sampling?
Most real datasets are too large to mine directly (> 200 million cases) Apply random sampling to reduce data size and improve error estimation Always sample at analysis granularity (case/"customer"), never at transaction granularity.
Similarity
Numerical measure of how alike two data objects are Value is higher when objects are more alike Often falls in the range [0,1]
Dissimilarity (e.g., distance)
Numerical measure of how different two data objects are Lower when objects are more alike Minimum dissimilarity is often 0 Upper limit varies
K-Nearest-Neighbor Strategy(combination functions)
Once the Nearest Neighbors are identified, the "votes" of these neighbors must be combined to generate a prediction Voting: the "democracy" approach poll the neighbors for the answer and use the majority vote the number of neighbors (k) is often taken to be odd in order to avoid ties works when the number of classes is two if there are more than two classes, take k to be the number of classes plus 1
Conditional Probabilities
P(A | B) = the conditional (posterior) probability of A given B P(A | B) = P(A, B) / P(B) P(A Ù B) = P(A, B) = P(A | B) . P(B) P(A Ù B) = P(A, B) = P(A) . P(B), if A and B are independent we say that A is independent of B, if P(A | B) = P(A) A and B are independent given C if: P(A | B,C) = P(A | C) -> P(AB|C) = P(A|C) P(B|C)
EP: For Bayes theorem to be applied, the following relationship between hypothesis H and evidence E must hold.
P(H|E) + P(~H| E) = 1
Basic Axioms of Probability
P(True) = 1 and P(False) = 0 P(A Ù B) = P(A) . P(B | A) P(ØA) = 1 - P(A) if A º B, then P(A) = P(B) P(A Ú B) = P(A) + P(B) - P(A Ù B)
Many of todays real-world applications rely on the computation similarities or distances among objects
Personalization Recommender systems Document categorization Information retrieval Target marketing
Examples of Classification Task
Predicting tumor cells as benign or malignant Classifying credit card transactions as legitimate or fraudulent Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil Categorizing news stories as finance, weather, entertainment, sports, etc
Examples of commercial data mining support software
Python R SAS AWS Azure SPSS Mahout Spark And on...and on...and on
Why Generate New Fields?
Raw data fields may not be useful by themselves Simple transformations can improve mining results dramatically: Customer start date -> Customer tenure Recency, Frequency, Monetary values Fields at wrong granularity level must be aggregated
decision tree data access
Recent development: Very large training databases, both in-memory and on secondary storage Goal: Fast, efficient, and scalable decision tree construction, using the complete training database.
Types of Data
Relational data and transactional data Spatial and temporal data, spatio-temporal observations Time-series data Text Voice Images, video Mixtures of data Sequence data Features from processing other data sources
Entropy and Information gain
Review slides 48 - 56
Minkowski Distance
Review slides 85-88 Note that Euclidean and Manhattan distances are special cases
A data mining model can be described at two levels (representational level)
Specific representation of a model Log-linear model Classification tree
Decision Tree Construction (Three algorithmic components)
Split selection (CART, C4.5, QUEST, CHAID, CRUISE, ...) Pruning (direct stopping rule, test dataset pruning, cost-complexity pruning, statistical tests, bootstrapping) Data access (CLOUDS, SLIQ, SPRINT, RainForest, BOAT, UnPivot operator)
Examples of data sources (an abundance of data)...
Supermarket scanners, POS data Preferred customer cards Credit card transactions Direct mail response Call center records ATM machines Demographic data Sensor networks Cameras Web server logs Customer web site trails
Data Mining Techniques (two main)
Supervised learning Unsupervised learning
EP: Which of the following are true about decision trees?
The algorithm to produce a decision tree is deterministic. Decision trees do not make the naive assumption of independence among variables. A variable can occur multiple times in a tree. Variables can be nominal or numeric.
"Lazy" Classifiers
The approach defers all of the real work until new instance is obtained; no attempt is made to learn a generalized model from the training set Less data preprocessing and model evaluation, but more work has to be done at classification time
Data mining definition (valid)
The patterns hold in general.
Probabilistic Belief State
The world has only two possible states, which are respectively described by Cavity and Cavity The probabilistic belief state of an agent is a probabilistic distribution over all the states that the agent thinks possible
classifier goals
To produce an accurate classifier/regression function To understand the structure of the problem
Decision Tree Construction
Top-down tree construction schema: Examine training database and find best splitting predicate for the root node Partition training database Recurse on each child node
decision tree (leaf nodes regression problem)
Two choices: Piecewise constant model:t is labeled with a constant y in dom(Y). Piecewise linear model:t is labeled with a linear model Y = yt + Ó aiXi
decision tree choosing the 'best' feature
Use Information Gain to find the "best" (most discriminating) feature Assume there are two classes, P and N (e.g, P = "yes" and N = "no") Let the set of instances S (training data) contains p elements of class P and n elements of class N The amount of information, needed to decide if an arbitrary example in S belongs to P or N is defined in terms of entropy, I(p,n): Note that Pr(P) = p / (p+n) and Pr(N) = n / (p+n) In other words, entropy of a set on instances S is a function of the probability distribution of classes among the instances in S.
Human analysis skills are inadequate because of...
Volume and dimensionality of the data High data growth rate
Data mining definition (useful)
We can devise actions from the patterns.
Data mining definition (understandable)
We can interpret and comprehend the patterns.
Data mining definition (novel)
We did not know the pattern beforehand.
deductive reasoning
a logical process in which a conclusion is based on the concordance of multiple premises that are generally assumed to be true. Deductive reasoning is sometimes referred to as top-down logic.
inductive reasoning
a logical process in which multiple premises, all believed true or found true most of the time, are combined to obtain a specific conclusion. Inductive reasoning is often used in applications that involve prediction, forecasting, or behavior.
Proximity refers to
a similarity or dissimilarity
EP: With Bayes theorem the probability of hypothesis H, specified by P(H), is referred to as...
an apriori probability
Supervised learning (two main)
classification regression
EP: Classification problems are distinguished from estimation problems in that...
classification problems require the output attribute to be categorical.
Distance-Based Classification (basic idea)
classify new instances based on their similarity to or distance from instances we have seen before Sometimes called "instance-based learning" Save all previously encountered instances Given a new instance, find those instances that are most similar to the new one Assign new instance to the same class as these "nearest neighbors"
Unsupervised learning (two main)
clustering association rules
Bayes' theorem
describes the probability of an event, based on conditions that might be related to the event. For example, if cancer is related to age, then, using Bayes' theorem, a person's age can be used to more accurately assess the probability that they have cancer.
decision tree pruning methods
direct stopping rule, test dataset pruning, cost-complexity pruning, statistical tests, bootstrapping
Distance can be defined as a dual of a similarity measure
dist(X,Y) = 1 - sim(X,Y)
Properties of Distance Measures:
for all objects A and B, dist(A, B) ³ 0, and dist(A, B) = dist(B, A) for any object A, dist(A, A) = 0 dist(A, C) £ dist(A, B) + dist (B, C)
Corollaries of Bayes' Rule
if Pr(E | H) = 0, then Pr(H | E) = 0 (E and H are mutually exclusive) suppose that Pr(E | H1) = Pr(E | H2), so that the hypotheses H1 and H2 give the same information about a piece of evidence E; then in other words, these assumptions imply that the evidence E will not affect the relative probabilities of H1 and H2
K-Nearest-Neighbor Strategy: Impact of k on predictions
in general different values of k affect the outcome of classification we can associate a confidence level with predictions (this can be the % of neighbors that are in agreement) problem is that no single category may get a majority vote if there is strong variations in results for different choices of k, this an indication that the training set is not large enough
K-Nearest-Neighbor Strategy(combination functions) Weighted Voting: not so "democratic": Advantage of weighted voting
introduces enough variation to prevent ties in most cases helps distinguish between competing neighbors
supervised learning characteristics
labels known well-defined goal learn g(x) that is a good approximation to f(x) from training sample D well-defined error metrics accuracy, RMSE, ROC, ...
Distance (or Similarity) Matrix
n data points, but indicates only the pairwise distance (or similarity) A triangular matrix Symmetric
decision tree (internal node)
node with children
P(A)
probability of event A
P(A | B)
probability of event A given event B occured
P(A ∩ B)
probability that of events A and B
P(A ∪ B)
probability that of events A or B
K-Nearest-Neighbor Strategy(combination functions) Weighted Voting: not so "democratic"
similar to voting, but the vote some neighbors counts more "shareholder democracy?" question is which neighbor's vote counts more?
Data mining definition
the exploration and analysis of large quantities of data in order to discover valid, novel, potentially useful, and ultimately understandable patterns in data.