CSC529 Week1

¡Supera tus tareas y exámenes ahora con Quizwiz!

Top-Down Tree Construction

(1) Apply S to D to find splitting criterion (2) if (t is not a leaf node) (3) Create children nodes of t (4) Partition D into children partitions (5) Recurse on each partition (6) endif

decision tree Split selection algorithms

(CART, C4.5, QUEST, CHAID, CRUISE, ...)

EP: The data is split upon the variable, "Income >= 75K". 50 customers make more than 75K. The entropy of that split of the data is 0.8. The entropy for the split of the other customers is 0.6. What is the information gain?

0.35

EP: A data set of 200 customers is collected and used to train a decision tree. 100 of these customers have churned (left the company). What is the entropy?

What is a data mining model?

A data mining model is a description of a specific aspect of a dataset. It produces output values for an assigned set of input values.

Decision Trees

A decision tree T encodes d (a classifier or regression function) in form of a tree.

Gini index

A measure of impurity (based on relative frequencies of classes in a set of instances) The attribute that provides the smallest Gini index (or the largest reduction in impurity due to the split) is chosen to split the node Possible Problems: Biased towards multivalued attributes; similar to Info. Gain. Has difficulty when # of classes is large

decision tree (leaf node)

A node t in T without children

decision tree Data access methods

CLOUDS, SLIQ, SPRINT, RainForest, BOAT, UnPivot operator

decision tree (leaf nodes classification problem)

Classification problem: Node t is labeled with one class label c in dom(C)

Many data mining and analytics tasks involve the comparison of objects and determining their similarities (or dissimilarities)

Clustering Nearest-neighbor search, classification, and prediction Characterization and discrimination Automatic categorization Correlation analysis

Why use data mining today?

Competitive pressure! Competition on service, not only on price Personalization CRM Security, homeland defense

Data matrix

Conceptual representation of a table Cols = features; rows = data objects n data points with p dimensions Each row in the matrix is the vectorrepresentation of a data object

Probabilistic Belief

Consider a world where a dentist D meets with a new patient P D is interested in only whether P has a cavity; so, a state is described with a single proposition - Cavity Before observing P, D does not know if P has a cavity, but from years of practice, he believes Cavity with some probability p and Cavity with probability 1-p The proposition is now a random variable and (Cavity, p) is a probabilistic belief

Common Distance Measures for Numeric Data

Consider two vectors Rows in the data matrix Common Distance Measures: Manhattan distance: Euclidean distance:

examples of distances

Cosine of the angle between vectors Manhattan distance Euclidean distance Hamming Distance

Why do data selection?

Data Sources are Expensive Obtaining Data Loading Data into Database Maintaining Data Most Fields are not useful Names Addresses Code Numbers

data transformation includes

Data cleaning Combine related data sources Create common units Generate new fields Sampling

Why Create Common Units?

Data exists at different Granularity Levels Customers Transactions Products Data Mining requires a common Granularity Level (often called a Case) Mining usually occurs at "customer" or similar granularity

Data Mining Step in Detail

Data preprocessing (data selection, data transformation) Data mining model construction Model evaluation

different types of classifiers

Decision Trees Simple Bayesian models Nearest neighbor methods Logistic regression Neural networks Linear discriminant analysis (LDA) Quadratic discriminant analysis (QDA) Density estimation methods

A data mining model can be described at two levels (functional level)

Describes model in terms of its intended usage Classification Clustering

K-Nearest-Neighbor Strategy(combination functions) Weighted Voting: not so "democratic": How can weights be obtained?

Distance-based closer neighbors get higher weights "value" of the vote is the inverse of the distance (may need to add a small constant) the weighted sum for each class gives the combined score for that class to compute confidence, need to take weighted average Heuristic weight for each neighbor is based on domain-specific characteristics of that neighbor

Representation of objects as vectors:

Each data object (item) can be viewed as an n-dimensional vector, where the dimensions are the attributes (features) in the data Example (employee DB): Emp. ID 2 = <M, 51, 64000> The vector representation allows us to compute distance or similarity between pairs of items using standard vector operations, e.g.,

decision tree (internal node splitting predicate)

Each internal node has an associated splitting predicate. Most common are binary predicates. Example predicates: Age <= 20 Profession in {student, teacher} 5000*Age + 3*Salary - 10000 > 0

Why Combine Data Sources?

Enterprise Data typically stored in many heterogeneous systems Keys to join systems may or may not be present Heuristics must be used when keys are missing Time-based matching Situation-based matching

K-Nearest-Neighbor Strategy(classification)

Find the class label for each of the k neighbor Use a voting or weighted voting approach to determine the majority class among the neighbors (a combination function) Weighted voting means the closest neighbors count more Assign the majority class label to x

decision tree pruning

For a tree T, the misclassification rate and the mean-squared error rate depend on P, but not on D. The goal is to do well on records randomly drawn from P, not to do well on the records in D If the tree is too large, it overfits D and does not model P. The pruning method selects the tree of the right size.

A data mining model can be described at two levels:

Functional level Representational level

Classification: Definition

Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class. Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen records should be assigned a class as accurately as possible. A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

K-Nearest-Neighbor Strategy

Given object x, find the k most similar objects to x The k nearest neighbors Variety of distance or similarity measures can be used to identify and rank neighbors Note that this requires comparison between x and all objects in the database

Bayes' Rule - An Example

Given: P(Cavity) = 0.1 P(Toothache) = 0.05 P(Cavity|Toothache) = 0.8 Bayes' rule tells: P(Toothache | Cavity) = (0.8 x 0.05)/0.1 = 0.4

classifier requirements on the model

High accuracy Understandable by humans, interpretable Fast construction for very large training databases

Why Use Data Mining Today?

Human analysis skills are inadequate Availability of: data, storage, comp. power, expertise, off-the-shelf software An abundance of data commercial support (python, r, sas, aws, spss, azure, hadoop, etc.)

The Knowledge Discovery Process (steps)

Identify business (or other) problem Data mining Action Evaluation and measurement Deployment and integration into businesses processes

data selection

Identify target datasets and relevant fields

K-Nearest-Neighbor Strategy(prediction)

Identify the value of the target attribute for the k neighbors Return the weighted average as the predicted value of the target attribute for x

Proximity Measure for Nominal Attributes

If object attributes are all nominal (categorical), then proximity measures are used to compare objects Can take 2 or more states, e.g., red, yellow, blue, green (generalization of a binary attribute) Method 1: Simple matching m: # of matches, p: total # of variables Method 2: Convert to Standard Spreadsheet format For each attribute A create M binary attribute for the M nominal states of A Then use standard vector-based similarity or distance metrics

Measuring Distance or Similarity

In order to group similar items, we need a way to measure the distance between objects (e.g., records) Often requires the representation of objects as "feature vectors"

Vector-Based Similarity Measures

In some situations, distance measures provide a skewed view of data E.g., when the data is very sparse and 0's in the vectors are not significant In such cases, typically vector-based similarity measures are used Most common measure: Cosine similarity Dot product of two vectors: the cosine similarity is:

Gain Ratio

Information Gain measure tends to be biased in favor attributes with a large number of values Gain Ratio normalizes the Information Gain with respect to the total entropy of all splits based on values of an attribute Used by C4.5 (the successor of ID3) But, tends to prefer unbalanced splits (one partition much smaller than others)

examples of data mining models

Linear regression model Classification model Decision Tree Naïve Bayes K-Nearest Neighbor Clustering

Decision Trees: Summary

Many application of decision trees There are many algorithms available for: Split selection Pruning Handling Missing Values Data Access Decision tree construction still active research area (after 20+ years!) Challenges: Performance, scalability, evolving datasets, new applications

Why Data Cleaning?

Missing Data Unknown demographic data Impute missing values when possible Incorrect Data Hand-typed default values (e.g. 1900 for dates) Misplaced Fields Data does not always match documentation Missing Relationships Foreign keys missing or dangling

Why Sampling?

Most real datasets are too large to mine directly (> 200 million cases) Apply random sampling to reduce data size and improve error estimation Always sample at analysis granularity (case/"customer"), never at transaction granularity.

Similarity

Numerical measure of how alike two data objects are Value is higher when objects are more alike Often falls in the range [0,1]

Dissimilarity (e.g., distance)

Numerical measure of how different two data objects are Lower when objects are more alike Minimum dissimilarity is often 0 Upper limit varies

K-Nearest-Neighbor Strategy(combination functions)

Once the Nearest Neighbors are identified, the "votes" of these neighbors must be combined to generate a prediction Voting: the "democracy" approach poll the neighbors for the answer and use the majority vote the number of neighbors (k) is often taken to be odd in order to avoid ties works when the number of classes is two if there are more than two classes, take k to be the number of classes plus 1

Conditional Probabilities

EP: For Bayes theorem to be applied, the following relationship between hypothesis H and evidence E must hold.

P(H|E) + P(~H| E) = 1

Basic Axioms of Probability

P(True) = 1 and P(False) = 0 P(A Ù B) = P(A) . P(B | A) P(ØA) = 1 - P(A) if A º B, then P(A) = P(B) P(A Ú B) = P(A) + P(B) - P(A Ù B)

Many of todays real-world applications rely on the computation similarities or distances among objects

Personalization Recommender systems Document categorization Information retrieval Target marketing

Examples of Classification Task

Predicting tumor cells as benign or malignant Classifying credit card transactions as legitimate or fraudulent Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil Categorizing news stories as finance, weather, entertainment, sports, etc

Examples of commercial data mining support software

Python R SAS AWS Azure SPSS Mahout Spark And on...and on...and on

Why Generate New Fields?

Raw data fields may not be useful by themselves Simple transformations can improve mining results dramatically: Customer start date -> Customer tenure Recency, Frequency, Monetary values Fields at wrong granularity level must be aggregated

decision tree data access

Recent development: Very large training databases, both in-memory and on secondary storage Goal: Fast, efficient, and scalable decision tree construction, using the complete training database.

Types of Data

Relational data and transactional data Spatial and temporal data, spatio-temporal observations Time-series data Text Voice Images, video Mixtures of data Sequence data Features from processing other data sources

Entropy and Information gain

Review slides 48 - 56

Minkowski Distance

Review slides 85-88 Note that Euclidean and Manhattan distances are special cases

A data mining model can be described at two levels (representational level)

Specific representation of a model Log-linear model Classification tree

Decision Tree Construction (Three algorithmic components)

Split selection (CART, C4.5, QUEST, CHAID, CRUISE, ...) Pruning (direct stopping rule, test dataset pruning, cost-complexity pruning, statistical tests, bootstrapping) Data access (CLOUDS, SLIQ, SPRINT, RainForest, BOAT, UnPivot operator)

Examples of data sources (an abundance of data)...

Supermarket scanners, POS data Preferred customer cards Credit card transactions Direct mail response Call center records ATM machines Demographic data Sensor networks Cameras Web server logs Customer web site trails

Data Mining Techniques (two main)

Supervised learning Unsupervised learning

EP: Which of the following are true about decision trees?

The algorithm to produce a decision tree is deterministic. Decision trees do not make the naive assumption of independence among variables. A variable can occur multiple times in a tree. Variables can be nominal or numeric.

"Lazy" Classifiers

The approach defers all of the real work until new instance is obtained; no attempt is made to learn a generalized model from the training set Less data preprocessing and model evaluation, but more work has to be done at classification time

Data mining definition (valid)

The patterns hold in general.

Probabilistic Belief State

The world has only two possible states, which are respectively described by Cavity and Cavity The probabilistic belief state of an agent is a probabilistic distribution over all the states that the agent thinks possible

classifier goals

To produce an accurate classifier/regression function To understand the structure of the problem

Decision Tree Construction

Top-down tree construction schema: Examine training database and find best splitting predicate for the root node Partition training database Recurse on each child node

decision tree (leaf nodes regression problem)

Two choices: Piecewise constant model:t is labeled with a constant y in dom(Y). Piecewise linear model:t is labeled with a linear model Y = yt + Ó aiXi

decision tree choosing the 'best' feature

Use Information Gain to find the "best" (most discriminating) feature Assume there are two classes, P and N (e.g, P = "yes" and N = "no") Let the set of instances S (training data) contains p elements of class P and n elements of class N The amount of information, needed to decide if an arbitrary example in S belongs to P or N is defined in terms of entropy, I(p,n): Note that Pr(P) = p / (p+n) and Pr(N) = n / (p+n) In other words, entropy of a set on instances S is a function of the probability distribution of classes among the instances in S.

Human analysis skills are inadequate because of...

Volume and dimensionality of the data High data growth rate

Data mining definition (useful)

We can devise actions from the patterns.

Data mining definition (understandable)

We can interpret and comprehend the patterns.

Data mining definition (novel)

We did not know the pattern beforehand.

deductive reasoning

a logical process in which a conclusion is based on the concordance of multiple premises that are generally assumed to be true. Deductive reasoning is sometimes referred to as top-down logic.

inductive reasoning

a logical process in which multiple premises, all believed true or found true most of the time, are combined to obtain a specific conclusion. Inductive reasoning is often used in applications that involve prediction, forecasting, or behavior.

Proximity refers to

a similarity or dissimilarity

EP: With Bayes theorem the probability of hypothesis H, specified by P(H), is referred to as...

an apriori probability

Supervised learning (two main)

classification regression

EP: Classification problems are distinguished from estimation problems in that...

classification problems require the output attribute to be categorical.

Distance-Based Classification (basic idea)

classify new instances based on their similarity to or distance from instances we have seen before Sometimes called "instance-based learning" Save all previously encountered instances Given a new instance, find those instances that are most similar to the new one Assign new instance to the same class as these "nearest neighbors"

Unsupervised learning (two main)

clustering association rules

Bayes' theorem

describes the probability of an event, based on conditions that might be related to the event. For example, if cancer is related to age, then, using Bayes' theorem, a person's age can be used to more accurately assess the probability that they have cancer.

decision tree pruning methods

direct stopping rule, test dataset pruning, cost-complexity pruning, statistical tests, bootstrapping

Distance can be defined as a dual of a similarity measure

dist(X,Y) = 1 - sim(X,Y)

Properties of Distance Measures:

for all objects A and B, dist(A, B) ³ 0, and dist(A, B) = dist(B, A) for any object A, dist(A, A) = 0 dist(A, C) £ dist(A, B) + dist (B, C)

Corollaries of Bayes' Rule

if Pr(E | H) = 0, then Pr(H | E) = 0 (E and H are mutually exclusive) suppose that Pr(E | H1) = Pr(E | H2), so that the hypotheses H1 and H2 give the same information about a piece of evidence E; then in other words, these assumptions imply that the evidence E will not affect the relative probabilities of H1 and H2

K-Nearest-Neighbor Strategy: Impact of k on predictions

in general different values of k affect the outcome of classification we can associate a confidence level with predictions (this can be the % of neighbors that are in agreement) problem is that no single category may get a majority vote if there is strong variations in results for different choices of k, this an indication that the training set is not large enough

K-Nearest-Neighbor Strategy(combination functions) Weighted Voting: not so "democratic": Advantage of weighted voting

introduces enough variation to prevent ties in most cases helps distinguish between competing neighbors

supervised learning characteristics

labels known well-defined goal learn g(x) that is a good approximation to f(x) from training sample D well-defined error metrics accuracy, RMSE, ROC, ...

Distance (or Similarity) Matrix

n data points, but indicates only the pairwise distance (or similarity) A triangular matrix Symmetric

decision tree (internal node)

node with children

P(A)

probability of event A

P(A | B)

probability of event A given event B occured

P(A ∩ B)

probability that of events A and B

P(A ∪ B)

probability that of events A or B

K-Nearest-Neighbor Strategy(combination functions) Weighted Voting: not so "democratic"

similar to voting, but the vote some neighbors counts more "shareholder democracy?" question is which neighbor's vote counts more?

Data mining definition

the exploration and analysis of large quantities of data in order to discover valid, novel, potentially useful, and ultimately understandable patterns in data.

CSC529 Week1

Conjuntos de estudio relacionados

TGOV: Ch.4 MT Activities

OCE1001 Chp 4

Chapter 6

Chapter 4.2: Sharing the Road

Chapter 7 Homework

Nclex Musculosketetal

Sustainable Finance Exam

Ch 6: Designing a Motivating Work Environment

Project Management and IT: Chapter 10

EAQ Ch 36 Drugs to Treat Thyroid Disease

APES Unit 7

Front-end-Developer-Interview-Questions

Chapter 7 - Quiz

Nissan Intelligent Mobility

Final Exam Marketing Research

patho phys ch 11

Astronomy ch.8

SHORT ANSWERS - CHAPTER 6 (TRAINING AND DEVELOPMENT)

Exam 1 - quizzes 1 & 2, TA practice questions

Vocabulary