CIS 375 Final Study Guide Questions
Unigrams
This is a sentence: - N = 1 - this, is, a, sentence
Bigrams
This is a sentence: - N = 2 - this is, is a, a sentence
Trigrams
This is a sentence: - this is a, - is a sentence
Probability Estimation - Adjustment
To address this problem of small samples for tree-based class probability estimation, instead of simply computing the frequency, - we often use a "smoothed" version of the frequency-based estimate - the goal of this is to moderate the influences of leaves with only a few instances p(c) = n + 1 / n + m + 2 where n is the number of examples in the leaf belonging to class c, and m is the number of examples not belonging to class c
ROC Graph with 5 Classifiers
Top Left Corner: the perfect classification strategy Bottom Left Corner: the strategy of never issuing a single positive classification Top Right Corner: the strategy of always issuing positive classifications for all instances Dashed Line: the random guessing strategy No classifier in the lower right triangle - One point in ROC space is superior to another if it is to the northwest of the first (ex: its true positive rate is higher and the false positive rate is no worse) - Shows relative trade-off between benefits that a classifier make (true positive rate) and costs that a classifier incurs (false positive rate)
Bag of Words
Treat every document as just a collection of individual words - ignore grammar, word order, sentence structure, and (usually) punctuation - treat every word in a document as a potentially important keyword (feature of the document What will be the feature's value in a given document? - each document is represented by a one (if the token is present in the document) or a zero (the token is not present in the document)
Overfitting in linear functions
Two (almost identical) linear discriminants, logistic regression and SVM: - adding a new instance - the logistic regression model adjusted itself, while SVM hardly moved - adding a different new instance - the logistic regression model moves considerable more in response to a new instance (logistic regression appears to be more overfitting than SVMs) - the outlier instances introduced should not have a strong influence on the models (they contribute little to the "mass" of instances)
Profit Curve Critical Conditions
Two critical conditions underlying the profit calculation: - the class priors (the proportion of positive and negative instances in the target population - the costs and benefits (the expected profit is specifically sensitive to the relative levels of costs and benefits for the different cells of the cost-benefit matrix These conditions are uncertain or unstable
Classification / Decision Tree
Structure Characteristics: - made up of nodes: root, interior nodes, leaf nodes, and branches - interior nodes represent testing of an attribute (decision node) - each branch represents a distinct value of the attribute at that node - every data point (instance) will correspond to one and only one path (ending at a leaf node) - therefore, each leaf represents a segment of the population - we use information gain, which tells use how important a given attribute is, to decide the ordering of the attributes as nodes of a decision tree
Problems with Model Accuracy
Suppose we have a biopsy data: 1000 instances, 950 benign, 50 malignant (imbalanced data) Model Accuracy = 950 / 1000 = 0.95 Can we say this is a very good model? No. - misleading with imbalanced data - not made any distinction between two classification errors - false positives and false negative
TF (Term Frequency) - IDF (Inverse Document)
TF-IDF assigns value to each term in a document based on - frequency - rarity TFIDF(t,d) = TF(t,d) * IDF(t) - TF counts term t within document d, so is specific to document d - IDF value measures the sparseness of term t on the entire corpus
Why Text is Difficult
Text is "Unstructured": - linguistic structure is intended for human communication and not computers - Word order matters sometimes Text Can be Dirty: - people write ungrammatically, misspell words, abbreviate unpredictably, and punctuate randomly - even when text is flawlessly expressed it may contain: synonyms, homographs (i.e., bow), abbreviations, etc. - Context Matters
Support Vector Machine (SVM)
The Basic Idea: best "fit" is the center line of the widest "bar" (not line) that separates the instances The Goal: maximize the distance between the outer edges (of the bar), the "margin" around the linear discriminant (aka, the distance from the decision boundary to the nearest training instance) - support vectors are the points from each class that are the closest to the bar - each class must have at least one support vector - using the support vectors alone, it is possible to define the maximum margin classifier
Expected Value
The computation provides a framework that is useful in organizing thinking about data-analytic problems It decomposes data-analytic thinking into: - the structure of the problem (decision outcome) - the elements of the analysis that can be extracted from the data (probability) - the elements of the analysis that need to be acquired from other sources (business value) EV = p(o1) * v(o1) + p(o2) * v(o2) + p(o3) * v(o3) + ... where o = a possible decision outcome P(o) = the probability of decision outcome, o V(o) = the business value of o
Model Induction
The creation of models from data
Test Data
The data for model evaluation
Training Data
The input data for the induction algorithm
Term Frequency Representation
Use the word count (frequency) in the document instead of just a zero or one - the importance of a term in a document increases with the number of times that term occurs - Each sentence is considered a separate document - A simple bag-of-words approach using term frequency would produce a table of term counts
Algorithm Ensemble Techniques
The main objective of ensemble methodology is to improve the performance of single classifiers - the approach involves constructing several two stage classifiers from the original data and then aggregate their predictions
Kernels
The mapping of data into a higher dimension - Polynomial kernels are often used, especially with degree 2 - Radial basis function kernels (gaussian kernels) are a good first choice for problems requiring nonlinear models - a decision boundary that is a hyperplane in the mapped feature space is similar to a decision boundary that is a hypersphere in the original space - can have an infinite number of dimensions, a feat that would be impossible otherwise
Overfitting
The model is memorizing the data it has seen and is unable to generalize to unseen examples
Underfitting
The model is unable to capture the relationship between the input examples and the target values
Selecting Informative Attributes
The most common splitting criterion is information gain (IG) - based on a (im)purity measure called entropy - measures the disorder of a set - zero at minimal disorder (the set has same members all with the same, single property) - one at maximal disorder (the properties are equally mixed)
Boosting
We improve performance by concentrating modeling efforts on the data that results in more errors - we train a sequence of models where more weight is given to examples that were misclassified by earlier iterations
Lift Example
We operate a small convenience store where people buy beer, lottery, etc. We estimate that: - 30% of all transactions involve beer - 40% of all transactions involve lottery tickets - 20% of the transactions include both beer and lottery tickets is it simply due to chance occurrence? what we have known: P(beer) = 0.3, P(lottery tickets) = 0.4, P(beer, lottery tickets) = 0.2 Lift(A,B) = p(A,B) / p(A) * p(B) The probability of change occurrence: P(beer)*P(lottery) = 0.3 * 0.4 = 0.12 Lift(beer, lottery tickets) = p(beer, lottery tickets) / p(beer) * p(lottery tickets) = 0.2 / 0.12 = 1.67 - It means that buying beer and lottery tickets together is about 1 and two - thirds (1 and 2/3) times more likely than one would expect by chance
Cross-Validation/K folds Validation
We perform "multiple splits" and systematically "swapping out" samples for testing - cross-validation then iterates training and testing five times - in each iteration of the cross-validation, a different fold - fold 1 - is chosen as the test data and the remaining folds - fold 2, 3, 4, and 5 - are combined and used as training data - final accuracy = average (T1 + T2 + T3 + T4 + T5)
Manhattan Distance
The sum of the (unsquared) pairwise distances - also called taxicab distance d(manhattan)(X,Y) = |X-Y|(1) = | x1 - y1 | + | x2 - y2 | + ... It represents the total street distances you would have to travel in a place like midtown Manhattan to get between 2 points
K-means Clustering - Means
The word "means" in the k-means literally correspond to the centroids, and we represent it by the arithmetic mean (average) of the values along each dimension for the instances in the cluster - Ex: we average all x values of the points in a given cluster to form the x-coordinate of the centroid, and average all y values to form the centroid's y-coordinate
How many neighbors should we use?
k nearest neighbors - there is no simple answer to how many neighbors should be used - we will get an extremely complex model - k = ?, k = 1?, k = n(total num. of instance)? -we predict the average value in the entire dataset for each case - Decision boundaries created by 1-nearest neighbor classifier - They follow very specific boundaries around the training instances - The 1-nearest neighbor classifier is very sensitive to outliers, and over-fits very strongly Basic Idea: things are similar in some ways are probably similar in other ways as well
Support (of association)
measures how frequently it occurs in the data support(x) = count(x)/N N is the number of transactions in the database count(x) indicates the number of transactions the itemset X appears in Ex: {candy bar}: 2/5 = 0.4 {get well card, flowers} - we require rules to apply to at least 0.01% of all transactions (prevalence of co-occurrences)
Induction
the creation of models (general rules) from data
Classifying vs Ranking
With a simple classifier model, - we get one confusion matrix (CM) for the entire dataset - cannot "sort" the instances by their "likelihood" to take an action With a ranking classifier model, - we get one CM for a chosen "threshold" - can "sort" the instances by their "likelihood"
Measuring Sparseness
Words of different frequencies - words should not be too common or too rare -When a term is too common, it occurs in every document, so it is not useful for classification because it does not distinguish anything - When a term is too rare, it is not a good basis of a meaningful cluster Both upper and lower limit on the number (or fraction) of documents in which a word may occur
Co-Occurrence as a rule
"if A occurs then B is likely to occur as well"
Confidence (of the rule)
"likelihood" in the association, it measures the rule's predictive power or accuracy P(Y|X) = confidence(X->) = support(X,Y)/support(x) It is defined as the support of the itemset containing both X and Y divided by the support of the itemset containing only X Ex: the {flowers} -> {get well card} - P(flowers) = 4/5 = 0.8 - P(flower, get well card) = 3/5 = 0.6 - Confidence(flower ->) get well card) = P(flower, get well card) / P(flowers) = 0.6/0.8 = 0.75 - this means that a purchase involving flowers is accompanied by a purchase of a get well card 75 % of the time
Synthetic Minority Over-Sampling Technique (SMOTE) (for imbalanced data)
- Avoid overfitting which occurs when exact replicas of minority instances are added to the main dataset - A subset of data is taken from the minority class as an example and then new synthetic similar instances are created - These synthetic instances are then added to the original dataset, the new dataset is used as a sample to train the classification models
Ranking a Set of Cases by Their Model-Produced Scores
- Decide to which top N you will take an action - The instances above the threshold are classified as positive
Overfitting in decision/classification trees
- Decision trees are highly flexible, they can find a pattern given enough data - If we continue to split the data, eventually the subsets will be pure, meaning all instances in any chosen subset will have the same value for the target variables - if we allow the tree models to grow without bound, they can eventually fit the data to an arbitrary precision - the accuracy on holdout data is likely to increase up to a certain point of model complexity - but beyond that point, the tree starts to overfit
Logistic Regression vs Tree Induction
- For smaller training-set sizes, logistic regression yields better generalization accuracy than tree induction - Classification trees are a more flexible model representation than logistic regression - Flexibility of tree induction can be an advantage with larger training sets
Multi-layer neural network use
- Hidden layers, which are neurons stacked in between inputs and outputs, allow neural networks to learn more complicated features - Multilayer network adds one or more hidden layers that process the signals from the input nodes prior to reaching the output node - Most multilayer networks are fully connected, which means that every node in one layer is connected to every node in the next layer - the feature inputs are fed into the input layer - the weighted outputs of the input layer nodes are fed into hidden layer - the weighted outputs of the last hidden layer are inputs to the output layer In a two-layer stack model: - a set of logistic regressions is learned from the first set of original input features - then learns a logistic regression using the outputs of the first set of logistic regression as its input features
Similarity and Distance
- If two things - people, companies, products - are similar in some ways they often share other characteristics as well - Data mining procedures are often based on "grouping" things by similarity or "searching for" the right sort of similarity - if two objects can be represented as feature vectors, then we can compute the distance between them
Profit Curves
- If we have well defined cost-benefit matrix, we can determine the threshold where our expected profit is above a desired level (usually zero) - The idea behind is to combine expected value/profit and the rank of instances - This graph is based on a test set of 1,000 consumers to predict if they are going to accept an offer and purchase - We rank the instances by score which is produced by a classifier, from highest to lowest - As we lower the threshold from the highest to lowest by one level at a time, we compute the corresponding expected profit and record
Visualization Evaluation
- Measure and visualize generalization performance of predictive models - Be careful using accuracy alone (as sole performance evaluator) - Use cross-validation - Use confusion matrix as a useful tool for breaking out types of errors - Use expected value and profit curves to incorporate external information (costs and benefits) - Use AUC as a good metric for model performance comparison (against a random/baseline model)
ROC Space and Graph
- Method which can accommodate uncertainty by visualizing the "entire space of performance possibilities" - The receiver operation characteristics - First used measured the ability of radar receiver operators to analyze radar signals in World War II - In a 2-dimensional space, ROC graph plot a classifier's false positive rate on the x-axis and true positive rate on the y-axis - For each discrete classifier we produce a pair of FP rate and TP rate and we plot that single point in ROC space
Challenge of Imbalanced Classification
- Most machine learning classification algorithms are sensitive to unbalance in the classes - Standard classifier algorithms like decision tree and logistic regression have a bias towards classes which have number of instances - They tend to only predict the majority class data, the features of the minority class are treated as noise and are often ignored - Thus, there is a high probability of misclassification of the minority class as compared to the majority class
Nearest Neighbors
- One of the most common use of similarity in data science is to find the most similar instance - A new instance whose label we want to predict - Suppose we take 3 nearest neighbors and look at their known target values
Other Evaluation Metrics Drawn from the Confusion Matrix
- Precision - Recall - F-Score - Specificity
Clustering Around Centroids
- Represent each cluster by its "cluster center" or "centroid" - The star (centroid) is not one of the instances, but it is the geometric center of a group of instances - The most popular centroid-based clustering algorithm is called "k-means" clustering
Model Performance with Imbalanced Classes
- Research on imbalanced classes often considers imbalanced to mean a minority class of 10% to 20% (in reality, datasets can get far more imbalanced than this) Ex: about 2% of accounts are defrauded per year (most fraud detection domains are heavily imbalanced) - the conversion rates of online ads has been estimated to lie between 10^-3 to 10^-6 - factory production defect rate typically run about 0.1%
The Confusion Matrix
- Separates out the decisions made by the classifier, making explicit how one class is being confused for another - With the columns labeled with 'actual classes' and the rows labeled with 'predicted classes' - The errors of the classifier are the false positives and false negatives False Negatives: when the classifier predicts an instance as negative when it is actually a positive False Positives: when the classifier predicts an instances as positive when it is actually negative
Artificial Neural Networks
- Simulates the biological neural network - Mimics the key aspects of the brain, in particular its ability to learn from experience - Can be built by connecting layers of logistic regression models A single artificial neuron: - x = input signals received by the dendrites - y = the output signal - w = each dendrite's signal is weighted according to its importance - f = the input signals are summed by the cell body and the signal is passed on according to an activation function - Neural network is a set of connected input/output units, where each connection has a weight associated with it - Neural network learns by adjusting the weights so as to be able to correctly classify the training data and hence, after testing phase, to classify unknown data - Neural network needs long time for training and has a high tolerance to noisy and incomplete data
Clustering Evaluation
- Single vs. multiple runs of the k-means (dependent on the initial centroid locations) - k-means clustering is efficient (it runs quite fast) - k = ? (experiment with different k values and see which on generates good clustering results)
Linear Decision Boundary
- Straight, non-perpendicular boundary line - Not a perfect segmentation, but very close - Referred to as a linear classifier
Associations
- Support - Confidence - Lift
Ranking a set of cases by their model-produced scores
- Take actions on the cases at the top of the ranked list - We may decide to take the top N cases (most likely churners) above a given threshold - You have a limited budget for actions (ex: target the most promising prospects for a fixed amount of marketing budget)
Activation Function
- The activation function is the mechanism by which the artificial neuron processes incoming information and passes it throughout the network - ANN activation functions can be chosen based on their ability to demonstrate mathematical characteristics and their ability to accurately model relationships among data the threshold activation function is "on" only after the input signals meet a threshold - the primary detail that differentiates these activation functions is the output signal range Most commonly used is the sigmoid activation function: - the sigmoid activation function mimics the biological activation function with a smooth curve
Area Under the ROC Curve (AUC)
- The area under a classifier's curve expressed as a fraction of the unit square (its value ranges from zero to one) - The AUC is useful when a single number is needed to summarize performance, or when nothing is known about the operating conditions (ROC curve provides more information than its area)
Lift Curve
- The cumulative response curve is sometimes called a "lift curve" - The lift of a classifier means the relative advantage the classifier provides over random guessing - So the lift is the degree to which the classifier of interest pushes up the positive instances in a list above the negative instance - The lift for random guessing model is 1 - The lift curve is the value of the cumulative response curve at a given x-point divided by the diagonal line (y=x) value at that point - "our model gives a 2X lift"
Inverse Document Frequency
- The fewer documents in which a term occurs, the more significant it is likely to be to the documents - (in a corpus of 100 documents) - This sparseness of a term, t, is measured commonly by an equation called inverse document frequency (IDF), which is show in IDF(t) = 1 + log(total number of documents / number of documents containing t) When a term is very rare, the denominator - number of documents containing t - becomes smaller, while the numerator - the total number of documents - remains constant. IDF is quite high for a rare term.
K-means Clustering - k
- The k in the k-means simply refers to the number of clusters that we would like to find in the data - The value for k is "pre-determined" by us - This is contrast to that we derived the ideal number of clusters when we used hierarchical clustering method (after we see the dendrogram)
Data Characteristics
- Volume of data - Variety of data - Powerful computers - More efficient algorithms - Structured - numeric data - Un-structured - textual data
Using Expected Value to Frame Classifier Evaluation
- We evaluate the set of decisions, aggregated decisions made by a model when we apply it to a set of examples - Ex: does a classification tree work better than a linear discriminant model for a particular problem? - we acquire external information to estimate the costs and benefits for each case - we tally the counts for the different cells of the confusion matrix
ROC Curve
- We looked at "discrete" classifiers and therefore we plotted one point for each corresponding classifier in the ROC space - We examine a ranking model that produces a set of points (a curve) in ROC space - Each threshold value produces a different point in ROC space - When we connect these points/dots, they look like a solid-line curve in this graph, or at least a step-function
Building a Confusion Matrix
- We simply count the instances for each of 4 cases in a confusion matrix - Let's consider a "loan" default-prediction problem: here we have 2 classes (whether a loan is default or not)
Learning Curves vs Fitting Graphs
- a learning curve shows the generalization performance plotted against the AMOUNT of training data used - a fitting graph shows the generalization performance as well as the performance on the training data, but plotted against model COMPLEXITY - fitting graphs generally are shown for a fixed amount of training data
Selecting the "Best" Possible Line
- depends on the objective loss function - loss functions measure the amount of classification error a model has for a given training dataset - determines how much "penalty" should be assigned to a instance, based on the error in the model's predicted value - by maximizing or minimizing the objective function, we can find the optimal values for the weights
Holdout Method
- in order to evaluate the generalization performance of a model, we need to estimate its accuracy on "holdout" (test) data - holdout data is data that is not used for building the model, but for which we know the target variable's value - the holdout data comes from the original dataset
Objective Functions
- mean(y;x) = | y - f(x) | - mean(y;x) = (y - f(x))^2 Linear regression, logistic regression and support vector machines are all very similar instances of our basic fundamental technique: Linear regression = minimize squared errors Logistic regression = maximize likelihood - most commonly used linear classification algorithm Support vector machines = maximize the margin - hinge loss + widest margin between two classes
Entropy
- measures the level of disorder of a subgroup/subset entropy = -p1*log2(p1) - p2*log2(p2) - p3*log2(p3) - ... - were p is the percentage of property(i) within the subset - p = 1 when all members have property - p = 0 when none of the members have property Entropy is: - zero (0) for a minimally disordered set (has members all with the same, single property vales) - one (1) for maximally disordered set (has members with equally mixed property values - we want attributes that produce subsets with lowest entropy
Why is overfitting bad?
- overfitting (and underfitting) leads to poor predictions on new (unseen) datasets - we want our predictive models to be applicable not just to the particular training set, but to the general population from which they come from
Supervised Segmentation
- the goal is to segment/separate/partition a given population into subgroups that have different values for the target variable - but at the same time have similar values (for the target variable) within each subgroup ("purity") - there are often many attributes that make up an instance and not all are useful in determining the value of its target variable - therefore, we need a way to rank the importance of such variables with respect to predicting the value of the target - we want to determine which attributes segment our instances into subgroups that are pure (homogeneous) with respect to the target variable (every member of the subgroup has the same value for the target)
Objective / Loss Function
- the loss function incurs no penalty for an example that is not on the wrong side of the margin - loss function determines how much penalty should be assigned to an instances (the distance from the separation boundary) SVMs use a loss function known as hinge loss (since it looks like a door hinge): - for a classifier with positive and negative classes - x-axis = the distance of an instance from the separating boundary - y-axis = the loss incurred by an instance - when a negative instance is on the negative side of the boundary (classified correctly), the hinge loss = 0 - when a negative instance is on the positive side of the boundary (classified incorrectly), the hinge loss > 0 - a negative instance on the negative side, is on the correct side of the margin - a negative instance on the positive side, is on the incorrect side of the margin The hinge loss becomes positive when: 1. on the wrong side of the boundary 2. beyond the margin - loss increases linearly with the distance - the farther away the instances are from the separating boundary the more the loss (penalty)
The number of nodes in each layers
- the number of input nodes is predetermined by the number of features in the input data - the number of output nodes is predetermined by the number of outcomes to be modeled or the class of outcomes - the number of hidden nodes is left to the user to decide prior to training the model - the appropriate number depends on the number of input nodes, the amount of training data, the amount of noisy data, and the complexity of the learning task among many other factors
How to detect overfitting
- we can't know how well our model will perform on new data until we actually test it - to find out, we can split our initial dataset into separate training and test subsets - if our model does much better on the training set than on the test set, then we're likely overfitting - ex: if our model predicted with 99% accuracy on the training set but only 55% accuracy on the test set, then we have overfitting
Profit Curve Steps
1. All four curves begin and end at the same point "When no customers are targeted, there are no expenses and zero profit, and at the right side, everyone is targeted, so every classifier performs the same" 2. In between, we will see some differences by classifier, depending on how each classifier orders the customers Which classifier performs the best: - Classifier 1 is best when budget constraint considered (only can afford at most 8% of the total customer base) - Classifier 2 is best when budget constraint not considered by targeting top-ranked 50% of customers In summary, profit curves allow us to spot which classifier is best and help us achieve the goal (ex: maximizing expected profit with a limited budget)
Common Data Mining Tasks
1. Classification and class probability estimation - how likely is this consumer to (or will s/he) respond to our campaign? 2. Regression - how much will she use the service? 3. Similarity Matching - can we find consumers similar to my best customers? 4. Clustering - do my customers form groups? 5. Co-occurrence Grouping - also known as frequent itemset mining, association rule discovery, and market-basket analysis - what items are commonly purchased together? 6. Profiling (behavior description) - what does "normal behavior" look like? (ex: as a baseline to detect fraud) - what is the typical cell phone usage of this customer segment? 7. Data Reduction - which latent dimensions describe the consumer taste preferences? 8. Link Prediction - since john and jane share 2 friends, should john become jane's friend? 9. Casual Modeling - did advertisements influence a consumer to purchase?
Two Approaches for Clustering
1. Hierarchical Clustering 2. Clustering Around Centroids (k-Means Clustering)
Term Frequency - Process
1. Normalize case (make all terms/tokens lowercase) 2. Stem-ize words/tokens (ex: "Players, Played, ..." -> "Play") 3. Remove stop word (ex: "a", "in", "on", etc.)
Causes of Overfitting
1. Overfitting due to presence of noise - mislabeled instances may contradict the class labels of other similar records 2. Overfitting due to lack of representative instances - lack of representative instances in the training data can prevent refinement of the learning algorithm 3. Overfitting due to complex model - if the algorithm is too complex or flexible (it has too many input features it's not properly regularized), it can end up "memorizing the noise" instead of finding the signal
K-means Algorithm
1. Randomly select k cluster centers 2. Assign cases to closest center 3. Update cluster centers 4. Reassign cases 5. Repeat steps 3 and 4 until convergence
Pre-Processing of Text
1. The case should be normalized - every term is in lowercase 2. Words should be stemmed - suffixes are removed - ex: noun plurals are transformed to singular forms or - if the word ends in 'ed', remove the 'ed' - if the word ends in 'ing', remove the 'ing' - if the word ends in 'ly', remove the 'ly' 3. Stop-words should be removed - typical words such as the words the, and, of, and on are removed
ROC Graphs and Curves Example
1. The example set consists of 100 positives and 100 negatives 2. The model assigns a score to each instance and the instances are ordered decreasing from bottom to top 3. Lower the threshold, whenever we pass a positive instance, we take a step upward (increase true positives) - Lower the threshold to 0.99 and pass the first positive instance - Lower the threshold to 0.98 and pass another positive instance - Lower the threshold to 0.96 and pass the first negative instance (take a step rightward)
Expected Profit
= p(Y,p)*b(Y,p) + p(N,p)*b(N,p) + p(N,n)*b(N,n) + p(Y,n)*b(Y,n) True Positives: p(Y,p) * b(Y,p) False Negative: p(N,p) * b(N,p) True Negative: p(N,n) * b(N,n) False Positive: p(Y,n) * b(Y,n) - Ex: this expected value means that if we apply this model to a population of prospective customers (unseen customers) and mail offers to those the model classifies as positive (likely to respond), we can expect to make an average of about $50 profit per consumer
Predictive Model
A formula for estimating the unknown value of interest: the target
An expected rate matrix
A sample confusion matrix with prediction of how customers will respond a target marketing and buy the advertised product
Model
A simplified representation of reality created to serve a purpose
Training set size in overfitting
A single holdout set will give us a single estimate of generalization performance of a model What if we have different split in dataset, will the training set size influence the model performance? - yes
Fitting Graph and Overfitting
- A fitting graph plots the accuracy (or error) of a model (y-axis) as a function of degree of model complexity (x-axis) - A fitting graph shows the "difference" between: 1. a model's accuracy on the training data 2. the accuracy on the holdout data as model complexity changes - If a model/algorithm is too complex (it has too many input features), it can end up "memorizing the noise" instead of finding the signal
Term-Frequency Representation Example
- A sample document: "Microsoft Corp and SKYPE Global today announced that they have entered into a definitive agreement under which Microsoft will acquire Skype, the leading Internet communications company, for $8.5 billion in cash from the investor group led by Silver Lake. The agreement has been approved by the boards of directors of both Microsoft and Skype." TFR: Skype - 3, Microsoft - 3, Agreement - 2, Global - 1, etc...
Network Topology
- A set of neurons called input nodes receives unprocessed signals directly from the input data - Each input node is responsible for processing a single feature in the dataset - The feature's value will be transformed by the corresponding node's activation function - The signals sent by the input nodes are received by the output node, which uses its own activation function to generate a final prediction (denoted here as p)
N-Gram Sequences
- Adjacent pairs are commonly called bi-grams - "bag of n-grams up to 2" - Ex: "the quick brown fox jumps" , would be transformed into {quick, brown, fox, jumps, quick_brown, brown_fox, fox_jumps} - Useful when particular phrases are significant but their component words may not be - They greatly increase the size of the feature set
Associations Among Facebook Likes
- Associations among Facebook likes, here users can "like" items on Facebook - Each of these users to have a "basket" of likes, by aggregating all the likes of each user - Do certain likes tend to co-occur more frequently than we would expect by chance?
Plain Accuracy
Accuracy = (number of correct decisions made) / (total number of decisions made) = Accuracy = (TP + TN) / (Total Number of Decisions Made) - Unfortunately, it is sometimes too simplistic and it has some well-known problems
Advantages and Disadvantages of ROC Graphs/Curves
Advantages: - summarizes performance for any particular cutoff - independent of class priors and cost/benefits Disadvantages: - not a particularly intuitive visualization for most business stakeholders
Bagging
An abbreviation of bootstrap aggregating - the conventional bagging algorithm involves generating n different bootstrap training samples with replacement and training the algorithm on each bootstrapped algorithm separately and then aggregating the predictions at the end
Cumulative Response Curve
An alternative visualization framework that might not have all nice properties of the ROC curves but are more intuitive and even for layperson to understand easily - Plots the hit rate on the y-axis and the percentage of the population that is targeted on the x-axis - Hit rate is the same as true positive rate, which we can measure as the percentage of positives which are correctly classified
Cosine Distance
Often used in text classification to measure the similarity of two documents d(cosine)(X,Y) = 1 - (X * Y) / ( | X |2 * | Y |2 ) Particularly useful when you want to ignore differences in scale across instances - Ex: in text classification you may want to ignore whether one document is much longer than another, and just concentrate on the textual content
Generalization and Overfitting
Overfitting: a tendency of data mining procedures to tailor models to the training data, at the expense of generalization to previously unseen data points Generalization: a model applies to data that were not used to build the model - estimated by simulating the use scenario via holdout data - reserve holdout data for evaluation - cross-validation gives better statistics on generalization performance - holdout methods are used only to estimate generalization performance
Precision
P = TP / TP + FP = True Positives / Predicted Positives - precision, also called positive predicted value, is defined as the ratio of the total number of correctly classified positive classes divided by the total number of predicted positive classes - or, out of all the predictive positive classes, how much we predicted correctly, precision should be high
How do we chose a proper threshold?
Performance Visualization Curves: to select the proper threshold value, we consider several diagrams: - cumulative profit curves - ROC: receiver operating characteristics (ROC space, ROC curves) - AUC - Lift Curves
Link Prediction
Predict connections between data items Ex: "Since you and Karen share 10 friends, maybe you'd like to be Karen's friend?"
Pros and Cons of SVM
Pros: - Accuracy - Works well on smaller cleaner datasets - It can be more efficient because it uses a subset of training points Cons: - Isn't suited to larger datasets as the training time with SVMs can be high - Less effective on noisier datasets
Recall
R = TP / TP + FN = True Positives / Actual Positive - sensitivity, also called true positive rate - recall is defined as the ratio of the total number of correctly classified positive classes divide by the total number of positive classes - or, out of all the positive classes, how much we have predicted correctly, recall should be high
Under-Sampling
Randomly down-samples the majority class Advantages: it can help improve run time and storage problems by reducing the number of training data samples when the training data set is huge Disadvantages: - it can discard potentially useful information which could be important for building rule classifiers - the sample chosen by random under sampling may be a biased sample, and it will not be an accurate representative of the population, thereby, resulting in inaccurate results with the actual test data set
Oversampling
Randomly replicates minority instances to increase their population Advantages: - unlike under sampling this method leads to no information loss - Outperforms under-sampling Disadvantages: it increases the likelihood of overfitting as it is more likely to get the same samples in the training and in the test data (the test data is no longer independent from training data)
Instance
Represents a fact or data point - described by a set of attributes
Specificity
S = TN / TN + FP = True Negative - specificity, also called the true negative rate, determines the proportion of actual negatives that are correctly identified
Dendrogram
Shows explicitly the hierarchy of the clusters - on the x-axis we have 6 data points - on the y-axis we represent the distance between the clusters - we can "clip" the dendrogram with a horizontal line as denoted as a dotted red line
Pruning
Simplifies a decision tree to prevent over-fitting to noise in the data Pre-Pruning: - stops growing a branch when information becomes unreliable - minimum instance stopping criterion: limit "tree size" by specifying a minimum number of instances for each leaf - information gain tests at every leaf Post-Pruning: - takes a fully-grown decision tree and discards unreliable parts - estimate whether replacing a set of leaves or a branch with ONE leaf would reduce accuracy (if not, go ahead and prune) - post-pruning is preferred in practice
Hierarchical Clustering
Six points (instances) - A, B, C, D, E, and F - are arranged on a 2-dimensional instance space - how would you generate clusters - you can use the Euclidean distance between a pair of points or between a point and clusters - if they are closer to each other on the instance space, they are more similar to each other, thus can be grouped into a same cluster - Focuses on the similarities between the individual instances and how similarities link them together - Also allows the data analysts to "see" the groupings - the landscape of data similarity
Ranking Instead of Classifying
Sort test instances by their scores in a decreasing order - with a ranking classifier, "a classifier and a threshold" together (NOT ALONE INDEPENDENTLY) produce a single confusion matrix - Whenever the threshold changes, the confusion matrix may change as well because the numbers of TP, FP, FN, and TN change as we change the threshold - As we lower the threshold, instances will move up from the 'No' row to the 'Yes' row of the confusion matrix - An instance which was considered as a NO before is now classified as YES, so the counts in the confusion matrix changes as a result - Each different threshold produces a different classifier, and resulting in a different confusion matrix
Avoid Overfitting
Complexity control: finding the right balance between the fit to the data and the complexity of the model - different data mining algorithms adopt different methods to control the complexity of the model ex: - decision trees, we have various ways for trying to keep the tree from getting too big (too complex) - for equations like logistic regression, linear regression, etc, complexity can be controlled by choosing a "right" set of attributes or a general methodology called regularization
Handling Imbalanced Data-
Data Approach: - under-sampling - oversampling - SMOTE Algorithm Approach: - boosting - bagging
Euclidean Distance
Distance(A,B) = sqrt( (xb-xa)^2 + (yb - ya)^2) ) Pythagorean theorem, which tells us that the distance between A and B is given by the "square root of the summed squares of the lengths of the other two sides of the triangle" - when an object has n features, n dimensions, the general equation for euclidean distance in n dimensions is sqrt( (d1a - d1b)^2 + (d2a - d2b)^2 + ... ) What does it mean: - it does not have any unit, so it does not have any meaning interpretation per se - it is only useful for comparing the similarity of one pair of instances to that of another pair
The Impact of Document Length
Documents of various lengths - long documents have more words than shorter ones - prefer to normalize the term frequencies with respect to "document length" The raw term frequencies are normalized in some way - such as by dividing each by the total number of words in the document - or the frequency of the specific term in the corpus
Logistic Regression
Estimating the probability of class membership (a numeric quantity) over a categorical class) - class probability estimation model and not a regression model - linear function, f(x), ranges from -infinity to infinity, but the probability range from 0 to 1 - we model the log-odds that an instance x belongs to the positive class - "the ratio of the probability of the event occurring to the probability of the event not occurring" log ( p(x) / 1 - p(x) ) - the maximum likelihood model gives the highest probabilities to the positive examples and the lowest probabilities to the negative examples
Co-Occurrences
Example: customers who bought diaper also bought beer Applications: Online shopping, distribution center, etc But many of co-occurrences are just chance occurrences, due to chance. So we place a constraint that such rules must apply to some minimum % of data.
F-Score
F-Score = 2 * Recall * Precision / Recall + Precision - it is difficult to compare two models with different precision and recall, so to make them comparable, we use F-score - it is the harmonic mean of precision and recall, as compared to arithmetic mean, harmonic mean, punishes the extreme values more, F-score should be high
Type 2 Error
False Negatives
Type 1 Error
False Positives
Regression
For each instance/individual in a population, it attempts to estimate the numerical value of some variables - target variable: numerical (a real value, "dollars") Ex: how much will a particular customer use a certain service? how much will a particular user use Facebook next month?
Classification
For each instance/individual in a population, it attempts to predict which set of small classes it belongs to - target variable: categorical (often binary, yes/no) Ex: will a particular user quit a social media site? what is the probability that it will rain tomorrow?
Text Representation
Goal: Take a set of documents - each of which is a free-form sequence of words - and turn it into our familiar feature-vector form - A document is composed of individual tokens or terms - A collection of documents is called a corpus - Each document is one instance: but we don't know in advance what the features will be
Lift
How much more frequently does this association occur than we would expect by chance Lift(A,B) = p(A,B) / p(A)*p(B) It is a ration of the joint probability and the product of the each probability
Expected Value Framework
In a real world, these two are very different kinds of errors with very different costs and benefits Consider a medical diagnosis problem: - cost of false positive error: expensive, inconvenient, stressful - false negative error: life-threatening - Y (diagnosed as cancer) - N (diagnosed as no cancer)
A cost-benefit matrix
In many real-world cases, specifying the costs and benefits may take a lot of time and effort, and sometimes we will only get their approximate ranges Ex: YP (True Positives): - we provide a promotional offer to a customer, and that customer responds and buys the product - so the benefit in this case is the profit from the revenue of $200 minus the product cost of $100 minus the mailing cost of $1, therefore the benefit becomes $99 False Positives: - we classify a consumer a likely responder thus we mail the marketing materials, so it incurs a fixed cost of $1, but the targeted customer does not respond, therefore, the benefit in this case is negative $1 False Negatives: - we assumed that this product is only available via this offer - we predicted a customer not to buy so we did not offer the product to him, but he would've bought the product if he was offered - in this case, no marketing money was spent and nothing was gained in return, so the net benefit is zero True Negative: - we did not offer a deal to a consumer, and he wouldn't have bought it even if we had offered it to him. - again the net benefit in this case is zero - Product price: $200 - Product marginal cost: $100 - Targeting costs: $1
Decision Tree Summary
Intuition for entropy: higher entropy means less pure set of instances Information gain: metric for measuring gain in knowledge of target variable from knowing an attribute Decision trees: - popular, easy to implement, easy to use, easy to interpret models that work well - recursively perform IG-based attribute selection on set of instances - can also be used for probability estimation, regression
Kappa
Kappa = Pr(a) - Pr(e) / 1 - Pr(e) = (observed accuracy - expected accuracy) / (1 - expected accuracy) - the kappa adjusts accuracy by accounting for the possibility of a correct prediction by chance alone - this is especially important for datasets with severe class imbalance because a classifier can obtain high accuracy simply by always guessing the most frequent class - kappa values range from zero to a maximum of one, which indicates perfect agreement between the model's predictions and the true values - depending on how a model is to be used, the interpretation of the kappa statistic might vary, one common interpretation is shown as follow - poor agreement = less than 0.20 - fair agreement = 0.20 to 0.40 - moderate agreement = 0.40 to 0.60 - good agreement = 0.60 to 0.80 - very good agreement = 0.80 to 1.00 * it's important to note that these categories are subjective
Information Gain
Measures how much an attribute decreases entropy over the entire segmentation it creates IG(parent, children) = entropy(parent) - [p(c1) * entropy(c1) + p(c2) * entropy(c2) + ... ] - where p(c) is proportion of instances belonging to child i - the goal is to use an attribute that produces higher IG values
Context Matters Example
Movie Review: "the first part of this movie is far better than the second. The acting is poor and it gets out-of-control by the end, with the violence overdone and an incredible ending, but still fun to watch." - Consider whether the overall sentiment is for or against the film. Is the word incredible positive or negative? - Is it difficult to evaluate any particular word or phrase here without taking into account the entire context
Clustering
Another application of our fundamental notion of similarity - the basic idea is that we want to find groups of objects (consumers, businesses, etc.), where the objects within groups are similar, but the objects in different groups are not so similar
Supervised Learning Algorithms
Decision Tree: learning tasks = classification Logistic Regression: learning tasks = classification Neural Networks: learning tasks = classification and numeric prediction Support Vector Machines: learning tasks = classification and numeric prediction
CRISP Data Mining Process
Business Understanding: understand or create the business problem Data Understanding: understand the strength and limitation of data Data Preparation: convert the data into the right format Modeling: apply data mining techniques to the data Evaluation: assess the performance of the data mining results Deployment: put into real use to realize ROI
Linear vs Nonlinear Support Vector Machines
Linear SVM: - find the fattest bar and define best "fit" is the center line of the widest "bar" that linearly separates the instances Nonlinear SVM: - consider "higher-order" combinations of the original features - 1. squared features, 2. products of features, etc - x1^2, x2^2, x1 * x2, etc
