CS529 Final Exam
Given 9356 classifiers, each with the probability of 0.5, combined by simple voting, what is the probability of error of the ensemble?
0.5
Buckshot Algorithm
Combines HAC and K-Means clustering. First randomly take a sample of instances of size sqrt(n) Run group-average HAC on this sample, which takes only O(n) time. Use the results of HAC as initial seeds for K-means. Overall algorithm is O(n) and avoids problems of bad seed selection.
Data preprocessing
Data selection: Identify target datasets and relevant fields Data transformation - Data cleaning - Combine related data sources - Create common units - Generate new fields - Sampling
Motivations of ensemble methods
Ensemble model improves accuracy and robustness over single model methods Applications: distributed computing privacy-preserving applications large-scale data with reusable models multiple sources of data Efficiency: a complex problem can be decomposed into multiple sub-problems that are easier to understand and solve (divide-and-conquer approach)
Lift Chart
Example: A lift chart shows how much more likely we are to receive respondents to a survey than if we contact a random sample of customers. For example, by contacting only 10% of customers based on the predictive model we will reach 3 times as many respondents as if we use no model. Y axis - absolute number (probability of success using model at the given sample size) X axis - sample size
In practice, different types of classification errors often incur different costs. Give examples of classification errors.
Examples: Terrorist profiling: "Not a terrorist" correct 99.99% of the time Loan decisions Oil-slick detection Fault diagnosis Promotional mailing
Boosting is an ensemble method that uses sampling with replacement of a dataset and learns independently several classifiers.
FALSE
Ensemble methods can be used for classification but not prediction.
FALSE
Ensemble methods can be used for supervised learning but not unsupervised learning.
FALSE
For methods like boosting or bagging to work well it is necessary that the sub-samples from the dataset used in each classifier are as similar as possible.
FALSE
Random decision trees are computationally expensive to build. (T/F)
FALSE
Random subspaces (like random forest) are meta-learning methods that learn a set of classifiers using datasets that have subsets of attributes of the original data.
FALSE - Random Forest is a Random Subspace. Random Subspace is another term for Attribute Bagging or Feature Bagging. An ensemble of models employing the random subspace method can be constructed using the following algorithm: Let the number of training points be N and the number of features in the training data be D. Choose L to be the number of individual models in the ensemble. For each individual model l, choose dl (dl < D) to be the number of input variables for l. It is common to have only one value of dl for all the individual models. For each individual model l, create a training set by choosing dl features from D without replacement and train the model. Now, to apply the ensemble model to an unseen point, combine the outputs of the L individual models by majority voting or by combining the posterior probabilities.
Information Gain
Gain(A) = I(p,n) - E(A)
Stationary Assumption
Transition probabilities are independent of time.
A hypergraph is a graph with at least one hyperedge.
True
Hard svm attempts to completely separate positive and negative observations. It is therefore prone to overfitting.
True
In softsvm, each point is assigned an alpha, controlling its participation as a support vector.
True
Kernel PCA can be used to denoise a dataset.
True
Kernel tricks may transform an n-dimensional space into a higher order space to make the problem linearly separable.
True
One motivation of ensemble clustering is that an obvious distance measure may not exist and we must induce it, which is not always easy, especially in multidimensional spaces.
True
Support vector machines attempt to maximize the margin between classes.
True
The hypergraph partitioning algorithm can be used to aggregate the output of several clustering algorithms. In such an approach, a hyperedge connects objects put into the same cluster by a clustering algorithm and hyperedges are then cut to induce the final partitions.
True
The latent variables computed by the EM algorithm can be conceptualized by assuming the existence of additional unobserved features. (T/F)
True
When performing ensemble clustering, the results of several clusters can be combined by using the percentage of clusterings that assign two objects into different clusters as a new distance function. This is an object based approach.
True
Random forests are an example of Bagging Boosting
Bagging
Which of the following ensemble techniques can be easily parallelized? Bagging Boosting
Bagging only.
Which of the following are considerations when performing hierarchical agglomerative clustering? Mark all that apply. 1. Size of k 2. Similarity function 3. Seed selection 4. Linking method
Bottom-up hierarchical clustering is therefore called hierarchical agglomerative clustering or HAC.
Different Types of Classifiers
Decision Trees Simple Bayesian models Nearest neighbor methods Logistic regression Neural networks Linear discriminant analysis (LDA) Quadratic discriminant analysis (QDA) Density estimation methods
Bagging
Easily parallelized Create ensembles by repeatedly randomly resampling the training data. Given a training set of size n, create m samples of size n by drawing n examples from the original data, with replacement. Each bootstrap sample will on average contain 63.2% of the unique training examples, the rest are replicates. Combine the m resulting models using simple majority vote. Decreases error by decreasing the variance in the results due to unstable learners, algorithms (like decision trees) whose output can change dramatically when the training data is slightly changed. Bootstrap Sampling with replacement Contains around 63.2% original records in each sample Bootstrap Aggregation Train a classifier on each bootstrap sample Use majority voting to determine the class label of ensemble classifier
A clustering ensemble can accept a hard partitioning, but not a soft partitioning from its base models. (T/F)
False
Like decision trees, support vector machines can output a class but not a probability.
False
Like kernel support vector machines, kernel PCA transforms the space into a higher dimension in order to discover a linear separation between two classes.
False
Since kernel PCA produces a higher order space, it is not suitable for visualizing the results of a cluster analysis.
False
Support vector machines can be extended to a multiclass problem by increasing the number kernel tricks.
False
Support vector machines cannot model curved decision boundaries.
False
Tuning the parameters of svm kernels usually requires specialized domain know ledge.
False
When performing ensemble clustering, clusters can be represented over the assignment space. Then the clusters themselves can be clustered. Objects are assigned to its most associated consensus cluster. This is an object-cluster model.
False
Soft consensus combines multiple partitionings of a set of objects into a single consolidated clustering.
False Each instance in a soft ensemble is represented by a concatenation of r posterior membership probability distributions obtained from the constituent clustering algorithms. We can define a distance measure between two instances using the Kullback-Leibler (KL) divergence, which calculates the "distance" between two probability distributions.
The EM algorithm finds the globally maximum likelihood parameters of a statistical model in cases where the equations cannot be solved directly. (T/F)
False The EM algorithm is used to find *locally* maximum likelihood parameters of a statistical model in cases where the equations cannot be solved directly.
Transition Matrix - Stochastic Matrix
Rows must sum to 1.
Bayes Rule
Suppose we have two sentences, H (the hypothesis) and E (the evidence). Pr(H|E) = Pr(E|H) * Pr(H) / Pr(E) Example: P(Cavity) = 0.1 P(Toothache) = 0.05 P(Cavity|Toothache) = 0.8 Bayes' rule tells: P(Toothache | Cavity) = (0.8 x 0.05)/0.1 = 0.4
Divisive Clustering
(partitional, top-down) separate all examples immediately into clusters.
What number am I thinking of?
42
Markov Models
Assume dependence on more recent observations. A Markov system has N states, and there are discrete time steps. At a given time step, the system is in a given state. Between timesteps, the next state is chosen randomly based on a probability distribution at the current state.
Given a dataset with 10 features, kernel PCA can at most compute 10 principle components.
False
In a two class problem, g i (x o ) and g j (x o ) is computed and observation o is assigned to class i if g i (x o ) < g j (x o ).
False
In order to improve performance, support vector machines identify the observations that are the most challenging to predict and gives them more weight in the classifier.
False
Like LDA, kernel PCA relies on the variance between and within classes to construct new components.
False
Like PCA, kernel PCA is able to capture nonlinear structure in the dataset.
False
Decision Tree Pruning Method
For a tree T, the misclassification rate and the mean-squared error rate depend on P, but not on D. The goal is to do well on records randomly drawn from P, not to do well on the records in D If the tree is too large, it overfits D and does not model P. The pruning method selects the tree of the right size.
Bayes Rule - example #2 Medical Diagnosis suppose we know from statistical data that flu causes fever in 80% of cases, approximately 1 in 10,000 people have flu at a given time, and approximately 1 out of every 50 people is suffering from fever: Pr(fever | flu) = 0.8 Pr(flu) = 0.0001 Pr(fever) = 0.02
Given a patient with fever, does she have flu? Answer by applying Bayes' rule: Pr(flu | fever) = [ Pr(fever | flu) * Pr(flu) ] / Pr(fever) = 0.8 x 0.0001 / 0.02 = 0.004
Which of the following ensemble techniques works well when the errors of the classifiers are strongly correlated? Bagging Boosting
Neither
Conditional Probabilities
P(A | B) = the conditional (posterior) probability of A given B P(A | B) = P(A, B) / P(B) P(A ^ B) = P(A, B) = P(A | B) * P(B) P(A ^ B) = P(A, B) = P(A) * P(B), if A and B are independent We say that A is independent of B, if P(A | B) = P(A) A and B are independent given C if: P(A | B,C) = P(A | C) P(A^B|C) = P(A|C) * P(B|C)
AdaBoost is susceptible to outliers
TRUE
Markov Property
The state of the system at time t+1 depends solely on the state of the system at time t.
Agglommerative Clustering
(bottom-up) methods start with each example in its own cluster and iteratively combine them to form larger and larger clusters.
Which of the following are considerations when performing k-means clustering? 1. Size of k 2. Similarity function 3. Seed selection 4. Linking method
1. Size of k 2. Similarity function 3. Seed selection Note: KMeans minimizes the within-class variance, so there is no distance measure considered.
The buckshot algorithm integrates kmeans clustering by... 1. Using the results of the hierarchical clustering to validate the kmeans clusters 2. Using the results of the hierarchical clustering to seed the kmeans clusters 3. Using the results of the kmeans clustering to validate the hierarchical clusters
2. Using the results of the hierarchical clustering to seed the kmeans clusters
Whish of the following is a fundamental difference between bagging and boosting? 1. Bagging is used for supervised learning. Boosting is used for unsupervised clustering. 2. Bagging gives varying weights to training instances. Boosting gives equal weight to all training instances. 3. Bagging does not take the performance of previously built models into account when building a new model. With boosting, each new model is built on the the results of previous models. 4. Boosting is used for supervised learning. Bagging is used with unsupervised clustering.
3. Bagging does not take the performance of previously built models into account when building a new model. With boosting, each new model is built on the the results of previous models.
Which of the following is correct with respect to random forest compared to decision trees? 1. Random forests are more difficult to interpret but often less accurate. 2. Random forests are easier to interpret but often more accurate. 3. Random forests are more difficult to interpret but often more accurate. 4. None of the above.
3. Random forests are more difficult to interpret but often more accurate.
Decision Trees
A decision tree T encodes d (a classifier or regression function) in form of a tree. A node t in T without children is called a leaf node. Otherwise t is called an internal node.
Gini index
A measure of impurity (based on relative frequencies of classes in a set of instances) The attribute that provides the smallest Gini index (or the largest reduction in impurity due to the split) is chosen to split the node Possible Problems: Biased towards multivalued attributes; similar to Info. Gain. Has difficulty when # of classes is large
The 0.632 bootstrap
A particular instance has a probability of 1-1/n of not being picked.
An ensemble clustering approach based on kmeans may include... Mark all that apply. 1. Several iterations of kmeans using different values of k. 2. Several iterations of kmeans using different seeds for the initial clusters. 3. Several iterations of kmeans using different samples of the data. 4. Several iterations of kmeans using different features from the data.
All of the above
Heirarchical Agglomerative Clustering
Assumes a similarity function for determining the similarity of two instances. Starts with all instances in a separate cluster and then repeatedly joins the two clusters that are most similar until there is only one cluster. The history of merging forms a binary tree or hierarchy.
As you increase the number of classifiers in an ensemble, the classification error decreases until the error approaches zero. (T/F)
False
Clustering n objects with m features through the EM algorithm, produces an ensemble of n x m models, the results of which are then aggregated to produce a final assignment of objects to clusters. (T/F)
False
Components computed from kernel PCA should not be used in logistic regression because logistic regression already computes higher order spaces.
False
Ensemble clustering can provide data for a visualization tools to inspect cluster membership and boundaries, but not number.
False
Classification: Definition
Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class. Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen records should be assigned a class as accurately as possible. A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.
How does one build ensembles?
How to build ensembles: 1. Heterogeneous ensembles (same training data, different learning algorithms) 2. Manipulate training data (same learning algorithm, different training data) - bagging & boosting 3. Manipulate input features (use different subsets of the attribute sets) 4. Manipulate output targets (same data, same algorithm, convert multiclass problems into many two-class problems) 5. Inject randomness to learning algorithms.
Cluster Similarity
How to compute similarity of two clusters each possibly containing multiple instances? Single Link: Similarity of two most similar members. Complete Link: Similarity of two least similar members. Group Average: Average similarity between members.
The amount of information, needed to decide if an arbitrary example in S belongs to P or N is defined in terms of entropy, I(p,n): I(p,n)= -Pr(P)log2Pr(P) - Pr(N)log2Pr(N) Or I(p,n)= -Pr(P)xln(Pr(P))/ln(2) - [Pr(N) x ln(Pr(N))/ln(2)] Note that Pr(P) = p / (p+n) and Pr(N) = n / (p+n) Note also: log2(x) = ln(x)/ln(2)
I(9,5) = -(9/14)xlog2(9/14) - (5/14)xlog2(5/14) = 0.94 log2(9/14) = ln(9/14)/ln(2), etc. Whichever attribute has the largest information gain will be selected as root.
The Knowledge Discovery Process
Identify business (or other) problem Data mining Action Evaluation and measurement Deployment and integration into businesses processes
AdaBoost
Initially, set uniform weights on all the records At each round create a bootstrap sample based on the weights Train a classifier on the sample and apply it on the original training set Records that are wrongly classified will have their weights increased Records that are classified correctly will have their weights decreased If the error rate is higher than 50%, start over Final prediction is weighted average of all the classifiers with weight representing the training accuracy
Distance and Similarity Measures
Many of todays real-world applications rely on the computation similarities or distances among objects Personalization Recommender systems Document categorization Information retrieval Target marketing
Boosting
Not easily parallelized Originally developed by computational learning theorists to guarantee performance improvements on fitting training data for a weak learner that only needs to generate a hypothesis with a training accuracy greater than 0.5 (Schapire, 1990). Revised to be a practical algorithm, AdaBoost, for building ensembles that empirically improves generalization performance (Freund & Shapire, 1996). Examples are given weights. At each iteration, a new hypothesis is learned and the examples are reweighted to focus the system on examples that the most recently learned classifier got wrong. General Loop: Set all examples to have equal uniform weights. For t from 1 to T do: Learn a hypothesis, ht, from the weighted examples Decrease the weights of examples ht classifies correctly Base (weak) learner must focus on correctly classifying the most highly weighted examples while strongly avoiding over-fitting. During testing, each of the T hypotheses get a weighted vote proportional to their accuracy on the training data.
Basic Axioms of Probability
P(True) = 1 and P(False) = 0 P(A ^ B) = P(A) * P(B | A) P(-A) = 1 - P(A) if A º B, then P(A) = P(B) P(A \/ B) = P(A) + P(B) - P(A ^ B)
Suppose A blood test is 90% effective in detecting a disease. It also falsely diagnoses that a healthy person has the disease 3% of the time. If 10% of those tested have the disease, what is the probability that a person who tests positive will actually have the disease? ( i.e. find P(disease | positive) ) P(disease) = 0.10 P(-disease) = 0.90 P(positive | disease) = 0.90 P(positive | -disease) = 0.03
P(disease | positive) == P(positive | disease) * P(disease) / P(positive | disease) * P(disease) + P(positive|-disease) * P(-disease) = (0.90)(0.10) / ( (0.90)(0.10) + (0.03)(0.90) ) = 0.77
Given three classifiers, each with a probability of error of 0.2 combined by simple voting, what is the probability of error of the ensemble?
P(err) = P(2 wrong) or P(three wrong) P(2 wrong) = 3 (number of possible combinations of wrong) times the probability of two of the three being wrong times the probability of the remaining classifier being correct P(err) = 3 * (.2^2) * .8 + (.2^3) P(err) = 0.104
More measures
Percentage of retrieved documents that are relevant: precision=TP/(TP+FP) Percentage of relevant documents that are returned: recall =TP/(TP+FN)
Examples of Classification Task
Predicting tumor cells as benign or malignant Classifying credit card transactions as legitimate or fraudulent Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil Categorizing news stories as finance, weather, entertainment, sports, etc
Recall
Proportion of true positives to all correctly classified elements (true positives and false negatives). Recall = # True Positives / (#True Positives + # False Negative)
Precision
Proportion of true positives to all positives. Precision = # True Positives / (#True Positives + # False Positives)
Transition Matrix - Double Stochastic Matrix
Rows and columns sum to 1
Traffic Light Example Pr(green) = 0.45 Pr(red) = 0.45 Pr(yellow) = 0.1 We know that the police are perfect enforcers (i.e., we get a ticket if and only if the light is red when we enter the intersection) now we enter the intersection without getting a ticket; what are the probabilities that the light was green, red, or yellow?
Since we got no ticket, we know that the light could not have been red (in other words, we have Pr(red | no-ticket) = 0 ) also we have: Pr(no-ticket | green) = Pr(no-ticket | yellow) = 1 Using Bayes' rule we get: Pr(yellow|no ticket) = Pr(no ticket | yellow) * Pr(yellow) / Pr(no ticket) = 0.1/0.55 = 2/11 similarly, we can show that Pr(green | no-ticket) = 9 / 11
Stacking
Stacking (sometimes called stacked generalization) involves training a learning algorithm to combine the predictions of several other learning algorithms. First, all of the other algorithms are trained using the available data, then a combiner algorithm is trained to make a final prediction using all the predictions of the other algorithms as additional inputs. If an arbitrary combiner algorithm is used, then stacking can theoretically represent any ensemble technique, although in practice, a single-layer logistic regression model is often used as the combiner. Stacking typically yields performance better than any single one of the trained models. It has been successfully used on both supervised learning tasks (regression, classification and distance learning) and unsupervised learning (density estimation). It has also been used to estimate bagging's error rate. It has been reported to out-perform Bayesian model-averaging. The two top-performers in the Netflix competition used blending, which may be considered to be a form of stacking.
Classifier ensembles usually have better performance than stand alone classifiers because they combine the different points of view of each individual classifier.
TRUE
For an ensemble to work the following conditions must be true:
The errors of the classifiers need not to be strongly correlated (think about the exam example, if everyone knows by heart exactly the same chapters, will it help to solve the test in groups?) The errors of the individual classifiers making up the example need to be less than 0.5 (at least better than chance)
Data mining
The exploration and analysis of large quantities of data in order to discover valid, novel, potentially useful, and ultimately understandable patterns in data. Valid: The patterns hold in general. Novel: We did not know the pattern beforehand. Useful: We can devise actions from the patterns. Understandable: We can interpret and comprehend the patterns.
How to build ensembles: The two dominant approaches belong to category 2 Manipulate training data (same learning algorithm, different training data)
They are: bagging and boosting
Like PCA, kernel PCA relies on Eigen decomposition in order to extract new features.
True
Like kernel support vector machines, kernel PCA may require tuning of the kernel parameters.
True
The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the loglikelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected loglikelihood found on the E step. (T/F)
True In statistics, an expectation-maximization (EM) algorithm is an iterative method for finding maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step.
Non-heirarchical Clustering
Typically must provide the number of desired clusters, k. Randomly choose k instances as seeds, one per cluster. Form initial clusters based on these seeds. Iterate, repeatedly reallocating instances to different clusters to improve the overall clustering. Stop when clustering converges or after a fixed number of iterations.
Choosing the best feature
Use Information Gain to find the "best" (most discriminating) feature Assume there are two classes, P and N (e.g, P = "yes" and N = "no") Let the set of instances S (training data) contains p elements of class P and n elements of class N The amount of information, needed to decide if an arbitrary example in S belongs to P or N is defined in terms of entropy, I(p,n): Note that Pr(P) = p / (p+n) and Pr(N) = n / (p+n) In other words, entropy of a set on instances S is a function of the probability distribution of classes among the instances in S.
ROC curves ("receiver operating characteristic")
Used in signal detection to show trade-off between hit rate and false alarm rate over noisy channel y axis shows percentage of true positives in sample x axis shows percentage of false positives in sample In general, is a graphical plot that illustrates the performance of a binary classifier system as its discrimination threshold is varied. The curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. The true-positive rate is also known as sensitivity, recall or probability of detection[1] in machine learning. The false-positive rate is also known as the fall-out or probability of false alarm[1] and can be calculated as (1 − specificity). The ROC curve is thus the sensitivity as a function of fall-out. An ROC curve demonstrates several things: It shows the tradeoff between sensitivity and specificity (any increase in sensitivity will be accompanied by a decrease in specificity). The closer the curve follows the left-hand border and then the top border of the ROC space, the more accurate the test. The closer the curve comes to the 45-degree diagonal of the ROC space, the less accurate the test. The slope of the tangent line at a cutpoint gives the likelihood ratio (LR) for that value of the test. The area under the curve is a measure of accuracy.