CS529 Final Exam

Ace your homework & exams now with Quizwiz!

Given 9356 classifiers, each with the probability of 0.5, combined by simple voting, what is the probability of error of the ensemble?

0.5

Buckshot Algorithm

Combines HAC and K-Means clustering. First randomly take a sample of instances of size sqrt(n) Run group-average HAC on this sample, which takes only O(n) time. Use the results of HAC as initial seeds for K-means. Overall algorithm is O(n) and avoids problems of bad seed selection.

Data preprocessing

Data selection: Identify target datasets and relevant fields Data transformation - Data cleaning - Combine related data sources - Create common units - Generate new fields - Sampling

Motivations of ensemble methods

Ensemble model improves accuracy and robustness over single model methods Applications: distributed computing privacy-preserving applications large-scale data with reusable models multiple sources of data Efficiency: a complex problem can be decomposed into multiple sub-problems that are easier to understand and solve (divide-and-conquer approach)

Lift Chart

Example: A lift chart shows how much more likely we are to receive respondents to a survey than if we contact a random sample of customers. For example, by contacting only 10% of customers based on the predictive model we will reach 3 times as many respondents as if we use no model. Y axis - absolute number (probability of success using model at the given sample size) X axis - sample size

In practice, different types of classification errors often incur different costs. Give examples of classification errors.

Examples: Terrorist profiling: "Not a terrorist" correct 99.99% of the time Loan decisions Oil-slick detection Fault diagnosis Promotional mailing

Boosting is an ensemble method that uses sampling with replacement of a dataset and learns independently several classifiers.

FALSE

Ensemble methods can be used for classification but not prediction.

FALSE

Ensemble methods can be used for supervised learning but not unsupervised learning.

FALSE

For methods like boosting or bagging to work well it is necessary that the sub-samples from the dataset used in each classifier are as similar as possible.

FALSE

Random decision trees are computationally expensive to build. (T/F)

FALSE

Random subspaces (like random forest) are meta-learning methods that learn a set of classifiers using datasets that have subsets of attributes of the original data.

FALSE - Random Forest is a Random Subspace. Random Subspace is another term for Attribute Bagging or Feature Bagging. An ensemble of models employing the random subspace method can be constructed using the following algorithm: Let the number of training points be N and the number of features in the training data be D. Choose L to be the number of individual models in the ensemble. For each individual model l, choose dl (dl < D) to be the number of input variables for l. It is common to have only one value of dl for all the individual models. For each individual model l, create a training set by choosing dl features from D without replacement and train the model. Now, to apply the ensemble model to an unseen point, combine the outputs of the L individual models by majority voting or by combining the posterior probabilities.

Information Gain

Gain(A) = I(p,n) - E(A)

Stationary Assumption

Transition probabilities are independent of time.

A hypergraph is a graph with at least one hyperedge.

True

Hard svm attempts to completely separate positive and negative observations. It is therefore prone to overfitting.

True

In softsvm, each point is assigned an alpha, controlling its participation as a support vector.

True

Kernel PCA can be used to denoise a dataset.

True

Kernel tricks may transform an n-dimensional space into a higher order space to make the problem linearly separable.

True

One motivation of ensemble clustering is that an obvious distance measure may not exist and we must induce it, which is not always easy, especially in multidimensional spaces.

True

Support vector machines attempt to maximize the margin between classes.

True

The hypergraph partitioning algorithm can be used to aggregate the output of several clustering algorithms. In such an approach, a hyperedge connects objects put into the same cluster by a clustering algorithm and hyperedges are then cut to induce the final partitions.

True

The latent variables computed by the EM algorithm can be conceptualized by assuming the existence of additional unobserved features. (T/F)

True

When performing ensemble clustering, the results of several clusters can be combined by using the percentage of clusterings that assign two objects into different clusters as a new distance function. This is an object based approach.

True

Random forests are an example of Bagging Boosting

Bagging

Which of the following ensemble techniques can be easily parallelized? Bagging Boosting

Bagging only.

Which of the following are considerations when performing hierarchical agglomerative clustering? Mark all that apply. 1. Size of k 2. Similarity function 3. Seed selection 4. Linking method

Bottom-up hierarchical clustering is therefore called hierarchical agglomerative clustering or HAC.

Different Types of Classifiers

Decision Trees Simple Bayesian models Nearest neighbor methods Logistic regression Neural networks Linear discriminant analysis (LDA) Quadratic discriminant analysis (QDA) Density estimation methods

Bagging

Easily parallelized Create ensembles by repeatedly randomly resampling the training data. Given a training set of size n, create m samples of size n by drawing n examples from the original data, with replacement. Each bootstrap sample will on average contain 63.2% of the unique training examples, the rest are replicates. Combine the m resulting models using simple majority vote. Decreases error by decreasing the variance in the results due to unstable learners, algorithms (like decision trees) whose output can change dramatically when the training data is slightly changed. Bootstrap Sampling with replacement Contains around 63.2% original records in each sample Bootstrap Aggregation Train a classifier on each bootstrap sample Use majority voting to determine the class label of ensemble classifier

A clustering ensemble can accept a hard partitioning, but not a soft partitioning from its base models. (T/F)

False

Like decision trees, support vector machines can output a class but not a probability.

False

Like kernel support vector machines, kernel PCA transforms the space into a higher dimension in order to discover a linear separation between two classes.

False

Since kernel PCA produces a higher order space, it is not suitable for visualizing the results of a cluster analysis.

False

Support vector machines can be extended to a multiclass problem by increasing the number kernel tricks.

False

Support vector machines cannot model curved decision boundaries.

False

Tuning the parameters of svm kernels usually requires specialized domain know ledge.

False

When performing ensemble clustering, clusters can be represented over the assignment space. Then the clusters themselves can be clustered. Objects are assigned to its most associated consensus cluster. This is an object-cluster model.

False

Soft consensus combines multiple partitionings of a set of objects into a single consolidated clustering.

False Each instance in a soft ensemble is represented by a concatenation of r posterior membership probability distributions obtained from the constituent clustering algorithms. We can define a distance measure between two instances using the Kullback-Leibler (KL) divergence, which calculates the "distance" between two probability distributions.

The EM algorithm finds the globally maximum likelihood parameters of a statistical model in cases where the equations cannot be solved directly. (T/F)

False The EM algorithm is used to find *locally* maximum likelihood parameters of a statistical model in cases where the equations cannot be solved directly.

Transition Matrix - Stochastic Matrix

Rows must sum to 1.

Bayes Rule

Suppose we have two sentences, H (the hypothesis) and E (the evidence). Pr(H|E) = Pr(E|H) * Pr(H) / Pr(E) Example: P(Cavity) = 0.1 P(Toothache) = 0.05 P(Cavity|Toothache) = 0.8 Bayes' rule tells: P(Toothache | Cavity) = (0.8 x 0.05)/0.1 = 0.4

Divisive Clustering

(partitional, top-down) separate all examples immediately into clusters.

What number am I thinking of?

42

Markov Models

Assume dependence on more recent observations. A Markov system has N states, and there are discrete time steps. At a given time step, the system is in a given state. Between timesteps, the next state is chosen randomly based on a probability distribution at the current state.

Given a dataset with 10 features, kernel PCA can at most compute 10 principle components.

False

In a two class problem, g i (x o ) and g j (x o ) is computed and observation o is assigned to class i if g i (x o ) < g j (x o ).

False

In order to improve performance, support vector machines identify the observations that are the most challenging to predict and gives them more weight in the classifier.

False

Like LDA, kernel PCA relies on the variance between and within classes to construct new components.

False

Like PCA, kernel PCA is able to capture nonlinear structure in the dataset.

False

Decision Tree Pruning Method

For a tree T, the misclassification rate and the mean-squared error rate depend on P, but not on D. The goal is to do well on records randomly drawn from P, not to do well on the records in D If the tree is too large, it overfits D and does not model P. The pruning method selects the tree of the right size.

Bayes Rule - example #2 Medical Diagnosis suppose we know from statistical data that flu causes fever in 80% of cases, approximately 1 in 10,000 people have flu at a given time, and approximately 1 out of every 50 people is suffering from fever: Pr(fever | flu) = 0.8 Pr(flu) = 0.0001 Pr(fever) = 0.02

Given a patient with fever, does she have flu? Answer by applying Bayes' rule: Pr(flu | fever) = [ Pr(fever | flu) * Pr(flu) ] / Pr(fever) = 0.8 x 0.0001 / 0.02 = 0.004

Which of the following ensemble techniques works well when the errors of the classifiers are strongly correlated? Bagging Boosting

Neither

Conditional Probabilities

P(A | B) = the conditional (posterior) probability of A given B P(A | B) = P(A, B) / P(B) P(A ^ B) = P(A, B) = P(A | B) * P(B) P(A ^ B) = P(A, B) = P(A) * P(B), if A and B are independent We say that A is independent of B, if P(A | B) = P(A) A and B are independent given C if: P(A | B,C) = P(A | C) P(A^B|C) = P(A|C) * P(B|C)

AdaBoost is susceptible to outliers

TRUE

Markov Property

The state of the system at time t+1 depends solely on the state of the system at time t.

Agglommerative Clustering

(bottom-up) methods start with each example in its own cluster and iteratively combine them to form larger and larger clusters.

Which of the following are considerations when performing k-means clustering? 1. Size of k 2. Similarity function 3. Seed selection 4. Linking method

1. Size of k 2. Similarity function 3. Seed selection Note: KMeans minimizes the within-class variance, so there is no distance measure considered.

The buckshot algorithm integrates kmeans clustering by... 1. Using the results of the hierarchical clustering to validate the kmeans clusters 2. Using the results of the hierarchical clustering to seed the kmeans clusters 3. Using the results of the kmeans clustering to validate the hierarchical clusters

2. Using the results of the hierarchical clustering to seed the kmeans clusters

Whish of the following is a fundamental difference between bagging and boosting? 1. Bagging is used for supervised learning. Boosting is used for unsupervised clustering. 2. Bagging gives varying weights to training instances. Boosting gives equal weight to all training instances. 3. Bagging does not take the performance of previously built models into account when building a new model. With boosting, each new model is built on the the results of previous models. 4. Boosting is used for supervised learning. Bagging is used with unsupervised clustering.

3. Bagging does not take the performance of previously built models into account when building a new model. With boosting, each new model is built on the the results of previous models.

Which of the following is correct with respect to random forest compared to decision trees? 1. Random forests are more difficult to interpret but often less accurate. 2. Random forests are easier to interpret but often more accurate. 3. Random forests are more difficult to interpret but often more accurate. 4. None of the above.

3. Random forests are more difficult to interpret but often more accurate.

Decision Trees

A decision tree T encodes d (a classifier or regression function) in form of a tree. A node t in T without children is called a leaf node. Otherwise t is called an internal node.

Gini index

A measure of impurity (based on relative frequencies of classes in a set of instances) The attribute that provides the smallest Gini index (or the largest reduction in impurity due to the split) is chosen to split the node Possible Problems: Biased towards multivalued attributes; similar to Info. Gain. Has difficulty when # of classes is large

The 0.632 bootstrap

A particular instance has a probability of 1-1/n of not being picked.

An ensemble clustering approach based on kmeans may include... Mark all that apply. 1. Several iterations of kmeans using different values of k. 2. Several iterations of kmeans using different seeds for the initial clusters. 3. Several iterations of kmeans using different samples of the data. 4. Several iterations of kmeans using different features from the data.

All of the above

Heirarchical Agglomerative Clustering

Assumes a similarity function for determining the similarity of two instances. Starts with all instances in a separate cluster and then repeatedly joins the two clusters that are most similar until there is only one cluster. The history of merging forms a binary tree or hierarchy.

As you increase the number of classifiers in an ensemble, the classification error decreases until the error approaches zero. (T/F)

False

Clustering n objects with m features through the EM algorithm, produces an ensemble of n x m models, the results of which are then aggregated to produce a final assignment of objects to clusters. (T/F)

False

Components computed from kernel PCA should not be used in logistic regression because logistic regression already computes higher order spaces.

False

Ensemble clustering can provide data for a visualization tools to inspect cluster membership and boundaries, but not number.

False

Classification: Definition

Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class. Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen records should be assigned a class as accurately as possible. A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

How does one build ensembles?

How to build ensembles: 1. Heterogeneous ensembles (same training data, different learning algorithms) 2. Manipulate training data (same learning algorithm, different training data) - bagging & boosting 3. Manipulate input features (use different subsets of the attribute sets) 4. Manipulate output targets (same data, same algorithm, convert multiclass problems into many two-class problems) 5. Inject randomness to learning algorithms.

Cluster Similarity

How to compute similarity of two clusters each possibly containing multiple instances? Single Link: Similarity of two most similar members. Complete Link: Similarity of two least similar members. Group Average: Average similarity between members.

The amount of information, needed to decide if an arbitrary example in S belongs to P or N is defined in terms of entropy, I(p,n): I(p,n)= -Pr(P)log2Pr(P) - Pr(N)log2Pr(N) Or I(p,n)= -Pr(P)xln(Pr(P))/ln(2) - [Pr(N) x ln(Pr(N))/ln(2)] Note that Pr(P) = p / (p+n) and Pr(N) = n / (p+n) Note also: log2(x) = ln(x)/ln(2)

I(9,5) = -(9/14)xlog2(9/14) - (5/14)xlog2(5/14) = 0.94 log2(9/14) = ln(9/14)/ln(2), etc. Whichever attribute has the largest information gain will be selected as root.

The Knowledge Discovery Process

Identify business (or other) problem Data mining Action Evaluation and measurement Deployment and integration into businesses processes

AdaBoost

Initially, set uniform weights on all the records At each round create a bootstrap sample based on the weights Train a classifier on the sample and apply it on the original training set Records that are wrongly classified will have their weights increased Records that are classified correctly will have their weights decreased If the error rate is higher than 50%, start over Final prediction is weighted average of all the classifiers with weight representing the training accuracy

Distance and Similarity Measures

Many of todays real-world applications rely on the computation similarities or distances among objects Personalization Recommender systems Document categorization Information retrieval Target marketing

Boosting

Not easily parallelized Originally developed by computational learning theorists to guarantee performance improvements on fitting training data for a weak learner that only needs to generate a hypothesis with a training accuracy greater than 0.5 (Schapire, 1990). Revised to be a practical algorithm, AdaBoost, for building ensembles that empirically improves generalization performance (Freund & Shapire, 1996). Examples are given weights. At each iteration, a new hypothesis is learned and the examples are reweighted to focus the system on examples that the most recently learned classifier got wrong. General Loop: Set all examples to have equal uniform weights. For t from 1 to T do: Learn a hypothesis, ht, from the weighted examples Decrease the weights of examples ht classifies correctly Base (weak) learner must focus on correctly classifying the most highly weighted examples while strongly avoiding over-fitting. During testing, each of the T hypotheses get a weighted vote proportional to their accuracy on the training data.

Basic Axioms of Probability

P(True) = 1 and P(False) = 0 P(A ^ B) = P(A) * P(B | A) P(-A) = 1 - P(A) if A º B, then P(A) = P(B) P(A \/ B) = P(A) + P(B) - P(A ^ B)

Suppose A blood test is 90% effective in detecting a disease. It also falsely diagnoses that a healthy person has the disease 3% of the time. If 10% of those tested have the disease, what is the probability that a person who tests positive will actually have the disease? ( i.e. find P(disease | positive) ) P(disease) = 0.10 P(-disease) = 0.90 P(positive | disease) = 0.90 P(positive | -disease) = 0.03

P(disease | positive) == P(positive | disease) * P(disease) / P(positive | disease) * P(disease) + P(positive|-disease) * P(-disease) = (0.90)(0.10) / ( (0.90)(0.10) + (0.03)(0.90) ) = 0.77

Given three classifiers, each with a probability of error of 0.2 combined by simple voting, what is the probability of error of the ensemble?

P(err) = P(2 wrong) or P(three wrong) P(2 wrong) = 3 (number of possible combinations of wrong) times the probability of two of the three being wrong times the probability of the remaining classifier being correct P(err) = 3 * (.2^2) * .8 + (.2^3) P(err) = 0.104

More measures

Percentage of retrieved documents that are relevant: precision=TP/(TP+FP) Percentage of relevant documents that are returned: recall =TP/(TP+FN)

Examples of Classification Task

Predicting tumor cells as benign or malignant Classifying credit card transactions as legitimate or fraudulent Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil Categorizing news stories as finance, weather, entertainment, sports, etc

Recall

Proportion of true positives to all correctly classified elements (true positives and false negatives). Recall = # True Positives / (#True Positives + # False Negative)

Precision

Proportion of true positives to all positives. Precision = # True Positives / (#True Positives + # False Positives)

Transition Matrix - Double Stochastic Matrix

Rows and columns sum to 1

Traffic Light Example Pr(green) = 0.45 Pr(red) = 0.45 Pr(yellow) = 0.1 We know that the police are perfect enforcers (i.e., we get a ticket if and only if the light is red when we enter the intersection) now we enter the intersection without getting a ticket; what are the probabilities that the light was green, red, or yellow?

Since we got no ticket, we know that the light could not have been red (in other words, we have Pr(red | no-ticket) = 0 ) also we have: Pr(no-ticket | green) = Pr(no-ticket | yellow) = 1 Using Bayes' rule we get: Pr(yellow|no ticket) = Pr(no ticket | yellow) * Pr(yellow) / Pr(no ticket) = 0.1/0.55 = 2/11 similarly, we can show that Pr(green | no-ticket) = 9 / 11

Stacking

Stacking (sometimes called stacked generalization) involves training a learning algorithm to combine the predictions of several other learning algorithms. First, all of the other algorithms are trained using the available data, then a combiner algorithm is trained to make a final prediction using all the predictions of the other algorithms as additional inputs. If an arbitrary combiner algorithm is used, then stacking can theoretically represent any ensemble technique, although in practice, a single-layer logistic regression model is often used as the combiner. Stacking typically yields performance better than any single one of the trained models. It has been successfully used on both supervised learning tasks (regression, classification and distance learning) and unsupervised learning (density estimation). It has also been used to estimate bagging's error rate. It has been reported to out-perform Bayesian model-averaging. The two top-performers in the Netflix competition used blending, which may be considered to be a form of stacking.

Classifier ensembles usually have better performance than stand alone classifiers because they combine the different points of view of each individual classifier.

TRUE

For an ensemble to work the following conditions must be true:

The errors of the classifiers need not to be strongly correlated (think about the exam example, if everyone knows by heart exactly the same chapters, will it help to solve the test in groups?) The errors of the individual classifiers making up the example need to be less than 0.5 (at least better than chance)

Data mining

The exploration and analysis of large quantities of data in order to discover valid, novel, potentially useful, and ultimately understandable patterns in data. Valid: The patterns hold in general. Novel: We did not know the pattern beforehand. Useful: We can devise actions from the patterns. Understandable: We can interpret and comprehend the patterns.

How to build ensembles: The two dominant approaches belong to category 2 Manipulate training data (same learning algorithm, different training data)

They are: bagging and boosting

Like PCA, kernel PCA relies on Eigen decomposition in order to extract new features.

True

Like kernel support vector machines, kernel PCA may require tuning of the kernel parameters.

True

The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the loglikelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected loglikelihood found on the E step. (T/F)

True In statistics, an expectation-maximization (EM) algorithm is an iterative method for finding maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step.

Non-heirarchical Clustering

Typically must provide the number of desired clusters, k. Randomly choose k instances as seeds, one per cluster. Form initial clusters based on these seeds. Iterate, repeatedly reallocating instances to different clusters to improve the overall clustering. Stop when clustering converges or after a fixed number of iterations.

Choosing the best feature

Use Information Gain to find the "best" (most discriminating) feature Assume there are two classes, P and N (e.g, P = "yes" and N = "no") Let the set of instances S (training data) contains p elements of class P and n elements of class N The amount of information, needed to decide if an arbitrary example in S belongs to P or N is defined in terms of entropy, I(p,n): Note that Pr(P) = p / (p+n) and Pr(N) = n / (p+n) In other words, entropy of a set on instances S is a function of the probability distribution of classes among the instances in S.

ROC curves ("receiver operating characteristic")

Used in signal detection to show trade-off between hit rate and false alarm rate over noisy channel y axis shows percentage of true positives in sample x axis shows percentage of false positives in sample In general, is a graphical plot that illustrates the performance of a binary classifier system as its discrimination threshold is varied. The curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. The true-positive rate is also known as sensitivity, recall or probability of detection[1] in machine learning. The false-positive rate is also known as the fall-out or probability of false alarm[1] and can be calculated as (1 − specificity). The ROC curve is thus the sensitivity as a function of fall-out. An ROC curve demonstrates several things: It shows the tradeoff between sensitivity and specificity (any increase in sensitivity will be accompanied by a decrease in specificity). The closer the curve follows the left-hand border and then the top border of the ROC space, the more accurate the test. The closer the curve comes to the 45-degree diagonal of the ROC space, the less accurate the test. The slope of the tangent line at a cutpoint gives the likelihood ratio (LR) for that value of the test. The area under the curve is a measure of accuracy.


Related study sets

Chapter 8: Political Parties: American Government

View Set

Chapter 1 Science and the Universe

View Set

Micro Econ 1123- Midterm 2 Study Set- Pallab Ghosh (Spring 2022)

View Set

Khan Academy | Statistics and probability | Quiz 1

View Set

Past Tense Irregular Verbs English

View Set