Final Practice Exam

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

Provide the formula for accuracy in terms of TP, TN, FP, and FN.

(TP + TN) / (TP + TN + FP + FN)

Given a training set with 5+ and 10- examples, a) What is the entropy value associated with this data set? You need not simplify your answer to get a numerical answer

-(1/3)log(1/3) - (2/3)log(2/3)

Given a training set with 5+ and 10- examples b) What is the Gini associated with this data set? In this case you should simplify your result, although you may express the answer the answer as a fraction rather than a decimal.

1 - [(1/3)2 + (2/3) 2 ] = 1 - 1/9 - 4/9 = 1 - 5/9 = 4/9

The nearest neighbor algorithms relies on having a good notion of similarity, or distance. In class we discussed several factors that can make it non-trivial to have a good similarity metric. What were two of the factors?

A good similarity metric requires that the scales of the features are similar. For example, if one feature varies from 1 to 100 and another from 1, to 1,000,000, then there is a problem and the values should be rescaled. Another problem is that some features may be much less important than others, and yet by default all features are considered equally important. Also, redundant or highly correlated features will through off the distance metric, because the related features will be overvalued.

Discuss the basic difference between the agglomerative and divisive hierarchical clustering algorithms and mention which type of hierarchical clustering algorithm is more commonly used.

Agglomerative methods start with each object as and individual cluster and then incrementally builds larger clusters by merging clusters. Divisive methods, on the other hand, start with all points belonging to one cluster and then splits apart a cluster each iteration. The agglomerative method is more common.

How does an ordinal feature differ from a nominal feature? Explain in one or two sentences.

An ordinal feature is a nominal feature where there is a natural ordering of each attribute value.

3. How can you convert a decision tree into a rule set? Explain the process.

Create one rule per leaf node by traversing the conditions from the root node to the leaf and conjoining those conditions. Note that the rules would be mutually exclusive, meaning that only one rule could "fire" at a time.

The k-means clustering algorithm that we studied will automatically find the best value of k as part of its normal operation.

False

FN

False Negative

FP

False Positive

List two reasons why data mining is popular now and it wasn't as popular 20 years ago.

Faster computers, cheaper memory, more data being routinely recorded (e.g., popularity of the Web and devices like smartphones), and to a lesser degree better algorithms.

Does the Ripper rule learner build rules from general to specific or specific to general?

It builds rules from general to specific. It starts with a rule where the antecedent has no conditions and then adds conditions one at a time.

What does it mean if the rule set for a rule learner is exhaustive?

It means that the rules will collectively cover every possible example.

In the figure below, there are two clusters. They are connected by a line which represents the distance used to determine inter-cluster similarity.

MIN

Given a training set with 5+ and 10- examples c) If you generated a decision tree with just the root node for the examples in this data set, what class value would you assign and what would be the training-set error rate associated with this (very short) decision tree?

Majority class is negative class, so classify it as negative. Training error rate is then 5/15 1/3.

Are the two clusters shown below well separated? Circle and justify your answer

No, It is not well separated because some points in each cluster are closer to points in another cluster than to points in the same cluster.

Sally measures the pressure of all of tires coming into her garage for an oil change and records the values. Unknown to her, her tire gauge is miscalibrated and adds 3 psi to each reading. According to the definition of noise used by our textbook, is this error introduced by the tire gauge considered noise? Answer "yes" or "no" and justify your answer

No, since noise must be random, not systematic.

Provide the formula for precision and recall using TP, TN, FP, and FN.

Precision = TP/(TP + FP) Recall = TP/(TP + FN)

If we build a classifier and evaluate it on the training set and the test set: Which data set provides best accuracy estimate on new data?

Test Set

You need to split on attribute a1 in your decision tree. The attribute has 8 values. Why might a two way split be better than an 8-way split? What might be a problem with the 8-way split?

The 8-way split can lead to the problem of data fragmentation. The data will be split up excessively leaving smaller amounts of data available for future splits.

The algorithm that we used to do association rule mining is the Apriori algorithm This algorithm is efficient because it relies on and exploits the Apriori property. What is the Apriori property?

The Apriori property state that if an itemset is frequent then all of its subsets must also be frequent.

What is the curse of dimensionality?

The curse of dimensionality is that when the number of features increases, the concentration of the data points within the instance space decreases, which makes it harder to find patterns. For example, if you have 100 data points and one variable, it is likely the space is dense, but if we have 100 features, the space will be quite sparse.

A database has 4 transactions, shown below. TID Date items_bought T100 10/15/04 {K, A, D, B} T200 10/15/04 {D, A, C, E, B} T300 10/19/04 {C, A, B, E} T400 10/22/04 {B, A, D} Assuming a minimum level of support min_sup = 60% and a minimum level of confidence min_conf = 80%: (a) Find all frequent itemsets (not just the ones with the maximum width/length) using the Apriori algorithm. Show your work—just showing the final answer is not acceptable. For each iteration show the candidate and acceptable frequent itemsets. You should show your work similar to the way the example was done in the PowerPoint slides.

The final answer is: {{A}, {B}, {D}, {A, B}, {B, D}, {A, B, D}} (include {A,D} above)

What classifier induction algorithm can effectively generate the most expressive classifiers, in terms of the decision boundaries that can be formed? Which is the least expressive. Rank order them from most to least expressive. Briefly justify your ordering.

The induction algorithms are: decision trees, linear classifiers, and nearest neighbor. Most expressive: nearest neighbor Middle: decision trees Least expressive: linear classifier

Sometimes a data set is partitioned such that a validation set is provided. What is the purpose of the validation set?

The validation set is used to select amongst multiple models or to tune a specific model (which can be viewed as a family of models). (Here is more explanation. In this sense is it used like a test set in that it is used for evaluation, but it is not like a test set in that it cannot be used for reporting the performance of the model. That is, the validation set chooses a model—for example the right amount or pruning—but then that model can be evaluated on a test set and that performance number can be reported. If one dataset could be used for finding the best model and reporting the performance, you could generate 1million models and then pick the best on the data set and report that performance. But it is likely that model really does not have the best performance but just happened to do best on that one data set.)

If we build a classifier and evaluate it on the training set and the test set: Which data set would we expect to have the higher accuracy?

Training Set

A density-based clustering algorithm can generate non-globular clusters.

True

In association rule mining the generation of the frequent itermsets is the computational intensive step

True

Our use of association analysis will yield the same frequent itemsets and strong association rules whether a specific item occurs once or three times in an individual transaction.

True

TN

True Negative

TP

True Positive

We generally will be more interested in association rules with high confidence. However, often we will not be interested in association rules that have a confidence of 100%. Why? Then specifically explain why association rules with 99% confidence may be interesting (i.e., what might they indicate)?

While we generally prefer association rules with high confidence, a rule with 100% confidence most likely represents some already know fact or policy (e.g., checking account → savings account may just indicate that all customers are required to have a checking account if they have a savings account). Rules with 99% confidence are interesting not because of the 99% part but because of the 1% part. These are the exceptions to the rule. They may indicate, for example, that a policy is being violated. They might also indicate that there is a data entry error. Either way, it would be interesting to understand why the 1% do not follow the general pattern

Are decision trees easy to interpret (circle one)

Yes

List all of the strong association rules, along with their support and confidence values, which match the following metarule, where X is a variable representing customers and itemi denotes variables representing items (e.g., "A", "B", etc.). x transaction, buys(X, item1) buys(X, item2) buys(X, item3) Hint: don't worry about the fact that the statement above uses relations. The point of the metarule is to tell you to only worry about association rules of the form X Y Z (or {X, Y} Z if you prefer that notation). That is, you don't need to worry about rules of the form X Z.

buys(X, A) buys(X, B) → buys(X, D) (75%, 75%) Not Strong buys(X, A) buys(X, D) → buys(X, B) (75%, 100%) Strong buys(X, B) buys(X, D) → buys(X, A) (75%, 100%) Strong


Kaugnay na mga set ng pag-aaral

Social Problems Ch 8 Vocab and Main Points

View Set

10 Buildings that changed America

View Set

comm 10 pt 7 ethics of the mass media

View Set

PeriOperative MedSurg PREPU GOOD

View Set

care of the patient with a urinary disorder, renal failure

View Set