Decision Tree WEEK 2

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

What are advantages of decision trees? Define scalability.

- Relatively faster learning speed compared to other classification methods. - Convertible to simple and easy to understand classification rules. - Can use SQL queries to access databases. - Comparable classification accuracy with other methods. Scalability is the process of classifying data sets with millions of examples and hundreds of attributes with reasonable speed.

What is gini index?

A method to measuring the decision tree. The lowest gini index indicates the best attribute split.

How can we enhance basic decision tree induction?

Allow for continuous-valued attribute. -Dynamically define new discrete-valued attributes that partition the continuous attribute value into a discrete set of intervals. For example, creating a threshold for a numerical value to be greater than 80 or less than or equal to 80. Handle missing attribute values. - Assign the most common value of the attributes. This would be completed by imputing the mean in missing values. - Assign probability to each value of the possible values. Attribute construction - Create new attributes based on existing ones that are sparsely represented. This reduces fragmentation, repetition, and replication.

What is a decision tree?

An algorithm constructed in a top-down recursive divide-and-conquer manner. At start, all the training examples are at the root. All attributes are categorical. Examples are partitioned recursively based on selected attributes. Test attributes are selected on the basis of a heuristic or statistical measure. Conditions that stop partitioning include all samples for a given node belong to the same class, no more attributes or samples remain to partition.

What are the two splits based on continuous attributes?

Discretization to form an ordinal categorical attribute. Either by static-> discretize once at the beginning or dynamic-> ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), or clustering. Binary decision: (A < v) or (A >= v) consider all possible splits and find the best cut. Can be more compute intensive.

How can we best select attributes in a decision tree?

Information gain: Biased toward multivalued attributes. Gain ratio: Tends to prefer unbalanced splits in which one partition is much smaller than the others. Gini index: Biased toward multivalued attributes, has difficulty when number of classes is large, and tends to favor tests that result in equal-sized partitions and purity in both partitions.

What is Hunt's algorithm?

Let Dt be set of training records that reach node t. General Procedure If Dt contains records that belong to the same class yt, then it is a leaf node yt. If Dt is an empty set, then t is a leaf node labeled by the default class yd. If Dt contains one record that belongs to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset.

What are the two splits based on nominal attributes?

Multi-way split: Use as many partitions as distinct values. Binary split: Divides values into two subsets. Need to find optimal partitioning.

What is overfitting? How can it affect a decision tree? What methods can prevent overfitting in a decision tree?

Overfitting occurs when the training data has near zero error. This leads to the test data not correlating accurately with the result. An induced tree may overfit the training data by too many branches where some may reflect anomalies due to noise which will lead to poor accuracy for unseen samples. An example of poor accuracy for an unseen example is an observation that at some point in tree has an unknown value for an attribute, leading to an inaccuracy in classifying the result. Prepruning: Halting tree construction early and refuse split of a node if the result in the goodness measure falls below a threshold. Postpruning: Remove branches from a "fully grown" tree, get a sequence of progressively pruned trees. Use a set of data different from the training to decide "best pruned tree."


Ensembles d'études connexes

Ancient India (Indus River Valley)

View Set

farma uno - uždegimas, reumatas

View Set

Section 4: Financing Real Estate in Texas

View Set

Chapter 12 Development Through the Lifespan, 7e

View Set

World history Unit 2 Study Guide

View Set

Business Policy Test Chapters 7-10

View Set

Wrong Questions for Banking and Insuring

View Set