Chapter 5

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

List Applications of Decision Trees in Real Life

1. Bio medical engineering (decision trees for identifying features to be used in implantable devices). 2. Financial Analysis (Customer Satisfaction with a product or service) 3. Astronomy (classify galaxies). 4. Sysem control 5. Manufacturing and Production (Quality Control,Semiconductor,manufacturing,etc). 6. Medicines (diagnosis,cardiology,psychiatry). 7. Physics (Particle detection).

What are most commonly used Classification Algorithms?

1. Decision Tree 2. Random Forest 3. Naive Bayes Classifier 4. Support Vector Machines (SVM). 5. Logistic Regression

What are Dis-Advantages of Classification with Decision Trees

1. Easy to overfit 2. Decision Boundary restricted to being parallel to attribute axes. 3. Decision tree models are often biased toward splits on features have a large number of levels. 4.Small changes in the training data can result in large changes to decision logic 5. Large trees can be difficult to interpret and the decision they make may seem counter intuitive.

Horse Tree - Example for Decision Tree Classifier and Random Forest Classifier

1. Import Libraries and Dataset 2. Pre process the data using get dummies and label encoders 3. Impute the missing values using the most frequent values 4. Fit decision tree classifiers on the transformed data 5. print accuracy 6. Fit random forest classifiers on the transformed data 7. print accuracy

What are Advantages of Classification with Decision Trees

1. In expensive to construct 2. Extremely fast at classifying unknown records. 3. Easy to interpret for small-sized trees. 4. Accuracy comparable to other classification techniques for many simple datasets. 5. Excludes unimportant features

List down Features and Advantages of Random Forest

1. It is one of the most accurate learning algorithms available. For many data sets, it produces a highly accurate classifier. 2. It runs efficiently on large databases. 3. It can handle thousand of input variables without variable deletion. 4. It gives estimates of what variables that are important in the classification. 5. It generates an internal unbiased estimate of the generalization error as the forest building progresses. 6. It has an effective method for estimating missing data and maintains accuracy when large proportion of data are missing.

What are disadvantages of Random Forest?

1. Random forests have been observed to over fit for some data sets with noisy classification/regression tasks. 2. For data including categorical variables with different number of levels, random forests are biased in favor of those attributes with more levels. Therefore, the variable importance scores from random forest are not reliable for this type of data.

List Uses cases of Classification Algorithm

1. Sentiment Analysis 2. Fraud Detection 3. Face Detection

Define Classifier Model workflow

1st Step is to train a classifier model based on available data. Label -> ML algorithm Input -> Feature Extractor -> Features -> ML algorithm 2nd Step is to test or predict the classifier model with unseen data and validate the output. Input -> Feature Extractor -> Features -> Classifier model -> Label

What is entropy?

A measure of disorder or randomness. Entropy measures the impurity of a collection of examples. It depends on the distribution of the random variable. Entropy, in general, measure the amount of information in a random variable.

What is Decision Node?

A sub-node splits in to further sub-nodes

What is Branch / Sub Tree?

A subsection of entire tree

Explain Decision Tree Classifier?

A tree like structure in which the internal node represents the test on a attribute. Each branch represents the out come of the test, and each leaf node represents the class label. A path from root to leaf represent classification rules. Example : Is a person Fit? Age < 30 Yes No Eats logs of Pizzas Exercise in Morning Yes No Yes No Unfit Fit Fit Unfit Step 1 : Using the decision tree algorithm, we start at the tree root and split the data on the feature that results in largest information gain (IG) (reduction in uncertainty towards the final decision). Step 2 : In an iterative process, we can then repeat this splitting procedure at the each child node until the leaves are pure. This means that the samples at the each leaf node all belong to the same class. Step 3 : In practice, we may set a limit on the depth of the tree to prevent over fitting. We compromise on purity here somewhat as the final leaves may still have some impurity.

What is Accuracy? What is its significance?

Accuracy = TP + TN / (TP+TN+FP+FN) Accuracy is not a reliable metric for the real performance of the classifier, because it will yield misleading results if the data set is unbalanced (number of observations in different classes vary greatly). For example: Consider a 2-class problem Number of Class 0 examples = 9990 Number of Class 1 examples = 10 If the model predicts every example to be Class 0 then accuracy of the model is 99.9% Hence, the accuracy is mis-leading because the model doesnt detect any Class 1 example.

What is bagging?

Bootstrap aggregating also called bagging is a machine learning ensemble meta algorithm designed to improve the stability and accuracy of machine learning algorithms used in statistical classification and regression. It also reduces variance and helps to avoid over fitting. Although it is usually applied to decision tree methods, it can be used with any type of methods. Bagging is a special case of model averaging approach.

What type of Algorithm is Classification

Classification is a Supervised Learning Algorithm as input contains labels.

What is cost matrix?

Cost matrix is similar to confusion matrix except the fact that we are calculating the cost of wrong or right prediction. Ex: Cost associate with someone not identifying cancer will be more than false positive as there are more chances of dying without getting treated.

What is Splitting?

Division of nodes in to two or more sub nodes.

What is an Ensemble Algorithm?

Ensemble algorithms are those which combine more than one algorithms of same or different kind for classifying objects. For example, running prediction over Naive Bayes, SVM and Decision Tree and then taking vote for final consideration of class for test object.

What is a false negative?

Far more dangerous than a false positive is when the vulnerability scanner misses a vulnerability and fails to alert the administrator to the presence of a dangerous situation. This error is known as a false negative report. When a model is predicted as negative where as in actual it is positive. Example : Cancer . Model has predicted no cancer where as in actual it is positive for cancer.

When the Node is Pure?

If all of its data belongs to a single class. 100%

What is Information Gain?

Information Gain is the expected reduction in entropy caused by partitioning the examples on an attribute. Higher the information gain, the more effective the attribute in classifying training data.

What is use of PD.get_dummies?

It is a function in Pandas which can turn a categorical variable into a series of Zeroes and Ones. pd.get_dummies(data,columns=cat_feats,drop_first=True) In fact in pandas.get_dummies there is a parameter i.e. drop_first allows you whether to keep or remove the reference (whether to keep k or k-1 dummies out of k categorical levels). Please note default = False meaning that the reference is not dropped and k dummies created out of k categorical levels! You set default = True, then it will drop the reference column after encoding.

What is pruning?

It is a process of removing sub nodes

What is a confusion matrix?

It is also known as error matrix. Is a specific table which allows visualizaiton of the performance of an algorithm. Each row of matrix represents the instances in a predicted class and while each column represents instances in actual class.

What is Random Forest Classifier?

It is an ensemble tree-based learning algorithm. The Random forest classifier is a set of decision trees from randomly selected subset of training set. It aggregates the votes from different decision trees to decide the final class of the test object.

What is leaf/Terminal Node?

Node that does not split. or A branch with Entropy(E) = 0

When does overfitting of decision trees occur?

Overfitting occurs when the learning algorithm continues to develop hypothesis that reduce training set error at the cost of an increased test set error.

Which Attribute is best Classifier?

The attribute with the highest information gain is selected as splitting attribute.

What is Root Node?

The entire population or sample that further gets divided.

What is Classification?

The process of grouping things based on their similarities. A machine learning task that identifies the class to which an instance belongs.

What is a false positive?

When model predicted as positive and in actual it is negative. Ex: Pregnancy test when model predictive as pregnant (positive) where as in actual it is not pregnant.

When a Node is Impure?

if it is split in to 50:50 then the node is impure

What is sensitivity? or Recall? or Hit Rate?

probability that a patient with a disease will have a positive result. or The ability of a test to detect disease. For the above example : Model would have 100% recognition rate for Class 0 and 0% recognition rate for Class 1. Formula : TP/ (TP + FN)

What is specificity?

probability that a patient without disease will have a negative result Formula : TN / (TN + FP)

What is a true negative?

when model predicted as negative and actual is also negative. Ex: Pregnancy Test when model predicted as not pregnant and in actual it is negative (not pregnant).

What is a true positive?

when model predicted as positive and actual is also positive. Ex: Pregnancy Test when model predicted as positive and in actual it is positive (pregnant).


Kaugnay na mga set ng pag-aaral

Addition and Subtraction of Polynomials

View Set

Biology Chapter 1 (Biology and You)

View Set

Math Lesson 57 Adding and Subtracting with Unlike Denominators

View Set

Personal Training Quizzes (11-23)

View Set

Med Surg II - Chapter 70 - Care of Patients with Breast Disorders

View Set

Intro. to Musculoskeletal System

View Set

Freedom of Information Act (FOIA: GEN, PRA,EXC)

View Set

ATI - car. Circulation and Perfusion 2

View Set