Classification Algorithms

¡Supera tus tareas y exámenes ahora con Quizwiz!

The Random Forest Algorithm is a supervised classification algorithm.

Is the Random Forest Algorithm a Supervised or Unsupervised classification algorithm?

1. Less risk of overfitting than with a single Decision Tree 2. Usually more accurate than a Decision Tree

What are the Pros of using a Random Forest Classification Algorithm?

Predictions are made for a new data point by searching through the entire training set for the K most similar instances (the neighbors) and summarizing the output variable for those K instances. For regression problems, this might be the mean or median output variable. Euclidean, Manhattan, and Minkowski distance measures may also be used. For classification problems, this might be the mode class value.

How does the K-Nearest Neighbors Algorithm predict new data points?

Classification is computed from a simple majority vote of the K nearest neighbors of each point.

How does the K-Nearest Neighbors Algorithm classify data?

K-nearest neighbor (K-NN) Algorithm makes predictions based on how close a new data point is to a known data point. K-NN is considered 'lazy' as it does not attempt to construct a general internal model, but simply stores instances of the training data.

How does the K-Nearest Neighbors Algorithm work?

1. No assumptions about data - useful, for example, for nonlinear data 2. Simple - The algorithm is simple to implement. 3. Robust - The algorithm is robust to noise in the data. 4. 'Just in Time' Calculations - The algorithm performs calculations when a prediction is needed, not ahead of time. 5. Versatile - useful for classification or regression

What are the Pros of using a K-Nearest Neighbors algorithm?

1. Instability - Only if the information is precise and accurate, the decision tree will deliver promising results. Even if there is a slight change in the input data, it can cause large changes in the tree. 2. Complexity - If the dataset is huge with many columns and rows, it is a very complex task to design a decision tree with many branches. 3. Overfitting - Can create complex trees that do not generalize well. 3. Costs - Sometimes cost also remains a main factor because when one is required to construct a complex decision tree, it requires advanced knowledge in quantitative and statistical analysis.

What are the Cons of using a Decision Tree?

High computational resource demand Difficult to implement Kind of a 'blackbox' model which makes them difficult to explain. You don't know the specifics of all of the individual Decision Trees used in calculating Random Forest outcomes.

What are the Cons of using a Random Forest Classification Algorithm?

You need to remove attributes unrelated to the output variable or correlated to other attributes. This is not a top-performing classification algorithm.

What are the Cons of using a logistic regression model?

1. Need to determine the value of K. 2. High Computational Cost - It has to compute the distance of each instance to all the training samples...you have to hang on to your entire training dataset. 3. "Curse of dimensionality" - Distance can break down in very high dimensions, negatively affecting the performance. 4. Sensitive to irrelevant features and the scale of the data

What are the Cons of using the K-Nearest Neighbors Algorithm?

Poor estimator for use outside of ideal problems

What are the Cons of using the Naive Bayes algorithm?

1. Simplicity - A decision tree is simple to understand, visualize, and explain. 2. Flexibility - We can implement a decision tree on numerical as well as categorical data. 3. Robust - Decision Tree is proven to be a robust model with a broad range of problems 4. Fast - They are also time efficient with large data. 5. Little Prep - It requires less effort for the training of the data.

What are the Pros of using a Decision Tree Algorithm?

Decision Trees are supervised learning because they use labeled data.

Are Decision Trees supervised or unsupervised learning and why?

a classification algorithm

If you wanted to predict whether or not a customer will churn, what subset of machine learning algorithms would you use?

Naive Bayes is comprised of two types of probabilities calculated from your training data: The probability of each class The conditional probability for each class given each x value Once calculated, the probability model is used to make predictions for new data.

How does the Naive Bayes Algorithm make predictions?

Random Forest is an Ensemble algorithm. It is an implementation of bootstrap aggregation. (bagging) Bootstrapping is a statistical method for estimating a quantity from a data sample, like using the mean or median, but instead of just taking the mean of your single data set, it takes the mean of many different samples of your data set and averges all of those means to give a better estimation of a true mean. When you need to make a prediction for new data, each model makes a prediction and the predictions are averaged to give a better estimation of the true output value.

How does the Random Forest Classification Algorithm work?

Naive Bayes uses labeled data to predict probability, so it is supervised machine learning.

Is the Naive Bayes Classification Algorithm Supervised or Unsupervised?

Interpretable: Good for understanding the influence of several independent variables on a single outcome variable. Flexible: you can 'snap' predictions to 1 or 0 based on a rule (like < .5 = 0) or you can choose to use the output as is, which is a probability of being class 1. Easy to implement meaning: It is good to use for creating a benchmark. Efficient: Does not require many computation resources.

What are the Pros of using a logistic regression model?

Works with a smaller sample size of training data than other classifiers Fast compared to more sophisticated methods Simple and Powerful

What are the Pros of using the Naive Bayes Algorithm?

Decision Nodes, which is where the data is split or say, it is a place for the attribute. Decision Link, which represents a rule. Decision Leaves, which are the final outcomes.

What are the main components of a decision tree?

Splitting - It is the process of the partitioning of data into subsets. Splitting can be done on various factors as shown below i.e. on gender basis, height basis or based on class. Pruning - It is the process of shortening the branches of the decision tree, hence limiting the tree depth Tree Selection - The third step is the process of finding the smallest tree that fits the data.

What are the steps in modeling with a Decision Tree Algorithm?

The K-NN algorithm assumes that similar things exist in close proximity. In other words, similar things are near to each other. "Birds of a feather flock together."

What assumption does the K-Nearest Neighbors Algorithm assume?

Naive Bayes assumes that all predictive variables are independent of each other. Real-valued data is commonly assumed to be a Gaussian distribution (bell curve), so that you can estimate probabilities Its assumption of independence between every pair of features or predictive variables is unrealistic for real-world data but effective with document classification and spam filtering.

What assumptions does the Naive Bayes Algorithm make?

Instead of keeping outcomes as a probability between 0 and 1, you use a rule or condition that decides on a binary value or a yes or no decision. (< .5 = 0, >= .5 = 1)

What does it mean to 'snap' logistic regression predictions to 1 or 0, and how do you do it?

Being based on feature similarity means that how closely out-of-sample features resemble the training set determines how a given data point is classified.

What does it mean to say that K-Nearest Neighbors Algorithm is based on feature similarity?

K-NN does not use the training data points to do any generalizations. There is either no explicit training phase or very minimal training. Lack of generalization means that K-NN hangs onto all of the training data because it is needed during the testing phase.

What does it mean to say that K-Nearest Neighbors is a 'lazy' algorithm?

Non-parametric means that the model structure is determined from the data. The model does not assume an underlying data distribution. KNN could and probably should be one of the first choices for a classification study when there is little or no prior knowledge about the distribution of the data.

What does it mean to say that K-Nearest Neighbors is a non-parametric technique, and when is it good to use this technique?

Entropy is the measure of how disordered your data is. The reason it is used in Decision Trees is that the ultimate goal of a DT is to tidy data, to group similar data groups into similar classes.

What is entropy in a Decision Tree Algorithm and why is it used?

The model is really a regression, so the goal is to find the values for the coefficients that weight each input variable.

What is the goal of the logistic regression model?

0 or 1 that represents the probability of one class over the other

What is the outcome of a logistic regression?

You MUST normalize your data if you are going to use Naive Bayes because a Gaussian (bell curve) distribution is assumed in order to estimate probabilities.

What must you do with your data if you want to model with the Naive Bayes Algorithm?

K-NN is a supervised machine learning algorithm because we set the N in the model, or number of neighbors and it relies on labeled input data to learn a function that produces an appropriate output when given new unlabeled data.

What type of Machine Learning Algorithm is K-Nearest Neighbors

a logistic regression algorithm

What type of algorithm is technically a regression algorithm but considered a part of classification algorithms because it predicts binary outcomes?

Classification problems have discrete values as their outputs. The output is a class membership (predicts a class — a discrete value) ie. - likes pineapple/does not like pineapple, yes/no, pig/not a pig. The object is assigned to the class most common among its k nearest neighbors. You cannot perform mathematical operations on classification output because the numbers, -1, 0, 1, are purely representational. It would be meaningless to add them together or subtract them.

What type of output does a classification problem have?

A regression problem has a real number (a number with a decimal point) as its output. Output is the value for the object (predicts continuous values). This value is the average (or median) of the values of its k nearest neighbors. We have an independent variable (or set of independent variables) and a dependent variable (the thing we are trying to guess given our independent variables). Each row is typically called an example, observation, or data point, while each column (not including the label/dependent variable) is often called a predictor, dimension, independent variable, or feature.

What type of output does a regression problem have?

K-NN can solve both classification and regression problems

What types of problems can be solved using K-Nearest Neighbors?

Decision Trees are used to solve either classification or regression problems. They classify results into groups until no more similarity is left.

What types of problems do you use a Decision Tree algorithm to solve?

1. Credit ratings — collecting financial characteristics vs. comparing people with similar financial features to a database. By the very nature of a credit rating, people who have similar financial details would be given similar credit ratings. Therefore, they would like to be able to use this existing database to predict a new customer's credit rating, without having to perform all the calculations. ie. - Should the bank give a loan to an individual? Would an individual default on his or her loan? Is that person closer in characteristics to people who defaulted or did not default on their loans? 2. In political science — classing a potential voter to a "will vote" or "will not vote", or to "vote Democrat" or "vote Republican". 3. More advanced examples could include handwriting detection (like OCR), image recognition and even video recognition.

What would be some applications of K-Nearest Neighbors?

Based on their purchase and browsing history, what promos should I offer to my customers? Learn from IB to develop methods for prospecting new customers

What would be some ideal problems to solve using the Naive Bayes Algorithm?


Conjuntos de estudio relacionados

Vocabulary Workshop Second Course - Lesson 23

View Set

Strategic Management - Chapter 8

View Set

Chapter 19 Multiple Choice (Blood Vessels)

View Set

Global Economics Midterm Ch. 7-13

View Set

Chapter 60: Drugs for Disorders of Adrenal Cortex

View Set

CDC Nursing Home Infection Prevention Exam

View Set

Texas Pretrial Full Course - Carlson

View Set