Data Mining Chapter 7 - K-Nearest-Neighbor

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

Summary: For prediction, take average of the nearest neighbors. Choose appropriate k based on error rates/prediction metrics

"Curse of dimensionality" - need to limit # of predictors

Typical range of k:

1 - 20

If 3 NN of new record are 2 owners and 1 non-owner, what is the probability of a new household being an owner or non-owner?

2/3 chance of being owner, 1/3 for non-owner

What is K nearest neighbor?

Algorithm used for classification (of a categorical outcome) or prediction (of a numerical response)

What does k=1 mean?

Classify a new record as belonging to the same class as its nearest neighbor

What is the classification rule?

Compute distances between the validation records to be classified and existing records in the training set

KNN is ____, not model-driven

Data-driven

In prediction, what is usually used instead of the misclassification error rate to choose k?

RMSE or average error metric

What are the advantages of using KNN?

Simple and intuitive No assumptions about data Can be very powerful with a large training set

How do you measure 'nearby'?

The most popular distance measure is Euclidean distance between two records with predictor values

k.n.n. Makes no _____ about the data

assumptions

k is the number of nearby neighbors to be used to _____

classify the new record

When using weighted average, weight ______ with distance

decreases

Instead of relying on fitting a model like linear regression, KNN relies on ____

finding "similar" records in the training data

A drawback of using KNN is that the required size of training set ____ with # of predictors, p

increases exponentially

What is the Simplest case?

k=1, One-nearest neighbor.

Nearest neighbor means the same as

lowest Euclidean distance

For k > 5, Use a ____ decision rule to classify a record

majority

For cut-off value, Classification based on _____.

majority decision rule. "majority" is linked closely to cut-off value

High values of k provide more ___, less __, but may miss ____

more smoothing, less noise, but may miss local structure

To remove bias, put the _____ values in the formula

normalized

What does near mean?

records with similar predictor values X1, X2, ... Xp

How do you calculate the Euclidean distance?

square the differences of the record values, sum that, and then take the square root of the total

k=5 means use ____

the 5 nearest records

k = N is ___

the entire data set

Proportion of k neighbors that belong to a class is an estimate of ____

the probability of a new record belonging to that class

When using K-NN for Prediction, for Numerical Outcome, what's a better method than majority vote?

use average of response values from the k-nearest neighbors to get predicted value in the validation set

Low values of k capture _____ but also ____

local structure in data but also noise

Typically choose that value of k which has ____

lowest error rate in validation data

In KNN, Classify the record as whatever the _____ is among the nearby records, a.k.a the "neighbors"

predominant class

Majority decision rule says the record is classified as a member of ____

the majority class of the k neighbors

What is k.n.n. basic idea of categorical outcome?

For a given record to be classified, identify nearby records - each record consists of an outcome, or class membership, and predictor values

What is a drawback of using KNN with a large training set?

It takes a long time to find distances to all the neighbors and then identify the nearest ones.

Classification results can be reported as:

Misclassification error rate or Classification/confusion matrix

Summary: Find distance between record-to-be-classified and all other records in the training set

Select k-nearest records Classify it according to majority vote of nearest neighbors

When you have the entire data set, what rule do you use?

The "naïve rule"--classify all records according to majority class.

Sometimes predictors need to be standardized to ____ scales before computing distances

equalize


Kaugnay na mga set ng pag-aaral

CYBR 4323 (Privitera) - Chapter 3 - Introduction to Physical Layer

View Set

Chapters 1-8 Organizational Behavior

View Set

CHAPTER 4 Communicating For Success

View Set

Florida Real Estate Exam Practice Exam Qestions

View Set

CLEET Oklahoma Phase I & II Unarmed Security Guard Test

View Set

ODYSSEY by homer? Chaps 6-10 (tursday quiz)

View Set

Managerial Accounting Exam 2 (chapter 6)

View Set