Data Mining Chapter 7 - K-Nearest-Neighbor
Summary: For prediction, take average of the nearest neighbors. Choose appropriate k based on error rates/prediction metrics
"Curse of dimensionality" - need to limit # of predictors
Typical range of k:
1 - 20
If 3 NN of new record are 2 owners and 1 non-owner, what is the probability of a new household being an owner or non-owner?
2/3 chance of being owner, 1/3 for non-owner
What is K nearest neighbor?
Algorithm used for classification (of a categorical outcome) or prediction (of a numerical response)
What does k=1 mean?
Classify a new record as belonging to the same class as its nearest neighbor
What is the classification rule?
Compute distances between the validation records to be classified and existing records in the training set
KNN is ____, not model-driven
Data-driven
In prediction, what is usually used instead of the misclassification error rate to choose k?
RMSE or average error metric
What are the advantages of using KNN?
Simple and intuitive No assumptions about data Can be very powerful with a large training set
How do you measure 'nearby'?
The most popular distance measure is Euclidean distance between two records with predictor values
k.n.n. Makes no _____ about the data
assumptions
k is the number of nearby neighbors to be used to _____
classify the new record
When using weighted average, weight ______ with distance
decreases
Instead of relying on fitting a model like linear regression, KNN relies on ____
finding "similar" records in the training data
A drawback of using KNN is that the required size of training set ____ with # of predictors, p
increases exponentially
What is the Simplest case?
k=1, One-nearest neighbor.
Nearest neighbor means the same as
lowest Euclidean distance
For k > 5, Use a ____ decision rule to classify a record
majority
For cut-off value, Classification based on _____.
majority decision rule. "majority" is linked closely to cut-off value
High values of k provide more ___, less __, but may miss ____
more smoothing, less noise, but may miss local structure
To remove bias, put the _____ values in the formula
normalized
What does near mean?
records with similar predictor values X1, X2, ... Xp
How do you calculate the Euclidean distance?
square the differences of the record values, sum that, and then take the square root of the total
k=5 means use ____
the 5 nearest records
k = N is ___
the entire data set
Proportion of k neighbors that belong to a class is an estimate of ____
the probability of a new record belonging to that class
When using K-NN for Prediction, for Numerical Outcome, what's a better method than majority vote?
use average of response values from the k-nearest neighbors to get predicted value in the validation set
Low values of k capture _____ but also ____
local structure in data but also noise
Typically choose that value of k which has ____
lowest error rate in validation data
In KNN, Classify the record as whatever the _____ is among the nearby records, a.k.a the "neighbors"
predominant class
Majority decision rule says the record is classified as a member of ____
the majority class of the k neighbors
What is k.n.n. basic idea of categorical outcome?
For a given record to be classified, identify nearby records - each record consists of an outcome, or class membership, and predictor values
What is a drawback of using KNN with a large training set?
It takes a long time to find distances to all the neighbors and then identify the nearest ones.
Classification results can be reported as:
Misclassification error rate or Classification/confusion matrix
Summary: Find distance between record-to-be-classified and all other records in the training set
Select k-nearest records Classify it according to majority vote of nearest neighbors
When you have the entire data set, what rule do you use?
The "naïve rule"--classify all records according to majority class.
Sometimes predictors need to be standardized to ____ scales before computing distances
equalize