K-NN
Dominance of Attributes
distance between neighbors could be dominated by some attribute with relatively large numbers. because of this it is important to rescale the data.
How else can distance be measured?
euclidian, manhattan, minkowski
KNN Summary
-can be applied to data from any distribution -very simple and intuitive -good classification if the number of samples is large enough -choosing best K may be difficult -computationally heavy, but improvement is possible -need large number of samples for accuracy
What are the disadvantages of using KNN?
-need a lot of space to store all examples -difficult to explain "knowledge" that has been mind from the data -distance based learning is not clear which type of distance to use and which attributes to use to produce the best results(should we use all attributes, or certain attributes only?) -takes more time to classify a new example than with a model (need to calculate and compare distance from new example to all other examples)
What are the advantages of using KNN?
-simple to implement and use -comprehensible - easy to explain prediction -robust to noisy data by averaging k-nearest neighbors -effective if training data is large
Finding the Best K (more in depth)
-split your data into training, test and validation sets -use the training data and a range of k values from 1 to 10 to classify the observations in the test data set -the best value of k is defined to be the one that resulted in the largest average correct classification rate in the test data set -recombine the training and test datasets back into one dataset, this value of k and this dataset is then used to classify the validation data
How do we classify using KNN?
...
How do you measure the distance between two members of our dataset?
...
What is instance-based (memory based) learning?
...
K-NN Algorithm
1. Determine parameter K = number of the nearest neighbors 2. Calculate the distance between the query instance and all the training samples 3. Sort the distance and determine nearest neighbors based on the K-th minimum distance 4. Gather the category Y of the nearest neighbors 5. Use simple majority of the category of nearest neighbors as the prediction value of the query instance.
How can we best choose/weight attributes?
Because KNN predictions so far have been determined by using a majority vote, ties are avoided. An alternative way to go about this is to give greater weight to the more similar neighbors and less weight to those that are further away. The weighted score is then used to choose the class of the new record. similarity weight: 1/(distance^2) Determine the contribution of each neighbor by summing the similarity weights and divide each similarity weight by this sum. sum the contributions of each possible response and assign the class of the largest. **Determining the best "K" is less important when weighting as the contribution of each neighbor is moderated by its distance. Some attributes may be more important than others in determining the best classification. This can be handled by removing attributes from the classification model that do not help the accuracy or using weights on the attributes that place a greater emphasis on significant attributes.
Why and how do we rescale data?
Because of dominance of attributes, it allows parameters to have a fair comparison. Two methods used to rescale data are: normalization and standardization.
How do we choose the best k?
In theory, when the infinite number of samples is available, the larger the k the classification. But the caveat is that all k neighbors have to be close to x which is possible when infinite samples are available, but impossible in practice because the number of samples is finite. In practice: 1. K should be large so that the error rate is minimized (k too small will lead to noisy decision boundaries) 2. k should be small enough so that only nearby samples are include (k too large will lead to over-smoother boundaries)
Issues with Nearest Neighbor Methods
Intelligibility -decision can be explained along the contributions of the neighbors, NN methods lack of specific decision models Dimensionality -having too many irrelevant attributes may confuse distance calculations Computational Efficiency -querying the database for predictions can be very expensive
KNN
Kth Nearest Neighbor is a classification model. With this model, a new example is assigned to the most common class among the (K) examples that are most similar to it.
What is the "curse of dimensionality?"
Wrong classification due to presence of many irrelevant attributes. Distance usually relates to all the attributes and assumes all of them have the same effects on distance. The similarity metrics do not consider the relation of attributes which results in inaccurate distance and then impact on classification precision. In high dimensions, if there are a lot of irrelevant features, normalization will not help. Euclidian distance will be dominated by noise.
How can KNN be used for prediction?
given a new example whose target variable we want to predict, we scan through all the training examples and choose several that are similar to it. Then we predict the new example's target value based on the nearest neighbors' known target values. Classification: look for the nearest neighbors and derive target class for new example. Regression: derive target value from the mean or median of the neighbors.
Rescaling Data
normalization of the data is very important when dealing with parameters of different units and scales. All parameters should have the same scale for a fair comparison between them. (two methods, normalization and standardization)
Normalization
scales all numeric variables in the range[0,1] (formula in slides); however, if there are outliers in your set, normalizing the data will scale the "normal" data to a very small interval
What are similarity and distance?
similarity is at the core of many data mining methods, if two things are similar in some ways, they often share other characteristics as well; (some business cases use similarity for classification and regression, group similar items together into clusters, provide recommendations to people, reasoning from similar cases); typically distances between data objects are used for the determination of similarity, distance is just a number with no units, it has no meaningful interpretation and it is useful to compare the similarity of one pair of instances and another pair.
Standardization
will transform the dataset to have a mean of 0 and a variance of 1 (equation in slides); when using standardization, your data aren't bounded (unlike normalization)