Quiz 3: Similarity Based Learning
Levenshtein Edit Distance
-Minimum number of edits (insertions, deletions and substitutions) to convert one term into another -Example: cyanotic to cyanosis takes 2 changes
Lazy Learning
-Name for similarity-based learning -Because no work is done during training and all the work is done during testing
Euclidean Distance disadvantages for Time Series
-Need to be the same length -Sensitive to mismatches or shifts -Use Dynamic Time Warping
Similarity-Based Learning
-Does not involve construction of an explicit function relating features to target (i.e. regression equation) -Classifies new instances based on direct comparison to known training instances -Training is very easy, just memorizing training instances -Testing can be computationally expensive, requiring comparison to all past training instances
Right value of K
-Empirically determined out of different values using a hold-out set or through cross-validation within the training data -If K is too low, it can be sensitive to noise and lead to overfitting -If K is too high, it can lose the pattern in the data and lead to underfitting
Data Normalization
-KNN is very sensitive to this -Higher numerical values should not affect distance calculations, numeric values must be normalized from 0 to 1
Nearest Neighbor Method
-Might be the easiest machine learning method -Classifies new instances based on direct comparison to known training instances by calculating the distance between test point and every training instance -Picks the closest training example and assigns the test instance to the same category as closest training example
Implicit Classification Function
-Not explicitly calculated -Learned classification function based on regions of the feature space divided by training examples -Can be seen as creating decision boundaries in the feature space of a map
Similarity Measures for Binary (T/F) Features
-Russel-Rao -Sokal-Michener -Jaccard Similarity
Similarity (vs Distance)
-The closer the distances the more similar two points are = 1 - distance (is 1 is maximum distance)
Notes on K-Nearest Neighbors
-Training takes no time -Testing examples requires computing distances with all TRAINING examples which can be computational -May not be suited for real-time predictions
K is typically an odd number (nearest neighbor)
-True -Done do classification ties can be easily broken to find the majority
Categorical (Nominal) Features
-Use Manhattan distance -Assume difference between categories for values are 0 if different and 1 if the same
Edit Distance
-Used to measure distance between strings of unbounded lengths -Computes how many edits will change one string to another -Detects similarity between words, sentences, phrases, and even DNA sequences
Weighted K-Nearest Neighbors
-Way to avoid drawbacks of high K values -Farther neighbors receive less weight emphasis -Closer neighbors receive more weight emphasis
Cosine Similarity
-if documents can be represented as vectors, the cosine between those vectors represents how similar they are. -this is best used when dealing with numbers that are ratios -particularly useful with text data since documents can differ in length
Manhattan Distance
A measure of travel through a grid system like navigating around the buildings and blocks of Manhattan, NYC. =Sum(abs(Ai-Bi))
Cosine similarity is good for which type of data
Data in which the ratio and not the scale of feature values matter
To measure distance between two time series, how does dynamic time warping (DTW) distance overcome the disadvantages of Euclidean distance?
By first best aligning the two time series
Euclidean distance is the best distance measure for all datasets
False
Dynamic Time Warping
Computes the best alignment between the two series before computing the distance
Which distance/similarity will be best suited to find closest word(s) for auto-spelling correction?
Edit distance
Numeric Distance measures
Euclidean and Manhattan distances
Russel-Rao similarity
Fraction of features that matched for true
Sokal-Michener similarity
Fraction of features that matched for true and false
Jaccard Similarity
Fraction of features that matched for true but excluding the features that are false in both examples
Nearest Neighbor method is NOT an example of:
Information based learning
Other names for Similarity-Based Learning
Instance-based, case-based, exemplar-based, memory-based, lazy learning
K-nearest neighbors can be extended for regression by
Predicting the average target values of the k nearest neighbors
Numeric Targets in K-Nearest Neighbors
Predicting the average value of the k nearest neighbors, also known as weighted-average
All features should be normalized to take values between 0 and 1 to
Prevent some features to numerically dominate other features
For all binary features, fraction of features that matched is called
Sokal-Michener similarity
Nearest neighbor incorrect classification
Solution is to find k nearest neighbors and assign the majority class where k is the parameter
What is true about k-nearest neighbors?
The right value of k for a dataset should be determined empirically (through experience, not logic)
K-nearest neighbor when used for classification task implicitly defines a classification function
True
Euclidean distance
the straight-line distance, or shortest possible path, between two points = SqRt(Sum((Ai-Bi)^2)) From lecture: = sqrt ((A_test - A_weight)^2 + (B_test-B_weight)^2)