Quiz 3: Similarity Based Learning

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

Levenshtein Edit Distance

-Minimum number of edits (insertions, deletions and substitutions) to convert one term into another -Example: cyanotic to cyanosis takes 2 changes

Lazy Learning

-Name for similarity-based learning -Because no work is done during training and all the work is done during testing

Euclidean Distance disadvantages for Time Series

-Need to be the same length -Sensitive to mismatches or shifts -Use Dynamic Time Warping

Similarity-Based Learning

-Does not involve construction of an explicit function relating features to target (i.e. regression equation) -Classifies new instances based on direct comparison to known training instances -Training is very easy, just memorizing training instances -Testing can be computationally expensive, requiring comparison to all past training instances

Right value of K

-Empirically determined out of different values using a hold-out set or through cross-validation within the training data -If K is too low, it can be sensitive to noise and lead to overfitting -If K is too high, it can lose the pattern in the data and lead to underfitting

Data Normalization

-KNN is very sensitive to this -Higher numerical values should not affect distance calculations, numeric values must be normalized from 0 to 1

Nearest Neighbor Method

-Might be the easiest machine learning method -Classifies new instances based on direct comparison to known training instances by calculating the distance between test point and every training instance -Picks the closest training example and assigns the test instance to the same category as closest training example

Implicit Classification Function

-Not explicitly calculated -Learned classification function based on regions of the feature space divided by training examples -Can be seen as creating decision boundaries in the feature space of a map

Similarity Measures for Binary (T/F) Features

-Russel-Rao -Sokal-Michener -Jaccard Similarity

Similarity (vs Distance)

-The closer the distances the more similar two points are = 1 - distance (is 1 is maximum distance)

Notes on K-Nearest Neighbors

-Training takes no time -Testing examples requires computing distances with all TRAINING examples which can be computational -May not be suited for real-time predictions

K is typically an odd number (nearest neighbor)

-True -Done do classification ties can be easily broken to find the majority

Categorical (Nominal) Features

-Use Manhattan distance -Assume difference between categories for values are 0 if different and 1 if the same

Edit Distance

-Used to measure distance between strings of unbounded lengths -Computes how many edits will change one string to another -Detects similarity between words, sentences, phrases, and even DNA sequences

Weighted K-Nearest Neighbors

-Way to avoid drawbacks of high K values -Farther neighbors receive less weight emphasis -Closer neighbors receive more weight emphasis

Cosine Similarity

-if documents can be represented as vectors, the cosine between those vectors represents how similar they are. -this is best used when dealing with numbers that are ratios -particularly useful with text data since documents can differ in length

Manhattan Distance

A measure of travel through a grid system like navigating around the buildings and blocks of Manhattan, NYC. =Sum(abs(Ai-Bi))

Cosine similarity is good for which type of data

Data in which the ratio and not the scale of feature values matter

To measure distance between two time series, how does dynamic time warping (DTW) distance overcome the disadvantages of Euclidean distance?

By first best aligning the two time series

Euclidean distance is the best distance measure for all datasets

False

Dynamic Time Warping

Computes the best alignment between the two series before computing the distance

Which distance/similarity will be best suited to find closest word(s) for auto-spelling correction?

Edit distance

Numeric Distance measures

Euclidean and Manhattan distances

Russel-Rao similarity

Fraction of features that matched for true

Sokal-Michener similarity

Fraction of features that matched for true and false

Jaccard Similarity

Fraction of features that matched for true but excluding the features that are false in both examples

Nearest Neighbor method is NOT an example of:

Information based learning

Other names for Similarity-Based Learning

Instance-based, case-based, exemplar-based, memory-based, lazy learning

K-nearest neighbors can be extended for regression by

Predicting the average target values of the k nearest neighbors

Numeric Targets in K-Nearest Neighbors

Predicting the average value of the k nearest neighbors, also known as weighted-average

All features should be normalized to take values between 0 and 1 to

Prevent some features to numerically dominate other features

For all binary features, fraction of features that matched is called

Sokal-Michener similarity

Nearest neighbor incorrect classification

Solution is to find k nearest neighbors and assign the majority class where k is the parameter

What is true about k-nearest neighbors?

The right value of k for a dataset should be determined empirically (through experience, not logic)

K-nearest neighbor when used for classification task implicitly defines a classification function

True

Euclidean distance

the straight-line distance, or shortest possible path, between two points = SqRt(Sum((Ai-Bi)^2)) From lecture: = sqrt ((A_test - A_weight)^2 + (B_test-B_weight)^2)


Kaugnay na mga set ng pag-aaral

NYS DMV Learners Permit Quest. - Chapter 5 Intersections & Turns

View Set

Pharmacology Ch. 20 - Drugs Used for Pain Management

View Set