Data Mining Exam 1: Lecture 3
Cosine measure Often falls in the range [-1,1]. True False
True
Similarity/Dissimilarity for Simple Attributes: Interval or Ratio
d = |p-q| s = -d, or s = 1/(1+d)
Similarity/Dissimilarity for Simple Attributes: Ordinal
d = |p-q|/(n-1) s = 1 - |p-q|/(n-1)
Simple Matching Coefficient
(F11+F00) / (F01 + F10 + F11 + F00)
Similarity Measure
-A numerical measure of how alike two data objects are -Is higher when objects are more alike -Often falls in the range [0,1]
Dissimilarity Measure
-Numerical measure of how different are two data objects -Lower when objects are more alike -Minimum dissimilarity is often 0 -Upper limit varies
Similarity/Dissimilarity for Simple Attributes: Nominal
d = 0 if p=q d = 1 if p != q s = 1 if p =q s = 0 if p!= q
Which of the following statements regarding data exploration is NOT true? A. Data exploration can make use of humans' abilities to recognize patterns. B. In data mining, clustering and anomaly detection can only be applied to handle data exploration tasks. C. Data exploration can help to select the right tool for preprocessing or analysis. D. A preliminary exploration of the data can help to better understand its characteristics.
B. In data mining, clustering and anomaly detection can only be applied to handle data exploration tasks.
For the following vectors, x and y, x = (1, 1, 0, 1, 0, 1), y = (1, 1, 1, 0, 0, 1), the Jaccard coefficient of x and y is: A. 0.4 B. 0.5 C. 0.6 D. 0.7
C. 0.6
Which of the following statements regarding similarity and dissimilarity measures is NOT true? A. Similarity measure often falls in the range [0,1]. B. The dissimilarity measure is lower when objects are more alike. C. Dissimilarity measures always have an invariant upper limit. D. Proximity refers to a similarity or dissimilarity.
C. Dissimilarity measures always have an invariant upper limit.
Mahalanobis Distance
Determining similarity of an unknown Sample set to a known one. Takes into account the correlations of the Data set and is scale-invariant
Euclidean Distance
Dist = sqrt(sum(Pk - Qk)^2)) n = number of dimensions (attributes) and Pk and Qk are respectively the kth attributes or data objects p and q
Similarity Between Binary Vectors
F01 = # of attributes where p was 0 and q was 1 F10 = # of attributes where p was 1 and q was 0 F00 = # of attributes where p was 0 and q was 0 F11 = # of attributes where p was 1 and q was 1
Jaccard Coefficient
F11/(F01 + F10 + F11)
If two objects have a cosine measure of 1, they are identical. True False
False
Minkowski Distance is a
Generalization of Euclidean Distance
Cosine Similarity
If d1 and d2 are document vectors then cos(d1, d2) = (d1 o d2) / (||d1|| * ||d2||)
Correlation
Measures the linear relationship between objects, and ranges from -1 (perfect negative correlation) to 1 (perfect positive correlation)
Minkowski Distance Examples
r = 1: City block distance r = 2: Euclidean distance r = infinity (max diff. between any component of the vectors)
Common Properties of a Similarity
s(p, q) = 1 only if p = q s(p, q) = s(q, p) for all p and q (symmetry)
Proximity refers to a
similarity or dissimilarity