Data Mining Exam 1: Lecture 3

¡Supera tus tareas y exámenes ahora con Quizwiz!

Cosine measure Often falls in the range [-1,1]. True False

True

Similarity/Dissimilarity for Simple Attributes: Interval or Ratio

d = |p-q| s = -d, or s = 1/(1+d)

Similarity/Dissimilarity for Simple Attributes: Ordinal

d = |p-q|/(n-1) s = 1 - |p-q|/(n-1)

Simple Matching Coefficient

(F11+F00) / (F01 + F10 + F11 + F00)

Similarity Measure

-A numerical measure of how alike two data objects are -Is higher when objects are more alike -Often falls in the range [0,1]

Dissimilarity Measure

-Numerical measure of how different are two data objects -Lower when objects are more alike -Minimum dissimilarity is often 0 -Upper limit varies

Similarity/Dissimilarity for Simple Attributes: Nominal

d = 0 if p=q d = 1 if p != q s = 1 if p =q s = 0 if p!= q

Which of the following statements regarding data exploration is NOT true? A. Data exploration can make use of humans' abilities to recognize patterns. B. In data mining, clustering and anomaly detection can only be applied to handle data exploration tasks. C. Data exploration can help to select the right tool for preprocessing or analysis. D. A preliminary exploration of the data can help to better understand its characteristics.

B. In data mining, clustering and anomaly detection can only be applied to handle data exploration tasks.

For the following vectors, x and y, x = (1, 1, 0, 1, 0, 1), y = (1, 1, 1, 0, 0, 1), the Jaccard coefficient of x and y is: A. 0.4 B. 0.5 C. 0.6 D. 0.7

C. 0.6

Which of the following statements regarding similarity and dissimilarity measures is NOT true? A. Similarity measure often falls in the range [0,1]. B. The dissimilarity measure is lower when objects are more alike. C. Dissimilarity measures always have an invariant upper limit. D. Proximity refers to a similarity or dissimilarity.

C. Dissimilarity measures always have an invariant upper limit.

Mahalanobis Distance

Determining similarity of an unknown Sample set to a known one. Takes into account the correlations of the Data set and is scale-invariant

Euclidean Distance

Dist = sqrt(sum(Pk - Qk)^2)) n = number of dimensions (attributes) and Pk and Qk are respectively the kth attributes or data objects p and q

Similarity Between Binary Vectors

F01 = # of attributes where p was 0 and q was 1 F10 = # of attributes where p was 1 and q was 0 F00 = # of attributes where p was 0 and q was 0 F11 = # of attributes where p was 1 and q was 1

Jaccard Coefficient

F11/(F01 + F10 + F11)

If two objects have a cosine measure of 1, they are identical. True False

False

Minkowski Distance is a

Generalization of Euclidean Distance

Cosine Similarity

If d1 and d2 are document vectors then cos(d1, d2) = (d1 o d2) / (||d1|| * ||d2||)

Correlation

Measures the linear relationship between objects, and ranges from -1 (perfect negative correlation) to 1 (perfect positive correlation)

Minkowski Distance Examples

r = 1: City block distance r = 2: Euclidean distance r = infinity (max diff. between any component of the vectors)

Common Properties of a Similarity

s(p, q) = 1 only if p = q s(p, q) = s(q, p) for all p and q (symmetry)

Proximity refers to a

similarity or dissimilarity


Conjuntos de estudio relacionados

econ, chapter 19 - public goods and tragety of the commons

View Set

Computers in Health Care Units 1 2 3 4

View Set

Strategic Management: Exam 1 Study Guide

View Set

AP Bio Unit 6 DNA and Gene expression review

View Set

4-Authzd Relps Duties and Disclosre

View Set

IB Chemistry HL - Unit 10 Organic Chemistry

View Set