exam 2 slides
item support formula
# transactions with X/ #total transactions
Association analysis algorithms
-Apriori -Eclat -ZeroR -FP Growth -ETC
Uses for cluster results
-Data segmentation -Categories for classifying new data -Labeled data for classification -Anomaly detection
the three similarity measures
-Euclidean Distance -Manhattan Distance -Cosine Similarity
Cluster Analysis Characteristics
-Unsupervised -No labels for clusters -No 'correct' clustering
Association analysis steps
-create item sets -identify frequent item sets -generate rules
how to evaluate cluster results
-find error -find square error -sum of square errors between all samples and centroids -sum > WSSE -if WSSE 1 < WSSE2 -> WSSE 1 is numerically better.
key uses of rule confidence
-frequent item sets --> significant rules
When to stop iterating
-no changes to centroid -number of samples changing clusters below threshold
K-Means algorithm steps
-start -Select initial centroids (K) -assign each sample to a centroid -calculate cluster mean to find new centroid
Association analysis characteristics
-unsupervised -rules usefulness is subjective -Need to determine uses of rules
residual distance in leas squares method
squared distance from regression line
Manhattan Distance
A measure of travel through a grid system like navigating around the buildings and blocks of Manhattan, NYC via horizontal and vertical paths
cosine similarity
Cosine of the angles between points A and B.
Goal of regression analysis
Given input variables predict numerical output
Error in cluster analysis
distance between sample and centroid
rule confidence formula
conf(x>y) = supp(X U Y) / supp(X)
square error
error ^2
Issue with initial centroid
final clusters are sensitive to initial clusters
Goal of association analysis
find rules to capture associations between items/events
cluster analysis goal
organize similar items into groups aka clusters. differences between items within a cluster are minimized while differences between items in another cluster are maximized
Solution to initial centroid issue
run k-means multiple times and choose best results
Euclidean Distance
the distance between two points measured as a straight line.
x--> Y rule
x is the antecedent y is the consequent