data mining midterm 3
parametric stats based approach to outlier detection
a priori - assumes in advance normal distribution univarite detection: -We can then learn the parameters of the normal distribution from the input data, and identify the points with low probability as outliers -We will try to detect outliers by assuming the data follow a normal distribution. detection of multivariate outliers: -Multivariate outlier detection using the Mahalanobis distance. -transform to a univariate set -chai squared you can use multiple parametric distribution measures -using multiple clusters -using multiple normal distributions
which type of cluster analysis is good for finding arbitrary shapes?
density, continuity based similarity finding methods
optimizing within cluster variation
for exact optimum: O(n^(dk+1) log n) heuristic: greedy (kmeans, k medioid)
partitioning method
partition of data where each partition creates a cluster. k clusters <=n each group must contain at least one object exclusive cluster separation: each each object must belong to exactly one group most are distance based uses iterative relocation technique: attempts to improve the cluster through iteration finds mutually exclusive clusters of spherical shape may use mean or medoid to represent cluster center good for small to medium sized datasets
what are the types of clustering methods?
partitioning density hierarchal grid based some methods contain some or all of these together
which methods are designed to find spherical shaped clusters?
partitioning and hierarchal methods
core-distance
the smallest distance e' from the core (p) to one of the min pts in the cluster. it is undefined if it the point your comparing it to is not in the neighborhood.
what happens to STING as the granularity approaches 0?
An interesting property of STING is that it approaches the clustering result of DBSCAN if the granularity approaches 0 (i.e., toward very low-level data). aka dense clusters can be identified using STING
cluster outlier detection
- is it in a cluster? DBSCAN - is there a big dist between the cluster its in and the other parts? K-means w/ outlier score -is it part of a small or sparse cluster? CBLOF, fixed-width clustering - not within the width is an outlier
hierarchical methods
- makes a hierarchical decomposition of the dataset, ranks them (multiple levels) -agglomerative: bottom up approach. all dp starts off as individual units and they build up. keeps merging until all groups are merged into one or termination condition holds (top of hierarchy - combo of all) -divisive: top down. starts off as one giant cluster, keeps breaking into subsets until all the dp are on their own -it can be distance based, density based, and continuity based - May incorporate other techniques like microclustering or consider object "linkages" cons: once a step is done(merge or split) it can't be undone, cannot correct erroneous merges or splits
cons of k means
- need to pick a k in advance -cant discover non convex shapes or clusters of very different size -sensitive to outliers
requirements of clustering
- scalability: models need to work well on large data sets - ability to deal with different types of attributes: needs to cluster with more types of data instead of just numerical -discovery of clusters with arbitrary shape: not just spherical -reduce requirements for domain knowledge to determine input parameters: like k means, you need to specify k. you don't want domain knowledge to be involved. but a lot of them you do -ability to deal with noisy data: we want models that are not so sensitive to outliers -incremental clustering and insensitivity to input order: representing new data, incremental updates. some algorithms need to start from scratch when this happens. some also are sensitive to input data order, we need algorithms that take incremental data and is insensitive to input order -capability of clustering high dimensionality data: most algorithms can do dimensions of 2 or 3. We want to create ones that take more dimensions -constraint based clustering: find groups with good clustering behavior that satisfies the constraints. -interpretability and usability: the goal should influence the attributes pict for clustering. needs to be interpretable and useful analysis
clustering methods can be compared using the following criteria:
- the partitioning criteria: no hierarchy, customers being grouped to a manager, are at the same level conceptually. hierarchy: basketball under the hierarchy of sports -separation of clusters: can be exclusive-dp only belongs to one cluster. or a dp can be within multiple clusters, multiple classifications -similarity measure: similarity may be measured by distance in some, density, continuity based -clustering space: subspace clustering, finding clusters in low dimensional subspaces
characteristics of cluster analysis
-automatic classification: creates implicit classes, can automatically find groupings - data segmentation: partitions data based on similarity -unsupervised learning since class labels are not present (learn by observation)
density based clustering
-can find non-spherical shapes DBSCAN: Density-Based Clustering Based on Connected Regions with High Density Density-Based Spatial Clustering of Applications with Noise density is found by: measuring the number of objects close to o finds core objects, that is, objects that have dense neighborhoods. It connects core objects and their neighborhoods to form dense regions as clusters. core objects(objects w/ dense neighborhoods) are found and are connected to their neighborhood to form a cluster A user-specified parameter e > 0 is used to specify the radius of a neighborhood we consider for every object the density of a neighbor- hood can be measured simply by the number of objects in the neighborhood.
what can you use cluster analysis for?
-gaining insight into distribution of data -observe characteristics of each cluster -focus on particular cluster for further analysis -preprocessing -outlier detection (outlier points can be "interesting" to do further analysis on), fraud stuff
outlier detection
-supervised - classification problem (has clas imbalance problem) -semisupervised unsupervised -makes implicit ssumption that all -normal dps are clustered together normal objects vs outlier diff (statistical model): -- -normal objects are generated by a statisticl model, dp that dont fit this model is an outlier gaussian model- neds to fit the normal distribution -proximity based: proximity from its neighbors significantly deviates -clustering based
how to make k means more efficient /scalable with big data
-use a decent sized set -filter aproach - spacial data -micro clustering: which first groups nearby objects into "microclusters" and then performs k-means clustering on the microclusters. use k medoids instead! less sensitive to outliers since it chooses an actual dp to be the center
k medoid algorithm
1) chose seeds(k representative objects) 2) assign each remaining object to the nearest rep 3)choose random non representative objects 4) compute the cost of swapping it with a rep 5) if s < 0, swap with rep to form new rep
hierarchal clustering pt2
A hierarchical clustering method works by grouping data objects into a hierarchy or "tree" of clusters. Our use of a hierarchy here is just to summarize and represent the underlying data in a compressed way. can uncover a hierarchy as well
dendrogram
A tree structure called a dendrogram is commonly used to represent the process of hierarchical clustering. It shows how objects are grouped together (in an agglomerative method) or partitioned (in a divisive method) step-by-step
CLIQUE
CLustering In QUEst grid and density based approach to subspace clustering in high dimensional data space apiori like finding clusters with attributes that are more similar to each other (subspaces) like instead of trying to find clusters with high dimensionality. use just like 3 similar attributes in a sub category: symptoms and find clusters there we can do it by attrbutes of symptions: high fever, stuffy nose. etc.
parameters for DBSCAN
DBSCAN = density based ε is a distance parameter that defines the radius to search for nearby neighbors. minpts = number of points that need to be in the neighborhood to be considered a cluster a point in the DBSCAN can be a core point a border point( is in a cluster but doesnt have enough minpts around it to be its own cluster) noise point
divisive method
DIANA (DIvisive ANAlysis) top-down strategy. The cluster is split according to some principle such as the maximum Euclidean distance between the closest neighboring objects in the cluster.
univariate data
Data involving only one attribute or variable are called univariate data.
cons of hierarchal method
Hierarchical clustering methods can encounter difficulties regarding the selection of merge or split points Such a decision is critical, because once a group of objects is merged or split, the process at the next step will operate on the newly generated clusters.
k mediods
Instead of taking the mean value of the objects in a cluster as a reference point, we can pick actual objects to represent the clusters, using one representative object per cluster. Each remaining object is assigned to the cluster of which the representative object is the most similar. minimizing the sum of the dissimilarities between each object p and its corresponding representative object: uses absolute error criterion - we want to minimize absolute error NP-hard except when k = 1 Partitioning Around Medoids (PAM):
unsupervised outlier detection
Instead, they can form multiple groups, where each group has distinct features. However, an outlier is expected to occur far away in feature space from any of those groups of normal objects.
number of clusters
It can be regarded as finding a good balance between compressibility and accuracy in cluster analysis simple: set the num of clusters to sqrt of n/2 each cluster would have sqrt2n points elbow method: increasing the number of clusters can help to reduce the sum of within-cluster variance of each cluster. cross-validation:
challenges in outlier detection
Modeling normal objects and outliers effectively.-outlierness, to what extent would it be considered a normal pt? application specific outlier detection handling noise in outlier detection Understandability.
signatures
Our goal in this section is to replace large sets by much smaller represen- tations called "signatures." compare the signatures of two sets and estimate the Jaccard sim- ilarity of the underlying sets from the signatures alone. basically signatures are the result of minhashing. its the 4 digit into or whater that is used to represent the bin that has similar sets in the bin H(d) being the singnature If similarity(d1,d2) is high then Probability(H(d1)==H(d2)) is high make as much signatures as there are documents
PAM vs CLARA
PAM examines every object in the data set against every current medoid, whereas CLARA confines the candidate medoids to only a random sample of the data set.
proximity based approaches to outlier detection
Proximity-based approaches assume that the proximity of an outlier object to its nearest neighbors significantly deviates from the proximity of the object to most of the other objects in the data set. distance based density based
single linkage approach
This is a single-linkage approach in that each cluster is represented by all the objects in the cluster, and the similarity between two clusters is measured by the similarity of the closest pair of data points belonging to different clusters. agglomerative (AGNES)
minimal spanning tree algorithm
Thus, an agglomerative hierar- chical clustering algorithm that uses the minimum distance measure is also called a minimal spanning tree algorithm, where a spanning tree of a graph is a tree that connects all vertices, and a minimal spanning tree is the one with the least sum of edge weights.
Partitioning Around Medoids (PAM):
We consider whether replacing a representative object by a nonrep- resentative object would improve the clustering quality. the quality of a another data point in being a new mediod This quality is measured by a cost function of the average dissimilarity between an object and the representative object of its cluster. Therefore, the cost function calculates the difference in absolute-error value if a current representative object is replaced by a nonrepresentative object The total cost of swapping is the sum of costs incurred by all nonrepresentative objects If the total cost is negative, then oj is replaced or swapped with orandom because the actual absolute-error E is reduced. If the total cost is positive, the current representative object, oj, is considered acceptable, and nothing is changed in the iteration. does not scale well for large data sets
complete linkage algorithm
When an algorithm uses the maximum distance, dmax (Ci , Cj ), to measure the distance between clusters, it is sometimes called a farthest-neighbor clustering algorithm. If the clustering process is terminated when the maximum distance between nearest clusters exceeds a user-defined threshold, it is called a complete-linkage algorithm. complete subgraph, that is, with edges connecting all the nodes in the clusters Farthest-neighbor algorithms tend to minimize the increase in diameter of the clusters at each iteration.
CELL algorithm
a grid-based method for distance-based outlier detection each cell is a cube with legnth r/2 (r = distance threshold param) DB(r, pi) 3 time scan
multi phase clustering
a way to make the hierarchal method better. integrates it with other techniques BIRCH and Chameleon
collective outlier
aa certain action containing multiple dp is considered an outlier as a whole. 100 delays in order on one day when usually its around 1 or 2 a day. Given a data set, a subset of data objects forms a collective outlier if the objects as a whole deviate significantly from the entire data set. Importantly, the individual data objects may not be outliers. the black objects as a whole form a collective outlier because the density of those objects is much higher than the rest in the data set need mutiple dp to see the outlier pattern consider the behavior of a group of objects
an objective function
aims for high intracluster similarity and low intercluster similarity. This objective function tries to make the resulting k clusters as compact and as separate as possible.
categories of hierarchal methods
algorithmic (deterministic measurement of distance): agglomerative, divisive, multiphase probabilistic (measure the quality of clusters by the fitness of models.) bayesian (return a group of clustering structures and their probabilities, conditional on the given data)
kmeans
an objective function A centroid-based partitioning technique uses the centroid of a cluster, Ci , to represent that cluster. centroid can be defined in various ways such as by the mean or medoid of the objects euclidean distance The quality of cluster Ci can be measured by the within- cluster variation, which is the sum of squared error between all objects in Ci and the centroid ci, defined as The k-means algorithm defines the centroid of a cluster as the mean value of the points within the cluster. For each of the remaining objects, an object is assigned to the cluster to which it is the most similar, based on the Euclidean distance between the object and the cluster mean. The k-means algorithm then iteratively improves the within-cluster variation. The k-means method is not guaranteed to converge to the global optimum and often terminates at a local optimum. The results may depend on the initial random selection of cluster centers. run time: O(nkt) n = number of dp, k = number of clusters, t = number of iterations The k-means method can be applied only when the mean of a set of objects is defined.
local proximity based outliers
an outleir relative to the distance to cluster 1 but no necessarly in comparison to cluster 2 distance based cannot detect this, only density based The critical idea here is that we need to compare the density around an object with the density around its local neighbors. local outlier factor used - close to 1 means its deep in a cluster
contextual outliers
anomalies that depend on multiple factors. 80 degrees is a outlier in antarctica but not in florida dependent on behavior attributes and contextual attributes a generalization of local outliers: its density significantly deviates from the local area in which it occurs
novelty detection vs outliers
both are found becasue they are differnet from the normal mechanisms. however novelty one usually get incorperated into the model later and wont be treated as an outlier from that point forward
k modes method
can cluster nominal data The k-modes method is a variant of k-means, which extends the k-means paradigm to cluster nominal data by replacing the means of clusters with modes It uses new dissimilarity measures to deal with nominal objects and a frequency-based method to update modes of clusters
When I try out a clustering method on a data set, how can I evaluate whether the clustering results are good?"
clustering tendency number of clusters measure cluster quality:
difference between density based clustering vs density based outlier detection
clustering: need minpts and r outlier: need reachability distance and r. no min number of points!
density based method
continue growing a given cluster until it exceeds some threshold. it needs to hit the minimum number of points to become a cluster (need a certain number of point minimum in its neighborhood) -can find arbitrarily shaped clusters -clusters are high density areas in space separated by low density regions -usually only deals with exclusive clusters -may filter out outliers
what information does OPTICS need per object?
core-distance: smallest value of e'(aka dense) to the nearest dp with MinPt objects. e' is the minimum distance threshold that makes p a core -If p is not a core object with respect to ✏ and MinPts, the core-distance of p is undefined. -the point must have at least MinPts other points within a ball of radius reachability-distance: the minium radius value that makes p density-reachable from q max{core-distance(q), dist(p, q)}. one has to be a core object. if the core distance is greater than the euclidean distance, then its the core distance he complexity is O(n log n) if a spatial index is used, and O(n2) otherwise, where n is the number of objects
clustering tendency
determines whether a given data set has a non-random structure, which may lead to meaningful clusters. see that the data set has a uniform data distribution using: The Hopkins Statistic is a spatial statistic that tests the spatial randomness of a variable as distributed in a space.
global outler detection
distance based: since all take the same params
OPTICS
doesnt have to make clusters all the same sizes by setting e = infinity, main advantage over DBscan, however run time may be more complex outputs a cluster ordering represents the density-based clustering structure of the data objects in a denser cluster are listed closer to each other in the ordering does not require a user to provide a specific density threshold. - good! To construct the different clusterings simultaneously, the objects are processed in a specific order clusters with higher density (lower e) will be finsihed first when creating the clusters
grid based method
embedding space into cells independent of the distribution of the input objects. uses a multiresolution grid data structure. It quantizes the object space into a finite number of cells that form a grid structure on which all of the operations for clustering are performed pros: fast processing time(independent of # of dp, only dependent on the number of cells STING and CLIQUE
measure cluster quality
etrinsic: cluster homogenity, cluster completness, ragbag(michellienious dp), small clustering preservation intrensic: sillouette coefficient,
exclusive vs fuzzy
exclusive. a dp can only belong to one cluster fuzzy: can belong to multiple clusters
types of outliers
global contextual collective
space driven clustering method
grid-based clustering embedding into cells independent of the distribution of the dps
data driven clustering methods
hierarchal density based partitioning based aka they data-driven—they partition the set of objects and adapt to the distribution of the objects in the embedding space.
single linkage algorithm
if the clustering process is terminated when the distance between nearest clusters exceeds a user-defined threshold, it is called a single-linkage algorithm. (min span tree, min distance)
directly density-reachable
if a dp p is n the neighborhood of the core object q the non core object is density reachable from the core object but no vise versa. unless both are cores
How is this statistical information useful for query answering? for STING
it filters down the necessary cells that fit the criteria for the query from the top down until it reaches the bottom. confidence testing is used. First, a layer within the hierarchical structure is determined from which the query-answering process is to start. This layer typically contains a small number of cells. For each cell in the current layer, we compute the confidence interval (or estimated probability range) reflecting the cell's relevancy to the given query. The irrelevant cells are removed from further considera- tion. Processing of the next lower level examines only the remaining relevant cells. This process is repeated until the bottom layer is reached. At this time, if the query specifica- tion is met, the regions of relevant cells that satisfy the query are returned.
agglomerative method
it finds the two clusters that are closest to each other (according to some similarity measure), and combines the two to form one cluster. Because two clusters are merged per iteration, where each cluster contains at least one object, an agglomerative method requires at most n iterations. AGNES (AGglomerative NESting)
What advantages does STING offer over other clustering methods?
it is query independent: statistical information stored in each cell represents the summary information of the data in the grid cell, independent of the query; the grid structure facilitates parallel processing and incremental updating; and the method's efficiency is a major advan- tage: STING goes through the database once to compute the statistical parameters of the cells, and hence the time complexity of generating clusters is O(n). After generating the hierarchical structure, the query processing time is O(g ), where g is the total number of grid cells at the lowest level, which is usually much smaller than n.
"Which method is more robust—k-means or k-medoids?"
k medoids is if there is a lot of noise and outliers in the data complexity is high though
heuristic clustering methods
k-means and the k-medoids algorithms, which progres- sively improve the clustering quality and approach a local optimum. sufficient for reaching goal, not necessary the optimal solution good fro finding spherical shaped clusters in small to medium sized data sets
statistical approaches
makes assumptions about data normality They assume that the normal objects in a data set are generated by a stochastic process (a generative model) objects in low probabiltiy regions are outliers parametric method: assumes that the normal data objects are generated by a parametric distribution(probabiltiy distriubtion with a fixed set of parameters) non-parametric method: assumes that the number and nature of the parameters are flexible and not fixed in advance. made from the input data. non priori method. In summary, statistical methods for outlier detection learn models from data to dis- tinguish normal data objects from outliers
what is the most popular clustering method?
most popular is partitioning by distance (cons: can only find spherical shaped clusters)
distance based proximity approach to outlier detection
nested loop: checks an object in comarison to the whole data set - costly grid based- better: checks it in comparison to others in the group
noise vs outliers
noise: random error or varience outliers: interesting because it s not made from the same mechanisms as the rest of the data justify why the outliers detected are generated by some other mechanisms. This is often achieved by making various assumptions on the rest of the data and showing that the outliers detected violate those assumptions significantly.
gaussian distribution
normal distribution
The general criterion of a good partitioning
objects in the same cluster are "close" or related to each other, whereas objects in different clusters are "far apart" or very different
DBSCAN algorithm
picks a random unvisited object marks as visited and see if point contains Minpts amount of objects around it if not: labeled as noisy data if so: becomes a core, cluster the neighborhood points gets checked next he clustering process continues until all objects are visited. the computational complexity of DBSCAN is O(nlogn), must use global params, all clusters have to look the same requires a user to provide a specific density threshold
pros and cons of CLIQUE
pros: high scalabiltiy insensitive to the order of inputs cons: good results are: dependent on grid size dependent on density threshold every cluster have these same params so the simplicity of the model can make it a little bad
grid based method
quantizes the object space into a finite number of cells that form a grid structure pros: fast processing time. only dependent on the number of grids. independent of the number of dp good approach to spacial data mining problems can be integrated with density based clustering or hierarchal methods, since its just a space thing -multiresolution grid structure
user parameters in DBSCAN
radius of the neighborhood density threshold of dense regions (Minpts) -an object is core if it has minpts amount of dp
CLARA
sampling-based method for large data sets to do k medoids Instead of taking the whole data set into consideration, CLARA uses a random sample of the data set. The PAM algorithm is then applied to compute the best medoids from the sample. CLARA builds clusterings from multiple random samples and returns the best clustering as the output. The effectiveness of CLARA depends on the sample size.
4 widely used distance measurements for algorithmic hierarchal methods
sensitive to outliers: max: farthest neighbor (complete linkage) min: nearest neighbor (single linkage, min span tree) not as sensitive: avg (handles categorical + numeric data) mean (simplest to compute)
CLIQUE algorithm
separates space into cells categorizes some as dense according to density threshold A 2-D cell, say in the subspace formed by age and salary, contains l points only if the projection of this cell in every dimension, that is, age and salary, respectively, contains at least l points.
clustering
set of clusters resulting from cluster analysis, discovering groupings dp within clusters have high similarities, and they are dissimilar from dp not in the cluster partitioning sets into subsets can lead to discovery of previously unknown groupings
how to find similar items
shingling - turn into sets minhashing - compress large sets in a way that still deduces similarity locally sensitive ashing - focus on pairs that have a higher likelyhood of being silmiliar instead of looking @ everything -singatures
difference and similarities between collective outliers vs contextual
similarity: both are dependent on structure, both need to be reduced to conventional outlier detection diff: collective is usually implicit, not clearly defined context while contextual usually is clearly defined
Single versus complete linkages
single: finds local closeness (minimum spanning tree) complete: finds global closeness
STING
statistical info stored into grid cells mutiresolution technique that puts the spacial areas of input objects into rectangular cells can be hierarchal and recursive statistical parameters in each cell. like the mean avg etc. the stat prameters of the higher level cells are calulated by the stats of the lower level cells has attr independent param: count and attr dependent param: stdev, mean, min, max, type of distrubution(normal, niform, exponential) - assinged by user Here, the attribute is a selected measure for analysis such as price for house objects. complexity of generating clusters is O(n)
Sum of Squared Errors (SSE)
sum of dist(y, y*)^2 for all data points looks for the extent of similarity within clusters part of the objective function within cluster variation the problem is NP-hard in general Euclidean space even for two clusters
classification approach to outlier detection
supervised: one class model using only the model of the normal class to detect outliers, can handle fuzzy clustering
what happens if the distribution types of the lower level cells in STING disagree with each other?
the higher level cells are set to the distribution type of "none"
non parametric methods of statistical outlier analysis
the model of "normal data" is learned from the input data, rather than assuming one a priori. bases decidion from observations instead of infering something before hand -histogram: if it fits into a bin, its normal- else, its an outlier, con: its hard to choose the appropriate bin size - solution to making histogram better: kernel(gaussian function) algorithms for pattern analysis, whose best known member is the support vector machine (SVM
to be in a cluster in DBSCAN
the point must have at least MinPts other points within a ball of radius If a point isn't a core-point it is noise. reachability-distance captures this but seeing that the e is at a min value that a certain point can be added into that cluster. prevents noise from being added to the cluster
cons of STING
the quality of STING clustering depends on the granularity of the lowest level of the grid structure. If the granularity is very fine, the cost of processing will increase substantially; however, if the bottom level of the grid structure is too coarse, it may reduce the quality of cluster analysis. gradularity in important. it doesnt consider the spacial relationship between children and parent, and with neighbohrs. the clusters are horizonatl or verticle but never diagonal. may lower quality
partitioning pt2
the simple and most fundamental form of cluster analysis number of clusters is given as background knowledge (k) organizes object into k partitions k<=n kmeans kmedoids
locally sensitive hashing
to hash data points into buckets so that data points near each other are located in the same buckets with high probability, while data points far from each other are likely to be in different buckets. This makes it easier to identify observations with various degrees of similarity. reduced to the pairs that are most likely to be similar We check only the candidate pairs for similarity.
shingling
turn a doc into a set problem Similar documents are more likely to share more shingles k value of 8-10 is generally used in practice. A small value will result in many shingles which are present in most of the documents (bad for differentiating documents) similiarty of docs by shingles -> use jaccard index: More number of common shingles will result in bigger Jaccard Index and hence more likely that the documents are similar. - big overhead- too much data-> solve this by using signatures
contextual outlier models
turn it into a regular model to make outlier detection possible. predictive or can be based on data you already give of the contexts
density-connected
two cores are density reachable from one another, so the clusters get put together
global outliers
when a point significnatly deviates from he rest of the ds most algorithms try to find global outliers simplest type, point anomolies dependent on behavioral attributes