data mining midterm 3

Ace your homework & exams now with Quizwiz!

parametric stats based approach to outlier detection

a priori - assumes in advance normal distribution univarite detection: -We can then learn the parameters of the normal distribution from the input data, and identify the points with low probability as outliers -We will try to detect outliers by assuming the data follow a normal distribution. detection of multivariate outliers: -Multivariate outlier detection using the Mahalanobis distance. -transform to a univariate set -chai squared you can use multiple parametric distribution measures -using multiple clusters -using multiple normal distributions

which type of cluster analysis is good for finding arbitrary shapes?

density, continuity based similarity finding methods

optimizing within cluster variation

for exact optimum: O(n^(dk+1) log n) heuristic: greedy (kmeans, k medioid)

partitioning method

partition of data where each partition creates a cluster. k clusters <=n each group must contain at least one object exclusive cluster separation: each each object must belong to exactly one group most are distance based uses iterative relocation technique: attempts to improve the cluster through iteration finds mutually exclusive clusters of spherical shape may use mean or medoid to represent cluster center good for small to medium sized datasets

what are the types of clustering methods?

partitioning density hierarchal grid based some methods contain some or all of these together

which methods are designed to find spherical shaped clusters?

partitioning and hierarchal methods

core-distance

the smallest distance e' from the core (p) to one of the min pts in the cluster. it is undefined if it the point your comparing it to is not in the neighborhood.

what happens to STING as the granularity approaches 0?

An interesting property of STING is that it approaches the clustering result of DBSCAN if the granularity approaches 0 (i.e., toward very low-level data). aka dense clusters can be identified using STING

cluster outlier detection

- is it in a cluster? DBSCAN - is there a big dist between the cluster its in and the other parts? K-means w/ outlier score -is it part of a small or sparse cluster? CBLOF, fixed-width clustering - not within the width is an outlier

hierarchical methods

- makes a hierarchical decomposition of the dataset, ranks them (multiple levels) -agglomerative: bottom up approach. all dp starts off as individual units and they build up. keeps merging until all groups are merged into one or termination condition holds (top of hierarchy - combo of all) -divisive: top down. starts off as one giant cluster, keeps breaking into subsets until all the dp are on their own -it can be distance based, density based, and continuity based - May incorporate other techniques like microclustering or consider object "linkages" cons: once a step is done(merge or split) it can't be undone, cannot correct erroneous merges or splits

cons of k means

- need to pick a k in advance -cant discover non convex shapes or clusters of very different size -sensitive to outliers

requirements of clustering

- scalability: models need to work well on large data sets - ability to deal with different types of attributes: needs to cluster with more types of data instead of just numerical -discovery of clusters with arbitrary shape: not just spherical -reduce requirements for domain knowledge to determine input parameters: like k means, you need to specify k. you don't want domain knowledge to be involved. but a lot of them you do -ability to deal with noisy data: we want models that are not so sensitive to outliers -incremental clustering and insensitivity to input order: representing new data, incremental updates. some algorithms need to start from scratch when this happens. some also are sensitive to input data order, we need algorithms that take incremental data and is insensitive to input order -capability of clustering high dimensionality data: most algorithms can do dimensions of 2 or 3. We want to create ones that take more dimensions -constraint based clustering: find groups with good clustering behavior that satisfies the constraints. -interpretability and usability: the goal should influence the attributes pict for clustering. needs to be interpretable and useful analysis

clustering methods can be compared using the following criteria:

- the partitioning criteria: no hierarchy, customers being grouped to a manager, are at the same level conceptually. hierarchy: basketball under the hierarchy of sports -separation of clusters: can be exclusive-dp only belongs to one cluster. or a dp can be within multiple clusters, multiple classifications -similarity measure: similarity may be measured by distance in some, density, continuity based -clustering space: subspace clustering, finding clusters in low dimensional subspaces

characteristics of cluster analysis

-automatic classification: creates implicit classes, can automatically find groupings - data segmentation: partitions data based on similarity -unsupervised learning since class labels are not present (learn by observation)

density based clustering

-can find non-spherical shapes DBSCAN: Density-Based Clustering Based on Connected Regions with High Density Density-Based Spatial Clustering of Applications with Noise density is found by: measuring the number of objects close to o finds core objects, that is, objects that have dense neighborhoods. It connects core objects and their neighborhoods to form dense regions as clusters. core objects(objects w/ dense neighborhoods) are found and are connected to their neighborhood to form a cluster A user-specified parameter e > 0 is used to specify the radius of a neighborhood we consider for every object the density of a neighbor- hood can be measured simply by the number of objects in the neighborhood.

what can you use cluster analysis for?

-gaining insight into distribution of data -observe characteristics of each cluster -focus on particular cluster for further analysis -preprocessing -outlier detection (outlier points can be "interesting" to do further analysis on), fraud stuff

outlier detection

-supervised - classification problem (has clas imbalance problem) -semisupervised unsupervised -makes implicit ssumption that all -normal dps are clustered together normal objects vs outlier diff (statistical model): -- -normal objects are generated by a statisticl model, dp that dont fit this model is an outlier gaussian model- neds to fit the normal distribution -proximity based: proximity from its neighbors significantly deviates -clustering based

how to make k means more efficient /scalable with big data

-use a decent sized set -filter aproach - spacial data -micro clustering: which first groups nearby objects into "microclusters" and then performs k-means clustering on the microclusters. use k medoids instead! less sensitive to outliers since it chooses an actual dp to be the center

k medoid algorithm

1) chose seeds(k representative objects) 2) assign each remaining object to the nearest rep 3)choose random non representative objects 4) compute the cost of swapping it with a rep 5) if s < 0, swap with rep to form new rep

hierarchal clustering pt2

A hierarchical clustering method works by grouping data objects into a hierarchy or "tree" of clusters. Our use of a hierarchy here is just to summarize and represent the underlying data in a compressed way. can uncover a hierarchy as well

dendrogram

A tree structure called a dendrogram is commonly used to represent the process of hierarchical clustering. It shows how objects are grouped together (in an agglomerative method) or partitioned (in a divisive method) step-by-step

CLIQUE

CLustering In QUEst grid and density based approach to subspace clustering in high dimensional data space apiori like finding clusters with attributes that are more similar to each other (subspaces) like instead of trying to find clusters with high dimensionality. use just like 3 similar attributes in a sub category: symptoms and find clusters there we can do it by attrbutes of symptions: high fever, stuffy nose. etc.

parameters for DBSCAN

DBSCAN = density based ε is a distance parameter that defines the radius to search for nearby neighbors. minpts = number of points that need to be in the neighborhood to be considered a cluster a point in the DBSCAN can be a core point a border point( is in a cluster but doesnt have enough minpts around it to be its own cluster) noise point

divisive method

DIANA (DIvisive ANAlysis) top-down strategy. The cluster is split according to some principle such as the maximum Euclidean distance between the closest neighboring objects in the cluster.

univariate data

Data involving only one attribute or variable are called univariate data.

cons of hierarchal method

Hierarchical clustering methods can encounter difficulties regarding the selection of merge or split points Such a decision is critical, because once a group of objects is merged or split, the process at the next step will operate on the newly generated clusters.

k mediods

Instead of taking the mean value of the objects in a cluster as a reference point, we can pick actual objects to represent the clusters, using one representative object per cluster. Each remaining object is assigned to the cluster of which the representative object is the most similar. minimizing the sum of the dissimilarities between each object p and its corresponding representative object: uses absolute error criterion - we want to minimize absolute error NP-hard except when k = 1 Partitioning Around Medoids (PAM):

unsupervised outlier detection

Instead, they can form multiple groups, where each group has distinct features. However, an outlier is expected to occur far away in feature space from any of those groups of normal objects.

number of clusters

It can be regarded as finding a good balance between compressibility and accuracy in cluster analysis simple: set the num of clusters to sqrt of n/2 each cluster would have sqrt2n points elbow method: increasing the number of clusters can help to reduce the sum of within-cluster variance of each cluster. cross-validation:

challenges in outlier detection

Modeling normal objects and outliers effectively.-outlierness, to what extent would it be considered a normal pt? application specific outlier detection handling noise in outlier detection Understandability.

signatures

Our goal in this section is to replace large sets by much smaller represen- tations called "signatures." compare the signatures of two sets and estimate the Jaccard sim- ilarity of the underlying sets from the signatures alone. basically signatures are the result of minhashing. its the 4 digit into or whater that is used to represent the bin that has similar sets in the bin H(d) being the singnature If similarity(d1,d2) is high then Probability(H(d1)==H(d2)) is high make as much signatures as there are documents

PAM vs CLARA

PAM examines every object in the data set against every current medoid, whereas CLARA confines the candidate medoids to only a random sample of the data set.

proximity based approaches to outlier detection

Proximity-based approaches assume that the proximity of an outlier object to its nearest neighbors significantly deviates from the proximity of the object to most of the other objects in the data set. distance based density based

single linkage approach

This is a single-linkage approach in that each cluster is represented by all the objects in the cluster, and the similarity between two clusters is measured by the similarity of the closest pair of data points belonging to different clusters. agglomerative (AGNES)

minimal spanning tree algorithm

Thus, an agglomerative hierar- chical clustering algorithm that uses the minimum distance measure is also called a minimal spanning tree algorithm, where a spanning tree of a graph is a tree that connects all vertices, and a minimal spanning tree is the one with the least sum of edge weights.

Partitioning Around Medoids (PAM):

We consider whether replacing a representative object by a nonrep- resentative object would improve the clustering quality. the quality of a another data point in being a new mediod This quality is measured by a cost function of the average dissimilarity between an object and the representative object of its cluster. Therefore, the cost function calculates the difference in absolute-error value if a current representative object is replaced by a nonrepresentative object The total cost of swapping is the sum of costs incurred by all nonrepresentative objects If the total cost is negative, then oj is replaced or swapped with orandom because the actual absolute-error E is reduced. If the total cost is positive, the current representative object, oj, is considered acceptable, and nothing is changed in the iteration. does not scale well for large data sets

complete linkage algorithm

When an algorithm uses the maximum distance, dmax (Ci , Cj ), to measure the distance between clusters, it is sometimes called a farthest-neighbor clustering algorithm. If the clustering process is terminated when the maximum distance between nearest clusters exceeds a user-defined threshold, it is called a complete-linkage algorithm. complete subgraph, that is, with edges connecting all the nodes in the clusters Farthest-neighbor algorithms tend to minimize the increase in diameter of the clusters at each iteration.

CELL algorithm

a grid-based method for distance-based outlier detection each cell is a cube with legnth r/2 (r = distance threshold param) DB(r, pi) 3 time scan

multi phase clustering

a way to make the hierarchal method better. integrates it with other techniques BIRCH and Chameleon

collective outlier

aa certain action containing multiple dp is considered an outlier as a whole. 100 delays in order on one day when usually its around 1 or 2 a day. Given a data set, a subset of data objects forms a collective outlier if the objects as a whole deviate significantly from the entire data set. Importantly, the individual data objects may not be outliers. the black objects as a whole form a collective outlier because the density of those objects is much higher than the rest in the data set need mutiple dp to see the outlier pattern consider the behavior of a group of objects

an objective function

aims for high intracluster similarity and low intercluster similarity. This objective function tries to make the resulting k clusters as compact and as separate as possible.

categories of hierarchal methods

algorithmic (deterministic measurement of distance): agglomerative, divisive, multiphase probabilistic (measure the quality of clusters by the fitness of models.) bayesian (return a group of clustering structures and their probabilities, conditional on the given data)

kmeans

an objective function A centroid-based partitioning technique uses the centroid of a cluster, Ci , to represent that cluster. centroid can be defined in various ways such as by the mean or medoid of the objects euclidean distance The quality of cluster Ci can be measured by the within- cluster variation, which is the sum of squared error between all objects in Ci and the centroid ci, defined as The k-means algorithm defines the centroid of a cluster as the mean value of the points within the cluster. For each of the remaining objects, an object is assigned to the cluster to which it is the most similar, based on the Euclidean distance between the object and the cluster mean. The k-means algorithm then iteratively improves the within-cluster variation. The k-means method is not guaranteed to converge to the global optimum and often terminates at a local optimum. The results may depend on the initial random selection of cluster centers. run time: O(nkt) n = number of dp, k = number of clusters, t = number of iterations The k-means method can be applied only when the mean of a set of objects is defined.

local proximity based outliers

an outleir relative to the distance to cluster 1 but no necessarly in comparison to cluster 2 distance based cannot detect this, only density based The critical idea here is that we need to compare the density around an object with the density around its local neighbors. local outlier factor used - close to 1 means its deep in a cluster

contextual outliers

anomalies that depend on multiple factors. 80 degrees is a outlier in antarctica but not in florida dependent on behavior attributes and contextual attributes a generalization of local outliers: its density significantly deviates from the local area in which it occurs

novelty detection vs outliers

both are found becasue they are differnet from the normal mechanisms. however novelty one usually get incorperated into the model later and wont be treated as an outlier from that point forward

k modes method

can cluster nominal data The k-modes method is a variant of k-means, which extends the k-means paradigm to cluster nominal data by replacing the means of clusters with modes It uses new dissimilarity measures to deal with nominal objects and a frequency-based method to update modes of clusters

When I try out a clustering method on a data set, how can I evaluate whether the clustering results are good?"

clustering tendency number of clusters measure cluster quality:

difference between density based clustering vs density based outlier detection

clustering: need minpts and r outlier: need reachability distance and r. no min number of points!

density based method

continue growing a given cluster until it exceeds some threshold. it needs to hit the minimum number of points to become a cluster (need a certain number of point minimum in its neighborhood) -can find arbitrarily shaped clusters -clusters are high density areas in space separated by low density regions -usually only deals with exclusive clusters -may filter out outliers

what information does OPTICS need per object?

core-distance: smallest value of e'(aka dense) to the nearest dp with MinPt objects. e' is the minimum distance threshold that makes p a core -If p is not a core object with respect to ✏ and MinPts, the core-distance of p is undefined. -the point must have at least MinPts other points within a ball of radius reachability-distance: the minium radius value that makes p density-reachable from q max{core-distance(q), dist(p, q)}. one has to be a core object. if the core distance is greater than the euclidean distance, then its the core distance he complexity is O(n log n) if a spatial index is used, and O(n2) otherwise, where n is the number of objects

clustering tendency

determines whether a given data set has a non-random structure, which may lead to meaningful clusters. see that the data set has a uniform data distribution using: The Hopkins Statistic is a spatial statistic that tests the spatial randomness of a variable as distributed in a space.

global outler detection

distance based: since all take the same params

OPTICS

doesnt have to make clusters all the same sizes by setting e = infinity, main advantage over DBscan, however run time may be more complex outputs a cluster ordering represents the density-based clustering structure of the data objects in a denser cluster are listed closer to each other in the ordering does not require a user to provide a specific density threshold. - good! To construct the different clusterings simultaneously, the objects are processed in a specific order clusters with higher density (lower e) will be finsihed first when creating the clusters

grid based method

embedding space into cells independent of the distribution of the input objects. uses a multiresolution grid data structure. It quantizes the object space into a finite number of cells that form a grid structure on which all of the operations for clustering are performed pros: fast processing time(independent of # of dp, only dependent on the number of cells STING and CLIQUE

measure cluster quality

etrinsic: cluster homogenity, cluster completness, ragbag(michellienious dp), small clustering preservation intrensic: sillouette coefficient,

exclusive vs fuzzy

exclusive. a dp can only belong to one cluster fuzzy: can belong to multiple clusters

types of outliers

global contextual collective

space driven clustering method

grid-based clustering embedding into cells independent of the distribution of the dps

data driven clustering methods

hierarchal density based partitioning based aka they data-driven—they partition the set of objects and adapt to the distribution of the objects in the embedding space.

single linkage algorithm

if the clustering process is terminated when the distance between nearest clusters exceeds a user-defined threshold, it is called a single-linkage algorithm. (min span tree, min distance)

directly density-reachable

if a dp p is n the neighborhood of the core object q the non core object is density reachable from the core object but no vise versa. unless both are cores

How is this statistical information useful for query answering? for STING

it filters down the necessary cells that fit the criteria for the query from the top down until it reaches the bottom. confidence testing is used. First, a layer within the hierarchical structure is determined from which the query-answering process is to start. This layer typically contains a small number of cells. For each cell in the current layer, we compute the confidence interval (or estimated probability range) reflecting the cell's relevancy to the given query. The irrelevant cells are removed from further considera- tion. Processing of the next lower level examines only the remaining relevant cells. This process is repeated until the bottom layer is reached. At this time, if the query specifica- tion is met, the regions of relevant cells that satisfy the query are returned.

agglomerative method

it finds the two clusters that are closest to each other (according to some similarity measure), and combines the two to form one cluster. Because two clusters are merged per iteration, where each cluster contains at least one object, an agglomerative method requires at most n iterations. AGNES (AGglomerative NESting)

What advantages does STING offer over other clustering methods?

it is query independent: statistical information stored in each cell represents the summary information of the data in the grid cell, independent of the query; the grid structure facilitates parallel processing and incremental updating; and the method's efficiency is a major advan- tage: STING goes through the database once to compute the statistical parameters of the cells, and hence the time complexity of generating clusters is O(n). After generating the hierarchical structure, the query processing time is O(g ), where g is the total number of grid cells at the lowest level, which is usually much smaller than n.

"Which method is more robust—k-means or k-medoids?"

k medoids is if there is a lot of noise and outliers in the data complexity is high though

heuristic clustering methods

k-means and the k-medoids algorithms, which progres- sively improve the clustering quality and approach a local optimum. sufficient for reaching goal, not necessary the optimal solution good fro finding spherical shaped clusters in small to medium sized data sets

statistical approaches

makes assumptions about data normality They assume that the normal objects in a data set are generated by a stochastic process (a generative model) objects in low probabiltiy regions are outliers parametric method: assumes that the normal data objects are generated by a parametric distribution(probabiltiy distriubtion with a fixed set of parameters) non-parametric method: assumes that the number and nature of the parameters are flexible and not fixed in advance. made from the input data. non priori method. In summary, statistical methods for outlier detection learn models from data to dis- tinguish normal data objects from outliers

what is the most popular clustering method?

most popular is partitioning by distance (cons: can only find spherical shaped clusters)

distance based proximity approach to outlier detection

nested loop: checks an object in comarison to the whole data set - costly grid based- better: checks it in comparison to others in the group

noise vs outliers

noise: random error or varience outliers: interesting because it s not made from the same mechanisms as the rest of the data justify why the outliers detected are generated by some other mechanisms. This is often achieved by making various assumptions on the rest of the data and showing that the outliers detected violate those assumptions significantly.

gaussian distribution

normal distribution

The general criterion of a good partitioning

objects in the same cluster are "close" or related to each other, whereas objects in different clusters are "far apart" or very different

DBSCAN algorithm

picks a random unvisited object marks as visited and see if point contains Minpts amount of objects around it if not: labeled as noisy data if so: becomes a core, cluster the neighborhood points gets checked next he clustering process continues until all objects are visited. the computational complexity of DBSCAN is O(nlogn), must use global params, all clusters have to look the same requires a user to provide a specific density threshold

pros and cons of CLIQUE

pros: high scalabiltiy insensitive to the order of inputs cons: good results are: dependent on grid size dependent on density threshold every cluster have these same params so the simplicity of the model can make it a little bad

grid based method

quantizes the object space into a finite number of cells that form a grid structure pros: fast processing time. only dependent on the number of grids. independent of the number of dp good approach to spacial data mining problems can be integrated with density based clustering or hierarchal methods, since its just a space thing -multiresolution grid structure

user parameters in DBSCAN

radius of the neighborhood density threshold of dense regions (Minpts) -an object is core if it has minpts amount of dp

CLARA

sampling-based method for large data sets to do k medoids Instead of taking the whole data set into consideration, CLARA uses a random sample of the data set. The PAM algorithm is then applied to compute the best medoids from the sample. CLARA builds clusterings from multiple random samples and returns the best clustering as the output. The effectiveness of CLARA depends on the sample size.

4 widely used distance measurements for algorithmic hierarchal methods

sensitive to outliers: max: farthest neighbor (complete linkage) min: nearest neighbor (single linkage, min span tree) not as sensitive: avg (handles categorical + numeric data) mean (simplest to compute)

CLIQUE algorithm

separates space into cells categorizes some as dense according to density threshold A 2-D cell, say in the subspace formed by age and salary, contains l points only if the projection of this cell in every dimension, that is, age and salary, respectively, contains at least l points.

clustering

set of clusters resulting from cluster analysis, discovering groupings dp within clusters have high similarities, and they are dissimilar from dp not in the cluster partitioning sets into subsets can lead to discovery of previously unknown groupings

how to find similar items

shingling - turn into sets minhashing - compress large sets in a way that still deduces similarity locally sensitive ashing - focus on pairs that have a higher likelyhood of being silmiliar instead of looking @ everything -singatures

difference and similarities between collective outliers vs contextual

similarity: both are dependent on structure, both need to be reduced to conventional outlier detection diff: collective is usually implicit, not clearly defined context while contextual usually is clearly defined

Single versus complete linkages

single: finds local closeness (minimum spanning tree) complete: finds global closeness

STING

statistical info stored into grid cells mutiresolution technique that puts the spacial areas of input objects into rectangular cells can be hierarchal and recursive statistical parameters in each cell. like the mean avg etc. the stat prameters of the higher level cells are calulated by the stats of the lower level cells has attr independent param: count and attr dependent param: stdev, mean, min, max, type of distrubution(normal, niform, exponential) - assinged by user Here, the attribute is a selected measure for analysis such as price for house objects. complexity of generating clusters is O(n)

Sum of Squared Errors (SSE)

sum of dist(y, y*)^2 for all data points looks for the extent of similarity within clusters part of the objective function within cluster variation the problem is NP-hard in general Euclidean space even for two clusters

classification approach to outlier detection

supervised: one class model using only the model of the normal class to detect outliers, can handle fuzzy clustering

what happens if the distribution types of the lower level cells in STING disagree with each other?

the higher level cells are set to the distribution type of "none"

non parametric methods of statistical outlier analysis

the model of "normal data" is learned from the input data, rather than assuming one a priori. bases decidion from observations instead of infering something before hand -histogram: if it fits into a bin, its normal- else, its an outlier, con: its hard to choose the appropriate bin size - solution to making histogram better: kernel(gaussian function) algorithms for pattern analysis, whose best known member is the support vector machine (SVM

to be in a cluster in DBSCAN

the point must have at least MinPts other points within a ball of radius If a point isn't a core-point it is noise. reachability-distance captures this but seeing that the e is at a min value that a certain point can be added into that cluster. prevents noise from being added to the cluster

cons of STING

the quality of STING clustering depends on the granularity of the lowest level of the grid structure. If the granularity is very fine, the cost of processing will increase substantially; however, if the bottom level of the grid structure is too coarse, it may reduce the quality of cluster analysis. gradularity in important. it doesnt consider the spacial relationship between children and parent, and with neighbohrs. the clusters are horizonatl or verticle but never diagonal. may lower quality

partitioning pt2

the simple and most fundamental form of cluster analysis number of clusters is given as background knowledge (k) organizes object into k partitions k<=n kmeans kmedoids

locally sensitive hashing

to hash data points into buckets so that data points near each other are located in the same buckets with high probability, while data points far from each other are likely to be in different buckets. This makes it easier to identify observations with various degrees of similarity. reduced to the pairs that are most likely to be similar We check only the candidate pairs for similarity.

shingling

turn a doc into a set problem Similar documents are more likely to share more shingles k value of 8-10 is generally used in practice. A small value will result in many shingles which are present in most of the documents (bad for differentiating documents) similiarty of docs by shingles -> use jaccard index: More number of common shingles will result in bigger Jaccard Index and hence more likely that the documents are similar. - big overhead- too much data-> solve this by using signatures

contextual outlier models

turn it into a regular model to make outlier detection possible. predictive or can be based on data you already give of the contexts

density-connected

two cores are density reachable from one another, so the clusters get put together

global outliers

when a point significnatly deviates from he rest of the ds most algorithms try to find global outliers simplest type, point anomolies dependent on behavioral attributes


Related study sets

NOS 110 CH.3 Desktop Virtualization Quiz

View Set

40 Questions to test a Data Scientist on Dimensionality Reduction Techniques

View Set

Chapter 31: Assessment and Management of Patients With Hypertension

View Set

Chapter 2 - Job Performance Concepts and Measures

View Set

class 4 - chapter 28: safety, security, and emergency preparedness

View Set

AP Psych, Unit 5, Cognitive Psychology

View Set

Marketing Research Final (Philip)

View Set

Sociology Chapter 4 : Social Structure and Interaction in Everyday Life

View Set