Data Analytics - Module 3 Clustering Techniques

Ace your homework & exams now with Quizwiz!

True

T/F: A cluster of data objects can be treated as one group.

True

T/F: A subset of objects such that the distance between any of the two objects in the cluster is less than the distance between any object in the cluster and any object that is not located inside it.

True

T/F: The main advantage of clustering over classification is that, it is adaptable to changes and helps single out useful features that distinguish different groups.

True

T/F: While doing cluster analysis, we first partition the set of data into groups based on data similarity and then assign the labels to the groups.

Partitioning Hierarchical Density-based Grid-Based Model-Based Constraint-based

6: Enumeration: Clustering Methods

unsupervised machine learning

Clustering, falling under the category of _______________, is one of the problems that machine learning algorithms solve.

Each cluster must have at least one object Each object must be a part of exactly one clusters i.e. no overlapping

Enumeration (2): The clusters formed have the following characteristics:

Determines the best value for K center points or centroids by an iterative process. Assigns each data point to its closest k-center. Those data points which are near to the particular k-center, create a cluster

Enumeration (2): The k-means clustering algorithm mainly performs two tasks:

Agglomerative Approach bottom-up approach Divisive Approach top-down approach

Enumeration (2): There are two types of approaches for the creation of hierarchical decomposition, they are

Scalability Ability to deal with different kinds of attributes Discovery of clusters with attribute shape High dimensionality Ability to deal with noisy data Interpretability

Enumeration (6): Properties of Clustering

data analysis, market research, pattern recognition, and image processing assists marketers to find different groups helps in data discovery tracking applications a tool to gain insight into the distribution of data to analyze the characteristics of each cluster in biology, It can be used to determine plant and animal taxonomies, categorization of genes helps in the identification of areas of similar land

Enumeration (7): Applications in Data Mining

1. Select the number K to decide the number of clusters. 2. Select random K points or centroids. (It can be other from the input dataset). 3. Assign each data point to their closest centroid, which will form the predefined K clusters. 4. Calculate the variance and place a new centroid of each cluster. 5. Repeat the third steps, which means reassign each datapoint to the new closest centroid of each cluster. 6. If any reassignment occurs, then go to step-4 else go to FINISH. 7. The model is ready.

Enumeration (7): Steps in K-Means Algorithm

Hierarchical

In this method, a hierarchical decomposition of the given set of data objects is created. We can classify hierarchical methods and will be able to know the purpose of classification on the basis of how the hierarchical decomposition is formed.

K-Means Clustering Algorithm

K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset into different clusters. Here K defines the number of pre-defined clusters that need to be created in the process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and so on. It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a way that each dataset belongs only one group that has similar properties. It allows us to cluster the data into different groups and a convenient way to discover the categories of groups in the unlabeled dataset on its own without the need for any training. It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of this algorithm is to minimize the sum of distances between the data point and their corresponding clusters

Grid-based

a grid is formed using the object together, i.e, the object space is quantized into a finite number of cells that form a grid structure. One of the major advantages of the grid-based method is fast processing time and it is dependent only on the number of cells in each dimension in the quantized space. The processing time for this method is much faster so it can save time.

Model-Based

all the clusters are hypothesized in order to find the data which is best suited for the model. The clustering of the density function is used to locate the clusters for a given model. It reflects the spatial distribution of data points and also provides a way to automatically determine the number of clusters based on standard statistics, taking outlier or noise into account. Therefore, it yields robust clustering methods.

Partitioning

clusters are represented by the prototype and we use the iterative counterstrategy to optimize the clustering. divide the data sets into various subsets called partitions and each partition represents a cluster.

cluster

is a subset of similar objects

Density-Based

mainly focuses on density. the given cluster will keep on growing continuously as long as the density in the neighborhood exceeds some threshold, i.e, for each data point within a given cluster. The radius of a given cluster has to contain at least a minimum number of points

Constraints-Based

performed by the incorporation of application or user-oriented constraints. A constraint refers to the user expectation or the properties of the desired clustering results. Constraints provide us with an interactive way of communication with the clustering process. The user or the application requirement can specify constraints.

Clustering

the process of making a group of abstract objects into classes of similar objects.

Ability to deal with different kinds of attributes

•Algorithms should be capable to be applied on any kind of data such as interval-based (numerical) data, categorical, and binary data.

Ability to deal with noisy data

•Databases contain noisy, missing or erroneous data.

Discovery of clusters with attribute shape

•The clustering algorithm should be capable of detecting clusters of arbitrary shape. They should not be bounded to only distance measures that tend to find spherical cluster of small sizes

High dimensionality

•The clustering algorithm should not only be able to handle low-dimensional data but also the High dimensionality high dimensional space.

Interpretability

•The clustering results should be interpretable, comprehensible, and usable

Scalability

•We need highly scalable clustering algorithms to deal with large databases. •Data should be scalable if it is not scalable, then we can't get the appropriate result.


Related study sets

Century 21 Accounting 9E - Chapter 6 (multicolumn)

View Set

Community Health Mod. 1 Practice

View Set

Wrapup #1: Intro to Anatomical Language

View Set