K-means Clustering
For K-means if there is a y-dependent variable, do we remove it before trying to group customers?
Yes, if you have a dependent variable in your dataset, you should remove that before applying clustering algorithms on your dataset.
Common applications of clustering include
Customer Segmentation Document Clustering Image Segmentation Recommendation Engines Segmenting customers into similar groups. Identifying fraudulent or criminal activity It can be used for classification among different species of plants and animals. Customer segmentation - buying patterns, income,spending behaviour, loyalty, customer lifetime value ● Anomaly detection ● Creating news feeds - cluster articles based on their similarity ● Pattern detection in medical imaging for diagnostics
Euclidean distance is insensitive to outliers.
False
The formula of the silhouette is given below: Silhouette Score = (b-a)/max(a,b) a= the average distance of a point from all other points in the same cluster. b= the average distance of a point from all points in the nearest cluster. Which of the following is the correct statement ?
For good cluster a is smaller than b
Which of the following distance metrics can account for the relationship between variables?
Mahalanobis distance
If the silhouette value is close to 1, the sample is well-clustered and already assigned to a very appropriate cluster.
True
The Jaccard distance measures the distance between two sets.
True
Similar
closer to each other
dissimilar
closer to each other; not alike; different
Disadvantage of Connectivity clustering (Hierarchical)
computing required
objective of clustering
ensure that the distance between data points in a cluster is very low compared to the distance between 2 clusters group similar data points into a group Segmenting customers into similar groups • Automatically organizing similar files/emails into folders Simplifies data by reducing many data points into a few clusters
Jaccard Distance
measure distances between 2 sets how different 2 sets are from each other
Unsupervised learning is the
training an algorithm using information that is neither classified nor labelled No defined dependent and independent variables Patterns in the data are used to identify / group similar observations
Silhouette Coefficient
Silhouette score is a measurement of how similar points within a cluster are to each other compared to how similar those points are to neighboring clusters. how to compare a whole bunch of clusters? ● It values range between -1 to +1. The xxxx score is a metric which indicates the goodness of clustering algorithms , for especially K-means algorithms. 1 indicates tight , well separated clusters, 0 indicates clusters not well separable and -1 indicates data points of a cluster is more closer to centroid of other clusters than centroid of its own clusters for each cluster = take each point 1. how close is this point to all other points in same cluster 2. and compares it how close is this point to another cluster
Unsupervised Learning
is a class of Machine Learning techniques to find the patterns in data
What is Lloyd's Algorithm
1. Assume K Centroids (subjective) 2. Compute Squared Eucledian distance of each objects with these K centroids. Assign each to the closest centroid forming clusters. 3. Compute the new centroid (mean) of each cluster based on the objects assigned to each clusters. 4. Repeat 2 and 3 till convergence: usually defined as the point at which there is no movement of objects between clusters Kn K = 3 (3 centroids) compute distance of each point to each of the centroid the point closest is assigned to that centroid then form the clusters what is center of points within the cluster points repeat
Strength and Weakness of K-means clustering
A. Use simple principles without the need for any complex statistical terms A. Once clusters and their associated centroids are identified, it is easy to assign new objects (for example, new customers) to a cluster based on the object's distance from the closest centroid A. Because the method is unsupervised, using kmeans helps to eliminate subjectivity from the analysis. eliminate subjectivey/ allw data to speak D How to choose K? (choosing the value of K not clear) D The k-means algorithm is sensitive to the starting positions of the initial centroid. Thus, it is important to rerun the k-means analysis several times for a particular value of k to ensure the cluster results provide the overall minimum WSS D Susceptible to curse of dimensionality (initial starting points) as # of dimension gets higher/ couluteationally more complex
Connectivity clustering (Hierarchical)
Computationally Expensive based on the idea that related objects are closer to each other. Can we then create a hierarchy of clusters/groups. Come up with a tree (i will decide how many clusters i want after analysis) Define similarity = These 2 points are similar if they are close to each other centroid; then these 2 points are similar ALL every pair of points, rows, objects use connectivity to understand how I should do the clustering Useful when you want flexibility in how many clusters you ultimately want. For example, imagine grouping items on an online marketplace like Etsy or Amazon. 1000 rows [every connection in objects] 500,000 distances nX (n+1)/2 In terms of outputs from the algorithm, in addition to cluster assignments you also build a nice tree (dendrogram) that tells you about the hierarchies between the clusters. You can then pick the number of clusters you want from this tree. • In a dendrogram, the y-axis marks the distance at which the clusters merge, while the objects are placed along the x-axis. • Algorithms can be agglomerative (start with 1 object and aggregate them into clusters) or divisive (start with complete data and divide into partitions)
centroid clustering (Eg. K-Means clustering, Euclidean)
Computationally less expensive compared to hierarchical techniques. Have to pre-define K, the no of clusters if 2 points are close to a centroid, they are similar Centroid is mean of x points most used clustering techniques Aims to partition the n observations into k clusters so as to minimize the within-cluster sum of squares (i.e. variance). sq distances and sum it objective is to find K clusters/groups clusters are represented by their centroids. hierarchical agglomerative clustering with cluster distance defined by d_centroid centroids are like the heart of the cluster, they capture the points closest to them and add them to the cluster Eg. 1000 rows 5 centers 5 - groups (5 x 1000 rows = 5k) 5Xn choose by size of dataset, and computing tools type distance to 5 different center Large K produces smaller groups and small K produces larger groups K-Means uses Eucledian distances and is the most popular Other variants like K-medians and K-mediods use other distance measures
What are some good things about K-means clustering?
Sometimes, it is quite tough to figure out the appropriate number of clusters, or the value of k. The output is highly influenced by original input, for example, the number of clusters. It gets affected by the presence of outliers in the data set. In some cases, clusters show complex spatial views, then executing clustering is not a good choice.
Steps in k-Means clustering
Step 1: Initialize the K random centroids or K points Step 2: For each data point, calculate the Euclidean distance of it from randomly chosen K centroids and assign each point to a minimum distance cluster. Step 3: Update the centroid by using newly assigned data points to the cluster by calculating the average of data points. Step 4: Repeat the above process for a given no. of iterations or until the centroid allocation no longer changes
Manhattan Distance
Takes the sum of the absolute values of the differences of the coordinates. (X1 - X2) + (Y1 - Y2) + (Z1 - Z2) absolute value For example, if x=(a,b) and y=(c,d), the Manhattan distance between x and y is |a−c|+|b−d|.
Elbow Method
The elbow method is used to determine the optimal value of k to perform the k-Means Clustering Algorithm. The basic idea behind this method is that it plots the within-cluster sum of squares (WSS) (variance) with changing k. There is no method to define the exact value of K. ● Elbow method is the most popular and well-known method to find the optimal no. of clusters. ● This method is based on plotting the value of cost function against different values of K. ● The point where the distortion declines most is said to be the elbow point and defines the optimal number of clusters for the dataset.
Scaling is very important in cluster analysis is because all features do not have the same range of values, and scaling prevents any feature from becoming dominant due to the high magnitude of its values.
True
The distance-based clustering algorithms are sensitive to different scales of variables, and to avoid, this scaling is done.
True
Visual analysis of attributes selected for clustering might give an idea of the range of values of K.
True
Is there any metric to compare clustering results?
You can compare clustering results by checking silhouette scores and by doing cluster profiling. Besides this, you should also validate the clustering result by consulting with a domain expert to see if the cluster profiles make sense or not.
Clustering
an Unsupervised Learning Technique. collection of objects thjat are similar; simplify data rows
The data given to unsupervised algorithm
are not labelled, which means only the input variables(X) are given with no corresponding output variables. understand data better by reducing their size, or group into various categories, let data tell what categories (grouping similar items together, keeping dissimilar from overlapping with some other items)
Mahalanobis distance
can account for the relationship between variables how far data is from regular pattern of all data points (Ex. how close are A and B to blue points) The Mahalanobis distance is a measure of the distance between a point P and a distribution D, introduced by P. C. Mahalanobis in 1936. It is a multi-dimensional generalization of the idea of measuring how many standard deviations away P is from the mean of D. This distance is zero for P at the mean of D and grows as P moves away from the mean along each principal component axis. If each of these axes is re-scaled to have unit variance, then the Mahalanobis distance corresponds to standard Euclidean distance in the transformed space. The Mahalanobis distance is thus unitless, scale-invariant, and takes into account the correlations of the data set.
The similarity between data points
is determined by the distance between them
Euclidian distance
is the most commonly measure of distance metric Map distance between my work to my home (X1, Y1) My home (X2, Y2) My Work Distance between the two is the Euclidian distance (X1 my home - X2 my work)sq + (Y1 my home - Y2 my work)sq = Euclidian distance (2 dimension) M dimension = (X1 - X2)sq + (Y1 - Y2)sq + (Z1 - Z2) + m times for all m columns (all squre rooted) Three main Features 1. It is highly scale dependent. Changing the units of one variable can have a huge influence on the results. Hence standardizing the dimensions is a good practice • 2. It completely ignores the relationship between measurements 3. It is sensitive to outliers. If the data has outliers that cannot be handled or removed, use of Manhattan distance is preferred
chebyshev or chestboard distance
looks at vertical and horizontol take max of |X1 - X2|, |Y1 - Y2|, |Z1 - Z2| Chest board, king to move from A to B (min # of moves)
Minkowski Distance
means of calculating the distance between points in dimensional space; taking each row Xi - Yi..etc P = 2 Eucl P = 1 Manh P = infinity = chevy
What is k-means clustering? (Eucledian) (Centroid based)
most used clustering technique K-Means clustering aims to partition the n observations into k clusters to minimize the within-cluster sum of squares. (i.e. variance) In k-means clustering, two data points are similar if they are close to the same centroid. scales linearly in terms of computation n (n - 1) / 2 as n becomes very large tell me k (# of clusters you want) is a centroid-based algorithm, or a distance-based algorithm, where we calculate the distances to assign a point to a cluster. In K-Means, each cluster is associated with a centroid Aims to partition the n observations into k clusters so as to minimize the within-cluster sum of squares (i.e. variance). Computationally less expensive compared to hierarchical techniques Have to pre-define K, the no of clusters