Hierarchical Clustering and PCA
Eigenvalue
%Represents the total variance explained by each factor
Advantages of dimension reduction
- removes multi-collinearity to improve ML model performance - decreases computational times for fitting models - makes visualization easier - decreases storage requirements - avoids curse of dimensionality - helps reduce over fitting
We fit a dataset to the PCA class of sklearn, and obtain the following output on checking the explained_variance_ratio_ attribute: [0.92461621 0.05301557 0.01718514 0.00518309] What is the minimum number of principal components we should keep to preserve 95% of the variance in the data?
2
Which of the following is true regarding PCA (Principal Component Analysis):
It is used to find inter-relation between variables in the data. It is used to visualize data. As the number of variables decrease on using PCA, it makes further analysis simpler.
How do we know whether K-means is appropriate to use in a given business problem?
K-means algorithm is used when it is already known in advance how many clusters have to be formed, also k-means is suitable if your data is well separated into spherical-like clusters.
Feature elimination
Simply identify and remove variables (columns) that are not important
Dendrograms (in Hierarchical Clustering) can be compared to one another by comparing their cophenetic correlations.
True
Generally normalized euclidean distance is used. However, when data size is very large and high dimensional, Manhattan distance is found to perform better computationally. The type of distance to use is decided by the problem at hand.
True
Hierarchical clustering doesn't work well when we have a huge amount of data.
True
Lower values of the cophenetic coefficient indicate bad clustering.
True
PCA (Principal Component Analysis) is a dimensionality reduction process but there is no guarantee that the dimension is interpretable.
True
The cophenet function of scipy returns both the cophenetic correlation and cophenetic distances between every pair of observations in the data.
True
The labels_ attribute of sklearn's AgglomerativeClustering class will give the clusters to which each observation has been assigned
True
The transformed variables obtained after executing PCA are independent of each other.
True
complete linkage hierarchical clustering is the distance between two clusters is defined as the longest distance between two points in each cluster.
True
Which of the following parameters is used to define the distance metric in sklearn AgglomerativeClustering?
affinity
Eigenvector of the covariance matrix
are the principal components. How well-connected are those I'm connected to
The xxxxxx attribute of sklearn's PCA class returns the eigenvectors of the covariance matrix of the input data.
components_
What is Connectivity clustering? (Ex. Distance between A and B)
connect each pair (ALL) of points; calculate the distance [use Linear Scaling and finding eclavian distance between the 2 to control dominating points]
What does PCA Principal Component Analysis do
creates new variables using linear combinations of old variables • is designed to create variables that are independent of one another • also manages to tell us how important each of these new variables are • this "importance", helps us to choose how many variables we will use
What you do with clusters after interpretation?
depends upon the problem you are trying to solve. Clustering will give you groups that are similar in some aspects. This can be used for recommendations, understanding customers, market campaigning, etc.
Which of the following attributes of sklearn's PCA class will give the eigenvalues of the covariance matrix of the input data?
explained_variance
Clustering (unsupervised learning technique) Ex. Customer Spend Data {No of visits, item counts, avg mnthly spend}
group them in some sense of similarity (1) better understand our data and customers (2) precise target mechanism - group understand them better
agglomerative clustering method
is a bottom-up method we assign each observation to its cluster. Then, compute the similarity (e.g., distance) between each of the clusters and join the two most similar clusters.
divisive clustering method
is a top-down method where we assign all of the observations to a single cluster and then partition the cluster into two least similar clusters.
Hierarchical Clustering
is density-based clustering in which nearby points are joined to form clusters. It gives you a dendrogram from which you can figure out how many clusters should be formed. Hierarchical clustering is computationally expensive so it will not perform well when data size is very very big.
PCA Principal Component Analysis is the
most popular feature extraction technique
An advantage of hierarchical clustering over K-Means is
not having to pre-define the number of clusters.
PCA explains how to
reduce dimensions; find new relationship
The disadvantage of feature elimination is
that we would gain no insight from those dropped variables and loose any information they contain
In single linkage hierarchical clustering,
the distance between two clusters is defined as the shortest distance between two points in each cluster.
Feature extraction is
type of Dimensionality Reduction Technique. Create a few new variables from the old variables
Let the sum of variances of cluster A be m. When another cluster B is merged with cluster A, the sum of variances of the cluster (A+B)
will be greater than m
The zscore function of scipy does which of the following?
From each observation in a column, it subtracts column mean and divides by column standard deviation
If we set n_clusters=4, then the cluster labels assigned to the observation will range from 1 to 4.
False
PCA (Principal Component Analysis) is a feature elimination type of dimensionality reduction technique.
False
The cophenetic coefficient shows the goodness of fit of our clustering. The lower value of the cophenetic coefficient indicates good clustering.
False
The cophenetic correlation can have a value of 2
False
complete linkage hierarchical clustering is the distance between two clusters is defined as the shortest distance between two points in each cluster.
False
what is Dimension Reduction
The process of reducing the number of independent variables
How do we choose the correct distance to use in clustering algorithms?
There is no single distance that will give the best results with all data and all problem statements. The type of distance you use depends on the data and the problem statement
