Hierarchical Clustering and PCA

Ace your homework & exams now with Quizwiz!

Eigenvalue

%Represents the total variance explained by each factor

Advantages of dimension reduction

- removes multi-collinearity to improve ML model performance - decreases computational times for fitting models - makes visualization easier - decreases storage requirements - avoids curse of dimensionality - helps reduce over fitting

We fit a dataset to the PCA class of sklearn, and obtain the following output on checking the explained_variance_ratio_ attribute: [0.92461621 0.05301557 0.01718514 0.00518309] What is the minimum number of principal components we should keep to preserve 95% of the variance in the data?

2

Which of the following is true regarding PCA (Principal Component Analysis):

It is used to find inter-relation between variables in the data. It is used to visualize data. As the number of variables decrease on using PCA, it makes further analysis simpler.

How do we know whether K-means is appropriate to use in a given business problem?

K-means algorithm is used when it is already known in advance how many clusters have to be formed, also k-means is suitable if your data is well separated into spherical-like clusters.

Feature elimination

Simply identify and remove variables (columns) that are not important

Dendrograms (in Hierarchical Clustering) can be compared to one another by comparing their cophenetic correlations.

True

Generally normalized euclidean distance is used. However, when data size is very large and high dimensional, Manhattan distance is found to perform better computationally. The type of distance to use is decided by the problem at hand.

True

Hierarchical clustering doesn't work well when we have a huge amount of data.

True

Lower values of the cophenetic coefficient indicate bad clustering.

True

PCA (Principal Component Analysis) is a dimensionality reduction process but there is no guarantee that the dimension is interpretable.

True

The cophenet function of scipy returns both the cophenetic correlation and cophenetic distances between every pair of observations in the data.

True

The labels_ attribute of sklearn's AgglomerativeClustering class will give the clusters to which each observation has been assigned

True

The transformed variables obtained after executing PCA are independent of each other.

True

complete linkage hierarchical clustering is the distance between two clusters is defined as the longest distance between two points in each cluster.

True

Which of the following parameters is used to define the distance metric in sklearn AgglomerativeClustering?

affinity

Eigenvector of the covariance matrix

are the principal components. How well-connected are those I'm connected to

The xxxxxx attribute of sklearn's PCA class returns the eigenvectors of the covariance matrix of the input data.

components_

What is Connectivity clustering? (Ex. Distance between A and B)

connect each pair (ALL) of points; calculate the distance [use Linear Scaling and finding eclavian distance between the 2 to control dominating points]

What does PCA Principal Component Analysis do

creates new variables using linear combinations of old variables • is designed to create variables that are independent of one another • also manages to tell us how important each of these new variables are • this "importance", helps us to choose how many variables we will use

What you do with clusters after interpretation?

depends upon the problem you are trying to solve. Clustering will give you groups that are similar in some aspects. This can be used for recommendations, understanding customers, market campaigning, etc.

Which of the following attributes of sklearn's PCA class will give the eigenvalues of the covariance matrix of the input data?

explained_variance

Clustering (unsupervised learning technique) Ex. Customer Spend Data {No of visits, item counts, avg mnthly spend}

group them in some sense of similarity (1) better understand our data and customers (2) precise target mechanism - group understand them better

agglomerative clustering method

is a bottom-up method we assign each observation to its cluster. Then, compute the similarity (e.g., distance) between each of the clusters and join the two most similar clusters.

divisive clustering method

is a top-down method where we assign all of the observations to a single cluster and then partition the cluster into two least similar clusters.

Hierarchical Clustering

is density-based clustering in which nearby points are joined to form clusters. It gives you a dendrogram from which you can figure out how many clusters should be formed. Hierarchical clustering is computationally expensive so it will not perform well when data size is very very big.

PCA Principal Component Analysis is the

most popular feature extraction technique

An advantage of hierarchical clustering over K-Means is

not having to pre-define the number of clusters.

PCA explains how to

reduce dimensions; find new relationship

The disadvantage of feature elimination is

that we would gain no insight from those dropped variables and loose any information they contain

In single linkage hierarchical clustering,

the distance between two clusters is defined as the shortest distance between two points in each cluster.

Feature extraction is

type of Dimensionality Reduction Technique. Create a few new variables from the old variables

Let the sum of variances of cluster A be m. When another cluster B is merged with cluster A, the sum of variances of the cluster (A+B)

will be greater than m

The zscore function of scipy does which of the following?

From each observation in a column, it subtracts column mean and divides by column standard deviation

If we set n_clusters=4, then the cluster labels assigned to the observation will range from 1 to 4.

False

PCA (Principal Component Analysis) is a feature elimination type of dimensionality reduction technique.

False

The cophenetic coefficient shows the goodness of fit of our clustering. The lower value of the cophenetic coefficient indicates good clustering.

False

The cophenetic correlation can have a value of 2

False

complete linkage hierarchical clustering is the distance between two clusters is defined as the shortest distance between two points in each cluster.

False

what is Dimension Reduction

The process of reducing the number of independent variables

How do we choose the correct distance to use in clustering algorithms?

There is no single distance that will give the best results with all data and all problem statements. The type of distance you use depends on the data and the problem statement


Related study sets

ch3 Quantitative Demand Analysis

View Set

Org Behavior Test Bank Chapter 4

View Set

M12 COMPTIA CORE 2: (220-1102) Security PART #1

View Set

Chapter 21: Assessment of Cardiovascular Function

View Set

Domain 3. Security Architecture and Engineering Flash Cards

View Set