Unsupervised Learning and K means Clustering
How do we find the clusters that solve the optimization task for clustering?
K means algorithm (provides a local optimum rather than an exact solution)
What is an issue with K means?
K means algorithm finds a local rather than a global optimum.
What are the two most ubiquitous clustering methods?
K means and hierarchical
What are some challenges with unsupervised learning?
Much more subjective. No clear-cut goal such as predicting a response and there's no easy way to assess quality of the solution (cross validation not possible)
How does the K means algorithm work?
1. (Random Initialization) Randomly assign a number, from 1 to K , to each of the observations. These serve as initial cluster assignments for the observations. 2. (Iterative Updates.) Iterate until the cluster assignments stop changing: For each of the clusters C_1,...,C_k compute the cluster centroid. Assign each observation to the cluster whose centroid is closest.
How do we solve the local optima issue?
1. Run K means algorithm multiple B times, each time for a new random cluster assignment in step 1 2. Selects the best solution out of those B runs, i.e. for which the within-cluster variation is smallest
What is clustering?
A broad class of methods for grouping the observations, leading to discovery of unknown subgroups in data.
What is PCA?
A tool for grouping the predictor variables that is used for data visualization or data pre-processing before supervised techniques are applied.
What is the goal of unsupervised learning?
Discover patterns and relationships among measurements x1,...,xp and among the n observations
What conditions have to be met for K means?
Each observation I belongs to at least one of the K clusters and clusters are non overlapping.
How do we quantify the similarity of observations for clustering?
Euclidean distance
What is the goal of clustering?
Partition observations into distinct groups so that the observations within each cluster are quire similar to one another, while the observations in different clusters are quite different from one another.
What are the two most popular unsupervised learning techniques?
Principal component analysis and Clustering
How does PCA address that issue?
Shrinks the total number p of variables down to a set of few principal components (PCs) with those PCs retaining as much information from all p variables as possible.
How do we define a within-cluster variation of a cluster C_k?
Squared Euclidean distance between all distinct pairs of observations i, j element of C_k, i != j
What is considered a good clustering based on within-cluster variation?
W(C_k) is small
What is k means clustering?
We seek to partition the observations into a pre-specified number K of clusters
What is the optimization problem for k means in plain English?
We want to partition the observations into K clusters such that the total within-cluster variation, summed over all K clusters, is as small as possible.
What issue does PCA address?
When there are too many variables to plot
What is within-cluster variation for cluster C_k?
a measure of the amount by which the observations within C_k differ from each other