How to choose optimal value of K in K-Means clustering?

1.	How to choose optimal value of K in K-Means clustering?
Answer» Following methods can be considered for finding optimal value of K in K-means clustering. Approximate Expected Overall R-square: Approximate Expected Overall R-Square is calculated based on the hypothesis that all the explanatory variables used for Clustering are independent. Hence if there is a lot of difference between Observed Overall R-square and Approximate Expected Overall R-square, we can suspect high correlation among the independent variables. Cubic Clustering Criterion: Comparative MEASURE of the deviation of the clusters from the distribution expected if data points were obtained from a uniform distribution Larger positive values of the CCC indicate a better solution, as it shows a larger difference from a uniform (no clusters) distribution. Large negative Values indicate the presence of Outliers Pseudo F: The pseudo F statistic measures the separation among all the clusters at the current LEVEL Relatively large values indicate a stopping point. Reading down the PSF column, find all possible stopping points (where PSF is very large compared to other values). The optimal number of clusters is found at a point where CCC and Pseudo-F reach maximum and Overall R-Square tapers off. Elbow Method: The Elbow method is a method of interpretation and validation of consistency within cluster analysis designed to help finding the appropriate number of clusters in the data set. One simple heuristic is to compute the total within sum of squares (WSS) for different values of k and LOOK for an “elbow” in the curve. Define the cluster’s centroid as the point that is the MEAN value of all the points in the cluster. The within sum of squares for a single cluster is the average squared distance of each point in the cluster from the cluster’s centroid. The total within sum of squares is the sum of the within sum of squares of all the clusters. The total WSS will decrease as the number of clusters increases, because each cluster will be smaller and tighter. The hope is that the rate at which the WSS decreases will slow down for k beyond the optimal number of clusters. In other words, the graph of WSS versus k should flatten out beyond the optimal k, so the optimal k will be at the “elbow” of the graph. Unfortunately, this elbow can be difficult to see. *CH Index (Calinski-Harabasz)*: The Calinski-Harabasz index of a clustering is the ratio of the between-cluster variance (which is essentially the variance of all the cluster centroids from the dataset’s grand centroid) to the total within-cluster variance (basically, the average WSS of the clusters). For a given dataset, the total sum of squares (TSS) is the squared distance of all the data points from the dataset’s centroid. The TSS is independent of the clustering. If WSS(k) is the total WSS of a clustering with k clusters, then the between sum of squares BSS(k) of the clustering is given by BSS(k) = TSS - WSS(k). WSS(k) measures how close the points in a cluster are to each other. BSS(k) measures how far apart the clusters are from each other. A good clustering has a small WSS(k) and a large BSS(k).The within-cluster variance W is given by WSS(k)/(n-k), where n is the number of points in the dataset. The between-cluster variance B is given by BSS(k)/(k-1). The within-cluster variance will decrease as “K” increases; the rate of decrease should slow down past the optimal k. The between-cluster variance will increase as k, but the rate of increase should slow down past the optimal k. So in theory, the ratio of B to W should be maximized at the optimal k**. All these metrics can be evaluated to decide on the final value for K.

Answer»

Following methods can be considered for finding optimal value of K in K-means clustering.

Approximate Expected Overall R-square: Approximate Expected Overall R-Square is calculated based on the hypothesis that all the explanatory variables used for Clustering are independent. Hence if there is a lot of difference between Observed Overall R-square and Approximate Expected Overall R-square, we can suspect high correlation among the independent variables.

Cubic Clustering Criterion:

Comparative MEASURE of the deviation of the clusters from the distribution expected if data points were obtained from a uniform distribution
Larger positive values of the CCC indicate a better solution, as it shows a larger difference from a uniform (no clusters) distribution.
Large negative Values indicate the presence of Outliers

Pseudo F:

The pseudo F statistic measures the separation among all the clusters at the current LEVEL
Relatively large values indicate a stopping point. Reading down the PSF column, find all possible stopping points (where PSF is very large compared to other values).

The optimal number of clusters is found at a point where CCC and Pseudo-F reach maximum and Overall R-Square tapers off.

Elbow Method: The Elbow method is a method of interpretation and validation of consistency within cluster analysis designed to help finding the appropriate number of clusters in the data set. One simple heuristic is to compute the total within sum of squares (WSS) for different values of k and LOOK for an “elbow” in the curve. Define the cluster’s centroid as the point that is the MEAN value of all the points in the cluster. The within sum of squares for a single cluster is the average squared distance of each point in the cluster from the cluster’s centroid. The total within sum of squares is the sum of the within sum of squares of all the clusters. The total WSS will decrease as the number of clusters increases, because each cluster will be smaller and tighter. The hope is that the rate at which the WSS decreases will slow down for k beyond the optimal number of clusters. In other words, the graph of WSS versus k should flatten out beyond the optimal k, so the optimal k will be at the “elbow” of the graph. Unfortunately, this elbow can be difficult to see.

CH Index (Calinski-Harabasz): The Calinski-Harabasz index of a clustering is the ratio of the between-cluster variance (which is essentially the variance of all the cluster centroids from the dataset’s grand centroid) to the total within-cluster variance (basically, the average WSS of the clusters). For a given dataset, the total sum of squares (TSS) is the squared distance of all the data points from the dataset’s centroid. The TSS is independent of the clustering. If WSS(k) is the total WSS of a clustering with k clusters, then the between sum of squares BSS(k) of the clustering is given by BSS(k) = TSS - WSS(k). WSS(k) measures how close the points in a cluster are to each other. BSS(k) measures how far apart the clusters are from each other. A good clustering has a small WSS(k) and a large BSS(k).The within-cluster variance W is given by WSS(k)/(n-k), where n is the number of points in the dataset. The between-cluster variance B is given by BSS(k)/(k-1). The within-cluster variance will decrease as “K” increases; the rate of decrease should slow down past the optimal k. The between-cluster variance will increase as k, but the rate of increase should slow down past the optimal k. So in theory, the ratio of B to W should be maximized at the optimal k.

All these metrics can be evaluated to decide on the final value for K.

How to choose optimal value of K in K-Means clustering?

Discussion

No Comment Found

Related InterviewSolutions

Reply to Comment