Evaluating Clustering Algorithm — Silhouette Score
Theory
Silhouette Score is a metric to evaluate the performance of clustering algorithm. It uses compactness of individual clusters(intra cluster distance) and separation amongst clusters (inter cluster distance) to measure an overall representative score of how well our clustering algorithm has performed.
This is a simple metric but many a times we fail to use it correctly ending up quoting numbers which are false representation of actual score. In this blog I will cover a simple example of how silhouette score may be misleading if used blindly. To keep thing visually understandable I will stick to two dimension dataset but the idea can be promoted for multi-dimension dataset as well.
Silhouette Score function is directly available in sklearn and may be readily used.
Let’s quickly look behind the math of it
Silhouette Score for a datapoint i is given as
where,
bi : is the inter cluster distance defined as the average distance to closest cluster of datapoint i except for that it’s a part of
ai : is the intra cluster distance defined as the average distance to all other points in the cluster to which it’s a part of
Overall Silhouette score for the complete dataset can be calculated as the mean of silhouette score for all data points in the dataset. As can be seen from the formula silhouette score would always lie between -1 to 1. 1 representing better clustering.
Practical
Let’s calculate Silhouette score for a dataset using sjlearn.
import libraries
create a dataset using the make_blobs function from sklearn
visualise data
calculate silhouette score for this dataset
The silhouette score is 0.804 which is close to 1 and thus these clusters have been separated quite perfectly which is visible from the plot as well.
Issue
Now let try another dataset. This time we would try the concentric circles from sklearn.
Let’s plot the data.
Looking at the plot we can see that the two clusters have been separated nicely with the outer circle in blue and the inner circle in red. Thus we expect the silhouette score for this to be high.
On the other hand we see the silhouette score is 0.099 which is close to 0. This clearly is a misrepresentation of how well out algorithm is performing.
This is common mistake many data scientist make while quoting silhouette score as a measurement for clustering performance. In high dimensional data this problem would become further severe as visualising high dimension data is tough.
The fix in this case is very easy.We use kernel transformation and create a new axis which calculates the distance of each datapoint from center of plot. This plots the data is a new plane where they are linearly separable.
Conclusion
Silhouette Score like many other clustering evaluation metric is susceptible to error. Whenever its being used to quote algorithm performance one must be sure that the distance metric used in the algorithm is able to linearly separate the data.
In cases where the datasets are not linearly separable and the dimensions of the datasets is very high we must be careful while quoting silhouette distance.
Dimensionality reduction techniques can be used to reduce data into two dimension for visualisation. As a rule of thumb whenever using Density based clustering algorithms silhouette distance may not be an appropriate metric.
Bonus
Below are the functions to calculate silhouette score without sklearn