Cluster analysis or clustering is a data-mining task that consists in grouping a set of experiments (observations) in such a way that element belonging to the same group are more similar (in some mathematical sense) to each other than to those in the other groups. We call the groups with the name of clusters.
In this example we will see how centroid based clustering works. The basic idea of Centroid Based clustering is to define clusters based on the distance of each member of the cluster and the so-called centroid of the cluster itself.
The most common centroid based clustering algorithm is the so called K-means. The procedure follows a simple and easy way to classify a given data set through a certain number of predefined clusters.
The idea is to define k centroids, one for each cluster. The next step is to take each point belonging to a given data set and associate it to the nearest centroid. When no point is pending, the first step is completed and an early grouping is done. At this point, the algorithm re-computes the new centroids as the centers of the clusters resulting from the previous step. The algorithm stops when no more changes are observed from an iteration and the following one.
The algorithm is composed by the following steps:
- guess K centroids positions, a centroid for each results cluster
- assign each observation to the group that has the closest centroid
- when all observations have been assigned, recalculate the positions of the K centroids
- repeat Steps 2 and 3 until either the centroid position or the observation assignments no longer move.
The iterative procedure leads to a separation of the objects into k-groups.
- Although it can be proved that the procedure always converge, the k-means algorithm does not necessarily find the most optimal configuration, being sensitive to the initial randomly selected clusters. A typical approach to overcome this limitation is to re-run the k-means algorithm multiple times starting with different initial clusters to reduce the possibilities to remain trapped in a local solution.
- Another limitation is due to the fact that the number of cluster is defined by the user upfront. As the points are grouped by a predefined number of clusters, there is no guarantee that the provided number is the one that provide the best classification of the observations.
A first example
This first example shows a typical cluster example based on the Euclidean distance of the observation in a 2D domain.
When performing the exercise in Grapheme (or Nexus), the user has two additional options to plat with:
- Distance: this option defines how the distance of each observation point with respect of the center of the cluster it belongs to is computed. In this specific case we will use the default Euclidean distance, which measure the distance of two points as the length of the straight line that link them. Other typically used distance metrics used in clustering are: Manhattan, Camberra, Chebyshev, Wassertein and Mahalanobis
- Scaling: this defines how the domain to which the points belong to should be scaled before distance is computed. This is useful when computing the clustering against non-homogeneous dimensions. In this specific case, no scaling is used. Typical scale algorithms are: linear, logarithmic and variance based one.
The 400 provided observation points will be clustered in 2, 3, 4 and 5 clusters respectively. With the exception of the number of clusters, all the other algorithm parameters will be left unchanged.
|grouping in 2 clusters||grouping in 3 clusters|
|grouping in 4 clusters||grouping in 5 clusters|
A real-world example: House Market
This second example uses the data of Houses and Neighborhoods collected from the UCI Machine Learning Repository (original data can be found Here) and published by Harrison, D. and Rubinfeld, D.L. ‘Hedonic prices and the demand for clean air’, J. Environ. Economics & Management, vol.5, 81-102, 1978. The data can be used to visualize important relationships between house value, house size (number of rooms), neighborhood characteristics, average pollution and age of houses in the same neighborhood.
Limiting to this example, we will focus analyzing and in grouping houses based on the following data:
- RM: average number of rooms per dwelling
- MEDV: median value of owner-occupied homes in $1000’s
Number or Rooms and Average Value:
Firstly, we would like to identify if there is a clear relationship between the average value of a property (MEDV) and the number of rooms (RM). In order to prove that, we define a scatter plot. A linear regression clearly highlight a relationship between number of rooms and average property value as shown on the right.
Another way to highlight this relationship is to attempt to group houses based on their value and number of rooms. Since the quantities with respect of we want to cluster are not homogeneous, we scale inputs linearly within the domain.