
Clustering is a machine learning technique that involves the grouping of data points. Data points in the same group should share similar properties and features. In Data Science, we can use clustering algorithms to gain valuable insights from data by seeing what groups the data points fall into.
The concept sounds similar to community detection. That’s because it is. Different fields of science assigned the same concept another name. In literature, you will find community detection connected more with social networks and clustering with machine learning. This distinction can also be attributed to the data that’s used for both techniques. Community detection algorithms often work on node structures while clustering algorithms are more prominent with vector representations.
In graph theory, a clustering coefficient measures the degree to which nodes in a graph tend to cluster together. In particular social networks, nodes tend to create tightly connected groups. This measurement is essential because it tells us if clustering is even possible for the graph.
K-means clustering
K-means is probably the most well-known clustering algorithm. K-means is a distance-based algorithm where we calculate the distances to assign a point to a cluster. The main objective of K-means is to minimize the sum of distances between the points and their respective cluster centers.
The k-means algorithm works as follows:
- Choose the number of clusters k.
- Select random k points from the data as centers.
- Assign all the points to the closest cluster center.
- Recompute the centers of the newly formed clusters.
- Compute the graph Laplacian (which is just another matrix representation of a graph)
- For k clusters, compute first k eigenvectors.
- Stack the vectors vertically to form a matrix with the vectors as columns.
- Represent every node with the corresponding row in this new matrix.
- Use k-Means Clustering to cluster these points in k clusters.
Spectral clustering
Spectral clustering is a more general technique that can be applied not only to graphs but also to images or any sort of data. Spectral clustering uses information from the eigenvalues (spectrum) of matrices derived from the data. In contrast to the k-means approach, even if the distance between 2 points is less, if they are not connected, they are not clustered together.
Spectral clustering works as follows:
- Compute the graph Laplacian (which is just another matrix representation of a graph)
- For k clusters, compute first k eigenvectors.
- Stack the vectors vertically to form a matrix with the vectors as columns.
- Represent every node with the corresponding row in this new matrix.
- Use k-Means Clustering to cluster these points in k clusters.
Use of K-Means clustering in the final step results in that the clusters are not always the same. They may vary depending on the choice of initial points.
Hierarchical clustering
Hierarchical cluster analysis (HCA) is a clustering algorithm that involves creating clusters ordered from top to bottom. For example, all files and folders on our hard disk are organized hierarchically.
This clustering technique is divided into two types:
- Agglomerative hierarchical clustering,
- Divisive hierarchical clustering.
In this lesson, we will take a closer look at agglomerative hierarchical clustering.
Agglomerative hierarchical clustering is the most common type of hierarchical clustering. It uses a “bottom-up” approach: each object starts in its cluster, and pairs of clusters are merged together as one moves up the hierarchy.
Agglomerative hierarchical clustering works as follows:
- Group each point as a single-point cluster.
- Take two closest points and group them into one cluster.
- Take two closest clusters and merge them into one cluster
- Repeat step 3 until you are left with one single cluster.