Reading, Writing, Arithmetic, Robotics: What to Know About Machine Learning Part 7
This is the seventh article in a series dedicated to the various aspects of machine learning. Today’s article will cover the types of clustering used in machine learning, offering insight into the various ways that ML agents can organize unlabeled and unstructured data.
Our last article on the kinds of algorithms employed by machine learning agents included a section dedicated to clustering algorithms. The main takeaway of that section was that there is more than one way to cluster a data set, and that many data sets require different methods for clustering in order to be made sense of.
That was an important point, but there was still a lot that was left unsaid, particularly an in-depth explanation of the types of clustering methods used, and why they are important for data gathering. Keep reading for a breakdown of some of the most significant types of clustering used by machine learning agents.
In this type of clustering algorithm, the data set is divided into separate classes (one of the drawbacks of this is that the number of clusters are specified in advance, rather than discovered through an analysis of the data set). A centroid is the center of a cluster, often randomly chosen, and data points are assigned to a cluster based on the proximity to a centroid.
Again, the big disadvantage here is that we need to select the amount of clusters there will be beforehand, while normally we want the algorithm to give us that insight. However, many centroid-based clustering algorithms are very fast, and offer quick insights into how “close” data points are to each other based on proximity to an arbitrarily-determined center.
Hierarchical-based clustering involves making a tree out of the data, but don’t confuse this with normal decision-tree algorithms, which involve mapping out the range of actions or reactions available to an agent at every step towards a goal. Hierarchical-based clustering is different in that it strives to organize data into categories, with an overarching main category that branches into more specialized subcategories.
Let’s revive our tried-and-true animal categorization example for this one. The algorithm’s input data is a set of unlabeled animal pictures. The agent may not know the names of any of the animals, but it will be able to use visual cues (yellow fur) to create categories, and create subcategories from there (yellow fur with no mane; yellow fur with mane). The agent doesn’t need to know the words “yellow,” “fur,” or “mane,” only recognize the similarities between them.
This method of clustering identifies distinct groups in the data (e.g. yellow fur in pictures of animals). If there are lots of lions in a data set of animal pictures, then the group of yellow fur animals will be quite dense. Areas of high density will be clustered together, and the borders between clusters is determined by where each cluster begins to be less dense.
A disadvantage here is that data sets with many different densities cannot be analyzed. But, if there are a limited number of densities, what you get is a clustering method that will find and create clusters from a high number of data points, while still being able to incorporate outliers (since outliers help define the borders between one cluster and the other).
Based on distribution models, which give the probabilities of the occurrence of certain variables (like yellow fur!), these clustering algorithms group data sets based on the distributions certain data points are most likely to belong to.
A big advantage of this method is that it can discover correlations between data points, along with dependence. However, just like with centroid-based clustering, the drawback is that the distributions are decided prior to the data analysis.
There are various categories of distributions (like binomial, or normal/Gaussian), but the point to take home is that, whatever distribution is used, what this method achieves is to see which data points, and clusters, likely belong to that distribution. What this in turn tells us is which data points and clusters are more closely related than others.
These different types of clustering algorithms exist because not every data set is the same, and therefore no algorithm is one-size-fits-all. Some have data points with many outliers, and therefore are more fit for density-based clustering rather than centroid-based. Likewise, if there is a high variety of densities among data points, then perhaps a distribution-based method is the way to go. Whatever the choice, these algorithms are vital because they allow an agent to learn from data that is fundamentally confusing for them due to its being unlabeled. With clustering, the confusion be cleared away, even if just partially.