It works by creating clusters from a data set. This process involves dividing the entire data into groups based on the patterns in the data set. It is an unsupervised learning algorithm, which means there is no fixed target variable as we don’t have targets to predict. We need to look at the data and make observations and create different clusters.
One way to find the optimal number of clusters include using the elbow method. This is when you plot a line chart including the number of clusters (value of k) and the data. Then you must join the points. When there is a rapid drop in values, the line will create an elbow shape.
A target number k is then formed. This will be the number of centroids you need, and it will act as the imaginary locations representing the centre of cluster. The algorithm will then allocate every data point to the nearest cluster, trying to keep centroids as small as possible.
Advantages | Disadvantages |
---|---|
Scales to large data sets. | Choosing k manually may take a long time. |
Simple to implement. | Being dependent on initial values such the k value. |
Adapts to new examples. | Clustering outliers may lead to them getting their own cluster instead of it being ignored. |
Generalizes clusters of different shapes and sizes. | Clustering data of varying sizes and density can cause issues. |