Discuss the differences between K-Means and Hierarchical clustering.

Following are the difference between K-Means and Hierarchical Clustering Algorithm (HCA)

  1. K-Means is that it needs us to pre-enter the number of clusters (K) but Hierarchical clustering has no such requirements. The algorithm on itself deduces the optimum number of cluster and displays it form of dendrogram.
  2. Performance of K-Means on spherical data is better than that of HCA
  3. Hierarchical clustering is a purely agglomerative approach and goes on to build one giant cluster. K-Means algorithm in all its iterations has same number of clusters.
  4. K-Means need circular data, while Hierarchical clustering has no such requirement.
  5. K-Means uses median or mean to compute centroid for representing cluster while HCA has various linkage method that may or may not employ the centroid.
  6. With introduction of mini batches K-Means can work with very large datasets but HCA lacks in this regard.
  7. Hierarchical methods are  suited for cases which require arrangement of the clusters  into a natural hierarchy. In K-means all clusters are on same level i.e. similar WCSS or cohesiveness.
  8. HCA can produce reproducible results while older versions of K-Means can’t
  9. K-Means simply divides data into mutually exclusive subsets while HCA arranges it into a tree format.

Discuss the various linkage methods for clustering.

Following are the methods of linkage used by Hierarchical clustering algorithm:

  1. Single Linkage: It is a distance based criteria and measures minimum pairwise distance.
  2. Complete Linkage: It is a distance based criteria and measures maximum pairwise distance.
  3. Centroid Linkage: It is a distance based criteria and measures distance between centroids.
  4. Ward’s Linkage: It is a cluster similarity based criteria and measures distance between centroids.

Discuss the Hierarchical clustering in detail.

Hierarchical clustering algorithm adopts agglomerative learning approach. Following are the steps involved in training process of this algorithm:

  1. For the first iteration it starts with points and finds the closest pairs and combines them into single cluster. For later iterations it does the same with clusters.
  2. The process goes on till all the points are clustered into one giant cluster
  3. Desired number of clusters are deduced from the dendrogram, are achieved.
  4. The output with desired number of cluster is chosen.
  5. The criteria for closeness is based on the chosen method of linkage.

Following are the methods of linkage used by Hierarchical clustering algorithm:

  1. Single Linkage: It is a distance based criteria and measures minimum pairwise distance.
  2. Complete Linkage: It is a distance based criteria and measures maximum pairwise distance.
  3. Centroid Linkage: It is a distance based criteria and measures distance between centroids.
  4. Ward’s Linkage: It is a cluster similarity based criteria and measures distance between centroids.

Discuss the agglomerative and divisive clustering approaches.

Following are the two approaches used by clustering algorithms:

  1. Agglomerative: The algorithms starts with assuming each individual point as a cluster and proceeds with adding the nearest point to the existing cluster. Thus similarity is used as a measure for creating new cluster. If not stopped, as per business requirement, the algorithm would go on to create one giant cluster with all the data points. This is also called bottoms up approach.
  2. Divisive: The algorithm starts with e one giant cluster with all the data points and proceeds to remove dissimilar points into separate clusters. Thus dissimilarity is used as a measure for creating new cluster. If not stopped, as per business requirement, the algorithm would go on to create individual cluster for all data points. This is also called top down approach.

Discuss the various improvements in K-Means

Following were the improvements made in K-means:

  1. Since initial centroids are arbitrarily chosen, the results for earlier versions were not exactly replicable. K-Means++ proposed a new method of assigning centroids where the initial points were furthest from each other i.e. fixed and replicable locations.
  2. Earlier versions were computationally expensive as distance was to be calculated from each centroid to all data points and then compared for finding the nearest centroid. Leveraging the triangle inequality theorem, i.e. sum of two sides of a triangle is always greater than third side, the amount of computation was reduced.
  3. Earlier versions showed out of memory error for large datasets. This was solved by using mini batches. The algorithm is run for each mini batch and the centroid move incrementally with each new batch. This increases the speed of computation by around 3 times.

What are the challenges with K-Means?

Following are the challenges faced by K-Means Clustering:

  1. k-Means doesn’t perform well if the clusters have varying sizes, different densities, or non-spherical shapes.
  2. Has to be run for a certain amount of iteration or it would produce a suboptimal result.
  3. Computationally expensive as distance is to be calculated from each centroid to all data points.
  4. Can’t handle high dimensional data
  5. Number of clusters need to be specified
  6. Since initial centroids are arbitrarily chosen, the result is not exactly replicable.
  7. Shows out of memory error for large datasets

Discuss the step by step implementation of K-Means Clustering.

Following are the steps taken by K-Means algorithm in its learning process:

  1. User specifies the value of K using the elbow curve.
  2. For the first iteration, K number of points are arbitrarily chosen. Let’s call them centroids for the time being, in below steps you’ll get the reason for this nomenclature. In case of K-Means++, it is taken care that these points are farthest from each other. In further iterations the centroids calculated in step 4 are used.
  3. Distance of all the points from these K points is calculated and the point which is nearest to given centroid is assigned to that centroid. Thus first set of K clusters are formed.
  4. Centroid for each cluster is calculated. [That’s why we were calling the arbitrarily chosen points as centroids]
  5. Go back to step 2 and repeat the process till the time centroids don’t move at all or the maximum number of iterations is reached. Beyond a certain number of iteration the movement in centroid, hence decrease in WCSS, is negligible.

What is the significance of ‘K’ in K-Means and how is it calculated?

K in K-Means clustering signifies the number of clusters. In K-Means algorithm we need to specify the value of K. As is the inherent nature of the algorithm, with increase in number of clusters the WCSS decreases. But the rate of decrease is steep for first few points and then plateaus. The point of inflection is the ideal number of cluster for the given dataset i.e. the ideal value of K.

Discuss the elbow method.

In K-Means algorithm we need to specify the number of clusters. As is the inherent nature of the algorithm, with increase in number of clusters the WCSS decreases. But the rate of decrease is steep for first few points and then plateaus. Thus the curve resembles the shape of human elbow, hence the name. The point of inflection is the ideal number of cluster for the given dataset.