## Summary

K-means clustering algorithm, also known as k-means clustering algorithm, is a simple and classic distance based clustering algorithm. It uses distance as the evaluation index of similarity, that is, the closer the distance between two objects, the greater the similarity. The algorithm considers that the cluster is composed of objects close to each other, so the final goal is to get a compact and independent cluster.

## Core idea

K-means clustering algorithm is an iterative clustering analysis algorithm. Its steps are randomly selecting K objects as the initial clustering center, then calculating the distance between each object and each seed clustering center, and assigning each object to the nearest clustering center. Cluster centers and objects assigned to them represent a cluster. Every time a sample is allocated, the cluster center will be recalculated according to the existing objects in the cluster. This process will be repeated until a termination condition is met. The termination condition can be that no (or minimum number) objects are reassigned to different clusters, no (or minimum number) cluster centers change again, and the sum of error squares is locally minimum.

## Specific implementation steps of the algorithm

- First, we determine a k value, that is, we want to cluster the data sets to get k sets.
- Randomly select k data points from the dataset as the center of mass.
- For each point in the data set, calculate its distance from each centroid (such as the Euclidean distance). If it is close to any centroid, it will be divided into the set to which that centroid belongs.
- After putting all the data together, there are k sets. Then recalculate the centroid of each set.
- If the distance between the newly calculated centroid and the original centroid is less than a set threshold (indicating that the position of the recalculated centroid changes little, tends to be stable, or converges), we can think that the clustering has reached the expected result, and the algorithm is terminated.
- If the distance between the new centroid and the original centroid changes greatly, 3-5 steps need to be iterated.

## Diagram of algorithm steps

Figure a above represents the initial data set, assuming k=2. In Figure b, we randomly select the class centroids corresponding to two K classes, namely, the red centroids and the blue centroids in the figure, and then calculate the distance between all points in the sample and the two centroids respectively, and mark each sample's category as the category of the centroids with the smallest distance from the sample, as shown in Figure c. after calculating the distance between the sample and the red centroids and the blue centroids, we get the Category after the first iteration with sample points. At this time, we find the new centroids for the points that are currently marked as red and blue. As shown in Figure d, the positions of the new red and blue centroids have changed. Figure E and figure f repeat our process in Figure c and figure d, i.e. marking the categories of all points as the nearest centroids and finding new centroids. Finally, we get two categories as shown in Figure F.

## Algorithm terminology

Cluster: a collection of points of all data. The objects in the cluster are similar.

Center of mass: the center of all points in the cluster (calculated from the center of all points)

## Advantages and disadvantages of the algorithm

Advantage:

- The principle is simple, the implementation is easy, and the convergence speed is fast.
- When the result clusters are dense, and the difference between clusters is obvious, its effect is better.
- The main parameter to be adjusted is only the cluster number k.

Disadvantages:

- K value needs to be given in advance, so it is very difficult to estimate K value in many cases.
- The K-Means algorithm is sensitive to the initial selected centroids, and the clustering results of different random seed points are obtained
- It's totally different and has a great impact on the results.
- Sensitive to noise and abnormal points. Used to detect outliers.
- Using iterative method, we can only get the local optimal solution, but not the global optimal solution.

## Code

#Test routine import numpy as np import matplotlib.pyplot as plt # Loading data def loadDataSet(fileName): data = np.loadtxt(fileName,delimiter='\t') return data # Euclidean distance calculation def distEclud(x,y): return np.sqrt(np.sum((x-y)**2)) # Calculate Euclidean distance # Building a set of K random centroids for a given data set def randCent(dataSet,k): m,n = dataSet.shape centroids = np.zeros((k,n)) for i in range(k): index = int(np.random.uniform(0,m)) # centroids[i,:] = dataSet[index,:] return centroids # k-means clustering def KMeans(dataSet,k): m = np.shape(dataSet)[0] #Number of rows # Which cluster does the first column store sample belong to # Error from the second column to the center of the cluster clusterAssment = np.mat(np.zeros((m,2))) clusterChange = True # Step 1 initialize centroids centroids = randCent(dataSet,k) while clusterChange: clusterChange = False # Traverse all samples (rows) for i in range(m): minDist = 100000.0 minIndex = -1 # Traverse all the centers of mass #Step 2 find the nearest center of mass for j in range(k): # Calculate the Euclidean distance from the sample to the center of mass distance = distEclud(centroids[j,:],dataSet[i,:]) if distance < minDist: minDist = distance minIndex = j # Step 3: update the cluster of each row of samples if clusterAssment[i,0] != minIndex: clusterChange = True clusterAssment[i,:] = minIndex,minDist**2 #Step 4: update the center of mass for j in range(k): pointsInCluster = dataSet[np.nonzero(clusterAssment[:,0].A == j)[0]] # Get all points of cluster class centroids[j,:] = np.mean(pointsInCluster,axis=0) # Average the rows of a matrix print("Congratulations,cluster complete!") return centroids,clusterAssment def showCluster(dataSet,k,centroids,clusterAssment): m,n = dataSet.shape if n != 2: print("Data is not two-dimensional") return 1 mark = ['or', 'ob', 'og', 'ok', '^r', '+r', 'sr', 'dr', '<r', 'pr'] if k > len(mark): print("k The value is too large.") return 1 # Draw all samples for i in range(m): markIndex = int(clusterAssment[i,0]) plt.plot(dataSet[i,0],dataSet[i,1],mark[markIndex]) mark = ['Dr', 'Db', 'Dg', 'Dk', '^b', '+b', 'sb', 'db', '<b', 'pb'] # Drawing centroid for i in range(k): plt.plot(centroids[i,0],centroids[i,1],mark[i]) plt.show() dataSet = loadDataSet("K-means_data.txt") k = 3 centroids,clusterAssment = KMeans(dataSet,k) showCluster(dataSet,k,centroids,clusterAssment)

#K-means_data.txt data set -1.26 0.46 -1.15 0.49 -1.19 0.36 -1.33 0.28 -1.06 0.22 -1.27 0.03 -1.28 0.15 -1.06 0.08 -1.00 0.38 -0.44 0.29 -0.37 0.45 -0.22 0.36 -0.34 0.18 -0.42 0.06 -0.11 0.12 -0.17 0.32 -0.27 0.08 -0.49 -0.34 -0.39 -0.28 -0.40 -0.45 -0.15 -0.33 -0.15 -0.21 -0.33 -0.30 -0.23 -0.45 -0.27 -0.59 -0.61 -0.65 -0.61 -0.53 -0.52 -0.53 -0.42 -0.56 -1.39 -0.26