The principle of K-means and K-means + + is relatively simple, which directly refers to the following:
This paper focuses on the implementation of K-means and K-means + + algorithm based on python, and tests.
Because the classification results of K-means algorithm will be different from each other due to the selection of initial points, an improvement of this algorithm is proposed: K-means + +.
In fact, this algorithm only improves the selection of initial point, and other steps are the same. The basic idea of initial centroid selection is that the distance between initial clustering centers should be as far as possible.
The algorithm is described as follows:
- Step 1: randomly select a sample as the first cluster center c1;
- Step 2:
- The shortest distance between each sample and the existing clustering center (i.e. the distance from the nearest clustering center) is calculated, which is represented by D(x);
- The larger the value is, the more likely it is to be selected as the cluster center;
- Finally, the next cluster center is selected by the method of roulette;
- Step 3: repeat step 2 until k cluster centers are selected.
After selecting the initial point, we continue to use the standard k-means algorithm.
K-means + + can significantly improve the final error of classification results.
Although it takes extra time to calculate the initial point, in the iterative process, k-mean itself can converge rapidly, so the algorithm actually reduces the calculation time.
Some people on the Internet have tested their methods with real and synthetic data sets, the speed is usually increased by two times, and for some data sets, the error is increased by nearly 1000 times.
Here is a simple example to show how K-means + + selects the initial cluster center.
There are 8 samples in the data set. The distribution and corresponding serial number are shown in the following figure:
Suppose point 6 is selected as the first initial cluster center after step 1 in Figure 2,
The probability of D(x) of each sample and being selected as the second cluster center in step 2 is shown in the table below:
P(x) is the probability that each sample is selected as the next cluster center.
The Sum in the last row is the Sum of probability P(x), which is used to select the second cluster center by the roulette method.
The method is to randomly generate a random number between 0 and 1, judge which interval it belongs to, then the sequence number corresponding to the interval is the second cluster center selected.
For example, the interval of point 1 is [0,0.2], and the interval of point 2 is [0.2,0.525).
From the table above, it can be seen that the probability of the second initial cluster center is 0.9 for one of 1, 2, 3 and 4.
These four points are just four points far away from the first initial clustering center point 6.
This also verifies the improved idea of K-means, that is, the point far away from the existing cluster center is more likely to be selected as the next cluster center.
As you can see, the K value of this example is 2, which is more appropriate. When k value is greater than 2, each sample will have multiple distances, and the minimum distance should be taken as D(x).
3, Code implementation
import numpy as np import random def distance(vec1, vec2): # Define distance return np.sqrt(np.sum((vec1 - vec2) ** 2)) def kmeans(dataset, k, is_kmeans=True, is_random=False): """ dataset: Dataset 2D array k: Number of clusters is_kmeans: default True use k-means,Otherwise use k-means++ is_random: default False，in use k-means++Otherwise, it will be randomly generated according to the probability of using distance """ num_sample, num_feature = dataset.shape # K-means + + initialization cluster center if not is_kmeans: # 1. Randomly select the first center point first_idx = random.sample(range(num_sample), 1) center = dataset[first_idx] # 2. Calculate the distance between each sample and each center point, and select other center points through Roulette dist_note = np.zeros(num_sample) dist_note += 1000000000.0 for j in range(k): if j+1 == k: break # Enough cluster centers have been calculated, and you can exit directly # Calculate the distance between each sample and each cluster center, and save the minimum distance for i in range(num_sample): dist = distance(center[j], dataset[i]) if dist < dist_note[i]: dist_note[i] = dist # If Roulette is used, a new clustering center is randomly generated according to the distance, otherwise, the farthest sample is used as the next clustering center point if is_random: dist_p = dist_note / dist_note.sum() next_idx = np.random.choice(range(num_sample), 1, p=dist_p) center = np.vstack([center, dataset[next_idx]]) else: next_idx = dist_note.argmax() center = np.vstack([center, dataset[next_idx]]) # K-means random initialization clustering center else: # Random initialization clustering center center_indexs = random.sample(range(num_sample), k) center = dataset[center_indexs, :] # Iterative implementation of K-means algorithm cluster_assessment = np.zeros((num_sample, 2)) cluster_assessment[:, 0] = -1 # Set all categories to - 1 cluster_changed = True while cluster_changed: cluster_changed = False for i in range(num_sample): min_distance = 100000000.0 c = 0 # Determine which class each sample belongs to, that is, which center point is closest to for j in range(k): dist = distance(dataset[i, :], center[j, :]) if min_distance > dist: min_distance = dist c = j # Update cluster if cluster_assessment[i, 0] != c: # There are still category changes in the data in the previous two calculations, which do not meet the iteration stop requirements cluster_assessment[i, :] = c, min_distance cluster_changed = True # Update cluster center position for j in range(k): changed_center = dataset[cluster_assessment[:, 0] == j].mean(axis=0) center[j, :] = changed_center return cluster_assessment, center def show_cluster(dataSet, k, centroids, clusterAssement): """ //Drawing clustering results for two-dimensional data """ numSamples, dim = dataSet.shape mark = ['or', 'ob', 'og', 'ok', '^r', '+r', '<r', 'pr'] center_mark = ['*r', '*b', '*g', '*k', '*r', '*r', '*r', '*r'] for i in range(numSamples): markIndex = int(clusterAssement[i,0]) plt.plot(dataSet[i, 0], dataSet[i, 1], mark[markIndex], markersize=2) for j in range(k): plt.plot(centroids[j, 0], centroids[j, 1], center_mark[j], markersize=12) plt.show()
x1 = np.random.randint(0, 50, (50, 2)) x2 = np.random.randint(40, 100, (50, 2)) x3 = np.random.randint(90, 120, (50, 2)) x4 = np.random.randint(110, 160, (50, 2)) test = np.vstack((x1, x2, x3, x4)) # Clustering features result, center = kmeans(test, 4, is_kmeans=False, is_random=False) print(center) show_cluster(test, 4, center, result) >>> //Cluster center [[134 134] [ 21 25] [106 104] [ 70 64]]