K-means and K-means + + (based on python Implementation)


1, K-means

2, K-means++

3, Code implementation

The principle of K-means and K-means + + is relatively simple, which directly refers to the following:


This paper focuses on the implementation of K-means and K-means + + algorithm based on python, and tests.

1, K-means

2, K-means++


Because the classification results of K-means algorithm will be different from each other due to the selection of initial points, an improvement of this algorithm is proposed: K-means + +.

Algorithm steps

In fact, this algorithm only improves the selection of initial point, and other steps are the same. The basic idea of initial centroid selection is that the distance between initial clustering centers should be as far as possible.

The algorithm is described as follows:

  • Step 1: randomly select a sample as the first cluster center c1;
  • Step 2:
    • The shortest distance between each sample and the existing clustering center (i.e. the distance from the nearest clustering center) is calculated, which is represented by D(x);
    • The larger the value is, the more likely it is to be selected as the cluster center;
    • Finally, the next cluster center is selected by the method of roulette;
  • Step 3: repeat step 2 until k cluster centers are selected.

After selecting the initial point, we continue to use the standard k-means algorithm.


K-means + + can significantly improve the final error of classification results.

Although it takes extra time to calculate the initial point, in the iterative process, k-mean itself can converge rapidly, so the algorithm actually reduces the calculation time.

Some people on the Internet have tested their methods with real and synthetic data sets, the speed is usually increased by two times, and for some data sets, the error is increased by nearly 1000 times.

Here is a simple example to show how K-means + + selects the initial cluster center.

There are 8 samples in the data set. The distribution and corresponding serial number are shown in the following figure:

Suppose point 6 is selected as the first initial cluster center after step 1 in Figure 2,

The probability of D(x) of each sample and being selected as the second cluster center in step 2 is shown in the table below:

P(x) is the probability that each sample is selected as the next cluster center.

The Sum in the last row is the Sum of probability P(x), which is used to select the second cluster center by the roulette method.

The method is to randomly generate a random number between 0 and 1, judge which interval it belongs to, then the sequence number corresponding to the interval is the second cluster center selected.

For example, the interval of point 1 is [0,0.2], and the interval of point 2 is [0.2,0.525).

From the table above, it can be seen that the probability of the second initial cluster center is 0.9 for one of 1, 2, 3 and 4.

These four points are just four points far away from the first initial clustering center point 6.

This also verifies the improved idea of K-means, that is, the point far away from the existing cluster center is more likely to be selected as the next cluster center.

As you can see, the K value of this example is 2, which is more appropriate. When k value is greater than 2, each sample will have multiple distances, and the minimum distance should be taken as D(x).

3, Code implementation

import numpy as np
import random

def distance(vec1, vec2):
    # Define distance
    return np.sqrt(np.sum((vec1 - vec2) ** 2))

def kmeans(dataset, k, is_kmeans=True, is_random=False):
    dataset: Dataset 2D array
    k: Number of clusters
    is_kmeans: default True use k-means,Otherwise use k-means++
    is_random: default False,in use k-means++Otherwise, it will be randomly generated according to the probability of using distance
    num_sample, num_feature = dataset.shape
    # K-means + + initialization cluster center
    if not is_kmeans:
        # 1. Randomly select the first center point
        first_idx = random.sample(range(num_sample), 1)
        center = dataset[first_idx]
        # 2. Calculate the distance between each sample and each center point, and select other center points through Roulette
        dist_note = np.zeros(num_sample)
        dist_note += 1000000000.0

        for j in range(k):
            if j+1 == k:
                break  # Enough cluster centers have been calculated, and you can exit directly
            # Calculate the distance between each sample and each cluster center, and save the minimum distance
            for i in range(num_sample):
                dist = distance(center[j], dataset[i])
                if dist < dist_note[i]:
                    dist_note[i] = dist
            # If Roulette is used, a new clustering center is randomly generated according to the distance, otherwise, the farthest sample is used as the next clustering center point
            if is_random:
                dist_p = dist_note / dist_note.sum()
                next_idx = np.random.choice(range(num_sample), 1, p=dist_p)
                center = np.vstack([center, dataset[next_idx]])
                next_idx = dist_note.argmax()
                center = np.vstack([center, dataset[next_idx]])
    # K-means random initialization clustering center
        # Random initialization clustering center
        center_indexs = random.sample(range(num_sample), k)
        center = dataset[center_indexs, :]

    # Iterative implementation of K-means algorithm
    cluster_assessment = np.zeros((num_sample, 2))
    cluster_assessment[:, 0] = -1  # Set all categories to - 1
    cluster_changed = True
    while cluster_changed:
        cluster_changed = False

        for i in range(num_sample):
            min_distance = 100000000.0
            c = 0
            # Determine which class each sample belongs to, that is, which center point is closest to
            for j in range(k):
                dist = distance(dataset[i, :], center[j, :])
                if min_distance > dist:
                    min_distance = dist
                    c = j
            # Update cluster
            if cluster_assessment[i, 0] != c:  # There are still category changes in the data in the previous two calculations, which do not meet the iteration stop requirements
                cluster_assessment[i, :] = c, min_distance
                cluster_changed = True
        # Update cluster center position
        for j in range(k):
            changed_center = dataset[cluster_assessment[:, 0] == j].mean(axis=0)
            center[j, :] = changed_center
    return cluster_assessment, center

def show_cluster(dataSet, k, centroids, clusterAssement):
    //Drawing clustering results for two-dimensional data
    numSamples, dim = dataSet.shape
    mark = ['or', 'ob', 'og', 'ok', '^r', '+r', '<r', 'pr']
    center_mark = ['*r', '*b', '*g', '*k', '*r', '*r', '*r', '*r']

    for i in range(numSamples):
        markIndex = int(clusterAssement[i,0])
        plt.plot(dataSet[i, 0], dataSet[i, 1], mark[markIndex], markersize=2)
    for j in range(k):
        plt.plot(centroids[j, 0], centroids[j, 1], center_mark[j], markersize=12)

Code test:

x1 = np.random.randint(0, 50, (50, 2))
x2 = np.random.randint(40, 100, (50, 2))
x3 = np.random.randint(90, 120, (50, 2))
x4 = np.random.randint(110, 160, (50, 2))
test = np.vstack((x1, x2, x3, x4))

# Clustering features
result, center = kmeans(test, 4, is_kmeans=False, is_random=False)
show_cluster(test, 4, center, result)

//Cluster center
[[134 134]
 [ 21  25]
 [106 104]
 [ 70  64]]


Tags: Python

Posted on Mon, 15 Jun 2020 02:18:30 -0400 by jpadie