Clustering algorithm KMeans

This article is carried from the Internet and is only used for my learning records.

K-means is our most commonly used clustering algorithm based on Euclidean distance. It believes that the closer the distance between two targets, the greater the similarity.

1. Algorithm

1.1. Algorithm steps

1.2. Complexity

2. Advantages and disadvantages

advantage:

  • It is easy to understand and the clustering effect is good. Although it is locally optimal, it is often sufficient to be locally optimal;
  • When dealing with large data sets, the algorithm can ensure good scalability;
  • When the cluster is approximate to Gaussian distribution, the effect is very good;
  • The complexity of the algorithm is low.

Disadvantages:

  • The K value needs to be set manually, and different K values get different results;
  • It is sensitive to the initial cluster center, and different selection methods will get different results;
  • Sensitive to outliers;
  • The sample bold style can only be classified into one category, which is not suitable for multi category * * tasks;
  • It is not suitable for too discrete classification, unbalanced classification of sample categories, and classification of non convex shapes.

3. Algorithm tuning & Improvement

3.1. Data preprocessing

The essence of K-means is a data partition algorithm based on Euclidean distance. The dimensions with large mean and variance will have a decisive impact on data clustering. Therefore, data without normalization and unified unit cannot be directly involved in operation and comparison. Common data preprocessing methods include data normalization and data standardization.

In addition, outliers or noise data will have a great impact on the mean value, resulting in center offset. Therefore, we also need to detect outliers in the data.

Euclidean distance:

3.2. Reasonable selection of K value

The selection of K value has a great impact on K-means, which is also the biggest disadvantage of k-means. Common methods for selecting K value include elbow method and gap statistical method.

  1. Elbow method
  2. Gap statistic al method

3.3. Using kernel function

K-means based on Euclidean distance assumes that the data of each data cluster has the same a priori probability and presents a spherical distribution, but this distribution is not common in real life. In the face of non convex data distribution shape, we can introduce kernel function to optimize. At this time, the algorithm is also called kernel k-means algorithm, which is a kind of kernel clustering method. The main idea of kernel clustering method is to map the data points in the input space to the high-order feature space through a nonlinear mapping, and cluster in the new feature space. Nonlinear mapping increases the probability of linear separability of data points. Therefore, when the classical clustering algorithm fails, more accurate clustering results can be achieved by introducing kernel function.

3.4. K-Means++

3.5. KMeans code

import numpy as np
import matplotlib.pyplot as plt
import random
import math

class KMeans(object):
    def __init__(self, k, data):
        self.k = k # Number of clusters
        self.data = data # Matrix sample

    # Euclidean distance calculation
    def getDistance(self, n1, n2):
        # distance = 0.0
        # for a, b in zip(n1, n2):
        #     distance += math.pow(a - b, 2)
        # return math.sqrt(distance)
        return np.linalg.norm(n1 - n2, ord=2)

    # Cluster center initialization
    def center_init(self):
        idx = random.sample(range(len(self.data)), k=self.k) # Randomly select k sample Subscripts
        self.centers = self.data[idx] # k samples are selected as cluster centers

    def fit(self):
        self.center_init() # Cluster center initialization
        clusterDistance = np.zeros((len(self.data), 2)) # Record the nearest cluster center subscript of each sample and the corresponding distance
        
        flag = True # Control the flag of iteration and control the number of clustering iterations
        while(flag):
            flag = False
            # Traverse each sample
            for i in range(len(self.data)):
                minIdx = -1 # Distance minimum cluster center subscript
                minDistance = math.pow(10, 5) # minimum distance
                for j in range(self.k): # Traverse each cluster and calculate the distance from the sample
                    dist = self.getDistance(self.data[i], self.centers[j]) # Distance from sample i to cluster j
                    if dist < minDistance:
                        minDistance = dist
                        minIdx = j
                if clusterDistance[i][0] != minIdx: # If the cluster to which sample i belongs is not minIdx, the cluster category of sample i has changed
                    flag = True # At this point, you should continue to iterate and update the cluster center
                # Record the minimum distance dist between sample i and the cluster and the subscript j of the corresponding cluster
                clusterDistance[i][0] = minIdx
                clusterDistance[i][1] = minDistance

            # After the cluster of samples is divided, the cluster center is updated with the sample mean
            for i in range(self.k):
                x = self.data[clusterDistance[:, 0] == i] # Take out all samples belonging to cluster i
                self.centers[i] = np.mean(x, axis=0) # Take the sample mean as the new cluster center

# Two dimensional sample points
x1 = np.random.randint(0, 50, (50, 2))
x2 = np.random.randint(40, 100, (50, 2))
x3 = np.random.randint(90, 120, (50, 2))
data = np.vstack((x1, x2, x3))

model = KMeans(k=3, data=data)
model.fit()
centers = model.centers # Cluster center

# Visual clustering center
plt.scatter(data[:, 0], data[:, 1], c='b', s=10)
plt.scatter(centers[:, 0], centers[:, 1], c='r', s=30, marker='*')
plt.show()

3.6. ISODATA

The full name of ISODATA is iterative self-organizing data analysis. It solves the disadvantage that the value of K needs to be determined artificially in advance. When we encounter high-dimensional and massive data sets, it is often difficult to accurately estimate the size of K. ISODATA improves this problem, and its idea is also very intuitive: when the number of samples belonging to a category is too small, remove this category; when the number of samples belonging to a category is too large and the degree of dispersion is large, divide this category into two subcategories.

4. Proof of convergence

Tags: Algorithm Machine Learning kmeans

Posted on Sun, 05 Sep 2021 22:28:01 -0400 by anibiswas