This article is carried from the Internet and is only used for my learning records.
K-means is our most commonly used clustering algorithm based on Euclidean distance. It believes that the closer the distance between two targets, the greater the similarity.
1.1. Algorithm steps
2. Advantages and disadvantages
- It is easy to understand and the clustering effect is good. Although it is locally optimal, it is often sufficient to be locally optimal;
- When dealing with large data sets, the algorithm can ensure good scalability;
- When the cluster is approximate to Gaussian distribution, the effect is very good;
- The complexity of the algorithm is low.
- The K value needs to be set manually, and different K values get different results;
- It is sensitive to the initial cluster center, and different selection methods will get different results;
- Sensitive to outliers;
- The sample bold style can only be classified into one category, which is not suitable for multi category * * tasks;
- It is not suitable for too discrete classification, unbalanced classification of sample categories, and classification of non convex shapes.
3. Algorithm tuning & Improvement
3.1. Data preprocessing
The essence of K-means is a data partition algorithm based on Euclidean distance. The dimensions with large mean and variance will have a decisive impact on data clustering. Therefore, data without normalization and unified unit cannot be directly involved in operation and comparison. Common data preprocessing methods include data normalization and data standardization.
In addition, outliers or noise data will have a great impact on the mean value, resulting in center offset. Therefore, we also need to detect outliers in the data.
3.2. Reasonable selection of K value
The selection of K value has a great impact on K-means, which is also the biggest disadvantage of k-means. Common methods for selecting K value include elbow method and gap statistical method.
- Elbow method
- Gap statistic al method
3.3. Using kernel function
K-means based on Euclidean distance assumes that the data of each data cluster has the same a priori probability and presents a spherical distribution, but this distribution is not common in real life. In the face of non convex data distribution shape, we can introduce kernel function to optimize. At this time, the algorithm is also called kernel k-means algorithm, which is a kind of kernel clustering method. The main idea of kernel clustering method is to map the data points in the input space to the high-order feature space through a nonlinear mapping, and cluster in the new feature space. Nonlinear mapping increases the probability of linear separability of data points. Therefore, when the classical clustering algorithm fails, more accurate clustering results can be achieved by introducing kernel function.
3.5. KMeans code
import numpy as np import matplotlib.pyplot as plt import random import math class KMeans(object): def __init__(self, k, data): self.k = k # Number of clusters self.data = data # Matrix sample # Euclidean distance calculation def getDistance(self, n1, n2): # distance = 0.0 # for a, b in zip(n1, n2): # distance += math.pow(a - b, 2) # return math.sqrt(distance) return np.linalg.norm(n1 - n2, ord=2) # Cluster center initialization def center_init(self): idx = random.sample(range(len(self.data)), k=self.k) # Randomly select k sample Subscripts self.centers = self.data[idx] # k samples are selected as cluster centers def fit(self): self.center_init() # Cluster center initialization clusterDistance = np.zeros((len(self.data), 2)) # Record the nearest cluster center subscript of each sample and the corresponding distance flag = True # Control the flag of iteration and control the number of clustering iterations while(flag): flag = False # Traverse each sample for i in range(len(self.data)): minIdx = -1 # Distance minimum cluster center subscript minDistance = math.pow(10, 5) # minimum distance for j in range(self.k): # Traverse each cluster and calculate the distance from the sample dist = self.getDistance(self.data[i], self.centers[j]) # Distance from sample i to cluster j if dist < minDistance: minDistance = dist minIdx = j if clusterDistance[i] != minIdx: # If the cluster to which sample i belongs is not minIdx, the cluster category of sample i has changed flag = True # At this point, you should continue to iterate and update the cluster center # Record the minimum distance dist between sample i and the cluster and the subscript j of the corresponding cluster clusterDistance[i] = minIdx clusterDistance[i] = minDistance # After the cluster of samples is divided, the cluster center is updated with the sample mean for i in range(self.k): x = self.data[clusterDistance[:, 0] == i] # Take out all samples belonging to cluster i self.centers[i] = np.mean(x, axis=0) # Take the sample mean as the new cluster center # Two dimensional sample points x1 = np.random.randint(0, 50, (50, 2)) x2 = np.random.randint(40, 100, (50, 2)) x3 = np.random.randint(90, 120, (50, 2)) data = np.vstack((x1, x2, x3)) model = KMeans(k=3, data=data) model.fit() centers = model.centers # Cluster center # Visual clustering center plt.scatter(data[:, 0], data[:, 1], c='b', s=10) plt.scatter(centers[:, 0], centers[:, 1], c='r', s=30, marker='*') plt.show()
The full name of ISODATA is iterative self-organizing data analysis. It solves the disadvantage that the value of K needs to be determined artificially in advance. When we encounter high-dimensional and massive data sets, it is often difficult to accurately estimate the size of K. ISODATA improves this problem, and its idea is also very intuitive: when the number of samples belonging to a category is too small, remove this category; when the number of samples belonging to a category is too large and the degree of dispersion is large, divide this category into two subcategories.
4. Proof of convergence