Python machine learning K-means clustering algorithm

Summary

K-means clustering algorithm, also known as k-means clustering algorithm, is a simple and classic distance based clustering algorithm. It uses distance as the evaluation index of similarity, that is, the closer the distance between two objects, the greater the similarity. The algorithm considers that the cluster is composed of objects close to each other, so the final goal is to get a compact and independent cluster.

Core idea

K-means clustering algorithm is an iterative clustering analysis algorithm. Its steps are randomly selecting K objects as the initial clustering center, then calculating the distance between each object and each seed clustering center, and assigning each object to the nearest clustering center. Cluster centers and objects assigned to them represent a cluster. Every time a sample is allocated, the cluster center will be recalculated according to the existing objects in the cluster. This process will be repeated until a termination condition is met. The termination condition can be that no (or minimum number) objects are reassigned to different clusters, no (or minimum number) cluster centers change again, and the sum of error squares is locally minimum.

Specific implementation steps of the algorithm

  • First, we determine a k value, that is, we want to cluster the data sets to get k sets.
  • Randomly select k data points from the dataset as the center of mass.
  • For each point in the data set, calculate its distance from each centroid (such as the Euclidean distance). If it is close to any centroid, it will be divided into the set to which that centroid belongs.
  • After putting all the data together, there are k sets. Then recalculate the centroid of each set.
  • If the distance between the newly calculated centroid and the original centroid is less than a set threshold (indicating that the position of the recalculated centroid changes little, tends to be stable, or converges), we can think that the clustering has reached the expected result, and the algorithm is terminated.
  • If the distance between the new centroid and the original centroid changes greatly, 3-5 steps need to be iterated.

Diagram of algorithm steps

Figure a above represents the initial data set, assuming k=2. In Figure b, we randomly select the class centroids corresponding to two K classes, namely, the red centroids and the blue centroids in the figure, and then calculate the distance between all points in the sample and the two centroids respectively, and mark each sample's category as the category of the centroids with the smallest distance from the sample, as shown in Figure c. after calculating the distance between the sample and the red centroids and the blue centroids, we get the Category after the first iteration with sample points. At this time, we find the new centroids for the points that are currently marked as red and blue. As shown in Figure d, the positions of the new red and blue centroids have changed. Figure E and figure f repeat our process in Figure c and figure d, i.e. marking the categories of all points as the nearest centroids and finding new centroids. Finally, we get two categories as shown in Figure F.

Algorithm terminology

Cluster: a collection of points of all data. The objects in the cluster are similar.
Center of mass: the center of all points in the cluster (calculated from the center of all points)

Advantages and disadvantages of the algorithm

Advantage:

  • The principle is simple, the implementation is easy, and the convergence speed is fast.
  • When the result clusters are dense, and the difference between clusters is obvious, its effect is better.
  • The main parameter to be adjusted is only the cluster number k.

Disadvantages:

  • K value needs to be given in advance, so it is very difficult to estimate K value in many cases.
  • The K-Means algorithm is sensitive to the initial selected centroids, and the clustering results of different random seed points are obtained
  • It's totally different and has a great impact on the results.
  • Sensitive to noise and abnormal points. Used to detect outliers.
  • Using iterative method, we can only get the local optimal solution, but not the global optimal solution.

Code

#Test routine
import numpy as np
import matplotlib.pyplot as plt

# Loading data
def loadDataSet(fileName):
    data = np.loadtxt(fileName,delimiter='\t')
    return data

# Euclidean distance calculation
def distEclud(x,y):
    return np.sqrt(np.sum((x-y)**2))  # Calculate Euclidean distance

# Building a set of K random centroids for a given data set
def randCent(dataSet,k):
    m,n = dataSet.shape
    centroids = np.zeros((k,n))
    for i in range(k):
        index = int(np.random.uniform(0,m)) #
        centroids[i,:] = dataSet[index,:]
    return centroids

# k-means clustering
def KMeans(dataSet,k):

    m = np.shape(dataSet)[0]  #Number of rows
    # Which cluster does the first column store sample belong to
    # Error from the second column to the center of the cluster
    clusterAssment = np.mat(np.zeros((m,2)))
    clusterChange = True

    # Step 1 initialize centroids
    centroids = randCent(dataSet,k)
    while clusterChange:
        clusterChange = False

        # Traverse all samples (rows)
        for i in range(m):
            minDist = 100000.0
            minIndex = -1

            # Traverse all the centers of mass
            #Step 2 find the nearest center of mass
            for j in range(k):
                # Calculate the Euclidean distance from the sample to the center of mass
                distance = distEclud(centroids[j,:],dataSet[i,:])
                if distance < minDist:
                    minDist = distance
                    minIndex = j
            # Step 3: update the cluster of each row of samples
            if clusterAssment[i,0] != minIndex:
                clusterChange = True
                clusterAssment[i,:] = minIndex,minDist**2
        #Step 4: update the center of mass
        for j in range(k):
            pointsInCluster = dataSet[np.nonzero(clusterAssment[:,0].A == j)[0]]  # Get all points of cluster class
            centroids[j,:] = np.mean(pointsInCluster,axis=0)   # Average the rows of a matrix

    print("Congratulations,cluster complete!")
    return centroids,clusterAssment

def showCluster(dataSet,k,centroids,clusterAssment):
    m,n = dataSet.shape
    if n != 2:
        print("Data is not two-dimensional")
        return 1

    mark = ['or', 'ob', 'og', 'ok', '^r', '+r', 'sr', 'dr', '<r', 'pr']
    if k > len(mark):
        print("k The value is too large.")
        return 1

    # Draw all samples
    for i in range(m):
        markIndex = int(clusterAssment[i,0])
        plt.plot(dataSet[i,0],dataSet[i,1],mark[markIndex])

    mark = ['Dr', 'Db', 'Dg', 'Dk', '^b', '+b', 'sb', 'db', '<b', 'pb']
    # Drawing centroid
    for i in range(k):
        plt.plot(centroids[i,0],centroids[i,1],mark[i])

    plt.show()
dataSet = loadDataSet("K-means_data.txt")
k = 3
centroids,clusterAssment = KMeans(dataSet,k)
showCluster(dataSet,k,centroids,clusterAssment)

#K-means_data.txt data set
-1.26	0.46
-1.15	0.49
-1.19	0.36
-1.33	0.28
-1.06	0.22
-1.27	0.03
-1.28	0.15
-1.06	0.08
-1.00	0.38
-0.44	0.29
-0.37	0.45
-0.22	0.36
-0.34	0.18
-0.42	0.06
-0.11	0.12
-0.17	0.32
-0.27	0.08
-0.49	-0.34
-0.39	-0.28
-0.40	-0.45
-0.15	-0.33
-0.15	-0.21
-0.33	-0.30
-0.23	-0.45
-0.27	-0.59
-0.61	-0.65
-0.61	-0.53
-0.52	-0.53
-0.42	-0.56
-1.39	-0.26
Published 22 original articles, won praise 14, visited 50000+
Private letter follow

Tags: less

Posted on Mon, 09 Mar 2020 01:09:26 -0400 by Anders_Scales