The bottom layer implements the K-means + + algorithm and is used to find data outliers

preface

In this article, we solve the problem of outlier screening using the data of the overall dimension based on our own defined methods rather than calling ready-made modules, and finally visually display the results.

Years are like clouds, bandits I want to save, and writing is not easy. I hope friends passing by will praise, collect and pay attention to ha. Thank you here!

1: Introduction of clustering algorithm

General introduction of clustering algorithm

Different from the classification algorithm, it is a method of sample grouping according to data similarity without a given classification category. According to the data distribution characteristics of the original samples, it can be divided into semi supervised clustering and unsupervised clustering.

  1. Semi supervised clustering: make full use of the known label information and then classify the unknown label information
  2. Unsupervised clustering: similarity classification without any known label information

Common clustering methods are:

  1. Partition clustering: K-Means(+ +), K-Medoids, Clarans, etc
  2. Hierarchical clustering: Birch, Cure, agglomerative clustering, etc
  3. Density based methods: DBSCAN, DENCLUE, options, etc
  4. Grid based: STING, CLIOUE, etc
  5. Other methods: statistics, deep learning, etc

Of course, there are many methods. If you can use and understand several of them in practice, you can solve many problems.

There are many evaluation methods for clustering effect, among which the most simple and practical is purity method. Of course, there are some advanced discrimination methods, such as contour coefficient, elbow method and calinski_harabaz value, etc,

  1. purity method refers to only calculating the proportion of correct clusters in the total number, i.e
    p u r i t y ( X , Y ) = 1 n ∑ i k ( x i ∩ y i ) purity(X,Y)=\frac{1}{n}\sum_{i}^{k}{(x_i\cap y_i)} purity(X,Y)=n1 Σ ik (xi ∩ yi), where
    x = ( x 1 , x 2 , . . . x k ) x=(x_1,x_2,...x_k) x=(x1, x2,... xk) is a set that has been clustered, y = ( y 1 , y 2 , . . . , y k ) y=(y_1,y_2,...,y_k) y=(y1​,y2​,...,yk​)
    Represents the original data set to be clustered, n n n represents the total number of clustered objects,

  2. Some advanced discriminant methods will be introduced later, and will not be described too much here.

  • Write K-means code at the bottom
    This paper mainly finds out multiple clustering centers of data through K-means + + algorithm, and then finds out the outliers of data, so we first review the underlying principle of K-mean algorithm with a simple sentence.
  1. Initialize the number constant k of clustering types, and randomly select the initial point as the data center,
  2. Define the distance operation between each sample data and each center (each sample point dimension here must be consistent with each center point dimension, and matrix operation can be performed), and classify the samples into the most similar classes,
  3. Recalculate the central value and bring it into step 2 above until the category of each sample does not change (of course, we can also set the number of iterations at the beginning to exit the logic),
  4. Output the final center value and each class.

According to the above steps, we implement the K-means algorithm:

  1. Define the distance function between the calculated sample and the center point
import numpy as np
def distance(vA,vB):
    d_t = vA - np.array(vB)[0]
    dist = np.power(sum(np.power(d_t, 4)), 1 / 4)
    # dist = np.dot((vA - vB),(vA - vB).T)[0,0]##Two formulas for calculating distance
    return dist
  1. Define initialization random clustering center function
def init_randCent(data,K):
    n = np.shape(data)[1]##Number of attributes
    cent_value = np.mat(np.zeros((K,n)))
    for j in range(n):
        min_t = np.min(data[:,j])
        range_t = np.max(data[:,j]) - min_t
        ##Random initialization between maximum and minimum values
        cent_value[:,j] = min_t * np.mat(np.ones((K,1))) + np.random.rand(K,1) * range_t
    return cent_value
  1. The main logic is written by recursive method in this paper
def K_means(data, K, cent_value):###K is the clustering data we set
    m = np.shape(data)[0]
    n = np.shape(data)[1]
    subcenter = np.mat(np.zeros((m, 2)))  ##Initialization: by default, all samples belong to class 1
    count = 0
    ls_r = []
    def main_process(tag, count, ls_r):##Define recursive functions
        if tag == False:##Set a trigger event, initially True, and execute else
            df1 = pd.DataFrame(subcenter,columns=['Clustering category','Distance from center(0 Represents the initialization distance, which has never changed for this class)'])
            df2  = pd.DataFrame(cent_value,index=['Category 1','Category 2','Category 3'])
            ls_rr = []
            for i in ls_r:
                ls_rr.append(round(i / sum(ls_r),3))
            df3 = pd.DataFrame([ls_r,ls_rr],columns=['Category 1','Category 2','Category 3'],index=['Number of samples','Proportion%']).T
            df4 = pd.concat([df2,df3],axis=1)
            print(df1)
            print(df4)
            print('\033[1;38m Total iterations:%s\033[0;m'%count)
            return 'END!!'
        else:
            ls_r = []
            tag = False
            count = count + 1
            for i in range(m):
                minDist = np.inf
                minIndex = 0
                for j in range(K):
                    ###Calculate the distance between i and each cluster
                    dist = distance(data[i, :], cent_value[j, :])
                    if dist < minDist:
                        minDist = dist
                        minIndex = j
                ###Determine whether changes are needed
                if subcenter[i, 0] != minIndex:  ##Need to change
                    subcenter[i, :] = np.mat([minIndex, minDist])
                    tag = True
            ###Recalculate cluster center
            for j in range(K):
                sum_all = np.mat(np.zeros((1, n)))
                r = 0
                for i in range(m):
                    if subcenter[i, 0] == j:  ###Calculate the j-th category
                        sum_all = sum_all + data[i, :]##Center values belonging to the same category are added
                        r = r + 1
                for k in range(n):
                    try:
                        cent_value[j, k] = sum_all[0, k] / r###All characteristic data of the same kind are taken as the mean
                    except ZeroDivisionError:
                        print('r is zero!')
                ls_r.append(r)
        return main_process(tag, count, ls_r)
    main_process(True, count, ls_r)#Call recursive function

K_means(data, 3, init_randCent(data, 3))

PS: the above code can be used directly after I debug, copy and paste, and the data input is in the form of data frame!

Based on the data in the previous article, we test the results
Using advanced 3sigm criterion to deal with practical problems

  Clustering category and shortest distance(0 Represents the initialization distance, which has never changed for this class)
0     2.0                      0.434378
1     2.0                      0.301428
2     1.0                      0.192687
3     2.0                      0.427841
4     2.0                      0.233552
..    ...                           ...
935   2.0                      0.347604
936   2.0                      0.334781
937   2.0                      0.299281
938   2.0                      0.378599
939   2.0                      0.433147

[940 rows x 2 columns]
            0         1         2   Number of samples    Proportion%
Category 1 0.505249  0.272886  0.258199   37.0  0.039
 Category 2 0.120483  0.553747  0.151894  325.0  0.346
 Category 3 0.125011  0.123172  0.115365  578.0  0.615
 Total number of iterations: 10

We can standardize the data before clustering to reduce the difference caused by the order of magnitude and avoid that the value of an attribute is too large, and the distance operation depends entirely on this attribute. The standardization method can adopt maximum and minimum standardization, i.e ( V − m i n ( V ) ) / ( m a x ( V ) − m i n ( V ) ) (V-min(V))/(max(V)-min(V)) (V−min(V))/(max(V)−min(V)) .

The right side of the above result category column is the final cluster center of the data set. Through the underlying code, we know that the initialization center is obtained randomly each time, so the clustering results will be different each time. There may be a local optimization in the distance operation without iterative updating. Although the data is standardized, some clustering results will still be inclined. In this case, Monte Carlo can be used to solve the expected problem, This chapter will not start the description for the time being.

2: Implement K-means + + algorithm

Next, we understand and implement the K-means + + algorithm at the bottom, and finally use this algorithm for systematic outlier filtering.

  • Bottom layer understanding of K-means + + algorithm

The K-means + + algorithm mainly deals with the initialized data center. In the case of excluding outliers, we hope that the initialization center point should be as far away as possible. The specific steps are as follows:

  1. Firstly, the number k value of cluster centers is determined, and the first cluster center is randomly selected from the input data,
  2. For each point in the data set, calculate its distance from the existing cluster center. We set it as D ( x e x i s t ) D(x_{exist}) D(xexist​) ,
  3. Then select a new data point as a new clustering center. The selection requirements are D ( x e x i s t ) D(x_{exist}) The point with larger D(xexist) is also more likely to be selected,
  4. Repeat steps 2 and 3 above until k k K centers are selected, and the remaining operations are the same as general K-means.

Note that the existence of outliers has an impact on the selection of new clustering centers. Because outliers exist, D ( x e x i s t ) D(x_{exist}) D(xexist) will be very large. We have two methods to deal with this situation: 1. Deal with outliers before modeling, but it won't work if we want to deal with outliers through clustering method; 2. In the distance value set D ( x ) = { D ( x 1 e x i s t ) , D ( x 2 e x i s t ) . . . , D ( x n e x i s t ) } D(x)=\{D(x_{1exist}),D(x_{2exist})...,D(x_{nexist})\} D(x)={D(x1exist), D(x2exist)..., D(xnexist)}( n n n is the number of data), we cannot directly select the maximum distance value, but the larger data point. The "larger" here can be selected through the "area probability idea", that is, if these distances are "connected together", then the random points fall on the larger data point D ( x i e x i s t ) D(x_{iexist}) The probability within D(xiexist) is also large.

We implement the above process in code

init_value = np.inf#First defined as infinity
def nest_dist(point,clc):
    min_dist = init_value
    m = np.shape(clc)[0]###Number of currently initialized cluster centers
    for i in range(m):
        d = distance(point,clc[i,:])##Calculate the distance from each cluster center
        ###Select the shortest distance
        if min_dist > d:
            min_dist = d
    return min_dist

def get_cent_value(data,K):
    m,n = np.shape(data)###m is the number of data
    clc = np.mat(np.zeros((K,n)))
    ###Randomly select a sample point as the first cluster center
    index = np.random.randint(0,m)
    clc[0,:] = np.copy(data[index,:])
    ##Initialize a distance sequence
    d = [0 for i in range(m)]
    for i in range(1,K):
        sum_all = 0
        for j in range(m):
            ##Calculate the distance between each sample and the found cluster center, and return the nearest distance value
            d[j] = nest_dist(data[j,:],clc[0:i,:])
            ##Add all shortest distances
            sum_all = sum_all + d[j]
        sum_all = sum_all * np.random.random()###take
        ##The farthest sample point is obtained as the cluster center point
        for j,dist in enumerate(d):
            sum_all = sum_all - dist
            if sum_all > 0:
                continue
            else:
                clc[i] = np.copy(data[j,:])
                break
    return clc
K_means(data, 3, get_cent_value(data,3))

Let's look at the running results:

 Cluster category and center distance(0 Represents the initialization distance, which has never changed for this class)
0     2.0                      0.127224
1     2.0                      0.172709
2     0.0                      0.000000
3     1.0                      0.208919
4     1.0                      0.157283
..    ...                           ...
935   2.0                      0.136519
936   2.0                      0.137244
937   1.0                      0.155543
938   2.0                      0.194231
939   2.0                      0.126017

[940 rows x 2 columns]
            0         1         2   Number of samples    Proportion%
Category 1 0.118565  0.621701  0.162584  231.0  0.246
 Category 2 0.188617  0.313694  0.151047  265.0  0.282
 Category 3 0.133774  0.077739  0.108143  444.0  0.472
 Total number of iterations: 14

Similarly, after the data is standardized, we use the K-means + + algorithm. The clustering results are relatively "average", and the results of multiple runs are relatively stable. This is mainly because K-means + + can optimize the initial center point value as far as possible at the beginning of clustering, so as to make each cluster center as far away as possible, rather than local optimization based on more chance.

3: Data outlier filtering based on K-means + + algorithm

  • This article implements the K-means + + algorithm from the bottom. Of course, it needs to be used

Based on the previous article, we only need to slightly modify the p-order norm of each line and change it to the for loop form, because now we need to calculate each class data with the center of each class, and finally merge all the results. The code is relatively simple, so we won't describe it too much here. In addition, the number of clusters we choose is 3.
The advanced version of 3sigm IgM criterion is used to deal with the actual data and the advanced processing of outliers

4: Summary

  • Starting from the underlying principle of the algorithm, this paper implements the K-means + + algorithm and finally applies it to the screening of outliers. Theoretically, the K-means + + algorithm is better than the ordinary k-means algorithm,
  • Nevertheless, we have not solved an important problem, that is, when using clustering algorithm (whether hierarchical clustering or partitioned clustering, etc.), we do not specify in advance how many classes are best. Generally, the number of cluster categories is determined by "a posteriori" method, such as elbow method, CH value, etc. at the same time, in order to prevent local optimization, Monte Carlo can be used to take the expected idea. These contents will be presented in the next article.

Tags: Python Algorithm Machine Learning Data Mining

Posted on Sat, 25 Sep 2021 07:42:00 -0400 by dgudema