Clustering algorithm learning

Algorithm steps

Euclidean distance

Advantages and disadvantages of K-means

advantage:
1. The algorithm is fast and simple;
2. High efficiency and scalability for large data sets;
3. The time complexity is nearly linear, and it is suitable for mining large-scale data sets. The time complexity of K-Means clustering algorithm is O(n) × k × t) Where n represents the number of objects in the dataset, T represents the number of iterations of the algorithm, and K represents the number of clusters

Disadvantages:
1. In k-measure algorithm, K is given in advance, but the selection of K value is very difficult to estimate.
2. In the k-means algorithm, we first need to determine an initial partition according to the initial cluster center, and then optimize the initial partition. The selection of this initial clustering center has a great impact on the clustering results. Once the initial value is not selected well, it may not be able to obtain effective clustering results, which has also become a main problem of K-means algorithm.
3. When the amount of data is large, the overhead of the algorithm is very large.

from sklearn import datasets from sklearn.model_selection import train_test_split from sklearn.cluster import k_means from sklearn.metrics import accuracy_score,precision_recall_curve,confusion_matrix from sklearn.metrics import PrecisionRecallDisplay,ConfusionMatrixDisplay import matplotlib.pyplot as plt import numpy as np from functools import partial # iris = datasets.load_iris() # print(dir(iris)) # print(iris.target_names) # print=partial(print,sep='\n') # X=iris.data # y=iris.target # print('X's feature name ', iris. Feature)_ names) # print('x's size',X.shape,'y's size',y.shape,) fig1,(ax1,ax2)=plt.subplots(2,1,sharex=True) # #sepal length and petal length X=np.array([[1,1],[5,5],[6,6]]) y=np.array([0,1,1]) ax1.scatter(X[:,0],X[:,1],c=y) "choose random，n_init=1,Preprocessing: adding X Centralization,x-=x.mean(),Then select a sample as the center of mass according to the random seed" pred= k_means(X,n_clusters=2, init='random',random_state=2,max_iter=10,n_init=1) centroid,label,inertia=pred print(pred) # print("cluster center of data", pred.cluster_centers_) # print(pred) ax2.scatter(X[:,0],X[:,1],c=labe '''sklearn.cluster.KMeans Parameter introduction n_clusters: int Type, the number of clusters generated. The default is 8 max_iter: int Type, execute once k-means The maximum number of iterations performed by the algorithm. The default value is 300 n_init: int Type, the number of times the algorithm is run with different cluster center initialization values, and the final solution is inertia The optimal result is selected in the sense of. The default value is 10 init: There are three optional values:'k-means++','random',Or pass a ndarray Vector. １)'k-means++' A special method is used to select the initial centroid, which can accelerate the convergence of the iterative process ２)'random' The initial centroid is randomly selected from the training data. ３)If you pass a ndarray，It should look like (n_clusters, n_features) The initial centroid is given. The default value is'k-means++'. tol: float Type, default= 1e-4 And inertia Combined to determine the convergence conditions. n_jobs: int Type. Specifies the number of processes used for the calculation. The internal principle is simultaneous n_init Calculation of the specified number of times. (１)If the value is -1，Then use all CPU Perform operations. If the value is 1, no parallel operation will be performed, which is convenient for debugging. (２)If the value is less than-1，Is used CPU Number is(n_cpus + 1 + n_jobs). So if n_jobs Value is-2，Is used CPU Total number CPU Minus one. random_state: Plastic or numpy.RandomState Type, optional Generator for initializing centroids( generator). If the value is an integer, a seed. The default value of this parameter is numpy Random number generator. Main attributes cluster_centers_: Cluster center labels: Cluster to which each sample belongs inertial_: It is used to evaluate whether the number of clusters is appropriate. The more the distance is, the better the cluster is. The number of clusters at the critical point is selected '''

2. Hierarchical clustering

Source: https://www.cnblogs.com/zongfa/p/9344769.html

Watermelon book: https://zhuanlan.zhihu.com/p/70414047

The merging algorithm of hierarchical clustering combines the two most similar data points in all data points by calculating the similarity between the two types of data points, and iterates this process repeatedly. In short, the merging algorithm of hierarchical clustering determines the similarity between each category of data points by calculating the distance between each category of data points and all data points. The smaller the distance, the higher the similarity. The two nearest data points or categories are combined to generate a clustering tree.

Calculation of similarity

Hierarchical clustering uses Euclidean distance to calculate the distance (similarity) between different categories of data points.

Calculate the Euclidean distance value (matrix) respectively

After combining data point B and data point C, recalculate the distance matrix between various types of data points. The distance between data points is calculated in the same way as before. What needs to be explained here is the calculation method between the combined data point (B,C) and other data points. When we calculate the distance from (B,C) to A, we need to calculate the average distance from B to A and C to A respectively.

After calculation, the distance from data point d to data point E is the smallest of all distance values, which is 1.20. This means that among all current data points (including combined data points), D and E have the highest similarity. Therefore, we combine data point D and data point E. And calculate the distance between other data points again.

The later work is to repeatedly calculate the distance between data points and data points, and between data points and combined data points. This step should be done by the program. Here, due to the small amount of data, we manually calculate and list the results of distance calculation and data point combination at each step.

Distance between two combined data points

There are three methods to calculate the distance between two combined data points: single link, complete link and average link. Before starting the calculation, let's introduce the three calculation methods and their advantages and disadvantages.

Single Linkage: the method is to take the distance between the nearest two data points in the two combined data points as the distance between the two combined data points. This method is susceptible to extreme values. Two very similar combined data points may be combined because one of the extreme data points is close to each other.
Complete Linkage: the calculation method of Complete Linkage is opposite to that of Single Linkage. The distance between the two farthest data points in the two combined data points is taken as the distance between the two combined data points. The problem of Complete Linkage is also opposite to that of Single Linkage. Two dissimilar combined data points may not be combined because their extreme values are far away.
Average Linkage: the calculation method of Average Linkage is to calculate the distance between each data point in the two combined data points and all other data points. The mean of all distances is taken as the distance between two combined data points. This method has a large amount of calculation, but the result is more reasonable than the first two methods.

We use Average Linkage to calculate the distance between combined data points. The following is the calculation of the distance from the combined data point (A,F) to (B,C). Here, the mean values of the distances between (A,F) and (B,C) are calculated respectively.

Tree view

The graph is independent of the data

shortcoming

The efficiency of traditional hierarchical clustering algorithm is relatively low O(tn**2) t: number of iterations n: The most obvious disadvantage of the number of sample points is that it does not have the ability of redistribution, that is, if sample point A has been divided into class cluster C1 in A certain iteration process, then in the later iteration process, A will always belong to class cluster C1, which will affect the accuracy of clustering results.

improvement:

In general, hierarchical clustering is usually combined with partitioned clustering algorithm, which can not only solve the problem of algorithm efficiency, but also solve the problem of sample point redistribution. BIRCH algorithm will be introduced later.

3. Looking for the best k

Elbow rule elbow method is a common method. As shown in the figure below, K = 3 is the value of K at the elbow

Figure 2 and 3 below are better k

So what is the principle of this method?

Is to minimize the distance from the point to the cluster center

1 install yellowbrick Library

pip install yellowbrick

2 run, actually one line of code

Source: https://www.zhihu.com/question/279825061/answer/409613401

from sklearn.cluster import KMeans from yellowbrick.cluster.elbow import kelbow_visualizer from yellowbrick.datasets.loaders import load_nfl X, y = load_nfl() # Use the quick method and immediately show the figure kelbow_visualizer(KMeans(random_state=4), X, k=(2,10))

Automatic selection of k=4 is the optimal K value.