Principle of Kmeans clustering method:
1. Firstly, K clustering centers are randomly determined;
2. Calculate the distance (Euclidean distance) from each point in the data to the K cluster centers, and classify the point into which cluster which is the smallest;
3. Calculate the center points of all points in each cluster (the corresponding elements of the vector are averaged), and these points are determined as new cluster centers;
4. Repeat steps 2 and 3 until all cluster centers do not change.
After knowing this principle, we apply it to make a simple small example.
Third party libraries to install:
Numpy, matplotlib, scipy, and sklearn (note that the order should not be wrong. The last three libraries are based on numpy, and the right of sklearn is based on the first three)
Installation method (pip installation):
For small partners who can't climb the wall, you can download them from the image source website, which is also very fast. Open the DOS window, take numpy as an example, and copy the following code:
pip install numpy -i http://pypi.douban.com/simple/ --trusted-host pypi.douban.com
Because you can't move the cursor to the middle of the dos window to delete things, you can save this line of code in a word document. Each time you need to install a library, change the name in this line first and then copy it to the command line window.
Data introduction:
At present, there are eight main sources of annual consumption expenditure per capita of urban households in 31 provinces in China in 1999
The eight variables are food, clothing, household equipment and services, and medical treatment
Health care, transportation and communications, recreation, education and cultural services, housing and miscellaneous goods and services. Use already
With data, 31 provinces were clustered.
Purpose of the experiment:
Through clustering, we can understand the domestic consumption level of each province in 1999.
Data display:
Data link:
Link: https://pan.baidu.com/s/1PArczd8hxp1AbQe9MOKwgQ
Extraction code: 9fg2
(put the txt document and the following program in a folder)
Program presentation:
For ease of understanding, I added notes in some places.
import numpy as np from sklearn.cluster import KMeans def loadData(filePath): #This function reads data from txt fr = open(filePath, 'r+') lines = fr.readlines() retData = [] retCityName = [] for line in lines: items = line.strip().split(",")#The default parameters of the strip() function delete whitespace characters (including '\ n', '\r', '\t', ') #line is originally a string. After being. split(","), it becomes a list of several strings retCityName.append(items[0]) retData.append([float(items[i]) for i in range(1, len(items))])#Note that a list is added here return retData, retCityName if __name__ == '__main__': # There are usually two ways to use a python file. The first is to execute it directly as a script, and the second is to import it into other Python scripts to be called (module reuse) for execution. # Therefore, if _name _ = ='main ': is used to control the code execution process in these two cases. The code under if _name _ = ='main': can only be used in # In the first case (that is, the file is directly executed as a script), but import ing to other scripts will not be executed. data, cityName = loadData('31 Household consumption level of provincial and municipal residents-city.txt') km = KMeans(n_clusters=4) label = km.fit_predict(data)#fit_predict(): calculate the cluster center and assign the sequence number to the cluster. Label: the label to which each data belongs after clustering expenses = np.sum(km.cluster_centers_, axis=1) # When axis is 0, it is a compressed row, that is, the elements of each column are added to compress the matrix into one row. When axis is 1, it is a compressed column, that is, the elements of each row are added to compress the matrix into one column # Here is the summation of the data in the cluster center # print(expenses) CityCluster = [[], [], [], []] for i in range(len(cityName)): CityCluster[label[i]].append(cityName[i]) #Each element in the label corresponds to cityname, that is, each city name is added to the corresponding cluster for i in range(len(CityCluster)): print("Expenses:%.2f" % expenses[i]) print(CityCluster[i])
Operation results:
Source:
mooc Beijing University of technology python machine learning