### Catalog

- 1, Python implementation of KNN algorithm
- 1. Import package
- 2. Draw pictures to show the distribution of different films on the pictures
- 3. Preparation of training samples and samples to be tested
- 4. Calculate the distance between the sample points to be tested and each training sample point
- 5. Find the type of K training sample points closest to the sample point to be tested
- 6. Find the largest number of classes
- 7. Write as a custom function
- 2, iris data set test

KNN is one of the top ten algorithms for machine learning. Because the principle is well understood, there is a sentence: "talk is heap. Show me the code." so use Python to implement it and test the model effect on iris dataset.

Algorithm principle: to see what kind of training set sample the new sample belongs to, the "closest" is generally quantified by distance. Find the nearest K training samples, and classify the samples to be judged according to the principle of "the minority obeys the majority".

The distance here is Euclidean distance:

Disadvantages of the algorithm: the complexity of the algorithm is high, because all training samples should be compared with the samples to be tested. In addition, when the distribution of training samples is unbalanced, for example, the proportion of a certain type of samples is too large, then the samples to be tested can easily be classified into this type. In fact, the distance may not be closer, but the number is dominant.

Algorithm optimization: therefore, there is an optimization method to weaken the influence of the generation of sample imbalance, that is, the distance is weighted as the weight (weight=1/d), so that the closer to the sample to be judged, the greater the weight of the training set sample.

Here is my implementation process. This example classifies movies according to the number of fights and kisses.

Adjust:

### 1, Python implementation of KNN algorithm

#### 1. Import package

import numpy as np import matplotlib.pyplot as plt %matplotlib inline

#### 2. Draw pictures to show the distribution of different films on the pictures

#Drawing x1 = np.array([3,2,1]) y1 = np.array([104,100,81]) x2 = np.array([101,99,98]) y2 = np.array([10,5,2]) s1 = plt.scatter(x1,y1,c='r') s2 = plt.scatter(x2,y2,c='b') #Unknown movie x = np.array([18]) y = np.array([90]) s3 = plt.scatter(x,y,c='k') plt.legend(handles=[s1,s2,s3],labels=['Romance','Action','Unknow'],loc='best') plt.show()

#### 3. Preparation of training samples and samples to be tested

#Define training samples, x has two eigenvalues x_data = np.array([ [3,104], [2,100], [1,81], [101,10], [99,5], [81,2] ]) #y is the tag. y_data = np.array(['Romance','Romance','Romance','Action','Action','Action']) #Movie type data to test x_test = np.array([18,90])

#### 4. Calculate the distance between the sample points to be tested and each training sample point

#Because there are 6 training samples, it is necessary to calculate the distance between the samples to be tested and 6 training sample points ((np.tile(x_test,(x_data.shape[0],1))-x_data)**2).sum(axis=1) # (xi-x)^2+(yi-y)^2

#Find the distance from the test point to each point distances = (((np.tile(x_test,(x_data.shape[0],1))-x_data)**2).sum(axis=1))**0.5 #sqr((xi-x)^2+(yi-y)^2) sort_distances = distances.argsort() #Sort distances by subscript

#### 5. Find the type of K training sample points closest to the sample point to be tested

The selection of k value is arbitrary here. The actual situation needs to be adjusted according to the model effect many times.

k = 5 #Set the K of the k nearest neighbor to 5 classcount = {} #Take the first five and find the quantity of each type for i in range(k): votlabel = y_data[sort_distances[i]] classcount[votlabel] = classcount.get(votlabel,0)+1 classcount

Result:

#### 6. Find the largest number of classes

#Find the largest number of classes max_k='' max_v=0 for k,v in classcount.items(): if v>max_v: max_v = v max_k = k print(max_k)

#### 7. Write as a custom function

#x_data: training sample eigenvalue; #y_data: training sample label (type); #X'u test: sample to be judged; #K: Nearest neighbor selected sample points #Return forecast type def knn(x_test,x_data,y_data,k): #Calculate the number of samples x_data_size = x_data.shape[0] # Copy x_test np.tile(x_test,(x_data_size,1)) # Calculate the difference between x'u test and each sample diffMat = np.tile(x_test,(x_data_size,1))-x_data # Calculate the square of the difference sqDiffMat = diffMat**2 # Summation sqDistances = sqDiffMat.sum(axis=1) # Square root distances = sqDistances**0.5 # Sort from small to large sortedDistances = distances.argsort() classCount = {} for i in range(k): # Get Tags votelabel = y_data[sortedDistances[i]] # Number of statistical labels classCount[votelabel] = classCount.get(votelabel,0)+1 # Find the most Tags max_k = '' max_v = 0 for k,v in classCount.items(): if v > max_v: max_v = v max_k = k return(max_k)

### 2, iris data set test

#### 1. Import package

import numpy as np from sklearn import datasets from sklearn.model_selection import train_test_split #Data segmentation from sklearn.metrics import classification_report,confusion_matrix #Precision, recall, confusion matrix import random

#### 2. Import data, divide data sets

#Load data iris = datasets.load_iris() x_train,x_test,y_train,y_test = train_test_split(iris,iris.target,test_size = 0.2)

If you don't want to use the self-contained train ﹣ test ﹣ split() method to divide the data set, you can write your own code as follows:

#Because iris data is arranged by category, now we need to scramble the data #It is equivalent to the function implemented by X ﹐ train, X ﹐ test, y ﹐ train, y ﹐ test = train ﹐ test ﹐ split (iris, iris. Target, test ﹐ size = 0.2) data_size = iris.data.shape[0] index = [i for i in np.arange(data_size)] random.shuffle(index) iris.data = iris.data[index] iris.target = iris.target[index] #Segmentation data set test_size = int(data_size * 0.2) x_train = iris.data[test_size:] x_test = iris.data[:test_size] y_train = iris.target[test_size:] y_test = iris.target[:test_size]

#### 3. Call the written KNN function, and calculate the precision, recall and confusion matrix

prediction = [] #Call KNN (x'u test, x'u data, y'u data, K) function for i in range(x_test.shape[0]): prediction.append(knn(x_test[i],x_train,y_train,5)) print(classification_report(y_test,prediction)) print(confusion_matrix(y_test,prediction))

It seems that the effect is good. Keep going A kind of (omega) A kind of

finish