[machine learning algorithm] implement KNN classification algorithm in manual Python, and test the model effect with iris data set


KNN is one of the top ten algorithms for machine learning. Because the principle is well understood, there is a sentence: "talk is heap. Show me the code." so use Python to implement it and test the model effect on iris dataset.

Algorithm principle: to see what kind of training set sample the new sample belongs to, the "closest" is generally quantified by distance. Find the nearest K training samples, and classify the samples to be judged according to the principle of "the minority obeys the majority".
The distance here is Euclidean distance:

Disadvantages of the algorithm: the complexity of the algorithm is high, because all training samples should be compared with the samples to be tested. In addition, when the distribution of training samples is unbalanced, for example, the proportion of a certain type of samples is too large, then the samples to be tested can easily be classified into this type. In fact, the distance may not be closer, but the number is dominant.
Algorithm optimization: therefore, there is an optimization method to weaken the influence of the generation of sample imbalance, that is, the distance is weighted as the weight (weight=1/d), so that the closer to the sample to be judged, the greater the weight of the training set sample.

Here is my implementation process. This example classifies movies according to the number of fights and kisses.


1, Python implementation of KNN algorithm

1. Import package

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

2. Draw pictures to show the distribution of different films on the pictures

x1 = np.array([3,2,1])
y1 = np.array([104,100,81])
x2 = np.array([101,99,98])
y2 = np.array([10,5,2])
s1 = plt.scatter(x1,y1,c='r')
s2 = plt.scatter(x2,y2,c='b')
#Unknown movie
x = np.array([18])
y = np.array([90])
s3 = plt.scatter(x,y,c='k')

3. Preparation of training samples and samples to be tested

#Define training samples, x has two eigenvalues
x_data = np.array([
#y is the tag.
y_data = np.array(['Romance','Romance','Romance','Action','Action','Action'])
#Movie type data to test
x_test = np.array([18,90])

4. Calculate the distance between the sample points to be tested and each training sample point

#Because there are 6 training samples, it is necessary to calculate the distance between the samples to be tested and 6 training sample points
((np.tile(x_test,(x_data.shape[0],1))-x_data)**2).sum(axis=1)  # (xi-x)^2+(yi-y)^2
#Find the distance from the test point to each point
distances = (((np.tile(x_test,(x_data.shape[0],1))-x_data)**2).sum(axis=1))**0.5  #sqr((xi-x)^2+(yi-y)^2)
sort_distances = distances.argsort() #Sort distances by subscript

5. Find the type of K training sample points closest to the sample point to be tested

The selection of k value is arbitrary here. The actual situation needs to be adjusted according to the model effect many times.

k = 5   #Set the K of the k nearest neighbor to 5
classcount = {}
#Take the first five and find the quantity of each type
for i in range(k):
    votlabel = y_data[sort_distances[i]]
    classcount[votlabel] = classcount.get(votlabel,0)+1


6. Find the largest number of classes

#Find the largest number of classes
for k,v in classcount.items():
    if v>max_v:
        max_v = v
        max_k = k

7. Write as a custom function

#x_data: training sample eigenvalue;
#y_data: training sample label (type);
#X'u test: sample to be judged;
#K: Nearest neighbor selected sample points
#Return forecast type
def knn(x_test,x_data,y_data,k):
    #Calculate the number of samples
    x_data_size = x_data.shape[0]
    # Copy x_test
    # Calculate the difference between x'u test and each sample
    diffMat = np.tile(x_test,(x_data_size,1))-x_data
    # Calculate the square of the difference
    sqDiffMat = diffMat**2
    # Summation
    sqDistances = sqDiffMat.sum(axis=1)
    # Square root
    distances = sqDistances**0.5
    # Sort from small to large
    sortedDistances = distances.argsort()
    classCount = {}
    for i in range(k):
        # Get Tags
        votelabel = y_data[sortedDistances[i]]
        # Number of statistical labels
        classCount[votelabel] = classCount.get(votelabel,0)+1
    # Find the most Tags
    max_k = ''
    max_v = 0
    for k,v in classCount.items():
        if v > max_v:
            max_v = v
            max_k = k

2, iris data set test

1. Import package

import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split   #Data segmentation
from sklearn.metrics import classification_report,confusion_matrix   #Precision, recall, confusion matrix
import random

2. Import data, divide data sets

#Load data
iris = datasets.load_iris()
x_train,x_test,y_train,y_test = train_test_split(iris,iris.target,test_size = 0.2)

If you don't want to use the self-contained train ﹣ test ﹣ split() method to divide the data set, you can write your own code as follows:

#Because iris data is arranged by category, now we need to scramble the data
#It is equivalent to the function implemented by X ﹐ train, X ﹐ test, y ﹐ train, y ﹐ test = train ﹐ test ﹐ split (iris, iris. Target, test ﹐ size = 0.2)
data_size = iris.data.shape[0]  
index = [i for i in np.arange(data_size)]
iris.data = iris.data[index]
iris.target = iris.target[index]

#Segmentation data set
test_size = int(data_size * 0.2)
x_train = iris.data[test_size:]
x_test = iris.data[:test_size]
y_train = iris.target[test_size:]
y_test = iris.target[:test_size]

3. Call the written KNN function, and calculate the precision, recall and confusion matrix

prediction = []
#Call KNN (x'u test, x'u data, y'u data, K) function
for i in range(x_test.shape[0]):

It seems that the effect is good. Keep going A kind of (omega) A kind of


Published 1 original article, praised 0, visited 7
Private letter follow

Tags: Python

Posted on Mon, 09 Mar 2020 06:27:36 -0400 by devil_online