Machine learning -- KNN algorithm learning

Principle understanding

KNN is the k nearest neighbor algorithm, which is a classification algorithm. It means to select k nearest neighbors. It means that each sample can be represented by its closest K neighbors. Among the K samples, the category with the largest proportion can be classified as this category.

There is a saying that good birds of a feather flock together. If we want to see whether a person is a good person or a bad person, we have to see whether there are more good people or more bad people among his friends. Although this explanation is far fetched, this is the principle in KNN algorithm.

The purpose of KNN algorithm is to search the nearest K known category samples for the prediction of unknown category samples.

At the same time, the value of K in KNN algorithm is very important. As shown in the figure below, when K=3 and K=5, the results are very different.

Algorithm flow

General process of K-nearest neighbor algorithm:
1. Collect data: any method can be used.
2. Prepare data: the value required for distance calculation, preferably in a structured data format.
3. Analyze data: any method can be used.
4. Training algorithm: this step is not applicable to K-nearest neighbor algorithm.
5. Test algorithm: calculate the error rate.
6. Using the algorithm: first, input the sample data and structured output results, then run the K-nearest neighbor algorithm to determine which classification the input data belongs to respectively, and finally perform subsequent processing on the calculated classification.

Code practice

  • Film category classification
  • Cancer prediction
  • Iris classification
  • Handwritten numeral recognition

Film category classification


# -*- coding: UTF-8 -*-
import collections
import numpy as np
def createDataSet():
	#Four sets of 2D features
	group = np.array([[1,101],[5,89],[108,5],[115,8]])
	#Labels for four sets of features
	labels = ['affectional film','affectional film','action movie','action movie']
	return group, labels
def classify(inx, dataset, labels, k):
	# Calculating the distance is actually calculating the distance between points
	dist = np.sum((inx - dataset)**2, axis=1)**0.5


	# k recent Tags
	# dist.argsort arranges the elements in x from small to large, and extracts the corresponding index
	k_labels = [labels[index] for index in dist.argsort()[0 : k]]

	print('k_labels', k_labels)
	# The most frequent label is the final category
	#Main functions: it can support convenient and fast counting, count the number of elements, and then count and return to a dictionary. The key is the element and the value is the number of elements.
	label = collections.Counter(k_labels).most_common(1)[0][0]
	return label

if __name__ == '__main__':
	#Create dataset
	group, labels = createDataSet()
	#Test set
	test = [55,20]
	#kNN classification
	#test_class = classify0(test, group, labels, 3)
	test_class = classify(test, group, labels, 3)
	#Print classification results

Cancer prediction

Iris classification

Sklearn has become the most awesome Python machine learning library (Library). The machine learning algorithms supported by scikit learn include classification, regression, dimensionality reduction and clustering. There are also some modules of extracting features, processing data and evaluating models applied to sklearn
Sklearn quiz

from sklearn.neighbors import KNeighborsClassifier
X = [[0], [1], [2], [3], [4], [5], [6], [7], [8]]
y = [0, 0, 0, 1, 1, 1, 2, 2, 2]
neigh = KNeighborsClassifier(n_neighbors=3), y)
# The fit function uses X as the training data and y as the target value (similar to the label) to fit the model.
##The predicted value here is 1.1, and the result is 0, indicating that 1.1 should be in the class 0



# -*- coding: utf-8 -*-
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
class KNN(object):

    # Three categories of iris data (mountain iris / 0, iris mallow / 1, color changing iris / 2), 50 samples in each category, and four eigenvalues in each sample (sepal length, sepal width, petal length, petal width)
    def get_iris_data(self):
        iris = load_iris()
        iris_data =
        iris_target =
        return iris_data, iris_target

    def run(self):
        # 1. Obtain the characteristic value and target value of iris
        iris_data, iris_target = self.get_iris_data()
        #print("iris_data, iris_target",iris_data, iris_target)
        # 2. Divide the data into training set and test set_ Size = 0.25 means that 25% of the data is used as the test set
        x_train, x_test, y_train, y_test = train_test_split(iris_data, iris_target, test_size=0.25)
        # 3. Characteristic Engineering (standardize characteristic values)
        std = StandardScaler()
        x_train = std.fit_transform(x_train)
        x_test = std.transform(x_test)
        # 4. Feeding algorithm
        knn = KNeighborsClassifier(n_neighbors=5) # Create an instance of KNN algorithm, n_neighbors defaults to 5, and then the optimal parameters are obtained through grid search, y_train) # Send test set into algorithm
        y_predict = knn.predict(x_test) # Get forecast results
        # Display of prediction results
        labels = ["Mountain iris","Iris mallow","Discolored iris"]
        for i in range(len(y_predict)):
            print("The first%d Test times:True value:%s\t Estimate:%s"%((i+1),labels[y_test[i]],labels[y_predict[i]]))
        print("Accuracy:",knn.score(x_test, y_test))

if __name__ == '__main__':
    knn = KNN()

Handwriting Recognition

print(str.split('o')[0])#What you get is the content before the first o
print(str.split("[")[1].split("]")[0])#What you get is the content after [before]
print(str.split('l')[2])#Content between the second and third l


# -*- coding: UTF-8 -*-
import numpy as np
#The operator module is a built-in operator function interface in Python. It defines built-in functions for arithmetic, comparison and other operations corresponding to the standard object API.
import operator
from os import listdir
from sklearn.neighbors import KNeighborsClassifier as kNN

def img2vector(filename):
    #Create 1x1024 zero vector
    returnVect = np.zeros((1, 1024))
    #Open file
    fr = open(filename)
    #Read by line
    for i in range(32):
        #Read a row of data
        lineStr = fr.readline()
        #The first 32 elements of each row are added to returnVect in turn
        for j in range(32):
            returnVect[0, 32*i+j] = int(lineStr[j])
    #Returns the converted 1x1024 vector
    return returnVect
def handwritingClassTest():
    #Labels for test sets
    hwLabels = []
    #listdir returns the file name in the trainingDigits directory
    trainingFileList = listdir('trainingDigits')
    #Returns the number of files in the folder
    m = len(trainingFileList)
    #Initialize the training Mat matrix and test set
    trainingMat = np.zeros((m, 1024))
    #Resolve the category of the training set from the file name
    for i in range(m):
        #Get the name of the file
        fileNameStr = trainingFileList[i]
        #Get classified number str.split get character '_‘ Previous content
        classNumber = int(fileNameStr.split('_')[0])
        #Add the obtained categories to hwLabels
        #Store 1x1024 data of each file in the trainingMat matrix
        #trainingMat creates an array with m rows and 1024 columns (all contents are zero),
        #trainingMat[i,:] is to put the list returned by img2vector method into line i of trainingMat
        trainingMat[i,:] = img2vector('trainingDigits/%s' % (fileNameStr))

    #Constructing kNN classifier
    neigh = kNN(n_neighbors = 3, algorithm = 'auto')
    #Fit the model. trainingMat is the test matrix and hwLabels is the corresponding label, hwLabels)
    #Returns the file list in the testDigits directory. listdir returns the file name
    testFileList = listdir('testDigits')
    #Error detection count
    errorCount = 0.0
    #Number of test data
    mTest = len(testFileList)
    #Analyze the category of test set from the file and conduct classification test
    for i in range(mTest):
        #Get the name of the file
        fileNameStr = testFileList[i]
        #Get the number of categories
        classNumber = int(fileNameStr.split('_')[0])
        #The 1x1024 vector of the test set is obtained for training
        vectorUnderTest = img2vector('testDigits/%s' % (fileNameStr))
        #Obtain prediction results
        # classifierResult = classify0(vectorUnderTest, trainingMat, hwLabels, 3
        classifierResult = neigh.predict(vectorUnderTest)
        print("The returned result of classification is%d\t The real result is%d" % (classifierResult, classNumber))
        if(classifierResult != classNumber):
            errorCount += 1.0
    print("All wrong%d Data\n The error rate is%f%%" % (errorCount, errorCount/mTest * 100))
if __name__ == '__main__':


draw lessons from
Sklearn Chinese document

Tags: Python Algorithm Machine Learning

Posted on Mon, 27 Sep 2021 09:21:08 -0400 by R0d Longfella