# Machine learning -- KNN algorithm learning

## Principle understanding

KNN is the k nearest neighbor algorithm, which is a classification algorithm. It means to select k nearest neighbors. It means that each sample can be represented by its closest K neighbors. Among the K samples, the category with the largest proportion can be classified as this category.

There is a saying that good birds of a feather flock together. If we want to see whether a person is a good person or a bad person, we have to see whether there are more good people or more bad people among his friends. Although this explanation is far fetched, this is the principle in KNN algorithm.

The purpose of KNN algorithm is to search the nearest K known category samples for the prediction of unknown category samples.

At the same time, the value of K in KNN algorithm is very important. As shown in the figure below, when K=3 and K=5, the results are very different. ## Algorithm flow

General process of K-nearest neighbor algorithm:
1. Collect data: any method can be used.
2. Prepare data: the value required for distance calculation, preferably in a structured data format.
3. Analyze data: any method can be used.
4. Training algorithm: this step is not applicable to K-nearest neighbor algorithm.
5. Test algorithm: calculate the error rate.
6. Using the algorithm: first, input the sample data and structured output results, then run the K-nearest neighbor algorithm to determine which classification the input data belongs to respectively, and finally perform subsequent processing on the calculated classification.

## Code practice

• Film category classification
• Cancer prediction
• Iris classification
• Handwritten numeral recognition

### Film category classification

Code

```# -*- coding: UTF-8 -*-
import collections
import numpy as np
def createDataSet():
#Four sets of 2D features
group = np.array([[1,101],[5,89],[108,5],[115,8]])
print(group)
#Labels for four sets of features
labels = ['affectional film','affectional film','action movie','action movie']
return group, labels
def classify(inx, dataset, labels, k):
# Calculating the distance is actually calculating the distance between points
dist = np.sum((inx - dataset)**2, axis=1)**0.5

#print("dist",dist)

# k recent Tags
# dist.argsort arranges the elements in x from small to large, and extracts the corresponding index
k_labels = [labels[index] for index in dist.argsort()[0 : k]]

print('k_labels', k_labels)
# The most frequent label is the final category
#Main functions: it can support convenient and fast counting, count the number of elements, and then count and return to a dictionary. The key is the element and the value is the number of elements.
print('k_labels',collections.Counter(k_labels).most_common(1))
label = collections.Counter(k_labels).most_common(1)
return label

if __name__ == '__main__':
#Create dataset
group, labels = createDataSet()
#Test set
test = [55,20]
#kNN classification
#test_class = classify0(test, group, labels, 3)
test_class = classify(test, group, labels, 3)
#Print classification results
print(test_class)
``` ### Iris classification

Sklearn has become the most awesome Python machine learning library (Library). The machine learning algorithms supported by scikit learn include classification, regression, dimensionality reduction and clustering. There are also some modules of extracting features, processing data and evaluating models applied to sklearn
https://scikitlearn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

https://sklearn.apachecn.org/docs/master/7.html
Sklearn quiz

```from sklearn.neighbors import KNeighborsClassifier
X = [, , , , , , , , ]
y = [0, 0, 0, 1, 1, 1, 2, 2, 2]
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X, y)
# The fit function uses X as the training data and y as the target value (similar to the label) to fit the model.
print(neigh.predict([[1.1]]))
##The predicted value here is 1.1, and the result is 0, indicating that 1.1 should be in the class 0

print(neigh.predict([[1.6]]))
print(neigh.predict([[5.2]]))
print(neigh.predict([[7.2]]))
print(neigh.predict([[8.2]]))
``` Code

```# -*- coding: utf-8 -*-
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
class KNN(object):

# Three categories of iris data (mountain iris / 0, iris mallow / 1, color changing iris / 2), 50 samples in each category, and four eigenvalues in each sample (sepal length, sepal width, petal length, petal width)
def get_iris_data(self):
iris_data = iris.data
iris_target = iris.target
return iris_data, iris_target

def run(self):
# 1. Obtain the characteristic value and target value of iris
iris_data, iris_target = self.get_iris_data()
#print("iris_data, iris_target",iris_data, iris_target)
# 2. Divide the data into training set and test set_ Size = 0.25 means that 25% of the data is used as the test set
x_train, x_test, y_train, y_test = train_test_split(iris_data, iris_target, test_size=0.25)
# 3. Characteristic Engineering (standardize characteristic values)
std = StandardScaler()
x_train = std.fit_transform(x_train)
x_test = std.transform(x_test)
# 4. Feeding algorithm
knn = KNeighborsClassifier(n_neighbors=5) # Create an instance of KNN algorithm, n_neighbors defaults to 5, and then the optimal parameters are obtained through grid search
knn.fit(x_train, y_train) # Send test set into algorithm
y_predict = knn.predict(x_test) # Get forecast results
# Display of prediction results
labels = ["Mountain iris","Iris mallow","Discolored iris"]
for i in range(len(y_predict)):
print("The first%d Test times:True value:%s\t Estimate:%s"%((i+1),labels[y_test[i]],labels[y_predict[i]]))
print("Accuracy:",knn.score(x_test, y_test))

if __name__ == '__main__':
knn = KNN()
knn.run()

```

### Handwriting Recognition

```str='[HelloHello]'
print(str.split('o'))#What you get is the content before the first o
print(str.split("[").split("]"))#What you get is the content after [before]
print(str.split('l'))#Content between the second and third l

``` Code

```# -*- coding: UTF-8 -*-
import numpy as np
#The operator module is a built-in operator function interface in Python. It defines built-in functions for arithmetic, comparison and other operations corresponding to the standard object API.
import operator
from os import listdir
from sklearn.neighbors import KNeighborsClassifier as kNN

def img2vector(filename):
#Create 1x1024 zero vector
returnVect = np.zeros((1, 1024))
#Open file
fr = open(filename)
for i in range(32):
#The first 32 elements of each row are added to returnVect in turn
for j in range(32):
returnVect[0, 32*i+j] = int(lineStr[j])
#Returns the converted 1x1024 vector
return returnVect
def handwritingClassTest():
#Labels for test sets
hwLabels = []
#listdir returns the file name in the trainingDigits directory
trainingFileList = listdir('trainingDigits')
#print('trainingFileList',trainingFileList)
#Returns the number of files in the folder
m = len(trainingFileList)
#Initialize the training Mat matrix and test set
trainingMat = np.zeros((m, 1024))
#Resolve the category of the training set from the file name
for i in range(m):
#Get the name of the file
fileNameStr = trainingFileList[i]
#Get classified number str.split get character '_‘ Previous content
classNumber = int(fileNameStr.split('_'))
#Add the obtained categories to hwLabels
hwLabels.append(classNumber)
#Store 1x1024 data of each file in the trainingMat matrix
#trainingMat creates an array with m rows and 1024 columns (all contents are zero),
#trainingMat[i,:] is to put the list returned by img2vector method into line i of trainingMat
trainingMat[i,:] = img2vector('trainingDigits/%s' % (fileNameStr))
#print('trainingMat',trainingMat)

#Constructing kNN classifier
neigh = kNN(n_neighbors = 3, algorithm = 'auto')
#Fit the model. trainingMat is the test matrix and hwLabels is the corresponding label
neigh.fit(trainingMat, hwLabels)
#Returns the file list in the testDigits directory. listdir returns the file name
testFileList = listdir('testDigits')
#Error detection count
errorCount = 0.0
#Number of test data
mTest = len(testFileList)
#Analyze the category of test set from the file and conduct classification test
for i in range(mTest):
#Get the name of the file
fileNameStr = testFileList[i]
#Get the number of categories
classNumber = int(fileNameStr.split('_'))
#The 1x1024 vector of the test set is obtained for training
vectorUnderTest = img2vector('testDigits/%s' % (fileNameStr))
#Obtain prediction results
# classifierResult = classify0(vectorUnderTest, trainingMat, hwLabels, 3
classifierResult = neigh.predict(vectorUnderTest)
print("The returned result of classification is%d\t The real result is%d" % (classifierResult, classNumber))
if(classifierResult != classNumber):
errorCount += 1.0
print("All wrong%d Data\n The error rate is%f%%" % (errorCount, errorCount/mTest * 100))
if __name__ == '__main__':
handwritingClassTest()
```

## draw lessons from

Posted on Mon, 27 Sep 2021 09:21:08 -0400 by R0d Longfella