KNN is the k nearest neighbor algorithm, which is a classification algorithm. It means to select k nearest neighbors. It means that each sample can be represented by its closest K neighbors. Among the K samples, the category with the largest proportion can be classified as this category.
There is a saying that good birds of a feather flock together. If we want to see whether a person is a good person or a bad person, we have to see whether there are more good people or more bad people among his friends. Although this explanation is far fetched, this is the principle in KNN algorithm.
The purpose of KNN algorithm is to search the nearest K known category samples for the prediction of unknown category samples.
At the same time, the value of K in KNN algorithm is very important. As shown in the figure below, when K=3 and K=5, the results are very different.
General process of K-nearest neighbor algorithm:
1. Collect data: any method can be used.
2. Prepare data: the value required for distance calculation, preferably in a structured data format.
3. Analyze data: any method can be used.
4. Training algorithm: this step is not applicable to K-nearest neighbor algorithm.
5. Test algorithm: calculate the error rate.
6. Using the algorithm: first, input the sample data and structured output results, then run the K-nearest neighbor algorithm to determine which classification the input data belongs to respectively, and finally perform subsequent processing on the calculated classification.
- Film category classification
- Cancer prediction
- Iris classification
- Handwritten numeral recognition
Film category classification
# -*- coding: UTF-8 -*- import collections import numpy as np def createDataSet(): #Four sets of 2D features group = np.array([[1,101],[5,89],[108,5],[115,8]]) print(group) #Labels for four sets of features labels = ['affectional film','affectional film','action movie','action movie'] return group, labels def classify(inx, dataset, labels, k): # Calculating the distance is actually calculating the distance between points dist = np.sum((inx - dataset)**2, axis=1)**0.5 #print("dist",dist) # k recent Tags # dist.argsort arranges the elements in x from small to large, and extracts the corresponding index k_labels = [labels[index] for index in dist.argsort()[0 : k]] print('k_labels', k_labels) # The most frequent label is the final category #Main functions: it can support convenient and fast counting, count the number of elements, and then count and return to a dictionary. The key is the element and the value is the number of elements. print('k_labels',collections.Counter(k_labels).most_common(1)) label = collections.Counter(k_labels).most_common(1) return label if __name__ == '__main__': #Create dataset group, labels = createDataSet() #Test set test = [55,20] #kNN classification #test_class = classify0(test, group, labels, 3) test_class = classify(test, group, labels, 3) #Print classification results print(test_class)
Sklearn has become the most awesome Python machine learning library (Library). The machine learning algorithms supported by scikit learn include classification, regression, dimensionality reduction and clustering. There are also some modules of extracting features, processing data and evaluating models applied to sklearn
from sklearn.neighbors import KNeighborsClassifier X = [, , , , , , , , ] y = [0, 0, 0, 1, 1, 1, 2, 2, 2] neigh = KNeighborsClassifier(n_neighbors=3) neigh.fit(X, y) # The fit function uses X as the training data and y as the target value (similar to the label) to fit the model. print(neigh.predict([[1.1]])) ##The predicted value here is 1.1, and the result is 0, indicating that 1.1 should be in the class 0 print(neigh.predict([[1.6]])) print(neigh.predict([[5.2]])) print(neigh.predict([[7.2]])) print(neigh.predict([[8.2]]))
# -*- coding: utf-8 -*- from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.preprocessing import StandardScaler from sklearn.neighbors import KNeighborsClassifier class KNN(object): # Three categories of iris data (mountain iris / 0, iris mallow / 1, color changing iris / 2), 50 samples in each category, and four eigenvalues in each sample (sepal length, sepal width, petal length, petal width) def get_iris_data(self): iris = load_iris() iris_data = iris.data iris_target = iris.target return iris_data, iris_target def run(self): # 1. Obtain the characteristic value and target value of iris iris_data, iris_target = self.get_iris_data() #print("iris_data, iris_target",iris_data, iris_target) # 2. Divide the data into training set and test set_ Size = 0.25 means that 25% of the data is used as the test set x_train, x_test, y_train, y_test = train_test_split(iris_data, iris_target, test_size=0.25) # 3. Characteristic Engineering (standardize characteristic values) std = StandardScaler() x_train = std.fit_transform(x_train) x_test = std.transform(x_test) # 4. Feeding algorithm knn = KNeighborsClassifier(n_neighbors=5) # Create an instance of KNN algorithm, n_neighbors defaults to 5, and then the optimal parameters are obtained through grid search knn.fit(x_train, y_train) # Send test set into algorithm y_predict = knn.predict(x_test) # Get forecast results # Display of prediction results labels = ["Mountain iris","Iris mallow","Discolored iris"] for i in range(len(y_predict)): print("The first%d Test times:True value:%s\t Estimate:%s"%((i+1),labels[y_test[i]],labels[y_predict[i]])) print("Accuracy:",knn.score(x_test, y_test)) if __name__ == '__main__': knn = KNN() knn.run()
str='[HelloHello]' print(str.split('o'))#What you get is the content before the first o print(str.split("[").split("]"))#What you get is the content after [before] print(str.split('l'))#Content between the second and third l
# -*- coding: UTF-8 -*- import numpy as np #The operator module is a built-in operator function interface in Python. It defines built-in functions for arithmetic, comparison and other operations corresponding to the standard object API. import operator from os import listdir from sklearn.neighbors import KNeighborsClassifier as kNN def img2vector(filename): #Create 1x1024 zero vector returnVect = np.zeros((1, 1024)) #Open file fr = open(filename) #Read by line for i in range(32): #Read a row of data lineStr = fr.readline() #The first 32 elements of each row are added to returnVect in turn for j in range(32): returnVect[0, 32*i+j] = int(lineStr[j]) #Returns the converted 1x1024 vector return returnVect def handwritingClassTest(): #Labels for test sets hwLabels =  #listdir returns the file name in the trainingDigits directory trainingFileList = listdir('trainingDigits') #print('trainingFileList',trainingFileList) #Returns the number of files in the folder m = len(trainingFileList) #Initialize the training Mat matrix and test set trainingMat = np.zeros((m, 1024)) #Resolve the category of the training set from the file name for i in range(m): #Get the name of the file fileNameStr = trainingFileList[i] #Get classified number str.split get character '_‘ Previous content classNumber = int(fileNameStr.split('_')) #Add the obtained categories to hwLabels hwLabels.append(classNumber) #Store 1x1024 data of each file in the trainingMat matrix #trainingMat creates an array with m rows and 1024 columns (all contents are zero), #trainingMat[i,:] is to put the list returned by img2vector method into line i of trainingMat trainingMat[i,:] = img2vector('trainingDigits/%s' % (fileNameStr)) #print('trainingMat',trainingMat) #Constructing kNN classifier neigh = kNN(n_neighbors = 3, algorithm = 'auto') #Fit the model. trainingMat is the test matrix and hwLabels is the corresponding label neigh.fit(trainingMat, hwLabels) #Returns the file list in the testDigits directory. listdir returns the file name testFileList = listdir('testDigits') #Error detection count errorCount = 0.0 #Number of test data mTest = len(testFileList) #Analyze the category of test set from the file and conduct classification test for i in range(mTest): #Get the name of the file fileNameStr = testFileList[i] #Get the number of categories classNumber = int(fileNameStr.split('_')) #The 1x1024 vector of the test set is obtained for training vectorUnderTest = img2vector('testDigits/%s' % (fileNameStr)) #Obtain prediction results # classifierResult = classify0(vectorUnderTest, trainingMat, hwLabels, 3 classifierResult = neigh.predict(vectorUnderTest) print("The returned result of classification is%d\t The real result is%d" % (classifierResult, classNumber)) if(classifierResult != classNumber): errorCount += 1.0 print("All wrong%d Data\n The error rate is%f%%" % (errorCount, errorCount/mTest * 100)) if __name__ == '__main__': handwritingClassTest()