1, K-nearest neighbor algorithm
KNN is a supervised learning class algorithm. Its full name (k-nearest neighbor) is translated into k nearest neighbors. It is a clustering algorithm. The algorithm believes that we can judge the category of an object according to the category of K objects that are very similar to him (the category of K objects is known). Its working mechanism is to give a new test sample, find the first k samples in the training set closest to the new sample based on a certain distance measure (the sample label in the training set is known), and predict the category of the new sample based on the label information of the K samples. Generally, the category label that appears the most times in the K samples is predicted as the category label of the new sample.
2, Importing data using Python
First create a KNN.py file and enter the following code in the file:
from numpy import * #Import numpy Library import operator #Import operator module #The following code defines a data set containing data and corresponding classification label information def createDataSet(): group=array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]]) labels=['A','A','B','B'] return group,labels
Save the KNN.py file, change the current path to the location where KNN.py is stored, and open the Python development environment.
Enter the following code to import the KNN module:
group,labels=KNN.createDataSet() print(group) print(labels) Output is: [[1. 1.1] [1. 1. ] [0. 0. ] [0. 0.1]] ['A', 'A', 'B', 'B']
3, Parsing data from text
The specific codes are as follows:
def calssify0(inX,dataSet,labels,k): dataSetSize=dataSet.shape[0] diffMat=numpy.tile(inX,(dataSetSize,1))-dataSet sqDiffMat=diffMat**2 sqDistances=sqDiffMat.sum(axis=1) distances=sqDistances**0.5 sortedDistIndicies=distances.argsort() classCount={} for i in range(k): voteIlabel=labels[sortedDistIndicies[i]] classCount[voteIlabel]=classCount.get(voteIlabel,0)+1 sortclassCount=sorted(classCount.items(), key=operator.itemgetter(1),reverse=True) return sortclassCount[0][0]
That is, the distance between two points is calculated through the Euclidean distance formula, and the K points with the smallest distance are selected and sorted
Euclidean distance formula:
Text data:
4, Example:
Helen has been using online dating sites to find the right date for herself. Although dating sites recommend different
But she doesn't like everyone. After summing up, she found that she had met three types of people:
People you don't like
A man of average Charm
A charismatic person
Despite the above rules, Helen was still unable to classify the matching objects recommended by dating websites into appropriate categories. She thinks she can date those charming people from Monday to Friday, and she prefers to keep company with those charming people on weekends. Helen hopes that our classification software can better help her classify matching objects into exact classifications. In addition, Helen also collected some data information that has not been recorded on dating websites. She believes that these data are more helpful to the classification of matching objects.
1. Parse data from text file:
Helen has been collecting dating data for some time. She stores the data in the text file datingTestSet.txt
One sample data occupies one row, with a total of 1000 rows. Helen's sample mainly contains the following three characteristics:
frequent flyer miles obtained each year
percentage of time spent playing video games
litres of ice cream consumed per week
Before inputting the above feature data into the classifier, the format of the data to be processed must be changed to a format acceptable to the classifier. Create a function called file2matrix in kNN.py to handle the input format problem. The input of this function is the file name string, and the output is the training sample matrix and category label vector.
def files2matrix(filename): fr=open(filename) arraylines=fr.readlines() lenoflines=len(arraylines) #Returns the length of the collection, where the collection line is one element matrix=np.zeros((lenoflines,3))#Create a full 0 matrix with the number of rows as lenoflines and the number of columns as 3 labelmatrix=[] #Create an empty collection index=0 for line in arraylines: line=line.strip() listFromline=line.split('\t') matrix[index,:]=listFromline[0:3] labelmatrix.append(int(listFromline[-1])) index+=1 return matrix,labelmatrix
2. Analyze data: use Matplotlib to create scatter chart
import matplotlib import matplotlib.pyplot as plt import pylab matplotlib.rcParams['font.sans-serif'] = ['SimHei'] #Enable Chinese characters to be displayed in the chart fig=plt.figure() ax=fig.add_subplot(111) #The chart is drawn in the first table divided into rows and columns ax.scatter(datingDataMat[:,0],datingDataMat[:,1],15*np.array(datingLabels),15*np.array(datingLabels)) ax.set_xlabel('Percentage of time spent playing video games') ax.set_ylabel('Litres of ice cream consumed per week') plt.show()
3. No color mark code is written
ax.scatter(datingDataMat[:,0],datingDataMat[:,1],15*np.array(datingLabels),15*np.array(datingLabels))
Output effect:
Effect after adding color mark:
5. Normalized eigenvalue Code:
import numpy as np def autoNorm(dataSet): minVals = dataSet.min(0) maxVals = dataSet.max(0) ranges = maxVals - minVals normDataSet = np.zeros(np.shape(dataSet)) #Create a 0 matrix with the same shape size as the dataSet m = dataSet.shape[0] #m is the number of rows in the dataSet normDataSet = dataSet - np.tile(minVals, (m,1)) normDataSet = normDataSet/np.tile(ranges, (m,1)) return normDataSet, ranges, minVals
6. Test algorithm code:
def datingClassTest(): hoRatio = 0.50 datingDataMat,datingLabels = files2matrix('datingTestSet2.txt') m = normMat.shape[0] #Gets the number of rows of normMat numTestVecs = int(m*hoRatio) errorCount = 0.0 for i in range(numTestVecs): classifierResult =calssify0(normMat[i,:],normMat[numTestVecs:m,:],datingLabels[numTestVecs:m],3) print ("the classifier came back with: %d, the real answer is: %d" % (classifierResult, datingLabels[i])) if (classifierResult != datingLabels[i]): errorCount += 1.0 print ("the total error rate is: %f" % (errorCount/float(numTestVecs)))
7. Use algorithm: build a complete and available system
We have tested the classifier on the data above, and now we can finally use this classifier to classify people for Helen. We will give Helen a short program through which Helen will find someone on the dating website and enter his information. The program will give a prediction of how much she likes each other
def classifyperson(): resultList=['not at all','in small doses','in large doses'] percentTats=float(input("percentage of tiome spent playing video games?")) ffMiles=float(input("frequent flier miles earned per year?")) iceCream=float(input("liters of ice cream consumed per year?")) datingDatMat,datingLabels=files2matrix('datingTestSet2.txt') #read file normMat,ranges,minVals=autoNorm(datingDatMat) #data normalization inArr=np.array([ffMiles,percentTats,iceCream]) #Create array classifierResult=calssify0((inArr-minVals)/ranges,normMat,datingLabels,3) print("You will probably like this person:",resultList[classifierResult-1])
Output results:
classifyperson() out: percentage of tiome spent playing video games?100 frequent flier miles earned per year?90 liters of ice cream consumed per year?10 You will probably like this person: in large doses
Note: the above code comes from machine learning practice