Books: Machine Learning Practice
Author: Peter Harrington
Advantages and disadvantages of K-nearest neighbor algorithm
- Advantages: High accuracy, insensitive to outliers, no data input assumptions.
- Disadvantages: high computational and spatial complexity.
- Applicable data range: numeric and nominal.
General flow of K-neighbor algorithm
- Collect data: You can use any method.
- Prepare the data: Values needed for distance calculations, preferably in a structured format.
- Analyzing data: You can use any method.
- Training data: This step is not applicable to the K Nearest Neighbor algorithm.
- Test data: Calculate error rate.
- Use algorithm: first input sample data and structured output results are required, then run the K Nearest Neighbor algorithm to determine which classification the input data is into, and finally apply the subsequent processing to the calculated classification.
code analysis
First paragraph
# from numpy import * import numpy as np import operator def createDataSet(): # group = array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]]) # data group = np.array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]]) labels = ['A','A','B','B'] # Corresponding label return group, labels
- The first line of code in the original book is from numpy import *. I'll change this to import numpy as np because, if you use the original book's code, it's likely that if you're not familiar with numpy, you'll mistake the array method under numpy for a Python built-in function, whereas np.array knows very well that array is the method under numpy.NP means to import a numpy module and use NP as an alternative to numpy to simplify writing.
- group = np.array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]]) creates a two-dimensional ( 4 × 2 4\times 2 numpy array of 4*2).
The second paragraph
def classify0(inX, dataSet, labels, k): dataSetSize = dataSet.shape[0] diffMat = np.tile(inX, (dataSetSize,1)) - dataSet sqDiffMat = diffMat**2 sqDistances = sqDiffMat.sum(axis=1) distances = sqDistances**0.5 sortedDistIndicies = distances.argsort() classCount={} for i in range(k): voteIlabel = labels[sortedDistIndicies[i]] classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1 sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True) return sortedClassCount[0][0]
- Four input parameters for the classify0 function:
- inX: Input vector for classification
- dataSet: The input training sample is a two-dimensional matrix, the number of rows is the number of participating training samples, and the number of columns is the number of features
- labels tag vectors, the same number as dataSet rows
- k: number of nearest neighbors, scalar
- dataSetSize = dataSet.shape[0] indicates that the number of training samples is assigned to the variable dataSetSize.
- .tile(array,[i,j]) is a method under the numpy module that represents an array array that repeats I times in the line direction and j times in the column direction. If [i,j] is a scalar j, it repeats J times in the column direction and 1 time in the row direction by default.
>>> import numpy >>> numpy.tile([0,0],5) array([0, 0, 0, 0, 0, 0]) >>> numpy.tile([0,0],(1,1)) array([[0, 0]]) >>> numpy.tile([0,0],(3,1)) array([[0, 0], [0, 0], [0, 0]]) >>> numpy.tile([0,0],(1,3)) array([[0, 0, 0, 0, 0, 0]]) >>> numpy.tile([0,0],(2,2)) array([[0, 0, 0, 0,], [0, 0, 0, 0]])
- DiffMat = np.tile (inX, (dataSetSize, 1)) - What exactly did dataSet do? Using the Euclidean distance formula between two points
d
=
(
x
1
−
x
2
)
2
+
(
y
1
−
y
2
)
2
d=\sqrt{(x_1-x_2)^2+(y_1-y_2)^2}
d=(x1−x2)2+(y1−y2)2
You can see that this line of code executes the Euclidean distance formula from the points to be classified to each training sample point
[
(
x
1
−
x
2
)
,
(
y
1
−
y
2
)
]
[(x_1-x_2),(y_1-y_2)]
[(x1 x2), (y1 y2)], we can take an example to explain:
Assuming there are three training samples (1,1), (2,2), (3,3) and the samples to be classified are (5,2), the code first repeats (5,2) three times in the row direction, making it a 3-row, 2-column matrix of the same dimension as the dataSet, then subtracts it, that is:
[ 5 2 5 2 5 2 ] − [ 1 1 2 2 3 3 ] = [ 5 − 1 2 − 1 5 − 2 2 − 2 5 − 3 2 − 3 ] \begin{bmatrix} 5 & 2 \\ 5 & 2 \\ 5 & 2 \end{bmatrix}- \begin{bmatrix} 1 & 1 \\ 2 & 2 \\ 3 & 3 \end{bmatrix}=\begin{bmatrix} 5-1 & 2-1 \\ 5-2 & 2-2 \\ 5-3 & 2-3 \end{bmatrix} ⎣⎡555222⎦⎤−⎣⎡123123⎦⎤=⎣⎡5−15−25−32−12−22−3⎦⎤ - Instead of mathematically multiplying two identical matrices, diffMat**2 means that each element of the matrix diffMat is squared.
- After the code sqDiffMat = diffMat**2, the matrix becomes: [ ( 5 − 1 ) 2 ( 2 − 1 ) 2 ( 5 − 2 ) 2 ( 2 − 2 ) 2 ( 5 − 3 ) 2 ( 2 − 3 ) 2 ] \begin{bmatrix} (5-1)^2 & (2-1)^2 \\ (5-2)^2 & (2-2)^2 \\ (5-3)^2 & (2-3)^2 \end{bmatrix} ⎣⎡(5−1)2(5−2)2(5−3)2(2−1)2(2−2)2(2−3)2⎦⎤
- distances = sqDistances**0.5 means that when you sum a matrix element and open the root sign, you get the Euclidean distance you want, that is: ( 5 − 1 ) 2 + ( 2 − 1 ) 2 + ( 5 − 2 ) 2 + ( 2 − 2 ) 2 + ( 5 − 3 ) 2 + ( 2 − 3 ) 2 \sqrt{(5-1)^2 + (2-1)^2 + (5-2)^2 + (2-2)^2 + (5-3)^2 + (2-3)^2} (5−1)2+(2−1)2+(5−2)2+(2−2)2+(5−3)2+(2−3)2
- .argsort(array) is a method under numpy that returns the index value of the array from smallest to largest; if a multidimensional array can be sorted by setting the axis parameter axis; and if a negative sign precedes the parameter array, the corresponding index value is returned by sorting the array from largest to smallest.
>>> x = np.array([3, 1, 2]) >>> np.argsort(x) array([1, 2, 0]) >>> x = np.array([[0, 1], [2, 2]]) # Top by default with equal data >>> x array([[0, 1], [2, 2]]) >>> np.argsort(x, axis=0) # Sort by Column array([[0, 1], [1, 0]]) >>> np.argsort(x, axis=1) # Sort by row array([[0, 1], [0, 1]]) >>> x = np.array([3, 1, 2]) >>> np.argsort(x) # Sort in ascending order array([1, 2, 0]) >>> np.argsort(-x) # Sort in descending order array([0, 2, 1]) >>> x[np.argsort(x)] array([1, 2, 3]) >>> x[np.argsort(-x)] array([3, 2, 1])
- ClassCount={} defines a dictionary classCount whose key stores non-coincident label values corresponding to the nearest k neighbors and whose value stores the number of times the label appears within k. For example, if the nearest eight neighbors correspond to labels ['A','A','A','A','B','B','C'], then classCount = {'A':5,'B':2,'C':1}.
- The loop code that follows implements the process of getting the classCount above and sorting it in descending order by the value of the dictionary, since the K Nearest Neighbor algorithm needs the tag that appears most often in the corresponding tag in the K neighbors. After sorting, the tag value corresponding to it can be obtained as the return value of the function from the largest dictionary. There are several points in this loop segment.Notable:
- The.get(key,default=None) method of the dictionary returns the value of the specified key, if the key does not return the default value None in the dictionary or the default value set. Setting the default value to 0 is wonderful here, which is equivalent to creating the key and assigning it a value of 0 if the key does not exist in the dictionary.
- sorted(iterable,cmp=None,key=None,reverse=False) parameter description:
- iterable: iterable object
- key: the element used primarily for comparison, with only one parameter. The parameter of a specific function is taken from an iterative object, and one element of the iterative object is specified for ordering.
- Reverse: collation, reverse = True descending, reverse = False ascending (default)
- itemgetter is the method under the operator that gets the value in the specified field of the object
The dictionary can be sorted by the combination of itemgetter and key parameters based on the corresponding key or value within the itemgetter parentheses>>> from operator import itemgetter >>> a = [1,2,3,4,5] >>> b = itemgetter(0) >>> b(a) 1 >>> c = itemgetter(0,1,2) >>> c(a) (1, 2, 3)
- items() returns the key-value pairs of the dictionary as a list, and iteritems() returns the key-value pairs of the dictionary as an iterator
- sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True) indicates that the returned key-value pairs are sorted in descending order by the value of the dictionary.
Paragraph 3
def file2matrix(filename): fr = open(filename) # Open the text (this text has four columns, the first three are sample features, and the fourth are labels) numberOfLines = len(fr.readlines()) # Determine the number of lines of the text returnMat = np.zeros((numberOfLines,3)) # Create a full 0 matrix with 3 rows and 3 columns of text classLabelVector = [] # Create a label list to store labels for samples index = 0 # Set initial index value to 0 for line in fr.readlines(): line = line.strip() # Remove whitespace on both sides of a line listFromLine = line.split('\t') # Split text data on rows with tab delimiters returnMat[index,:] = listFromLine[0:3] # Assign the characteristics of the sample to the index row of the matrix classLabelVector.append(int(listFromLine[-1])) # Add the feature values of the sample to the label list index += 1 return returnMat,classLabelVector
Paragraph 4
def autoNorm(dataSet): minVals = dataSet.min(0) # Minimizing each column of a dataset maxVals = dataSet.max(0) # Maximizing each column of a dataset ranges = maxVals - minVals # Find the extreme difference for each column of a dataset normDataSet = np.zeros(shape(dataSet)) # Create an all-0 matrix of the same shape as the dataset m = dataSet.shape[0] # Assign the number of samples (rows) in the dataset to m normDataSet = dataSet - np.tile(minVals, (m,1)) # The value of each element of a dataset minus the minimum value of its corresponding column normDataSet = normDataSet/np.tile(ranges, (m,1)) # Normalize the maximum and minimum values as follows return normDataSet, ranges, minVals
- Usage of min(0), max(0) in matrices:
- min(0) returns the minimum value of each column in the matrix
- min(1) returns the minimum value of each row in the matrix
- max(0) returns the maximum value of each column in the matrix
- max(1) returns the maximum value of each row in the matrix
- Maximum and Minimum Normalization Formula: newValue = oldValue − min max − min \text{newValue}=\dfrac{\text{oldValue}-\text{min}}{\text{max}-\text{min}} newValue=max−minoldValue−min
Paragraph 5
def datingClassTest(): hoRatio = 0.10 # Scale of test data datingDataMat,datingLabels = file2matrix('datingTestSet2.txt') # Call the third piece of code, read the text, assign the first three columns of the text to datingDataMat, and assign the last row to datingLabels normMat, ranges, minVals = autoNorm(datingDataMat) # Maximum-Minimum Normalization, calling the fourth code, returns the normalized matrix, the range of each column, and the minimum value m = normMat.shape[0] # Assign the number of samples (rows) in the dataset to m numTestVecs = int(m*hoRatio) # Assign 10% of the sample number to numTestVecs errorCount = 0.0 # Number of initialization prediction errors for i in range(numTestVecs): classifierResult = classify0(normMat[i,:],normMat[numTestVecs:m,:],datingLabels[numTestVecs:m],3) # Call the second section of code to return the classification results print(classifierResult, datingLabels[i]) # Show classification results and real results if (classifierResult != datingLabels[i]): errorCount += 1.0 # If the two are not equal, the number of errors plus 1 print(errorCount/float(numTestVecs)) # Return classification error rate
The raw_input() function reads a line from the user. It returns a string by stripping the trailing line break. It was renamed the input() function in Python 3.0 and later versions.
The basic difference between raw_input and input is that raw_input always returns a string value, whereas the input function does not necessarily return a string because when a user enters a number, it treats it as an integer.
Paragraph 6
def img2vector(filename): returnVect = np.zeros((1,1024)) # Create a column vector (array) with a dimension of 1024 fr = open(filename) for i in range(32): # Read the first 32 lines of the file lineStr = fr.readline() for j in range(32): # Store the first 32 character values of each row in a column vector 32 * 32 = 1024 returnVect[0,32*i+j] = int(lineStr[j]) return returnVect
Seventh paragraph
def handwritingClassTest(): hwLabels = [] # Create tag list trainingFileList = os.listdir('trainingDigits') # Returns a list of the names of files (or folders) under the trainingDigits folder m = len(trainingFileList) # Assign m the number of files (or folders) under the trainingDigits folder trainingMat = zeros((m,1024)) # Create a full 0 matrix of 1024 columns in m rows, with one image per row behind it for i in range(m): # Traverse m files under the training folder fileNameStr = trainingFileList[i] # Assign file name to fileNameStr fileStr = fileNameStr.split('.')[0] # Remove the.txt suffix name after the file name and assign it to fileStr classNumStr = int(fileStr.split('_')[0]) # Split with as the separator and return the first value after splitting and converted to an integer as the classification label, for example, "9_45" returns "9" -->9 hwLabels.append(classNumStr) # Add tags to the tag list trainingMat[i,:] = img2vector('trainingDigits/%s' % fileNameStr) # Returns the first 32 characters (32*32=1024) from the first 32 lines of the i+1 file to the first I line of trainingMat testFileList = os.listdir('testDigits') # Returns a list of file (or folder) names under the testDigits folder errorCount = 0.0 # Number of initialization prediction errors mTest = len(testFileList) # Assign the number of files (or folders) under the testDigits folder to mTest for i in range(mTest): # Traverse m files under the test folder fileNameStr = testFileList[i] fileStr = fileNameStr.split('.')[0] classNumStr = int(fileStr.split('_')[0]) vectorUnderTest = img2vector('testDigits/%s' % fileNameStr) classifierResult = classify0(vectorUnderTest, trainingMat, hwLabels, 3) # Return classification results print(classifierResult, classNumStr) # Show classification results and real results if (classifierResult != classNumStr): errorCount += 1.0 # If the two are not equal, the number of errors plus 1 print(errorCount) # Number of times classification errors are returned print(errorCount/float(mTest)) # Return classification error rate
- .listdir(path) is a method under the os module that returns a list of the names of files or folders contained in the specified path.