K Nearest Neighbor Algorithm Code Details

Books: Machine Learning Practice
Author: Peter Harrington

Advantages and disadvantages of K-nearest neighbor algorithm

  • Advantages: High accuracy, insensitive to outliers, no data input assumptions.
  • Disadvantages: high computational and spatial complexity.
  • Applicable data range: numeric and nominal.

General flow of K-neighbor algorithm

  • Collect data: You can use any method.
  • Prepare the data: Values needed for distance calculations, preferably in a structured format.
  • Analyzing data: You can use any method.
  • Training data: This step is not applicable to the K Nearest Neighbor algorithm.
  • Test data: Calculate error rate.
  • Use algorithm: first input sample data and structured output results are required, then run the K Nearest Neighbor algorithm to determine which classification the input data is into, and finally apply the subsequent processing to the calculated classification.

code analysis

First paragraph

# from numpy import *
import numpy as np
import operator

def createDataSet():
    # group = array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]]) # data
    group = np.array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]])
    labels = ['A','A','B','B'] # Corresponding label
    return group, labels
  • The first line of code in the original book is from numpy import *. I'll change this to import numpy as np because, if you use the original book's code, it's likely that if you're not familiar with numpy, you'll mistake the array method under numpy for a Python built-in function, whereas np.array knows very well that array is the method under numpy.NP means to import a numpy module and use NP as an alternative to numpy to simplify writing.
  • group = np.array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]]) creates a two-dimensional ( 4 × 2 4\times 2 numpy array of 4*2).

The second paragraph

def classify0(inX, dataSet, labels, k):
    dataSetSize = dataSet.shape[0]
    diffMat = np.tile(inX, (dataSetSize,1)) - dataSet
    sqDiffMat = diffMat**2
    sqDistances = sqDiffMat.sum(axis=1)
    distances = sqDistances**0.5
    sortedDistIndicies = distances.argsort()
    classCount={}          
    for i in range(k):
        voteIlabel = labels[sortedDistIndicies[i]]
        classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1
    sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
    return sortedClassCount[0][0]
  • Four input parameters for the classify0 function:
    • inX: Input vector for classification
    • dataSet: The input training sample is a two-dimensional matrix, the number of rows is the number of participating training samples, and the number of columns is the number of features
    • labels tag vectors, the same number as dataSet rows
    • k: number of nearest neighbors, scalar
  • dataSetSize = dataSet.shape[0] indicates that the number of training samples is assigned to the variable dataSetSize.
  • .tile(array,[i,j]) is a method under the numpy module that represents an array array that repeats I times in the line direction and j times in the column direction. If [i,j] is a scalar j, it repeats J times in the column direction and 1 time in the row direction by default.
    >>> import numpy
    >>> numpy.tile([0,0],5)
    array([0, 0, 0, 0, 0, 0])
    >>> numpy.tile([0,0],(1,1))
    array([[0, 0]])
    >>> numpy.tile([0,0],(3,1))
    array([[0, 0],
       		[0, 0],
       		[0, 0]])
    >>> numpy.tile([0,0],(1,3))
    array([[0, 0, 0, 0, 0, 0]])
    >>> numpy.tile([0,0],(2,2))
    array([[0, 0, 0, 0,],
       [0, 0, 0, 0]])
    
  • DiffMat = np.tile (inX, (dataSetSize, 1)) - What exactly did dataSet do? Using the Euclidean distance formula between two points d = ( x 1 − x 2 ) 2 + ( y 1 − y 2 ) 2 d=\sqrt{(x_1-x_2)^2+(y_1-y_2)^2} d=(x1​−x2​)2+(y1​−y2​)2 You can see that this line of code executes the Euclidean distance formula from the points to be classified to each training sample point [ ( x 1 − x 2 ) , ( y 1 − y 2 ) ] [(x_1-x_2),(y_1-y_2)] [(x1 x2), (y1 y2)], we can take an example to explain:

    Assuming there are three training samples (1,1), (2,2), (3,3) and the samples to be classified are (5,2), the code first repeats (5,2) three times in the row direction, making it a 3-row, 2-column matrix of the same dimension as the dataSet, then subtracts it, that is:
    [ 5 2 5 2 5 2 ] − [ 1 1 2 2 3 3 ] = [ 5 − 1 2 − 1 5 − 2 2 − 2 5 − 3 2 − 3 ] \begin{bmatrix} 5 & 2 \\ 5 & 2 \\ 5 & 2 \end{bmatrix}- \begin{bmatrix} 1 & 1 \\ 2 & 2 \\ 3 & 3 \end{bmatrix}=\begin{bmatrix} 5-1 & 2-1 \\ 5-2 & 2-2 \\ 5-3 & 2-3 \end{bmatrix} ⎣⎡​555​222​⎦⎤​−⎣⎡​123​123​⎦⎤​=⎣⎡​5−15−25−3​2−12−22−3​⎦⎤​

  • Instead of mathematically multiplying two identical matrices, diffMat**2 means that each element of the matrix diffMat is squared.
  • After the code sqDiffMat = diffMat**2, the matrix becomes: [ ( 5 − 1 ) 2 ( 2 − 1 ) 2 ( 5 − 2 ) 2 ( 2 − 2 ) 2 ( 5 − 3 ) 2 ( 2 − 3 ) 2 ] \begin{bmatrix} (5-1)^2 & (2-1)^2 \\ (5-2)^2 & (2-2)^2 \\ (5-3)^2 & (2-3)^2 \end{bmatrix} ⎣⎡​(5−1)2(5−2)2(5−3)2​(2−1)2(2−2)2(2−3)2​⎦⎤​
  • distances = sqDistances**0.5 means that when you sum a matrix element and open the root sign, you get the Euclidean distance you want, that is: ( 5 − 1 ) 2 + ( 2 − 1 ) 2 + ( 5 − 2 ) 2 + ( 2 − 2 ) 2 + ( 5 − 3 ) 2 + ( 2 − 3 ) 2 \sqrt{(5-1)^2 + (2-1)^2 + (5-2)^2 + (2-2)^2 + (5-3)^2 + (2-3)^2} (5−1)2+(2−1)2+(5−2)2+(2−2)2+(5−3)2+(2−3)2 ​
  • .argsort(array) is a method under numpy that returns the index value of the array from smallest to largest; if a multidimensional array can be sorted by setting the axis parameter axis; and if a negative sign precedes the parameter array, the corresponding index value is returned by sorting the array from largest to smallest.
      >>> x = np.array([3, 1, 2])
      >>> np.argsort(x)
      array([1, 2, 0])
      >>> x = np.array([[0, 1], [2, 2]])  # Top by default with equal data
      >>> x
      array([[0, 1],
             [2, 2]])
      >>> np.argsort(x, axis=0)  # Sort by Column
      array([[0, 1],
             [1, 0]])
      
      >>> np.argsort(x, axis=1)  # Sort by row
      array([[0, 1],
             [0, 1]])
      >>> x = np.array([3, 1, 2])
      >>> np.argsort(x)  # Sort in ascending order
      array([1, 2, 0])
      >>> np.argsort(-x)  # Sort in descending order
      array([0, 2, 1])
      >>> x[np.argsort(x)]
      array([1, 2, 3])
      >>> x[np.argsort(-x)]
      array([3, 2, 1])
    
  • ClassCount={} defines a dictionary classCount whose key stores non-coincident label values corresponding to the nearest k neighbors and whose value stores the number of times the label appears within k. For example, if the nearest eight neighbors correspond to labels ['A','A','A','A','B','B','C'], then classCount = {'A':5,'B':2,'C':1}.
  • The loop code that follows implements the process of getting the classCount above and sorting it in descending order by the value of the dictionary, since the K Nearest Neighbor algorithm needs the tag that appears most often in the corresponding tag in the K neighbors. After sorting, the tag value corresponding to it can be obtained as the return value of the function from the largest dictionary. There are several points in this loop segment.Notable:
    • The.get(key,default=None) method of the dictionary returns the value of the specified key, if the key does not return the default value None in the dictionary or the default value set. Setting the default value to 0 is wonderful here, which is equivalent to creating the key and assigning it a value of 0 if the key does not exist in the dictionary.
    • sorted(iterable,cmp=None,key=None,reverse=False) parameter description:
      • iterable: iterable object
      • key: the element used primarily for comparison, with only one parameter. The parameter of a specific function is taken from an iterative object, and one element of the iterative object is specified for ordering.
      • Reverse: collation, reverse = True descending, reverse = False ascending (default)
    • itemgetter is the method under the operator that gets the value in the specified field of the object
        >>> from operator import itemgetter
        >>> a = [1,2,3,4,5]
        >>> b = itemgetter(0)
        >>> b(a)
        1
        >>> c = itemgetter(0,1,2)
        >>> c(a)
        (1, 2, 3)
      
      The dictionary can be sorted by the combination of itemgetter and key parameters based on the corresponding key or value within the itemgetter parentheses
    • items() returns the key-value pairs of the dictionary as a list, and iteritems() returns the key-value pairs of the dictionary as an iterator
    • sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True) indicates that the returned key-value pairs are sorted in descending order by the value of the dictionary.

Paragraph 3

def file2matrix(filename):
    fr = open(filename)  # Open the text (this text has four columns, the first three are sample features, and the fourth are labels)
    numberOfLines = len(fr.readlines())  # Determine the number of lines of the text
    returnMat = np.zeros((numberOfLines,3))  # Create a full 0 matrix with 3 rows and 3 columns of text
    classLabelVector = []  # Create a label list to store labels for samples
    index = 0  # Set initial index value to 0
    for line in fr.readlines():
        line = line.strip()  # Remove whitespace on both sides of a line
        listFromLine = line.split('\t')  # Split text data on rows with tab delimiters
        returnMat[index,:] = listFromLine[0:3]  # Assign the characteristics of the sample to the index row of the matrix
        classLabelVector.append(int(listFromLine[-1]))  # Add the feature values of the sample to the label list
        index += 1
    return returnMat,classLabelVector

Paragraph 4

def autoNorm(dataSet):
    minVals = dataSet.min(0)  # Minimizing each column of a dataset
    maxVals = dataSet.max(0)  # Maximizing each column of a dataset
    ranges = maxVals - minVals  # Find the extreme difference for each column of a dataset
    normDataSet = np.zeros(shape(dataSet))  # Create an all-0 matrix of the same shape as the dataset
    m = dataSet.shape[0]  # Assign the number of samples (rows) in the dataset to m
    normDataSet = dataSet - np.tile(minVals, (m,1))  # The value of each element of a dataset minus the minimum value of its corresponding column
    normDataSet = normDataSet/np.tile(ranges, (m,1))  # Normalize the maximum and minimum values as follows
    return normDataSet, ranges, minVals
  • Usage of min(0), max(0) in matrices:
    • min(0) returns the minimum value of each column in the matrix
    • min(1) returns the minimum value of each row in the matrix
    • max(0) returns the maximum value of each column in the matrix
    • max(1) returns the maximum value of each row in the matrix
  • Maximum and Minimum Normalization Formula: newValue = oldValue − min max − min \text{newValue}=\dfrac{\text{oldValue}-\text{min}}{\text{max}-\text{min}} newValue=max−minoldValue−min​

Paragraph 5

def datingClassTest():
    hoRatio = 0.10  # Scale of test data
    datingDataMat,datingLabels = file2matrix('datingTestSet2.txt')  # Call the third piece of code, read the text, assign the first three columns of the text to datingDataMat, and assign the last row to datingLabels
    normMat, ranges, minVals = autoNorm(datingDataMat)  # Maximum-Minimum Normalization, calling the fourth code, returns the normalized matrix, the range of each column, and the minimum value
    m = normMat.shape[0]  # Assign the number of samples (rows) in the dataset to m
    numTestVecs = int(m*hoRatio)  # Assign 10% of the sample number to numTestVecs
    errorCount = 0.0  # Number of initialization prediction errors
    for i in range(numTestVecs):
        classifierResult = classify0(normMat[i,:],normMat[numTestVecs:m,:],datingLabels[numTestVecs:m],3)  # Call the second section of code to return the classification results
        print(classifierResult, datingLabels[i])  # Show classification results and real results
        if (classifierResult != datingLabels[i]): errorCount += 1.0  # If the two are not equal, the number of errors plus 1
    print(errorCount/float(numTestVecs))  # Return classification error rate

The raw_input() function reads a line from the user. It returns a string by stripping the trailing line break. It was renamed the input() function in Python 3.0 and later versions.
The basic difference between raw_input and input is that raw_input always returns a string value, whereas the input function does not necessarily return a string because when a user enters a number, it treats it as an integer.

Paragraph 6

def img2vector(filename):
    returnVect = np.zeros((1,1024))  # Create a column vector (array) with a dimension of 1024
    fr = open(filename)
    for i in range(32):  # Read the first 32 lines of the file
        lineStr = fr.readline()
        for j in range(32):  # Store the first 32 character values of each row in a column vector 32 * 32 = 1024
            returnVect[0,32*i+j] = int(lineStr[j])
    return returnVect

Seventh paragraph

def handwritingClassTest():
    hwLabels = []  # Create tag list
    trainingFileList = os.listdir('trainingDigits')  # Returns a list of the names of files (or folders) under the trainingDigits folder
    m = len(trainingFileList)  # Assign m the number of files (or folders) under the trainingDigits folder
    trainingMat = zeros((m,1024))  # Create a full 0 matrix of 1024 columns in m rows, with one image per row behind it
    for i in range(m):  # Traverse m files under the training folder
        fileNameStr = trainingFileList[i]  # Assign file name to fileNameStr
        fileStr = fileNameStr.split('.')[0]  # Remove the.txt suffix name after the file name and assign it to fileStr
        classNumStr = int(fileStr.split('_')[0])  # Split with as the separator and return the first value after splitting and converted to an integer as the classification label, for example, "9_45" returns "9" -->9
        hwLabels.append(classNumStr)  # Add tags to the tag list
        trainingMat[i,:] = img2vector('trainingDigits/%s' % fileNameStr)  # Returns the first 32 characters (32*32=1024) from the first 32 lines of the i+1 file to the first I line of trainingMat
    testFileList = os.listdir('testDigits')  # Returns a list of file (or folder) names under the testDigits folder
    errorCount = 0.0  # Number of initialization prediction errors
    mTest = len(testFileList)  # Assign the number of files (or folders) under the testDigits folder to mTest
    for i in range(mTest):  # Traverse m files under the test folder
        fileNameStr = testFileList[i]
        fileStr = fileNameStr.split('.')[0]
        classNumStr = int(fileStr.split('_')[0])
        vectorUnderTest = img2vector('testDigits/%s' % fileNameStr)
        classifierResult = classify0(vectorUnderTest, trainingMat, hwLabels, 3)  # Return classification results
        print(classifierResult, classNumStr)  # Show classification results and real results
        if (classifierResult != classNumStr): errorCount += 1.0  # If the two are not equal, the number of errors plus 1
    print(errorCount)  # Number of times classification errors are returned
    print(errorCount/float(mTest))  # Return classification error rate
  • .listdir(path) is a method under the os module that returns a list of the names of files or folders contained in the specified path.

Tags: Python Algorithm Machine Learning

Posted on Wed, 13 Oct 2021 12:29:36 -0400 by fusionpixel