Implementation of k-nearest neighbor algorithm

Article directory

Data preparation

Because knn can be used for multi class discrimination, the original mnist ten class handwritten numeral data set is directly used. https://github.com/phdsky/ML/tree/master/data mnist.csv in.

k-nearest neighbor algorithm

The idea of k-nearest-neighbor is very simple. For each input xxx, traverse and calculate the distance from the sample points in the sample space to xxx, select the K sample points with the shortest distance, and vote to determine the type of input samples. It can be seen that the knn algorithm does not show the training process, and the steps of the algorithm are as follows:

knn algorithm needs to traverse the sample space once for each input sample xxx, and the time complexity required for each traverse is O(n)O(n)O(n), and the total time complexity for n samples is O(n2)O(n^2)O(n2), which is unacceptable for a large number of input data. The algorithm code is as follows:

# @Author: phd
# @Date: 19-4-17
# @Site: github.com/phdsky
# @Description:
#   KNN has no explict training progress
#       can deal with multi label classification

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split


def calc_accuracy(y_pred, y_truth):
    assert len(y_pred) == len(y_truth)
    n = len(y_pred)

    hit_count = 0
    for i in range(0, n):
        if y_pred[i] == y_truth[i]:
            hit_count += 1

    print("Predicting accuracy %f\n" % (hit_count / n))


def minkowski(xi, xj, p):
    assert len(xi) == len(xj)
    n = len(xi)

    # distance = 0
    # for i in range(0, n):
    #     distance += pow(abs(xi[i] - xj[i]), p)
    #
    # distance = pow(distance, 1/p)

    # Euclidean distance
    distance = np.linalg.norm(xi - xj)

    return distance


class KNN(object):
    def __init__(self, k, p):
        self.k = k
        self.p = p

    def vote(self, k_vec):
        assert len(k_vec) == self.k
        flag = np.full(10, 0)  # Ten labels

        for i in range(0, self.k):
            flag[k_vec[i][1]] += 1

        return np.argmax(flag)

    def predict(self, X_train, y_train, X_test):
        n = len(X_test)
        m = len(X_train)
        predict_label = np.full(n, -1)

        for i in range(0, n):
            to_predict = X_test[i]
            distances, distances_label = [], []
            for j in range(0, m):
                to_compare = X_train[j]
                dist = minkowski(to_predict, to_compare, self.p)
                distances.append(dist)

            distances_label = list(zip(distances, y_train))
            distances_label.sort(key=lambda kv: kv[0])

            predict_label[i] = self.vote(distances_label[0:self.k])
            # print("Nearest neighbour is %s" % X_train[predict_label[i]])
            print("Sample %d predicted as %d" % (i, predict_label[i]))

        return predict_label


def example():
    print("Start testing on simple dataset...")

    X_train = np.asarray([[2, 3], [5, 4], [9, 6], [4, 7], [8, 1], [7, 2]])
    y_train = np.asarray([0, 1, 2, 3, 4, 5])
    X_test = np.asarray([[3, 5]])

    knn = KNN(k=1, p=2)  # p=2 Euclidean distance
    y_predicted = knn.predict(X_train=X_train, y_train=y_train, X_test=X_test)

    print("Simple testing done...\n")


if __name__ == "__main__":

    # example()

    mnist_data = pd.read_csv("../data/mnist.csv")
    mnist_values = mnist_data.values

    images = mnist_values[::, 1::]
    labels = mnist_values[::, 0]

    X_train, X_test, y_train, y_test = train_test_split(
        images, labels, test_size=100, random_state=42
    )

    knn = KNN(k=10, p=2)  # p=2 Euclidean distance

    # Start predicting, training progress omitted
    print("Testing on %d samples..." % len(X_test))
    y_predicted = knn.predict(X_train=X_train, y_train=y_train, X_test=X_test)

    calc_accuracy(y_pred=y_predicted, y_truth=y_test)

Code output

/Users/phd/Softwares/anaconda3/bin/python /Users/phd/Desktop/ML/knn/knn.py
Testing on 100 samples...
Sample 0 predicted as 8
Sample 1 predicted as 1
Sample 2 predicted as 9
Sample 3 predicted as 9
Sample 4 predicted as 8
Sample 5 predicted as 6
Sample 6 predicted as 2
Sample 7 predicted as 2
Sample 8 predicted as 7
Sample 9 predicted as 1
Sample 10 predicted as 6
Sample 11 predicted as 3
Sample 12 predicted as 1
Sample 13 predicted as 2
Sample 14 predicted as 7
Sample 15 predicted as 4
Sample 16 predicted as 3
Sample 17 predicted as 3
Sample 18 predicted as 6
Sample 19 predicted as 4
Sample 20 predicted as 9
Sample 21 predicted as 5
Sample 22 predicted as 2
Sample 23 predicted as 6
Sample 24 predicted as 0
Sample 25 predicted as 0
Sample 26 predicted as 0
Sample 27 predicted as 8
Sample 28 predicted as 6
Sample 29 predicted as 3
Sample 30 predicted as 6
Sample 31 predicted as 6
Sample 32 predicted as 1
Sample 33 predicted as 9
Sample 34 predicted as 8
Sample 35 predicted as 6
Sample 36 predicted as 7
Sample 37 predicted as 3
Sample 38 predicted as 6
Sample 39 predicted as 1
Sample 40 predicted as 9
Sample 41 predicted as 7
Sample 42 predicted as 9
Sample 43 predicted as 6
Sample 44 predicted as 8
Sample 45 predicted as 3
Sample 46 predicted as 4
Sample 47 predicted as 2
Sample 48 predicted as 7
Sample 49 predicted as 8
Sample 50 predicted as 4
Sample 51 predicted as 3
Sample 52 predicted as 3
Sample 53 predicted as 7
Sample 54 predicted as 1
Sample 55 predicted as 2
Sample 56 predicted as 6
Sample 57 predicted as 2
Sample 58 predicted as 9
Sample 59 predicted as 6
Sample 60 predicted as 4
Sample 61 predicted as 0
Sample 62 predicted as 4
Sample 63 predicted as 8
Sample 64 predicted as 5
Sample 65 predicted as 3
Sample 66 predicted as 4
Sample 67 predicted as 3
Sample 68 predicted as 9
Sample 69 predicted as 3
Sample 70 predicted as 9
Sample 71 predicted as 4
Sample 72 predicted as 2
Sample 73 predicted as 8
Sample 74 predicted as 1
Sample 75 predicted as 6
Sample 76 predicted as 3
Sample 77 predicted as 7
Sample 78 predicted as 0
Sample 79 predicted as 3
Sample 80 predicted as 1
Sample 81 predicted as 7
Sample 82 predicted as 6
Sample 83 predicted as 7
Sample 84 predicted as 6
Sample 85 predicted as 1
Sample 86 predicted as 9
Sample 87 predicted as 5
Sample 88 predicted as 3
Sample 89 predicted as 6
Sample 90 predicted as 9
Sample 91 predicted as 3
Sample 92 predicted as 7
Sample 93 predicted as 6
Sample 94 predicted as 6
Sample 95 predicted as 5
Sample 96 predicted as 2
Sample 97 predicted as 9
Sample 98 predicted as 3
Sample 99 predicted as 5
Predicting accuracy 0.970000


Process finished with exit code 0

Above, the parameter k = 10 in knn selects ten nearest neighbors, p = 2 selects the Euclidean distance; in the implementation process, it is too slow to calculate the Euclidean distance according to minkowski k = 2, so it is directly replaced by the Euclidean distance in numpy. From the output results, the test accuracy reaches 97%, which shows that the algorithm is simple and effective.

Nearest neighbor algorithm based on kd tree

Tags: github Lambda Python

Posted on Sun, 03 Nov 2019 08:11:52 -0500 by lorddemos90