Nearest neighbor method and its implementation in python

overview

The nearest neighbor method is a basic classification and regression method. There is no explicit learning process. The selection of 𝐾 value, distance measurement and classification decision-making rules are the three basic elements of 𝐾 nearest neighbor method. Generally speaking, only the first N most similar data in the sample data set are selected. 𝐾 generally not greater than 20. Finally, select the category with the most frequent occurrence in k as the classification of new data.

  • Advantages: high precision, insensitive to outliers and no data input assumption
  • Disadvantages: high computational complexity and space complexity
  • Applicable data range: numerical type and nominal type

Nearest neighbor algorithm

Nearest neighbor model

Distance measurement

L p L_p Lp} distance:
L p ( x i , x j ) = ( ∑ l = 1 n ∣ x i ( l ) − x j ( l ) ∣ p ) 1 p L_p(x_i,x_j)=\left (\sum_{l=1}^{n}\left |x_i^{(l)}-x_j^{(l)} \right | ^p \right )^\frac{1}{p} Lp​(xi​,xj​)=(l=1∑n​∣∣∣​xi(l)​−xj(l)​∣∣∣​p)p1​
When p = 2 p=2 When p=2, it is Euclidean distance:
L 2 ( x i , x j ) = ( ∑ l = 1 n ∣ x i ( l ) − x j ( l ) ∣ 2 ) 1 2 L_2(x_i,x_j)=\left (\sum_{l=1}^{n}\left |x_i^{(l)}-x_j^{(l)} \right |^2 \right )^\frac{1}{2} L2​(xi​,xj​)=(l=1∑n​∣∣∣​xi(l)​−xj(l)​∣∣∣​2)21​
When p = 1 p=1 When p=1, Manhattan distance:
L 1 ( x i , x j ) = ∑ l = 1 n ∣ x i ( l ) − x j ( l ) ∣ L_1(x_i,x_j)=\sum_{l=1}^{n}\left |x_i^{(l)}-x_j^{(l)} \right | L1​(xi​,xj​)=l=1∑n​∣∣∣​xi(l)​−xj(l)​∣∣∣​
When p = ∞ p=∞ When p = ∞, it is the maximum value of each coordinate distance:
L ( x i , x j ) = max ⁡ l ∣ x i ( l ) − x j ( l ) ∣ L_(x_i,x_j)=\max_{l}\left |x_i^{(l)}-x_j^{(l)} \right | L(​xi​,xj​)=lmax​∣∣∣​xi(l)​−xj(l)​∣∣∣​

Selection of 𝐾 value

If K K When K is small, the approximation error of "learning" will decrease, but the estimation error of "learning" will increase; noise is sensitive; K K The decrease of K value means that the overall model becomes complex and prone to over fitting.
If K K Large K value reduces the estimation error of learning, but the disadvantage is that the approximation error of learning will increase. K K The increase of K value means that the overall model becomes simple.
K = N K=N K=N, then no matter what the input instance is, it will simply predict that it belongs to the class with the most training instances.
In application, a smaller number is generally selected, and the cross validation method is used to select the optimal number K K K value.

Rules of classification decision

The majority voting rule is explained as follows: if the loss function of the classification is 0-1 loss function( L ( Y , f ( x ) ) = { 1 , Y ≠ f ( X ) 0 , Y = f ( X ) } L\left(Y,f(x)\right) = \left \{ \begin{matrix} 1,Y≠f(X)\\ 0,Y=f(X) \end{matrix} \right \} L(Y,f(x))={1,Y  = f(X)0,Y=f(X)}), and the classification function is f : R n → c 1 , c 2 , . . , c K f:R^n→{c_1,c_2,..,c_K } f:Rn → c1, c2,..., cK, then the probability of misclassification is P ( Y ≠ f ( X ) ) = 1 − P ( Y = f ( X ) ) P(Y≠f(X))=1-P(Y=f(X)) P(Y​=f(X))=1−P(Y=f(X)).
For a given instance x ∈ X x∈X X ∈ x, its nearest neighbor k k k training instance points form a set N k ( x ) N_k (x) Nk (x). If covered N k ( x ) N_k (x) The area category of Nk (x) is c j c_j cj, then the probability of misclassification is:
1 k ∑ x i ∈ N k ( x ) I ( y i ≠ c j ) = 1 − 1 k ∑ x i ∈ N k ( x ) I ( y i = c j ) \frac{1}{k}\sum_{x_i \in N_k(x)}I(y_i≠c_j)=1-\frac{1}{k}\sum_{x_i \in N_k(x)}I(y_i=c_j) k1​xi​∈Nk​(x)∑​I(yi​​=cj​)=1−k1​xi​∈Nk​(x)∑​I(yi​=cj​)

code

Main.py

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from myKNN import KNN
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split


def pplt(df, test_point=[0, 0], flag=0):
    plt.scatter(df[:50]['sepal length'], df[:50]['sepal width'], label='0')
    plt.scatter(df[50:100]['sepal length'], df[50:100]['sepal width'], label=1)
    if flag == 1:
        plt.plot(test_point[0], test_point[1], 'bo', label='test_point')
    plt.xlabel('sepal length')
    plt.ylabel('sepal width')
    plt.legend()
    plt.show()


iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['label'] = iris.target
df.columns = ['sepal length', 'sepal width', 'petal length', 'petal width', 'label']
# print(df)
pplt(df)
data = np.array(df.iloc[:100, [0, 1, -1]])
X, y = data[:, :-1], data[:, -1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
print("np.shape(X_train)", np.shape(X_train))
print("np.shape(y_train)", np.shape(y_train))
clf = KNN(X_train, y_train)
print(clf.score(X_test, y_test))
test_point = [6.0, 3.0]
print('Test Point:{}'.format(clf.predict(test_point)))

pplt(df, test_point, flag=1)

myKNN.py

Insert the code slice here import numpy as np
from collections import Counter

class KNN:
    def __init__(self, X_train, y_train, n_neighbors=3, p=2):
        """
        :param n_neighbors: Number of adjacent points
        :param p: Distance measurement
        """
        self.n = n_neighbors
        self.p = p
        self.X_train = X_train
        self.y_train = y_train

    def predict(self, X):
        # Take out n points
        knn_list = []
        for i in range(self.n):
            # np.linalg.norm(x, ord=None, axis=None, keepdims=False) find norm
            # x --- matrix ord --- norm type axis --- processing type (= 1 processing by row vector, = 0 processing by column vector, = None matrix norm) keeping --- whether to maintain the two-dimensional characteristics of the matrix
            dist = np.linalg.norm(X - self.X_train[i], ord=self.p)
            knn_list.append((dist, self.y_train[i]))

        for i in range(self.n, len(self.X_train)):
            max_index = knn_list.index(max(knn_list, key=lambda x: x[0]))
            dist = np.linalg.norm(X - self.X_train[i], ord=self.p)
            if knn_list[max_index][0] > dist:
                knn_list[max_index] = (dist, self.y_train[i])

        # Statistics
        knn = [k[-1] for k in knn_list]
        count_pairs = Counter(knn)
        max_count = sorted(count_pairs.items(), key=lambda x: x[1])[-1][0]
        return max_count

    def score(self, X_test, y_test):
        right_count = 0
        n = 10
        for X, y in zip(X_test, y_test):
            label = self.predict(X)
            if label == y:
                right_count += 1
        return right_count / len(X_test)

output


Implementation of nearest neighbor method: kd tree

Constructing kd tree

k d kd kd tree is a kind of tree data structure that stores instance points in k-dimensional space for fast retrieval. k d kd kd tree is a binary tree, which represents a partition construction of k-dimensional space k d kd The kd tree is equivalent to continuously using a hyperplane perpendicular to the coordinate axis k k k-dimensional space is segmented to form a series of k k k-dimensional hyper rectangular region. k d kd Each node of kd tree corresponds to one node k k k-dimensional hyper rectangular region.
structure k d kd The method of kd tree is as follows: construct the root node so that the root node corresponds to the hyperrectangular region containing all instance points in the k-dimensional space; continuously segment the k-dimensional space and generate child nodes through the following recursive method Select a coordinate axis and a tangent point on the coordinate axis to determine a hyperplane. The hyperplane divides the current hyperrectangular area into left and right sub areas (sub nodes) through the selected tangent point and perpendicular to the selected coordinate axis. At this time, the instance is divided into two sub areas. This process ends when there are no instances in the sub area (the node at the time of termination is a leaf node). In this process, the instance is saved on the corresponding node.

Search kd tree

Given a target point, search its nearest neighbor. First find the leaf node containing the target point; then start from the leaf node and return to the parent node in turn; constantly find the node closest to the target point and terminate when it is determined that there is no closer node. In this way, the search is limited to the local area of the space, and the efficiency is greatly improved.

The leaf node containing the target point corresponds to the smallest hyper rectangular area containing the target point. Take the instance point of the leaf node as the current nearest point. The nearest neighbor of the target point must be inside the hyper sphere centered on the target point and passing through the current nearest point, and then return to the parent node of the current node. If the hyper rectangular area of another child node of the parent node intersects with the hyper sphere, then Find the instance point closer to the target point in the intersection area. If there is such a point, take this point as the new current nearest point. The algorithm goes to the parent node at a higher level and continues the above process. If the hyperrectangular area of another child node of the parent node does not intersect the hypersphere, or there is no point closer to the current nearest point, stop the search.

code implementation

main.py

import KD
import search
import time
from random import random

data = [[2, 3], [5, 4], [9, 6], [4, 7], [8, 1], [7, 2]]
kd = KD.KDTree(data)
KD.preorder(kd.root)

# A k-dimensional random vector is generated, and the component value of each dimension is 0 ~ 1
def random_point(k):
    return [random() for _ in range(k)]

# Generate n k-dimensional random vectors
def random_points(k, n):
    return [random_point(k) for _ in range(n)]

ret = search.find_nearest(kd, [3, 4.5])
print(ret)

N = 400000
t0 = time.perf_counter()
kd2 = KD.KDTree(random_points(3, N))  # Construct a kd tree containing 400000 3-dimensional sample points
ret2 = search.find_nearest(kd2, [0.1, 0.5, 0.8])  # Find the nearest point from the target in 400000 sample points
t1 = time.perf_counter()
print("Time:", t1 - t0, "s")
print(ret2)

KD.py

# The main data structures contained in each node of KD tree are as follows:
class KdNode(object):
    def __init__(self, dom_elt, split, left, right):
        self.dom_elt = dom_elt  # K-dimensional vector node (a sample point in k-dimensional space)
        self.split = split  # Integer (serial number of the dimension to be split)
        self.left = left  # The node divides the KD tree composed of the left subspace of the hyperplane
        self.right = right  # The node divides the KD tree composed of the right subspace of the hyperplane

class KDTree(object):
    def __init__(self, data):
        k = len(data[0])  # Data dimension
        self.root = CreateNode(k, 0, data)  # The kd tree is constructed from the component of dimension 0, and the root node is returned

def CreateNode(k, split, data_set):
    """
    Press No split Dimension partition dataset exset establish kdNode
    """
    if not data_set:  # The data set is empty
        return None
    # The value of the key parameter is a function. This function has only one parameter and returns a value for comparison
    # The itemsetter function provided by the operator module is used to obtain the data of which dimensions of the object, and the parameter is the sequence number of the data to be obtained in the object
    # data_set.sort(key=itemgetter(split)) # Sort according to the one-dimensional data to be divided
    data_set.sort(key=lambda x: x[split])
    split_pos = len(data_set) // 2
    median = data_set[split_pos]  # Median split point
    split_next = (split + 1) % k

    # Create kd tree recursively
    return KdNode(median, split,
                  CreateNode(k, split_next, data_set[:split_pos]),  # Create left subtree
                  CreateNode(k, split_next, data_set[split_pos + 1:])  # Create right subtree
                  )

# KDTree preorder traversal
def preorder(root):
    print(root.dom_elt)
    if root.left:
        preorder(root.left)
    if root.right:
        preorder(root.right)

search.py

"""
Yes, the structure is good kd Search the tree to find the sample point closest to the target point
"""

from math import sqrt
from collections import namedtuple
import KD

# Define a namedtuple to store the nearest coordinate point, the nearest distance and the number of visited nodes respectively
result = namedtuple("Result_tuple", "nearest_point nearest_dist nodes_visited")

def find_nearest(tree, point):
    k = len(point)  # Data dimension

    def travel(kd_node, target, max_dist):
        if kd_node is None:
            return result([0] * k, float("inf"), 0)

        nodes_visited = 1

        s = kd_node.split  # Dimension to split
        pivot = kd_node.dom_elt  # Axis to split

        if target[s] <= pivot[s]:  # If the s dimension of the target point is less than the corresponding value of the split axis (the target is closer to the left subtree)
            nearer_node = kd_node.left  # The next access node is the root node of the left subtree
            further_node = kd_node.right  # Record the right subtree at the same time
        else:  # The target is closer to the right subtree
            nearer_node = kd_node.right
            further_node = kd_node.left

        temp1 = travel(nearer_node, target, max_dist)  # Traverse to find the area containing the target point

        nearest = temp1.nearest_point  # Use this leaf node as the "current closest point"“
        dist = temp1.nearest_dist  # Update nearest distance

        nodes_visited += temp1.nodes_visited

        if dist < max_dist:
            max_dist = dist  # The nearest point will be in the hypersphere with the target point as the center and max_dist as the radius

        temp_dist = abs(pivot[s] - target[s])  # The distance between the target point and the split hyperplane in the s-dimension
        if max_dist < temp_dist:  # Determine whether the hypersphere intersects the hyperplane
            return result(nearest, dist, nodes_visited)  # Disjoint can be returned directly without judgment

        # ---------------------------------------------------------------------------------
        # Calculate the Euclidean distance between the target point and the segmentation point
        temp_dist = sqrt(sum((p1 - p2) ** 2 for p1, p2 in zip(pivot, target)))

        if temp_dist < dist:  # If closer
            nearest = pivot  # Update nearest point
            dist = temp_dist  # Update nearest distance
            max_dist = dist  # Update hypersphere radius

        # Check whether the area corresponding to another child node has a closer point
        temp2 = travel(further_node, target, max_dist)

        nodes_visited += temp2.nodes_visited
        if temp2.nearest_dist < dist:  # If there is a closer distance in another child node
            nearest = temp2.nearest_point  # Update nearest point
            dist = temp2.nearest_dist  # Update nearest distance

        return result(nearest, dist, nodes_visited)

    return travel(tree.root, point, float("inf"))  # Recursion from root node

output

Tags: Python Machine Learning

Posted on Fri, 08 Oct 2021 00:35:24 -0400 by mc2007