Statistical learning method -- K-nearest neighbor (kd tree implementation)

Take the set (2,3), (5,4), (9,6), (4,7), (8,1), (7,2) of two-dimensional plane points ((x,y)) as an example to illustrate the construction process of k-d tree.

(1) Build steps

1. When building the root node, the segmentation dimension is (x), and the order of the above point set in (x) dimension from small to large is:

(2,3)，(4,7)，(5,4)，(7,2)，(8,1)，(9,6)；

The median is 7, and the median (7,2) is selected. (Note: the median value of 2,4,5,7,8,9 in mathematics is (5 + 7) / 2 = 6, but because the median value of the algorithm needs to be within the set of points, so in this paper, len(points)//2=3, points[3]=(7,2)) is used for median calculation.)

2. (2,3), (4,7), (5,4) left subtree of (7,2) node, (8,1), (9,6) right subtree of (7,2) node.

3. When constructing the left subtree of (7,2) node, the segmentation dimension of point set (2,3), (4,7), (5,4) is (y). Select median 4 from 3,4,7 and median (5,4) as the segmentation plane, (2,3) hang on its left subtree and (4,7) hang on its right subtree.

4. When constructing the right subtree of (7,2) node, the segmentation dimension of point set (8,1), (9,6) is also (y), the median value is (9,6) as the segmentation plane, and (8,1) is hung on its left subtree. So far, the k-d tree has been built.

According to the above construction process and the following figure, building a k-d tree is the process of dividing a two-dimensional plane step by step.

(2) Code implementation building kd tree

class Node: def __init__(self,data,sp=0,left=None,right=None): self.data = data self.sp = sp #0 is sorted by feature 1, 1 is sorted by feature 2 self.left = left self.right = right def __lt__(self, other): return self.data < other.data

class KDTree: def __init__(self,data): self.dim = data.shape[1] self.root = self.createTree(data,0) self.nearest_node = None self.nearest_dist = np.inf #Set infinity def createTree(self,dataset,sp): if len(dataset) == 0: return None dataset_sorted = dataset[np.argsort(dataset[:,sp])] #Sort by characteristic column #Get median index mid = len(dataset) // 2 #Build node left = self.createTree(dataset_sorted[:mid],(sp+1)%self.dim) right = self.createTree(dataset_sorted[mid+1:],(sp+1)%self.dim) parentNode = Node(dataset_sorted[mid],sp,left,right) return parentNode

data = np.array([[2,3],[5,4],[9,6],[4,7],[8,1],[7,2]]) kdtree = KDTree(data) #Create KDTree

2 kd tree search (find nearest neighbor node)

Note: nearest neighbor - when k is 1, it is called the nearest neighbor.

Searching data in k-d tree is also an important part of feature matching. Its purpose is to search the data points closest to the query points in k-d tree.

(1) Simple case 1: query points (2.1, 3.1)

1. Through binary search, the nearest approximation point, namely leaf node (2,3), can be found quickly from the root node along the search path.

2. The found leaf node is not necessarily the closest, the closest must be closer to the query point, which should be located in the circle with the query point as the center and through the leaf node.

3. In order to find the real nearest neighbor, we also need to perform the "backtracking" operation:

The algorithm reversely searches for data points closer to the query point along the search path.

deduction:

1. In this example, the binary search starts from (7,2), then reaches (5,4), and finally reaches (2,3). At this time, the nodes in the search path are < (7,2), (5,4), (2,3) >.

2. First, take (2,3) as the nearest neighbor, and calculate the distance from (2.1,3.1) to the query point as 0.1414,

3. Then trace back to its parent node (5,4), and judge whether there are data points closer to the query point in other child node spaces of the parent node. Draw a circle with (2.1,3.1) as the center and 0.1414 as the radius, as shown in Figure 3. It is found that the circle does not intersect with hyperplane y = 4, so it is not necessary to search in the right subspace of (5,4) nodes.

4.4. Finally, it goes back to (7,2), taking (2.1,3.1) as the center, and a circle with a radius of 0.1414 will not deliver to x = 7 hyperplane, so it is not necessary to enter the right subspace of (7,2) to search. So far, all the nodes in the search path have been traced back, the whole search is ended, and the nearest neighbor (2,3) is returned, with the nearest distance of 0.1414.

(2) Case 2: the search point is (2, 4.5)

1. Binary search is also carried out first, and (5,4) nodes are found from (7,2). When searching, y = 4 is the partition hyperplane. Since the search point is y value of 4.5, enter the right subspace to find (4,7), forming the search path < (7,2), (5,4), (4,7) >

2. Take (4,7) as the current nearest neighbor point, and calculate the distance between it and the target finding point as 3.202.

3. Then go back to (5,4) and calculate the distance between it and the search point as 3.041. (the distance between (4,7) and the target search point is 3.202, while the distance between (5,4) and the search point is 3.041, so (5,4) is the nearest point of the query point;)

4. Take (2, 4.5) as the center and 3.041 as the radius, as shown in Figure 4. It can be seen that the circle and y = 4 hyperplane deliver, so it needs to enter the (5,4) left subspace for searching. At this time, it is necessary to add (2,3) nodes to < (7,2), (2,3) > in the search path.

5. Backtracking to (2,3) leaf node, (2,3) distance (2,4.5) is closer than (5,4), so the nearest neighbor is updated to (2,3), and the nearest distance is updated to 1.5.

6. Go back to (7,2), make a circle with (2,4.5) as the center and 1.5 as the radius, and do not deliver with x = 7 split hyperplane.

At this point, the search path backtracking is completed. Return to the nearest neighbor (2,3), with the nearest distance of 1.5.

(3) Code implementation

import numpy as np class Node: def __init__(self,data,sp=0,left=None,right=None): self.data = data self.sp = sp #0 is sorted by feature 1, 1 is sorted by feature 2 self.left = left self.right = right def __lt__(self, other): return self.data < other.data

class KDTree: def __init__(self,data): self.dim = data.shape[1] self.root = self.createTree(data,0) self.nearest_node = None self.nearest_dist = np.inf #Set infinity def createTree(self,dataset,sp): if len(dataset) == 0: return None dataset_sorted = dataset[np.argsort(dataset[:,sp])] #Sort by characteristic column #Get median index mid = len(dataset) // 2 #Build node left = self.createTree(dataset_sorted[:mid],(sp+1)%self.dim) right = self.createTree(dataset_sorted[mid+1:],(sp+1)%self.dim) parentNode = Node(dataset_sorted[mid],sp,left,right) return parentNode def nearest(self, x): def visit(node): if node != None: dis = node.data[node.sp] - x[node.sp] #Accessing child nodes visit(node.left if dis > 0 else node.right) #View the distance between the current node and the target node, find the distance by two norm curr_dis = np.linalg.norm(x-node.data,2) #Update node if curr_dis < self.nearest_dist: self.nearest_dist = curr_dis self.nearest_node = node #Compare whether the distance between the target node and the current node exceeds the current hyperplane. If the distance exceeds the current hyperplane, you need to go to another subtree if self.nearest_dist > abs(dis): #We need to go to the other side, so the judgment condition is the opposite of the above visit(node.left if dis < 0 else node.right) #Find from root node = self.root visit(node) return self.nearest_node.data,self.nearest_dist

data = np.array([[2,3],[5,4],[9,6],[4,7],[8,1],[7,2]]) kdtree = KDTree(data) #Create KDTree node,dist = kdtree.nearest(np.array([6,5])) print(node,dist)

(4) Performance comparison

https://www.cnblogs.com/21207-ihome/p/6084670.html

Generally speaking, the nearest search only needs to detect a few leaf nodes, as shown in the following figure:

However, if the distribution of instance points is poor, almost all nodes should be traversed as follows:

3 kd tree searching the nearest K nodes in k-nearest neighbor algorithm

Add: There is no large top heap in python. What should I do?

Change the original value of X to - x

(1) Algorithm ideas (with the help of Heap sort -- heapq)

We implement our K-nearest neighbor algorithm with a large top heap of size K:

1. First, look down from the root node to the leaf node

2. Starting from the leaf node, record the distance from each target point to the maximum heap.

(1) If the heap size is less than k, the backtracking is normal, and if the root node is reached, we also need to visit the subtree on the other side

(2) If the heap size = k, we take the maximum value every time we backtrack, check whether the target point intersects the other side of the current node, and then decide whether to visit the other side. When the new node obtained is smaller than the target node, the current maximum distance will be out of the heap, the current value will be inserted and reordered. Until we find the maximum value of k elements, we can no longer intersect with the other side of the current node.

(2) Code implementation

import numpy as np import heapq class Node: def __init__(self,data,sp=0,left=None,right=None): self.data = data self.sp = sp #0 is sorted by feature 1, 1 is sorted by feature 2 self.left = left self.right = right self.nearest_dist = -np.inf #We need to use the minimum heap to simulate the maximum heap, we set the default size-∞，In fact+∞ def __lt__(self, other): return self.nearest_dist < other.nearest_dist class KDTree: def __init__(self,data): self.k = data.shape[1] self.root = self.createTree(data,0) self.heap = [] #Initializing a heap def createTree(self,dataset,sp): if len(dataset) == 0: return None dataset_sorted = dataset[np.argsort(dataset[:,sp])] #Sort by characteristic column #Get median index mid = len(dataset) // 2 #Build node left = self.createTree(dataset_sorted[:mid],(sp+1)%self.k) right = self.createTree(dataset_sorted[mid+1:],(sp+1)%self.k) parentNode = Node(dataset_sorted[mid],sp,left,right) return parentNode def nearest(self, x, k): def visit(node): if node != None: dis = node.data[node.sp] - x[node.sp] #Accessing child nodes visit(node.left if dis > 0 else node.right) #View the distance between the current node and the target node, find the distance by two norm curr_dis = np.linalg.norm(x-node.data,2) node.nearest_dist = -curr_dis #Update node if len(self.heap) < k: #Join directly heapq.heappush(self.heap,node) else: #Get the maximum value first, and then decide if nsmallest(1,self.heap)[0].nearest_dist < -curr_dis: heapq.heapreplace(self.heap, node) #Compare whether the distance between the target node and the current node exceeds the current hyperplane. If the distance exceeds the current hyperplane, you need to go to another subtree if len(self.heap) < k or abs(nsmallest(1,self.heap)[0].nearest_dist) > abs(dis): #We need to go to the other side, so the judgment condition is the opposite of the above visit(node.left if dis < 0 else node.right) #Find from root node = self.root visit(node) nds = nlargest(k,self.heap) for i in range(k): nd = nds[i] print(nd.data,nd.nearest_dist)

data = np.array([[2,3],[5,4],[9,6],[4,7],[8,1],[7,2]]) kdtree = KDTree(data) #Create KDTree kdtree.nearest(np.array([6,5]),5)

(3) Compared with the original KNN

import numpy as np import matplotlib.pyplot as plt import pandas as pd def KNNClassfy(preData,dataSet,k): distance = np.sum(np.power(dataSet - preData,2),1) #Note: we don't prescribe here, we can calculate once less sortDistIdx = np.argsort(distance,0)[:k] #Small to large sort, get index for i in range(k): print(dataSet[sortDistIdx[i]],np.linalg.norm(dataSet[sortDistIdx[i]]-preData,2)) data = np.array([[2,3],[5,4],[9,6],[4,7],[8,1],[7,2]]) predata = np.array([6,5]) KNNClassfy(predata,data,5)