# 1 kd tree construction

Take the set (2,3), (5,4), (9,6), (4,7), (8,1), (7,2) of two-dimensional plane points ((x,y)) as an example to illustrate the construction process of k-d tree.

## (1) Build steps

### 1. When building the root node, the segmentation dimension is (x), and the order of the above point set in (x) dimension from small to large is:

(2,3)，(4,7)，(5,4)，(7,2)，(8,1)，(9,6)；

The median is 7, and the median (7,2) is selected. (Note: the median value of 2,4,5,7,8,9 in mathematics is (5 + 7) / 2 = 6, but because the median value of the algorithm needs to be within the set of points, so in this paper, len(points)//2=3, points=(7,2)) is used for median calculation.) ### 2. (2,3), (4,7), (5,4) left subtree of (7,2) node, (8,1), (9,6) right subtree of (7,2) node. ### 3. When constructing the left subtree of (7,2) node, the segmentation dimension of point set (2,3), (4,7), (5,4) is (y). Select median 4 from 3,4,7 and median (5,4) as the segmentation plane, (2,3) hang on its left subtree and (4,7) hang on its right subtree. ### 4. When constructing the right subtree of (7,2) node, the segmentation dimension of point set (8,1), (9,6) is also (y), the median value is (9,6) as the segmentation plane, and (8,1) is hung on its left subtree. So far, the k-d tree has been built. ### According to the above construction process and the following figure, building a k-d tree is the process of dividing a two-dimensional plane step by step. ## (2) Code implementation building kd tree

``````class Node:
def __init__(self,data,sp=0,left=None,right=None):
self.data = data
self.sp = sp  #0 is sorted by feature 1, 1 is sorted by feature 2
self.left = left
self.right = right

def __lt__(self, other):
return self.data < other.data``````
``````class KDTree:
def __init__(self,data):
self.dim = data.shape
self.root = self.createTree(data,0)
self.nearest_node = None
self.nearest_dist = np.inf #Set infinity

def createTree(self,dataset,sp):
if len(dataset) == 0:
return None

dataset_sorted = dataset[np.argsort(dataset[:,sp])] #Sort by characteristic column
#Get median index
mid = len(dataset) // 2
#Build node
left = self.createTree(dataset_sorted[:mid],(sp+1)%self.dim)
right = self.createTree(dataset_sorted[mid+1:],(sp+1)%self.dim)
parentNode = Node(dataset_sorted[mid],sp,left,right)

return parentNode``````
``````data = np.array([[2,3],[5,4],[9,6],[4,7],[8,1],[7,2]])
kdtree = KDTree(data)  #Create KDTree``````

# 2 kd tree search (find nearest neighbor node)

## (1) Simple case 1: query points (2.1, 3.1)  1. Through binary search, the nearest approximation point, namely leaf node (2,3), can be found quickly from the root node along the search path.

2. The found leaf node is not necessarily the closest, the closest must be closer to the query point, which should be located in the circle with the query point as the center and through the leaf node.

3. In order to find the real nearest neighbor, we also need to perform the "backtracking" operation:

`The algorithm reversely searches for data points closer to the query point along the search path.`

### deduction:

1. In this example, the binary search starts from (7,2), then reaches (5,4), and finally reaches (2,3). At this time, the nodes in the search path are < (7,2), (5,4), (2,3) >.

2. First, take (2,3) as the nearest neighbor, and calculate the distance from (2.1,3.1) to the query point as 0.1414, 3. Then trace back to its parent node (5,4), and judge whether there are data points closer to the query point in other child node spaces of the parent node. Draw a circle with (2.1,3.1) as the center and 0.1414 as the radius, as shown in Figure 3. It is found that the circle does not intersect with hyperplane y = 4, so it is not necessary to search in the right subspace of (5,4) nodes.

4.4. Finally, it goes back to (7,2), taking (2.1,3.1) as the center, and a circle with a radius of 0.1414 will not deliver to x = 7 hyperplane, so it is not necessary to enter the right subspace of (7,2) to search. So far, all the nodes in the search path have been traced back, the whole search is ended, and the nearest neighbor (2,3) is returned, with the nearest distance of 0.1414.

## (2) Case 2: the search point is (2, 4.5)

1. Binary search is also carried out first, and (5,4) nodes are found from (7,2). When searching, y = 4 is the partition hyperplane. Since the search point is y value of 4.5, enter the right subspace to find (4,7), forming the search path < (7,2), (5,4), (4,7) >

2. Take (4,7) as the current nearest neighbor point, and calculate the distance between it and the target finding point as 3.202. 3. Then go back to (5,4) and calculate the distance between it and the search point as 3.041. (the distance between (4,7) and the target search point is 3.202, while the distance between (5,4) and the search point is 3.041, so (5,4) is the nearest point of the query point;) 4. Take (2, 4.5) as the center and 3.041 as the radius, as shown in Figure 4. It can be seen that the circle and y = 4 hyperplane deliver, so it needs to enter the (5,4) left subspace for searching. At this time, it is necessary to add (2,3) nodes to < (7,2), (2,3) > in the search path.

5. Backtracking to (2,3) leaf node, (2,3) distance (2,4.5) is closer than (5,4), so the nearest neighbor is updated to (2,3), and the nearest distance is updated to 1.5. 6. Go back to (7,2), make a circle with (2,4.5) as the center and 1.5 as the radius, and do not deliver with x = 7 split hyperplane.

At this point, the search path backtracking is completed. Return to the nearest neighbor (2,3), with the nearest distance of 1.5.

## (3) Code implementation

``````import numpy as np

class Node:
def __init__(self,data,sp=0,left=None,right=None):
self.data = data
self.sp = sp  #0 is sorted by feature 1, 1 is sorted by feature 2
self.left = left
self.right = right

def __lt__(self, other):
return self.data < other.data``````
``````class KDTree:
def __init__(self,data):
self.dim = data.shape
self.root = self.createTree(data,0)
self.nearest_node = None
self.nearest_dist = np.inf #Set infinity

def createTree(self,dataset,sp):
if len(dataset) == 0:
return None

dataset_sorted = dataset[np.argsort(dataset[:,sp])] #Sort by characteristic column
#Get median index
mid = len(dataset) // 2
#Build node
left = self.createTree(dataset_sorted[:mid],(sp+1)%self.dim)
right = self.createTree(dataset_sorted[mid+1:],(sp+1)%self.dim)
parentNode = Node(dataset_sorted[mid],sp,left,right)

return parentNode

def nearest(self, x):
def visit(node):
if node != None:
dis = node.data[node.sp] - x[node.sp]
#Accessing child nodes
visit(node.left if dis > 0 else node.right)
#View the distance between the current node and the target node, find the distance by two norm
curr_dis = np.linalg.norm(x-node.data,2)
#Update node
if curr_dis < self.nearest_dist:
self.nearest_dist = curr_dis
self.nearest_node = node
#Compare whether the distance between the target node and the current node exceeds the current hyperplane. If the distance exceeds the current hyperplane, you need to go to another subtree
if self.nearest_dist > abs(dis): #We need to go to the other side, so the judgment condition is the opposite of the above
visit(node.left if dis < 0 else node.right)

#Find from root
node = self.root
visit(node)
return self.nearest_node.data,self.nearest_dist``````
``````data = np.array([[2,3],[5,4],[9,6],[4,7],[8,1],[7,2]])
kdtree = KDTree(data)  #Create KDTree
node,dist = kdtree.nearest(np.array([6,5]))
print(node,dist)`````` ## (4) Performance comparison

https://www.cnblogs.com/21207-ihome/p/6084670.html

Generally speaking, the nearest search only needs to detect a few leaf nodes, as shown in the following figure: However, if the distribution of instance points is poor, almost all nodes should be traversed as follows: # 3 kd tree searching the nearest K nodes in k-nearest neighbor algorithm

## (1) Algorithm ideas (with the help of Heap sort -- heapq)

### We implement our K-nearest neighbor algorithm with a large top heap of size K:

1. First, look down from the root node to the leaf node

2. Starting from the leaf node, record the distance from each target point to the maximum heap.

(1) If the heap size is less than k, the backtracking is normal, and if the root node is reached, we also need to visit the subtree on the other side

(2) If the heap size = k, we take the maximum value every time we backtrack, check whether the target point intersects the other side of the current node, and then decide whether to visit the other side. When the new node obtained is smaller than the target node, the current maximum distance will be out of the heap, the current value will be inserted and reordered. Until we find the maximum value of k elements, we can no longer intersect with the other side of the current node.

## (2) Code implementation

``````import numpy as np
import heapq

class Node:
def __init__(self,data,sp=0,left=None,right=None):
self.data = data
self.sp = sp  #0 is sorted by feature 1, 1 is sorted by feature 2
self.left = left
self.right = right
self.nearest_dist = -np.inf  #We need to use the minimum heap to simulate the maximum heap, we set the default size-∞，In fact+∞

def __lt__(self, other):
return self.nearest_dist < other.nearest_dist

class KDTree:
def __init__(self,data):
self.k = data.shape
self.root = self.createTree(data,0)
self.heap = [] #Initializing a heap

def createTree(self,dataset,sp):
if len(dataset) == 0:
return None

dataset_sorted = dataset[np.argsort(dataset[:,sp])] #Sort by characteristic column
#Get median index
mid = len(dataset) // 2
#Build node
left = self.createTree(dataset_sorted[:mid],(sp+1)%self.k)
right = self.createTree(dataset_sorted[mid+1:],(sp+1)%self.k)
parentNode = Node(dataset_sorted[mid],sp,left,right)

return parentNode

def nearest(self, x, k):
def visit(node):
if node != None:
dis = node.data[node.sp] - x[node.sp]
#Accessing child nodes
visit(node.left if dis > 0 else node.right)

#View the distance between the current node and the target node, find the distance by two norm
curr_dis = np.linalg.norm(x-node.data,2)
node.nearest_dist = -curr_dis
#Update node
if len(self.heap) < k: #Join directly
heapq.heappush(self.heap,node)
else:
#Get the maximum value first, and then decide
if nsmallest(1,self.heap).nearest_dist < -curr_dis:
heapq.heapreplace(self.heap, node)

#Compare whether the distance between the target node and the current node exceeds the current hyperplane. If the distance exceeds the current hyperplane, you need to go to another subtree
if len(self.heap) < k or abs(nsmallest(1,self.heap).nearest_dist) > abs(dis): #We need to go to the other side, so the judgment condition is the opposite of the above
visit(node.left if dis < 0 else node.right)

#Find from root
node = self.root
visit(node)

nds = nlargest(k,self.heap)
for i in range(k):
nd = nds[i]
print(nd.data,nd.nearest_dist)``````
``````data = np.array([[2,3],[5,4],[9,6],[4,7],[8,1],[7,2]])
kdtree = KDTree(data)  #Create KDTree
kdtree.nearest(np.array([6,5]),5)`````` ## (3) Compared with the original KNN

``````import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

def KNNClassfy(preData,dataSet,k):
distance = np.sum(np.power(dataSet - preData,2),1)  #Note: we don't prescribe here, we can calculate once less
sortDistIdx = np.argsort(distance,0)[:k]  #Small to large sort, get index
for i in range(k):
print(dataSet[sortDistIdx[i]],np.linalg.norm(dataSet[sortDistIdx[i]]-preData,2))

data = np.array([[2,3],[5,4],[9,6],[4,7],[8,1],[7,2]])
predata = np.array([6,5])

KNNClassfy(predata,data,5)`````` Tags: less Python

Posted on Sun, 07 Jun 2020 05:46:29 -0400 by TabLeft