overview
The nearest neighbor method is a basic classification and regression method. There is no explicit learning process. The selection of 𝐾 value, distance measurement and classification decision-making rules are the three basic elements of 𝐾 nearest neighbor method. Generally speaking, only the first N most similar data in the sample data set are selected. 𝐾 generally not greater than 20. Finally, select the category with the most frequent occurrence in k as the classification of new data.
- Advantages: high precision, insensitive to outliers and no data input assumption
- Disadvantages: high computational complexity and space complexity
- Applicable data range: numerical type and nominal type
Nearest neighbor algorithm
Nearest neighbor model
Distance measurement
L
p
L_p
Lp} distance:
L
p
(
x
i
,
x
j
)
=
(
∑
l
=
1
n
∣
x
i
(
l
)
−
x
j
(
l
)
∣
p
)
1
p
L_p(x_i,x_j)=\left (\sum_{l=1}^{n}\left |x_i^{(l)}-x_j^{(l)} \right | ^p \right )^\frac{1}{p}
Lp(xi,xj)=(l=1∑n∣∣∣xi(l)−xj(l)∣∣∣p)p1
When
p
=
2
p=2
When p=2, it is Euclidean distance:
L
2
(
x
i
,
x
j
)
=
(
∑
l
=
1
n
∣
x
i
(
l
)
−
x
j
(
l
)
∣
2
)
1
2
L_2(x_i,x_j)=\left (\sum_{l=1}^{n}\left |x_i^{(l)}-x_j^{(l)} \right |^2 \right )^\frac{1}{2}
L2(xi,xj)=(l=1∑n∣∣∣xi(l)−xj(l)∣∣∣2)21
When
p
=
1
p=1
When p=1, Manhattan distance:
L
1
(
x
i
,
x
j
)
=
∑
l
=
1
n
∣
x
i
(
l
)
−
x
j
(
l
)
∣
L_1(x_i,x_j)=\sum_{l=1}^{n}\left |x_i^{(l)}-x_j^{(l)} \right |
L1(xi,xj)=l=1∑n∣∣∣xi(l)−xj(l)∣∣∣
When
p
=
∞
p=∞
When p = ∞, it is the maximum value of each coordinate distance:
L
(
x
i
,
x
j
)
=
max
l
∣
x
i
(
l
)
−
x
j
(
l
)
∣
L_(x_i,x_j)=\max_{l}\left |x_i^{(l)}-x_j^{(l)} \right |
L(xi,xj)=lmax∣∣∣xi(l)−xj(l)∣∣∣
Selection of 𝐾 value
If
K
K
When K is small, the approximation error of "learning" will decrease, but the estimation error of "learning" will increase; noise is sensitive;
K
K
The decrease of K value means that the overall model becomes complex and prone to over fitting.
If
K
K
Large K value reduces the estimation error of learning, but the disadvantage is that the approximation error of learning will increase.
K
K
The increase of K value means that the overall model becomes simple.
K
=
N
K=N
K=N, then no matter what the input instance is, it will simply predict that it belongs to the class with the most training instances.
In application, a smaller number is generally selected, and the cross validation method is used to select the optimal number
K
K
K value.
Rules of classification decision
The majority voting rule is explained as follows: if the loss function of the classification is 0-1 loss function(
L
(
Y
,
f
(
x
)
)
=
{
1
,
Y
≠
f
(
X
)
0
,
Y
=
f
(
X
)
}
L\left(Y,f(x)\right) = \left \{ \begin{matrix} 1,Y≠f(X)\\ 0,Y=f(X) \end{matrix} \right \}
L(Y,f(x))={1,Y = f(X)0,Y=f(X)}), and the classification function is
f
:
R
n
→
c
1
,
c
2
,
.
.
,
c
K
f:R^n→{c_1,c_2,..,c_K }
f:Rn → c1, c2,..., cK, then the probability of misclassification is
P
(
Y
≠
f
(
X
)
)
=
1
−
P
(
Y
=
f
(
X
)
)
P(Y≠f(X))=1-P(Y=f(X))
P(Y=f(X))=1−P(Y=f(X)).
For a given instance
x
∈
X
x∈X
X ∈ x, its nearest neighbor
k
k
k training instance points form a set
N
k
(
x
)
N_k (x)
Nk (x). If covered
N
k
(
x
)
N_k (x)
The area category of Nk (x) is
c
j
c_j
cj, then the probability of misclassification is:
1
k
∑
x
i
∈
N
k
(
x
)
I
(
y
i
≠
c
j
)
=
1
−
1
k
∑
x
i
∈
N
k
(
x
)
I
(
y
i
=
c
j
)
\frac{1}{k}\sum_{x_i \in N_k(x)}I(y_i≠c_j)=1-\frac{1}{k}\sum_{x_i \in N_k(x)}I(y_i=c_j)
k1xi∈Nk(x)∑I(yi=cj)=1−k1xi∈Nk(x)∑I(yi=cj)
code
Main.py
import numpy as np import pandas as pd import matplotlib.pyplot as plt from myKNN import KNN from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split def pplt(df, test_point=[0, 0], flag=0): plt.scatter(df[:50]['sepal length'], df[:50]['sepal width'], label='0') plt.scatter(df[50:100]['sepal length'], df[50:100]['sepal width'], label=1) if flag == 1: plt.plot(test_point[0], test_point[1], 'bo', label='test_point') plt.xlabel('sepal length') plt.ylabel('sepal width') plt.legend() plt.show() iris = load_iris() df = pd.DataFrame(iris.data, columns=iris.feature_names) df['label'] = iris.target df.columns = ['sepal length', 'sepal width', 'petal length', 'petal width', 'label'] # print(df) pplt(df) data = np.array(df.iloc[:100, [0, 1, -1]]) X, y = data[:, :-1], data[:, -1] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) print("np.shape(X_train)", np.shape(X_train)) print("np.shape(y_train)", np.shape(y_train)) clf = KNN(X_train, y_train) print(clf.score(X_test, y_test)) test_point = [6.0, 3.0] print('Test Point:{}'.format(clf.predict(test_point))) pplt(df, test_point, flag=1)
myKNN.py
Insert the code slice here import numpy as np from collections import Counter class KNN: def __init__(self, X_train, y_train, n_neighbors=3, p=2): """ :param n_neighbors: Number of adjacent points :param p: Distance measurement """ self.n = n_neighbors self.p = p self.X_train = X_train self.y_train = y_train def predict(self, X): # Take out n points knn_list = [] for i in range(self.n): # np.linalg.norm(x, ord=None, axis=None, keepdims=False) find norm # x --- matrix ord --- norm type axis --- processing type (= 1 processing by row vector, = 0 processing by column vector, = None matrix norm) keeping --- whether to maintain the two-dimensional characteristics of the matrix dist = np.linalg.norm(X - self.X_train[i], ord=self.p) knn_list.append((dist, self.y_train[i])) for i in range(self.n, len(self.X_train)): max_index = knn_list.index(max(knn_list, key=lambda x: x[0])) dist = np.linalg.norm(X - self.X_train[i], ord=self.p) if knn_list[max_index][0] > dist: knn_list[max_index] = (dist, self.y_train[i]) # Statistics knn = [k[-1] for k in knn_list] count_pairs = Counter(knn) max_count = sorted(count_pairs.items(), key=lambda x: x[1])[-1][0] return max_count def score(self, X_test, y_test): right_count = 0 n = 10 for X, y in zip(X_test, y_test): label = self.predict(X) if label == y: right_count += 1 return right_count / len(X_test)
output
Implementation of nearest neighbor method: kd tree
Constructing kd tree
k
d
kd
kd tree is a kind of tree data structure that stores instance points in k-dimensional space for fast retrieval.
k
d
kd
kd tree is a binary tree, which represents a partition construction of k-dimensional space
k
d
kd
The kd tree is equivalent to continuously using a hyperplane perpendicular to the coordinate axis
k
k
k-dimensional space is segmented to form a series of
k
k
k-dimensional hyper rectangular region.
k
d
kd
Each node of kd tree corresponds to one node
k
k
k-dimensional hyper rectangular region.
structure
k
d
kd
The method of kd tree is as follows: construct the root node so that the root node corresponds to the hyperrectangular region containing all instance points in the k-dimensional space; continuously segment the k-dimensional space and generate child nodes through the following recursive method Select a coordinate axis and a tangent point on the coordinate axis to determine a hyperplane. The hyperplane divides the current hyperrectangular area into left and right sub areas (sub nodes) through the selected tangent point and perpendicular to the selected coordinate axis. At this time, the instance is divided into two sub areas. This process ends when there are no instances in the sub area (the node at the time of termination is a leaf node). In this process, the instance is saved on the corresponding node.
Search kd tree
Given a target point, search its nearest neighbor. First find the leaf node containing the target point; then start from the leaf node and return to the parent node in turn; constantly find the node closest to the target point and terminate when it is determined that there is no closer node. In this way, the search is limited to the local area of the space, and the efficiency is greatly improved.
The leaf node containing the target point corresponds to the smallest hyper rectangular area containing the target point. Take the instance point of the leaf node as the current nearest point. The nearest neighbor of the target point must be inside the hyper sphere centered on the target point and passing through the current nearest point, and then return to the parent node of the current node. If the hyper rectangular area of another child node of the parent node intersects with the hyper sphere, then Find the instance point closer to the target point in the intersection area. If there is such a point, take this point as the new current nearest point. The algorithm goes to the parent node at a higher level and continues the above process. If the hyperrectangular area of another child node of the parent node does not intersect the hypersphere, or there is no point closer to the current nearest point, stop the search.
code implementation
main.py
import KD import search import time from random import random data = [[2, 3], [5, 4], [9, 6], [4, 7], [8, 1], [7, 2]] kd = KD.KDTree(data) KD.preorder(kd.root) # A k-dimensional random vector is generated, and the component value of each dimension is 0 ~ 1 def random_point(k): return [random() for _ in range(k)] # Generate n k-dimensional random vectors def random_points(k, n): return [random_point(k) for _ in range(n)] ret = search.find_nearest(kd, [3, 4.5]) print(ret) N = 400000 t0 = time.perf_counter() kd2 = KD.KDTree(random_points(3, N)) # Construct a kd tree containing 400000 3-dimensional sample points ret2 = search.find_nearest(kd2, [0.1, 0.5, 0.8]) # Find the nearest point from the target in 400000 sample points t1 = time.perf_counter() print("Time:", t1 - t0, "s") print(ret2)
KD.py
# The main data structures contained in each node of KD tree are as follows: class KdNode(object): def __init__(self, dom_elt, split, left, right): self.dom_elt = dom_elt # K-dimensional vector node (a sample point in k-dimensional space) self.split = split # Integer (serial number of the dimension to be split) self.left = left # The node divides the KD tree composed of the left subspace of the hyperplane self.right = right # The node divides the KD tree composed of the right subspace of the hyperplane class KDTree(object): def __init__(self, data): k = len(data[0]) # Data dimension self.root = CreateNode(k, 0, data) # The kd tree is constructed from the component of dimension 0, and the root node is returned def CreateNode(k, split, data_set): """ Press No split Dimension partition dataset exset establish kdNode """ if not data_set: # The data set is empty return None # The value of the key parameter is a function. This function has only one parameter and returns a value for comparison # The itemsetter function provided by the operator module is used to obtain the data of which dimensions of the object, and the parameter is the sequence number of the data to be obtained in the object # data_set.sort(key=itemgetter(split)) # Sort according to the one-dimensional data to be divided data_set.sort(key=lambda x: x[split]) split_pos = len(data_set) // 2 median = data_set[split_pos] # Median split point split_next = (split + 1) % k # Create kd tree recursively return KdNode(median, split, CreateNode(k, split_next, data_set[:split_pos]), # Create left subtree CreateNode(k, split_next, data_set[split_pos + 1:]) # Create right subtree ) # KDTree preorder traversal def preorder(root): print(root.dom_elt) if root.left: preorder(root.left) if root.right: preorder(root.right)
search.py
""" Yes, the structure is good kd Search the tree to find the sample point closest to the target point """ from math import sqrt from collections import namedtuple import KD # Define a namedtuple to store the nearest coordinate point, the nearest distance and the number of visited nodes respectively result = namedtuple("Result_tuple", "nearest_point nearest_dist nodes_visited") def find_nearest(tree, point): k = len(point) # Data dimension def travel(kd_node, target, max_dist): if kd_node is None: return result([0] * k, float("inf"), 0) nodes_visited = 1 s = kd_node.split # Dimension to split pivot = kd_node.dom_elt # Axis to split if target[s] <= pivot[s]: # If the s dimension of the target point is less than the corresponding value of the split axis (the target is closer to the left subtree) nearer_node = kd_node.left # The next access node is the root node of the left subtree further_node = kd_node.right # Record the right subtree at the same time else: # The target is closer to the right subtree nearer_node = kd_node.right further_node = kd_node.left temp1 = travel(nearer_node, target, max_dist) # Traverse to find the area containing the target point nearest = temp1.nearest_point # Use this leaf node as the "current closest point"“ dist = temp1.nearest_dist # Update nearest distance nodes_visited += temp1.nodes_visited if dist < max_dist: max_dist = dist # The nearest point will be in the hypersphere with the target point as the center and max_dist as the radius temp_dist = abs(pivot[s] - target[s]) # The distance between the target point and the split hyperplane in the s-dimension if max_dist < temp_dist: # Determine whether the hypersphere intersects the hyperplane return result(nearest, dist, nodes_visited) # Disjoint can be returned directly without judgment # --------------------------------------------------------------------------------- # Calculate the Euclidean distance between the target point and the segmentation point temp_dist = sqrt(sum((p1 - p2) ** 2 for p1, p2 in zip(pivot, target))) if temp_dist < dist: # If closer nearest = pivot # Update nearest point dist = temp_dist # Update nearest distance max_dist = dist # Update hypersphere radius # Check whether the area corresponding to another child node has a closer point temp2 = travel(further_node, target, max_dist) nodes_visited += temp2.nodes_visited if temp2.nearest_dist < dist: # If there is a closer distance in another child node nearest = temp2.nearest_point # Update nearest point dist = temp2.nearest_dist # Update nearest distance return result(nearest, dist, nodes_visited) return travel(tree.root, point, float("inf")) # Recursion from root node