LOF for anomaly detection

preface:

Lof: Local outlier factor, i.e. Local outlier factor. Lof mainly judges whether each point p is an abnormal point by comparing the density of each point p and its neighborhood points. If the density of point p is lower, it is more likely to be identified as an abnormal point. As for the density, it is calculated by the distance between points. The farther the distance between points, the lower the density, the closer the distance, and the higher the density, which is fully in line with our understanding. Moreover, because lof calculates the density through the k neighborhood of the point, rather than the global calculation, it is called "local" anomaly factor. That is, lof is based on density analysis to detect anomalies through local data density.

principle

LOF algorithm is a representative algorithm in density based outlier detection methods. The algorithm will calculate an outlier factor LOF for each point in the data set, and determine whether it is an outlier factor by judging whether the LOF is close to 1. If LOF is much greater than 1, it is considered as an outlier factor, and if it is close to 1, it is a normal point.

In order to describe LOF algorithm, the following concepts are introduced:
(1) K distance of object p (the distance of the k-th object closest to object p)
For a positive integer k, the k-distance of object p can be recorded as k-distance §. In the sample space, there is object o, and the distance between it and object p is recorded as d(p,o). If the following two conditions are met, we consider k-distance § = d(p,o):
1) In the sample space, there are at least k objects Q, so that D (P, q) < = D (D, O);
2) In the sample space, there are at most k-1 objects Q, so that D (P, q) < D (P, O).
k − distance § = max | p − o | (the maximum distance between samples O and p is the maximum distance of k neighborhood of p)
Obviously, if k − distance § is used to quantify the local spatial area range of object p, the value of k − distance § is small for areas with high object density, while the value of k − distance § is large for areas with low object density.

(2) The k-th distance field of object p
If the k-th distance of object p has been, the set of objects whose distance from object p is less than or equal to K − distance § is called the k-th distance field of object p, which is recorded as: Nk §
This field is actually a collection of all objects (excluding p itself) in an area with p as the center and K − distance § as the radius. Since there may be multiple k-th distance data at the same time, the set includes at least k objects. It is conceivable that the larger the degree of outlier, the larger the range of objects, while the smaller the degree of outlier, the smaller the range of objects.
(3) The reachable distance of object p relative to object o
Formula: reachdist(p,o)=max{k − distance (o), | P − o |}
For the understanding of the formula, refer to the following figure. For two different points p1 and p2, their reachable distance calculation is different. For p1, because p1 is in the K neighborhood of O (it can be seen that k = 3 here), their distance is the distance of k-distance(o), that is, the radius equal to the circle; For p2, it is obvious that it is not in the K neighborhood of O, so its reachable distance is the actual distance, that is, the distance between these two points.

(4) Local reachable density
The local reachability density of object p is defined as the reciprocal of the average reachable distance of the k nearest neighbor of p (in essence, that is, the derivative of the average value of all reachable distances in the k-th distance neighborhood of p. The larger the better, which means that the data is more compact)

(5) Local outlier factor

If the ratio is closer to 1, it means that p has the same density as the points in its K-distance neighborhood, and p may belong to the same cluster as the points in its K-distance neighborhood; If the ratio is less than 1, it means that the density of p is higher than that of other points in its neighborhood, and p is a dense point; If the ratio is greater than 1, it means that the density of p is less than that of other points in its neighborhood, and p is more likely to be an abnormal point. In this way, outliers can be accurately found in the case of uneven distribution of sample spatial data.
It can be imagined that for an outlier p, its lrdk(o) is much larger than lrdk § and it is easy to see through a special case, so that the algorithm can be well understood

Advantages and disadvantages

advantage:
LOF algorithm is an unsupervised algorithm
LOF algorithm is a density based algorithm
LOF algorithm is suitable for anomaly detection of data with different density

Disadvantages:
The detected data must have obvious density difference
The calculation is complicated
Limited application scenarios

Adjusting parameters

LOF(self, n_neighbors=20, algorithm='auto', leaf_size=30,
    metric='minkowski', p=2, metric_params=None,
    contamination=0.1, n_jobs=1)

n_neighbors: (int,optional (default = 20 )) - Used by default for kneighbors Number of neighbors queried. If n_neighbors Greater than the number of samples provided, all samples will be used.

Algorithm: ({'auto' ,'ball_tree' ,'kd_tree' ,'brute'} ,(optional) -
Algorithm for calculating nearest neighbor:
'ball_tree'Will use BallTree
'kd_tree'Will use KDTree
'brute'Brute force search will be used.
'auto'Pass the attempt to fit()Method to determine the most appropriate algorithm.
Note: fitting on sparse input will use strong force to override the setting of this parameter.

leaf_size: (int,optional (default = 30 )) - Pass to BallTree or KDTree Leaf size.
This may affect the speed of constructing and querying, as well as the memory required to store the tree. The optimal value depends on the nature of the problem.

Metric: (String or callable, default'minkowski') -A measure used for distance calculation.
from scikit-learn: ['cityblock', 'cosine', 'euclidean', 'l1', 'l2', 'manhattan']
from scipy.spatial.distance: ['braycurtis', 'canberra', 'chebyshev', 'correlation', 'dice', 'hamming', 'jaccard', 'kulsinski', 'mahalanobis', 'matching', 'minkowski', 'rogerstanimoto', 'russellrao', 'seuclidean', 'sokalmichener', 'sokalsneath', 'sqeuclidean', 'yule']

p: (Integer, optional (default)= 2 )) - come from sklearn.metrics.pairwise.pairwise_distances of Minkowski Parameters for the measurement.
When p = 1 This is equivalent to using manhattan_distance(l1). 
about p = 2 use euclidean_distance(l2). 
For any p,use minkowski_distance(l_p)

summary

Through practice, it is found that the algorithm has low execution efficiency in the face of scenes with large amount of data. As a density based algorithm, the effect is better than KNN. In the face of appropriate scenes, you can try the method.

Posted on Tue, 07 Sep 2021 18:28:27 -0400 by alanrenouf