ORB_SLAM2 source code analysis feature matching


1, Feature matching in monocular initialization SearchForInitialization

2, TrackwithModel


3, Word bag introduction BoW

1. Intuitive understanding of word bag

2. Basic idea of word bag

3. From dictionary structure to k-d tree

K-means clustering

4. Similarity calculation TF-IDF

5. Summary word bag model

4, General process of word bag

Closed loop detection:

Accelerated matching

How to make and generate BOW? Why does BOW usually use BRIEF descriptor?

Speed aspect

In terms of accuracy

Visualization effect

Offline training vocabulary tree (also known as dictionary)

On line image generation BoW vector

Source code analysis  

Understand word bag vector BowVector

Understanding eigenvectors

Saving and loading dictionary tree

1, Feature matching in monocular initialization SearchForInitialization

  Calculate the distance of the descriptor (Hamming distance). The shortest distance is the matching point

  1. The matching point must be greater than 100 before initialization

As shown in the above figure, find out the feature points corresponding to the first picture in the second picture, make a frame around the feature points with a radius of 100, and find the required feature points in this range.

Fast matching and candidate feature points GetFeaturesInArea

  2. The pyramid to which the selected feature points belong must be layer 0

3. Eliminate points that are within the selected grid but not within the search range

4. The optimal distance should be less than 50, and the ratio of the optimal distance to the minimum distance should be calculated

5. Histogram of statistical matching point direction

        Calculate the FAST feature point direction of the first picture and the FAST feature point direction of the second picture, move the direction vector of the first feature point to the second picture, calculate the angle of the two direction vectors, and make a histogram.

6. The three directions with the largest number of statistical feature points

7. It is judged that the second most quantity is less than 0.1 and the first most quantity. If yes, it proves that the first direction is the main direction

8. It is judged that the third most quantity is less than 0.1 and the first most quantity. Coincidence proves the principal directions of the first and second plurality

2, TrackwithModel

The initial pose is calculated according to the uniform velocity model, and then the matching points are searched in the form of projection


1. The content obtained by the previous EPnP will no longer be tracked and searched

2. Here, it is necessary to estimate the number of layers of key frame map points in the current image pyramid

How to calculate the number of layers of the image pyramid: it is calculated by the distance between the map point and the optical center of the camera

3. Similar to the previous seachByProjection, key frame map points are projected to the current frame, and then searched within a certain range

3, Word bag introduction BoW

1. Intuitive understanding of word bag

  How to find a matching image

Feature matching

Different light intensity will have an impact, and the matching time is long

Word bag model (BoW)

2. Basic idea of word bag

Introduce basic ideas from word concepts

How to describe quantitatively

s --- score calculation function

A & B --- binary vector

W --- vector dimension

1 --- vector L1 norm, sum of absolute values of each element

Thus, the similarity of description vectors, that is, the similarity degree of images, is defined

3. From dictionary structure to k-d tree

k: The number of nodes in each layer is K

d: The number structure has a total of D layers

Word: a collection of locally adjacent feature points

Function: classify all pedestrians in the picture into people

Dictionary: a word index with a certain structure, trained from a large number of pictures

So with a feature point, how to find the matching word

From dictionary structure to K-d tree (two indexing methods)

K-means clustering


1. Firstly, K cluster centers are randomly generated

2. According to the clustering center point, the data are divided into K categories. The classification principle is to divide the data into categories according to which center point it is close to

3. Then recalculate the cluster center point according to the classified category data

4. Repeat steps 2 and 3 until the center point does not change

Note: different center points generated at the beginning will also have different effects on the later

4. Similarity calculation TF-IDF

TF: the more frequently a word appears in an image, the more representative the word is

IDF: the lower the frequency of a word in the dictionary, the more representative the word is in the dictionary

Picture dimension


Dictionary dimension




Similarity calculation BoW vector (bog of words)

The BoW vector consists of two elements: word and weight

The vector describes an image, including which words are in the image and the corresponding weight. Then, by artificially specifying the form of norm, the similarity of the two images can be calculated

5. Summary word bag model

Step 1. Extract features from a large number of pictures and use clustering method to generate words to build a dictionary

Step 2. Process a frame of pictures and use TF-IDF to calculate the word weight

step3. Generate the BoW vector of the frame and update the contents of the forward index and inverted index

The generated BoW vector is the new business card of the picture, which can be used for loop detection, image matching, pose optimization, finding feature points, etc

4, General process of word bag

Why bag of words instead of list of words or array of words?

Because the order and location of Word are discarded, and only the frequency is considered, the expression is greatly simplified and the storage space is saved. It is very efficient in analyzing and comparing the similarity

Closed loop detection:

The core is to judge whether the two pictures are the same scene, that is, to judge the similarity of the images.

  How to calculate the similarity of two pictures

Bag of words can solve this problem. It takes the image feature set as visual words and only cares about whether there are these words in the image and how many times they are. It is more in line with human cognitive style and is very robust to different lighting, perspective transformation, seasonal change and so on.

Accelerated matching

SearchByBoW (for key frame tracking, relocation, Closed-loop Detection and SIM3 calculation) used in ORB-SLAM2 code and SearchForTriangulation in local map are mainly implemented internally by using FeatureVector in BoW to accelerate feature matching.

Using FeatureVector avoids pairwise matching of all feature points and only compares feature points under the same node, which greatly accelerates the matching efficiency. As for the matching accuracy, the paper Bags of Binary Words for Fast Place Recognition in Image Sequences mentioned that the false position in 26292 images is 0, indicating that the accuracy is guaranteed. The effect is very good in practical application.


The word bag dictionary trained offline needs to be loaded in advance to increase the storage space. However, the advantages far outweigh the disadvantages, and there are many improvement methods, such as using binary storage to compress the word bag, reduce the storage space and improve the loading speed.

How to make and generate BOW? Why does BOW usually use BRIEF descriptor?

Speed aspect

Because the calculation and matching are very fast, the paper says that about a key point, it only needs 17.3 µ s to calculate 256 bit descriptors. Because they are binary descriptors, the distance description can be described through Hamming distance and XOR operation, which is very fast. The sift and surf descriptors are all floating-point types and need to calculate the Euclidean distance, which will be much slower.

On Intel Core i7, 2.67GHz CPU, using FAST+BRIEF feature, feature extraction + word bag position recognition in 26300 images takes 22ms per frame.

In terms of accuracy

First, the conclusion is that the closed-loop effect is not worse than the high-precision feature points such as sift and surf.

Let's look at the comparison specifically: the following comparison is from the paper 2012, Bags of Binary Words for Fast Place Recognition in Image Sequences, IEEE TRANSACTIONS ON ROBOTICS

The three descriptors BREF, SURF64 and U-SURF128 use the same parameters to the precision recall curve on the training data set NewCollege and bicacca25b

SURF64: 64 dimensional descriptor with rotation invariance

U-SURF128: 128 dimensional descriptor without rotation invariance

In both data, SURF64 performs significantly better than U-SURF128 (the area under the curve is larger). It can be seen that BRIEF is significantly better than U-SURF128 and slightly better than SURF64 on bicacca25b data set. SURF64 is better than BRIEF on NewCollege data set, but BRIEF is still good.

In short, the effects of BRIEF and SURF64 are basically similar, which can be said to be a tie.  

Visualization effect

Visualize the effect

The left picture pair in the figure below is the loop matching result of BRIEF under the same Word in vocabulary, and the same features are connected into a line.

The image pair on the right of the figure below is the closed-loop matching result of SURF64 in the same dataset.

In the first line, the matching results of BREF and SURF64 are close despite a certain perspective change

Line 2: BREF successfully closed-loop, but SURF64 did not. The reason is that SURF64 does not get enough matching relationships.

The third line: BREF closed loop failed and SURF64 closed loop succeeded.

Let's analyze the reason: it is mainly related to close-up vision. Because BREF has no scale invariance compared with SURF64, it is easy to fail to match the close range with large scale transformation, such as the third line. In the medium and long-range, because the scale changes little, the performance of BREF is close to or even better than SURF64

  However, we can solve the scale problem of BRIEF through image pyramid. In this paper, the author also mentioned the feature points of orb + BREF. The main problem is that there is no rotation invariance and scale invariance. But it's all settled now.

In short, BREF's closed-loop effect is trustworthy!

Offline training vocabulary tree (also known as dictionary)

Firstly, the ORB feature points are extracted from the image, and the descriptors are clustered through k-means. According to the set branch number and depth of the tree, the clustering starts from the leaf node to the root node, and finally a very large volatile tree is obtained,

1. Traverse all training images and extract ORB feature points for each image.

2. Set the branch number k and depth L of volatile tree. Each descriptor of the feature points is clustered into k sets by K-means as the first level of the volatile tree, and then the clustering operation is repeated for each set to obtain the second level of the volatile tree. Continue to iterate and finally obtain the conditional volatile tree. Its scale is usually large. For example, the offline dictionary used by ORB-SLAM2 has 1.08 million + nodes.

3. The layer farthest from the root node is called a leaf or Word. A weight is given according to the correlation degree of each Word in the training set. The more times it appears in the training set, the worse the discrimination is, and the lower the weight is given.

On line image generation BoW vector

1. The ORB feature of a new frame image is extracted to obtain a certain number (generally hundreds) of feature points, and the description sub dimension is consistent with that in vocabulary tree

2. For the descriptor of each feature point, start to find its own position from the offline created volatile tree, start from the root node, use the descriptor and the descriptor of each node to calculate the Hamming distance, select the node with the smallest Hamming distance as its own node, and traverse to the leaf node.

The whole process is like this, as shown in the figure below. The purple line represents the process of a feature point from the root node to the leaf node.

Source code analysis  

A descriptor is converted into Word id, Word weight, and the corresponding implementation code of the parent node (the node with the depth of level up from the leaf) id of the node is shown in:

* @brief Convert the descriptor into Word id, Word weight, and the parent node id of the node (the parent node here is not the upper node of the leaf)
One layer, which is level sup from the leaf)
* @tparam TDescriptor
* @tparam F
* @param[in] feature feature descriptor 
* @param[in & out] word_id Word id
* @param[in & out] weight Word weight
* @param[in & out] nid Record the node id of the current descriptor after it is converted to Word, and its distance
 The leaf depth is levelsup
* @param[in] levelsup Depth from leaf
template<class TDescriptor, class F>
void TemplatedVocabulary<TDescriptor,F>::transform(const TDescriptor &feature,
WordId &word_id, WordValue &weight, NodeId *nid, int levelsup) const
// propagate the feature down the tree
vector<NodeId> nodes;
typename vector<NodeId>::const_iterator nit;
// level at which the node must be stored in nid, if given
// m_L: depth levels, m_L = 6 in ORB-SLAM2
// nid_level the node id of Word to which the current feature point belongs, which is convenient for indexing
const int nid_level = m_L - levelsup;
if(nid_level <= 0 && nid != NULL) *nid = 0; // root
NodeId final_id = 0; // root
int current_level = 0;
nodes = m_nodes[final_id].children;
final_id = nodes[0];
// Take the descriptive sub distance of the first child node of the current node to initialize the optimal (small) distance
double best_d = F::distance(feature, m_nodes[final_id].descriptor);
// Traverse all descriptors in nodes to find the descriptor corresponding to the minimum distance
for(nit = nodes.begin() + 1; nit != nodes.end(); ++nit)
NodeId id = *nit;
double d = F::distance(feature, m_nodes[id].descriptor);
if(d < best_d)
best_d = d;
final_id = id;
// Record the node id of the current descriptor after it is converted into Word, and its depth from the leaf is levelsup
if(nid != NULL && current_level == nid_level)
*nid = final_id;
} while( !m_nodes[final_id].isLeaf() );
// turn node id into word id
// Take out the Word id and ID of the node in the volatile tree with the smallest distance from the current feature descriptor
word_id = m_nodes[final_id].word_id;
weight = m_nodes[final_id].weight;

The codes for converting all feature points in an image into two std::map containers BowVector and FeatureVector are shown in:

* @brief Convert all feature points of an image into BowVector and FeatureVector
* @tparam TDescriptor
* @tparam F
* @param[in] features All feature points in the image
* @param[in & out] v BowVector
* @param[in & out] fv FeatureVector
* @param[in] levelsup Depth from leaf
template<class TDescriptor, class F>
void TemplatedVocabulary<TDescriptor,F>::transform(
  const std::vector<TDescriptor>& features,
  BowVector &v, FeatureVector &fv, int levelsup) const
  if(empty()) // safe for subclasses
// normalize
// Determine whether the BowVector needs to be normalized according to the selected scoring type
LNorm norm;
bool must = m_scoring_object->mustNormalize(norm);
typename vector<TDescriptor>::const_iterator fit;
if(m_weighting == TF || m_weighting == TF_IDF)
  unsigned int i_feature = 0;
  // Traverse all feature points in the image
  for(fit = features.begin(); fit < features.end(); ++fit, ++i_feature)
    WordId id; // Word id of leaf node
    NodeId nid; // The NodeId in the FeatureVector is used to speed up the search
    WordValue w; // Weight corresponding to leaf node
   // Convert the current descriptor into Word id, Word weight, and the parent node id of the node (the parent node here is not the upper layer of the leaf, and its depth from the leaf is levelsup)
   // w is the idf value if TF_IDF, 1 if TF
   transform(*fit, id, w, &nid, levelsup);
   if(w > 0) // not stopped
   // If Word weight is greater than 0, add it to BowVector and FeatureVector
   v.addWeight(id, w);
   fv.addFeature(nid, i_feature);
  if(!v.empty() && !must)
    // unnecessary when normalizing
    const double nd = v.size();
    for(BowVector::iterator vit = v.begin(); vit != v.end(); vit++)
       vit->second /= nd;
// The code of IDF | binary case is omitted
if(must) v.normalize(norm);

It is equivalent to compressing the current image information, and it is of great significance for rapid matching, Closed-loop Detection and relocation of later feature points. These two containers are very important. Let's explain them one by one:

Understand word bag vector BowVector

What it actually stores inside is this


WordId and WordValue represent the id and weight of the nearest leaf of Word among all leaves (explained later). The weight of the same Word id is updated cumulatively. See the code

void BowVector::addWeight(WordId id, WordValue v)
  // Returns the position pointing to the first value greater than or equal to id
  BowVector::iterator vit = this->lower_bound(id);
  // http://www.cplusplus.com/reference/map/map/key_comp/
  if(vit != this->end() && !(this->key_comp()(id, vit->first)))
  // If id = vit - > first, it means the same Word, and the weight is updated
  vit->second += v;
  // If the Word id is not in the BowVector, add it
  this->insert(vit, BowVector::value_type(id, v));

Understanding eigenvectors

The interior is actually a

std::map <Nodeid,std:;vector<unsigned int>>

The NodeId is not the direct parent node id of the leaf node, but the id of the node corresponding to the level up depth from the leaf node, which corresponds to the Word's node id in the volatile tree diagram above. Why not set it as the parent node directly? Because the later search for the matching point of the Word is to match the Word in all child nodes under the same node id as it. The search area is shown in the Word's search region in the figure. Therefore, the size of the search range is determined according to level up. The larger the level up value, the wider the search range and the slower the speed; The smaller the level up value, the smaller the search range and the faster the speed, but the fewer features can be matched. In addition, std::vector actually stores the indexes of all feature points under NodeId in the image. See code

void FeatureVector::addFeature(NodeId id, unsigned int i_feature)
  FeatureVector::iterator vit = this->lower_bound(id);
  // Put the features under the same node id in a vector
  if(vit != this->end() && vit->first == id)
   vit = this->insert(vit, FeatureVector::value_type(id,
     std::vector<unsigned int>() ));

FeatureVector is mainly used for fast matching of different image feature points and accelerating geometric relationship verification. For example, it is used in ORBmatcher::SearchByBoW

DBoW2::FeatureVector::const_iterator f1it = vFeatVec1.begin();
DBoW2::FeatureVector::const_iterator f2it = vFeatVec2.begin();
DBoW2::FeatureVector::const_iterator f1end = vFeatVec1.end();
DBoW2::FeatureVector::const_iterator f2end = vFeatVec2.end();
while(f1it != f1end && f2it != f2end)
    // Step 1: respectively take out the ORB feature points belonging to the same node (only those belonging to the same node can be matching points)
    if(f1it->first == f2it->first)
       // Step 2: traverse the feature points belonging to this node in KF
       for(size_t i1=0, iend1=f1it->second.size(); i1<iend1; i1++)
          const size_t idx1 = f1it->second[i1];
          MapPoint* pMP1 = vpMapPoints1[idx1];
          // ellipsis
          // ..........

Saving and loading dictionary tree

We translate vocabulary tree into dictionary tree

How to save the trained dictionary tree as a txt file?

template<class TDescriptor, class F>
void TemplatedVocabulary<TDescriptor,F>::saveToTextFile(const std::string
&filename) const
   fstream f;
   // The first line prints the number of branches, depth, scoring method and weight calculation method of the tree
   f << m_k << " " << m_L << " " << " " << m_scoring << " " << m_weighting <<
   for(size_t i=1; i<m_nodes.size();i++)
      const Node& node = m_nodes[i];
      // The first number in each line is the parent node id
      f << node.parent << " ";
      // The second digit in each line indicates Yes (1) no (0) as leaf (Word)
        f << 1 << " ";
        f << 0 << " ";
      // Next, 256 bit descriptors are stored, and finally node weights are stored
      f << F::toString(node.descriptor) << " " << (double)node.weight << endl;

How to load the trained dictionary tree file?

* @brief Load the trained vocabulary tree in txt format
* @tparam TDescriptor
* @tparam F
* @param[in] filename vocabulary tree File name
* @return true Loading succeeded
* @return false Loading failed
template<class TDescriptor, class F>
bool TemplatedVocabulary<TDescriptor,F>::loadFromTextFile(const std::string
ifstream f;
return false;
string s;
stringstream ss;
ss << s;
ss >> m_k; // Number of branches of the tree
ss >> m_L; // Depth of tree
int n1, n2;
ss >> n1;
ss >> n2;
   if(m_k<0 || m_k>20 || m_L<1 || m_L>10 || n1<0 || n1>5 || n2<0 || n2>3)
   std::cerr << "Vocabulary loading failure: This is not a correct text
   file!" << endl;
   return false;
m_scoring = (ScoringType)n1; // Scoring type
m_weighting = (WeightingType)n2; // Weight type
// The total number of nodes is the sum of an equal ratio sequence
//!  The bug does not contain the number of last leaf nodes. It should be changed to ((pow((double)m_k, (double)m_L + 2)-
1)/(m_k - 1))
//! but it has no impact, because here is only reserve, and the actual storage is to resize step by step
int expected_nodes =
(int)((pow((double)m_k, (double)m_L + 1) - 1)/(m_k - 1));
// Number of words (leaves) pre allocated space
m_words.reserve(pow((double)m_k, (double)m_L + 1));
// The first node is the root node with id set to 0
m_nodes[0].id = 0;
string snode;
stringstream ssnode;
ssnode << snode;
// nid represents the current node id, which is actually the reading order, starting from 0
int nid = m_nodes.size();
// Node size plus 1
m_nodes[nid].id = nid;
// Read the first number in each line to indicate the parent node id
int pid ;
ssnode >> pid;
// Record the parent-child relationship between node IDs
m_nodes[nid].parent = pid;
// Read the second number to indicate whether it is a leaf (Word)
int nIsLeaf;
ssnode >> nIsLeaf;
// Each feature point descriptor is 256 bit s, and one byte corresponds to 8 bit s, so a feature point needs 32 bytes to store.
// Here, F::L=32, that is, 32 bytes are read and finally stored in ssd in the form of string
stringstream ssd;
for(int iD=0;iD<F::L;iD++)
string sElement;
ssnode >> sElement;
ssd << sElement << " ";
// Store the ssd in the descriptor of the node
F::fromString(m_nodes[nid].descriptor, ssd.str())
// Read the last number: the weight of the node (Word only)
ssnode >> m_nodes[nid].weight;
// If it is a leaf (Word), store it in m_words
int wid = m_words.size();
//Store the id of Word, which is unique
m_nodes[nid].word_id = wid;
//Build vector < node * > m_words and store the pointer of the node where word is located
m_words[wid] = &m_nodes[nid];
//Non leaf node, directly allocate m_k branches
return true;

Tags: C++ Algorithm Machine Learning AI slam

Posted on Tue, 19 Oct 2021 16:55:35 -0400 by lentin64