Object Detection is a kind of computer vision task, which involves localizing one or more targets in an image and classifying each target in the image. This is a challenging computer vision task. It needs to locate the target successfully to locate and draw the bounding box around each target in the image. It also needs to classify the target to predict the correct category of the target.
The extension of target detection involves marking the specific pixels belonging to each detected target in the image, rather than using rough bounding boxes in the process of object localization. The more difficult version of this problem is usually called object segmentation or semantic segmentation.
Region based convolutional neural network (R-CNN) is composed of Ross Girshick A series of convolutional neural network models for target detection developed by et al. There are four main variants of this method, and the latest model is called Mask R-CNN. The four variants are as follows:
- [R-CNN]: the bounding box is proposed by the * * selective search * * algorithm. Each bounding box is stretched, and features are extracted by deep convolution neural network (such as AlexNet), and then the final object classification is performed by using linear SVM.
- [Fast R-CNN]: for the simplified design using a single model, the boundary box is still specified as input, but after deep CNN merges the region of interest, the pooling layer will be used to merge the region, and the model can directly predict the class labels and regions of interest.
- [fast r-cnn]: add a Region Proposal Network, which explains the features extracted from deep CNN and learns the regions of interest directly proposed.
- [Mask R-CNN]: the expansion of fast r-cnn adds the output model of mask used to predict each detected object.
Papers in 2018 Mask R-CNN The Mask R-CNN model introduced in is the latest variant of the R-CNN model, which supports both target detection and target segmentation. This article provides a good summary of this model:
The Region-based CNN (R-CNN) approach to bounding-box object detection is to attend to a manageable number of candidate object regions and evaluate convolutional networks independently on each RoI. R-CNN was extended to allow attending to RoIs on feature maps using RoIPool, leading to fast speed and better accuracy. Faster R-CNN advanced this stream by learning the attention mechanism with a Region Proposal Network (RPN). Faster R-CNN is flexible and robust to many follow-up improvements, and is the current leading framework in several benchmarks.
This series of methods may be the most effective object detection methods, so that the latest results can be obtained on the computer vision reference data set. Although accurate, models may be slower to make predictions than alternative models such as YOLO, which are less accurate but designed specifically for real-time prediction.
The implementation of Mask R-CNN is more complex, especially compared with the simple and even the latest deep convolution neural network model. Each version of the R-CNN model provides source code, which is provided in a separate GitHub repository, and has a prototype model based on the Caffe deep learning framework. For example:
- R-CNN: Regions with Convolutional Neural Network Features, GitHub.
- Fast R-CNN, GitHub.
- Faster R-CNN Python Code, GitHub.
- Detectron, Facebook AI, GitHub.
We can use the reliable third-party implementation built by Keras without developing the R-CNN or Mask R-CNN model from scratch. The best third-party implementation of Mask R-CNN is Matterport Developed Mask R-CNN Project, which is released according to MIT license open source code, has been widely used in various projects and Kaggle competitions.
This project provides many examples in the form of Python notebooks, which can be used through the following two notebooks:
In the matchport library, there are three main use cases using the Mask R-CNN model:
- [Object Detection Application]: use the pre trained model to detect the new image.
- [New Model via Transfer Learning]: take the pre trained model as the starting point to develop a model for the new object detection data set.
- New Model from Scratch: develop a new model object detection data set from scratch.
git clone https://github.com/matterport/Mask_RCNN.git cd Mask_RCNN python setup.py install # Confirm whether the installation is successful pip show mask-rcnn
Download model weight: mask_rcnn_coco.h5
Prepare sample photos: elephant.jpg
Load the model and predict:
First of all, the model must be defined through the instance MaskRCNN class, which needs to configure the object as a parameter, and the configuration object defines how to use the model during training or reasoning. In this case, the configuration will only specify the number of images per batch and the number of classes to predict. You can config.py The full scope of the configuration object in the file and the properties that can be overridden.
# define the test configuration class TestConfig(Config): NAME = "test" GPU_COUNT = 1 IMAGES_PER_GPU = 1 NUM_CLASSES = 1 + 80
Define the MaskRCNN instance. Define the model as an inference type, indicating that you want to make predictions instead of training. You must also specify a directory where you can write any log messages.
# define the model rcnn = MaskRCNN(mode='inference', model_dir='./', config=TestConfig())
Load the weight of the download:
# load coco model weights rcnn.load_weights('mask_rcnn_coco.h5', by_name=True)
# load photograph img = load_img('elephant.jpg') img = img_to_array(img) # make prediction results = rcnn.detect([img], verbose=0)
The result contains a dictionary for each image passed to the detect() function, in this case a list of individual dictionaries for an image.
Dictionaries have keys for bounding boxes, mask s, and so on, each pointing to a list of multiple possible targets detected in the image. The key for the annotation dictionary is as follows:
- rois: the detected target's bound boxes or regions of interest (ROI).
- Mask: the mask of the detected target.
- class_ids: class of the detected target.
- scores: probability or confidence of each prediction category.
You can draw each box detected in an image by first obtaining the dictionary of the first image (for example, results ) and then retrieving a list of bounding boxes (for example, ['rois').
boxes = results['rois']
Each bounding box is defined according to the coordinates of the lower left and upper right corners of the bounding box in the image
y1, x1, y2, x2 = boxes
Draw each rectangle in the image:
# example of inference with a pre-trained coco model from keras.preprocessing.image import load_img from keras.preprocessing.image import img_to_array from mrcnn.config import Config from mrcnn.model import MaskRCNN import matplotlib.pyplot as plt from matplotlib.patches import Rectangle # draw an image with detected objects def draw_image_with_boxes(filename, boxes_list): # load the image data = plt.imread(filename) # plot the image plt.imshow(data) # get the context for drawing boxes ax = pyplot.gca() # plot each box for box in boxes_list: # get coordinates y1, x1, y2, x2 = box # calculate width and height of the box width, height = x2 - x1, y2 - y1 # create the shape rect = Rectangle((x1, y1), width, height, fill=False, color='red') # draw the box ax.add_patch(rect) # show the plot plt.show() # define the test configuration class TestConfig(Config): NAME = "test" GPU_COUNT = 1 IMAGES_PER_GPU = 1 NUM_CLASSES = 1 + 80 # define the model rcnn = MaskRCNN(mode='inference', model_dir='./', config=TestConfig()) # load coco model weights rcnn.load_weights('mask_rcnn_coco.h5', by_name=True) # load photograph img = load_img('elephant.jpg') img = img_to_array(img) # make prediction results = rcnn.detect([img], verbose=0) # visualize the results draw_image_with_boxes('elephant.jpg', results['rois'])
Now that you know how to load the model and use it for prediction, perform real object detection, that is, in addition to localizing objects, you want to know what they are. Mask_ The function provided by RCNN API is called display_instances() draws all aspects of the loaded image array and prediction dictionary, such as bounding box, score, and category label, into the image.
One of the parameters is' class' of the dictionary_ List of prediction class identifiers available in the IDS' key. The function also needs to map the id to the class label. The pre trained model is suitable for data sets with 80 (including background) 81 category labels. The Mask R-CNN demonstration listed below provides the list in notebook tutorial.
# example of inference with a pre-trained coco model from keras.preprocessing.image import load_img from keras.preprocessing.image import img_to_array from mrcnn.visualize import display_instances from mrcnn.config import Config from mrcnn.model import MaskRCNN # define 81 classes that the coco model knowns about class_names = ['BG', 'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow', 'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket', 'bottle', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed', 'dining table', 'toilet', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone', 'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'book', 'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush'] # define the test configuration class TestConfig(Config): NAME = "test" GPU_COUNT = 1 IMAGES_PER_GPU = 1 NUM_CLASSES = 1 + 80 # define the model rcnn = MaskRCNN(mode='inference', model_dir='./', config=TestConfig()) # load coco model weights rcnn.load_weights('mask_rcnn_coco.h5', by_name=True) # load photograph img = load_img('elephant.jpg') img = img_to_array(img) # make prediction results = rcnn.detect([img], verbose=0) # get dictionary for first prediction r = results # show photo with bounding boxes, masks, class labels and scores display_instances(img, r['rois'], r['masks'], r['class_ids'], class_names, r['scores'])