Reading notes Deep Learning for Computer Vision with Python - Volume II, Chapter 8, using HDF5 and large data sets

Download address

Extraction code: zxkt  

         Chapter 8 of Volume II uses HDF5 and large data sets

         So far, in this book, we have only used data sets suitable for the main memory of our machine. For small data sets, this is a reasonable assumption - we just need to load each individual image, preprocess it, and allow it to be fed through our network. However, for large-scale deep learning data sets (such as ImageNet), we need to create a data generator that accesses only a part of the data set (i.e. small batch) at a time, and then allow the batch to pass through the network.

         Fortunately, the method included with Keras allows you to use the original file path on disk as input to the training process. You don't have to store the entire dataset in memory -- just provide the image path to the Keras data generator, and your images will be loaded in batches and fed over the network.

         However, this method is extremely inefficient. Each image residing on disk requires I/O operation, which will cause delay to the training pipeline. The training deep learning network is slow enough - we will avoid I/O bottlenecks as much as possible.

         A more elegant solution is to generate an HDF5 dataset for the original image, as we did in Chapter 3 on migration learning and feature extraction, but this time we store the image itself rather than the extracted features. HDF5 can not only store massive data sets, but also optimize I/O operations, especially extracting batches from files (called "slices"). As we will see in the rest of this book, taking additional steps to package the original images residing on disk into HDF5 files enables us to build a deep learning framework that can be used to quickly build data sets and train deep learning networks on them.

         In the rest of this chapter, I will demonstrate how to build an HDF5 dataset for the Kaggle Dogs vs Cats competition. Then, in the next chapter, we will use this HDF5 dataset to train the AlexNet architecture, and finally get a better position on the leaderboard in the next chapter.

        1. Download Kaggle: dog and cat

         To download the Kaggle Dogs vs Cats dataset, you first need to create an account on From there, go to the Dogs vs Cats Home page.

Dogs vs. Cats | Kaggle          You need to download Do not download The pictures in are only used to calculate the prediction and submit it to the Kaggle evaluation server. Since we need class tags to build our own training and test segmentation, we only need Submitting your own forecast results is beyond the scope of this book, but it can be easily done by writing your forecast on in the file format outlined in sampleSubmission.csv.

         After downloading, unzip it, and you'll find a directory called train -- which contains our actual images. The tag itself can be obtained by checking the file name. I have included an example of the following file names:

         2. Create a profile

         Here is the directory structure of the Kaggle Dogs vs.Cats project (some python scripts will be introduced in the next one):

          First, dogs_ vs_ cats_ The configuration file of mainly contains the following contents:

         1. Enter the path of the image.

         2. Total number of class labels.

         3. Information on training, verification and testing.

         4. Path of HDF5 dataset.

         5. Path to output model, drawing, log, etc.

         Using Python files instead of JSON files allows me to include Python code fragments and make configuration files more efficient (a good example is using the os.path module to manipulate file paths). I recommend that you get into the habit of using Python based configuration files in your deep learning projects, as this will greatly improve your productivity and allow you to control most of the parameters in the project through a single file.

        Reference codes are as follows:

# define the paths to the images directory
IMAGES_PATH = "../datasets/kaggle_dogs_vs_cats/train"

# since we do not have validation data or access to the testing
# labels we need to take a number of images from the training
# data and use them instead

# define the path to the output training, validation, and testing
# HDF5 files
TRAIN_HDF5 = "../datasets/kaggle_dogs_vs_cats/hdf5/train.hdf5"
VAL_HDF5 = "../datasets/kaggle_dogs_vs_cats/hdf5/val.hdf5"
TEST_HDF5 = "../datasets/kaggle_dogs_vs_cats/hdf5/test.hdf5"

# path to the output model file
MODEL_PATH = "output/alexnet_dogs_vs_cats.model"

# define the path to the dataset mean
DATASET_MEAN = "output/dogs_vs_cats_mean.json"

# define the path to the output directory used for storing plots,
# classification reports, etc.
OUTPUT_PATH = "output"

         In line 2, I define the path to the directory containing dog and cat images - these are the images we will package into the HDF5 dataset later in this chapter.

         Lines 7-9 define the total number of category labels (two: one for dogs and one for cats) and the number of validation and test images (2500 each).

         Section 13-15 behavior training, verification and test split specifies the path of the output HDF5 file.

         The second half of the configuration file defines the path to output the serialization weight, the mean value of the dataset, and the path to store graphs, classification reports, logs, etc.

         DATASET_ The mean file will be used to store the average red, green, and blue pixel intensity values for the entire (training) dataset. When we train our network, we will subtract the average RGB value from each pixel in the image (the same is true for testing and evaluation). In this method, the so-called average subtraction is a type of data normalization technology and is more commonly used than the range of scaled pixel intensity [0,1], because it is more effective for large data sets and deeper neural networks.

         3. Build dataset

         Now that our configuration file has been defined, let's continue to actually build our HDF5 dataset. Open a new file named, as follows:

# import the necessary packages
import sys, os
BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))  
# __ file__ Get the relative path of the execution file, and the whole line takes the upper directory of the upper level

import dogs_vs_cats_config as config
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from import AspectAwarePreprocessor
from hdf5DatasetWriter import HDF5DatasetWriter
from imutils import paths
import numpy as np
import progressbar
import json
import cv2

# grab the paths to the images
trainPaths = list(paths.list_images(config.IMAGES_PATH))
trainLabels = [p.split(os.path.sep)[2].split(".")[0]
    for p in trainPaths]
le = LabelEncoder()
trainLabels = le.fit_transform(trainLabels)

# perform stratified sampling from the training set to build the
# testing split from the training data
split = train_test_split(trainPaths, trainLabels, test_size=config.NUM_TEST_IMAGES, stratify=trainLabels, random_state=42)
(trainPaths, testPaths, trainLabels, testLabels) = split

# perform another stratified sampling, this time to build the
# validation data
split = train_test_split(trainPaths, trainLabels, test_size=config.NUM_VAL_IMAGES, stratify=trainLabels, random_state=42)
(trainPaths, valPaths, trainLabels, valLabels) = split

# construct a list pairing the training, validation, and testing
# image paths along with their corresponding labels and output HDF5
# files
datasets = [
("train", trainPaths, trainLabels, config.TRAIN_HDF5),
("val", valPaths, valLabels, config.VAL_HDF5),
("test", testPaths, testLabels, config.TEST_HDF5)]

# initialize the image preprocessor and the lists of RGB channel
# averages
aap = AspectAwarePreprocessor(256, 256)
(R, G, B) = ([], [], [])

# loop over the dataset tuples
for (dType, paths, labels, outputPath) in datasets:
    # create HDF5 writer
    print("[INFO] building {}...".format(outputPath))
    writer = HDF5DatasetWriter((len(paths), 256, 256, 3), outputPath)

    # initialize the progress bar
    widgets = ["Building Dataset: ", progressbar.Percentage(), " ", progressbar.Bar(), " ", progressbar.ETA()]
    pbar = progressbar.ProgressBar(maxval=len(paths), widgets=widgets).start()

    # loop over the image paths
    for (i, (path, label)) in enumerate(zip(paths, labels)):
        # load the image and process it
        image = cv2.imread(path)
        image = aap.preprocess(image)

        # if we are building the training dataset, then compute the
        # mean of each channel in the image, then update the
        # respective lists
        if dType == "train":
            (b, g, r) = cv2.mean(image)[:3]

        # add the image and label # to the HDF5 dataset
        writer.add([image], [label])

    # close the HDF5 writer

# construct a dictionary of averages, then serialize the means to a
# JSON file
print("[INFO] serializing means...")
D = {"R": np.mean(R), "G": np.mean(G), "B": np.mean(B)}
f = open(config.DATASET_MEAN, "w")

        Running the code generates an hdf5 file.  

          The generated HDF5 file, train.hdf5, 30 G, is mainly used to decompress the image to the original data (original image file formats such as JPEG and PNG apply data compression algorithm to maintain a small image file size). Lack of compression will greatly increase our storage cost, but it will also help speed up our training time, because we don't have to waste processor time decoding images - we can access images directly from HDF5 data sets, preprocess them, and then pass them through our network.        It also generates dogs_ vs_ cats_ The mean.json file contains the following contents. This RGB is the RGB average pixel value of all images. We will build a new image preprocessor to normalize our images by subtracting these RGB averages from the input images before they pass through our network. This mean normalization helps to "concentrate" the data near the zero mean. In general, this normalization enables our networks to learn faster, which is why we use this type of normalization (rather than [0,1] scaling) on larger and more challenging data sets.

{"R": 125.03920054779053, "G": 115.98626419296265, "B": 106.16002079620361}

         4. Summary

         In this chapter, we learned how to serialize the original image into an HDF5 data set suitable for training deep neural network. The reason why we serialize the original images into HDF5 files during training instead of simply accessing the small batch image path on the disk is due to I/O delay - for each image on the disk, we must perform I/O operation to read the image. This subtle optimization doesn't seem like a big deal, but I/O latency is a big problem in the deep learning pipeline - the training process is slow enough.

         On the contrary, if we serialize all images into an efficient packaged HDF5 file, we can extract our small batch with very fast array slices, which can significantly reduce I/O latency and help speed up the training process.

         Whenever you use the Keras library and work with data sets that are too large to fit into memory, be sure to consider serializing the data sets into HDF5 format first -- as we will find in the next chapter, it makes your network training easier (and more effective) tasks.

Tags: Python neural networks Deep Learning

Posted on Sun, 05 Dec 2021 21:27:26 -0500 by gioahmad