PyTorch pit six data processing modules Dataloader and Dataset

Overview of data processing in deep learning

Three elements of deep learning: data, computing power and algorithm
In engineering practice, the importance of data has attracted more and more attention. In the data science community, there is a saying that "data determines the upper limit of the model, and algorithm determines the lower limit of the model". Therefore, this "saying" clearly shows that only good data can have a good model, and data is the key factor that determines the model.

Data is important

Simply put, it is to find good data and give it to the model to "eat".
However, what kind of "good" data is, how to find "good" data, and whether the model performance changes after eating the model is a very huge topic. This paper does not discuss it in depth. First, throw out a diagram from the perspective of Feature Engineering, and summarize several data processing processes most commonly used in deep learning.

Basic steps of data processing in ML/DL


Before the experiment, data needs to be collected, including original samples and labels. Generally, there are several methods for label information, such as collecting public dataset data, manual labeling, automatic / semi-automatic labeling, simulation platform generation and so on.


With the original data, the data set needs to be divided into training set, verification set and test set

  • Training sets: Training Models
  • Verification set: the verification set is used to verify whether the model is over fitted, and select the super parameters of the model (learning rate, optimization algorithm, network structure, etc.) by comparing the performance of the algorithm in the verification set
  • Test set: test the performance of the model and the generalization ability of the model (often the test session indicators are provided by a third party, and the algorithm students do not contact the test data and labels)

data fetch

The core of data reading in pytorch is DataLoader.
DataLoader is also subdivided into two sub modules, Sampler and dataset; The function of Sample is to generate an index, that is, the serial number of the Sample; The dataset reads the picture and the corresponding tag according to the index

Data preprocessing

For example, data centralization, standardization, rotation or flipping, etc
Data preprocessing in pytorch is performed through transforms

Data reading module in PyTorch

  • Function: build an iterative data loader;
  • dataset:Dataset class, which determines where and how to read data;
  • Batch size: batch size;
  • num_works: whether to read data from multiple processes;
  • shuffle: whether each epoch is out of order;
  • drop_last: whether to discard the last batch of data when the sample number cannot be divided by batchsize;

class Dataset(object):
    def __getitem__(self, index):
        raise NotImplementedError
    def __add__(self, other)
        return ConcatDataset([self,other])
  • Dataset is used to define where and how to read data;
  • Function: Dataset abstract class. All custom datasets need to inherit it and replicate it__ getitem__ ();
  • Functions__ getitem__ () function: receive an index and return a sample

A data reading example of classification task

See for details

Here is the reference Classification task DataLoader example
Core code:

# Build MyDataset instance. MyDataset must be built by the user
train_data = RMBDataset(data_dir=train_dir, transform=train_transform)  # data_dir is the data path and transform is the data preprocessing
valid_data = RMBDataset(data_dir=valid_dir, transform=valid_transform)  # One for training and one for verification

#Build DataLoder
train_loader = DataLoader(dataset=train_data, batch_size=BATCH_SIZE, shuffle=True)  # shuffle=True, the samples in each epoch are out of order
valid_loader = DataLoader(dataset=valid_data, batch_size=BATCH_SIZE)

Among them, the DataLoader will pass in a parameter Dataset, that is, the RMBDataset built earlier; The second parameter is batch_size, shuffle=True. Its function is that the samples in each epoch are out of order

The RMBDataset tracked in the code builds two datasets, one for training and one for verification.
The core function is rewritten

def __getitem__(self, index):
    path_img, label = self.data_info[index]
    img ='RGB')     # 0~255

    if self.transform is not None:
        img = self.transform(img)   # Do transform here, turn it into tensor, etc

    return img, label

Tags: Machine Learning Pytorch Deep Learning

Posted on Mon, 20 Sep 2021 21:38:27 -0400 by ven.ganeva