Overview of data processing in deep learning
Three elements of deep learning: data, computing power and algorithm
In engineering practice, the importance of data has attracted more and more attention. In the data science community, there is a saying that "data determines the upper limit of the model, and algorithm determines the lower limit of the model". Therefore, this "saying" clearly shows that only good data can have a good model, and data is the key factor that determines the model.
Data is important
Simply put, it is to find good data and give it to the model to "eat".
However, what kind of "good" data is, how to find "good" data, and whether the model performance changes after eating the model is a very huge topic. This paper does not discuss it in depth. First, throw out a diagram from the perspective of Feature Engineering, and summarize several data processing processes most commonly used in deep learning.
Basic steps of data processing in ML/DL
Before the experiment, data needs to be collected, including original samples and labels. Generally, there are several methods for label information, such as collecting public dataset data, manual labeling, automatic / semi-automatic labeling, simulation platform generation and so on.
With the original data, the data set needs to be divided into training set, verification set and test set
- Training sets: Training Models
- Verification set: the verification set is used to verify whether the model is over fitted, and select the super parameters of the model (learning rate, optimization algorithm, network structure, etc.) by comparing the performance of the algorithm in the verification set
- Test set: test the performance of the model and the generalization ability of the model (often the test session indicators are provided by a third party, and the algorithm students do not contact the test data and labels)
The core of data reading in pytorch is DataLoader.
DataLoader is also subdivided into two sub modules, Sampler and dataset; The function of Sample is to generate an index, that is, the serial number of the Sample; The dataset reads the picture and the corresponding tag according to the index
For example, data centralization, standardization, rotation or flipping, etc
Data preprocessing in pytorch is performed through transforms
Data reading module in PyTorch
DataLoader(dataset, batch_size=1, shuffle=False, sampler=None, batch_sampler=None, num_works=0, clollate_fn=None, pin_memory=False, drop_last=False, timeout=0, worker_init_fn=None, multiprocessing_context=None)
- Function: build an iterative data loader;
- dataset:Dataset class, which determines where and how to read data;
- Batch size: batch size;
- num_works: whether to read data from multiple processes;
- shuffle: whether each epoch is out of order;
- drop_last: whether to discard the last batch of data when the sample number cannot be divided by batchsize;
class Dataset(object): def __getitem__(self, index): raise NotImplementedError def __add__(self, other) return ConcatDataset([self,other])
- Dataset is used to define where and how to read data;
- Function: Dataset abstract class. All custom datasets need to inherit it and replicate it__ getitem__ ();
- Functions__ getitem__ () function: receive an index and return a sample
A data reading example of classification task
See for details
Here is the reference Classification task DataLoader example
# Build MyDataset instance. MyDataset must be built by the user train_data = RMBDataset(data_dir=train_dir, transform=train_transform) # data_dir is the data path and transform is the data preprocessing valid_data = RMBDataset(data_dir=valid_dir, transform=valid_transform) # One for training and one for verification #Build DataLoder train_loader = DataLoader(dataset=train_data, batch_size=BATCH_SIZE, shuffle=True) # shuffle=True, the samples in each epoch are out of order valid_loader = DataLoader(dataset=valid_data, batch_size=BATCH_SIZE)
Among them, the DataLoader will pass in a parameter Dataset, that is, the RMBDataset built earlier; The second parameter is batch_size, shuffle=True. Its function is that the samples in each epoch are out of order
The RMBDataset tracked in the code builds two datasets, one for training and one for verification.
The core function is rewritten
def __getitem__(self, index): path_img, label = self.data_info[index] img = Image.open(path_img).convert('RGB') # 0~255 if self.transform is not None: img = self.transform(img) # Do transform here, turn it into tensor, etc return img, label