[Pytorch]PyTorch Dataloader custom data reading
Sort out the user-defined data reading methods you see. There are three better articles. In fact, the user-defined method is to return the train and test of the existing data set with the list containing image path and label respectively, so you need to adapt according to the data set.
All pictures are in one folder 1
When I first started using it, I encountered many pits when writing Dataloader. There are some tutorials on the Internet, including all images in one folder and each class one folder. There are a lot of people writing in the back. I'll write the stylized things in the front. Just change a few parameters for different tasks each time.
When training, write an article to 2333
1, What already exists
Example: take a dog feed dataset on kaggle as an example. There are three subdirectories in the data folder
Test: thousands of pictures, no labels, test set
train: 10222 pictures of dogs, all jpg, different sizes, length and width, basically 400 × More than 300
labels.csv: excel table, picture name + variety name
I like to read out the form information with pandas first
import pandas as pd import numpy as np df = pd.read_csv('./dog_breed/labels.csv') print(df.info()) print(df.head())
You can see that there are 10222 data in total. The id corresponds to the name of the picture, but there is no suffix. jpg. The breed corresponds to the breed of dog.
2, Pretreatment
What we need to do is:
1) Get a long list1: inside is the path of each picture
2) Another long list2: inside is the label (integer) corresponding to each picture, and the order should correspond to list1.
3) Cut the two list s into a part as the verification set
1) Look at the total number of feeds, and match the name of each feed with a numeric number:
from pandas import Series,DataFrame
breed = df['breed']
breed_np = Series.as_matrix(breed)
print(type(breed_np) )
print(breed_np.shape) #(10222,)
Let's see how many different kinds there are
breed_set = set(breed_np)
print(len(breed_set)) #120
Build a dictionary corresponding to the number and name. When the output number wants to become a name in the future, use:
breed_120_list = list(breed_set)
dic = {}
for i in range(120):
dic[ breed_120_list[i] ] = i
2) The processing id column is divided into two segments:
file = Series.as_matrix(df["id"]) print(file.shape)
import os
file = [i+".jpg" for i in file]
file = [os.path.join("./dog_breed/train",i) for i in file ]
file_train = file[:8000]
file_test = file[8000:]
print(file_train)
np.save( "file_train.npy" ,file_train )
np.save( "file_test.npy" ,file_test )
Inside is the path of the picture
3) The feed column is divided into two sections:
breed = Series.as_matrix(df["breed"]) print(breed.shape) number = [] for i in range(10222): number.append( dic[ breed[i] ] ) number = np.array(number) number_train = number[:8000] number_test = number[8000:] np.save( "number_train.npy" ,number_train ) np.save( "number_test.npy" ,number_test )
3, Dataloader
We already have a list of image paths and a list of target numbers. Just fill in the Dataset class.
from torch.utils.data import Dataset, DataLoader from torchvision import transforms, utils
normalize = transforms.Normalize(
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]
)
preprocess = transforms.Compose([
#transforms.Scale(256),
#transforms.CenterCrop(224),
transforms.ToTensor(),
normalize
])
def default_loader(path):
img_pil = Image.open(path)
img_pil = img_pil.resize((224,224))
img_tensor = preprocess(img_pil)
return img_tensor
Of course, they all turned into tensor s when they came out
class trainset(Dataset):
def init(self, loader=default_loader):
#Defined image path
self.images = file_train
self.target = number_train
self.loader = loader
def __getitem__(self, index): fn = self.images[index] img = self.loader(fn) target = self.target[index] return img,target def __len__(self): return len(self.images)
Let's look at the code. A custom Dataset only needs the lowest class, which inherits from the Dataset class. There are three private functions
def init(self, loader=default_loader):
Generally, a loader (see the code above) and an images should be initialized_ Path list, a target list
def getitem(self, index):
Here, when you are given an index, you return the tensor of an image and the tensor of a target. The loader method is used to change the image into a tensor after normalization, clipping and type conversion
def len(self):
return the number of all your data
Taken together, these three actually tell it the length of all your data. It returns you a shuffle d index every time, traversing the data set in this way getitem(self, index) returns a set of (input,target) you want
4, Use
Instantiate a dataset and package it with Dataloader
train_data = trainset() trainloader = DataLoader(train_data, batch_size=4,shuffle=True)
All pictures are in one folder 2
In the last blog PyTorch learning path (Level 1) -- training an image classification model It introduces how to train an image classification model with PyTorch. It is recommended to read that blog before reading this blog. In that code, the torchvision.datasets.ImageFolder interface is used to read image data. By default, your training data is stored in a folder according to a category. However, in some cases, your image data is not maintained in this way. For example, there are all kinds of image data under a folder. At the same time, a corresponding label file, such as txt file, is used to maintain the corresponding relationship between the image and the label. In this case, you can't use torchvision.datasets.ImageFolder to read the data. You need to customize a data reading interface. In addition, this blog finally introduces how to save the model and multi GPU training.
How?
Let's take a look at how torchvision.datasets.ImageFolder is written. The main code is as follows. You can see what you want to know in detail: Official github code.
It looks complicated, but it's actually very simple. The inherited class is torch.utils.data.Dataset, which mainly contains three methods: initialization__ init__, Get Image__ getitem__, Number of datasets __ len__. __ init__ Method through find_ The classes function obtains the classified class alias (classes) and the mapping relationship dictionary between the class alias and the digital category (class_to_idx). Then make_ The dataset function obtains imags, which is a list, where each value is a tuple, and each tuple contains two elements: Image path and label. All that's left is some assignment operations. In__ getitem__ The most important line in the method is the line img = self.loader(path), which indicates that the data can be read from__ init__ It can be seen from the method that self.loader adopts default_loader, this is the default_ The core of loader is to use the Image module of python's PIL library to read Image data.
class ImageFolder(data.Dataset): """A generic data loader where the images are arranged in this way: :: root/dog/xxx.png root/dog/xxy.png root/dog/xxz.png root/cat/123.png root/cat/nsdf3.png root/cat/asd932_.png Args: root (string): Root directory path. transform (callable, optional): A function/transform that takes in an PIL image and returns a transformed version. E.g, ``transforms.RandomCrop`` target_transform (callable, optional): A function/transform that takes in the target and transforms it. loader (callable, optional): A function to load an image given its path. Attributes: classes (list): List of the class names. class_to_idx (dict): Dict with items (class_name, class_index). imgs (list): List of (image path, class_index) tuples """ def __init__(self, root, transform=None, target_transform=None, loader=default_loader): classes, class_to_idx = find_classes(root) imgs = make_dataset(root, class_to_idx) if len(imgs) == 0: raise(RuntimeError("Found 0 images in subfolders of: " + root + "\n" "Supported image extensions are: " + ",".join(IMG_EXTENSIONS))) self.root = root self.imgs = imgs self.classes = classes self.class_to_idx = class_to_idx self.transform = transform self.target_transform = target_transform self.loader = loader def __getitem__(self, index): """ Args: index (int): Index Returns: tuple: (image, target) where target is class_index of the target class. """ path, target = self.imgs[index] img = self.loader(path) if self.transform is not None: img = self.transform(img) if self.target_transform is not None: target = self.target_transform(target) return img, target def __len__(self): return len(self.imgs)
Take a look at default_loader function, which mainly calls two functions in two cases, generally using pil_loader function.
def pil_loader(path): with open(path, 'rb') as f: with Image.open(f) as img: return img.convert('RGB') def accimage_loader(path): import accimage try: return accimage.Image(path) except IOError: # Potentially a decoding problem, fall back to PIL.Image return pil_loader(path) def default_loader(path): from torchvision import get_image_backend if get_image_backend() == 'accimage': return accimage_loader(path) else: return pil_loader(path)
After understanding the ImageFolder class, you can customize your own data reading interface.
First, in PyTorch, classes related to data reading basically inherit a base class: torch.utils.data.Dataset. Then rewrite it__ init__,__ len__,__ getitem__ And so on.
Let's assume img_path is your image folder, under which all image data (including training and testing) are placed, and then TXT_ Under path, there are two files: train.txt and val.txt. Each line in the txt file is the image path, tab key and label. So the following code__ init__ Self.img in method_ Name and self.img_ The reading mode of label is related to the storage mode of your data. You can adjust it according to the maintenance mode of your actual data__ getitem__ The method has not changed much, and default is still used_ The loader method to read the image. Finally, each image is encapsulated into Tensor in Transform.
class customData(Dataset): def __init__(self, img_path, txt_path, dataset = '', data_transforms=None, loader = default_loader): with open(txt_path) as input_file: lines = input_file.readlines() self.img_name = [os.path.join(img_path, line.strip().split('\t')[0]) for line in lines] self.img_label = [int(line.strip().split('\t')[-1]) for line in lines] self.data_transforms = data_transforms self.dataset = dataset self.loader = loader def __len__(self): return len(self.img_name) def __getitem__(self, item): img_name = self.img_name[item] label = self.img_label[item] img = self.loader(img_name) if self.data_transforms is not None: try: img = self.data_transforms[self.dataset](img) except: print("Cannot transform image: {}".format(img_name)) return img, label
After defining the data reading interface, how to use it?
This can be called in code.
image_datasets = {x: customData(img_path='/ImagePath', txt_path=('/TxtFile/' + x + '.txt'), data_transforms=data_transforms, dataset=x) for x in ['train', 'val']}
The image returned in this way_ Datasets is the same as the data type returned by the torchvision.datasets.ImageFolder class. It feels like a cat changing a crown prince. This is the feeling of building blocks when writing code in the first blog.
With image_datasets, and then still use the torch.utils.data.DataLoader class to further encapsulate the image data and labels of this batch into Tensor respectively.
dataloders = {x: torch.utils.data.DataLoader(image_datasets[x], batch_size=batch_size, shuffle=True) for x in ['train', 'val']}
In addition, how to save the model generated by each iteration? Very simple, that is to use torch.save. The input is your model, the path to save and the model name. If there is no output folder, you can create one manually or in the code.
torch.save(model, 'output/resnet_epoch{}.pkl'.format(epoch))
Finally, with regard to the use of multiple GPUs, PyTorch supports the multi GPU training model. Assuming that your network is a model, you only need the following line of code (calling torch.nn.DataParallel interface) to make the subsequent model training train on 0 and 1 GPUs to speed up the training speed.
model = torch.nn.DataParallel(model, device_ids=[0,1])
Complete code, please move to: Github
The pictures of each class are placed in a folder
This is a blog suitable for PyTorch beginners. PyTorch's documents are of high quality and easy to get started. This blog is selected Official link The example inside introduces how to train a ResNet model for image classification with PyTorch. The code logic is very clear. It is basically similar to the code ideas of many deep learning frameworks. It is very suitable for beginners who want to get started with PyTorch training model (you don't have to run the manist demo every time). Next, it is explained from the perspective of personal use. The idea of explanation is to write code by building blocks from the beginning of data import to the end of model training.
The first is the data import part. Here, the officially written torchvision.datasets.ImageFolder interface is used to realize data import. This interface requires you to provide the folder where the image is located, which is the following data_dir = '/ data', and then for a classification problem, here is data_dir directory generally includes two folders: train and val. each file contains n subfolders. N is the number of your classification categories, and the images of this category are stored in each subfolder. In this way, torchvision.datasets.ImageFolder will return a list (such as image_datasets ['train '] or image_datasets ['val]) in the following code. Each value in the list is a tuple, and each tuple contains image and label information.
data_dir = '/data' image_datasets = {x: datasets.ImageFolder( os.path.join(data_dir, x), data_transforms[x]), for x in ['train', 'val']}
In addition, the data here_ Transforms is a dictionary, as follows. It mainly carries out some image preprocessing, such as resize, crop, etc. The torchvision.transforms module is used in the implementation. For example, torchvision.transforms.Compose is used to manage all transforms operations, and torchvision.transforms.RandomSizedCrop is used for crow. It should be noted that for torchvision.transforms.RandomSizedCrop and transforms.RandomHorizontalFlip(), the input object is PIL Image, that is, the image content read in python's PIL library, and the action object of transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]) needs to be a Tensor, so in transforms.normalize ([0.485, 0.456, 0.406], Before [0.229, 0.224, 0.225]), there was a transforms.ToTensor() used to generate tensors. In addition, transforms.Scale(256) is actually a resize operation, which has been replaced by the transforms.Resize class.
data_transforms = { 'train': transforms.Compose([ transforms.RandomSizedCrop(224), transforms.RandomHorizontalFlip(), transforms.ToTensor(), transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]) ]), 'val': transforms.Compose([ transforms.Scale(256), transforms.CenterCrop(224), transforms.ToTensor(), transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]) ]), }
Previously, torchvision.datasets.ImageFolder only returns a list, which cannot be used as model input. Therefore, another class needs to be used in pytorch to encapsulate the list, that is, torch.utils.data.DataLoader. The torch.utils.data.DataLoader class can encapsulate the input data of list type into Tensor data format for model use. Note that images and labels are used separately Encapsulate into a Tensor. Here we will mention another very important class: torch.utils.data.Dataset, which is an abstract class. In pytorch, all data related classes should inherit this class to implement. For example, the torchvision.datasets.ImageFolder class mentioned above and the torch.util.data.DataLoader class here are the same. So when your data is not When storing in this way, you need to customize a class to read data. The customized class must inherit from the base class of torch.utils.data.Dataset, and finally encapsulated into Tensor with torch.utils.data.DataLoader.
dataloders = {x: torch.utils.data.DataLoader(image_datasets[x], batch_size=4, shuffle=True, num_workers=4) for x in ['train', 'val']}
After generating dataloaders, another step can be used as the input of the model, that is, encapsulating the Tensor data type into the Variable data type. Let's look at the following code. Dataloaders is a dictionary, dataloders ['train'] stores the training data, and this for loop is from dataloders ['train'] Read batch_size data from the. Batch_size is set when generating dataloaders. Therefore, this data contains the Tensor of image data (inputs) and the Tensor of labels. Then, use torch.autograd.Variable to encapsulate the Tensor into the Variable data type that can be used by the model.
Why should we package them into variables? In pytorch, torch.tensor and torch.autograd.Variable are two important data structures. Variable can be regarded as a package of tensor, which not only contains the content of tensor, but also contains gradient and other information. Therefore, variable data structures are often used in Shenjing network. So how to take tensor from a variable type What about? It's also very simple. For example, the encapsulated inputs below are a variable, so inputs.data is the corresponding tensor.
for data in dataloders['train']: inputs, labels = data if use_gpu: inputs = Variable(inputs.cuda()) labels = Variable(labels.cuda()) else: inputs, labels = Variable(inputs), Variable(labels)
After encapsulating the data, it can be used as the input of the model. Therefore, import your model first. By default, PyTorch has prepared some common network structures for you, such as VGG, ResNet, DenseNet, etc. in classification. You can import them with the torchvision.models module. For example, torchvision.models.ResNet18 (pre trained = true) To import the ResNet18 network, and indicate that the imported network is the network that has been pre trained. Because the pre training network is generally carried out on 1000 ImageNet datasets, to migrate to the 2 classification of your own dataset, you need to replace the last full connection layer for the output you need. Therefore, the following three lines of code are to import the ResNet18 network with the models module After that, obtain the number of input channels of the full connection layer, and replace the full connection layer in the original model with the number of channels and the number of classification categories you want to do (here is 2). In this way, the network results are also ready.
model = models.resnet18(pretrained=True) num_ftrs = model.fc.in_features model.fc = nn.Linear(num_ftrs, 2)
However, only the network structure and data are not enough to run the code, and the loss function needs to be defined. In PyTorch, the torch.nn module is used to define all layers of the network, such as convolution, downsampling, loss layer, etc. here, the cross entropy function is used, so it can be defined as follows:
criterion = nn.CrossEntropyLoss()
Then you also need to define optimization functions, such as the most common random gradient descent, which is implemented through the torch.optim module in PyTorch. In addition, although SGD is written here, it is Adam's optimization method because there is momentum. The input of this class includes the parameters to be optimized: model.parameters(), learning rate, and Adam related momentum parameters. This is the default definition of many optimization methods.
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
Then, the change strategy of learning rate is generally defined. Torch.optim.lr is used here_ The StepLR class of the scheduler module represents every step_size epoch reduces the learning rate to gamma times.
scheduler = lr_scheduler.StepLR(optimizer, step_size=7, gamma=0.1)
The preparations are finally finished. It's time to start training.
First of all, the learning rate needs to be updated at the beginning of training. This is because we have formulated the change strategy of learning rate, so we need to update it at the beginning of each epoch:
scheduler.step()
Then set the model status to training status:
model.train(True)
Then set all gradients in the network to 0:
optimizer.zero_grad()
Then there is the forward propagation of the network:
outputs = model(inputs)
Then take the output and the imported labels as the input of the loss function to get the loss:
loss = criterion(outputs, labels)
The output is also in torch.autograd.Variable format. After the output (the output of the full connection layer of the network) is obtained, it is also hoped that the model can predict which category the sample belongs to. torch.max is used here. The first input of torch.max() is in tensor format, so outputs.data is used instead of outputs; The second parameter 1 represents dim, that is, take the maximum value of each row. In fact, it is the index with the highest probability; The third parameter loss is also in torch.autograd.Variable format.
_, preds = torch.max(outputs.data, 1)
After the loss is calculated, the loss will be returned. It should be noted that this operation is only available during training, and only the forward process is available during testing.
loss.backward()
In the process of returning the loss, calculate the gradient, and then update the parameters according to these gradients. optimizer.step() is used to update the parameters. After optimizer.step(), you can start from optimizer. Param_ The gradient and weight information of each layer can be seen in groups [0] ['params].
optimizer.step()
Such a batch data training is over! When you keep repeating this training process, you can finally achieve the results you want.
In addition, if you have a gpu available, your data and models can be operated on the gpu, which is also very simple in PyTorch. To determine whether you have a gpu to use, you can use the following line of code. If so, use_gpu is true.
use_gpu = torch.cuda.is_available()
Complete code, please move to: Github