# Section two

The main content of this paper is the application of linear regression - kaggle house price prediction, convolution neural network

## 1, Linear regression practice

The project steps are as follows: read + preprocess data, define neural network, define loss function and result calculation function (according to the requirements of the topic), define training network function, k-fold cross validation, and finally train all data for result prediction

### 1.1 data reading and preprocessing

The competition data is divided into training data set and test data set. Both data sets include the characteristics of each house, such as street type, construction year, roof type, basement condition and so on. These eigenvalues have continuous numbers, discrete labels or even missing values "na". We first use pandas to read the data.

```import pandas as pd
#At this time, the data we read is of pandas type
```

The second step is to see what the data sheet looks like

```train_data.shape # Output (1460, 81)
test_data.shape # Output (1459, 80)
train_data.iloc[0:4, [0, 1, 2, 3, -3, -2, -1]]#Output content in pandas
```

The third step is to combine the characteristics of training set and data set

```all_features = pd.concat((train_data.iloc[:, 1:-1], test_data.iloc[:, 1:]))
#The last column of the training set is the label
```

Finally, we need to standardize the data. For the missing digital features, we use the mean value instead. For the non digital features, we introduce two additional columns, which are represented by 0 and 1 respectively. Here, I choose Z-scores standardization, where the mean value is 0 and the standard deviation is 1

```numeric_features = all_features.dtypes[all_features.dtypes != 'object'].index
all_features[numeric_features] = all_features[numeric_features].apply(lambda x: (x - x.mean()) / (x.std()))
#Write 0 for unlabeled (mean)
all_features[numeric_features] = all_features[numeric_features].fillna(0)

# print(all_features.iloc['MSSubClass',0])
all_features = pd.get_dummies(all_features, dummy_na=True)
```

### 1.2 transform pandas data into sensor data and separate training set and test set

```n_train = train_data.shape  # Rows of pandas
n_features = all_features.shape  # Columns of pandas
train_features = torch.tensor(all_features[:n_train].values,dtype=torch.float)
test_features = torch.tensor(all_features[n_train:].values,dtype=torch.float)
train_labels = torch.tensor(train_data.SalePrice.values,dtype=torch.float)
```

### 1.3 define neural network and loss function

In this practice, the loss function of mean square deviation is selected, and the optimization function is Adam (the optimized version of SGD, not particularly sensitive to learning rate)

```#Mean square loss function
loss = nn.MSELoss()
#Result observation
def log_rmse(net, features, labels):
# Set the value less than 1 to 1, making the value more stable when taking logarithm
clipped_preds = torch.max(net(features), torch.tensor(1.0))
rmse = torch.sqrt(2 * loss(clipped_preds.log(), labels.log()).mean())
return rmse.item()

#Training function (initialization + optimization): (,,,, training times, learning rate, weight attenuation, small batch (because each training needs to take some but not all))
def train(net, train_features, train_labels, test_features, test_labels,
num_epochs, learning_rate, weight_decay, batch_size):
#Training error and generalization error (for drawing)
train_ls, test_ls = [], []
dataset = torch.utils.data.TensorDataset(train_features, train_labels)
net = net.float()#Convert all to float
for epoch in range(num_epochs):
for X, y in train_iter:
l = loss(net(X.float()), y.float())
l.backward()
optimizer.step()
train_ls.append(log_rmse(net, train_features, train_labels))
if test_labels is not None:
test_ls.append(log_rmse(net, test_features, test_labels))
return train_ls, test_ls
```

### 1.4 k-fold cross validation

In this method, the training set should be divided into k parts, one of which is the verification function and the other k-1 part is the training function, so that we can see whether our model is excellent while training

```#k-fold cross validation
def get_k_fold_data(k, i, X, y):
# Return the training and validation data required for the i-fold cross validation
assert k > 1
fold_size = X.shape // k#It's divided into k pieces. The quantity of each piece is fold_
X_train, y_train = None, None
for j in range(k):
idx = slice(j * fold_size, (j + 1) * fold_size)
X_part, y_part = X[idx, :], y[idx]
if j == i:
X_valid, y_valid = X_part, y_part
elif X_train is None:
X_train, y_train = X_part, y_part
else:
X_train = torch.cat((X_train, X_part), dim=0)
y_train = torch.cat((y_train, y_part), dim=0)
return X_train, y_train, X_valid, y_valid

def k_fold(k, X_train, y_train, num_epochs,learning_rate, weight_decay, batch_size):
train_l_sum, valid_l_sum = 0, 0
for i in range(k):
data = get_k_fold_data(k, i, X_train, y_train)
net = get_net()
train_ls, valid_ls = train(net, *data, num_epochs, learning_rate,
weight_decay, batch_size)
train_l_sum += train_ls[-1]
valid_l_sum += valid_ls[-1]
if i == 0:
d2l.semilogy(range(1, num_epochs + 1), train_ls, 'epochs', 'rmse',
range(1, num_epochs + 1), valid_ls,
['train', 'valid'])
print('fold %d, train rmse %f, valid rmse %f' % (i, train_ls[-1], valid_ls[-1]))
return train_l_sum / k, valid_l_sum / k

k, num_epochs, lr, weight_decay, batch_size = 5, 100, 5, 0, 64#Generally, the batch size is 2 to the nth power
train_l, valid_l = k_fold(k, train_features, train_labels, num_epochs, lr, weight_decay, batch_size)
#print('%d-fold validation: avg train rmse %f, avg valid rmse %f' % (k, train_l, valid_l))
```

### 1.5 model training and output results

Train all data and predict output

```def train_and_pred(train_features, test_features, train_labels, test_data,
num_epochs, lr, weight_decay, batch_size):
net = get_net()
train_ls, _ = train(net, train_features, train_labels, None, None,
num_epochs, lr, weight_decay, batch_size)
d2l.semilogy(range(1, num_epochs + 1), train_ls, 'epochs', 'rmse')
print('train rmse %f' % train_ls[-1])
preds = net(test_features).detach().numpy()#Prediction results
test_data['SalePrice'] = pd.Series(preds.reshape(1, -1))
submission = pd.concat([test_data['Id'], test_data['SalePrice']], axis=1)#Combination of pandas, torch.cat()
submission.to_csv('./submission.csv', index=False)
k, num_epochs, lr, weight_decay, batch_size = 5, 100, 5, 0, 64
train_and_pred(train_features, test_features, train_labels, test_data, num_epochs, lr, weight_decay, batch_size)
```

## 2, Convolutional neural network CNN When we do image classification, we usually think of linear regression. For example, we first expand a 12828 image into a 1 * 784 image, so we will destroy the spatial characteristics of the image and turn them into ordinary one-dimensional data. So we need to find a way to extract the characteristics of the image separately, just like people looking at objects, each neuron is only responsible for a certain Part of the data, so we think of convolutional neural network to separate the acquisition of the characteristics of items. CNN is mainly divided into three parts: convolution layer, galloping layer and full connection layer. Finally, the output of softmax classification is carried out. Here are some efficient net models.

### 2.1 LeNet

LeNet is divided into convolution layer block and full connection layer block. The basic unit in the convolution layer block is the convolution layer followed by the average pooling layer: the convolution layer is used to identify spatial patterns in the image, such as lines and local objects, and the average pooling layer after the convolution layer is used to reduce the sensitivity of the convolution layer to location.

The convolution layer block consists of two such basic units stacked repeatedly. In the convolution layer block, each convolution layer uses a 5 * 5 window, and the sigmoid activation function is used on the output. The number of output channels of the first convolution layer is 6, and the number of output channels of the second convolution layer is increased to 16.

The full connection layer block includes three full connection layers. Their output numbers are 120, 84 and 10 respectively, where 10 is the number of output categories.

#### 2.1.1 defining neural networks ```#net
class Flatten(torch.nn.Module):  #Flattening operation
def forward(self, x):
return x.view(x.shape, -1)

class Reshape(torch.nn.Module): #Reshape image size
def forward(self, x):
return x.view(-1,1,28,28)      #(B x C x H x W)

net = torch.nn.Sequential(     #Lelet
Reshape(),
nn.Conv2d(in_channels=1, out_channels=6, kernel_size=5, padding=2), #Create a volume accumulation layer with a size of 5 * 5, a fill of 2, and a step number of 1 by default
nn.Sigmoid(), #Activation function
nn.AvgPool2d(kernel_size=2, stride=2),                              #Galloping layer
nn.Conv2d(in_channels=6, out_channels=16, kernel_size=5),
nn.Sigmoid(),
nn.AvgPool2d(kernel_size=2, stride=2),
Flatten(),
nn.Linear(in_features=16*5*5, out_features=120),
nn.Sigmoid(),
nn.Linear(120, 84),
nn.Sigmoid(),
nn.Linear(84, 10)
)
```

#### 2.1.2 example display

Let's implement the LeNet model. We still use fashion MNIST as the training data set.

```import sys
sys.path.append("/home/kesci/input")
import d2lzh1981 as d2l
import torch
import torch.nn as nn
import torch.optim as optim
import time
import matplotlib.pyplot as plt

# Lenet network initialization
class Flatten(torch.nn.Module):  # Flattening operation
def forward(self, x):
return x.view(x.shape, -1)

class Reshape(torch.nn.Module):  # Reshape image size
def forward(self, x):
return x.view(-1, 1, 28, 28)  # (B x C x H x W)

net = torch.nn.Sequential(  # Lelet
Reshape(),
nn.Conv2d(in_channels=1, out_channels=6, kernel_size=5, padding=2, stride= 1),  # b*1*28*28  =>b*6*28*28
nn.Sigmoid(),
nn.AvgPool2d(kernel_size=2, stride=2),  # b*6*28*28  =>b*6*14*14
nn.Conv2d(in_channels=6, out_channels=16, kernel_size=5),  # b*6*14*14  =>b*16*10*10
nn.Sigmoid(),
nn.AvgPool2d(kernel_size=2, stride=2),  # b*16*10*10  => b*16*5*5
Flatten(),  # b*16*5*5   => b*400
nn.Linear(in_features=16 * 5 * 5, out_features=120),
nn.Sigmoid(),
nn.Linear(120, 84),
nn.Sigmoid(),
nn.Linear(84, 10)
)

def try_gpu():
if torch.cuda.is_available():
device = torch.device('cuda:0')
else:
device = torch.device('cpu')
return device

device = try_gpu()

batch_size = 256 #Batch size
batch_size=batch_size, root='/home/kesci/input/FashionMNIST2065')
print(len(train_iter))

#Data display
def show_fashion_mnist(images, labels):
d2l.use_svg_display()
# Here "UU" means we ignore (do not use) variables
_, figs = plt.subplots(1, len(images), figsize=(12, 12))
for f, img, lbl in zip(figs, images, labels):
f.imshow(img.view((28, 28)).numpy())
f.set_title(lbl)
f.axes.get_xaxis().set_visible(False)
f.axes.get_yaxis().set_visible(False)
plt.show()

for Xdata,ylabel in train_iter:
break
X, y = [], []
for i in range(10):
print(Xdata[i].shape,ylabel[i].numpy())
X.append(Xdata[i]) # Add the i th feature to X
y.append(ylabel[i].numpy()) # Add the i-th label to y
show_fashion_mnist(X, y)

#Calculation accuracy
'''
(1). net.train()
//Enable BatchNormalization and Dropout and set BatchNormalization and Dropout to True
(2). net.eval()
//Do not enable BatchNormalization and Dropout, set BatchNormalization and Dropout to False
'''

def evaluate_accuracy(data_iter, net,device=torch.device('cpu')):
"""Evaluate accuracy of a model on the given data set."""
acc_sum,n = torch.tensor(,dtype=torch.float32,device=device),0
for X,y in data_iter:
X,y = X.to(device),y.to(device)
net.eval()
y = y.long()
acc_sum += torch.sum((torch.argmax(net(X), dim=1) == y))  #[[0.2 ,0.4 ,0.5 ,0.6 ,0.8] ,[ 0.1,0.2 ,0.4 ,0.3 ,0.1]] => [ 4 , 2 ]
n += y.shape
return acc_sum.item()/n

def train_ch5(net, train_iter, test_iter, criterion, num_epochs, batch_size, device, lr=None):

net.to(device)
optimizer = optim.SGD(net.parameters(), lr=lr)
for epoch in range(num_epochs):
train_l_sum = torch.tensor([0.0], dtype=torch.float32, device=device)
train_acc_sum = torch.tensor([0.0], dtype=torch.float32, device=device)
n, start = 0, time.time()
for X, y in train_iter:
net.train()

X, y = X.to(device), y.to(device)
y_hat = net(X)
loss = criterion(y_hat, y)
loss.backward()
optimizer.step()

y = y.long()
train_l_sum += loss.float()
train_acc_sum += (torch.sum((torch.argmax(y_hat, dim=1) == y))).float()
n += y.shape
test_acc = evaluate_accuracy(test_iter, net, device)
print('epoch %d, loss %.4f, train acc %.3f, test acc %.3f, '
'time %.1f sec'
% (epoch + 1, train_l_sum / n, train_acc_sum / n, test_acc,
time.time() - start))

#train
lr, num_epochs = 0.9, 10

def init_weights(m):
if type(m) == nn.Linear or type(m) == nn.Conv2d:
torch.nn.init.xavier_uniform_(m.weight)

net.apply(init_weights)
net = net.to(device)

criterion = nn.CrossEntropyLoss()   #Cross entropy describes the distance between two probability distributions. The more cross entropy is, the closer they are
train_ch5(net, train_iter, test_iter, criterion,num_epochs, batch_size,device, lr)
``` This three runs out of the data in the local cpu environment, each training is slow, the accuracy is good

### 2.2AlexNet

LeNet: it doesn't always work on large real datasets.
1. The calculation of neural network is complex.
2. There is no deep research on parameter initialization and convex optimization algorithm. AlexNet features:

1.8-layer transformation, including 5-layer convolution and 2-layer fully connected hidden layer, and 1 fully connected output layer.
2. The sigmoid activation function is changed to a simpler ReLU activation function.
3. Dropout is used to control the model complexity of the full connection layer.
4. Introduce data enhancement, such as flipping, cropping and color change, so as to further expand the data set to alleviate over fitting.

```import time
import torch
from torch import nn, optim
import torchvision
import numpy as np
import sys
sys.path.append("/home/kesci/input")
import d2lzh1981 as d2l
import os
import torch.nn.functional as F

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

class AlexNet(nn.Module):
def __init__(self):
super(AlexNet, self).__init__()
self.conv = nn.Sequential(
nn.Conv2d(1, 96, 11, 4), # in_channels, out_channels, kernel_size, stride, padding
nn.ReLU(),
nn.MaxPool2d(3, 2), # kernel_size, stride
# Reduce convolution window, use fill 2 to make the height and width of input and output consistent, and increase the number of output channels
nn.Conv2d(96, 256, 5, 1, 2),
nn.ReLU(),
nn.MaxPool2d(3, 2),
# Three consecutive convolution layers and smaller convolution windows are used. In addition to the final convolution layer, the number of output channels is further increased.
# Do not use pooling layer after the first two convolutions to reduce the height and width of input
nn.Conv2d(256, 384, 3, 1, 1),
nn.ReLU(),
nn.Conv2d(384, 384, 3, 1, 1),
nn.ReLU(),
nn.Conv2d(384, 256, 3, 1, 1),
nn.ReLU(),
nn.MaxPool2d(3, 2)
)
# The output number of full connection layer here is several times larger than that in LeNet. Use discard layer to ease over fitting
self.fc = nn.Sequential(
nn.Linear(256*5*5, 4096),
nn.ReLU(),
nn.Dropout(0.5),
#Because CPU image is used to simplify the network, this layer can be added for GPU image
#nn.Linear(4096, 4096),
#nn.ReLU(),
#nn.Dropout(0.5),

# Output layer. Because fashion MNIST is used here, the number of categories is 10 instead of 1000 in the paper
nn.Linear(4096, 10),
)

def forward(self, img):

feature = self.conv(img)
output = self.fc(feature.view(img.shape, -1))
return output

net = AlexNet()

trans = []
if resize:
trans.append(torchvision.transforms.Resize(size=resize))
trans.append(torchvision.transforms.ToTensor())

transform = torchvision.transforms.Compose(trans)

train_iter = torch.utils.data.DataLoader(mnist_train, batch_size=batch_size, shuffle=True, num_workers=2)
test_iter = torch.utils.data.DataLoader(mnist_test, batch_size=batch_size, shuffle=False, num_workers=2)

return train_iter, test_iter

batch_size = 16
# If the error message "out of memory" appears, you can reduce the batch "size or resize
for X, Y in train_iter:
print('X =', X.shape,'\nY =', Y.type(torch.int32))
break

lr, num_epochs = 0.001, 3
d2l.train_ch5(net, train_iter, test_iter, batch_size, optimizer, device, num_epochs)
```

### 2.3 VGG11

VGG: build the depth model by repeating the simple foundation blocks.
Block: several identical fills are
1. Convolution layer with window shape of 33, followed by a step of
2. Maximum pooling layer with window shape 22.
The convolution layer keeps the height and width of the input constant, while the pooling layer halves them. ```import time
import torch
from torch import nn, optim
import torchvision
import numpy as np
import sys
sys.path.append("/home/jiahui/PycharmProjects/deep_learning")
from deep_learning.d2lzh_pytorch import utils as d2l

trans = []
if resize:
trans.append(torchvision.transforms.Resize(size=resize))
trans.append(torchvision.transforms.ToTensor())

transform = torchvision.transforms.Compose(trans)

train_iter = torch.utils.data.DataLoader(mnist_train, batch_size=batch_size, shuffle=True, num_workers=2)
test_iter = torch.utils.data.DataLoader(mnist_test, batch_size=batch_size, shuffle=False, num_workers=2)

return train_iter, test_iter

def vgg_block(num_convs, in_channels, out_channels): #Number of convolution layers, number of input channels, number of output channels
blk = []
for i in range(num_convs):
if i == 0:
else:
blk.append(nn.ReLU())
blk.append(nn.MaxPool2d(kernel_size=2, stride=2)) # It's going to halve the width and height
return nn.Sequential(*blk)

conv_arch = ((1, 1, 64), (1, 64, 128), (2, 128, 256), (2, 256, 512), (2, 512, 512))
# After five vgg_block s, the width and height will be halved five times to 224 / 32 = 7
fc_features = 512 * 7 * 7 # c * w * h
fc_hidden_units = 4096 # Arbitrarily

def vgg(conv_arch, fc_features, fc_hidden_units=4096):
net = nn.Sequential()
# Convolution layer part
for i, (num_convs, in_channels, out_channels) in enumerate(conv_arch):
# Every time you pass a VGg block, the width and height will be halved
net.add_module("vgg_block_" + str(i+1), vgg_block(num_convs, in_channels, out_channels))
# Full connection layer part
nn.Linear(fc_features, fc_hidden_units),
nn.ReLU(),
nn.Dropout(0.5),
nn.Linear(fc_hidden_units, fc_hidden_units),
nn.ReLU(),
nn.Dropout(0.5),
nn.Linear(fc_hidden_units, 10)
))
return net

net = vgg(conv_arch, fc_features, fc_hidden_units)
X = torch.rand(1, 1, 224, 224)

# Named ABCD children get the first level sub module and its name (named modules will return all sub modules, including sub modules of sub modules)
for name, blk in net.named_children():
X = blk(X)
print(name, 'output shape: ', X.shape)

ratio = 8
small_conv_arch = [(1, 1, 64//ratio), (1, 64//ratio, 128//ratio), (2, 128//ratio, 256//ratio),
(2, 256//ratio, 512//ratio), (2, 512//ratio, 512//ratio)]
net = vgg(small_conv_arch, fc_features // ratio, fc_hidden_units // ratio)
print(net)

batch_size = 16
# If the error message "out of memory" appears, you can reduce the batch "size or resize
for X, Y in train_iter:
print('X =', X.shape,'\nY =', Y.type(torch.int32))
break

lr, num_epochs = 0.001, 5
d2l.train_ch5(net, train_iter, test_iter, batch_size, optimizer, device, num_epochs)
```

### 2.4NiN

LeNet, AlexNet, and VGG: firstly, the spatial features are fully extracted from the modules composed of the volume accumulation layer, and then the classification results are output from the modules composed of the full connection layer.
NiN: connect several small collaterals composed of convolution layer and "all connected" layer in series to build deep collaterals.
The number of output channels is equal to the number of label categories, and then the global average pooling layer averages all elements in each channel and directly classifies them. 1 × 1 convolution kernel action
1. Number of channels: the number of channels can be reduced by controlling the number of convolution kernels.
2. Increase nonlinearity. The convolution process of 1 × 1 convolution kernel is equivalent to the calculation process of full connection layer, and the nonlinear activation function is added, so the nonlinearity of the network can be increased.
3. Few calculation parameters

```import time
import torch
from torch import nn, optim
import torchvision
import numpy as np
import sys
sys.path.append("/home/jiahui/PycharmProjects/deep_learning")
from deep_learning.d2lzh_pytorch import utils as d2l

trans = []
if resize:
trans.append(torchvision.transforms.Resize(size=resize))
trans.append(torchvision.transforms.ToTensor())

transform = torchvision.transforms.Compose(trans)

train_iter = torch.utils.data.DataLoader(mnist_train, batch_size=batch_size, shuffle=True, num_workers=2)
test_iter = torch.utils.data.DataLoader(mnist_test, batch_size=batch_size, shuffle=False, num_workers=2)

return train_iter, test_iter

def nin_block(in_channels, out_channels, kernel_size, stride, padding):
blk = nn.Sequential(nn.Conv2d(in_channels, out_channels, kernel_size, stride, padding),
nn.ReLU(),
nn.Conv2d(out_channels, out_channels, kernel_size=1),
nn.ReLU(),
nn.Conv2d(out_channels, out_channels, kernel_size=1),
nn.ReLU())
return blk

class GlobalAvgPool2d(nn.Module):
# The global average pooling layer can be achieved by setting the pool window shape to the input height and width
def __init__(self):
super(GlobalAvgPool2d, self).__init__()
def forward(self, x):
return F.avg_pool2d(x, kernel_size=x.size()[2:])

net = nn.Sequential(
nn.MaxPool2d(kernel_size=3, stride=2),
nn.MaxPool2d(kernel_size=3, stride=2),
nn.MaxPool2d(kernel_size=3, stride=2),
nn.Dropout(0.5),
# The number of label categories is 10
GlobalAvgPool2d(),
# Convert the four-dimensional output to the two-dimensional output in the shape of (batch size, 10)
d2l.FlattenLayer())

X = torch.rand(1, 1, 224, 224)
for name, blk in net.named_children():
X = blk(X)
print(name, 'output shape: ', X.shape)

batch_size = 128
for X, Y in train_iter:
print('X =', X.shape,'\nY =', Y.type(torch.int32))
break

lr, num_epochs = 0.002, 5
d2l.train_ch5(net, train_iter, test_iter, batch_size, optimizer, device, num_epochs)
```

1. it is composed of Inception basic block.
2. The concept block is equivalent to a network with four lines. It extracts information through convolution layers and most pooling layers with different window shapes, and reduces the number of channels in 1 × 1 convolution layer, thus reducing the complexity of the model.
3. The super parameter that can be defined is the number of output channels of each layer, so we can control the complexity of the model. Complete model structure ```import time
import torch
from torch import nn, optim
import torchvision
import numpy as np
import sys
sys.path.append("/home/jiahui/PycharmProjects/deep_learning")
from deep_learning.d2lzh_pytorch import utils as d2l
import os
import torch.nn.functional as F

trans = []
if resize:
trans.append(torchvision.transforms.Resize(size=resize))
trans.append(torchvision.transforms.ToTensor())

transform = torchvision.transforms.Compose(trans)

train_iter = torch.utils.data.DataLoader(mnist_train, batch_size=batch_size, shuffle=True, num_workers=2)
test_iter = torch.utils.data.DataLoader(mnist_test, batch_size=batch_size, shuffle=False, num_workers=2)

return train_iter, test_iter

class Inception(nn.Module):
# c1 - c4 is the number of output channels of the layer in each line
def __init__(self, in_c, c1, c2, c3, c4):
super(Inception, self).__init__()
# Line 1, single 1 x 1 roll up
self.p1_1 = nn.Conv2d(in_c, c1, kernel_size=1)
# Line 2, 1 x 1 roll up followed by 3 x 3 roll up
self.p2_1 = nn.Conv2d(in_c, c2, kernel_size=1)
self.p2_2 = nn.Conv2d(c2, c2, kernel_size=3, padding=1)
# Line 3, 1 x 1 roll, followed by 5 x 5 roll
self.p3_1 = nn.Conv2d(in_c, c3, kernel_size=1)
self.p3_2 = nn.Conv2d(c3, c3, kernel_size=5, padding=2)
# Line 4, 3 x 3 max. pooling layer followed by 1 x 1 convolution layer
self.p4_2 = nn.Conv2d(in_c, c4, kernel_size=1)

def forward(self, x):
p1 = F.relu(self.p1_1(x))
p2 = F.relu(self.p2_2(F.relu(self.p2_1(x))))
p3 = F.relu(self.p3_2(F.relu(self.p3_1(x))))
p4 = F.relu(self.p4_2(self.p4_1(x)))

b1 = nn.Sequential(nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3),
nn.ReLU(),

b2 = nn.Sequential(nn.Conv2d(64, 64, kernel_size=1),

b3 = nn.Sequential(Inception(192, 64, (96, 128), (16, 32), 32),
Inception(256, 128, (128, 192), (32, 96), 64),

b4 = nn.Sequential(Inception(480, 192, (96, 208), (16, 48), 64),
Inception(512, 160, (112, 224), (24, 64), 64),
Inception(512, 128, (128, 256), (24, 64), 64),
Inception(512, 112, (144, 288), (32, 64), 64),
Inception(528, 256, (160, 320), (32, 128), 128),

b5 = nn.Sequential(Inception(832, 256, (160, 320), (32, 128), 128),
Inception(832, 384, (192, 384), (48, 128), 128),
d2l.GlobalAvgPool2d())

net = nn.Sequential(b1, b2, b3, b4, b5,
d2l.FlattenLayer(), nn.Linear(1024, 10))

net = nn.Sequential(b1, b2, b3, b4, b5, d2l.FlattenLayer(), nn.Linear(1024, 10))

X = torch.rand(1, 1, 96, 96)

for blk in net.children():
X = blk(X)
print('output shape: ', X.shape)

batch_size = 16
for X, Y in train_iter:
print('X =', X.shape,'\nY =', Y.type(torch.int32))
break

lr, num_epochs = 0.001, 5  