Convolutional Neural Networks

The lecture slides are available here.

The goals of this session are to practice with **implementing a Convolutional Neural Network**, **understanding the involved computations**, and more generally, to **build a full Deep Learning pipeline in PyTorch** to train a model on a given dataset. Some parts of this course are inspired from this great Pytorch Tutorial

We will be coding with **Python3** and will use the **Pytorch** library.

To install Pytorch on your local machine, follow this link

Convolutional Neural networks (CNNs) are powerful neural models that take advantage of the convolution operator to learn to extract information from raw images. You can refer to your lecture for more information about CNNs, as well as many great online resources.

Why would we want to use a CNN when dealing with images? We could also use a simpler Multi-Layer Perceptron (MLP), where each neuron in a layer aggregates information from all neurons in the previous layer. An important notion in Machine Learning is known as **inductive bias** or **model prior**. This corresponds to the prior knowledge you, as a designer, incorporate in the model you are building.

Take some time to think about what is the prior knowledge incorporated into a CNN, that is not in an MLP.

The assumption that data has a spatial underlying structure, known as **Spatial Inductive Bias** is used in CNNs. Indeed, the convolution operation aggregates information from only the local spatial neighborood around the center of the filter. Models equipped with such inductive bias are particularly well suited to extract information from the pixels of an image.

The CIFAR-10 dataset is a well-known dataset of RGB images. It is composed of **60000 32x32 colour images** labelled as belonging to one of **10 classes**. You can find **6000 images per class**. There are **50000 training images** and **10000 test images**. If you are looking for a dataset with more classes, you can look at CIFAR-100 that, as indicated by its name, contains 100 semantic classes.

The following Pytorch code takes advantage of **torchvision**. It allows you to create 3 datasets (train, val, test) and to apply a normalization to the images. The names of the 10 classes in the CIFAR-10 dataset are also given for you to use later.

```
import torch
import torchvision
import torchvision.transforms as transforms
transform = transforms.Compose(
[transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
train_set = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
test_set = torchvision.datasets.CIFAR10(root='./data', train=False, download=True, transform=transform)
# Split train set into a train and validation sets
train_size = int(0.75*len(train_set))
valid_size = len(train_set) - train_size
train_set, valid_set = torch.utils.data.random_split(train_set, [train_size, valid_size])
# Ground-Truth classes in the CIFAR-10 dataset
classes = ('plane', 'car', 'bird', 'cat',
'deer', 'dog', 'frog', 'horse', 'ship', 'truck')
```

It is important to visualise the data you will be working on. Moreover, when training and evaluating your model, you will need to load data from your training, validation and test sets respectively. To do so, we will use the **DataLoader** class from **torch.utils.data**.

Start by **implementing 3 dataloaders** for your training, validation and test sets.

```
# Define you batch size
batch_size = 4
# Training dataloader, we want to shuffle the samples between epochs
training_dataloader = torch.utils.data.DataLoader(train_set, batch_size=batch_size, shuffle=True, num_workers=0)
# Validation dataloader, no need to shuffle
valid_dataloader = torch.utils.data.DataLoader(valid_set, batch_size=batch_size, shuffle=False, num_workers=0)
# Test dataloader, no need to shuffle
test_dataloader = torch.utils.data.DataLoader(test_set, batch_size=batch_size, shuffle=False, num_workers=0)
```

Now, try to **create an iterator** to go through your training dataloader, and **get the next batch**.

```
dataiter = iter(training_dataloader)
images, labels = dataiter.next()
```

Finally, write a function that takes a batch of image tensors as inputs and display them, along with their associated labels. You can use **matplotlib.pyplot** to do so.

```
def imshow(img):
img = img / 2 + 0.5 # unnormalize
npimg = img.numpy()
plt.imshow(np.transpose(npimg, (1, 2, 0)))
plt.show()
# show images
imshow(torchvision.utils.make_grid(images))
# print labels
print(' '.join('%5s' % classes[labels[j]] for j in range(batch_size)))
```

```
def process_img(img):
img = img / 2 + 0.5 # unnormalize
npimg = img.numpy()
npimg = np.transpose(npimg, (1, 2, 0))
return npimg
def imshow_batch(imgs, labels, classes):
# Get batch_size
bs = imgs.shape[0]
# Create Matplotlib figure with batch_size sub_plots
fig, axs = plt.subplots(1, bs)
for i in range(bs):
# Showing image
axs[i].imshow(process_img(imgs[i]))
# Removing axis legend
axs[i].axis('off')
# Adding the GT class of the image as a title of the subplot
axs[i].title.set_text(classes[labels[i]])
plt.show()
imshow_batch(images, labels, classes)
```

It is now time to build a CNN! Write a class inheriting from **torch.nn.Module**. Be careful of the dimensions of input tensors and the dimensions of your desired output.
Then, you can play with different hyperparameters, such as the number of layers, and hyperparameters of the **torch.nn.Conv2d** layer (number of output channels, kernel size, stride, padding, etc.)

```
import torch.nn.functional as F
# This is the base LeNet architecture you saw in the lecture, adapted to our input and output dimensions
class LeNet(torch.nn.Module):
def __init__(self):
super (LeNet , self).__init__()
# 3 input channels , 10 output channels,
# 5x5 filters , stride=1, no padding
self.conv1 = torch.nn.Conv2d(3, 20, 5, 1, 0)
self.conv2 = torch.nn.Conv2d(20, 50, 5, 1, 0)
self.fc1 = torch.nn.Linear(5*5*50, 500)
self.fc2 = torch.nn.Linear(500, 10)
def forward(self , x):
x = F.relu(self.conv1(x))
# Max pooling with a filter size of 2x2
# and a stride of 2
x = F.max_pool2d(x, 2, 2)
x = F.relu(self.conv2(x))
x = F.max_pool2d(x, 2, 2)
x = x.view(-1, 5*5*50)
x = F.relu(self.fc1(x))
return self.fc2(x)
model = LeNet()
print('in: ', images.shape) # in: torch.Size([4, 3, 32, 32])
out = model(images)
print('out: ', out.shape) # out: torch.Size([4, 10])
```

It is important to understand the convolution operations going on inside your neural network.

Let’s denote the size of input and output tensors of a convolution layer along axis $x$ as $I_x$ and $O_x$, and the respective kernel size, padding and stride as $K_x$, $P_x$ and $S_x$. **Write the formula that gives you $O_x$ as a function of $I_x$, $K_x$, $P_x$ and $S_x$.**

**IMPORTANT: Before moving on to the next step, call me to present and eventually discuss this result!**

Let’s denote the size of input and output tensors along axis $x$ as $I_x$ and $O_x$, and the respective kernel size, padding and stride as $K_x$, $P_x$ and $S_x$. We have,

\[O_x = \frac{I_x - K_x + 2P_x}{S_x} + 1\]For an input tensor with shape $(N_{in}, I_x, I_y)$ where $N_{in}$ is the number of input channels, the output tensor from a convolution layer with $N_{out}$ filters will have the following shape,

\((N_{out}, \frac{I_x - K_x + 2P_x}{S_x} + 1, \frac{I_y - K_y + 2P_y}{S_y} + 1)\)

Now, you can write a function that takes as input a tensor shape, as well as the hyperparameters of the layer and outputs the size of the output tensor. You can then check your function is correct by comparing its returned value with the real shapes of tensors within the forward pass of your neural network.

```
def compute_output_shape_conv(input_shape=torch.Size([4, 3, 32, 32]), kernel_size=(3, 3), stride=(1, 1), padding=(0, 0), n_out=20):
assert len(input_shape) == 4, 'input shape should be (B, C, H, W)'
assert len(kernel_size) == 2 and len(stride) == 2 and len(padding) == 2, 'all conv hyperparameters should be defined along both x and y axes'
I_x = input_shape[2]
I_y = input_shape[3]
out = []
for i, I in enumerate([I_x, I_y]):
O = 1 + (I - kernel_size[i] + 2*padding[i])/stride[i]
out.append(int(O))
return torch.Size([input_shape[0], n_out, out[0], out[1]])
class LeNet(torch.nn.Module):
def __init__(self):
super (LeNet , self).__init__()
# 3 input channels , 10 output channels ,
# 5x5 filters , stride =1, no padding
self.conv1 = torch.nn.Conv2d(3, 20, 5, 1, 0)
self.conv2 = torch.nn.Conv2d(20, 50, 5, 1, 0)
self.fc1 = torch.nn.Linear(5*5*50, 500)
self.fc2 = torch.nn.Linear(500, 10)
def forward(self , x):
out_shape_conv1 = compute_output_shape_conv(x.shape, (5,5), (1,1), (0,0), 20)
x = F.relu(self.conv1(x))
assert x.shape == out_shape_conv1
# Max pooling with a filter size of 2x2
# and a stride of 2
x = F.max_pool2d(x, 2, 2)
out_shape_conv2 = compute_output_shape_conv(x.shape, (5,5), (1,1), (0,0), 50)
x = F.relu(self.conv2(x))
assert x.shape == out_shape_conv2
x = F.max_pool2d(x, 2, 2)
x = x.view(-1, 5*5*50)
x = F.relu(self.fc1(x))
return self.fc2(x)
```

The next step is to define a **loss function** that is suited to the problem you want to solve, in our case **multi-class classification**. Then you have to choose an **optimizer**. You are encouraged to try different ones to compare them. You can also study the impact of different hyperparameters of the optimizer (learning rate, momentum, etc.)

```
import torch.optim as optim
criterion = torch.nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
```

It is now to time to write the code for **training and validating your model**. You must iterate through your training data using your dataloader, and compute forward and backward passes on given data batches.
Don’t forget to log your training as well as validation losses (the latter is mainly used to tune hyperparameters).

```
epochs = 10
for epoch in range(epochs):
logging_loss = 0.0
for i, data in enumerate(training_dataloader):
input, labels = data
# zero the parameter gradients
optimizer.zero_grad()
# Forward pass
out = model(input)
# Compute loss
loss = criterion(out, labels)
# Compute gradients
loss.backward()
# Backward pass - model update
optimizer.step()
logging_loss += loss.item()
if i % 2000 == 1999:
# Logging training loss
logging_loss /= 2000
print('Training loss epoch ', epoch, ' -- mini-batch ', i, ': ', logging_loss)
logging_loss = 0.0
# Model validation
with torch.no_grad():
logging_loss_val = 0.0
for data_val in tqdm(valid_dataloader):
input_val, labels_val = data_val
out_val = model(input_val)
loss_val = criterion(out_val, labels_val)
logging_loss_val += loss_val.item()
logging_loss_val /= len(valid_dataloader)
print('Validation loss: ', logging_loss_val)
```

A useful tool to visualize your training is **Tensorboard**. You can also have a look at solutions such as **Weights & Biases**, but we will focus on the simpler Tensorboard for now.
You can easily use Tensorboard with Pytorch by looking at **torch.utils.tensorboard**

Once training is completed, it can be useful to save the weights of your neural network to use it later. The following tutorial explains how you can do this. Now, try to save and then load your trained model.

```
path = './le_net_cifar10.pth'
# Saving model
torch.save(model.state_dict(), path)
# Loading model
trained_model = LeNet()
trained_model.load_state_dict(torch.load(path))
# To use it for inference only, you might want to pass your model in eval mode
trained_model.eval()
```

You must now **evaluate the performance of your trained model** on the **test set**. To this end, you have to iterate through test samples, and perform forward passes on given data batches. You might want to compute the **test loss**, but also any **accuracy-related metrics** you are interested in. You could also **visualize some test samples** along with the **output distribution of your model**.