# Notes on fastai Book Ch. 4

ai
fastai
notes
pytorch
Chapter 4 covers broadcasting, stochastic gradient descent, the MNIST loss function, and the sigmoid activation functions.
Published

March 14, 2022

## Tenacity and Deep Learning

• Deep learning practitioners need to be tenacious
• Only a handful of researchers kept trying to make neural networks work through the 1990s and 2000s.
• Academic Papers for neural networks were rejected by top journals and conferences, despite showing dramatically better results than anything previously published
• Jurgen Schmidhuber
• pioneered many important ideas
• worked with his student Sepp Hochreiter on the long short-term memory (LSTM) architecture
• LSTMs are now widely used for speech recognition and other text modelling tasks
• Paul Werbos
• Invented backpropagation for neural networks in 1974
• considered the most important foundation of modern AI

## The Foundations of Computer Vision

• MNIST Database
• contains images of handwritten digits, collected by the National Institute of Standards and Technology
• created in 1998
• LeNet-5
• A convolutional neural network structure proposed by Yann Lecun and his colleagues
• Demonstrated the first practically useful recognition of handwritten digit sequences in 1998
• One of the most important breakthroughs in the history of AI

## Pixels

### MNIST_SAMPLE

• A sample of the famous MNIST dataset consisting of handwritten digits.
• contains training data for the digits 3 and 7
• images are in 1-dimensional grayscale format
• already split into training and validation sets

from fastai.vision.all import *
from fastbook import *

matplotlib.rc('image', cmap='Greys')

print(URLs.MNIST_SAMPLE)
path = untar_data(URLs.MNIST_SAMPLE)
print(path)
https://s3.amazonaws.com/fast-ai-sample/mnist_sample.tgz
/home/innom-dt/.fastai/data/mnist_sample

# Set base path to mnist_sample directory
Path.BASE_PATH = path

# A custom fastai method that returns the contents of path as a list
path.ls()
(#3) [Path('labels.csv'),Path('train'),Path('valid')]

#### fastcore L Class

type(path.ls())
fastcore.foundation.L

(path/'train').ls()
(#2) [Path('train/3'),Path('train/7')]

threes = (path/'train'/'3').ls().sorted()
sevens = (path/'train'/'7').ls().sorted()
threes
(#6131) [Path('train/3/10.png'),Path('train/3/10000.png'),Path('train/3/10011.png'),Path('train/3/10031.png'),Path('train/3/10034.png'),Path('train/3/10042.png'),Path('train/3/10052.png'),Path('train/3/1007.png'),Path('train/3/10074.png'),Path('train/3/10091.png')...]

im3_path = threes[1]
print(im3_path)
im3 = Image.open(im3_path)
im3
/home/innom-dt/.fastai/data/mnist_sample/train/3/10000.png

### PIL Image Module

print(type(im3))
print(im3.size)
<class 'PIL.PngImagePlugin.PngImageFile'>
(28, 28)

# Slice of the image from index 4 up to, but not including, index 10
array(im3)[4:10,4:10]
array([[  0,   0,   0,   0,   0,   0],
[  0,   0,   0,   0,   0,  29],
[  0,   0,   0,  48, 166, 224],
[  0,  93, 244, 249, 253, 187],
[  0, 107, 253, 253, 230,  48],
[  0,   3,  20,  20,  15,   0]], dtype=uint8)

### NumPy Arrays and PyTorch Tensors

• NumPy
• the most widely used library for scientific and numeric programming in Python
• does not support using GPUs or calculating gradients
• Python is slow compared to many languages
• anything fast in Python is likely to be a wrapper for a compiled object written and optimized in another language like C
• NumPy arrays and PyTorch tensors can finish computations many thousands of times than using pure Python
• NumPy array
• a multidimensional table of data
• all items are the same type
• can use any type, including arrays, for the array type
• simple types are stored as a compact C data structure in memory
• PyTorch tensor
• nearly identical to NumPy arrays
• can only use a single basic numeric type for all elements
• not as flexible as a genuine array of arrays
• must always be a regularly shaped multi-dimensional rectangular structure
• cannot be jagged
• supports using GPUs
• PyTorch can automatically calculate derivatives of operations performed with tensors
• impossible to do deep learning without this capability
• perform operations directly on arrays or tensors as much as possible instead of using loops

data = [[1,2,3],[4,5,6]]
arr = array (data)
tns = tensor(data)

arr  # numpy
array([[1, 2, 3],
[4, 5, 6]])

tns  # pytorch
tensor([[1, 2, 3],
[4, 5, 6]])

# select a row
tns[1]
tensor([4, 5, 6])

# select a column
tns[:,1]
tensor([2, 5])

# select a slice
tns[1,1:3]
tensor([5, 6])

# Perform element-wise addition
tns+1
tensor([[2, 3, 4],
[5, 6, 7]])

tns.type()
'torch.LongTensor'

# Perform element-wise multiplication
tns*1.5
tensor([[1.5000, 3.0000, 4.5000],
[6.0000, 7.5000, 9.0000]])

### NumPy Array Objects

#### numpy.array function

print(type(array(im3)[4:10,4:10]))
array
<class 'numpy.ndarray'>
<function numpy.array>

print(array(im3)[4:10,4:10][0].data)
print(array(im3)[4:10,4:10][0].dtype)
<memory at 0x7f3c13a20dc0>
uint8

### PyTorch Tensor

#### fastai tensor function

print(type(tensor(im3)[4:10,4:10][0]))
tensor
<class 'torch.Tensor'>
<function fastai.torch_core.tensor(x, *rest, dtype=None, device=None, requires_grad=False, pin_memory=False)>

print(tensor(im3)[4:10,4:10][0].data)
print(tensor(im3)[4:10,4:10][0].dtype)
tensor([0, 0, 0, 0, 0, 0], dtype=torch.uint8)
torch.uint8

### Pandas DataFrame

# Full Image
pd.DataFrame(tensor(im3))
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0 29 150 195 254 255 254 176 193 150 96 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 48 166 224 253 253 234 196 253 253 253 253 233 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 93 244 249 253 187 46 10 8 4 10 194 253 253 233 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 107 253 253 230 48 0 0 0 0 0 192 253 253 156 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 3 20 20 15 0 0 0 0 0 43 224 253 245 74 0 0 0 0 0 0 0 0 0
10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 249 253 245 126 0 0 0 0 0 0 0 0 0 0
11 0 0 0 0 0 0 0 0 0 0 0 14 101 223 253 248 124 0 0 0 0 0 0 0 0 0 0 0
12 0 0 0 0 0 0 0 0 0 11 166 239 253 253 253 187 30 0 0 0 0 0 0 0 0 0 0 0
13 0 0 0 0 0 0 0 0 0 16 248 250 253 253 253 253 232 213 111 2 0 0 0 0 0 0 0 0
14 0 0 0 0 0 0 0 0 0 0 0 43 98 98 208 253 253 253 253 187 22 0 0 0 0 0 0 0
15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9 51 119 253 253 253 76 0 0 0 0 0 0 0
16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 183 253 253 139 0 0 0 0 0 0 0
17 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 182 253 253 104 0 0 0 0 0 0 0
18 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 85 249 253 253 36 0 0 0 0 0 0 0
19 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 60 214 253 253 173 11 0 0 0 0 0 0 0
20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 98 247 253 253 226 9 0 0 0 0 0 0 0 0
21 0 0 0 0 0 0 0 0 0 0 0 0 42 150 252 253 253 233 53 0 0 0 0 0 0 0 0 0
22 0 0 0 0 0 0 42 115 42 60 115 159 240 253 253 250 175 25 0 0 0 0 0 0 0 0 0 0
23 0 0 0 0 0 0 187 253 253 253 253 253 253 253 197 86 0 0 0 0 0 0 0 0 0 0 0 0
24 0 0 0 0 0 0 103 253 253 253 253 253 232 67 1 0 0 0 0 0 0 0 0 0 0 0 0 0
25 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
26 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
27 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

tensor(im3).shape
torch.Size([28, 28])

im3_t = tensor(im3)
# Create a pandas DataFrame from image slice
df = pd.DataFrame(im3_t[4:15,4:22])
# Set defined CSS-properties to each <td> HTML element within the given subset.
# Color-code the values using a gradient
df.style.set_properties(**{'font-size':'6pt'}).background_gradient('Greys')
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 29 150 195 254 255 254 176 193 150 96 0 0 0
2 0 0 0 48 166 224 253 253 234 196 253 253 253 253 233 0 0 0
3 0 93 244 249 253 187 46 10 8 4 10 194 253 253 233 0 0 0
4 0 107 253 253 230 48 0 0 0 0 0 192 253 253 156 0 0 0
5 0 3 20 20 15 0 0 0 0 0 43 224 253 245 74 0 0 0
6 0 0 0 0 0 0 0 0 0 0 249 253 245 126 0 0 0 0
7 0 0 0 0 0 0 0 14 101 223 253 248 124 0 0 0 0 0
8 0 0 0 0 0 11 166 239 253 253 253 187 30 0 0 0 0 0
9 0 0 0 0 0 16 248 250 253 253 253 253 232 213 111 2 0 0
10 0 0 0 0 0 0 0 43 98 98 208 253 253 253 253 187 22 0

## Pixel Similarity

• Establish a baseline to compare against your model
• a simple model that you are confident should perform reasonably well
• should be simple to implement and easy to test
• helps indicate whether your super-fancy models are any good

### Method

• Calculate the average values for each pixel location across all images for each digit
• This will generate a blurry image of the target digit
• Compare the values for each pixel location in a new image to the average

# Store all images of the digit 7 in a list of tensors
seven_tensors = [tensor(Image.open(o)) for o in sevens]
# Store all iamges of the digit 3 in a list of tensors
three_tensors = [tensor(Image.open(o)) for o in threes]
len(three_tensors),len(seven_tensors)
(6131, 6265)

#### fastai show_image function

show_image(three_tensors[1]);

#### PyTorch Stack Function

# Stack all images for each digit into a single tensor
# and scale pixel values from the range [0,255] to [0,1]
stacked_sevens = torch.stack(seven_tensors).float()/255
stacked_threes = torch.stack(three_tensors).float()/255
stacked_threes.shape
torch.Size([6131, 28, 28])
 python len(stacked_threes.shape)
stacked_threes.ndim
3

# Calculate the mean values for each pixel location across all images of the digit 3
mean3 = stacked_threes.mean(0)
show_image(mean3);

# Calculate the mean values for each pixel location across all images of the digit 7
mean7 = stacked_sevens.mean(0)
show_image(mean7);

# Pick a single image to compare to the average
a_3 = stacked_threes[1]
show_image(a_3);

# Calculate the Mean Absolute Error between the single image and the mean pixel values
dist_3_abs = (a_3 - mean3).abs().mean()
# Calculate the Root Mean Squared Error between the single image and the mean pixel values
dist_3_sqr = ((a_3 - mean3)**2).mean().sqrt()
print(f"MAE: {dist_3_abs}")
print(f"RMSE: {dist_3_sqr}")
MAE: 0.11143654584884644
RMSE: 0.20208320021629333

dist_7_abs = (a_3 - mean7).abs().mean()
dist_7_sqr = ((a_3 - mean7)**2).mean().sqrt()
print(f"MAE: {dist_7_abs}")
print(f"RMSE: {dist_7_sqr}")
MAE: 0.15861910581588745
RMSE: 0.30210891366004944

Note: The error is larger when comparing the image of a 3 to the average pixel values for the digit 7

#### torch.nn.functional

F
<module 'torch.nn.functional' from '/home/innom-dt/miniconda3/envs/fastbook/lib/python3.9/site-packages/torch/nn/functional.py'>

#### PyTorch mse_loss function

# Calculate the Mean Absolute Error aka L1 norm
print(F.l1_loss(a_3.float(),mean7))
# Calculate the Root Mean Squared Error aka L2 norm
print(F.mse_loss(a_3,mean7).sqrt())
tensor(0.1586)
tensor(0.3021)

• automatically expanding a tensor with a smaller rank to have the same size one with a larger rank to perform an operation
• an important capability that makes tensor code much easier to write
• it does not actually create multiple copies of the smaller tensor
• PyTorch performs broadcast calculations in C on the CPU and CUDA on the GPU
• tens of thousands of times faster than pure Python
• up to millions of times faster on GPU

# Create tensors for the validation set for the digit 3
# and stack them into a single tensor
valid_3_tens = torch.stack([tensor(Image.open(o))
for o in (path/'valid'/'3').ls()])
# Scale pixel values from [0,255] to [0,1]
valid_3_tens = valid_3_tens.float()/255

# Create tensors for the validation set for the digit 7
# and stack them into a single tensor
valid_7_tens = torch.stack([tensor(Image.open(o))
for o in (path/'valid'/'7').ls()])
# Scale pixel values from [0,255] to [0,1]
valid_7_tens = valid_7_tens.float()/255

valid_3_tens.shape,valid_7_tens.shape
(torch.Size([1010, 28, 28]), torch.Size([1028, 28, 28]))

# Calculate Mean Absolute Error using broadcasting
# Subtraction operation is performed using broadcasting
# Absolute Value operation is performed elementwise
# Mean operation is performed over the values indexed by the height and width axes
def mnist_distance(a,b): return (a-b).abs().mean((-1,-2))
# Calculate MAE for two single images
mnist_distance(a_3, mean3)
tensor(0.1114)

# Calculate MAE between a single image and a vector of images
valid_3_dist = mnist_distance(valid_3_tens, mean3)
valid_3_dist, valid_3_dist.shape
(tensor([0.1422, 0.1230, 0.1055,  ..., 0.1244, 0.1188, 0.1103]),
torch.Size([1010]))

tensor([1,2,3]) + tensor([1,1,1])
tensor([2, 3, 4])

(valid_3_tens-mean3).shape
torch.Size([1010, 28, 28])

# Compare the MAE value between the single and the mean values for the digits 3 and 7
def is_3(x): return mnist_distance(x,mean3) < mnist_distance(x,mean7)

is_3(a_3), is_3(a_3).float()
(tensor(True), tensor(1.))

is_3(valid_3_tens)
tensor([ True,  True,  True,  ..., False,  True,  True])

accuracy_3s =      is_3(valid_3_tens).float() .mean()
accuracy_7s = (1 - is_3(valid_7_tens).float()).mean()

accuracy_3s,accuracy_7s,(accuracy_3s+accuracy_7s)/2
(tensor(0.9168), tensor(0.9854), tensor(0.9511))

print(f"Correct 3s: {accuracy_3s * valid_3_tens.shape[0]:.0f}")
print(f"Incorrect 3s: {(1 - accuracy_3s) * valid_3_tens.shape[0]:.0f}")
Correct 3s: 926
Incorrect 3s: 84

print(f"Correct 7s: {accuracy_7s * valid_7_tens.shape[0]:.0f}")
print(f"Incorrect 7s: {(1 - accuracy_7s) * valid_7_tens.shape[0]:.0f}")
Correct 7s: 1013
Incorrect 7s: 15

• the key to having a model that can improve
• need to represent a task such that their are weight assignments that can be evaluated and updated
• Sample function:
• assign a weight value to each pixel location
• X is the image represented as a vector
• all of the rows are stacked up end to end into a single long line
• W contains the weights for each pixel

def pr_eight(x,w) = (x*w).sum()

def f(x): return x**2

#### plot_function

plot_function(f, 'x', 'x**2')

plot_function(f, 'x', 'x**2')
plt.scatter(-1.5, f(-1.5), color='red');

• the gradients tell us how much we need to change each weight to make our model better
• $$\frac{rise}{run} = \frac{the \ change \ in \ value \ of \ the \ function}{the \ change \ in \ the \ value \ of \ the \ parameter}$$
• derivative of a function
• when we know how our function will change, we know how to make it smaller
• the key to machine learning
• PyTorch is able to automatically compute the derivative of nearly any function
• The gradient only tells us the slope of the function
• it does not indicate exactly how far to adjust the parameters
• if the slope is large, more adjustments may be required
• if the slope is small, we may be close to the optimal value

xt = tensor(3.).requires_grad_()

yt = f(xt)
yt
tensor(9., grad_fn=<PowBackward0>)

yt.grad_fn
<PowBackward0 at 0x7f91e90a6670>

#### Tensor.backward()

yt.backward()

The derivative of f(x) = x**2 is 2x, so the derivative at x=3 is 6

xt.grad
tensor(6.)

Derivatives should be 6, 8, 20

xt = tensor([3.,4.,10.]).requires_grad_()
xt
tensor([ 3.,  4., 10.], requires_grad=True)

def f(x): return (x**2).sum()

yt = f(xt)
yt
tensor(125., grad_fn=<SumBackward0>)

yt.backward()
xt.grad
tensor([ 6.,  8., 20.])

### Stepping with a Learning Rate

• nearly all approaches to updating model parameters start with multiplying the gradient by some small number called the learning rate
• Learning rate is often a number between 0.001 and 0.1
• could be value
• size of step is determined by the learning rate
• picking a learning rate that is too small means more steps are needed to reach the optimal parameter values
• picking a learning rate that is too big can result in the loss getting worse or bouncing around the same range of values

### An End-to-End SGD Example

• Steps to turn function into classifier
1. Initialize the weights
• initialize parameters to random values
2. For each image, use these weights to predict whether it appears to be a 3 or a 7.
3. Based on these predictions, calculate how good the model is (it loss)
• “testing the effectiveness of any current weight assignment in terms of actual performance”
• need a function that will return a number that is small when performance is good
• standard convention is to treat a small loss as good and a large loss as bad
4. Calculate the gradient, which measures for each weight how changing that weight would change the loss
• use calculus to determine whether to increase or decrease individual weight values
5. Step (update) the weights based on that calculation
6. Go back to step 2 and repeat the process
7. Iterate until you decide to stop the training process
• until either the model is good enough, the model accuracy starts to decrease or you don’t want to wait any longer

Scenario: build a model of how the speed of a rollercoaster changes over time

#### torch.arange()

time = torch.arange(0,20).float();
print(time)
tensor([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10., 11., 12., 13., 14., 15., 16., 17., 18., 19.])

#### matplotlib.pyplot.scatter()

# Add some random noise to mimic manually measuring the speed
speed = torch.randn(20)*3 + 0.75*(time-9.5)**2 + 1
plt.scatter(time,speed);

# A quadratic function with trainable parameters
def f(t, params):
a,b,c = params
return a*(t**2) + (b*t) + c

def mse(preds, targets): return ((preds-targets)**2).mean().sqrt()

#### Step 1: Initialize the parameters

# Initialize trainable parameters with random values
# Let PyTorch know that we want to track the gradients
params
tensor([-0.7658, -0.7506,  1.3525], requires_grad=True)

#hide
orig_params = params.clone()

#### Step 2: Calculate the predictions

preds = f(time, params)
print(preds.shape)
preds
torch.Size([20])
tensor([ 1.3525e+00, -1.6391e-01, -3.2121e+00, -7.7919e+00, -1.3903e+01, -2.1547e+01, -3.0721e+01, -4.1428e+01, -5.3666e+01, -6.7436e+01, -8.2738e+01, -9.9571e+01, -1.1794e+02, -1.3783e+02,
-1.5926e+02, -1.8222e+02, -2.0671e+02, -2.3274e+02, -2.6029e+02, -2.8938e+02], grad_fn=<AddBackward0>)

def show_preds(preds, ax=None):
if ax is None: ax=plt.subplots()[1]
ax.scatter(time, speed)
ax.scatter(time, to_np(preds), color='red')
ax.set_ylim(-300,100)

show_preds(preds)

#### Step 3: Calculate the loss

• goal is to minimize this value

loss = mse(preds, speed)
loss
tensor(160.6979, grad_fn=<SqrtBackward0>)

#### Step 4: Calculate the gradients

loss.backward()
params.grad
tensor([-165.5151,  -10.6402,   -0.7900])

# Set learning rate to 0.00001
lr = 1e-5

# Multiply the graients by the learning rate
params.grad * lr
tensor([-1.6552e-03, -1.0640e-04, -7.8996e-06])

params
tensor([-0.7658, -0.7506,  1.3525], requires_grad=True)

#### Step 5: Step the weights.

# Using a learning rate of 0.0001 for larger steps
lr = 1e-4
# Update the parameter values
params.grad = None

# Test the updated parameter values
preds = f(time,params)
mse(preds, speed)
tensor(157.9476, grad_fn=<SqrtBackward0>)

show_preds(preds)

def apply_step(params, prn=True):
preds = f(time, params)
loss = mse(preds, speed)
loss.backward()
if prn: print(loss.item())
return preds

#### Step 6: Repeat the process

for i in range(10): apply_step(params)
157.9476318359375
155.1999969482422
152.45513916015625
149.71319580078125
146.97434997558594
144.23875427246094
141.50660705566406
138.77809143066406
136.05340576171875
133.33282470703125

_,axs = plt.subplots(1,4,figsize=(12,3))
for ax in axs: show_preds(apply_step(params, False), ax)
plt.tight_layout()

Many steps later…

_,axs = plt.subplots(1,4,figsize=(12,3))
for ax in axs: show_preds(apply_step(params, False), ax)
plt.tight_layout()

#### Step 7: Stop

• Watch the training and validation losses and our metrics to decide when to stop

• Initial model weights can be randomly initialized or from a pretrained model
• Compare the model output with our labeled training data using a loss function
• The loss function returns a number that we want to minimize by improving the model weights
• We change the weights a little bit to make the model slightly better based on gradients calculated using calculus
• the magnitude of the gradients indicate how big of a step needs to be taken
• Multiply the gradients by a learning rate to control how big of a change to make for each update
• Iterate

## The MNIST Loss Function

• Khan Academy: Intro to Matrix Multiplication
• Accuracy is not useful as a loss function
• accuracy only changes when prediction changes from a 3 to a 7 or vice versa
• its derivative is 0 almost everywhere
• need a loss function that gives a slightly better loss when our weights result in slightly better prediction

#### Tensor.view()

# 1. Concatenate all independent variables into a single tensor
# 2. Flatten each image matrix into a vector
#    -1: auto adjust axis to maintain fit all the data
train_x = torch.cat([stacked_threes, stacked_sevens]).view(-1, 28*28)

train_x.shape
torch.Size([12396, 784])

# Label 3s as 1 and label 7s as 0
train_y = tensor([1]*len(threes) + [0]*len(sevens)).unsqueeze(1)
train_y.shape
torch.Size([12396, 1])

# Combine independent and dependent variables into a dataset
dset = list(zip(train_x,train_y))
x,y = dset[0]
x.shape,y
(torch.Size([784]), tensor([1]))

valid_x = torch.cat([valid_3_tens, valid_7_tens]).view(-1, 28*28)
valid_y = tensor([1]*len(valid_3_tens) + [0]*len(valid_7_tens)).unsqueeze(1)
valid_dset = list(zip(valid_x,valid_y))

# Randomly initialize parameters
def init_params(size, std=1.0): return (torch.randn(size)*std).requires_grad_()

# Initialize weight values
weights = init_params((28*28,1))

# Initialize bias values
bias = init_params(1)

# Calculate a prediction for a single image
(train_x[0]*weights.T).sum() + bias
tensor([-6.2330], grad_fn=<AddBackward0>)

### Matrix Multiplication

# Matrix multiplication using loops
def mat_mul(m1, m2):
result = []
for m1_r in range(len(m1)):
for m2_r in range(len(m2[0])):
sum_val = 0
for c in range(len(m1[0])):
sum_val += m1[m1_r][c] * m2[c][m2_r]
result += [sum_val]
return result

# Create copies of the tensors that don't require gradients
train_x_clone = train_x.clone().detach()
weights_clone = weights.clone().detach()

%%time
# Matrix multiplication using @ operator
(train_x_clone@weights_clone)[:5]
CPU times: user 2.35 ms, sys: 4.15 ms, total: 6.5 ms
Wall time: 5.29 ms

tensor([[ -6.5802],
[-10.9860],
[-21.2337],
[-18.2173],
[ -1.7079]], device='cuda:0')

%%time
# This is why you should avoid using loops
mat_mul(train_x_clone, weights_clone)[:5]
CPU times: user 1min 37s, sys: 28 ms, total: 1min 37s
Wall time: 1min 37s

[tensor(-6.5802, device='cuda:0'),
tensor(-10.9860, device='cuda:0'),
tensor(-21.2337, device='cuda:0'),
tensor(-18.2173, device='cuda:0'),
tensor(-1.7079, device='cuda:0')]

# Move tensor copies to GPU
train_x_clone = train_x_clone.to('cuda');
weights_clone = weights_clone.to('cuda');

%%time
(train_x_clone@weights_clone)[:5]
CPU times: user 2.19 ms, sys: 131 µs, total: 2.32 ms
Wall time: 7.78 ms

tensor([[ -6.5802],
[-10.9860],
[-21.2337],
[-18.2173],
[ -1.7079]], device='cuda:0')

# Over 86,000 times faster on GPU
print(f"{(44.9 * 1e+6) / 522:,.2f}")
86,015.33

# Define a linear layer
# Matrix-multiply xb and weights and add the bias
def linear1(xb): return xb@weights + bias
preds = linear1(train_x)
preds
tensor([[ -6.2330],
[-10.6388],
[-20.8865],
...,
[-15.9176],
[ -1.6866],
[-11.3568]], grad_fn=<AddBackward0>)

# Determine which predictions were correct
corrects = (preds>0.0).float() == train_y
corrects
tensor([[False],
[False],
[False],
...,
[ True],
[ True],
[ True]])

# Calculate the current model accuracy
corrects.float().mean().item()
0.5379961133003235

# Test a small change in the weights
weights[0] *= 1.0001

preds = linear1(train_x)
((preds>0.0).float() == train_y).float().mean().item()
0.5379961133003235

trgts  = tensor([1,0,1])
prds   = tensor([0.9, 0.4, 0.2])

#### torch.where(condition, x, y)

# Measures how distant each prediction is from 1 if it should be one
# and how distant it is from 0 if it should be 0 and take the mean of those distances
# returns a lower number when predictions are more accurate
# Assumes that all predictions are between 0 and 1
def mnist_loss(predictions, targets):
# return
return torch.where(targets==1, 1-predictions, predictions).mean()

torch.where(trgts==1, 1-prds, prds)
tensor([0.1000, 0.4000, 0.8000])

mnist_loss(prds,trgts)
tensor(0.4333)

mnist_loss(tensor([0.9, 0.4, 0.8]),trgts)
tensor(0.2333)

### Sigmoid Function

• always returns a value between 0 and 1
• function is a smooth curve only goes up
• makes it easier for SGD to find meaningful gradients

#### torch.exp(x)

https://pytorch.org/docs/stable/generated/torch.exp.html returns $$e^{x}$$ where $$e$$ is [Euler’s number](https://en.wikipedia.org/wiki/E_(mathematical_constant) * $$e \approx 2.7183$$

print(torch.exp(tensor(1)))
print(torch.exp(tensor(2)))
tensor(2.7183)
tensor(7.3891)

# Always returns a number between 0 and 1
def sigmoid(x): return 1/(1+torch.exp(-x))

plot_function(torch.sigmoid, title='Sigmoid', min=-4, max=4);

def mnist_loss(predictions, targets):
predictions = predictions.sigmoid()
return torch.where(targets==1, 1-predictions, predictions).mean()

### SGD and Mini-Batches

• calculating the loss for the entire dataset would take a lot of time
• the full dataset is also unlikely to fit in memory
• calculating the loss for single data item would result in an imprecise and unstable gradient
• we can compromise by calculating the loss for a few data items at a time
• mini-batch: a subset of data items
• batch size: the number of data items in a mini-batch
• larger batch-size
• typically results in a more accurate and stable estimate of your dataset’s gradient from the loss function
• takes longer per mini-batch
• fewer mini-batches processed per epoch
• the batch size is limited by the amount of available memory for the CPU or GPU
• ideal batch-size is context dependent
• accelerators like GPUs work best when they have lots of work to do at a time
• typically want to use the largest batch-size that will fit in GPU memory
• typically want to randomly shuffle the contents of mini-batches for each epoch
• handles shuffling and mini-batch collation
• can take any Python collection and turn it into an iterator over many batches
• PyTorch Dataset: a collection that contains tuples of independent and dependent variables

In-Place Operations:

• methods in PyTorch that end in an underscore modify their objects in place

#### Map-style datasets:

• implements the __getitem__() and __len__() protocols, and represents a map from indices/keys to data samples

#### Iterable-style datasets:

• an instance of a subclass of IterableDataset that implements the __iter__() protocol, and represents an iterable over data samples
• particularly suitable for cases where random reads are expensive or even improbable, and where the batch size depends on the fetched data

DataLoader
fastai.data.load.DataLoader

# Sample collection
coll = range(15)
range(0, 15)

# Sample collection
coll = range(15)
list(dl)
[tensor([ 0,  7,  4,  5, 11]),
tensor([ 9,  3,  8, 14,  6]),
tensor([12,  2,  1, 10, 13])]

# Sample dataset of independent and dependent variables
ds = L(enumerate(string.ascii_lowercase))
ds
(#26) [(0, 'a'),(1, 'b'),(2, 'c'),(3, 'd'),(4, 'e'),(5, 'f'),(6, 'g'),(7, 'h'),(8, 'i'),(9, 'j')...]

dl = DataLoader(ds, batch_size=6, shuffle=True)
list(dl)
[(tensor([20, 18, 21,  5,  6,  9]), ('u', 's', 'v', 'f', 'g', 'j')),
(tensor([13, 19, 12, 16, 25,  3]), ('n', 't', 'm', 'q', 'z', 'd')),
(tensor([15,  1,  0, 24, 10, 23]), ('p', 'b', 'a', 'y', 'k', 'x')),
(tensor([11, 22,  2,  4, 14, 17]), ('l', 'w', 'c', 'e', 'o', 'r')),
(tensor([7, 8]), ('h', 'i'))]

## Putting It All Together

# Randomly initialize parameters
weights = init_params((28*28,1))
bias = init_params(1)

# Create data loader for training dataset
dl = DataLoader(dset, batch_size=256)

#### fastcore first():

first
<function fastcore.basics.first(x, f=None, negate=False, **kwargs)>

# Get the first mini-batch from the data loader
xb,yb = first(dl)
xb.shape,yb.shape
(torch.Size([256, 784]), torch.Size([256, 1]))

# Create data loader for validation dataset
valid_dl = DataLoader(valid_dset, batch_size=256)

# Smaller example mini-batch for testing
batch = train_x[:4]
batch.shape
torch.Size([4, 784])

# Test model smaller mini-batch
preds = linear1(batch)
preds
tensor([[ -9.2139],
[-20.0299],
[-16.8065],
[-14.1171]], grad_fn=<AddBackward0>)

# Calculate the loss
loss = mnist_loss(preds, train_y[:4])
loss
tensor(1.0000, grad_fn=<MeanBackward0>)

# Compute the gradients
loss.backward()
weights.grad.shape,weights.grad.mean(),bias.grad
(torch.Size([784, 1]), tensor(-3.5910e-06), tensor([-2.5105e-05]))

def calc_grad(xb, yb, model):
preds = model(xb)
loss = mnist_loss(preds, yb)
loss.backward()

calc_grad(batch, train_y[:4], linear1)
weights.grad.mean(),bias.grad
(tensor(-7.1820e-06), tensor([-5.0209e-05]))

Note: loss.backward() adds the gradients of loss to any gradients that are currently stored. This means we need to zero the gradients first

calc_grad(batch, train_y[:4], linear1)
weights.grad.mean(),bias.grad
(tensor(-1.0773e-05), tensor([-7.5314e-05]))

weights.grad.zero_()
bias.grad.zero_();

def train_epoch(model, lr, params):
for xb,yb in dl:
for p in params:
# Assign directly to the data attribute to prevent
# PyTorch from taking the gradient of that step
p.grad.zero_()

# Calculate accuracy using broadcasting
(preds>0.0).float() == train_y[:4]
tensor([[False],
[False],
[False],
[False]])

def batch_accuracy(xb, yb):
preds = xb.sigmoid()
correct = (preds>0.5) == yb
return correct.float().mean()

batch_accuracy(linear1(batch), train_y[:4])
tensor(0.)

def validate_epoch(model):
accs = [batch_accuracy(model(xb), yb) for xb,yb in valid_dl]
return round(torch.stack(accs).mean().item(), 4)

validate_epoch(linear1)
0.3407

lr = 1.
params = weights,bias
# Train for one epoch
train_epoch(linear1, lr, params)
validate_epoch(linear1)
0.6138

# Train for twenty epochs
for i in range(20):
train_epoch(linear1, lr, params)
print(validate_epoch(linear1), end=' ')
0.7358 0.9052 0.9438 0.9575 0.9638 0.9692 0.9726 0.9741 0.975 0.976 0.9765 0.9765 0.9765 0.9779 0.9784 0.9784 0.9784 0.9784 0.9789 0.9784 

Note: Accuracy improves from 0.7358 to 0.9784

### Creating an Optimizer

Why we need Non-Linear activation functions

• a series of any number of linear layers in a row can be replaced with a single linear layer with different parameters
• adding a non-linear layer between linear layers helps decouple the linear layers from each other so they can learn separate features

#### nn.Module():

nn.Linear
torch.nn.modules.linear.Linear

linear_model = nn.Linear(28*28,1)
linear_model
Linear(in_features=784, out_features=1, bias=True)

#### nn.Parameter():

w,b = linear_model.parameters()
w.shape,b.shape
(torch.Size([1, 784]), torch.Size([1]))

print(type(w))
print(type(b))
<class 'torch.nn.parameter.Parameter'>
<class 'torch.nn.parameter.Parameter'>

b
Parameter containing:
tensor([0.0062], requires_grad=True)

# Implements the basic optimization steps used earlier for use with a PyTorch Module
class BasicOptim:
def __init__(self,params,lr): self.params,self.lr = list(params),lr

def step(self, *args, **kwargs):
for p in self.params: p.data -= p.grad.data * self.lr

for p in self.params: p.grad = None

# PyTorch optimizers need a reference to the target model parameters
opt = BasicOptim(linear_model.parameters(), lr)

def train_epoch(model):
for xb,yb in dl:
opt.step()
opt.zero_grad()

validate_epoch(linear_model)
0.4673

def train_model(model, epochs):
for i in range(epochs):
train_epoch(model)
print(validate_epoch(model), end=' ')

train_model(linear_model, 20)
0.4932 0.8193 0.8467 0.9155 0.935 0.9477 0.956 0.9629 0.9653 0.9682 0.9697 0.9731 0.9741 0.9751 0.9761 0.9765 0.9775 0.978 0.9785 0.9785 

Note: The PyTorch version arrives at almost exactly the same accuracy as the hand-crafted version

#### fastai SGD():

SGD
<function fastai.optimizer.SGD(params, lr, mom=0.0, wd=0.0, decouple_wd=True)>

linear_model = nn.Linear(28*28,1)
opt = SGD(linear_model.parameters(), lr)
train_model(linear_model, 20)
0.4932 0.8135 0.8481 0.916 0.9341 0.9487 0.956 0.9634 0.9653 0.9673 0.9692 0.9717 0.9746 0.9751 0.9756 0.9765 0.9775 0.9775 0.978 0.978 

dls = DataLoaders(dl, valid_dl)

#### fastai Learner:

Learner
fastai.learner.Learner

learn = Learner(dls, nn.Linear(28*28,1), opt_func=SGD,
loss_func=mnist_loss, metrics=batch_accuracy)

#### fastai Learner.fit:

lr
1.0

learn.fit(10, lr=lr)
epoch train_loss valid_loss batch_accuracy time
0 0.635737 0.503216 0.495584 00:00
1 0.443481 0.246651 0.777723 00:00
2 0.165904 0.159723 0.857704 00:00
3 0.074277 0.099495 0.918057 00:00
4 0.040486 0.074255 0.934740 00:00
5 0.027243 0.060227 0.949951 00:00
6 0.021766 0.051380 0.956330 00:00
7 0.019304 0.045439 0.962709 00:00
8 0.018036 0.041227 0.965653 00:00
9 0.017262 0.038097 0.968106 00:00

def simple_net(xb):
# Linear layer
res = xb@w1 + b1
# ReLU activation layer
res = res.max(tensor(0.0))
# Linear layer
res = res@w2 + b2
return res

w1 = init_params((28*28,30))
b1 = init_params(30)
w2 = init_params((30,1))
b2 = init_params(1)

#### PyTorch F.relu:

F.relu
<function torch.nn.functional.relu(input: torch.Tensor, inplace: bool = False) -> torch.Tensor>

plot_function(F.relu)

#### nn.Sequential:

simple_net = nn.Sequential(
nn.Linear(28*28,30),
nn.ReLU(),
nn.Linear(30,1)
)
simple_net
Sequential(
(0): Linear(in_features=784, out_features=30, bias=True)
(1): ReLU()
(2): Linear(in_features=30, out_features=1, bias=True)
)

learn = Learner(dls, simple_net, opt_func=SGD,
loss_func=mnist_loss, metrics=batch_accuracy)

learn.fit(40, 0.1)
epoch train_loss valid_loss batch_accuracy time
0 0.259396 0.417702 0.504416 00:00
1 0.128176 0.216283 0.818449 00:00
2 0.073893 0.111460 0.920020 00:00
3 0.050328 0.076076 0.941119 00:00
4 0.039086 0.059598 0.958292 00:00
5 0.033148 0.050273 0.964671 00:00
6 0.029618 0.044374 0.966634 00:00
7 0.027258 0.040340 0.969087 00:00
8 0.025527 0.037404 0.969578 00:00
9 0.024172 0.035167 0.971541 00:00
10 0.023068 0.033394 0.972522 00:00
11 0.022145 0.031943 0.973503 00:00
12 0.021360 0.030726 0.975466 00:00
13 0.020682 0.029685 0.974975 00:00
14 0.020088 0.028779 0.975466 00:00
15 0.019563 0.027983 0.975957 00:00
16 0.019093 0.027274 0.976448 00:00
17 0.018670 0.026638 0.977920 00:00
18 0.018285 0.026064 0.977920 00:00
19 0.017933 0.025544 0.978901 00:00
20 0.017610 0.025069 0.979392 00:00
21 0.017310 0.024635 0.979392 00:00
22 0.017032 0.024236 0.980373 00:00
23 0.016773 0.023869 0.980373 00:00
24 0.016531 0.023529 0.980864 00:00
25 0.016303 0.023215 0.981354 00:00
26 0.016089 0.022923 0.981354 00:00
27 0.015887 0.022652 0.981354 00:00
28 0.015695 0.022399 0.980864 00:00
29 0.015514 0.022164 0.981354 00:00
30 0.015342 0.021944 0.981354 00:00
31 0.015178 0.021738 0.981354 00:00
32 0.015022 0.021544 0.981845 00:00
33 0.014873 0.021363 0.981845 00:00
34 0.014731 0.021192 0.981845 00:00
35 0.014595 0.021031 0.982336 00:00
36 0.014464 0.020879 0.982826 00:00
37 0.014338 0.020735 0.982826 00:00
38 0.014217 0.020599 0.982826 00:00
39 0.014101 0.020470 0.982336 00:00

#### matplotlib.pyplot.plot:

plt.plot
<function matplotlib.pyplot.plot(*args, scalex=True, scaley=True, data=None, **kwargs)>

#### fastai learner.Recorder:

learn.recorder
Recorder

Recorder
fastai.learner.Recorder

#### fastcore L.itemgot():

L.itemgot
<function fastcore.foundation.L.itemgot(self, *idxs)>

plt.plot(L(learn.recorder.values).itemgot(2));

learn.recorder.values[-1][2]
0.98233562707901

### Going Deeper

• deeper models: models with more layers
• deeper models are more difficult to optimize the more layers
• deeper models require fewer parameters
• we can use smaller matrices with more layers
• we can train the model more quickly using less memory
• typically perform better

dls = ImageDataLoaders.from_folder(path)
learn = cnn_learner(dls, resnet18, pretrained=False,
loss_func=F.cross_entropy, metrics=accuracy)
learn.fit_one_cycle(1, 0.1)
epoch train_loss valid_loss accuracy time
0 0.066122 0.008277 0.997547 00:04

learn.model
Sequential(
(0): Sequential(
(0): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
(1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU(inplace=True)
(3): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
(4): Sequential(
(0): BasicBlock(
(conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
(1): BasicBlock(
(conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(5): Sequential(
(0): BasicBlock(
(conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(downsample): Sequential(
(0): Conv2d(64, 128, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(1): BasicBlock(
(conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(6): Sequential(
(0): BasicBlock(
(conv1): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(downsample): Sequential(
(0): Conv2d(128, 256, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(1): BasicBlock(
(conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(7): Sequential(
(0): BasicBlock(
(conv1): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(downsample): Sequential(
(0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(1): BasicBlock(
(conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
)
(1): Sequential(
)
(1): Flatten(full=False)
(2): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(3): Dropout(p=0.25, inplace=False)
(4): Linear(in_features=1024, out_features=512, bias=False)
(5): ReLU(inplace=True)
(6): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(7): Dropout(p=0.5, inplace=False)
(8): Linear(in_features=512, out_features=2, bias=False)
)
)

## Jargon Recap

• neural networks contain two types of numbers
1. Parameters: numbers that are randomly initialized and optimized
• define the model
2. Activations: numbers that are calculated using the parameter values
• tensors
• regularly-shaped arrays like a matrix
• have rows and columns
• called the axes or dimensions
• rank: the number of dimensions of a tensor
• Rank-0: scalar
• Rank-1: vector
• Rank-2: matrix
• a neural network contains a number of linear and non-linear layers
• non-linear layers are referred to as activation layers
• ReLU: a function that sets any negative values to zero
• Mini-batch: a small group of inputs and labels gathered together in two arrays to perform gradient descent
• Forward pass: Applying the model to some input and computing the predictions
• Loss: A value that represents how the model is doing
• Gradient: The derivative of the loss with respect to all model parameters
• Gradient descent: Taking a step in the direction opposite to the gradients to make the model parameters a little bit better
• Learning rate: The size of the step we take when applying SGD to update the parameters of the model

## References

Previous: Notes on fastai Book Ch. 3