Notes on fastai Book Ch. 4
- Tenacity and Deep Learning
- The Foundations of Computer Vision
- Pixels
- Pixel Similarity
- Computing Metrics Using Broadcasting
- Stochastic Gradient Descent
- The MNIST Loss Function
- Putting It All Together
- Adding a Nonlinearity
- References
Tenacity and Deep Learning
- Deep learning practitioners need to be tenacious
- Only a handful of researchers kept trying to make neural networks work through the 1990s and 2000s.
- Yann Lecun, Yoshua Bengio, and Geoffrey Hinton were not awarded the Turing Award until 2018
- Academic Papers for neural networks were rejected by top journals and conferences, despite showing dramatically better results than anything previously published
- Jurgen Schmidhuber
- pioneered many important ideas
- worked with his student Sepp Hochreiter on the long short-term memory (LSTM) architecture
- LSTMs are now widely used for speech recognition and other text modelling tasks
- Paul Werbos
- Invented backpropagation for neural networks in 1974
- considered the most important foundation of modern AI
- Invented backpropagation for neural networks in 1974
The Foundations of Computer Vision
- MNIST Database
- contains images of handwritten digits, collected by the National Institute of Standards and Technology
- created in 1998
- LeNet-5
- A convolutional neural network structure proposed by Yann Lecun and his colleagues
- Demonstrated the first practically useful recognition of handwritten digit sequences in 1998
- One of the most important breakthroughs in the history of AI
Pixels
MNIST_SAMPLE
- A sample of the famous MNIST dataset consisting of handwritten digits.
- contains training data for the digits
3
and7
- images are in 1-dimensional grayscale format
- already split into training and validation sets
from fastai.vision.all import *
from fastbook import *
'image', cmap='Greys') matplotlib.rc(
print(URLs.MNIST_SAMPLE)
= untar_data(URLs.MNIST_SAMPLE)
path print(path)
https://s3.amazonaws.com/fast-ai-sample/mnist_sample.tgz
/home/innom-dt/.fastai/data/mnist_sample
# Set base path to mnist_sample directory
= path Path.BASE_PATH
# A custom fastai method that returns the contents of path as a list
path.ls()
(#3) [Path('labels.csv'),Path('train'),Path('valid')]
fastcore L
Class
- https://fastcore.fast.ai/foundation.html#L
- Behaves like a list of
items
but can also index with list of indices or masks - Displays the number of items before printing the items
type(path.ls())
fastcore.foundation.L
/'train').ls() (path
(#2) [Path('train/3'),Path('train/7')]
= (path/'train'/'3').ls().sorted()
threes = (path/'train'/'7').ls().sorted()
sevens threes
(#6131) [Path('train/3/10.png'),Path('train/3/10000.png'),Path('train/3/10011.png'),Path('train/3/10031.png'),Path('train/3/10034.png'),Path('train/3/10042.png'),Path('train/3/10052.png'),Path('train/3/1007.png'),Path('train/3/10074.png'),Path('train/3/10091.png')...]
= threes[1]
im3_path print(im3_path)
= Image.open(im3_path)
im3 im3
/home/innom-dt/.fastai/data/mnist_sample/train/3/10000.png
PIL Image Module
- https://pillow.readthedocs.io/en/stable/reference/Image.html
- provides a class with the same name which is used to represent a PIL image
- provides a number of factory functions, including functions to load images from files, and to create new images
print(type(im3))
print(im3.size)
<class 'PIL.PngImagePlugin.PngImageFile'>
(28, 28)
# Slice of the image from index 4 up to, but not including, index 10
4:10,4:10] array(im3)[
array([[ 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 29],
[ 0, 0, 0, 48, 166, 224],
[ 0, 93, 244, 249, 253, 187],
[ 0, 107, 253, 253, 230, 48],
[ 0, 3, 20, 20, 15, 0]], dtype=uint8)
NumPy Arrays and PyTorch Tensors
- NumPy
- the most widely used library for scientific and numeric programming in Python
- does not support using GPUs or calculating gradients
- Python is slow compared to many languages
- anything fast in Python is likely to be a wrapper for a compiled object written and optimized in another language like C
- NumPy arrays and PyTorch tensors can finish computations many thousands of times than using pure Python
- NumPy array
- a multidimensional table of data
- all items are the same type
- can use any type, including arrays, for the array type
- simple types are stored as a compact C data structure in memory
- PyTorch tensor
- nearly identical to NumPy arrays
- can only use a single basic numeric type for all elements
- not as flexible as a genuine array of arrays
- must always be a regularly shaped multi-dimensional rectangular structure
- cannot be jagged
- must always be a regularly shaped multi-dimensional rectangular structure
- supports using GPUs
- PyTorch can automatically calculate derivatives of operations performed with tensors
- impossible to do deep learning without this capability
- perform operations directly on arrays or tensors as much as possible instead of using loops
= [[1,2,3],[4,5,6]]
data = array (data)
arr = tensor(data) tns
# numpy arr
array([[1, 2, 3],
[4, 5, 6]])
# pytorch tns
tensor([[1, 2, 3],
[4, 5, 6]])
# select a row
1] tns[
tensor([4, 5, 6])
# select a column
1] tns[:,
tensor([2, 5])
# select a slice
1,1:3] tns[
tensor([5, 6])
# Perform element-wise addition
+1 tns
tensor([[2, 3, 4],
[5, 6, 7]])
type() tns.
'torch.LongTensor'
# Perform element-wise multiplication
*1.5 tns
tensor([[1.5000, 3.0000, 4.5000],
[6.0000, 7.5000, 9.0000]])
NumPy Array Objects
- https://numpy.org/doc/stable/reference/arrays.html
- an N-dimensional array type, the ndarray, which describes a collection of “items” of the same type
numpy.array function
- https://numpy.org/doc/stable/reference/generated/numpy.array.html
- creates an array
print(type(array(im3)[4:10,4:10]))
array
<class 'numpy.ndarray'>
<function numpy.array>
print(array(im3)[4:10,4:10][0].data)
print(array(im3)[4:10,4:10][0].dtype)
<memory at 0x7f3c13a20dc0>
uint8
PyTorch Tensor
- https://pytorch.org/docs/stable/tensors.html
- a multi-dimensional matrix containing elements of a single data type
fastai tensor function
- https://docs.fast.ai/torch_core.html#tensor
- Like torch.as_tensor, but handle lists too, and can pass multiple vector elements directly.
print(type(tensor(im3)[4:10,4:10][0]))
tensor
<class 'torch.Tensor'>
<function fastai.torch_core.tensor(x, *rest, dtype=None, device=None, requires_grad=False, pin_memory=False)>
print(tensor(im3)[4:10,4:10][0].data)
print(tensor(im3)[4:10,4:10][0].dtype)
tensor([0, 0, 0, 0, 0, 0], dtype=torch.uint8)
torch.uint8
Pandas DataFrame
- https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html
- Two-dimensional, size-mutable, potentially heterogeneous tabular data
# Full Image
pd.DataFrame(tensor(im3))
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 29 | 150 | 195 | 254 | 255 | 254 | 176 | 193 | 150 | 96 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 48 | 166 | 224 | 253 | 253 | 234 | 196 | 253 | 253 | 253 | 253 | 233 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
7 | 0 | 0 | 0 | 0 | 0 | 93 | 244 | 249 | 253 | 187 | 46 | 10 | 8 | 4 | 10 | 194 | 253 | 253 | 233 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
8 | 0 | 0 | 0 | 0 | 0 | 107 | 253 | 253 | 230 | 48 | 0 | 0 | 0 | 0 | 0 | 192 | 253 | 253 | 156 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
9 | 0 | 0 | 0 | 0 | 0 | 3 | 20 | 20 | 15 | 0 | 0 | 0 | 0 | 0 | 43 | 224 | 253 | 245 | 74 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
10 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 249 | 253 | 245 | 126 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
11 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 14 | 101 | 223 | 253 | 248 | 124 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
12 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 11 | 166 | 239 | 253 | 253 | 253 | 187 | 30 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
13 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 16 | 248 | 250 | 253 | 253 | 253 | 253 | 232 | 213 | 111 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
14 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 43 | 98 | 98 | 208 | 253 | 253 | 253 | 253 | 187 | 22 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
15 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 9 | 51 | 119 | 253 | 253 | 253 | 76 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
16 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 183 | 253 | 253 | 139 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
17 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 182 | 253 | 253 | 104 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
18 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 85 | 249 | 253 | 253 | 36 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
19 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 60 | 214 | 253 | 253 | 173 | 11 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
20 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 98 | 247 | 253 | 253 | 226 | 9 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
21 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 42 | 150 | 252 | 253 | 253 | 233 | 53 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
22 | 0 | 0 | 0 | 0 | 0 | 0 | 42 | 115 | 42 | 60 | 115 | 159 | 240 | 253 | 253 | 250 | 175 | 25 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
23 | 0 | 0 | 0 | 0 | 0 | 0 | 187 | 253 | 253 | 253 | 253 | 253 | 253 | 253 | 197 | 86 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
24 | 0 | 0 | 0 | 0 | 0 | 0 | 103 | 253 | 253 | 253 | 253 | 253 | 232 | 67 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
25 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
26 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
27 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
tensor(im3).shape
torch.Size([28, 28])
= tensor(im3)
im3_t # Create a pandas DataFrame from image slice
= pd.DataFrame(im3_t[4:15,4:22])
df # Set defined CSS-properties to each ``<td>`` HTML element within the given subset.
# Color-code the values using a gradient
**{'font-size':'6pt'}).background_gradient('Greys') df.style.set_properties(
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 29 | 150 | 195 | 254 | 255 | 254 | 176 | 193 | 150 | 96 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 48 | 166 | 224 | 253 | 253 | 234 | 196 | 253 | 253 | 253 | 253 | 233 | 0 | 0 | 0 |
3 | 0 | 93 | 244 | 249 | 253 | 187 | 46 | 10 | 8 | 4 | 10 | 194 | 253 | 253 | 233 | 0 | 0 | 0 |
4 | 0 | 107 | 253 | 253 | 230 | 48 | 0 | 0 | 0 | 0 | 0 | 192 | 253 | 253 | 156 | 0 | 0 | 0 |
5 | 0 | 3 | 20 | 20 | 15 | 0 | 0 | 0 | 0 | 0 | 43 | 224 | 253 | 245 | 74 | 0 | 0 | 0 |
6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 249 | 253 | 245 | 126 | 0 | 0 | 0 | 0 |
7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 14 | 101 | 223 | 253 | 248 | 124 | 0 | 0 | 0 | 0 | 0 |
8 | 0 | 0 | 0 | 0 | 0 | 11 | 166 | 239 | 253 | 253 | 253 | 187 | 30 | 0 | 0 | 0 | 0 | 0 |
9 | 0 | 0 | 0 | 0 | 0 | 16 | 248 | 250 | 253 | 253 | 253 | 253 | 232 | 213 | 111 | 2 | 0 | 0 |
10 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 43 | 98 | 98 | 208 | 253 | 253 | 253 | 253 | 187 | 22 | 0 |
Pixel Similarity
- Establish a baseline to compare against your model
- a simple model that you are confident should perform reasonably well
- should be simple to implement and easy to test
- helps indicate whether your super-fancy models are any good
Method
- Calculate the average values for each pixel location across all images for each digit
- This will generate a blurry image of the target digit
- Compare the values for each pixel location in a new image to the average
# Store all images of the digit 7 in a list of tensors
= [tensor(Image.open(o)) for o in sevens]
seven_tensors # Store all iamges of the digit 3 in a list of tensors
= [tensor(Image.open(o)) for o in threes]
three_tensors len(three_tensors),len(seven_tensors)
(6131, 6265)
fastai show_image function
- https://docs.fast.ai/torch_core.html#show_image
- Display tensor as an image
1]); show_image(three_tensors[
PyTorch Stack Function
- https://pytorch.org/docs/stable/generated/torch.stack.html
- Concatenates a sequence of tensors along a new dimension
# Stack all images for each digit into a single tensor
# and scale pixel values from the range [0,255] to [0,1]
= torch.stack(seven_tensors).float()/255
stacked_sevens = torch.stack(three_tensors).float()/255
stacked_threes stacked_threes.shape
torch.Size([6131, 28, 28])
python len(stacked_threes.shape) |
stacked_threes.ndim
3
# Calculate the mean values for each pixel location across all images of the digit 3
= stacked_threes.mean(0)
mean3 ; show_image(mean3)
# Calculate the mean values for each pixel location across all images of the digit 7
= stacked_sevens.mean(0)
mean7 ; show_image(mean7)
# Pick a single image to compare to the average
= stacked_threes[1]
a_3 ; show_image(a_3)
# Calculate the Mean Absolute Error between the single image and the mean pixel values
= (a_3 - mean3).abs().mean()
dist_3_abs # Calculate the Root Mean Squared Error between the single image and the mean pixel values
= ((a_3 - mean3)**2).mean().sqrt()
dist_3_sqr print(f"MAE: {dist_3_abs}")
print(f"RMSE: {dist_3_sqr}")
MAE: 0.11143654584884644
RMSE: 0.20208320021629333
Khan Academy: Understanding Square Roots
= (a_3 - mean7).abs().mean()
dist_7_abs = ((a_3 - mean7)**2).mean().sqrt()
dist_7_sqr print(f"MAE: {dist_7_abs}")
print(f"RMSE: {dist_7_sqr}")
MAE: 0.15861910581588745
RMSE: 0.30210891366004944
Note: The error is larger when comparing the image of a
3
to the average pixel values for the digit7
torch.nn.functional
- https://pytorch.org/docs/stable/nn.functional.html
- Provides access to a variety of functions in PyTorch
F
<module 'torch.nn.functional' from '/home/innom-dt/miniconda3/envs/fastbook/lib/python3.9/site-packages/torch/nn/functional.py'>
PyTorch l1_loss function
- https://pytorch.org/docs/stable/generated/torch.nn.functional.l1_loss.html#torch.nn.functional.l1_loss
- takes the mean element-wise absolute value difference
PyTorch mse_loss function
- https://pytorch.org/docs/stable/generated/torch.nn.functional.mse_loss.html#torch.nn.functional.mse_loss
- Measures the element-wise mean squared error
- Penalizes bigger mistakes more heavily
# Calculate the Mean Absolute Error aka L1 norm
print(F.l1_loss(a_3.float(),mean7))
# Calculate the Root Mean Squared Error aka L2 norm
print(F.mse_loss(a_3,mean7).sqrt())
tensor(0.1586)
tensor(0.3021)
Computing Metrics Using Broadcasting
- broadcasting
- automatically expanding a tensor with a smaller rank to have the same size one with a larger rank to perform an operation
- an important capability that makes tensor code much easier to write
- PyTorch does not allocate additional memory for broadcasting
- it does not actually create multiple copies of the smaller tensor
- PyTorch performs broadcast calculations in C on the CPU and CUDA on the GPU
- tens of thousands of times faster than pure Python
- up to millions of times faster on GPU
# Create tensors for the validation set for the digit 3
# and stack them into a single tensor
= torch.stack([tensor(Image.open(o))
valid_3_tens for o in (path/'valid'/'3').ls()])
# Scale pixel values from [0,255] to [0,1]
= valid_3_tens.float()/255
valid_3_tens
# Create tensors for the validation set for the digit 7
# and stack them into a single tensor
= torch.stack([tensor(Image.open(o))
valid_7_tens for o in (path/'valid'/'7').ls()])
# Scale pixel values from [0,255] to [0,1]
= valid_7_tens.float()/255
valid_7_tens
valid_3_tens.shape,valid_7_tens.shape
(torch.Size([1010, 28, 28]), torch.Size([1028, 28, 28]))
# Calculate Mean Absolute Error using broadcasting
# Subtraction operation is performed using broadcasting
# Absolute Value operation is performed elementwise
# Mean operation is performed over the values indexed by the height and width axes
def mnist_distance(a,b): return (a-b).abs().mean((-1,-2))
# Calculate MAE for two single images
mnist_distance(a_3, mean3)
tensor(0.1114)
# Calculate MAE between a single image and a vector of images
= mnist_distance(valid_3_tens, mean3)
valid_3_dist valid_3_dist, valid_3_dist.shape
(tensor([0.1422, 0.1230, 0.1055, ..., 0.1244, 0.1188, 0.1103]),
torch.Size([1010]))
1,2,3]) + tensor([1,1,1]) tensor([
tensor([2, 3, 4])
-mean3).shape (valid_3_tens
torch.Size([1010, 28, 28])
# Compare the MAE value between the single and the mean values for the digits 3 and 7
def is_3(x): return mnist_distance(x,mean3) < mnist_distance(x,mean7)
float() is_3(a_3), is_3(a_3).
(tensor(True), tensor(1.))
is_3(valid_3_tens)
tensor([ True, True, True, ..., False, True, True])
= is_3(valid_3_tens).float() .mean()
accuracy_3s = (1 - is_3(valid_7_tens).float()).mean()
accuracy_7s
+accuracy_7s)/2 accuracy_3s,accuracy_7s,(accuracy_3s
(tensor(0.9168), tensor(0.9854), tensor(0.9511))
print(f"Correct 3s: {accuracy_3s * valid_3_tens.shape[0]:.0f}")
print(f"Incorrect 3s: {(1 - accuracy_3s) * valid_3_tens.shape[0]:.0f}")
Correct 3s: 926
Incorrect 3s: 84
print(f"Correct 7s: {accuracy_7s * valid_7_tens.shape[0]:.0f}")
print(f"Incorrect 7s: {(1 - accuracy_7s) * valid_7_tens.shape[0]:.0f}")
Correct 7s: 1013
Incorrect 7s: 15
Stochastic Gradient Descent
- the key to having a model that can improve
- need to represent a task such that their are weight assignments that can be evaluated and updated
- Sample function:
- assign a weight value to each pixel location
X
is the image represented as a vector- all of the rows are stacked up end to end into a single long line
W
contains the weights for each pixel
def pr_eight(x,w) = (x*w).sum()
def f(x): return x**2
plot_function
'x', 'x**2') plot_function(f,
'x', 'x**2')
plot_function(f, -1.5, f(-1.5), color='red'); plt.scatter(
Calculating Gradients
- the gradients tell us how much we need to change each weight to make our model better
- \(\frac{rise}{run} = \frac{the \ change \ in \ value \ of \ the \ function}{the \ change \ in \ the \ value \ of \ the \ parameter}\)
- derivative of a function
- tells you how much a change in its parameters will change its result
- Khan Academy: Basic Derivatives
- when we know how our function will change, we know how to make it smaller
- the key to machine learning
- PyTorch is able to automatically compute the derivative of nearly any function
- The gradient only tells us the slope of the function
- it does not indicate exactly how far to adjust the parameters
- if the slope is large, more adjustments may be required
- if the slope is small, we may be close to the optimal value
Tensor.requires_grad
- https://pytorch.org/docs/stable/generated/torch.Tensor.requires_grad.html
- is
True
if gradients need to be computed for the Tensor - here gradient refers to the value of a function’s derivative at a particular argument value
- The PyTorch API puts the focus onto the argument, not the function
= tensor(3.).requires_grad_() xt
= f(xt)
yt yt
tensor(9., grad_fn=<PowBackward0>)
Tensor.grad_fn
- https://pytorch.org/tutorials/beginner/former_torchies/autograd_tutorial.html#tensors-that-track-history
- references a function that has created a function
yt.grad_fn
<PowBackward0 at 0x7f91e90a6670>
Tensor.backward()
- https://pytorch.org/docs/stable/generated/torch.Tensor.backward.html#torch.Tensor.backward
- Computes the gradient of current tensor w.r.t. graph leaves.
- uses the chain rule
- backward refers to backpropagation
- the process of calculating the derivative for each layer
yt.backward()
The derivative of f(x) = x**2
is 2x
, so the derivative at x=3
is 6
xt.grad
tensor(6.)
Derivatives should be 6
, 8
, 20
= tensor([3.,4.,10.]).requires_grad_()
xt xt
tensor([ 3., 4., 10.], requires_grad=True)
def f(x): return (x**2).sum()
= f(xt)
yt yt
tensor(125., grad_fn=<SumBackward0>)
yt.backward() xt.grad
tensor([ 6., 8., 20.])
Stepping with a Learning Rate
- nearly all approaches to updating model parameters start with multiplying the gradient by some small number called the learning rate
- Learning rate is often a number between
0.001
and0.1
- could be value
- stepping: adjusting your model parameters
- size of step is determined by the learning rate
- picking a learning rate that is too small means more steps are needed to reach the optimal parameter values
- picking a learning rate that is too big can result in the loss getting worse or bouncing around the same range of values
An End-to-End SGD Example
- Steps to turn function into classifier
- Initialize the weights
- initialize parameters to random values
- For each image, use these weights to predict whether it appears to be a 3 or a 7.
- Based on these predictions, calculate how good the model is (it loss)
- “testing the effectiveness of any current weight assignment in terms of actual performance”
- need a function that will return a number that is small when performance is good
- standard convention is to treat a small loss as good and a large loss as bad
- Calculate the gradient, which measures for each weight how changing that weight would change the loss
- use calculus to determine whether to increase or decrease individual weight values
- Step (update) the weights based on that calculation
- Go back to step 2 and repeat the process
- Iterate until you decide to stop the training process
- until either the model is good enough, the model accuracy starts to decrease or you don’t want to wait any longer
- Initialize the weights
Scenario: build a model of how the speed of a rollercoaster changes over time
torch.arange()
- https://pytorch.org/docs/stable/generated/torch.arange.html?highlight=arange#torch.arange
- Returns a 1-D tensor of size \(\left\lceil \frac{\text{end} - \text{start}}{\text{step}} \right\rceil\) with values from the interval
[start, end)
taken with common differencestep
beginning fromstart
.
= torch.arange(0,20).float();
time print(time)
tensor([ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9., 10., 11., 12., 13., 14., 15., 16., 17., 18., 19.])
torch.randn()
- https://pytorch.org/docs/stable/generated/torch.randn.html?highlight=randn#torch.randn
- Returns a tensor filled with random numbers from a normal distribution with mean 0 and variance 1 (also called the standard normal distribution)
matplotlib.pyplot.scatter()
- https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html
- A scatter plot of y vs. x with varying marker size and/or color.
# Add some random noise to mimic manually measuring the speed
= torch.randn(20)*3 + 0.75*(time-9.5)**2 + 1
speed ; plt.scatter(time,speed)
# A quadratic function with trainable parameters
def f(t, params):
= params
a,b,c return a*(t**2) + (b*t) + c
def mse(preds, targets): return ((preds-targets)**2).mean().sqrt()
Step 1: Initialize the parameters
# Initialize trainable parameters with random values
# Let PyTorch know that we want to track the gradients
= torch.randn(3).requires_grad_()
params params
tensor([-0.7658, -0.7506, 1.3525], requires_grad=True)
#hide
= params.clone() orig_params
Step 2: Calculate the predictions
= f(time, params)
preds print(preds.shape)
preds
torch.Size([20])
tensor([ 1.3525e+00, -1.6391e-01, -3.2121e+00, -7.7919e+00, -1.3903e+01, -2.1547e+01, -3.0721e+01, -4.1428e+01, -5.3666e+01, -6.7436e+01, -8.2738e+01, -9.9571e+01, -1.1794e+02, -1.3783e+02,
-1.5926e+02, -1.8222e+02, -2.0671e+02, -2.3274e+02, -2.6029e+02, -2.8938e+02], grad_fn=<AddBackward0>)
def show_preds(preds, ax=None):
if ax is None: ax=plt.subplots()[1]
ax.scatter(time, speed)='red')
ax.scatter(time, to_np(preds), color-300,100) ax.set_ylim(
show_preds(preds)
Step 3: Calculate the loss
- goal is to minimize this value
= mse(preds, speed)
loss loss
tensor(160.6979, grad_fn=<SqrtBackward0>)
Step 4: Calculate the gradients
loss.backward() params.grad
tensor([-165.5151, -10.6402, -0.7900])
# Set learning rate to 0.00001
= 1e-5 lr
# Multiply the graients by the learning rate
* lr params.grad
tensor([-1.6552e-03, -1.0640e-04, -7.8996e-06])
params
tensor([-0.7658, -0.7506, 1.3525], requires_grad=True)
Step 5: Step the weights.
# Using a learning rate of 0.0001 for larger steps
= 1e-4
lr # Update the parameter values
-= lr * params.grad.data
params.data # Reset the computed gradients
= None params.grad
# Test the updated parameter values
= f(time,params)
preds mse(preds, speed)
tensor(157.9476, grad_fn=<SqrtBackward0>)
show_preds(preds)
def apply_step(params, prn=True):
= f(time, params)
preds = mse(preds, speed)
loss
loss.backward()-= lr * params.grad.data
params.data = None
params.grad if prn: print(loss.item())
return preds
Step 6: Repeat the process
for i in range(10): apply_step(params)
157.9476318359375
155.1999969482422
152.45513916015625
149.71319580078125
146.97434997558594
144.23875427246094
141.50660705566406
138.77809143066406
136.05340576171875
133.33282470703125
= plt.subplots(1,4,figsize=(12,3))
_,axs for ax in axs: show_preds(apply_step(params, False), ax)
plt.tight_layout()
Many steps later…
= plt.subplots(1,4,figsize=(12,3))
_,axs for ax in axs: show_preds(apply_step(params, False), ax)
plt.tight_layout()
Step 7: Stop
- Watch the training and validation losses and our metrics to decide when to stop
Summarizing Gradient Descent
- Initial model weights can be randomly initialized or from a pretrained model
- Compare the model output with our labeled training data using a loss function
- The loss function returns a number that we want to minimize by improving the model weights
- We change the weights a little bit to make the model slightly better based on gradients calculated using calculus
- the magnitude of the gradients indicate how big of a step needs to be taken
- Multiply the gradients by a learning rate to control how big of a change to make for each update
- Iterate
The MNIST Loss Function
- Khan Academy: Intro to Matrix Multiplication
- Accuracy is not useful as a loss function
- accuracy only changes when prediction changes from a 3 to a 7 or vice versa
- its derivative is 0 almost everywhere
- need a loss function that gives a slightly better loss when our weights result in slightly better prediction
torch.cat()
- https://pytorch.org/docs/stable/generated/torch.cat.html
- Concatenates a given sequence of tensors in the specified dimension
- All tensor must have the same shape except in the specified dimension
Tensor.view()
- https://pytorch.org/docs/stable/generated/torch.Tensor.view.html#torch.Tensor.view
- Returns a new tensor with the same data as the self tensor but of a different shape.
# 1. Concatenate all independent variables into a single tensor
# 2. Flatten each image matrix into a vector
# -1: auto adjust axis to maintain fit all the data
= torch.cat([stacked_threes, stacked_sevens]).view(-1, 28*28) train_x
train_x.shape
torch.Size([12396, 784])
# Label 3s as `1` and label 7s as `0`
= tensor([1]*len(threes) + [0]*len(sevens)).unsqueeze(1)
train_y train_y.shape
torch.Size([12396, 1])
# Combine independent and dependent variables into a dataset
= list(zip(train_x,train_y))
dset = dset[0]
x,y x.shape,y
(torch.Size([784]), tensor([1]))
= torch.cat([valid_3_tens, valid_7_tens]).view(-1, 28*28)
valid_x = tensor([1]*len(valid_3_tens) + [0]*len(valid_7_tens)).unsqueeze(1)
valid_y = list(zip(valid_x,valid_y)) valid_dset
# Randomly initialize parameters
def init_params(size, std=1.0): return (torch.randn(size)*std).requires_grad_()
# Initialize weight values
= init_params((28*28,1)) weights
# Initialize bias values
= init_params(1) bias
# Calculate a prediction for a single image
0]*weights.T).sum() + bias (train_x[
tensor([-6.2330], grad_fn=<AddBackward0>)
Matrix Multiplication
# Matrix multiplication using loops
def mat_mul(m1, m2):
= []
result for m1_r in range(len(m1)):
for m2_r in range(len(m2[0])):
= 0
sum_val for c in range(len(m1[0])):
+= m1[m1_r][c] * m2[c][m2_r]
sum_val += [sum_val]
result return result
# Create copies of the tensors that don't require gradients
= train_x.clone().detach()
train_x_clone = weights.clone().detach() weights_clone
%%time
# Matrix multiplication using @ operator
@weights_clone)[:5] (train_x_clone
CPU times: user 2.35 ms, sys: 4.15 ms, total: 6.5 ms
Wall time: 5.29 ms
tensor([[ -6.5802],
[-10.9860],
[-21.2337],
[-18.2173],
[ -1.7079]], device='cuda:0')
%%time
# This is why you should avoid using loops
5] mat_mul(train_x_clone, weights_clone)[:
CPU times: user 1min 37s, sys: 28 ms, total: 1min 37s
Wall time: 1min 37s
[tensor(-6.5802, device='cuda:0'),
tensor(-10.9860, device='cuda:0'),
tensor(-21.2337, device='cuda:0'),
tensor(-18.2173, device='cuda:0'),
tensor(-1.7079, device='cuda:0')]
# Move tensor copies to GPU
= train_x_clone.to('cuda');
train_x_clone = weights_clone.to('cuda'); weights_clone
%%time
@weights_clone)[:5] (train_x_clone
CPU times: user 2.19 ms, sys: 131 µs, total: 2.32 ms
Wall time: 7.78 ms
tensor([[ -6.5802],
[-10.9860],
[-21.2337],
[-18.2173],
[ -1.7079]], device='cuda:0')
# Over 86,000 times faster on GPU
print(f"{(44.9 * 1e+6) / 522:,.2f}")
86,015.33
# Define a linear layer
# Matrix-multiply xb and weights and add the bias
def linear1(xb): return xb@weights + bias
= linear1(train_x)
preds preds
tensor([[ -6.2330],
[-10.6388],
[-20.8865],
...,
[-15.9176],
[ -1.6866],
[-11.3568]], grad_fn=<AddBackward0>)
# Determine which predictions were correct
= (preds>0.0).float() == train_y
corrects corrects
tensor([[False],
[False],
[False],
...,
[ True],
[ True],
[ True]])
# Calculate the current model accuracy
float().mean().item() corrects.
0.5379961133003235
# Test a small change in the weights
with torch.no_grad():
0] *= 1.0001 weights[
= linear1(train_x)
preds >0.0).float() == train_y).float().mean().item() ((preds
0.5379961133003235
= tensor([1,0,1])
trgts = tensor([0.9, 0.4, 0.2]) prds
torch.where(condition, x, y)
- https://pytorch.org/docs/stable/generated/torch.where.html
- Return a tensor of elements selected from either
x
ory
, depending oncondition
# Measures how distant each prediction is from 1 if it should be one
# and how distant it is from 0 if it should be 0 and take the mean of those distances
# returns a lower number when predictions are more accurate
# Assumes that all predictions are between 0 and 1
def mnist_loss(predictions, targets):
# return
return torch.where(targets==1, 1-predictions, predictions).mean()
==1, 1-prds, prds) torch.where(trgts
tensor([0.1000, 0.4000, 0.8000])
mnist_loss(prds,trgts)
tensor(0.4333)
0.9, 0.4, 0.8]),trgts) mnist_loss(tensor([
tensor(0.2333)
Sigmoid Function
- always returns a value between 0 and 1
- function is a smooth curve only goes up
- makes it easier for SGD to find meaningful gradients
torch.exp(x)
https://pytorch.org/docs/stable/generated/torch.exp.html returns \(e^{x}\) where \(e\) is [Euler’s number](https://en.wikipedia.org/wiki/E_(mathematical_constant) * \(e \approx 2.7183\)
print(torch.exp(tensor(1)))
print(torch.exp(tensor(2)))
tensor(2.7183)
tensor(7.3891)
# Always returns a number between 0 and 1
def sigmoid(x): return 1/(1+torch.exp(-x))
='Sigmoid', min=-4, max=4); plot_function(torch.sigmoid, title
def mnist_loss(predictions, targets):
= predictions.sigmoid()
predictions return torch.where(targets==1, 1-predictions, predictions).mean()
SGD and Mini-Batches
- calculating the loss for the entire dataset would take a lot of time
- the full dataset is also unlikely to fit in memory
- calculating the loss for single data item would result in an imprecise and unstable gradient
- we can compromise by calculating the loss for a few data items at a time
- mini-batch: a subset of data items
- batch size: the number of data items in a mini-batch
- larger batch-size
- typically results in a more accurate and stable estimate of your dataset’s gradient from the loss function
- takes longer per mini-batch
- fewer mini-batches processed per epoch
- the batch size is limited by the amount of available memory for the CPU or GPU
- ideal batch-size is context dependent
- larger batch-size
- accelerators like GPUs work best when they have lots of work to do at a time
- typically want to use the largest batch-size that will fit in GPU memory
- typically want to randomly shuffle the contents of mini-batches for each epoch
- DataLoader
- handles shuffling and mini-batch collation
- can take any Python collection and turn it into an iterator over many batches
- PyTorch Dataset: a collection that contains tuples of independent and dependent variables
In-Place Operations:
- methods in PyTorch that end in an underscore modify their objects in place
PyTorch DataLoader:
- https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader
- Combines a dataset and a sampler, and provides an iterable over the given dataset.
- supports both map-style and iterable-style datasets with single- or multi-process loading, customizing loading order and optional automatic batching (collation) and memory pinning
PyTorch Dataset:
- https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset
- an abstract class representing a dataset
Map-style datasets:
- implements the
__getitem__()
and__len__()
protocols, and represents a map from indices/keys to data samples
Iterable-style datasets:
- an instance of a subclass of
IterableDataset
that implements the__iter__()
protocol, and represents an iterable over data samples - particularly suitable for cases where random reads are expensive or even improbable, and where the batch size depends on the fetched data
fastai DataLoader:
- https://docs.fast.ai/data.load.html#DataLoader
- API compatible with PyTorch DataLoader, with a lot more callbacks and flexibility
DataLoader
fastai.data.load.DataLoader
# Sample collection
= range(15) coll
range(0, 15)
# Sample collection
= range(15)
coll = DataLoader(coll, batch_size=5, shuffle=True)
dl list(dl)
[tensor([ 0, 7, 4, 5, 11]),
tensor([ 9, 3, 8, 14, 6]),
tensor([12, 2, 1, 10, 13])]
# Sample dataset of independent and dependent variables
= L(enumerate(string.ascii_lowercase))
ds ds
(#26) [(0, 'a'),(1, 'b'),(2, 'c'),(3, 'd'),(4, 'e'),(5, 'f'),(6, 'g'),(7, 'h'),(8, 'i'),(9, 'j')...]
= DataLoader(ds, batch_size=6, shuffle=True)
dl list(dl)
[(tensor([20, 18, 21, 5, 6, 9]), ('u', 's', 'v', 'f', 'g', 'j')),
(tensor([13, 19, 12, 16, 25, 3]), ('n', 't', 'm', 'q', 'z', 'd')),
(tensor([15, 1, 0, 24, 10, 23]), ('p', 'b', 'a', 'y', 'k', 'x')),
(tensor([11, 22, 2, 4, 14, 17]), ('l', 'w', 'c', 'e', 'o', 'r')),
(tensor([7, 8]), ('h', 'i'))]
Putting It All Together
# Randomly initialize parameters
= init_params((28*28,1))
weights = init_params(1) bias
# Create data loader for training dataset
= DataLoader(dset, batch_size=256) dl
fastcore first():
- https://fastcore.fast.ai/basics.html#first
- First element of x, optionally filtered by f, or None if missing
first
<function fastcore.basics.first(x, f=None, negate=False, **kwargs)>
# Get the first mini-batch from the data loader
= first(dl)
xb,yb xb.shape,yb.shape
(torch.Size([256, 784]), torch.Size([256, 1]))
# Create data loader for validation dataset
= DataLoader(valid_dset, batch_size=256) valid_dl
# Smaller example mini-batch for testing
= train_x[:4]
batch batch.shape
torch.Size([4, 784])
# Test model smaller mini-batch
= linear1(batch)
preds preds
tensor([[ -9.2139],
[-20.0299],
[-16.8065],
[-14.1171]], grad_fn=<AddBackward0>)
# Calculate the loss
= mnist_loss(preds, train_y[:4])
loss loss
tensor(1.0000, grad_fn=<MeanBackward0>)
# Compute the gradients
loss.backward() weights.grad.shape,weights.grad.mean(),bias.grad
(torch.Size([784, 1]), tensor(-3.5910e-06), tensor([-2.5105e-05]))
def calc_grad(xb, yb, model):
= model(xb)
preds = mnist_loss(preds, yb)
loss loss.backward()
4], linear1)
calc_grad(batch, train_y[: weights.grad.mean(),bias.grad
(tensor(-7.1820e-06), tensor([-5.0209e-05]))
Note: loss.backward() adds the gradients of loss to any gradients that are currently stored. This means we need to zero the gradients first
4], linear1)
calc_grad(batch, train_y[: weights.grad.mean(),bias.grad
(tensor(-1.0773e-05), tensor([-7.5314e-05]))
weights.grad.zero_(); bias.grad.zero_()
def train_epoch(model, lr, params):
for xb,yb in dl:
calc_grad(xb, yb, model)for p in params:
# Assign directly to the data attribute to prevent
# PyTorch from taking the gradient of that step
-= p.grad*lr
p.data p.grad.zero_()
# Calculate accuracy using broadcasting
>0.0).float() == train_y[:4] (preds
tensor([[False],
[False],
[False],
[False]])
def batch_accuracy(xb, yb):
= xb.sigmoid()
preds = (preds>0.5) == yb
correct return correct.float().mean()
4]) batch_accuracy(linear1(batch), train_y[:
tensor(0.)
def validate_epoch(model):
= [batch_accuracy(model(xb), yb) for xb,yb in valid_dl]
accs return round(torch.stack(accs).mean().item(), 4)
validate_epoch(linear1)
0.3407
= 1.
lr = weights,bias
params # Train for one epoch
train_epoch(linear1, lr, params) validate_epoch(linear1)
0.6138
# Train for twenty epochs
for i in range(20):
train_epoch(linear1, lr, params)print(validate_epoch(linear1), end=' ')
0.7358 0.9052 0.9438 0.9575 0.9638 0.9692 0.9726 0.9741 0.975 0.976 0.9765 0.9765 0.9765 0.9779 0.9784 0.9784 0.9784 0.9784 0.9789 0.9784
Note: Accuracy improves from 0.7358 to 0.9784
Creating an Optimizer
Why we need Non-Linear activation functions
- a series of any number of linear layers in a row can be replaced with a single linear layer with different parameters
- adding a non-linear layer between linear layers helps decouple the linear layers from each other so they can learn separate features
torch.nn:
- https://pytorch.org/docs/stable/nn.html
- provides the basic building blocks for building PyTorch models
nn.Linear():
- https://pytorch.org/docs/stable/generated/torch.nn.Linear.html
- Applies a linear transformation to the incoming data: \(y=xA^{T}+b\)
- contains both the weights and biases in a single class
- inherits from nn.Module()
nn.Module():
- https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module
- Base class for all neural network modules
- any PyTorch models should subclass this class
- modules can contain other modules
- submodules can be assigned as regular attributes
nn.Linear
torch.nn.modules.linear.Linear
= nn.Linear(28*28,1)
linear_model linear_model
Linear(in_features=784, out_features=1, bias=True)
nn.Parameter():
- https://pytorch.org/docs/stable/generated/torch.nn.parameter.Parameter.html#torch.nn.parameter.Parameter
- A Tensor sublcass
- A kind of Tensor that is to be considered a module parameter.
= linear_model.parameters()
w,b w.shape,b.shape
(torch.Size([1, 784]), torch.Size([1]))
print(type(w))
print(type(b))
<class 'torch.nn.parameter.Parameter'>
<class 'torch.nn.parameter.Parameter'>
b
Parameter containing:
tensor([0.0062], requires_grad=True)
# Implements the basic optimization steps used earlier for use with a PyTorch Module
class BasicOptim:
def __init__(self,params,lr): self.params,self.lr = list(params),lr
def step(self, *args, **kwargs):
for p in self.params: p.data -= p.grad.data * self.lr
def zero_grad(self, *args, **kwargs):
for p in self.params: p.grad = None
# PyTorch optimizers need a reference to the target model parameters
= BasicOptim(linear_model.parameters(), lr) opt
def train_epoch(model):
for xb,yb in dl:
calc_grad(xb, yb, model)
opt.step() opt.zero_grad()
validate_epoch(linear_model)
0.4673
def train_model(model, epochs):
for i in range(epochs):
train_epoch(model)print(validate_epoch(model), end=' ')
20) train_model(linear_model,
0.4932 0.8193 0.8467 0.9155 0.935 0.9477 0.956 0.9629 0.9653 0.9682 0.9697 0.9731 0.9741 0.9751 0.9761 0.9765 0.9775 0.978 0.9785 0.9785
Note: The PyTorch version arrives at almost exactly the same accuracy as the hand-crafted version
fastai SGD():
- https://docs.fast.ai/optimizer.html#SGD
- An Optimizer for SGD with lr and mom and params
- by default does the same thing as BasicOptim
SGD
<function fastai.optimizer.SGD(params, lr, mom=0.0, wd=0.0, decouple_wd=True)>
= nn.Linear(28*28,1)
linear_model = SGD(linear_model.parameters(), lr)
opt 20) train_model(linear_model,
0.4932 0.8135 0.8481 0.916 0.9341 0.9487 0.956 0.9634 0.9653 0.9673 0.9692 0.9717 0.9746 0.9751 0.9756 0.9765 0.9775 0.9775 0.978 0.978
= DataLoaders(dl, valid_dl) dls
fastai Learner:
- https://docs.fast.ai/learner.html#Learner
- Group together a model, some data loaders, an optimizer and a loss function to handle training
Learner
fastai.learner.Learner
= Learner(dls, nn.Linear(28*28,1), opt_func=SGD,
learn =mnist_loss, metrics=batch_accuracy) loss_func
fastai Learner.fit:
- https://docs.fast.ai/learner.html#Learner.fit
- fit a model for a specifed number of epochs using a specified learning rate
lr
1.0
10, lr=lr) learn.fit(
epoch | train_loss | valid_loss | batch_accuracy | time |
---|---|---|---|---|
0 | 0.635737 | 0.503216 | 0.495584 | 00:00 |
1 | 0.443481 | 0.246651 | 0.777723 | 00:00 |
2 | 0.165904 | 0.159723 | 0.857704 | 00:00 |
3 | 0.074277 | 0.099495 | 0.918057 | 00:00 |
4 | 0.040486 | 0.074255 | 0.934740 | 00:00 |
5 | 0.027243 | 0.060227 | 0.949951 | 00:00 |
6 | 0.021766 | 0.051380 | 0.956330 | 00:00 |
7 | 0.019304 | 0.045439 | 0.962709 | 00:00 |
8 | 0.018036 | 0.041227 | 0.965653 | 00:00 |
9 | 0.017262 | 0.038097 | 0.968106 | 00:00 |
Adding a Nonlinearity
def simple_net(xb):
# Linear layer
= xb@w1 + b1
res # ReLU activation layer
= res.max(tensor(0.0))
res # Linear layer
= res@w2 + b2
res return res
= init_params((28*28,30))
w1 = init_params(30)
b1 = init_params((30,1))
w2 = init_params(1) b2
PyTorch F.relu:
- https://pytorch.org/docs/stable/generated/torch.nn.functional.relu.html#torch.nn.functional.relu
- Applies the rectified linear unit function element-wise.
- \(\text{ReLU}(x) = (x)^+ = \max(0, x)\)
F.relu
<function torch.nn.functional.relu(input: torch.Tensor, inplace: bool = False) -> torch.Tensor>
plot_function(F.relu)
nn.Sequential:
- https://pytorch.org/docs/stable/generated/torch.nn.Sequential.html#torch.nn.Sequential
- A sequential container.
- Treats the whole container as a single module
- ouputs from the previous layer are fed as input to the next layer in the list
= nn.Sequential(
simple_net 28*28,30),
nn.Linear(
nn.ReLU(),30,1)
nn.Linear(
) simple_net
Sequential(
(0): Linear(in_features=784, out_features=30, bias=True)
(1): ReLU()
(2): Linear(in_features=30, out_features=1, bias=True)
)
= Learner(dls, simple_net, opt_func=SGD,
learn =mnist_loss, metrics=batch_accuracy) loss_func
40, 0.1) learn.fit(
epoch | train_loss | valid_loss | batch_accuracy | time |
---|---|---|---|---|
0 | 0.259396 | 0.417702 | 0.504416 | 00:00 |
1 | 0.128176 | 0.216283 | 0.818449 | 00:00 |
2 | 0.073893 | 0.111460 | 0.920020 | 00:00 |
3 | 0.050328 | 0.076076 | 0.941119 | 00:00 |
4 | 0.039086 | 0.059598 | 0.958292 | 00:00 |
5 | 0.033148 | 0.050273 | 0.964671 | 00:00 |
6 | 0.029618 | 0.044374 | 0.966634 | 00:00 |
7 | 0.027258 | 0.040340 | 0.969087 | 00:00 |
8 | 0.025527 | 0.037404 | 0.969578 | 00:00 |
9 | 0.024172 | 0.035167 | 0.971541 | 00:00 |
10 | 0.023068 | 0.033394 | 0.972522 | 00:00 |
11 | 0.022145 | 0.031943 | 0.973503 | 00:00 |
12 | 0.021360 | 0.030726 | 0.975466 | 00:00 |
13 | 0.020682 | 0.029685 | 0.974975 | 00:00 |
14 | 0.020088 | 0.028779 | 0.975466 | 00:00 |
15 | 0.019563 | 0.027983 | 0.975957 | 00:00 |
16 | 0.019093 | 0.027274 | 0.976448 | 00:00 |
17 | 0.018670 | 0.026638 | 0.977920 | 00:00 |
18 | 0.018285 | 0.026064 | 0.977920 | 00:00 |
19 | 0.017933 | 0.025544 | 0.978901 | 00:00 |
20 | 0.017610 | 0.025069 | 0.979392 | 00:00 |
21 | 0.017310 | 0.024635 | 0.979392 | 00:00 |
22 | 0.017032 | 0.024236 | 0.980373 | 00:00 |
23 | 0.016773 | 0.023869 | 0.980373 | 00:00 |
24 | 0.016531 | 0.023529 | 0.980864 | 00:00 |
25 | 0.016303 | 0.023215 | 0.981354 | 00:00 |
26 | 0.016089 | 0.022923 | 0.981354 | 00:00 |
27 | 0.015887 | 0.022652 | 0.981354 | 00:00 |
28 | 0.015695 | 0.022399 | 0.980864 | 00:00 |
29 | 0.015514 | 0.022164 | 0.981354 | 00:00 |
30 | 0.015342 | 0.021944 | 0.981354 | 00:00 |
31 | 0.015178 | 0.021738 | 0.981354 | 00:00 |
32 | 0.015022 | 0.021544 | 0.981845 | 00:00 |
33 | 0.014873 | 0.021363 | 0.981845 | 00:00 |
34 | 0.014731 | 0.021192 | 0.981845 | 00:00 |
35 | 0.014595 | 0.021031 | 0.982336 | 00:00 |
36 | 0.014464 | 0.020879 | 0.982826 | 00:00 |
37 | 0.014338 | 0.020735 | 0.982826 | 00:00 |
38 | 0.014217 | 0.020599 | 0.982826 | 00:00 |
39 | 0.014101 | 0.020470 | 0.982336 | 00:00 |
matplotlib.pyplot.plot:
- https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.plot.html
- Plot y versus x as lines and/or markers
plt.plot
<function matplotlib.pyplot.plot(*args, scalex=True, scaley=True, data=None, **kwargs)>
fastai learner.Recorder:
- https://docs.fast.ai/learner.html#Recorder
- Callback that registers statistics (lr, loss and metrics) during training
learn.recorder
Recorder
Recorder
fastai.learner.Recorder
fastcore L.itemgot():
- https://fastcore.fast.ai/foundation.html#L.itemgot
- Create new L with item idx of all items
L.itemgot
<function fastcore.foundation.L.itemgot(self, *idxs)>
2)); plt.plot(L(learn.recorder.values).itemgot(
-1][2] learn.recorder.values[
0.98233562707901
Going Deeper
- deeper models: models with more layers
- deeper models are more difficult to optimize the more layers
- deeper models require fewer parameters
- we can use smaller matrices with more layers
- we can train the model more quickly using less memory
- typically perform better
= ImageDataLoaders.from_folder(path)
dls = cnn_learner(dls, resnet18, pretrained=False,
learn =F.cross_entropy, metrics=accuracy)
loss_func1, 0.1) learn.fit_one_cycle(
epoch | train_loss | valid_loss | accuracy | time |
---|---|---|---|---|
0 | 0.066122 | 0.008277 | 0.997547 | 00:04 |
learn.model
Sequential(
(0): Sequential(
(0): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
(1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU(inplace=True)
(3): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
(4): Sequential(
(0): BasicBlock(
(conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
(1): BasicBlock(
(conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(5): Sequential(
(0): BasicBlock(
(conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(downsample): Sequential(
(0): Conv2d(64, 128, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(1): BasicBlock(
(conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(6): Sequential(
(0): BasicBlock(
(conv1): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(downsample): Sequential(
(0): Conv2d(128, 256, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(1): BasicBlock(
(conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(7): Sequential(
(0): BasicBlock(
(conv1): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(downsample): Sequential(
(0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(1): BasicBlock(
(conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
)
(1): Sequential(
(0): AdaptiveConcatPool2d(
(ap): AdaptiveAvgPool2d(output_size=1)
(mp): AdaptiveMaxPool2d(output_size=1)
)
(1): Flatten(full=False)
(2): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(3): Dropout(p=0.25, inplace=False)
(4): Linear(in_features=1024, out_features=512, bias=False)
(5): ReLU(inplace=True)
(6): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(7): Dropout(p=0.5, inplace=False)
(8): Linear(in_features=512, out_features=2, bias=False)
)
)
Jargon Recap
- neural networks contain two types of numbers
- Parameters: numbers that are randomly initialized and optimized
- define the model
- Activations: numbers that are calculated using the parameter values
- Parameters: numbers that are randomly initialized and optimized
- tensors
- regularly-shaped arrays like a matrix
- have rows and columns
- called the axes or dimensions
- rank: the number of dimensions of a tensor
- Rank-0: scalar
- Rank-1: vector
- Rank-2: matrix
- a neural network contains a number of linear and non-linear layers
- non-linear layers are referred to as activation layers
- ReLU: a function that sets any negative values to zero
- Mini-batch: a small group of inputs and labels gathered together in two arrays to perform gradient descent
- Forward pass: Applying the model to some input and computing the predictions
- Loss: A value that represents how the model is doing
- Gradient: The derivative of the loss with respect to all model parameters
- Gradient descent: Taking a step in the direction opposite to the gradients to make the model parameters a little bit better
- Learning rate: The size of the step we take when applying SGD to update the parameters of the model
References
Previous: Notes on fastai Book Ch. 3
Next: Notes on fastai Book Ch. 5
I’m Christian Mills, a deep learning consultant specializing in practical AI implementations. I help clients leverage cutting-edge AI technologies to solve real-world problems.
Interested in working together? Fill out my Quick AI Project Assessment form or learn more about me.