Notes on fastai Book Ch. 17
- A Neural Net from the Foundations
- Building a Neural Net Layer from Scratch
- The Forward and Backward Passes
- Conclusion
- References
import fastbook
fastbook.setup_book()
import inspect
def print_source(obj):
for line in inspect.getsource(obj).split("\n"):
print(line)
A Neural Net from the Foundations
Building a Neural Net Layer from Scratch
Modeling a Neuron
- a neuron receives a given number of inputs and has an internal weight for each of them
- the neuron sums the weighted inputs to produce an output and adds an inner bias
- \(out = \sum_{i=1}^{n}{x_{i}w_{i}+b}\), where \((x_{1},\ldots,x_{n})\) are inputs, \((w_{1},\ldots,w_{n})\) are the weights, and \(b\) is the bias
= sum([x*w for x,w in zip(inputs, weights)]) + bias output
- the output of the neuron is fed to a nonlinear function called an activation function
- the output of the nonlinear activation function is fed as input to another neuron
- Rectified Linear Unit (ReLU) activation function:
def relu(x): return x if x >= 0 else 0
- a deep learning model is build by stacking a lot of neurons in successive layers
- a linear layer: all inputs are linked to each neuron in the layer
- need to compute the dot product for each input and each neuron with a given weight
python sum([x*w for x,w in zip(input,weight)]
The output of a fully connected layer
- \(y_{i,j} = \sum_{k=1}^{n}{x_{i,k}w_{k,j}+b_{j}}\)
= sum([a*b for a,b in zip(x[i,:],w[j,:])]) + b[j] y[i,j]
= x @ w.t() + b y
x
: a matrix containing the inputs with a size ofbatch_size
byn_inputs
w
: a matrix containing the weights for the neurons with a size ofn_neurons
byn_inputs
b
: a vector containing the biases for the neurons with a size ofn_neurons
y
: the output of the fully connected layer of sizebatch_size
byn_neurons
@
: a matrix multiplicationw.t()
: the transpose matrix ofw
Matrix Multiplication from Scratch
- Need three nested loops
- for the row indices
- for the column indices
- for the inner sum
import torch
from torch import tensor
def matmul(a,b):
# Get the number of rows and columns for the two matrices
= a.shape
ar,ac = b.shape
br,bc # The number of columns in the first matrix need to be
# the same as the number of rows in the second matrix
assert ac==br
# Initialize the output matrix
= torch.zeros(ar, bc)
c # For each row in the first matrix
for i in range(ar):
# For each column in the second matrix
for j in range(bc):
# For each column in the first matrix
# Element-wise multiplication
# Sum the products
for k in range(ac): c[i,j] += a[i,k] * b[k,j]
return c
= torch.randn(5,28*28)
m1 = torch.randn(784,10) m2
Using nested for-loops
%time t1=matmul(m1, m2)
CPU times: user 329 ms, sys: 0 ns, total: 329 ms
Wall time: 328 ms
%timeit -n 20 t1=matmul(m1, m2)
325 ms ± 801 µs per loop (mean ± std. dev. of 7 runs, 20 loops each)
Note: Using loops is extremely inefficient!!! Avoid loops whenever possible.
Using PyTorch’s built-in matrix multiplication operator
- written in C++ to make it fast
- need to vectorize operations on tensors to take advantage of speed of PyTorch
- use element-wise arithmetic and broadcasting
%time t2=m1@m2
CPU times: user 190 µs, sys: 0 ns, total: 190 µs
Wall time: 132 µs
%timeit -n 20 t2=m1@m2
The slowest run took 9.84 times longer than the fastest. This could mean that an intermediate result is being cached.
6.42 µs ± 8.4 µs per loop (mean ± std. dev. of 7 runs, 20 loops each)
Elementwise Arithmetic
- addition:
+
- subtraction:
-
- multiplication:
*
- division:
/
- greater than:
>
- less than:
<
- equal to:
==
Note: Both tensors need to have the same shape to perform element-wise arithmetic.
= tensor([10., 6, -4])
a = tensor([2., 8, 7])
b + b a
tensor([12., 14., 3.])
< b a
tensor([False, True, True])
Reduction Operators
- return tensors with only one element
all
: Tests if all elements evaluate toTrue
.sum
: Returns the sum of all elements in the tensor.mean
: Returns the mean value of all elements in the tensor.
# Check if every element in matrix a is less than the corresponding element in matrix b
< b).all(),
((a # Check if every element in matrix a is equal to the corresponding element in matrix b
==b).all()) (a
(tensor(False), tensor(False))
# Convert tensor to a plain python number or boolean
+ b).mean().item() (a
9.666666984558105
= tensor([[1., 2, 3], [4,5,6], [7,8,9]])
m *m m
tensor([[ 1., 4., 9.],
[16., 25., 36.],
[49., 64., 81.]])
# Attempt to perform element-wise arithmetic on tensors with different shapes
= tensor([[1., 2, 3], [4,5,6]])
n *n m
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
/tmp/ipykernel_38356/3763285369.py in <module>
1 n = tensor([[1., 2, 3], [4,5,6]])
----> 2 m*n
RuntimeError: The size of tensor a (3) must match the size of tensor b (2) at non-singleton dimension 0
# Replace the inner-most for-loop with element-wise arithmetic
def matmul(a,b):
= a.shape
ar,ac = b.shape
br,bc assert ac==br
= torch.zeros(ar, bc)
c for i in range(ar):
for j in range(bc): c[i,j] = (a[i] * b[:,j]).sum()
return c
%timeit -n 20 t3 = matmul(m1,m2)
488 µs ± 159 µs per loop (mean ± std. dev. of 7 runs, 20 loops each)
Note: Just replacing one of the for loops with PyTorch element-wise arithmetic dramatically improved performance.
Broadcasting
- describes how tensors of different ranks are treated during arithmetic operations
- gives specific rules to codify when shapes are compatible when trying to do an element-wise operation, and how the tensor of the smaller shape is expanded to match the tensor of bigger shape
Broadcasting with a scalar
- the scalar is “virtually” expanded to the same shape as the tensor where every element contains the original scalar value
= tensor([10., 6, -4])
a > 0 a
tensor([ True, True, False])
Note: Broadcasting with a scalar is useful when normalizing a dataset by subtracting the mean and dividing by the standard deviation.
= tensor([[1., 2, 3], [4,5,6], [7,8,9]])
m - 5) / 2.73 (m
tensor([[-1.4652, -1.0989, -0.7326],
[-0.3663, 0.0000, 0.3663],
[ 0.7326, 1.0989, 1.4652]])
Broadcasting a vector to a matrix
- the vector is virtually expanded to the same shape as the tensor, by duplicating the rows/columns as needed
- PyTorch uses the expand_as method to expand the vector to the same size as the higher-rank tensor
- creates a new view on the existing vector tensor without allocating new memory
- It is only possible to broadcast a vector of size
n
by a matrix of sizem
byn
.
= tensor([10.,20,30])
c = tensor([[1., 2, 3], [4,5,6], [7,8,9]])
m m.shape,c.shape
(torch.Size([3, 3]), torch.Size([3]))
+ c m
tensor([[11., 22., 33.],
[14., 25., 36.],
[17., 28., 39.]])
c.expand_as(m)
tensor([[10., 20., 30.],
[10., 20., 30.],
[10., 20., 30.]])
help(torch.Tensor.expand_as)
Help on method_descriptor:
expand_as(...)
expand_as(other) -> Tensor
Expand this tensor to the same size as :attr:`other`.
``self.expand_as(other)`` is equivalent to ``self.expand(other.size())``.
Please see :meth:`~Tensor.expand` for more information about ``expand``.
Args:
other (:class:`torch.Tensor`): The result tensor has the same size
as :attr:`other`.
help(torch.Tensor.expand)
Help on method_descriptor:
expand(...)
expand(*sizes) -> Tensor
Returns a new view of the :attr:`self` tensor with singleton dimensions expanded
to a larger size.
Passing -1 as the size for a dimension means not changing the size of
that dimension.
Tensor can be also expanded to a larger number of dimensions, and the
new ones will be appended at the front. For the new dimensions, the
size cannot be set to -1.
Expanding a tensor does not allocate new memory, but only creates a
new view on the existing tensor where a dimension of size one is
expanded to a larger size by setting the ``stride`` to 0. Any dimension
of size 1 can be expanded to an arbitrary value without allocating new
memory.
Args:
*sizes (torch.Size or int...): the desired expanded size
.. warning::
More than one element of an expanded tensor may refer to a single
memory location. As a result, in-place operations (especially ones that
are vectorized) may result in incorrect behavior. If you need to write
to the tensors, please clone them first.
Example::
>>> x = torch.tensor([[1], [2], [3]])
>>> x.size()
torch.Size([3, 1])
>>> x.expand(3, 4)
tensor([[ 1, 1, 1, 1],
[ 2, 2, 2, 2],
[ 3, 3, 3, 3]])
>>> x.expand(-1, 4) # -1 means not changing the size of that dimension
tensor([[ 1, 1, 1, 1],
[ 2, 2, 2, 2],
[ 3, 3, 3, 3]])
Note: Expanding the vector does not increase the amount of data stored.
= c.expand_as(m)
t t.storage()
10.0
20.0
30.0
[torch.FloatStorage of size 3]
help(torch.Tensor.storage)
Help on method_descriptor:
storage(...)
storage() -> torch.Storage
Returns the underlying storage.
Note: PyTorch accomplishes this by giving the new dimension a stride of 0
* When PyTorch looks for the next row by adding the stride, it will stay at the same row
t.stride(), t.shape
((0, 1), torch.Size([3, 3]))
+ m c
tensor([[11., 22., 33.],
[14., 25., 36.],
[17., 28., 39.]])
= tensor([10.,20,30])
c = tensor([[1., 2, 3], [4,5,6]])
m +m c
tensor([[11., 22., 33.],
[14., 25., 36.]])
# Attempt to broadcast a vector with an incompatible matrix
= tensor([10.,20])
c = tensor([[1., 2, 3], [4,5,6]])
m +m c
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
/tmp/ipykernel_38356/3928136702.py in <module>
1 c = tensor([10.,20])
2 m = tensor([[1., 2, 3], [4,5,6]])
----> 3 c+m
RuntimeError: The size of tensor a (2) must match the size of tensor b (3) at non-singleton dimension 1
= tensor([10.,20,30])
c = tensor([[1., 2, 3], [4,5,6], [7,8,9]])
m # Expand the vector to broadcast across a different dimension
= c.unsqueeze(1)
c m.shape,c.shape, c
(torch.Size([3, 3]),
torch.Size([3, 1]),
tensor([[10.],
[20.],
[30.]]))
c.expand_as(m)
tensor([[10., 10., 10.],
[20., 20., 20.],
[30., 30., 30.]])
+m c
tensor([[11., 12., 13.],
[24., 25., 26.],
[37., 38., 39.]])
= c.expand_as(m)
t t.storage()
10.0
20.0
30.0
[torch.FloatStorage of size 3]
t.stride(), t.shape
((1, 0), torch.Size([3, 3]))
Note: By default, new broadcast dimensions are added at the beginning using c.unsqueeze(0)
behind the scenes.
= tensor([10.,20,30])
c 0).shape,c.unsqueeze(1).shape c.shape, c.unsqueeze(
(torch.Size([3]), torch.Size([1, 3]), torch.Size([3, 1]))
Note: The unsqueeze command can be replaced by None indexing.
None,:].shape,c[:,None].shape c.shape, c[
(torch.Size([3]), torch.Size([1, 3]), torch.Size([3, 1]))
Note: * You can omit training columns when indexing * ...
means all preceding dimensions
None].shape,c[...,None].shape c[
(torch.Size([1, 3]), torch.Size([3, 1]))
-1) c,c.unsqueeze(
(tensor([10., 20., 30.]),
tensor([[10.],
[20.],
[30.]]))
# Replace the second for loop with broadcasting
def matmul(a,b):
= a.shape
ar,ac = b.shape
br,bc assert ac==br
= torch.zeros(ar, bc)
c for i in range(ar):
# c[i,j] = (a[i,:] * b[:,j]).sum() # previous
= (a[i ].unsqueeze(-1) * b).sum(dim=0)
c[i] return c
-1).shape m1.shape, m1.unsqueeze(
(torch.Size([5, 784]), torch.Size([5, 784, 1]))
0].unsqueeze(-1).expand_as(m2).shape m1[
torch.Size([784, 10])
%timeit -n 20 t4 = matmul(m1,m2)
414 µs ± 18 µs per loop (mean ± std. dev. of 7 runs, 20 loops each)
Note: Even faster still, though the improvement is not as dramatic.
Broadcasting rules
- when operating on two tensors, PyTorch compares their shapes element-wise
- starts with the trailing dimensions and works with its way backward
- adds
1
when it meets and empty dimension
- two dimensions are compatible when one of the following is true
- They are equal
- One of them is 1, in which case that dimension is broadcast to make it the same as the other
- arrays do not need to have the same number of dimensions
Einstein Summation
- a compact representation for combining products and sums in a general way
- \(ik,kj \rightarrow ij\)
- lefthand side represents the operands dimensions, separated by commas
- righthand side represents the result dimensions
- a practical way of expressing operations involving indexing and sum of products
Notaion Rules
- Repeated indices are implicitly summed over.
- Each index can appear at most twice in any term.
- Each term must contain identical nonrepeated indices.
def matmul(a,b): return torch.einsum('ik,kj->ij', a, b)
%timeit -n 20 t5 = matmul(m1,m2)
The slowest run took 10.35 times longer than the fastest. This could mean that an intermediate result is being cached.
26.9 µs ± 37.3 µs per loop (mean ± std. dev. of 7 runs, 20 loops each)
Note: It is extremely fast even compared to the earlier broadcast implementation. * einsum is often the fastest way to do custom operations in PyTorch, without diving into C++ and CUDA * still not as fast as carefully optimizes CUDA code
Additional Einsum Notation Examples
= torch.randn(2, 2)
x print(x)
# Transpose
'ij->ji', x) torch.einsum(
tensor([[ 0.7541, -0.8633],
[ 2.2312, 0.0933]])
tensor([[ 0.7541, 2.2312],
[-0.8633, 0.0933]])
= torch.randn(2, 2)
x = torch.randn(2, 2)
y = torch.randn(2, 2)
z print(x)
print(y)
print(z)
# Return a vector of size b where the k-th coordinate is the sum of x[k,i] y[i,j] z[k,j]
'bi,ij,bj->b', x, y, z) torch.einsum(
tensor([[-0.2458, -0.7571],
[ 0.0921, 0.5496]])
tensor([[-1.2792, -0.0648],
[-0.2263, -0.1153]])
tensor([[-0.2433, 0.4558],
[ 0.8155, 0.5406]])
tensor([-0.0711, -0.2349])
# trace
= torch.randn(2, 2)
x 'ii', x) x, torch.einsum(
(tensor([[ 1.4828, -0.7057],
[-0.6288, 1.3791]]),
tensor(2.8619))
# diagonal
= torch.randn(2, 2)
x 'ii->i', x) x, torch.einsum(
(tensor([[-1.0796, 1.1161],
[ 2.2944, 0.6225]]),
tensor([-1.0796, 0.6225]))
# outer product
= torch.randn(3)
x = torch.randn(2)
y f"x: {x}", f"y: {y}", torch.einsum('i,j->ij', x, y)
('x: tensor([ 0.1439, -1.8456, -1.5355])',
'y: tensor([-0.7276, -0.5566])',
tensor([[-0.1047, -0.0801],
[ 1.3429, 1.0273],
[ 1.1172, 0.8547]]))
# batch matrix multiplication
= torch.randn(3,2,5)
As = torch.randn(3,5,4)
Bs 'bij,bjk->bik', As, Bs) torch.einsum(
tensor([[[ 1.9657, 0.5904, 2.8094, -2.2607],
[ 0.7610, -2.0402, 0.7331, -2.2257]],
[[-1.5433, -2.9716, 1.3589, 0.1664],
[ 2.7327, 4.4207, -1.1955, 0.5618]],
[[-1.7859, -0.8143, -0.8410, -0.2257],
[-3.4942, -1.9947, 0.7098, 0.5964]]])
# with sublist format and ellipsis
0, 1], Bs, [..., 1, 2], [..., 0, 2]) torch.einsum(As, [...,
tensor([[[ 1.9657, 0.5904, 2.8094, -2.2607],
[ 0.7610, -2.0402, 0.7331, -2.2257]],
[[-1.5433, -2.9716, 1.3589, 0.1664],
[ 2.7327, 4.4207, -1.1955, 0.5618]],
[[-1.7859, -0.8143, -0.8410, -0.2257],
[-3.4942, -1.9947, 0.7098, 0.5964]]])
# batch permute
= torch.randn(2, 3, 4, 5)
A '...ij->...ji', A).shape torch.einsum(
torch.Size([2, 3, 5, 4])
# equivalent to torch.nn.functional.bilinear
= torch.randn(3,5,4)
A = torch.randn(2,5)
l = torch.randn(2,4)
r 'bn,anm,bm->ba', l, A, r) torch.einsum(
tensor([[ 1.1410, -1.7888, 4.7315],
[ 3.8092, 3.0976, 2.2764]])
The Forward and Backward Passes
Defining and Initializing a Layer
# Linear layer
def lin(x, w, b): return x @ w + b
# Initialize random input values
= torch.randn(200, 100)
x # Initialize random target values
= torch.randn(200) y
# Initialize layer 1 weights
= torch.randn(100,50)
w1 # Initialize layer 1 biases
= torch.zeros(50)
b1 # Initialize layer 2 weights
= torch.randn(50,1)
w2 # Initialize layer 2 biases
= torch.zeros(1) b2
# Get a batch of hidden state
= lin(x, w1, b1)
l1 l1.shape
torch.Size([200, 50])
l1.mean(), l1.std()
(tensor(-0.0385), tensor(10.0544))
Note: Having with activations with a high standard deviation is a problem since the values can scale to numbers that can’t be represented by a computer by the end of the model.
= torch.randn(200, 100)
x for i in range(50): x = x @ torch.randn(100,100)
0:5,0:5] x[
tensor([[nan, nan, nan, nan, nan],
[nan, nan, nan, nan, nan],
[nan, nan, nan, nan, nan],
[nan, nan, nan, nan, nan],
[nan, nan, nan, nan, nan]])
Note: Having activations that are too small can cause all the activations at the end of the model to go to zero.
= torch.randn(200, 100)
x for i in range(50): x = x @ (torch.randn(100,100) * 0.01)
0:5,0:5] x[
tensor([[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.]])
Note: Need to scale the weight matrices so the standard deviation of the activations stays at 1 * Understanding the difficulty of training deep feedforward neural networks * the right scale for a given layer is \(\frac{1}{\sqrt{n_{in}}}\), where \(n_{in}"\) represents the number of inputs.
Note: For a layer with 100 inputs, \(\frac{1}{\sqrt{100}}=0.1\)
= torch.randn(200, 100)
x for i in range(50): x = x @ (torch.randn(100,100) * 0.1)
0:5,0:5] x[
tensor([[-1.7695, 0.5923, 0.3357, -0.7702, -0.8877],
[ 0.6093, -0.8594, -0.5679, 0.4050, 0.2279],
[ 0.4312, 0.0497, 0.1795, 0.3184, -1.7031],
[-0.7370, 0.0251, -0.8574, 0.6826, 2.0615],
[-0.2335, 0.0042, -0.1503, -0.2087, -0.0405]])
x.std()
tensor(1.0150)
Note: Even a slight variation from \(0.1\) will dramatically change the values
# Redefine inputs and targets
= torch.randn(200, 100)
x = torch.randn(200) y
from math import sqrt
# Scale the weights based on the number of inputs to the layers
= torch.randn(100,50) / sqrt(100)
w1 = torch.zeros(50)
b1 = torch.randn(50,1) / sqrt(50)
w2 = torch.zeros(1) b2
= lin(x, w1, b1)
l1 l1.mean(),l1.std()
(tensor(-0.0062), tensor(1.0231))
# Define non-linear activation function
def relu(x): return x.clamp_min(0.)
= relu(l1)
l2 l2.mean(),l2.std()
(tensor(0.3758), tensor(0.6150))
Note: The activation function ruined the mean and standard deviation. * The \(\frac{1}{\sqrt{n_{in}}}\) weight initialization method used not account for the ReLU function.
= torch.randn(200, 100)
x for i in range(50): x = relu(x @ (torch.randn(100,100) * 0.1))
0:5,0:5] x[
tensor([[1.2172e-08, 0.0000e+00, 0.0000e+00, 7.1241e-09, 5.9308e-09],
[1.9647e-08, 0.0000e+00, 0.0000e+00, 9.2189e-09, 7.1026e-09],
[1.8266e-08, 0.0000e+00, 0.0000e+00, 1.1150e-08, 7.0774e-09],
[1.8673e-08, 0.0000e+00, 0.0000e+00, 1.0574e-08, 7.3020e-09],
[2.1829e-08, 0.0000e+00, 0.0000e+00, 1.1662e-08, 1.0466e-08]])
Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification
- the article that introduced ResNet
- Introduced Kaiming initialization:
- \(\sqrt{\frac{2}{n_{in}}}\), where \(n_{in}\) is the number of inputs of our model
= torch.randn(200, 100)
x for i in range(50): x = relu(x @ (torch.randn(100,100) * sqrt(2/100)))
0:5,0:5] x[
tensor([[0.0000, 0.0000, 0.1001, 0.0358, 0.0000],
[0.0000, 0.0000, 0.1612, 0.0164, 0.0000],
[0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
[0.0000, 0.0000, 0.1764, 0.0000, 0.0000],
[0.0000, 0.0000, 0.1331, 0.0000, 0.0000]])
= torch.randn(200, 100)
x = torch.randn(200) y
= torch.randn(100,50) * sqrt(2 / 100)
w1 = torch.zeros(50)
b1 = torch.randn(50,1) * sqrt(2 / 50)
w2 = torch.zeros(1) b2
= lin(x, w1, b1)
l1 = relu(l1)
l2 l2.mean(), l2.std()
(tensor(0.5720), tensor(0.8259))
def model(x):
= lin(x, w1, b1)
l1 = relu(l1)
l2 = lin(l2, w2, b2)
l3 return l3
= model(x)
out out.shape
torch.Size([200, 1])
# Squeeze the output to get rid of the trailing dimension
def mse(output, targ): return (output.squeeze(-1) - targ).pow(2).mean()
= mse(out, y) loss
Gradients and the Backward Pass
- the gradients are computed in the backward pass using the chain rule from calculus
- chain rule: \((g \circ f)'(x) = g'(f(x)) f'(x)\)
- our loss if a big composition of different functions
= mse(out,y) = mse(lin(l2, w2, b2), y) loss
- chain rule: \[\frac{\text{d} loss}{\text{d} b_{2}} = \frac{\text{d} loss}{\text{d} out} \times \frac{\text{d} out}{\text{d} b_{2}} = \frac{\text{d}}{\text{d} out} mse(out, y) \times \frac{\text{d}}{\text{d} b_{2}} lin(l_{2}, w_{2}, b_{2})\]
- To compute all the gradients we need for the update, we need to begin from the output of the model and work our way backward, one layer after the other.
- We can automate this process by having each function we implemented provided its backward step
Gradient of the loss function
- undo the squeeze in the mse function
- calculate the derivative of \(x^{2}\): \(2x\)
- calculate the derivative of the mean: \(\frac{1}{n}\) where \(n\) is the number of elements in the input
def mse_grad(inp, targ):
# grad of loss with respect to output of previous layer
= 2. * (inp.squeeze() - targ).unsqueeze(-1) / inp.shape[0] inp.g
Gradient of the ReLU activation function
def relu_grad(inp, out):
# grad of relu with respect to input activations
= (inp>0).float() * out.g inp.g
Gradient of a linear layer
def lin_grad(inp, out, w, b):
# grad of matmul with respect to input
= out.g @ w.t()
inp.g = inp.t() @ out.g
w.g = out.g.sum(0) b.g
SymPy
- a library for symbolic computation that is extremely useful when working with calculus
- Symbolic computation deals with the computation of mathematical objects symbolically
- the mathematical objects are represented exactly, not approximately, and mathematical expressions with unevaluated variables are left in symbolic form
from sympy import symbols,diff
= symbols('sx sy')
sx,sy # Calculate the derivative of sx**2
**2, sx)
diff(sx
2*sx
Define Forward and Backward Pass
def forward_and_backward(inp, targ):
# forward pass:
= inp @ w1 + b1
l1 = relu(l1)
l2 = l2 @ w2 + b2
out # we don't actually need the loss in backward!
= mse(out, targ)
loss
# backward pass:
mse_grad(out, targ)
lin_grad(l2, out, w2, b2)
relu_grad(l1, l2) lin_grad(inp, l1, w1, b1)
Refactoring the Model
- define classes for each function that include their own forward and backward pass functions
class Relu():
def __call__(self, inp):
self.inp = inp
self.out = inp.clamp_min(0.)
return self.out
def backward(self): self.inp.g = (self.inp>0).float() * self.out.g
class Lin():
def __init__(self, w, b): self.w,self.b = w,b
def __call__(self, inp):
self.inp = inp
self.out = inp@self.w + self.b
return self.out
def backward(self):
self.inp.g = self.out.g @ self.w.t()
self.w.g = self.inp.t() @ self.out.g
self.b.g = self.out.g.sum(0)
class Mse():
def __call__(self, inp, targ):
self.inp = inp
self.targ = targ
self.out = (inp.squeeze() - targ).pow(2).mean()
return self.out
def backward(self):
= (self.inp.squeeze()-self.targ).unsqueeze(-1)
x self.inp.g = 2.*x/self.targ.shape[0]
class Model():
def __init__(self, w1, b1, w2, b2):
self.layers = [Lin(w1,b1), Relu(), Lin(w2,b2)]
self.loss = Mse()
def __call__(self, x, targ):
for l in self.layers: x = l(x)
return self.loss(x, targ)
def backward(self):
self.loss.backward()
for l in reversed(self.layers): l.backward()
= Model(w1, b1, w2, b2) model
= model(x, y) loss
model.backward()
Going to PyTorch
# Define a base class for all functions in the model
class LayerFunction():
def __call__(self, *args):
self.args = args
self.out = self.forward(*args)
return self.out
def forward(self): raise Exception('not implemented')
def bwd(self): raise Exception('not implemented')
def backward(self): self.bwd(self.out, *self.args)
class Relu(LayerFunction):
def forward(self, inp): return inp.clamp_min(0.)
def bwd(self, out, inp): inp.g = (inp>0).float() * out.g
class Lin(LayerFunction):
def __init__(self, w, b): self.w,self.b = w,b
def forward(self, inp): return inp@self.w + self.b
def bwd(self, out, inp):
= out.g @ self.w.t()
inp.g self.w.g = inp.t() @ self.out.g
self.b.g = out.g.sum(0)
class Mse(LayerFunction):
def forward (self, inp, targ): return (inp.squeeze() - targ).pow(2).mean()
def bwd(self, out, inp, targ):
= 2*(inp.squeeze()-targ).unsqueeze(-1) / targ.shape[0] inp.g
torch.autograd.Function
- In PyTorch, each basic function we need to differentiate is written as a torch.autograd.Function that has a forward and backward method
from torch.autograd import Function
class MyRelu(Function):
# Performs the operation
@staticmethod
def forward(ctx, i):
= i.clamp_min(0.)
result
ctx.save_for_backward(i)return result
# Defines a formula for differentiating the operation with
# backward mode automatic differentiation
@staticmethod
def backward(ctx, grad_output):
= ctx.saved_tensors
i, return grad_output * (i>0).float()
help(staticmethod)
Help on class staticmethod in module builtins:
class staticmethod(object)
| staticmethod(function) -> method
|
| Convert a function to be a static method.
|
| A static method does not receive an implicit first argument.
| To declare a static method, use this idiom:
|
| class C:
| @staticmethod
| def f(arg1, arg2, ...):
| ...
|
| It can be called either on the class (e.g. C.f()) or on an instance
| (e.g. C().f()). Both the class and the instance are ignored, and
| neither is passed implicitly as the first argument to the method.
|
| Static methods in Python are similar to those found in Java or C++.
| For a more advanced concept, see the classmethod builtin.
|
| Methods defined here:
|
| __get__(self, instance, owner, /)
| Return an attribute of instance, which is of type owner.
|
| __init__(self, /, *args, **kwargs)
| Initialize self. See help(type(self)) for accurate signature.
|
| ----------------------------------------------------------------------
| Static methods defined here:
|
| __new__(*args, **kwargs) from builtins.type
| Create and return a new object. See help(type) for accurate signature.
|
| ----------------------------------------------------------------------
| Data descriptors defined here:
|
| __dict__
|
| __func__
|
| __isabstractmethod__
torch.nn.Module
- the base structure for all models in PyTorch
Implementation Steps 1. Make sure the superclass __init__
is called first when you initialize it. 2. Define any parameters of the model as attributes with nn.Parameter
. 3. Define a forward function that returns the output of your model.
import torch.nn as nn
class LinearLayer(nn.Module):
def __init__(self, n_in, n_out):
super().__init__()
self.weight = nn.Parameter(torch.randn(n_out, n_in) * sqrt(2/n_in))
self.bias = nn.Parameter(torch.zeros(n_out))
def forward(self, x): return x @ self.weight.t() + self.bias
= LinearLayer(10,2)
lin = lin.parameters()
p1,p2 p1.shape,p2.shape
(torch.Size([2, 10]), torch.Size([2]))
class Model(nn.Module):
def __init__(self, n_in, nh, n_out):
super().__init__()
self.layers = nn.Sequential(
nn.Linear(n_in,nh), nn.ReLU(), nn.Linear(nh,n_out))self.loss = mse
def forward(self, x, targ): return self.loss(self.layers(x).squeeze(), targ)
Note: fsatai provides its own variant of Module that is identical to nn.Module
, but automatically calls super().__init__()
.
class Model(Module):
def __init__(self, n_in, nh, n_out):
self.layers = nn.Sequential(
nn.Linear(n_in,nh), nn.ReLU(), nn.Linear(nh,n_out))self.loss = mse
def forward(self, x, targ): return self.loss(self.layers(x).squeeze(), targ)
Conclusion
- A neural net is a bunch of matrix multiplications with nonlinearities in between
- Vectorize and take advantage of techniques such as element-wise arithmetic and broadcasting when possible
- Two tensors are broadcastable if the dimensions starting from the end and going backward match
- May need to add dimensions of size one to make tensors broadcastable
- Properly initializing a neural net is crucial to get training started
- Use Kaiming initialization when using ReLU
- The backward pass is the chain rule applied multiple times, computing the gradient from the model output and going back, one layer at a time
References
Previous: Notes on fastai Book Ch. 16
Next: Notes on fastai Book Ch. 18
I’m Christian Mills, a deep learning consultant specializing in practical AI implementations. I help clients leverage cutting-edge AI technologies to solve real-world problems.
Interested in working together? Fill out my Quick AI Project Assessment form or learn more about me.