Lucas Gelfond • Lesson #8: Convolutional Neural Networks

The beginning of the lesson 8 video simply recaps some stuff from collaborative filtering, including a few techniques. The second half is based on Chapter 13 in the book.

Honestly I rushed through this chapter a bit—I read this chapter sort of late at night and evidently didn't retain a ton of it. That said I think this is a reasonable one to move quickly through—I took computer vision in school and have done tasks like MNIST recognition, so I figure this is a decent one to speed through. Here's my work on MNIST repro (honestly: a mess, didn't get much out of this).

Anyways, here's the questions from Chapter 13, which I admittedly did not retain super well. This said, if I am being honest: the real reason I am here is for diffusion models, which are literally next! I'm pretty stoked about this which is part of why I'm erring toward moving more quickly.

Also, sort of crazy - we are beyond the list of chapters that have official wikis with answers! See this questionnaire solutions megathread, I believe the most reliable source now is a GitHub repo of someone's attempts here!

✅ What is a "feature"?

I believe - the values of a kernel we use to determine certain (usually visual) elements. For example, edge detection, and features that can be made from combining several edge detections.

✅ Write out the convolutional kernel matrix for a top edge detector.

1 1 1 0 0 0 -1 -1 -1

❌ Write out the mathematical operation applied by a 3×3 kernel to a single pixel in an image.

? - vague

NOTE: the answer provided is "multicplication" - which I agree with! But very vague

✅ What is the value of a convolutional kernel apply to a 3×3 matrix of zeros?

0 0 0 0 0 0 0 0 0

✅ What is "padding"?

How we deal with kernels on edge pixels, where not every pixel needed by the kernel exists in the image.

✅ What is "stride"?

How much we skip by at each point of applying the kernel. By skipping pixels, we also reduce the size.\

✅ Create a nested list comprehension to complete any task that you choose.

[f(a,b) for a in [] for b in] ??

Yes, although nit that we need the in range()

❌ What are the shapes of the input and weight parameters to PyTorch's 2D convolution?

I don't remember the specific example in the book!

EDIT: input cshape is: minibatch, in_channels, image_height, image_width weight shape is out_channels, in_channels, kernel_height, kernel_width

✅ What is a "channel"?

My sense is, broadly, another dimension of teh input; for example, red channel, green channel , blue channel.

✅ What is the relationship between a convolution and a matrix multiplication?

A convolution can be represented as a matrix multiplication

✅ What is a "convolutional neural network"?

A neural network that has layers for convolution rather than just matrix multiplications.

NOTE: as noted above, convolutions and matrix multiplication are the same, but convolutions can be computed more effficiently

✅ What is the benefit of refactoring parts of your neural network definition?

Clarity to future readers and elimination of bugs.

✅ What is Flatten? Where does it need to be included in the MNIST CNN? Why?

At the end of the convolutions, so that all of the channels can be put through to a linear layer.

NIT: at the end, yes, to convert rank 4 tensor into rank 2 tensor

✅ What does "NCHW" mean?

N - dimension Channel, Height, Width

NIT: n is batch!

✅❌ 15. Why does the third layer of the MNIST CNN have 7*7*(1168-16) multiplications?

I don't remember!

NOTE: no answer here either!

❌ What is a "receptive field"?

I also don't remember!

NIT: the area of an image tat is involved in the calculaktion of al ayer

❌ What is the size of the receptive field of an activation after two stride 2 convolutions? Why?

The size would be n/16, I believe, because stride 2 convolutions result in half of the data?

Likely (4,4) in this case - I'm not certain what OG dimensions are

✅ Run conv-example.xlsx yourself and experiment with trace precedents.

note - no cell / text 'trace precedents' in the current spreadsheet!

NOTE: no notes on the other too

✅ Have a look at Jeremy or Sylvain's list of recent Twitter "like"s, and see if you find any interesting resources or ideas there.

LOL - not possibly as of late!

❌ How is a color image represented as a tensor?

Rank 3 - 3 channels, one fo each of red, green, and blue.

NIT: rank 4 actually, ch_out x ch_in x ks x ks

✅ How does a convolution work with a color input?

The convolution is applied to each color value separately.

✅ What method can we use to see that data in DataLoaders?

.show_batch() I believe?

✅ Why do we double the number of filters after each stride-2 conv?

So that we don't lose too much data, I believe.

❌ Why do we use a larger kernel in the first conv with MNIST (with simple_cnn)?

Don't remember.

When outputs and inputs are similar size, don't learn too much . Change kernel size to get down, learn things.

✅ What information does ActivationStats save for each layer?

The mean, std dev of each activation. We hope to see a relatively normal distribution of these if the nn is functioning properly.

❌ How can we access a learner's callback after training?

Don't remember.

NOTE: learn object, e.g. learn.activation_stats

❌ What are the three statistics plotted by plot_layer_stats? What does the x-axis represent?

Also don't remember!

Answers don't really say here!

✅ Why are activations near zero problematic?

It means they aren't relaly b eing used! They are basically inactive parameters.

✅ What are the upsides and downsides of training with a larger batch size?

We have fewer changes to update weights but the updates will be morfe accurate because they are training on larger amounts of data. We can also rpocess more in parallel, depending on the size of our GPU.

✅ Why should we avoid using a high learning rate at the start of training?

It will often make it impossible to converge / will not find us a valley where loss is lower because it will be bouncing too much.

✅ What is 1cycle training?

We start with a low LR to find where to go, raise the LR once we've made progress in the direction, and then lower it again.

✅ What are the benefits of training with a high learning rate?

Fewer epochs needed.

✅❌ Why do we want to use a low learning rate at the end of training?

To find the exact spot nearby that lowers the loss.

NOTE: not in the questions!

❌ What is "cyclical momentum"?

Don't remember.

Optimizer takes a step in direction of gradients and continues in direction of previous tseps

✅ What callback tracks hyperparameter values during training (along with other information)?

Recorder?

❌ What does one column of pixels in the color_dim plot represent?

Don't remember this plot.

NOTE: lost here, the histogram of activations for a single batch, apparently

❌ What does "bad training" look like in color_dim? Why?

Don't remember.

NOTE: this one makes sense: bad training shows a non-smooth curve on the color_dim graphg, indicative that the activations aren't really working

❌ What trainable parameters does a batch normalization layer contain?

Oh man - don't remember!

NOTE: gamma weights, beta weights

❌ What statistics are used to normalize in batch normalization during training? How about during validation?

Not sure.

NOTE: use mean and std dev of activations to noramlize the data

❌ Why do models with batch normalization layers generalize better?

I don't remember this stuff about batch normalization!

NOTE: adds randomness in training process