Convolutional Networks
The input to a convolutional network is an image of size \(W_i \times H_i \times C_i\), where:
- \(W_i\) is the input width
- \(H_i\) is the input height
- \(C_i\) is the input channel dimension, e.g. 3 channels for red, green, and blue
The output is \(W_o \times H_o \times C_o\) where:
- \(C_o\) is the number of convolutional filters
The convolutional weights have dimension \(W_c \times H_c \times C_i \times C_o\).
The output is computed by sliding along the input images and considering a \(W_c \times H_c\) at a time. The amount that we slide the window per step is called the stride.
Say that \(W^c\) are the convolutional weights and \(X\) is the input. Let's say that we are looking at a certain \(W_c \times H_c\) window of the input. Here is how the output for the \(k\) -th convolutional filter is computed:
- For the \(i\) -th channel in the input, look at the \(W_c \times H_c\) window in the \(i\) -th channel, e.g. \(X[0:w,0:h,i]\). Then, take the dot product with \(W^c[0:w,0:h,i,k]\).
- Sum these products for all input channels \(i\) in the input.
- Add a bias term
- This gets us the \(k\) -th channel of the output for that specific \(W_c \times H_c\) input window
1. What does it mean to take a \(1 \times 1\) convolution?
Usually this is done to reduce the depth of the input. The input may be \(W_i \times H_i \times C_i\) and the output may be \(W_i \times H_i \times C_o\), where each of the \(C_o\) filters is a linear combination of the \(C_i\) channels. Then, for a given cell in the output, each channel of the output is a weighted sum + bias of the input channels.