Convolutional Neural Networks' building blocks aren't just performing the convolution you learned in digital signal processing class. In my opinion, the best way to think of these layers is as a channel-wise matrix-vector multiplication of convolutions.
Denote our discrete signals (time-series, images, videos, etc.) as vectors in , where is the total number of pixels in our spatio-temporal dimensions (1, 2, 3, for time, space, space-time resp.), and we consider our signal to be vector-valued at each pixel location. Notationally, we write
, our -pixel, -vector-valued, d-dimensional signal
, a pixel of our signal , indexed at location
, the d-dimensional -th channel of signal , of length
, the value of the -th channel of at pixel .
For images in particular, we often refer to as a feature-map. When , we refer to as a scalar-valued signal. Mono-channel audio and grayscale images/videos are all scalar-valued signals. Stereo-audio () and color images/videos () are vector-valued signals. More exotic channel numbers arise in many scientific instruments, such as EEG where every electrode may be considered channel of a brain time-series signal, and Multi-coil MRI where each RF-coil contributes a channel to the vector valued image.
Using our notation, familiar discrete time/space convolution for scalar-valued signals can be written as
where and may be of different sizes and , and our output may be of various sizes based on choices of padding and dimension of the signals . Without loss of generality, we will ignore the relationship between spatio-temporal sizes and instead focus on channels.
The documentation of popular Deep-Learning libraries such as PyTorch describe a conv-layer as follows,
where are our input, conv-kernel weights, conv-bias weights, and output respectively. Clearly, a deep-learning conv-layer involves potentially more than just scalar convolution. We can see that our input and output signals have a differing number of channels ( and ), and our conv-kernel is a matrix of scalar kernels, each of spatio-temporal size . Indeed, the indexing in the above equation looks very reminiscent of matrix-vector (mat-vec) multiplication. However, instead of indexing multiplying scalars, the conv-layer equation indexs convolves channels. We can write this mat-vec interpretation of the conv-layer as follows,
where the usual multiplication and addition is replaced with convolution and addition. A more colorful representation is given below, where squares represent feature-maps and the colored circles represent scalars. Note that .
Its now standard for libraries to have a groups parameter. Our channel-wise mat-vec visual makes understanding this parameter much easier. By default, groups=1 and we have our standard conv layer as above. In general, when groups, we divide the rows and columns of our kernel matrix into rows and columns and only allow a single group of kernels in each division of . The figure below shows an example of this.
This is equivalent to dividing our input signal into groups of channel-size , passing it through a conv-layer of size , and concatenating the results channel-wise. Hence, it is required that is divisible by in row and columns.
I hope this makes starting out with convolutional neural networks a little easier. In a later post, I'll go over deep-learning's conv-transposed layers, which require us to look at the matrix representation of convolution to understand them best.