Nikola Janjušević
Published 2022-12-11

How to think of Conv-Layers in Neural Networks

Convolutional Neural Networks' building blocks aren't just performing the convolution you learned in digital signal processing class. In my opinion, the best way to think of these layers is as a channel-wise matrix-vector multiplication of convolutions.

Notation

Denote our discrete signals (time-series, images, videos, etc.) as vectors in RCN\mathbb{R}^{CN}, where NN is the total number of pixels in our spatio-temporal dimensions (1, 2, 3, for time, space, space-time resp.), and we consider our signal to be vector-valued at each pixel location. Notationally, we write

For images in particular, we often refer to xi\mathbf{x}_i as a feature-map. When C=1C=1, we refer to x\mathbf{x} as a scalar-valued signal. Mono-channel audio and grayscale images/videos are all scalar-valued signals. Stereo-audio (C=2C=2) and color images/videos (C=3C=3) are vector-valued signals. More exotic channel numbers arise in many scientific instruments, such as EEG where every electrode may be considered channel of a brain time-series signal, and Multi-coil MRI where each RF-coil contributes a channel to the vector valued image.

Using our notation, familiar discrete time/space convolution for scalar-valued signals can be written as

(xh)[n]=mZdh[m] x[nm],xRNx, hRNh, nZd, (\mathbf{x} \ast \mathbf{h})[\mathbf{n}] = \sum_{\mathbf{m} \in \mathbb{Z}^d} \mathbf{h}[\mathbf{m}]\,\mathbf{x}[\mathbf{n}-\mathbf{m}], \quad \mathbf{x} \in \mathbb{R}^{N_x}, \, \mathbf{h} \in \mathbb{R}^{N_h}, \, \mathbf{n} \in \mathbb{Z}^d,

where x\mathbf{x} and h\mathbf{h} may be of different sizes NxN_x and NhN_h, and our output may be of various sizes based on choices of padding and dimension of the signals dd. Without loss of generality, we will ignore the relationship between spatio-temporal sizes and instead focus on channels.

Conv-layers

The documentation of popular Deep-Learning libraries such as PyTorch describe a conv-layer as follows,

yi=bi+j=1Cinhijxj,xRCinNx, hRCinCoutNh, yRCoutNy, bRCout, \mathbf{y}_i = \mathbf{b}_i + \sum_{j=1}^{C_{\mathrm{in}}} \mathbf{h}_{ij} \ast \mathbf{x}_j, \quad \mathbf{x} \in \mathbb{R}^{C_{\mathrm{in}}N_x}, \, \mathbf{h} \in \mathbb{R}^{C_{\mathrm{in}}C_{\mathrm{out}}N_h}, \, \mathbf{y} \in \mathbb{R}^{C_{\mathrm{out}}N_y}, \, \mathbf{b} \in \mathbb{R}^{C_{\mathrm{out}}},

where x,h,b,y\mathbf{x}, \mathbf{h}, \mathbf{b}, \mathbf{y} are our input, conv-kernel weights, conv-bias weights, and output respectively. Clearly, a deep-learning conv-layer involves potentially more than just scalar convolution. We can see that our input and output signals have a differing number of channels (CinC_{\mathrm{in}} and CoutC_{\mathrm{out}}), and our conv-kernel h\mathbf{h} is a matrix of Cin×CoutC_{\mathrm{in}} \times C_{\mathrm{out}} scalar kernels, each of spatio-temporal size NhN_h. Indeed, the indexing in the above equation looks very reminiscent of matrix-vector (mat-vec) multiplication. However, instead of indexing multiplying scalars, the conv-layer equation indexs convolves channels. We can write this mat-vec interpretation of the conv-layer as follows,

[y1y2yCout]=[b1b2bCout]+[h11h12h1Cinh21h22h2CinhCout1hCout2hCoutCin][x1x2xCin],\begin{array}{rcl} \begin{bmatrix} \mathbf{y}_1 \\ \mathbf{y}_2 \\ \vdots \\ \mathbf{y}_{C_{\mathrm{out}}} \end{bmatrix} &=& \begin{bmatrix} b_1 \\ b_2 \\ \vdots \\ b_{C_{\mathrm{out}}} \end{bmatrix} + \begin{bmatrix} \mathbf{h}_{11} & \mathbf{h}_{12} & \cdots & \mathbf{h}_{1C_{\mathrm{in}}} \\ \mathbf{h}_{21} & \mathbf{h}_{22} & \cdots & \mathbf{h}_{2C_{\mathrm{in}}} \\ \vdots & \vdots & \ddots & \vdots \\ \mathbf{h}_{C_{\mathrm{out}}1} & \mathbf{h}_{C_{\mathrm{out}}2} & \cdots & \mathbf{h}_{C_{\mathrm{out}}C_{\mathrm{in}}} \end{bmatrix} \begin{bmatrix} \mathbf{x}_1 \\ \mathbf{x}_2 \\ \vdots \\ \mathbf{x}_{C_{\mathrm{in}}} \end{bmatrix} , \end{array}

where the usual multiplication and addition is replaced with convolution and addition. A more colorful representation is given below, where squares represent feature-maps and the colored circles represent scalars. Note that Hij=hij\mathbf{H}_{ij} = \mathbf{h}_{ij}.

"Channel-wise mat-vec convolution" of standard deep-learning Conv-layer.

Groups

Its now standard for libraries to have a groups parameter. Our channel-wise mat-vec visual makes understanding this parameter much easier. By default, groups=1 and we have our standard conv layer as above. In general, when groups=G=G, we divide the rows and columns of our kernel matrix H\mathbf{H} into GG rows and columns and only allow a single group of kernels in each division of H\mathbf{H}. The figure below shows an example of this.

 Varying the groups parameter of a 4x4 channel conv kernel matrix.
Varying the groups parameter of a 4x4 channel conv kernel matrix.

This is equivalent to dividing our input signal into GG groups of channel-size Cin/GC_{\mathrm{in}}/G, passing it through a conv-layer of size Cout/G×Cin/GC_{\mathrm{out}}/G \times C_{\mathrm{in}}/G, and concatenating the results channel-wise. Hence, it is required that H\mathbf{H} is divisible by GG in row and columns.

I hope this makes starting out with convolutional neural networks a little easier. In a later post, I'll go over deep-learning's conv-transposed layers, which require us to look at the matrix representation of convolution to understand them best.