Published 2022-12-11

How to think of Conv-Layers in Neural Networks

Convolutional Neural Networks' building blocks aren't just performing the convolution you learned in digital signal processing class. In my opinion, the best way to think of these layers is as a channel-wise matrix-vector multiplication of convolutions.

Notation

Denote our discrete signals (time-series, images, videos, etc.) as vectors in $\mathbb{R}^{CN}$ , where $N$ is the total number of pixels in our spatio-temporal dimensions (1, 2, 3, for time, space, space-time resp.), and we consider our signal to be vector-valued at each pixel location. Notationally, we write

$\mathbf{x} \in \mathbb{R}^{CN}$ , our $N$ -pixel, $C$ -vector-valued, d-dimensional signal
$\mathbf{x}[\mathbf{n}] \in \mathbb{R}^C$ , a pixel of our signal $\mathbf{x}$ , indexed at location $\mathbf{n} \in \mathbb{Z}^d$
$\mathbf{x}_i \in \mathbb{R}^N$ , the d-dimensional $i$ -th channel of signal $\mathbf{x}$ , of length $N$
$\mathbf{x}_i[\mathbf{n}] \in \mathbb{R}$ , the value of the $i$ -th channel of $\mathbf{x}$ at pixel $\mathbf{n}$ .

For images in particular, we often refer to $\mathbf{x}_i$ as a feature-map. When $C=1$ , we refer to $\mathbf{x}$ as a scalar-valued signal. Mono-channel audio and grayscale images/videos are all scalar-valued signals. Stereo-audio ( $C=2$ ) and color images/videos ( $C=3$ ) are vector-valued signals. More exotic channel numbers arise in many scientific instruments, such as EEG where every electrode may be considered channel of a brain time-series signal, and Multi-coil MRI where each RF-coil contributes a channel to the vector valued image.

Using our notation, familiar discrete time/space convolution for scalar-valued signals can be written as

(\mathbf{x} \ast \mathbf{h})[\mathbf{n}] = \sum_{\mathbf{m} \in \mathbb{Z}^d} \mathbf{h}[\mathbf{m}]\,\mathbf{x}[\mathbf{n}-\mathbf{m}], \quad \mathbf{x} \in \mathbb{R}^{N_x}, \, \mathbf{h} \in \mathbb{R}^{N_h}, \, \mathbf{n} \in \mathbb{Z}^d,

where $\mathbf{x}$ and $\mathbf{h}$ may be of different sizes $N_x$ and $N_h$ , and our output may be of various sizes based on choices of padding and dimension of the signals $d$ . Without loss of generality, we will ignore the relationship between spatio-temporal sizes and instead focus on channels.

Conv-layers

The documentation of popular Deep-Learning libraries such as PyTorch describe a conv-layer as follows,

\mathbf{y}_i = \mathbf{b}_i + \sum_{j=1}^{C_{\mathrm{in}}} \mathbf{h}_{ij} \ast \mathbf{x}_j, \quad \mathbf{x} \in \mathbb{R}^{C_{\mathrm{in}}N_x}, \, \mathbf{h} \in \mathbb{R}^{C_{\mathrm{in}}C_{\mathrm{out}}N_h}, \, \mathbf{y} \in \mathbb{R}^{C_{\mathrm{out}}N_y}, \, \mathbf{b} \in \mathbb{R}^{C_{\mathrm{out}}},

where $\mathbf{x}, \mathbf{h}, \mathbf{b}, \mathbf{y}$ are our input, conv-kernel weights, conv-bias weights, and output respectively. Clearly, a deep-learning conv-layer involves potentially more than just scalar convolution. We can see that our input and output signals have a differing number of channels ( $C_{\mathrm{in}}$ and $C_{\mathrm{out}}$ ), and our conv-kernel $\mathbf{h}$ is a matrix of $C_{\mathrm{in}} \times C_{\mathrm{out}}$ scalar kernels, each of spatio-temporal size $N_h$ . Indeed, the indexing in the above equation looks very reminiscent of matrix-vector (mat-vec) multiplication. However, instead of indexing multiplying scalars, the conv-layer equation indexs convolves channels. We can write this mat-vec interpretation of the conv-layer as follows,

\begin{array}{rcl} \begin{bmatrix} \mathbf{y}_1 \\ \mathbf{y}_2 \\ \vdots \\ \mathbf{y}_{C_{\mathrm{out}}} \end{bmatrix} &=& \begin{bmatrix} b_1 \\ b_2 \\ \vdots \\ b_{C_{\mathrm{out}}} \end{bmatrix} + \begin{bmatrix} \mathbf{h}_{11} & \mathbf{h}_{12} & \cdots & \mathbf{h}_{1C_{\mathrm{in}}} \\ \mathbf{h}_{21} & \mathbf{h}_{22} & \cdots & \mathbf{h}_{2C_{\mathrm{in}}} \\ \vdots & \vdots & \ddots & \vdots \\ \mathbf{h}_{C_{\mathrm{out}}1} & \mathbf{h}_{C_{\mathrm{out}}2} & \cdots & \mathbf{h}_{C_{\mathrm{out}}C_{\mathrm{in}}} \end{bmatrix} \begin{bmatrix} \mathbf{x}_1 \\ \mathbf{x}_2 \\ \vdots \\ \mathbf{x}_{C_{\mathrm{in}}} \end{bmatrix} , \end{array}

where the usual multiplication and addition is replaced with convolution and addition. A more colorful representation is given below, where squares represent feature-maps and the colored circles represent scalars. Note that $\mathbf{H}_{ij} = \mathbf{h}_{ij}$ .

"Channel-wise mat-vec convolution" of standard deep-learning Conv-layer.

Groups

Its now standard for libraries to have a groups parameter. Our channel-wise mat-vec visual makes understanding this parameter much easier. By default, groups=1 and we have our standard conv layer as above. In general, when groups $=G$ , we divide the rows and columns of our kernel matrix $\mathbf{H}$ into $G$ rows and columns and only allow a single group of kernels in each division of $\mathbf{H}$ . The figure below shows an example of this.

This is equivalent to dividing our input signal into $G$ groups of channel-size $C_{\mathrm{in}}/G$ , passing it through a conv-layer of size $C_{\mathrm{out}}/G \times C_{\mathrm{in}}/G$ , and concatenating the results channel-wise. Hence, it is required that $\mathbf{H}$ is divisible by $G$ in row and columns.

I hope this makes starting out with convolutional neural networks a little easier. In a later post, I'll go over deep-learning's conv-transposed layers, which require us to look at the matrix representation of convolution to understand them best.