where x and h may be of different sizes Nx and Nh, and our output may be of various sizes based on choices of padding and dimension of the signals d. Without loss of generality, we will ignore the relationship between spatio-temporal sizes and instead focus on channels.
where x,h,b,y are our input, conv-kernel weights, conv-bias weights, and output respectively. Clearly, a deep-learning conv-layer involves potentially more than just scalar convolution. We can see that our input and output signals have a differing number of channels (Cin and Cout), and our conv-kernel h is a matrix of Cin×Cout scalar kernels, each of spatio-temporal size Nh. Indeed, the indexing in the above equation looks very reminiscent of matrix-vector (mat-vec) multiplication. However, instead of indexing multiplying scalars, the conv-layer equation indexs convolves channels. We can write this mat-vec interpretation of the conv-layer as follows,
where the usual multiplication and addition is replaced with convolution and addition. A more colorful representation is given below, where squares represent feature-maps and the colored circles represent scalars. Note that Hij=hij.
Its now standard for libraries to have a groups parameter. Our channel-wise mat-vec visual makes understanding this parameter much easier. By default, groups=1 and we have our standard conv layer as above. In general, when groups=G, we divide the rows and columns of our kernel matrix H into G rows and columns and only allow a single group of kernels in each division of H. The figure below shows an example of this.
This is equivalent to dividing our input signal into G groups of channel-size Cin/G, passing it through a conv-layer of size Cout/G×Cin/G, and concatenating the results channel-wise. Hence, it is required that H is divisible by G in row and columns.
I hope this makes starting out with convolutional neural networks a little easier. In a later post, I'll go over deep-learning's conv-transposed layers, which require us to look at the matrix representation of convolution to understand them best.