⇔ ( h ∗ x ) [ n ] = ∑ m = − M M h [ m ] x [ n − m ] , 1 ≤ n ≤ N . \mathbf{h} \ast \mathbf{x} \quad \Leftrightarrow \quad (\mathbf{h} \ast \mathbf{x})[n] = \sum_{m=-M}^M \mathbf{h}[m] \mathbf{x}[n-m], \quad 1 \leq n \leq N. h ∗ x ⇔ ( h ∗ x ) [ n ] = m = − M ∑ M h [ m ] x [ n − m ] , 1 ≤ n ≤ N . where x [ n < 1 ] ≡ 0 \mathbf{x}[n < 1] \equiv 0 x [ n < 1 ] ≡ 0 and x [ n > N ] ≡ 0 \mathbf{x}[n > N] \equiv 0 x [ n > N ] ≡ 0 .
Derive the partial derivative ,
∂ L ∂ h [ m ] , for − M ≤ m ≤ M . \frac{\partial \mathcal{L}}{\partial \mathbf{h}[m]} , \quad \text{ for } -M \leq m \leq M. ∂ h [ m ] ∂ L , for − M ≤ m ≤ M . Below is a solution that I would expect students to arrive at. It makes use of the scalar chain rule and doesn't worry about deriving Jacobians, as the course does not emphasize this perspective heavily.
Solution: (click to show)
Let y ^ = h ∗ x \hat{\mathbf{y}} = \mathbf{h} \ast \mathbf{x} y ^ = h ∗ x . Then,
∂ L ∂ h [ m ] = ∂ ∂ h [ m ] ( 1 2 ∑ i = 1 N ( y [ i ] − y ^ [ i ] ) 2 ) = ∑ i = 1 N ∂ y ^ [ i ] ∂ h [ m ] ( y ^ [ i ] − y [ i ] ) = ∑ i = 1 N x [ i − m ] ( y ^ [ i ] − y [ i ] ) = ∑ i = 1 N x [ i − m ] ( ( h ∗ x ) [ i ] − y [ i ] ) , for − M ≤ m ≤ M . \begin{aligned} \frac{\partial \mathcal{L}}{\partial \mathbf{h}[m]} &= \frac{\partial }{\partial \mathbf{h}[m]} \left( \frac{1}{2}\sum_{i=1}^N \left( \mathbf{y}[i] - \hat{\mathbf{y}}[i] \right)^2 \right) \\ &= \sum_{i=1}^N \frac{\partial \hat{\mathbf{y}}[i]}{\partial \mathbf{h}[m]} \left(\hat{\mathbf{y}}[i] - \mathbf{y}[i] \right) \\ &= \sum_{i=1}^N \mathbf{x}[i-m] \left(\hat{\mathbf{y}}[i] - \mathbf{y}[i] \right) \\ &= \sum_{i=1}^N \mathbf{x}[i-m] \left((\mathbf{h} \ast \mathbf{x})[i] - \mathbf{y}[i] \right), \quad \text{for } -M \leq m \leq M. \end{aligned} ∂ h [ m ] ∂ L = ∂ h [ m ] ∂ ( 2 1 i = 1 ∑ N ( y [ i ] − y ^ [ i ] ) 2 ) = i = 1 ∑ N ∂ h [ m ] ∂ y ^ [ i ] ( y ^ [ i ] − y [ i ] ) = i = 1 ∑ N x [ i − m ] ( y ^ [ i ] − y [ i ] ) = i = 1 ∑ N x [ i − m ] ( ( h ∗ x ) [ i ] − y [ i ] ) , for − M ≤ m ≤ M . Note that this may be interpreted as a convolution of a flipped version of x [ n ] \mathbf{x}[n] x [ n ] with ( y ^ − y ) [ n ] (\hat{\mathbf{y}}-\mathbf{y})[n] ( y ^ − y ) [ n ] . To see this, let x ⃗ [ n ] = x [ − n ] \vec{\mathbf{x}}[n] = \mathbf{x}[-n] x [ n ] = x [ − n ] . Then,
∂ L ∂ h [ m ] = ∑ i = 1 N x ⃗ [ m − i ] ( y ^ [ i ] − y [ i ] ) , for − M ≤ m ≤ M ⇔ ( ∂ L ∂ h ) [ m ] = ( ( y ^ − y ) ∗ x ⃗ ) [ m ] , for − M ≤ m ≤ M \begin{aligned} \frac{\partial \mathcal{L}}{\partial \mathbf{h}[m]} &= \sum_{i=1}^N \vec{\mathbf{x}}[m-i] \left(\hat{\mathbf{y}}[i] - \mathbf{y}[i] \right), \quad \text{ for } -M \leq m \leq M \\ & \Leftrightarrow \\ \left(\frac{\partial \mathcal{L}}{\partial \mathbf{h}}\right)[m] &= \left((\hat{\mathbf{y}} - \mathbf{y}) \ast \vec{\mathbf{x}} \right)[m], \quad \text{ for } -M \leq m \leq M \end{aligned} ∂ h [ m ] ∂ L ( ∂ h ∂ L ) [ m ] = i = 1 ∑ N x [ m − i ] ( y ^ [ i ] − y [ i ] ) , for − M ≤ m ≤ M ⇔ = ( ( y ^ − y ) ∗ x ) [ m ] , for − M ≤ m ≤ M Again, we assume zero-padding for y \mathbf{y} y and y ^ \hat{\mathbf{y}} y ^ .
There are two distinct approaches to deriving this partial derivative: the Jacobian way or the scalar way. I see many students tripping themselves up because they're aware of these two methods but not fully aware on their distinction.
From preschool we are all familiar with the standard scalar chain rule of calculus. For differentiable f : R → R f: \mathbb{R} \rightarrow \mathbb{R} f : R → R and g : R → R g: \mathbb{R} \rightarrow \mathbb{R} g : R → R ,
∂ ( f ∘ g ) ∂ x = f ′ ( g ( x ) ) g ′ ( x ) = ∂ f ∂ g ∣ g ( x ) ∂ g ∂ x ∣ x . \frac{\partial (f\circ g)}{\partial x} = f^\prime(g(x)) g^\prime(x) = \frac{\partial f}{\partial g}\Bigm\lvert_{g(x)} \frac{\partial g}{\partial x} \Bigm\lvert_{x}. ∂ x ∂ ( f ∘ g ) = f ′ ( g ( x )) g ′ ( x ) = ∂ g ∂ f ∣ ∣ g ( x ) ∂ x ∂ g ∣ ∣ x . It's this rule above that we directly employ in the above solution, by expanding the loss function in terms of its scalar variables, y ^ [ n ] \hat{\mathbf{y}}[n] y ^ [ n ] and y [ n ] \mathbf{y}[n] y [ n ] .
However, students also learn that a similar chain rule exists for vector input/output mappings. Namely, for differentiable mappings f : R K → R M f: \mathbb{R}^K \rightarrow \mathbb{R}^M f : R K → R M and g : R N → R K g: \mathbb{R}^N \rightarrow \mathbb{R}^K g : R N → R K ,
∂ ( f ∘ g ) ∂ x = ∂ f ∂ g ∣ g ( x ) ∂ g ∂ x ∣ x , \frac{\partial (f\circ g)}{\partial x} = \frac{\partial f}{\partial g}\Bigm\lvert_{g(x)} \frac{\partial g}{\partial x} \Bigm\lvert_{x}, ∂ x ∂ ( f ∘ g ) = ∂ g ∂ f ∣ ∣ g ( x ) ∂ x ∂ g ∣ ∣ x , where ∂ f ∂ g ∈ R M × K \frac{\partial f}{\partial g} \in \mathbb{R}^{M \times K} ∂ g ∂ f ∈ R M × K is the Jacobian matrix of f f f , defined element-wise as
( ∂ f ∂ g ) i j = ∂ f i ∂ g j , 1 ≤ i ≤ M , 1 ≤ j ≤ K , \left(\frac{\partial f}{\partial g}\right)_{ij} = \frac{\partial f_i}{\partial g_j}, \quad 1 \leq i \leq M, \quad 1 \leq j \leq K, ( ∂ g ∂ f ) ij = ∂ g j ∂ f i , 1 ≤ i ≤ M , 1 ≤ j ≤ K , and an analagous definition for ∂ g ∂ x ∈ R K × N \frac{\partial g}{\partial x} \in \mathbb{R}^{K \times N} ∂ x ∂ g ∈ R K × N . As a sanity check, observe that the shapes of the matrix multiplication in (7 ) work out, and that the Jacobian of f ∘ g f \circ g f ∘ g is an M × N M \times N M × N matrix.
Trouble then arises when students derive the elements of the Jacobian matrices in (7 ) separately and then forget to compose them using matrix multiplication, i.e.,
( ∂ f ∂ x ) i j = ∑ k = 1 K ( ∂ f ∂ g ) i k ( ∂ g ∂ x ) k j = ∑ k = 1 K ∂ f i ∂ g k ∂ g k ∂ x j . \left(\frac{\partial f}{\partial x}\right)_{ij} = \sum_{k=1}^K \left(\frac{\partial f}{\partial g}\right)_{ik} \left(\frac{\partial g}{\partial x}\right)_{kj} = \sum_{k=1}^K \frac{\partial f_i}{\partial g_k} \frac{\partial g_k}{\partial x_j}. ( ∂ x ∂ f ) ij = k = 1 ∑ K ( ∂ g ∂ f ) ik ( ∂ x ∂ g ) kj = k = 1 ∑ K ∂ g k ∂ f i ∂ x j ∂ g k . Students who have trouble will often try to do element-wise multiplication of ∂ f ∂ g \frac{\partial f}{\partial g} ∂ g ∂ f and ∂ g ∂ x \frac{\partial g}{\partial x} ∂ x ∂ g , even when the shapes don't make sense.
Solution: the Jacobian way (click to show)
Let y ^ = h ∗ x \hat{\mathbf{y}} = \mathbf{h} \ast \mathbf{x} y ^ = h ∗ x . Then,
∂ L ∂ h = ∂ L ∂ y ^ ∂ y ^ ∂ h , \frac{\partial \mathcal{L}}{\partial \mathbf{h}} = \frac{\partial \mathcal{L}}{\partial \hat{\mathbf{y}}}\frac{\partial \hat{\mathbf{y}}}{\partial \mathbf{h}}, ∂ h ∂ L = ∂ y ^ ∂ L ∂ h ∂ y ^ , where ∂ L ∂ h ∈ R 1 × ( 2 M + 1 ) \frac{\partial \mathcal{L}}{\partial \mathbf{h}} \in \mathbb{R}^{1 \times (2M+1)} ∂ h ∂ L ∈ R 1 × ( 2 M + 1 ) , ∂ L ∂ y ^ ∈ R 1 × N \frac{\partial \mathcal{L}}{\partial \hat{\mathbf{y}}} \in \mathbb{R}^{1 \times N} ∂ y ^ ∂ L ∈ R 1 × N , and ∂ y ^ ∂ h ∈ R N × ( 2 M + 1 ) \frac{\partial \hat{\mathbf{y}}}{\partial \mathbf{h}} \in \mathbb{R}^{N \times (2M+1)} ∂ h ∂ y ^ ∈ R N × ( 2 M + 1 ) . The Jacobian of the loss w.r.t y ^ \hat{\mathbf{y}} y ^ is,
∂ L ∂ y ^ [ j ] = y ^ [ j ] − y [ j ] , for 1 ≤ j ≤ N . \begin{aligned} \frac{\partial \mathcal{L}}{\partial \hat{\mathbf{y}}[j]} &= \hat{\mathbf{y}}[j] - \mathbf{y}[j], \quad \text{for } 1 \leq j \leq N. \end{aligned} ∂ y ^ [ j ] ∂ L = y ^ [ j ] − y [ j ] , for 1 ≤ j ≤ N . And the Jacobian of convolution w.r.t the kernel is,
∂ y ^ [ i ] ∂ h [ j ] = x [ i − j ] , for 1 ≤ i ≤ N , − M ≤ j ≤ M . \begin{aligned} \frac{\partial \hat{\mathbf{y}}[i]}{\partial \mathbf{h}[j]} &= \mathbf{x}[i-j], \quad \text{for } 1 \leq i \leq N, ~ -M \leq j \leq M. \end{aligned} ∂ h [ j ] ∂ y ^ [ i ] = x [ i − j ] , for 1 ≤ i ≤ N , − M ≤ j ≤ M . Combining them using matrix multiplication we get,
∂ L ∂ h [ m ] = ∑ i = 1 N ∂ L ∂ y ^ [ i ] ∂ y ^ [ i ] ∂ h [ m ] = ∑ i = 1 N x [ i − m ] ( y ^ [ i ] − y [ i ] ) , for − M ≤ m ≤ M . \begin{aligned} \frac{\partial \mathcal{L}}{\partial \mathbf{h}[m]} &= \sum_{i=1}^N \frac{\partial \mathcal{L}}{\partial \hat{\mathbf{y}}[i]}\frac{\partial \hat{\mathbf{y}}[i]}{\partial \mathbf{h}[m]} \\ &= \sum_{i=1}^N \mathbf{x}[i-m](\hat{\mathbf{y}}[i] - \mathbf{y}[i]), \quad \text{for } -M \leq m \leq M. \end{aligned} ∂ h [ m ] ∂ L = i = 1 ∑ N ∂ y ^ [ i ] ∂ L ∂ h [ m ] ∂ y ^ [ i ] = i = 1 ∑ N x [ i − m ] ( y ^ [ i ] − y [ i ]) , for − M ≤ m ≤ M .