I've recently seem some graduate students have confusion about partial derivatives and applying the chain rule. The following is a quiz I've written for my PhD advisor's Image and Video Processing students. I will use it to illustrate the proper application of the multi-dimensional chain-rule. The quiz question is as follows,
Consider the following loss function, with input x[n], 1≤n≤N, and target y[n], 1≤n≤N,
L(y,x;h)=21∥y−h∗x∥22.
where, h[m],−M≤1≤M is a learnable 1D filter and ∗ denotes 1D convolution with zero-padding, i.e.,
h∗x⇔(h∗x)[n]=m=−M∑Mh[m]x[n−m],1≤n≤N.
where x[n<1]≡0 and x[n>N]≡0.
Derive the partial derivative,
∂h[m]∂L, for −M≤m≤M.
Below is a solution that I would expect students to arrive at. It makes use of the scalar chain rule and doesn't worry about deriving Jacobians, as the course does not emphasize this perspective heavily.
There are two distinct approaches to deriving this partial derivative: the Jacobian way or the scalar way. I see many students tripping themselves up because they're aware of these two methods but not fully aware on their distinction.
From preschool we are all familiar with the standard scalar chain rule of calculus. For differentiable f:R→R and g:R→R,
∂x∂(f∘g)=f′(g(x))g′(x)=∂g∂f∣∣g(x)∂x∂g∣∣x.
It's this rule above that we directly employ in the above solution, by expanding the loss function in terms of its scalar variables, y^[n] and y[n].
However, students also learn that a similar chain rule exists for vector input/output mappings. Namely, for differentiable mappings f:RK→RM and g:RN→RK,
∂x∂(f∘g)=∂g∂f∣∣g(x)∂x∂g∣∣x,
where ∂g∂f∈RM×K is the Jacobian matrix of f, defined element-wise as
(∂g∂f)ij=∂gj∂fi,1≤i≤M,1≤j≤K,
and an analagous definition for ∂x∂g∈RK×N. As a sanity check, observe that the shapes of the matrix multiplication in (7) work out, and that the Jacobian of f∘g is an M×N matrix.
Trouble then arises when students derive the elements of the Jacobian matrices in (7) separately and then forget to compose them using matrix multiplication, i.e.,