Either algebraically or graphically we may then obtain the following piecewise representation for xi in terms of vi,
xi=⎩⎨⎧vi+λ,0,vi−λ,vi<−λ∣vi∣≤λvi>λ
This corresponds to the element-wise soft-thresholding operator, Sλ, shown below.
The soft-thresholding operator may be written more compactly as
Sλ(u)=sign(u)max(∣u∣−λ,0),
or in Deep-Learning notation as
Sλ(u)=sign(u)relu(∣u∣−λ).
Thus we have derived the proximal operator for the ℓ1 vector-norm to be the element-wise soft-thresholding operator (also known as the shrinkage-thresholding operator).
In order to continue our derivation of ISTA it is helpful for us to look at the proximal operator from a different perspective. First recall that the subdifferential ∂f(x0)⊂Rn of convex function at point x0 is the set defined as
∂f(x0)={y∣f(x)−f(x0)≥yT(x−x0),∀x∈domf},
i.e., the set of slopes y for which the hyperplanes at x0 remain below the function. This allows us to generalize the idea of derivatives to convex functions that are non-smooth (not differentiable everywhere).
For instance, we know that a differentiable convex function has a global minimum at x0 if and only if its gradient is zero, ∇f(x0)=0. Similarly this is the case for a non-smooth convex function if and only if zero is in the subdifferential, 0∈∂f(x0). Note that the subdifferential is equal to the gradient at a differentiable point.
Armed with our understanding of the proximal operator, the subdifferential, and the prox-op's characterization as the resolvent of the subdifferential, we can continue on our journey to derive ISTA by deriving The Proximal Gradient Method(Boyd Sec 4.2.1).
First, we will state the method. Consider the following optimization problem:
minimize f(x)+g(x)
with f and g both convex functionals and f differentiable. The proximal gradient method
xk+1:=proxηg(xk−η∇f(xk)),
is a fixed point iteration that converges to the unique minimizer of the objective function f(x)+g(x) for a fixed step-size η∈(0,1/L], where L is the Lipschitz constant of ∇f. It can be shown that the algorithm converges as O(1/k) – very slow!
This can be seen from the necessary and sufficient condition that x⋆ is a minimizer of f+g if and only if zero is in its subdifferential,
We've used the fact that the proximal operator is the resolvent of the subdifferential to move from the containment relation to one of equality. The operator
(I+η∂g)−1(I−η∇f)
is referred to as the forward-backward operator. The proximal gradient method as shown applies the forward-backward operator in a fixed-point iteration to minimize f+g.
We're now ready to assemble the Iterative Soft-Thresholding Algorithm. We'll restate the objective for convenience,
minimize 21∥Ax−y∥22+λ∥x∥1
where A∈Rn×m,m>>n, i.e. a fat/over-complete dictionary, y∈Rn is our signal, and x∈Rm is our sparse code to be determined.
If we label the quadratic term as f and the relaxed sparsity term as g, we can use the proximal gradient method to get the update step
xk+1:=Sλη(xk−ηAT(Axk−y))
where Sλ is the proximal operator for the ℓ1 norm, soft-thresholding with parameter λ, as derived earlier.
More often, the ISTA update step is presented as
xk+1:=Sλ/L(xk−L1AT(Axk−y))
where L=σmax(A)2, the square of the maximum singular value of A, i.e. the largest eigen-value of and Lipschitz constant of ATA. This is simply ISTA with its fixed maximum step-size, ensuring that ATA and ∇f in turn are contractive.
In sum, ISTA is a fixed-point iteration on the forward-backward operator defined by the soft-thresholding (prox-op of the ℓ1 norm) and the gradient of the quadratic difference between the original signal and its sparse-code reconstruction. The threshold and step-size of the algorithm are determined by the sparsity-fidelity trade-off required by the problem and the maximum scaling possible in the dictionary.