8 Non-linear minimization

Up to now: \(\vb f(\vb m)=\vb G \vb m\) (with constant \(\vb G\))

Now: either there is no \(\vb G\) or \(\vb G\) depends on \(\vb m\)

Example: traveltime tomography with curved rays

8.1 Comparison straight vs. curved rays

8.2 Linearization of non-linears

Non-linear problem \(y=y_0 e^{\beta x}\) with \(\vb m=(y_0, \beta)^T\)

Data transformation: \(\hat y = \log y=\log y_0 + \beta x\)

\(\Rightarrow\) Application of linear inverse theory

Logarithmic model transform (positivity constraint)

\(\hat m_i = \log m_i\) \(\Rightarrow\) non-linear problem for \(\hat{\vb m}\) even if linear for \(\vb m\)

8.3 Linearization using Taylor series

One-dimensional function \(f(x)\) of single parameter \(x\):

\[ f(x) = f(x_0) + f'(x_0) \cdot (x-x_0) + f''(x_0)/2 \cdot (x-x_0)^2 + \ldots \]

Multi-dimensional function of model parameters \(m_1, m_2, ...\):

\[ f(\vb m) = f(\vb m_0 + \Delta \vb m) \approx f(\vb m_0) + \sum\limits_i^M \pdv{f(\vb m_0)}{m_j} \Delta m_j \]

\[ \vb f(\vb m_0 + \Delta \vb m ) = \vb f(\vb m_0) + \sum\limits_i^M \pdv{\vb f(\vb m_0)}{m_j} \Delta m_j = \vb S \Delta\vb m \]

8.4 Linearization

\[ \vb f(\vb m_0 + \Delta \vb m ) = \vb f(\vb m_0) + \sum\limits_i^M \pdv{\vb f(\vb m_0)}{m_j} \Delta m_j = \vb f(\vb m_0) + \vb S \Delta\vb m \]

We hope the change in the model fits our data: \(\vb f(\vb m_0 + \Delta \vb m ) \approx \vb d\)

This leads to a linearized problem

\[\vb S \Delta \vb m = \vb d - \vb f(\vb m_0) = \Delta \vb d\]

\(\vb S(\vb m_0)\): sensitivity matrix of partial derivatives \(S_{ij} = \pdv{f_i(\vb m_0)}{m_j}\)

8.5 Computation of the sensitivity (Jacobian)

analytically (derivation of the forward operator)
transforming the PDE and its numerical solution
perturbation method (brute force)

\[ \pdv{f_i(\vb m)}{m_j} \approx \frac{f_i(\vb m+\delta_j \Delta m) - f_i(\vb m)}{\Delta m} \]

8.6 Iterative solution of non-linear problems

Starting with a model \(\vb m_0\), we iteratively (\(n=0,1,\ldots\)) improve the model by

\[ \vb m^{n+1} = \vb m^n + \Delta \vb m^n \]

by solving the linear sub-problem

\[ \vb S^n \Delta \vb m^n = \Delta \vb d^n = \vb d - \vb f(\vb m^n) \]

8.7 The objective function

\[ \Phi_d = \| \vb d - \vb f(\vb m) \|^2 = (\vb d - \vb f(\vb m))^T \cdot (\vb d - \vb f(\vb m)) \]

Topography of \(\Phi_d\) determines non-linearity (s. Notebook on VES)

8.8 Objective function for linear problems

\[ \Phi_d = (\vb d - \vb G \vb m)^T (\vb d - \vb G \vb m) = \] \[ (\vb d^T - \vb m^T\vb G^T) (\vb d - \vb G \vb m) = \] \[ \vb d^T \vb d + \vb m^T \vb G^T \vb G \vb m - 2 \vb d^T \vb G\vb m \]

\(\Rightarrow\) a parabola for all \(m_i\)

8.9 Example: linear regression

\(y = a x + b\) (\(a\)=0.5, \(b\)=-1)

8.10 Misfit for non-linear problems (Menke, 2012)

A single minimum
B multiple minima
C periodics
D broad range
E flat valley F separated valleys

8.11 Linearization (Menke, 2012)

8.12 Line search for strong non-linearity

The Taylor approximation might change along \(\Delta\vb m\), so we optimize the step length

\[ \vb m^{n+1} = \vb m^n + \tau^n \Delta \vb m^n \]

9 Newtons method

Taylor series expansion for the objective function:

\[ \Phi_d(\vb m) = \Phi_d(\vb m^0) + \sum\limits_i^M b_i (m_i-m_i^0) + \] \[ + \frac12 \sum\limits_i^M \sum\limits_j^M B_{ij} (m_i-m_i^0) (m_j-m_j^0) \]

with gradient vector \(b_i=\pdv{\Phi_d}{m_i}\) and Hessian matrix \(B_{ij}=\pdv{\Phi_d}{m_i}{m_j}\)

9.1 Newtons method

\[ \ldots \sum\limits_i^M b_i (m_i-m_i^0) + \frac12 \sum\limits_i^M \sum\limits_j^M B_{ij} (m_i-m_i^0) (m_j-m_j^0) \]

Minimum: derivative zero

\(\pdv{Phi_d}{m_k}=0=b_k + \sum\limits_j B_{kj}(m_k-m_k^0) \Rightarrow \vb m - \vb m^0=-\vb B^{-1} \vb b\)

9.2 Newtons method for linear problems

\[\Phi_d=(\vb d - \vb G \vb m)^T (\vb d - \vb G \vb m)\]

\[b_i=\pdv{\Phi_d}{m_i}=-2\vb G^T(\vb d - \vb G \vb m)\]

\[B_{ij}=\pdv{\Phi_d}{m_i}{m_j}=2\vb G^T\vb G\]

\(\Rightarrow \vb m = -\vb B^{-1} \vb b = (\vb G^T\vb G)^{-1} \vb G^T \vb d\) (least-squares solution)

9.3 Newtons method for non-linear problems

\[\Phi_d=(\vb d - \vb f(\vb m))^T (\vb d - \vb f(\vb m))\]

\[b_i=\pdv{\Phi_d}{m_i}=-2\vb S^T(\vb d - \vb f(\vb m)) \quad\mbox{with}\quad S_{ij}=\pdv{f_i(\vb m^0)}{m_j}\]

\[B_{ij} = \pdv{\Phi_d}{m_i}{m_j} = \pdv{b_i}{m_j} \approx 2\vb S^T\vb S\]

\(\Rightarrow \vb m - \vb m^0 = -\vb B^{-1} \vb b = (\vb S^T\vb S)^{-1} \vb S^T (\vb d - \vb f(\vb m))\)

\(\Rightarrow\) least-squares solution for \(\Delta\vb m\)/\(\Delta\vb d\)

9.4 Linearization (Menke, 2012)

9.5 Gauss-Newton minimization

Choose starting model \(\vb m^0\) and set \(n\)=0
Compute model response \(\vb f(\vb m^n)\)
Compute sensitivity matrix \(\vb S^n\)
Solve linearized subproblem \(\vb S^n \Delta\vb m^n=\Delta\vb d=(\vb d-\vb f(\vb m^n))\)
Update model by \(\vb m^{n+1} = \vb m^n + \Delta\vb m^n\)
If convergence quit, otherwise \(n\leftarrow n+1\) & proceed with 2.

9.6 Line search for strong non-linearity

The Taylor approximation might change along \(\Delta\vb m\), so we optimize the step length \(\tau\) by “searching along the line”

\[ \vb m^{n+1} = \vb m^n + \tau^n \Delta \vb m^n \]

exact line search: testing \(\vb f(\vb m + \tau \Delta \vb m)\) for many \(\tau\)
interpolation of \(\vb f(\vb m^n+\tau\Delta\vb m)\) between \(\vb f(\vb m^n)\) & \(\vb f(\vb m^n+\Delta \vb m)\)
fit parabola through points \(\vb f(\vb m^n)\), \(\vb f(\vb m^n+\Delta \vb m/2)\), \(\vb f(\vb m^n+\Delta \vb m)\)

9.7 Gauss-Newton minimization

Choose starting model \(\vb m^0\) and set \(n\)=0
Compute model response \(\vb f(\vb m^n)\)
Compute sensitivity matrix \(\vb S^n\)
Solve linearized subproblem \(\vb S^n \Delta\vb m^n=\Delta\vb d=(\vb d-\vb f(\vb m^n))\)
Optimize line search parameter \(\tau^n\)
Update model by \(\vb m^{n+1} = \vb m^n + \tau^n \Delta\vb m^n\)
If convergence quit, otherwise \(n\leftarrow n+1\) & proceed with 2.

9.8 Model transformation

To keep the parameters positive, we often invert for the logarithms.

If we invert for \(\hat{m}\) instead of \(m\), we use the chain rule \[\pdv{f}{\hat{m}} = \pdv{f}{m} \cdot \pdv{m}{\hat{m}} = \pdv{f}{m} / \pdv{\hat{m}}{m}\]

E.g. \(\pdv*{\log\rho}{\rho}=1/\rho\)

9.9 Data transformation

Often, measured data show a wide range so that we use the logarithm

If we invert \(\hat{d}\) instead of \(d\), we use the chain rule \[\pdv{\hat{f}}{m} = \pdv{f}{m} \cdot \pdv{\hat{f}}{f}\]

E.g. \(\pdv*{\log\rho^a}{\rho^a}=1/\rho_a\)

9.10 Combined model and data transformation

Sensitivity \(S_{ij}=\pdv{\rho^a_i}{\rho_j}\)

Data \(d_i=\log \rho^a_i\), model parameter \(m_i=\log\rho_j\)

\(\Rightarrow\) Jacobian matrix

\[J_{ij} = \pdv{\log \rho^a_i}{\log\rho_j} = \pdv{\rho^a_i}{\rho_j} \cdot \frac{\rho_j}{\rho^a_i}\]

9.11 Regularization

\[\Phi=(\vb d - \vb f(\vb m))^T (\vb d - \vb f(\vb m)) + \lambda (\vb c - \vb C\vb m)^T (\vb c - \vb C\vb m)\]

\[b_i=\pdv{\Phi}{m_i}=-2\vb S^T(\vb d - \vb f(\vb m)) - 2\lambda \vb C^T (\vb c - \vb C \vb m)\]

\[B_{ij} = \pdv{\Phi}{m_i}{m_j} = \pdv{b_i}{m_j} \approx 2\vb S^T\vb S + 2\lambda\vb C^T \vb C\]

\[\Rightarrow (\vb S^T\vb S + \lambda \vb C^T\vb C) \Delta\vb m = \vb S^T (\vb d-\vb f(\vb m^n)) + \lambda \vb C^T(\vb c - \vb C\vb m)\]