2 Linear problems

Linear problems

\(\vb f(m)\) is linear with respect to \(\vb m\) \(\Rightarrow\) write as matrix-vector product

\[\vb d = \vb G \vb m + \vb n \]

Examples

Gravity, Magnetics, Magnetic Resonance, VSP, straight-ray tomography

Problem

\(\vb m = G^{-1}d\) ? No, because is usually not invertible, not even quadratic.

2.1 Types of inverse problems

Problem size

every row stands for a measurement (data point)
every column represents a model parameter

\[\vb d = Gm + n \Rightarrow \vb G\in \mathfrak{R}^{N\times M}\]

\(N\) = \(M\): even-determined
\(N\) > \(M\): over-determined problem (no existing solution)
\(N\) < \(M\): under-determined problem (no unique solution)
mostly: (both over- and) under-determined model parts

2.1.1 Types of inverse problems (Menke, 2012)

2.2 A simple matrix problem

Let us assume a simple linear problem with two model parameters and three equations (representing data)

\[ \begin{align} ~ m_1 - m_2 = -1 \\ 2 m_1 - m_2 = 0 \\ ~ m_1 + m_2 = 2.5 \end{align} \]

whose forward operator can be expressed by a matrix \[ \textbf{G}=\mqty(1&-1\\2&-1\\1&1) \]

We can transform every equation to one parameter being a function of the other

\[m_2 = (d_i - G_{i1} \cdot m_1)/G_{i2}\]

All three equations can be plotted

Show the code

G = np.array([[1, -1], [2, -1], [1, 1]])
d = np.array([-1, 0, 2.5]);
fig, ax = plt.subplots()
m1 = np.linspace(0.65, 1.1, 10)
for i in range(3):
    m2 = (d[i] - G[i, 0]*m1) / G[i, 1]
    ax.plot(m1, m2)

ax.set_aspect(1.0)
ax.set_ylim(1.5, 2.1)
ax.grid()
ax.set_xlabel("$m_1$")
ax.set_ylabel("$m_2$");

2.2.1 Error bars

If the data \(d_i\) have some data error \(\delta d_i\), the equation for \(m_2\) \[m_2 = (d_i - G_{i1} \cdot m_1)/G_{i2}\]

does inherit an error

\[\delta m_2 = \delta d_i / G_{i2}\]

Show the code

fig, ax = plt.subplots()
dd = 0.05
for i in range(len(G)):
    y = (d[i] - G[i, 0] * m1) / G[i, 1]
    dy = dd/G[i, 1]
    ax.fill_between(m1, y-dy, y+dy, color=f"C{i}", alpha=0.5)
    ax.plot(m1, y, "-", label=f"{i}", color=f"C{i}")

ax.set_aspect(1.0)
ax.set_ylim(1.6, 2.0)
ax.legend()
ax.grid()

Show the code

fig, ax = plt.subplots()
dd = 0.1
for i in range(len(G)):
    y = (d[i] - G[i, 0] * m1) / G[i, 1]
    dy = dd/G[i, 1]
    ax.fill_between(m1, y-dy, y+dy, color=f"C{i}", alpha=0.5)
    ax.plot(m1, y, "-", label=f"{i}", color=f"C{i}")

ax.set_aspect(1.0)
ax.set_ylim(1.6, 2.0)
ax.legend()
ax.grid()

2.3 The objective function

We minimize the L2-norm (summed squared distances) of the residual between data and model response:

\[ \Phi = \| \vb d - G m \|^2_2 = \sum\limits_1^N (d_i-g_i({\vb m}))^2 \rightarrow\min \]

Let’s compute the objective function for a range of values for \(m_1\) and \(m_2\) (grid search).

2.3.1 Distance function for the first two data

Show the code

m1 = np.arange(0.65, 1.1, 0.01)
m2 = np.arange(1.6, 2.001, 0.01)
M1, M2 = np.meshgrid(m1, m2)
fig, ax = plt.subplots(ncols=3, sharex=True, sharey=True)
Y = [0, 0, 0]
ext = [m1[0], m1[-1], m2[-1], m2[0]]
for i in range(3):
    Y[i] = np.abs(G[i, 0] * M1 + G[i, 1] * M2 - d[i])
    ax[i].matshow(Y[i], extent=ext, cmap="Spectral_r")

ax[0].invert_yaxis()

If we combine the distances of the first two (the square root of their squares), we observe an elongated minimum similar to that of the two lines alone.

Show the code

plt.matshow(np.sqrt(Y[0]**2+Y[1]**2), extent=ext, cmap="Spectral_r")
plt.gca().invert_yaxis()

Thus, the ambiguity is not much reduced be combining these two similar equations. If we use all three equations, there is a clear minimum of ellipsoidal shape.

Show the code

plt.matshow(np.sqrt(Y[0]**2+Y[1]**2+Y[2]**2), extent=ext, cmap="Spectral_r")
plt.gca().invert_yaxis()

The minimum pixel (grid spacing of 0.01) lies at (0.82, 1.71). How can we determine the absolute minimum value?

2.4 The method of least squares

2.4.1 Derivation

\[\Phi = \| {\vb d}-{\vb G}{\vb m}\|^2_2 = ({\vb d}-{\vb G}{\vb m})^T ({\vb d}-{\vb G}{\vb m}) \]

\[\nabla_m = \qty(\pdv{m_1}, \pdv{m_2}, \ldots, \pdv{m_M})^T\]

\[\nabla_m \Phi = \nabla_m ({\vb G}{\vb m}-{\vb d})^T ({\vb G}{\vb m}-{\vb d})=0\]

\[\nabla_m \Phi = \nabla_m ({\vb m}^T{\vb G}^T-{\vb d}^T) ({\vb G}{\vb m}-{\vb d})=0\]

\[\nabla_m \Phi = \nabla_m {\vb m}^T{\vb G}^T ({\vb G}{\vb m}-{\vb d})=0\]

\[\nabla_m \Phi = {\vb G}^T ({\vb G}{\vb m}-{\vb d})={\vb G}^T {\vb G}{\vb m}-{\vb G}^T{\vb d}=0\]

\[\Rightarrow \quad\quad {\vb m}=({\vb G}^T{\vb G})^{-1} {\vb G}^T{\vb d} ={\vb G}^\dagger{\vb d}\] \[\mbox{mit}\quad {\vb G}^\dagger = ({\vb G}^T{\vb G})^{-1} {\vb G}^T\]

(generalized inverse or Pseudo-inverse pinv, in Matlab/Julia G\d)

The solution \({\vb m}_{LS}={\vb G}^\dagger {\vb d}\) is the least-squares solution

2.5 Simple matrix problem - Solution

Show the code

mLS, *_ = np.linalg.lstsq(G, d, rcond=None)
fig, ax = plt.subplots()
x = np.array([0.65, 1.1])
dd = 0.081
for i in range(len(G)):
    y = (d[i] - G[i, 0] * x) / G[i, 1]
    dy = dd/G[i, 1]
    co = f"C{i}"
    ax.fill_between(x, y-dy, y+dy,
        color=co, alpha=.5)
    ax.plot(x, y, "-", label=f"{i}", color=co)

ax.plot(mLS[0], mLS[1], "r*", label="m-LS")
ax.set_aspect(1.0)
ax.set_ylim(1.6, 2.0)
ax.legend()
ax.grid()

2.6 Resolution analysis

How does reality propagate into our image?
What is the quality of the inversion result?
What is the contribution of the individual data?

2.6.1 Model resolution

\[{\vb d}= {\vb G}{\vb m}^\text{true} + {\vb n}\] Matrix inversion with inverse operator \({\vb G}^\dagger\): \[{\vb m}^\text{est} = {\vb G}^\dagger {\vb d}= {\vb G}^\dagger{\vb G}{\vb m}^\text{true} + {\vb G}^\dagger{\vb n} = {\vb R}^M {\vb m}^\text{true} + {\vb G}^\dagger{\vb n}\] with the model resolution matrix \({\vb R}^M={\vb G}^\dagger{\vb G}\)
\(\Rightarrow\) How is the true model (\({\vb m}^\text{true}\)) reflected in the estimated (\({\vb m}^\text{est}\))?

Diagonal of \(\vb R^M\): resolution of individual parameters (0-1)

2.6.2 Data resolution

\[{\vb m}^\text{est} = {\vb G}^\dagger {\vb d}^\text{obs}\]

How are the data explained by the model? \[{\vb d}^\text{est} = {\vb G}{\vb m}^\text{est} = {\vb G}{\vb G}^\dagger{\vb d}^\text{obs} = {\vb R}^D {\vb d}^\text{obs}\]

with the data resolution (information density) matrix: \[{\vb R}^D={\vb G}{\vb G}^\dagger\]

Diagonal of \(R^D\): information content of individual data

2.6.3 Model resolution for overdetermined problems

\[{\vb G}^\dagger=({\vb G}^T{\vb G})^{-1} {\vb G}^T\]

\[\Rightarrow {\vb R}^M = ({\vb G}^T{\vb G})^{-1} {\vb G}^T {\vb G}= {\vb I}\]

perfect model resolution

\[{\vb G}^\dagger=({\vb G}^T{\vb G})^{-1} {\vb G}^T \quad \Rightarrow \quad {\vb R}^D = {\vb G}({\vb G}^T{\vb G})^{-1} {\vb G}^T\]

2.6.4 Resolution of matrix problem

Show the code

Ginv = np.linalg.inv(G.T @ G) @ G.T
RM = Ginv@G
RD = G@Ginv
im = plt.imshow(RD, vmin=-1, vmax=1, cmap="bwr")
plt.colorbar(im)

3rd row (measurement) is most important
1st and 2nd are highly correlated, sum(diag(RD)) \(\Rightarrow\) 2

2.7 Linear regression

Show the code

x = np.arange(0, 3, 0.1)
x[-1] = 3.5
err = 0.1
d = 0.5-0.5*x + np.random.randn(len(x)) * err
A = np.ones((len(x), 2))
A[:, 1] = x[:]
Ainv = np.linalg.inv(A.T @ A) @ A.T
m = Ainv @ d
print("m = ", m)
fig, ax = plt.subplots()
ax.plot(x, d, "b+")
ax.plot(x, A@m, "r-")
ax.grid()
residual = A @ m - d
print("mean=", np.mean(residual))
print("std=", np.std(residual))
print("X²=", np.sqrt(np.mean((residual/err)**2)))

m =  [ 0.51803119 -0.50901033]
mean= -4.4408920985006264e-17
std= 0.08366116071051583
X²= 0.8366116071051584

2.8 Linear regression - data importance matrix

Show the code

RD = A @ Ainv
fig, ax = plt.subplots(figsize=(7,6))
im = ax.matshow(RD, cmap="bwr",
    vmin=-0.3, vmax=0.3)
plt.colorbar(im);

neighboring data points are correlated, opposite anti-correlated

Show the code

rd = np.diag(RD)
display(np.sum(rd))
plt.plot(x, rd, "+-", ms=5);

1.9999999999999993

main diagonal of \(\vb R^D\) \(\Rightarrow\) data importance (outside biggest)
sum of importances equals \(M\) (later: rank of matrix)

2.9 Errors & robust inversion

error goes into the scaling of \(\vb G\) and \(\vb d\)
sometimes data have different errors
there might be outliers spoiling the result

2.9.1 Error weighting

Unweighted residual norm (root-mean square RMS)

\[\|{\bf d}-{\bf f}({\bf m})\| = \sqrt{1/N\sum (d_i-f_i({\bf m}))^2}\]

Weighting by individual error \(\epsilon_i\) (chi-square value):

\[\chi^2=\frac1N \sum \left( \frac{d_i-f_i({\bf m})}{\epsilon_i} \right)^2 \rightarrow \min\]

In case of exact error estimates: \(\chi^2=1\)

2.9.2 Error weighting

replace \(d_i\) by \(\hat{d}_i=d_i/\epsilon_i\) leads to \[{\bf m}= (\hat{\bf G}^T\hat{\bf G})^{-1} \hat{\bf G}^T \hat{\bf d}\] with \(\hat{{\bf G}}=\mbox{diag}(1/\epsilon_i)\cdot {\bf G}\)

2.9.3 General \(L_p\) norm

The \(L_2\) norm is the rooted sum of squared elements of a vector \(\bf v\):

\[\Phi^P=\|{\bf v}\|_2=\sqrt{\sum_i v_i^2}\]

More generally, the \(L_P\) norm is computed by \[\Phi^P=\|{\bf v}\|_P=\left( \sum_i |v_i|^P \right)^{1/P}\]

2.9.4 \(L_1\) norm minimization (IRLS)

\[\Phi^1=\|{\bf v}\|_1=\sum_i |v_i|\]

The \(L_1\) norm is much less prone to outliers as the misfit is not squared

By the ratio of its contribution to L1 and L2 norm we can reweight \(\epsilon_i\) \[ w_i = \frac{|v_i|/\|\vb v\|^1_1}{v_i^2/\|\vb v\|^2_2} = \frac{\|\vb v\|^2_2}{|v_i| \cdot \|\vb v\|^1_1} \]

This is known as Iteratively Reweighted Least-Squares (IRLS)

2.9.5 Linear regression with outliers

Show the code

d[-1] += 1
m = Ainv @ d
print("m = ", m)
fig, ax = plt.subplots()
ax.plot(x, d, "b+")
ax.plot(x, A@m, "r-")
ax.plot(x, 0.5-0.5*x, "m--")
ax.grid()

m =  [ 0.42987692 -0.4263657 ]

2.9.6 The misfit distribution

Important

Always have a look at the misfit distribution (not only \(\chi^2\))

2.9.7 Wrap-up overdetermined problems (\(N>M\))

objective function \(\Phi\) as squared data misfit
weighting of individual data (unitless, more flexible & objective)
broad minimum of the objective function
grid search to plot \(\phi\) \(\Rightarrow\) only nice for \(M\)=2
least squares method yields
resolution matrices for model (perfect) and data (distributed)

2.10 Under-determined problems

specifically \(N<M\) (too less data for the model pameters), but, more general, if the rank \(r<M\)

Rank of a matrix

number of independent columns and rows, i.e. the degree of non-degenerateness.

2.10.1 Example of a mixed-determined problem

\[ \textbf{G}=\mqty(1 & 1 & 0 \\ 0 & 0 & 1) \]

\(N\)<\(M\): clearly under-determined,
parameter \(m_3\) is perfectly determined (=\(d_2\))
any combination of \(m_1\) and \(m_2=d_1-m_1\) fulfils the first equation
there is no unique solution (model ambiguity)

2.10.2 Changing the system

\[ \textbf{G}=\mqty(1 & 1 & 0 \\ 0 & 0 & 1) \]

adding data 1+2

\[ \textbf{G}=\mqty(1 & 1 & 0 \\ 0 & 0 & 1 \\ 1 & 1 & 1) \]

does not help, because they are linear dependent (\(rk({\bf G})=2\)).

2.10.3 Minimum-norm solution

Analog to the least-squares inverse \(({\bf G}^T {\bf G})^{-1} {\bf G}^T\) there is an inverse for under-determined problems:

\[ \bf m = G^T (G G^T)^{-1} d = G^\dagger d \]

that is called minimum-norm solution.

Observation from Notebook

The backslash operator (G\d) yields the minimum-norm solution.

2.10.4 Derivation

From the models fulfilling \(\bf G m=d\) we are looking for the “smallest”

\[ \text{min} \Phi=\bf m^T m \quad\mbox{with}\quad G m=d \]

We use the method of Lagrangian parameters (\(\lambda_i\)) \[ \Phi = \bf m^T m + \lambda^T (G m - d) \rightarrow \mbox{min} \]

The derivatives vanish

\[ \frac{\partial\Phi}{\partial \lambda} = \bf G m - d=0 \]

\[ \frac{\partial\Phi}{\partial m} = 2\bf m^T + {\bf \lambda}^T G = 0 \]

\[ \Rightarrow {\bf m} = -\frac{1}{2} ({\lambda}^T {\bf G})^T = -\frac{1}{2}{\bf G}^T \lambda \]

from which we can determine the Langrangian parameters

\[ \lambda = -2 ({\bf G G^T})^{-1} \bf d \]

Replacing them we obtain the minimum-norm solution \[ {\bf m}_{MN} = \bf G^T (G G^T)^{-1} d \]

2.10.5 Resolution of the minimum norm inverse

\[ \bf G^\dagger = G^T (G G^T)^{-1} \]

\[ {\bf R}^M = \bf G^\dagger G = G^T (G G^T)^{-1} G \]

\[ {\bf R}^D = \bf G G^\dagger = G G^T (G G^T)^{-1} = I \]

Note

For underdetermined problems, all data are independent and equally important. Model parameters are insufficiently resolved.

2.11 Appendix

2.11.1 Derivation (1)

\[\Phi = \| {\vb d}-{\vb G}{\vb m}\|^2_2 = ({\vb d}-{\vb G}{\vb m})^T ({\vb d}-{\vb G}{\vb m}) \] \[\Phi = ({\vb G}{\vb m}-{\vb d})^T ({\vb G}{\vb m}-{\vb d})\]

\[\frac{\partial\Phi}{\partial m} = \frac{\partial}{\partial m} ({\vb G}{\vb m}-{\vb d})^T ({\vb G}{\vb m}-{\vb d}) + ({\vb G}{\vb m}-{\vb d})^T \frac{\partial}{\partial m}({\vb G}{\vb m}-{\vb d}) = 0\]

\[{\vb G}^T{\vb G}{\vb m}- {\vb G}^T{\vb d}+ {\vb G}^T{\vb G}{\vb m}- {\vb G}^T{\vb d}= 0\]

\[{\vb G}^T{\vb G}{\vb m}= {\vb G}^T{\vb d}\quad \Rightarrow \quad {\vb m}={\vb G}^\dagger{\vb d} \quad\mbox{with}\quad {\vb G}^\dagger = ({\vb G}^T{\vb G})^{-1} {\vb G}^T\]

2.11.2 Derivation (2)

\[\Phi = ({\vb d}-{\vb G}{\vb m})^T ({\vb d}-{\vb G}{\vb m}) = \sum_i \left[ (d_i-\sum_j G_{ij}m_j )(d_i-\sum_k G_{ij}m_j ) \right]\] \[\Phi = \sum_i \left[ d_i d_i - d_i\sum_k G_{ik}m_k - d_i\sum_j G_{ij}m_j + \sum_j G_{ij}m_j\sum_k G_{ik}m_k \right]\] \[\Phi = \sum_i d_i d_i - 2 \sum_j m_j \sum d_i G{ij} + \sum_i \sum_j\sum_k m_j G_{ij}G_{ik}m_j m_k\] \[\Phi = \sum_i d_i d_i - 2 \sum_j m_j \sum_i d_i G{ij} +\sum_j\sum_k m_j m_k \sum_i G_{ij}G_{ik}\]

\[\partial\Phi=\partial m_q = \sum_i\sum_k(\delta_{iq}m_k+m_j\delta_{ik})\sum G_{ij}G_{ik} - 2\sum_j \delta_{iq}\sum G_{ij} d_i = 0\]

\[0 = 2\sum_k\sum_i G_{iq}G_{ik} - 2\sum_i G_{iq}d_i = 2{\vb G}^T{\vb G}-2{\vb G}^T {\vb d}\]

2.11.3 Derivation by change \(t\delta{\vb m}\)

\[\Phi(t) = ({\vb d}-{\vb G}({\vb m}+t\delta{\vb m}))^T ({\vb d}-{\vb G}({\vb m}+t\delta{\vb m}))\] \[\Phi(t) = ({\vb m}+t\delta{\vb m})^T{\vb G}^T{\vb G}({\vb m}+t\delta{\vb m})-2({\vb m}+t\delta{\vb m}){\vb G}^T{\vb d}+{\vb d}^T{\vb d}\] \[\Phi(t) = t^2(\delta{\vb m}{\vb G}^T{\vb G}\delta{\vb m}) + 2t(\delta{\vb m}{\vb G}^T{\vb G}{\vb m}-\delta{\vb m}^T{\vb G}^T{\vb d}) + ({\vb m}^T{\vb G}^T{\vb G}{\vb m}+{\vb d}^T{\vb d}-2{\vb m}^T{\vb G}^T{\vb d})\] \(\Phi(t)\) has a minimum at \(t=0\), therefore \(\partial\Phi/\partial t\) must be 0 \[\partial\Phi(t=0)/\partial t = 2(\delta{\vb m}^T{\vb G}^T{\vb G}{\vb m}- \delta{\vb m}^T{\vb G}{\vb d}) = 2\delta{\vb m}^T ({\vb G}^T{\vb G}{\vb m}-{\vb G}^T{\vb d}) = 0\] As this holds for every \(\delta{\vb m}\), we obtain \({\vb G}^T{\vb G}{\vb m}={\vb G}^T{\vb d}\)