WKF
Wassertein roubust kalman filter is a member of distributionally robust kalman filters, along with $\tau$ divergence kalman filter and moment based kalman filter. This blog contains a brief derivation of WKF(Wassertein roubust kalman filter).
Solve the minimax problem
\[\inf _{\psi \in \mathcal{L}} \sup _{\mathbb{P} \in \mathcal{P}} \mathbb{E}^{\mathbb{P}}\left[\|x-\psi(y)\|^{2}\right]=\sup _{\mathbb{P} \in \mathcal{P}} \inf _{\psi \in \mathcal{L}} \mathbb{E}^{\mathbb{P}}\left[\|x-\psi(y)\|^{2}\right]\]minimax theorem. Proved in [1]
模糊集 \(\mathcal{P}=\left\{\mathbb{P} \in \mathcal{N}_{d}: W_{2}(\mathbb{P}, \overline{\mathbb{P}}) \leq \rho\right\}\) , 距离标称分布 \(\overline{\mathbb{P}}_{\mathbf{z}_{k} \mid \mathcal{Y}_{k-1}}\) \(\rho\) 的可能的真实分布 \(\mathbb{P}_{\mathbf{z}_{k} \mid \mathcal{Y}_{k-1}}\) 。
Wasserstein distance of two normal distributions \(W_{2}\left(\mathbb{Q}_{1}, \mathbb{Q}_{2}\right)=\sqrt{\left\|\mu_{1}-\mu_{2}\right\|^{2}+\operatorname{Tr}\left[\Sigma_{1}+\Sigma_{2}-2\left(\Sigma_{2}^{\frac{1}{2}} \Sigma_{1} \Sigma_{2}^{\frac{1}{2}}\right)^{\frac{1}{2}}\right]}\) .
where $\langle\boldsymbol{A}, \boldsymbol{B}\rangle:=\operatorname{Tr}\left[\boldsymbol{A}^{\top} \boldsymbol{B}\right]$ and:
\[\begin{aligned} \mathbb{E}^{\mathbb{P}}\left[\|x-\psi(y)\|^{2}\right] &=\mathbb{E}^{\mathbb{P}}\left[(x-\psi(y))^{\top}(x-\psi(y)) \right]\\ &=\mathbb{E}^{\mathbb{P}}\text{Tr}\left[(x-\psi(y))(x-\psi(y))^{\top} \right]\\ &=\text{Tr}\mathbb{E}^{\mathbb{P}}\left[(x-\psi(y))(x-\psi(y))^{\top} \right]\\ &=\text{Tr}\mathbb{E}^{\mathbb{P}}\left[xx^{\top}-x\psi(y)^{\top}-\psi(y)x^{\top}+\psi(y)\psi(y)^{\top}\right]\\ \end{aligned}\]因为 $ \mathbb{P} $ 是高斯,最优估计是affine的,即 $ \hat{x}_{k \mid k}=A_ky_k+b_k $ ,代入(2):
\[\begin{aligned} \mathbb{E}^{\mathbb{P}}\left[\|x-\psi(y)\|^{2}\right] &=\text{Tr}\mathbb{E}^{\mathbb{P}}\left[x_kx_k^{\top}-x_k(A_ky_k+b_k)^{\top}-(A_ky_k+b_k)x_k^{\top}+(A_ky_k+b_k)(A_ky_k+b_k)^{\top}\right]\\ &=\text{Tr}\mathbb{E}^{\mathbb{P}}\left[x_kx_k^{\top}-x_ky_k^{\top}A_k^{\top}-x_kb_k^{\top}-A_ky_kx_k^{\top}-b_kx_k^{\top}+A_ky_ky_k^{\top}A_k^{\top}+A_ky_kb_k^{\top}+b_ky_k^{\top}A_k^{\top}+b_kb_k^{\top}\right] \end{aligned}\]记高斯分布 \(\mathbb{P}_{\mathbf{z}_{k} \mid \mathcal{Y}_{k-1}}\sim \mathcal{N}(c_k,S_k)\) ,\(\overline{\mathbb{P}}_{\mathbf{z}_{k} \mid \mathcal{Y}_{k-1}}\sim \mathcal{N}(\mu_k,\Sigma_k)\).
根据定义:$S_{xx,k}:=\mathbb{E}(x_k-c_{x,k})(x_k-c_{x,k})^{\top}$,$S_{xy,k}=S_{yx,k}^{\top}:=\mathbb{E}(x_k-c_{x,k})(y_k-c_{y,k})^{\top}$,$S_{yy,k}:=\mathbb{E}(y_k-c_{y,k})(y_k-c_{y,k})^{\top}$
且 $S_{k}=\left[\begin{array}{ll} S_{x x, k} & S_{x y, k}
S_{y x, k} & S_{y y, k} \end{array}\right]$
因为$S_{xx,k}=\mathbb{E}(x_k-c_{x,k})(x_k-c_{x,k})^{\top}=\mathbb{E}x_kx_k^{\top}-\mathbb{E}x_kc_{x,k}^{\top}-c_{x,k}\mathbb{E}x_k^{\top}+c_{x,k}c_{x,k}^{\top}=\mathbb{E}x_kx_k^{\top}-c_{x,k}c_{x,k}^{\top}$
得 $\mathbb{E}x_kx_k^{\top}=S_{xx,k}+c_{x,k}c_{x,k}^{\top}$
类似 $\mathbb{E}x_ky_k^{\top}=S_{xy,k}+c_{x,k}c_{y,k}^{\top}$,$\mathbb{E}y_kx_k^{\top}=S_{yx,k}+c_{y,k}c_{x,k}^{\top}$,$\mathbb{E}y_ky_k^{\top}=S_{yy,k}+c_{y,k}c_{y,k}^{\top}$
代入(3): \(\begin{aligned} \mathbb{E}^{\mathbb{P}}\left[\|x-\psi(y)\|^{2}\right] &=\text{Tr}\mathbb{E}^{\mathbb{P}}\left[x_kx_k^{\top}-x_ky_k^{\top}A_k^{\top}-x_kb_k^{\top}-A_ky_kx_k^{\top}-b_kx_k^{\top}+A_ky_ky_k^{\top}A_k^{\top}+A_ky_kb_k^{\top}+b_ky_k^{\top}A_k^{\top}+b_kb_k^{\top}\right]\\ &=\text{Tr}[S_{xx,k}+c_{x,k}c_{x,k}^{\top}-(S_{xy,k}+c_{x,k}c_{y,k}^{\top})A_k^{\top}-c_{x,k}b_k^{\top}-A_k(S_{yx,k}+c_{y,k}c_{x,k}^{\top})-b_kc_{x,k}^{\top}\\ & \quad +A_k(S_{yy,k}+c_{y,k}c_{y,k}^{\top})A_k^{\top}+A_kc_{y,k}b_k^{\top}+b_kc_{y,k}^{\top}A_k^{\top}+b_kb_k^{\top}]\\ &=\langle I,S_{xx,k}+c_{x,k}c_{x,k}^{\top}\rangle-\langle A_k,S_{xy,k}+c_{x,k}c_{y,k}^{\top} \rangle-\langle b_k,c_{x,k} \rangle-\langle A_k^{\top},S_{yx,k}+c_{y,k}c_{x,k}^{\top} \rangle-\langle b_k,c_{x,k} \rangle\\ & \quad + \langle A_k^{\top}A_k,S_{yy,k}+c_{y,k}c_{y,k}^{\top} \rangle +2\langle b_k,A_kc_{y,k} \rangle+\langle b_k,b_k \rangle \\ &=\langle I,S_{xx,k}+c_{x,k}c_{x,k}^{\top}\rangle + \langle A_k^{\top}A_k,S_{yy,k}+c_{y,k}c_{y,k}^{\top} \rangle - \langle A_k,S_{xy,k}+c_{x,k}c_{y,k}^{\top} \rangle -\langle A_k^{\top},S_{yx,k}+c_{y,k}c_{x,k}^{\top} \rangle \\ & \quad + 2\langle b_k,A_kc_{y,k} -c_{x,k}\rangle + \langle b_k,b_k \rangle \end{aligned}\) Now
\[\begin{aligned} \inf _{\psi \in \mathcal{L}} \sup _{\mathbb{P} \in \mathcal{P}} \mathbb{E}^{\mathbb{P}}\left[\|x-\psi(y)\|^{2}\right] & \Leftrightarrow \sup _{\mathbb{P} \in \mathcal{P}} \inf _{\psi \in \mathcal{L}} \mathbb{E}^{\mathbb{P}}\left[\|x-\psi(y)\|^{2}\right]\\ & \Leftrightarrow \sup _{c_k, S_k} \inf _{A_k, b_k} \langle I,S_{xx,k}+c_{x,k}c_{x,k}^{\top}\rangle + \langle A_k^{\top}A_k,S_{yy,k}+c_{y,k}c_{y,k}^{\top} \rangle - \langle A_k,S_{xy,k}+c_{x,k}c_{y,k}^{\top} \rangle \\ & \quad -\langle A_k^{\top},S_{yx,k}+c_{y,k}c_{x,k}^{\top} \rangle + 2\langle b_k,A_kc_{y,k} -c_{x,k}\rangle + \langle b_k,b_k \rangle \end{aligned}\]Since (5) is constraint-free, quadratic and convex in terms of $b_k$, the optimal solution of $b_k$ is obtained by the first-order optimality condition, i.e. $b_{k}^{\star}=c_{x,k}-A_kc_{y,k}$
Thus simplying (5) to:
\[\begin{aligned} &\sup _{c_k, S_k} \inf _{A_k, b_k} \langle I,S_{xx,k}+c_{x,k}c_{x,k}^{\top}\rangle + \langle A_k^{\top}A_k,S_{yy,k}+c_{y,k}c_{y,k}^{\top} \rangle - \langle A_k,S_{xy,k}+c_{x,k}c_{y,k}^{\top} \rangle -\langle A_k^{\top},S_{yx,k}+c_{y,k}c_{x,k}^{\top} \rangle + 2\langle b_k,A_kc_{y,k} -c_{x,k}\rangle + \langle b_k,b_k \rangle \\ &= \sup _{c_k, S_k} \inf _{A_k, b_k} \langle I,S_{xx,k}+c_{x,k}c_{x,k}^{\top}\rangle + \langle A_k^{\top}A_k,S_{yy,k}+c_{y,k}c_{y,k}^{\top} \rangle - \langle A_k,S_{xy,k}+c_{x,k}c_{y,k}^{\top} \rangle -\langle A_k^{\top},S_{yx,k}+c_{y,k}c_{x,k}^{\top} \rangle -\langle A_kc_{y,k} -c_{x,k},A_kc_{y,k} -c_{x,k}\rangle \\ &= \sup _{c_k, S_k} \inf _{A_k, b_k} \langle I,S_{xx,k}+c_{x,k}c_{x,k}^{\top}\rangle + \langle A_k^{\top}A_k,S_{yy,k}+c_{y,k}c_{y,k}^{\top} \rangle - \langle A_k,S_{xy,k}+c_{x,k}c_{y,k}^{\top} \rangle -\langle A_k^{\top},S_{yx,k}+c_{y,k}c_{x,k}^{\top} \rangle -\langle A_k^{\top}A_k, c_{y,k}c_{y,k}^{\top}\rangle +2\langle A_k^{\top},c_{y,k}c_{x,k}^{\top} \rangle - \langle c_{x,k}, c_{x,k} \rangle \\ &= \sup _{c_k, S_k} \inf _{A_k, b_k} \langle I,S_{xx,k}\rangle + \langle A_k^{\top}A_k,S_{yy,k} \rangle - \langle A_k,S_{xy,k} \rangle -\langle A_k^{\top},S_{yx,k}\rangle \end{aligned}\]Write the objective function in a compact form: \(\sup _{S_{k}}\inf _{A_{k}} \left\langle\left[\begin{array}{cc} I & -A_{k} \\ -A_{k}^{\top} & A_{k}^{\top} A_{k} \end{array}\right], S_{k}\right\rangle\) which is subject to: \(\sqrt{\left\|c_k-\mu_k\right\|^{2}+\operatorname{Tr}\left[S_k+\Sigma_k-2\left(\Sigma_k^{\frac{1}{2}} S_k \Sigma_k^{\frac{1}{2}}\right)^{\frac{1}{2}}\right]}\le\rho\) Note that the objective function (7) is irrelevant to $c_k$. Therefore, to maximize (7), the larger the feasible set of $S_k$, the better. This gives the optimal solution of $c_k$ as $c_k^{\star}=\mu_k$
The optimization problem (7) over $A_k$ is constraint-free, differentiable, and convex, the first-order optimality condition, i.e. $A_k^{\star}=S_{xy,k}S_{yy,k}^{-1}$
Proceed and we get: \(\sup_{S_k} \text{Tr}[S_{xx,k}-S_{xy,k}S_{yy,k}^{-1}S_{yx,k}]\) which subjects to: \(\operatorname{Tr}\left[S_k+\Sigma_k-2\left(\Sigma_k^{\frac{1}{2}} S_k \Sigma_k^{\frac{1}{2}}\right)^{\frac{1}{2}}\right]\le\rho^2\) The objective function $f(S) \triangleq \operatorname{Tr}\left[S_{x x}-S_{x y} S_{y y}^{-1} S_{y x}\right]$ has unit total elasticity, i.e., \(f(S)=\langle S,\nabla f(S) \rangle \quad \forall S\in \mathbb{S}_{+}^{d}\)
\[\langle S, \nabla f(S)\rangle=\left\langle\left[\begin{array}{ll} S_{x x} & S_{x y} \\ S_{y x} & S_{y y} \end{array}\right],\left[\begin{array}{cc} I_{n} & -S_{x y} S_{y y}^{-1} \\ -S_{y y}^{-1} S_{y x} & S_{y y}^{-1} S_{y x} S_{x y} S_{y y}^{-1} \end{array}\right]\right\rangle=f(S)\]Solve the max problem with Frank-Wolfe Algorithm.
Steps of wasserstein kalman filter
\[\left.\begin{array}{rl} x_{t} & =F_{t-1} x_{t-1}+G_{t-1} w_{t-1} \\ y_{t} & =H_{t} x_{t}+D_{t} v_{t} \end{array}\right\} \quad \forall t \in \mathbb{N}\]Here $x_t\in \mathbb{R}^n$, $y_t \in \mathbb{R}^m$, $Y_{t} \triangleq\left(y_{1}, \ldots, y_{t}\right)$,
$z_{t}=\left[x_{t}^{\top}, y_{t}^{\top}\right]^{\top} \in \mathbb{R}^d, d=n+m$
$G_t \in \mathbb{R}^{n\times p}$ , $w_t \sim \mathcal{N}(0,Q), w_t \in \mathbb{R}^{p}$
$D_t\in \mathbb{R}^{m\times q}$, $v_t \sim \mathcal{N}(0,R), v_t\in \mathbb{R}^{q}$
$x_0 \sim \mathcal{N}(\hat{x}_0,V_0)$
According to (13), we have the nominal transition kernel: \(\overline{\mathbb{P}}_{z_t|x_{t-1}}=\mathcal{N}_{d}\left(\left[\begin{array}{c} F_{t-1} \\ H_{t} F_{t-1} \end{array}\right] x_{t-1},\left[\begin{array}{c} G_{t-1} & 0\\ H_{t} G_{t-1} & D_{t} \end{array}\right]\left[\begin{array}{c} Q_{t-1} & 0\\ 0 & R_{t} \end{array}\right]\left[\begin{array}{c} G_{t-1} & 0\\ H_{t} G_{t-1} & D_{t} \end{array}\right]^{\top}\right)\) Suppose we already have \(\mathbb{P}_{x_{t-1} \mid Y_{t-1}}^{\star}=\mathcal{N}_{n}\left(\hat{x}_{t-1}, V_{t-1}\right)\)
Then predict with the pseudo nominal distribution, which is defined for every Borel set $B \subseteq \mathbb{R}^d$ and observation history $Y_{t−1} \in \mathbb{R}^{m×(t−1)}$: \(\overline{\mathbb{P}}_{z_{t} \mid Y_{t-1}}\left(B \mid Y_{t-1}\right)=\int_{\mathbb{R}^{n}} \overline{\mathbb{P}}_{z_{t} \mid x_{t-1}}\left(B \mid x_{t-1}\right) \mathbb{P}_{x_{t-1} \mid Y_{t-1}}^{\star}\left(\mathrm{d} x_{t-1} \mid Y_{t-1}\right)\) Therefore: \(\overline{\mathbb{P}}_{z_{t} \mid Y_{t-1}}\left(B \mid Y_{t-1}\right)=\mathcal{N}(\mu_t,\Sigma_t)\) where: \(\mu_{t}=\left[\begin{array}{c} \mu_{x, t} \\ \mu_{y, t} \end{array}\right]=\left[\begin{array}{c} F_{t-1} \\ H_{t} F_{t-1} \end{array}\right] \hat{x}_{t-1 \mid t-1}\) and \(\begin{array}{l} \Sigma_{t}=\left[\begin{array}{c} F_{t-1} \\ H_{t} F_{t-1} \end{array}\right] V_{t-1 \mid t-1}\left[\begin{array}{c} F_{t-1} \\ H_{t} F_{t-1} \end{array}\right]^{\top}+ {\left[\begin{array}{cc} G_{t-1} Q_{t-1}^{\frac{1}{2}} & 0 \\ H_{t} G_{t-1} Q_{t-1}^{\frac{1}{2}} & R_{t}^{\frac{1}{2}} \end{array}\right]\left[\begin{array}{cc} G_{t-1} Q_{t-1}^{\frac{1}{2}} & 0 \\ H_{t} G_{t-1} Q_{t-1}^{\frac{1}{2}} & R_{t}^{\frac{1}{2}} \end{array}\right]^{\top}} \end{array}\) Here comes the update: \(\inf _{\psi_{t} \in \mathcal{L}} \sup _{\mathbb{P} \in \mathcal{P}_{z_{t} \mid Y_{t-1}}} \mathbb{E}^{\mathbb{P}}\left[\left\|x_{t}-\psi_{t}\left(y_{t}\right)\right\|^{2}\right]\) where: \(\mathcal{P}_{z_{t} \mid Y_{t-1}}=\left\{\mathbb{P} \in \mathcal{N}_{d}: W_{2}\left(\mathbb{P}, \overline{\mathbb{P}}_{z_{t} \mid Y_{t-1}}\right) \leq \rho_{t}\right\}\) The Wasserstein radius $\rho_t$ quantifies our distrust in the pseudo-nominal a priori estimate and can therefore be interpreted as a measure of model uncertainty.
The optimal solution $S_t^{\star}$ of (115) yields the least favorable conditional distribution: \(\mathbb{P}_{z_{t} \mid Y_{t-1}}^{\star}=\mathcal{N}_{d}\left(\mu_{t}, S_{t}^{\star}\right)\) And the worst-case conditional distribution of $x_t$ given $Y_t$:
(By using the formulas for conditional normal distributions (see, [2, page 522])) \(\mathbb{P}_{x_{t} \mid Y_{t}}^{\star}=\mathcal{N}_{d}\left(\hat{x}_{t|t}, V_{t}\right)\) where: \(\hat{x}_{t|t}=S_{t, x y}^{\star}\left(S_{t, y y}^{\star}\right)^{-1}\left(y_{t}-\mu_{t, y}\right)+\mu_{t, x}\) and \(V_{t}=S_{t, x x}^{\star}-S_{t, x y}^{\star}\left(S_{t, y y}^{\star}\right)^{-1} S_{t, y x}^{\star}\)
Reference
- S. Shafieezadeh Abadeh, V. A. Nguyen, D. Kuhn, and P. M. Mohajerin Esfahani, “Wasserstein Distributionally Robust Kalman Filtering,” in Advances in Neural Information Processing Systems, Curran Associates, Inc., 2018. [Online]. Available: https://proceedings.neurips.cc/paper/2018/hash/15212f24321aa2c3dc8e9acf820f3c15-Abstract.html
- C. R. Rao, Linear statistical inference and its applications, 2. ed., Paperback ed. in Wiley series in probability and statistics. New York: Wiley, 2002.