Post

WKF

Wassertein roubust kalman filter is a member of distributionally robust kalman filters, along with $\tau$ divergence kalman filter and moment based kalman filter. This blog contains a brief derivation of WKF(Wassertein roubust kalman filter).

Solve the minimax problem

\[\inf _{\psi \in \mathcal{L}} \sup _{\mathbb{P} \in \mathcal{P}} \mathbb{E}^{\mathbb{P}}\left[\|x-\psi(y)\|^{2}\right]=\sup _{\mathbb{P} \in \mathcal{P}} \inf _{\psi \in \mathcal{L}} \mathbb{E}^{\mathbb{P}}\left[\|x-\psi(y)\|^{2}\right]\]

minimax theorem. Proved in [1]

模糊集 \(\mathcal{P}=\left\{\mathbb{P} \in \mathcal{N}_{d}: W_{2}(\mathbb{P}, \overline{\mathbb{P}}) \leq \rho\right\}\) , 距离标称分布 \(\overline{\mathbb{P}}_{\mathbf{z}_{k} \mid \mathcal{Y}_{k-1}}\) \(\rho\) 的可能的真实分布 \(\mathbb{P}_{\mathbf{z}_{k} \mid \mathcal{Y}_{k-1}}\) 。

Wasserstein distance of two normal distributions \(W_{2}\left(\mathbb{Q}_{1}, \mathbb{Q}_{2}\right)=\sqrt{\left\|\mu_{1}-\mu_{2}\right\|^{2}+\operatorname{Tr}\left[\Sigma_{1}+\Sigma_{2}-2\left(\Sigma_{2}^{\frac{1}{2}} \Sigma_{1} \Sigma_{2}^{\frac{1}{2}}\right)^{\frac{1}{2}}\right]}\) .

where $\langle\boldsymbol{A}, \boldsymbol{B}\rangle:=\operatorname{Tr}\left[\boldsymbol{A}^{\top} \boldsymbol{B}\right]$ and:

\[\begin{aligned} \mathbb{E}^{\mathbb{P}}\left[\|x-\psi(y)\|^{2}\right] &=\mathbb{E}^{\mathbb{P}}\left[(x-\psi(y))^{\top}(x-\psi(y)) \right]\\ &=\mathbb{E}^{\mathbb{P}}\text{Tr}\left[(x-\psi(y))(x-\psi(y))^{\top} \right]\\ &=\text{Tr}\mathbb{E}^{\mathbb{P}}\left[(x-\psi(y))(x-\psi(y))^{\top} \right]\\ &=\text{Tr}\mathbb{E}^{\mathbb{P}}\left[xx^{\top}-x\psi(y)^{\top}-\psi(y)x^{\top}+\psi(y)\psi(y)^{\top}\right]\\ \end{aligned}\]

因为 $ \mathbb{P} $ 是高斯,最优估计是affine的,即 $ \hat{x}_{k \mid k}=A_ky_k+b_k $ ,代入(2):

\[\begin{aligned} \mathbb{E}^{\mathbb{P}}\left[\|x-\psi(y)\|^{2}\right] &=\text{Tr}\mathbb{E}^{\mathbb{P}}\left[x_kx_k^{\top}-x_k(A_ky_k+b_k)^{\top}-(A_ky_k+b_k)x_k^{\top}+(A_ky_k+b_k)(A_ky_k+b_k)^{\top}\right]\\ &=\text{Tr}\mathbb{E}^{\mathbb{P}}\left[x_kx_k^{\top}-x_ky_k^{\top}A_k^{\top}-x_kb_k^{\top}-A_ky_kx_k^{\top}-b_kx_k^{\top}+A_ky_ky_k^{\top}A_k^{\top}+A_ky_kb_k^{\top}+b_ky_k^{\top}A_k^{\top}+b_kb_k^{\top}\right] \end{aligned}\]

记高斯分布 \(\mathbb{P}_{\mathbf{z}_{k} \mid \mathcal{Y}_{k-1}}\sim \mathcal{N}(c_k,S_k)\) ,\(\overline{\mathbb{P}}_{\mathbf{z}_{k} \mid \mathcal{Y}_{k-1}}\sim \mathcal{N}(\mu_k,\Sigma_k)\).

根据定义:$S_{xx,k}:=\mathbb{E}(x_k-c_{x,k})(x_k-c_{x,k})^{\top}$,$S_{xy,k}=S_{yx,k}^{\top}:=\mathbb{E}(x_k-c_{x,k})(y_k-c_{y,k})^{\top}$,$S_{yy,k}:=\mathbb{E}(y_k-c_{y,k})(y_k-c_{y,k})^{\top}$

且 $S_{k}=\left[\begin{array}{ll} S_{x x, k} & S_{x y, k}
S_{y x, k} & S_{y y, k} \end{array}\right]$

因为$S_{xx,k}=\mathbb{E}(x_k-c_{x,k})(x_k-c_{x,k})^{\top}=\mathbb{E}x_kx_k^{\top}-\mathbb{E}x_kc_{x,k}^{\top}-c_{x,k}\mathbb{E}x_k^{\top}+c_{x,k}c_{x,k}^{\top}=\mathbb{E}x_kx_k^{\top}-c_{x,k}c_{x,k}^{\top}$

得 $\mathbb{E}x_kx_k^{\top}=S_{xx,k}+c_{x,k}c_{x,k}^{\top}$

类似 $\mathbb{E}x_ky_k^{\top}=S_{xy,k}+c_{x,k}c_{y,k}^{\top}$,$\mathbb{E}y_kx_k^{\top}=S_{yx,k}+c_{y,k}c_{x,k}^{\top}$,$\mathbb{E}y_ky_k^{\top}=S_{yy,k}+c_{y,k}c_{y,k}^{\top}$

代入(3): \(\begin{aligned} \mathbb{E}^{\mathbb{P}}\left[\|x-\psi(y)\|^{2}\right] &=\text{Tr}\mathbb{E}^{\mathbb{P}}\left[x_kx_k^{\top}-x_ky_k^{\top}A_k^{\top}-x_kb_k^{\top}-A_ky_kx_k^{\top}-b_kx_k^{\top}+A_ky_ky_k^{\top}A_k^{\top}+A_ky_kb_k^{\top}+b_ky_k^{\top}A_k^{\top}+b_kb_k^{\top}\right]\\ &=\text{Tr}[S_{xx,k}+c_{x,k}c_{x,k}^{\top}-(S_{xy,k}+c_{x,k}c_{y,k}^{\top})A_k^{\top}-c_{x,k}b_k^{\top}-A_k(S_{yx,k}+c_{y,k}c_{x,k}^{\top})-b_kc_{x,k}^{\top}\\ & \quad +A_k(S_{yy,k}+c_{y,k}c_{y,k}^{\top})A_k^{\top}+A_kc_{y,k}b_k^{\top}+b_kc_{y,k}^{\top}A_k^{\top}+b_kb_k^{\top}]\\ &=\langle I,S_{xx,k}+c_{x,k}c_{x,k}^{\top}\rangle-\langle A_k,S_{xy,k}+c_{x,k}c_{y,k}^{\top} \rangle-\langle b_k,c_{x,k} \rangle-\langle A_k^{\top},S_{yx,k}+c_{y,k}c_{x,k}^{\top} \rangle-\langle b_k,c_{x,k} \rangle\\ & \quad + \langle A_k^{\top}A_k,S_{yy,k}+c_{y,k}c_{y,k}^{\top} \rangle +2\langle b_k,A_kc_{y,k} \rangle+\langle b_k,b_k \rangle \\ &=\langle I,S_{xx,k}+c_{x,k}c_{x,k}^{\top}\rangle + \langle A_k^{\top}A_k,S_{yy,k}+c_{y,k}c_{y,k}^{\top} \rangle - \langle A_k,S_{xy,k}+c_{x,k}c_{y,k}^{\top} \rangle -\langle A_k^{\top},S_{yx,k}+c_{y,k}c_{x,k}^{\top} \rangle \\ & \quad + 2\langle b_k,A_kc_{y,k} -c_{x,k}\rangle + \langle b_k,b_k \rangle \end{aligned}\) Now

\[\begin{aligned} \inf _{\psi \in \mathcal{L}} \sup _{\mathbb{P} \in \mathcal{P}} \mathbb{E}^{\mathbb{P}}\left[\|x-\psi(y)\|^{2}\right] & \Leftrightarrow \sup _{\mathbb{P} \in \mathcal{P}} \inf _{\psi \in \mathcal{L}} \mathbb{E}^{\mathbb{P}}\left[\|x-\psi(y)\|^{2}\right]\\ & \Leftrightarrow \sup _{c_k, S_k} \inf _{A_k, b_k} \langle I,S_{xx,k}+c_{x,k}c_{x,k}^{\top}\rangle + \langle A_k^{\top}A_k,S_{yy,k}+c_{y,k}c_{y,k}^{\top} \rangle - \langle A_k,S_{xy,k}+c_{x,k}c_{y,k}^{\top} \rangle \\ & \quad -\langle A_k^{\top},S_{yx,k}+c_{y,k}c_{x,k}^{\top} \rangle + 2\langle b_k,A_kc_{y,k} -c_{x,k}\rangle + \langle b_k,b_k \rangle \end{aligned}\]

Since (5) is constraint-free, quadratic and convex in terms of $b_k$, the optimal solution of $b_k$ is obtained by the first-order optimality condition, i.e. $b_{k}^{\star}=c_{x,k}-A_kc_{y,k}$

Thus simplying (5) to:

\[\begin{aligned} &\sup _{c_k, S_k} \inf _{A_k, b_k} \langle I,S_{xx,k}+c_{x,k}c_{x,k}^{\top}\rangle + \langle A_k^{\top}A_k,S_{yy,k}+c_{y,k}c_{y,k}^{\top} \rangle - \langle A_k,S_{xy,k}+c_{x,k}c_{y,k}^{\top} \rangle -\langle A_k^{\top},S_{yx,k}+c_{y,k}c_{x,k}^{\top} \rangle + 2\langle b_k,A_kc_{y,k} -c_{x,k}\rangle + \langle b_k,b_k \rangle \\ &= \sup _{c_k, S_k} \inf _{A_k, b_k} \langle I,S_{xx,k}+c_{x,k}c_{x,k}^{\top}\rangle + \langle A_k^{\top}A_k,S_{yy,k}+c_{y,k}c_{y,k}^{\top} \rangle - \langle A_k,S_{xy,k}+c_{x,k}c_{y,k}^{\top} \rangle -\langle A_k^{\top},S_{yx,k}+c_{y,k}c_{x,k}^{\top} \rangle -\langle A_kc_{y,k} -c_{x,k},A_kc_{y,k} -c_{x,k}\rangle \\ &= \sup _{c_k, S_k} \inf _{A_k, b_k} \langle I,S_{xx,k}+c_{x,k}c_{x,k}^{\top}\rangle + \langle A_k^{\top}A_k,S_{yy,k}+c_{y,k}c_{y,k}^{\top} \rangle - \langle A_k,S_{xy,k}+c_{x,k}c_{y,k}^{\top} \rangle -\langle A_k^{\top},S_{yx,k}+c_{y,k}c_{x,k}^{\top} \rangle -\langle A_k^{\top}A_k, c_{y,k}c_{y,k}^{\top}\rangle +2\langle A_k^{\top},c_{y,k}c_{x,k}^{\top} \rangle - \langle c_{x,k}, c_{x,k} \rangle \\ &= \sup _{c_k, S_k} \inf _{A_k, b_k} \langle I,S_{xx,k}\rangle + \langle A_k^{\top}A_k,S_{yy,k} \rangle - \langle A_k,S_{xy,k} \rangle -\langle A_k^{\top},S_{yx,k}\rangle \end{aligned}\]

Write the objective function in a compact form: \(\sup _{S_{k}}\inf _{A_{k}} \left\langle\left[\begin{array}{cc} I & -A_{k} \\ -A_{k}^{\top} & A_{k}^{\top} A_{k} \end{array}\right], S_{k}\right\rangle\) which is subject to: \(\sqrt{\left\|c_k-\mu_k\right\|^{2}+\operatorname{Tr}\left[S_k+\Sigma_k-2\left(\Sigma_k^{\frac{1}{2}} S_k \Sigma_k^{\frac{1}{2}}\right)^{\frac{1}{2}}\right]}\le\rho\) Note that the objective function (7) is irrelevant to $c_k$. Therefore, to maximize (7), the larger the feasible set of $S_k$, the better. This gives the optimal solution of $c_k$ as $c_k^{\star}=\mu_k$

The optimization problem (7) over $A_k$ is constraint-free, differentiable, and convex, the first-order optimality condition, i.e. $A_k^{\star}=S_{xy,k}S_{yy,k}^{-1}$

Proceed and we get: \(\sup_{S_k} \text{Tr}[S_{xx,k}-S_{xy,k}S_{yy,k}^{-1}S_{yx,k}]\) which subjects to: \(\operatorname{Tr}\left[S_k+\Sigma_k-2\left(\Sigma_k^{\frac{1}{2}} S_k \Sigma_k^{\frac{1}{2}}\right)^{\frac{1}{2}}\right]\le\rho^2\) The objective function $f(S) \triangleq \operatorname{Tr}\left[S_{x x}-S_{x y} S_{y y}^{-1} S_{y x}\right]$ has unit total elasticity, i.e., \(f(S)=\langle S,\nabla f(S) \rangle \quad \forall S\in \mathbb{S}_{+}^{d}\)

\[\langle S, \nabla f(S)\rangle=\left\langle\left[\begin{array}{ll} S_{x x} & S_{x y} \\ S_{y x} & S_{y y} \end{array}\right],\left[\begin{array}{cc} I_{n} & -S_{x y} S_{y y}^{-1} \\ -S_{y y}^{-1} S_{y x} & S_{y y}^{-1} S_{y x} S_{x y} S_{y y}^{-1} \end{array}\right]\right\rangle=f(S)\]

Solve the max problem with Frank-Wolfe Algorithm.

Steps of wasserstein kalman filter

\[\left.\begin{array}{rl} x_{t} & =F_{t-1} x_{t-1}+G_{t-1} w_{t-1} \\ y_{t} & =H_{t} x_{t}+D_{t} v_{t} \end{array}\right\} \quad \forall t \in \mathbb{N}\]

Here $x_t\in \mathbb{R}^n$, $y_t \in \mathbb{R}^m$, $Y_{t} \triangleq\left(y_{1}, \ldots, y_{t}\right)$,

$z_{t}=\left[x_{t}^{\top}, y_{t}^{\top}\right]^{\top} \in \mathbb{R}^d, d=n+m$

$G_t \in \mathbb{R}^{n\times p}$ , $w_t \sim \mathcal{N}(0,Q), w_t \in \mathbb{R}^{p}$

$D_t\in \mathbb{R}^{m\times q}$, $v_t \sim \mathcal{N}(0,R), v_t\in \mathbb{R}^{q}$

$x_0 \sim \mathcal{N}(\hat{x}_0,V_0)$

According to (13), we have the nominal transition kernel: \(\overline{\mathbb{P}}_{z_t|x_{t-1}}=\mathcal{N}_{d}\left(\left[\begin{array}{c} F_{t-1} \\ H_{t} F_{t-1} \end{array}\right] x_{t-1},\left[\begin{array}{c} G_{t-1} & 0\\ H_{t} G_{t-1} & D_{t} \end{array}\right]\left[\begin{array}{c} Q_{t-1} & 0\\ 0 & R_{t} \end{array}\right]\left[\begin{array}{c} G_{t-1} & 0\\ H_{t} G_{t-1} & D_{t} \end{array}\right]^{\top}\right)\) Suppose we already have \(\mathbb{P}_{x_{t-1} \mid Y_{t-1}}^{\star}=\mathcal{N}_{n}\left(\hat{x}_{t-1}, V_{t-1}\right)\)

Then predict with the pseudo nominal distribution, which is defined for every Borel set $B \subseteq \mathbb{R}^d$ and observation history $Y_{t−1} \in \mathbb{R}^{m×(t−1)}$: \(\overline{\mathbb{P}}_{z_{t} \mid Y_{t-1}}\left(B \mid Y_{t-1}\right)=\int_{\mathbb{R}^{n}} \overline{\mathbb{P}}_{z_{t} \mid x_{t-1}}\left(B \mid x_{t-1}\right) \mathbb{P}_{x_{t-1} \mid Y_{t-1}}^{\star}\left(\mathrm{d} x_{t-1} \mid Y_{t-1}\right)\) Therefore: \(\overline{\mathbb{P}}_{z_{t} \mid Y_{t-1}}\left(B \mid Y_{t-1}\right)=\mathcal{N}(\mu_t,\Sigma_t)\) where: \(\mu_{t}=\left[\begin{array}{c} \mu_{x, t} \\ \mu_{y, t} \end{array}\right]=\left[\begin{array}{c} F_{t-1} \\ H_{t} F_{t-1} \end{array}\right] \hat{x}_{t-1 \mid t-1}\) and \(\begin{array}{l} \Sigma_{t}=\left[\begin{array}{c} F_{t-1} \\ H_{t} F_{t-1} \end{array}\right] V_{t-1 \mid t-1}\left[\begin{array}{c} F_{t-1} \\ H_{t} F_{t-1} \end{array}\right]^{\top}+ {\left[\begin{array}{cc} G_{t-1} Q_{t-1}^{\frac{1}{2}} & 0 \\ H_{t} G_{t-1} Q_{t-1}^{\frac{1}{2}} & R_{t}^{\frac{1}{2}} \end{array}\right]\left[\begin{array}{cc} G_{t-1} Q_{t-1}^{\frac{1}{2}} & 0 \\ H_{t} G_{t-1} Q_{t-1}^{\frac{1}{2}} & R_{t}^{\frac{1}{2}} \end{array}\right]^{\top}} \end{array}\) Here comes the update: \(\inf _{\psi_{t} \in \mathcal{L}} \sup _{\mathbb{P} \in \mathcal{P}_{z_{t} \mid Y_{t-1}}} \mathbb{E}^{\mathbb{P}}\left[\left\|x_{t}-\psi_{t}\left(y_{t}\right)\right\|^{2}\right]\) where: \(\mathcal{P}_{z_{t} \mid Y_{t-1}}=\left\{\mathbb{P} \in \mathcal{N}_{d}: W_{2}\left(\mathbb{P}, \overline{\mathbb{P}}_{z_{t} \mid Y_{t-1}}\right) \leq \rho_{t}\right\}\) The Wasserstein radius $\rho_t$ quantifies our distrust in the pseudo-nominal a priori estimate and can therefore be interpreted as a measure of model uncertainty.

The optimal solution $S_t^{\star}$ of (115) yields the least favorable conditional distribution: \(\mathbb{P}_{z_{t} \mid Y_{t-1}}^{\star}=\mathcal{N}_{d}\left(\mu_{t}, S_{t}^{\star}\right)\) And the worst-case conditional distribution of $x_t$ given $Y_t$:

(By using the formulas for conditional normal distributions (see, [2, page 522])) \(\mathbb{P}_{x_{t} \mid Y_{t}}^{\star}=\mathcal{N}_{d}\left(\hat{x}_{t|t}, V_{t}\right)\) where: \(\hat{x}_{t|t}=S_{t, x y}^{\star}\left(S_{t, y y}^{\star}\right)^{-1}\left(y_{t}-\mu_{t, y}\right)+\mu_{t, x}\) and \(V_{t}=S_{t, x x}^{\star}-S_{t, x y}^{\star}\left(S_{t, y y}^{\star}\right)^{-1} S_{t, y x}^{\star}\)

Reference

  1. S. Shafieezadeh Abadeh, V. A. Nguyen, D. Kuhn, and P. M. Mohajerin Esfahani, “Wasserstein Distributionally Robust Kalman Filtering,” in Advances in Neural Information Processing Systems, Curran Associates, Inc., 2018. [Online]. Available: https://proceedings.neurips.cc/paper/2018/hash/15212f24321aa2c3dc8e9acf820f3c15-Abstract.html
  2. C. R. Rao, Linear statistical inference and its applications, 2. ed., Paperback ed. in Wiley series in probability and statistics. New York: Wiley, 2002.
This post is licensed under CC BY 4.0 by the author.

Trending Tags