First, we describe the individual anomalous region (IAR) model, in which
the anomalous regions are not shared across the group of patients, though
their parameters or effects are. This model is defined in
fcdiff.model.UnsharedRegionModel.
Generative Model
If you are familiar with graphical models, then look at
Fig. 1 for a summary of the generative model.
Now we go into more detail on the distributions of these random variables and
their parameters.
Let \(R_{nu}\) be a Bernoulli random variable indicating that
region \(n\) of patient \(u\) is anomalous. \(R_{nu}\) is drawn
from the distribution
\[p(r_{nu}; \pi) = (1 - \pi)^{1 - r_{nu}} \pi^{r_{nu}}\]
where \(\pi \in (0, 1)\) is the parameter of a Bernoulli distribution.
You can sample from this distribution with
UnsharedRegionModel.sample_R().
Let \(T_{nmu}\) be a Bernoulli random variable indicating that the
connection between regions \(n\) and \(m\) of patient \(u\) is
anomalous. \(T_{nmu}\) is dependent on the anomalous state of the regions
at either end of the connection, and is drawn from the distribution
\[\begin{split}p(t_{nmu} | r_{nu}, r_{mu}; \eta)
\begin{cases}
\delta(t_{nmu}) & \mathrm{if} \, r_{nu} = r_{mu} = 0
\\
\delta(1 - t_{nmu}) & \mathrm{if} \, r_{nu} = r_{mu} = 1
\\
(1 - \eta)^{1 - t_{nmu}} \eta^{t_{nmu}} & \mathrm{if} \, r_{nu} \neq r_{mu}
\end{cases}\end{split}\]
where \(\delta\) is the Dirac delta function and \(\eta \in (0, 1)\)
is the parameter of a Bernoulli distribution. \(T_{nmu}\) is
deterministic if the anomalous state of regions \(n\) and \(m\) in
patient \(u\) is the same, and is a Bernoulli random variable with
parameter \(\eta\) if they are different.
This distribution encourages anomalous networks containing cliques of
anomalous nodes, where larger values of \(\eta\) allow more edges
outside of cliques to be affected.
You can sample from this distribution with
UnsharedRegionModel.sample_T().
Let \(F_{nm}\) be a multinomial random variable indicating the state of
healthy connectivity between regions \(n\) and \(m\).
We use three states of connectivity:
- \(f_{nm-1} = 1\) denotes negative connectivity;
- \(f_{nm0} = 1\) denotes no connectivity;
- \(f_{nm1} = 1\) denotes positive connectivity.
Exactly one component of \(f_{nm}\) must be equal to one. \(F_{nm}\)
is drawn from the distribution
\[p(f_{nm}; \gamma)
=
\prod_{k = -1}^1
\gamma_k^{f_{nmk}}\]
where \(\gamma = (\gamma_{-1}, \gamma_0, \gamma_{1})\) is the parameter
vector of a multinomial distribution such that
\(\gamma_k \in (0, 1) \, \forall k\) and
\(\sum_{k = -1}^1 \gamma_k = 1\).
You can sample from this distribution with
UnsharedRegionModel.sample_F().
Let \(\tilde{F}_{nmu}\) be a multinomial random variable indicating the
state of connectivity between regions \(n\) and \(m\) of
patient \(u\). \(\tilde{F}_{nmu}\) is dependent on \(T_{nmu}\),
the anomalous state of the connection between regions \(n\) and \(m\)
of patient \(u\), and on \(F_{nm}\), the healthy
connectivity state between regions \(n\) and \(m\).
\(\tilde{F}_{nmu}\) is drawn from the distribution
\[\begin{split}p(\tilde{f}_{nmu} | f_{nm}, t_{nmu}; \epsilon)
=
\begin{cases}
(1 - \epsilon)^{f_{nm}^\top \tilde{f}_{nmu}}
\left( \frac{\epsilon}{2} \right)^{1 - f_{nm}^\top \tilde{f}_{nmu}}
&
\mathrm{if} \, t_{nmu} = 0
\\
\epsilon^{f_{nm}^\top \tilde{f}_{nmu}}
\left( \frac{1 - \epsilon}{2} \right)^{1 - f_{nm}^\top \tilde{f}_{nmu}}
&
\mathrm{if} \, t_{nmu} = 1
\end{cases}\end{split}\]
where \(\epsilon \in (0, 1)\) is the parameter of a Bernoulli
distribution. If the connection between regions \(n\) and \(m\) of
patient \(u\) is anomalous, the connectivity state is perturbed from the
healthy template with high probability \(1 - \epsilon\). Conversely, if
the connection is typical, the connectivity state is perturbed with
small probability \(\epsilon\).
You can sample from this distribution with
UnsharedRegionModel.sample_F_tilde().
Let \(B_{nmh}\) be the random Pearson correlation coefficient between the
fMRI BOLD contrast time series from regions \(n\) and \(m\) of healthy
subject \(h\). This is dependent on the healthy
connectivity state, and is drawn from a mixture of Normal distributions
\[p(b_{nmh} | f_{nm}; \mu, \sigma)
=
\prod_{k = -1}^1
\mathcal{N}(b_{nmh}; \mu_k, \sigma_k^2)^{f_{nmk}}\]
where \(\mu = (\mu_{-1}, \mu_0, \mu_{1})\),
\(\sigma = (\sigma_{-1}, \sigma_0, \sigma_{1})\) and
\(\mathcal{N}(\cdot; \mu_k, \sigma_k^2)\) is a Normal distribution
with mean \(\mu_k\) and variance \(\sigma_k^2\).
You can sample from this distribution with
UnsharedRegionModel.sample_B().
Similarly, let \(\tilde{B}_{nmu}\) denote the random Pearson correlation
coefficient between the fMRI BOLD contrast time series from regions \(n\)
and \(m\) of patient \(u\).
This is dependent on the connectivity state of the patient, and is drawn from
the same mixture of Normal distributions as the healthy correlations
\[p(\tilde{b}_{nmh} | \tilde{f}_{nm}; \mu, \sigma)
=
\prod_{k = -1}^1
\mathcal{N}(\tilde{b}_{nmh}; \mu_k, \sigma_k^2)^{\tilde{f}_{nmk}}\]
except that it is conditional on \(\tilde{F}\) instead of \(F\).
You can sample from this distribution with
UnsharedRegionModel.sample_B_tilde().
We assume independence between all healthy subjects and patients,
independence between healthy connections and independence between regions,
and thus obtain the full joint distribution
\[\begin{split}p&(f, b, r, t, \tilde{f}, \tilde{b}; \theta)
=
p(f; \gamma)
p(b | f; \mu, \sigma)
p(r; \pi)
p(t | r; \eta)
p(\tilde{f} | f, t; \epsilon)
p(\tilde{b} | \tilde{f}; \mu, \sigma)
\\
=
&\left(
\prod_{n = 1}^N
\prod_{m > n}
p(f_{nm}; \gamma)
\prod_{h = 1}^H
p(b_{nmh} | f_{nm}; \mu, \sigma)
\right)
\\
&\left(
\prod_{n = 1}^N
\prod_{u = 1}^U
p(r_{nu}; \pi)
\prod_{m > n}
p(t_{nmu} | r_{nu}, r_{mu}; \eta)
p(\tilde{f}_{nmu} | f_{nm}, t_{nmu}; \epsilon)
p(\tilde{b} | \tilde{f}_{nmu}; \mu, \sigma)
\right)\end{split}\]
where \(\theta = (\pi, \gamma, \eta, \epsilon, \mu, \sigma)\).
Exact Inference
In [2] three inference algorithms are discussed. Here,
we only present the algorithm that gave the best estimation performance
on synthetic data experiments and computational efficiency. Similarly,
only this algorithm is available in the toolbox.
Our goal is to infer the posterior probability distribution
\(p(r_{nu} | \tilde{b}, b; \theta)\) for all regions
\(n \in \{1, ..., N\}\) and in all patients
\(u \in \{1, ..., U\}\).
This requires marginalizing out all latent random variables in the model
to compute the partition function \(p(\tilde{b}, b; \theta)\).
First, note that we can easily sum over \(T_{nmu}\)
\[\begin{split}p(\tilde{f}_{nmu} | f_{nm}, r_{nu}, r_{mu}; \eta, \epsilon)
&=
\sum_{t_{nmu}}
p(t_{nmu} | r_{nu}, r_{mu}; \eta)
p(\tilde{f}_{nmu} | f_{nm}, t_{nmu}; \epsilon)
\\
&=
\begin{cases}
(1 - \epsilon)^{f_{nm}^\top \tilde{f}_{nmu}}
\left( \frac{\epsilon}{2} \right)^{1 - f_{nm}^\top \tilde{f}_{nmu}}
&
\mathrm{if} \, r_{nu} = r_{mu} = 0
\\
\epsilon^{f_{nm}^\top \tilde{f}_{nmu}}
\left( \frac{1 - \epsilon}{2} \right)^{1 - f_{nm}^\top \tilde{f}_{nmu}}
&
\mathrm{if} \, r_{nu} = r_{mu} = 1
\\
\tilde{\epsilon}^{f_{nm}^\top \tilde{f}_{nmu}}
\left( \frac{1 - \tilde{\epsilon}}{2} \right)^{1 - f_{nm}^\top \tilde{f}_{nmu}}
&
\mathrm{if} \, r_{nu} \neq r_{mu}
\end{cases}\end{split}\]
where \(\tilde{\epsilon} = \eta \epsilon + (1 - \eta)(1 - \epsilon)\).
Note
The marginalization of \(T\) comes at the expense of coupling
\(\epsilon\) with \(\eta\).
Next, we marginalize out \(\tilde{F}_{nmu}\)
\[\begin{split}p(\tilde{b}_{nmu} | f_{nm}, r_{nu}, r_{mu}; \theta)
&=
\sum_{\tilde{f}_{nmu}}
p(\tilde{f}_{nmu} | f_{nm}, r_{nu}, r_{mu}; \eta, \epsilon)
p(\tilde{b}_{nmu} | \tilde{f}_{nmu}; \mu, \sigma)
\\
&=
\sum_{k=-1}^{1}
\mathcal{N}(\tilde{b}_{nmu}; \mu_{k}, \sigma_{k}^2)
\\
&\qquad
\left(
(1 - \epsilon)^{f_{nmk}}
\left(
\frac{\epsilon}{2}
\right)^{1 - f_{nmk}}
\right)^{(1 - r_{nu}) (1 - r_{mu})}
\\
&\qquad
\left(
\epsilon^{f_{nmk}}
\left(
\frac{1 - \epsilon}{2}
\right)^{1 - f_{nmk}}
\right)^{r_{nu} r_{mu}}
\\
&\qquad
\left(
\tilde{\epsilon}^{f_{nmk}}
\left(
\frac{1 - \tilde{\epsilon}}{2}
\right)^{1 - f_{nmk}}
\right)^{r_{nu} (1 - r_{mu}) + (1 - r_{nu}) r_{mu}}
\\
&=
\prod_{k = -1}^{1}
\bigg(
\mathcal{M}_{k0}(\tilde{b}_{nmu}; \theta)^{r_{nu} r_{mu}}
\\
&\qquad
\mathcal{M}_{k1}(\tilde{b}_{nmu}; \theta)^{(1 - r_{nu}) (1 - r_{mu})}
\\
&\qquad
\mathcal{M}_{k \neq}(\tilde{b}_{nmu}; \theta)^{(r_{nu} (1 - r_{mu}) + (1 - r_{nu}) r_{mu})}
\bigg)^{f_{nmk}}\end{split}\]
where
\[\begin{split}\mathcal{M}_{k0}(\tilde{b}_{nmu}; \theta)
&=
(1 - \epsilon) \mathcal{N}(\tilde{b}; \mu_k, \sigma_k^2)
+
\frac{\epsilon}{2} \sum_{l \neq k} \mathcal{N}(\tilde{b}_{nmu}; \mu_l, \sigma_l^2)
\\
\mathcal{M}_{k1}(\tilde{b}_{nmu}; \theta)
&=
\epsilon \mathcal{N}(\tilde{b}; \mu_k, \sigma_k^2)
+
\frac{1 - \epsilon}{2} \sum_{l \neq k} \mathcal{N}(\tilde{b}_{nmu}; \mu_l, \sigma_l^2)
\\
\mathcal{M}_{k \neq}(\tilde{b}_{nmu}; \theta)
&=
\tilde{\epsilon} \mathcal{N}(\tilde{b}; \mu_k, \sigma_k^2)
+
\frac{1 - \tilde{\epsilon}}{2} \sum_{l \neq k} \mathcal{N}(\tilde{b}_{nmu}; \mu_l, \sigma_l^2).\end{split}\]
Note
The marginalization of \(T\) comes at the expense of coupling
\(\epsilon\) and \(\eta\) with
\(\mu\) and \(\sigma\).
As \(\epsilon\) is assumed to be small \(\mathcal{M}_{k0}\) is
dominated by the likelihood of the correlation being drawn from the
\(k^\mathrm{th}\) Normal distribution, whereas \(\mathcal{M}_{k1}\)
is dominated by the likelihoods of the correlation being drawn from the
other Normal distributions. Finally, \(\mathcal{M}_{k \neq}\) is an
interpolation between the other two terms that tends to
\(\mathcal{M}_{k0}\) as \(\eta \to 0\) and tends to
\(\mathcal{M}_{k1}\) as \(\eta \to 1\).
Next, we could marginalize out \(F_{nm}\), but this would
complicate the form enough to make inference difficult and would also couple
\(\gamma\) with \((\epsilon, \eta, \mu, \sigma)\).
Finally, note that there is no way to analytically marginalize over \(R\),
because of the pair wise conditional dependence between \(\tilde{F}\) and
\(R\). Furthermore, it is very computationally expensive to perform
the brute force summation which would require a sum over \(2^N\) terms
- one for each value the binary \(N\) vector can take on.
Now we can construct the joint probability over all the remaining random
variables
\[\begin{split}p(f, b, r, \tilde{b}; \theta)
&=
p(f; \gamma)
p(b | f; \mu, \sigma)
p(r; \pi)
p(\tilde{b} | f, r; \theta)
\\
&=
\left(
\prod_{n=1}^N
\prod_{m > n}
p(f_{nm}; \gamma)
\prod_{h=1}^H
p(b_{nmh} | f_{nm}; \mu, \sigma)
\right)
\\
&\quad
\left(
\prod_{u=1}^U
\prod_{n=1}^N
p(r_{nu}; \pi)
\prod_{m > n}
p(\tilde{b}_{nmu} | f_{nm}, r_{nu}, r_{mu}; \eta, \epsilon, \mu, \sigma)
\right).\end{split}\]
Variational Inference
To perform inference without marginalizing out the remaining hidden random
variables, \(F\) and \(\tilde{F}\), we use a variational mean field
approximation of their posterior.
The factorization takes the form
\[\begin{split}p(f, r | b, \tilde{b})
&\approx
q(f, r)
=
q_F(f) q_R(r)
\\
&=
\left(
\prod_{n = 1}^{N}
\prod_{m > n}
\prod_{k = -1}^{1}
q_{F_{nmk}}^{f_{nmk}}
\right)
\left(
\prod_{n = 1}^N
\prod_{u = 1}^U
q_{R_{nu0}}^{1 - r_{nu}} q_{R_{nu1}}^{r_{nu}}
\right).\end{split}\]
Computing the posterior \(p(r | b, \tilde{b}; \theta)\) also requires
an estimate of \(\theta\).
To estimate both \(\theta\) and the variational factors, we minimize the
variational free energy
\[\begin{split}\mathcal{E}(q, \theta; b, \tilde{b})
&=
- \mathbb{E}_q \left[ \log p(f, b, r, \tilde{b}; \theta) \right]
+ \mathbb{E}_q \left[ \log q(f, r) \right]
\\
&=
- \mathbb{E}_{q_F} \left[ \log p(f; \gamma) \right]
- \mathbb{E}_{q_F} \left[ \log p(b | f; \mu, \sigma) \right]
- \mathbb{E}_{q_R} \left[ \log p(r; \pi) \right]
\\
&\qquad
- \mathbb{E}_{q_F q_R} \left[
\log p(\tilde{b} | f, r; \epsilon, \eta, \mu, \sigma)
\right]
+ \mathbb{E}_{q_R} \left[ \log q_R(r) \right]
+ \mathbb{E}_{q_F} \left[ \log q_F(f) \right]\end{split}\]
where
\[\begin{split}\mathbb{E}_{q_F} \left[ \log p(f; \gamma) \right]
&=
\sum_{n = 1}^N
\sum_{m > n}
\sum_{k = -1}^{1}
q_{F_{nmk}} \log \gamma_k
\\
\mathbb{E}_{q_R} \left[ \log p(r; \pi) \right]
&=
\sum_{n = 1}^N
\sum_{u = 1}^U
q_{R_{nu0}} \log (1 - \pi)
+
q_{R_{nu1}} \log \pi
\\
\mathbb{E}_{q_F} \left[ \log p(b | f; \mu, \sigma) \right]
&=
\sum_{n = 1}^N
\sum_{m > n}
\sum_{k = -1}^{1}
q_{F_{nmk}}
\sum_{h = 1}^H
\log \mathcal{N}(b_{nmh}; \mu_k, \sigma_k^2)
\\
\mathbb{E}_{q_F q_R} \left[
\log p(\tilde{b} | f, r; \epsilon, \eta, \mu, \sigma)
\right]
&=
\sum_{n = 1}^N
\sum_{m > n}
\sum_{k = -1}^{1}
q_{F_{nmk}}
\sum_{u = 1}^U
\bigg(
q_{R_{nu0}} q_{R_{mu0}}
\log \mathcal{M}_{k0}(\tilde{b}_{nmu}; \theta)
\\
&\qquad
+
q_{R_{nu1}} q_{R_{mu1}}
\log \mathcal{M}_{k1}(\tilde{b}_{nmu}; \theta)
\\
&\qquad
+
\left( q_{R_{nu0}} q_{R_{mu1}} + q_{R_{nu1}} q_{R_{mu0}} \right)
\log \mathcal{M}_{k \neq}(\tilde{b}_{nmu}; \theta)
\bigg)
\\
\mathbb{E}_{q_F} \left[ \log q(f) \right]
&=
\sum_{n = 1}^N
\sum_{m > n}
\sum_{k = -1}^{1}
q_{F_{nmk}} \log q_{F_{nmk}}
\\
\mathbb{E}_{q_R} \left[ \log q(r) \right]
&=
\sum_{n = 1}^N
\sum_{u = 1}^U
q_{R_{nu0}} \log q_{R_{nu0}}
+
q_{R_{nu1}} \log q_{R_{nu1}}.\end{split}\]
The algorithm for performing this minimization is
\[\begin{split}\begin{array}{lll}
\text{Line} & \text{Operation}
\\ \hline
1 &
e \gets \mathcal{E}(q, \theta, b, \tilde{b})
\\
2 &
\text{for } s = 1 \dots S
\\
3 & \quad
q_F^* \gets \mathcal{U}_{q_F}(q_R, \theta, b, \tilde{b})
\\
4 & \quad
q_R^* \gets \mathcal{U}_{q_R}(q_R,q_F^*, \theta, b, \tilde{b})
\\
5 & \quad
\theta^* \gets \mathcal{U}_{\theta}(q^*, \theta, b, \tilde{b})
\\
6 & \quad
e^* \gets \mathcal{E}(q^*, \theta^*, b, \tilde{b})
\\
7 & \quad
\text{if } (e - e^*) / e < \xi
\\
8 & \quad \quad
\text{break}
\\
9 & \quad
q \gets q^*, \, \theta \gets \theta^*, \, e \gets e^*
\end{array}\end{split}\]
where \(\xi\) is the relative tolerance used to detect convergence before
the maximum number of iteration steps \(S\).
The \(\mathcal{U}_{q_F}\) function is determined by the following update
equation
\[\begin{split}\log q_{F_{nmk}}^*
&=
\log \gamma_k
+
\sum_{h=1}^H
\log \mathcal{N}(b_{nmh}; \mu_k, \sigma_k)
\\
&\quad
+
\sum_{u=1}^U
q_{R_{nu0}} q_{R_{mu0}} \log \mathcal{M}_{k0}(\tilde{b}_{nmu}; \theta)
+
q_{R_{nu1}} q_{R_{mu1}} \log \mathcal{M}_{k1}(\tilde{b}_{nmu}; \theta)
\\
&\qquad
+
\left(
q_{R_{nu1}} q_{R_{mu0}}
+
q_{R_{nu0}} q_{R_{mu1}}
\right)
\log \mathcal{M}_{k \neq}(\tilde{b}_{nmu}; \theta)
+ \text{const.}\end{split}\]
where \(\text{const.}\) is used to imply that \(\log q_{F_{nmk}}^*\)
is further normalized to ensure that \(\sum_{k=-1}^1 q_{F_{nmk}}^* = 1\).
Note
We optimize w.r.t. \(\log q\) to ensure positivity of \(q\).
The \(\mathcal{U}_{q_R}\) function is determined by the following update
equations
\[\begin{split}\log q_{R_{nu0}}^*
&=
\log(1 - \pi)
\\
&\,
+
\sum_{m \neq n}
\sum_{k=-1}^{1}
q_{F_{nmk}}
\left(
q_{R_{mu0}} \log \mathcal{M}_{k0}(\tilde{b}_{nmu}; \theta)
+
q_{R_{mu1}} \log \mathcal{M}_{k \neq}(\tilde{b}_{nmu}; \theta)
\right)
+
\text{const.}\end{split}\]
\[\begin{split}\log q_{R_{nu1}}^*
&=
\log \pi
\\
&\,
+
\sum_{m \neq n}
\sum_{k=-1}^{1}
q_{F_{nmk}}
\left(
q_{R_{mu1}} \log \mathcal{M}_{k1}(\tilde{b}_{nmu}; \theta)
+
q_{R_{mu0}} \log \mathcal{M}_{k \neq}(\tilde{b}_{nmu}; \theta)
\right)
+
\text{const.}\end{split}\]
where \(\text{const.}\) is used to imply that \(\log q_{R_{nu0}}^*\)
and \(\log q_{R_{nu1}}^*\) are further normalized to ensure that
\(q_{R_{nu0}}^* + q_{R_{nu1}}^* = 1\).
Note
Here, we estimate each component of the variational parameters and then
project the solution onto the solution space of probability
distributions.
Instead, we could incorporate the equality constraints into the linear
system of equations. However, note that means we would have to optimize
w.r.t. \(q\) instead of \(\log q\) and thus would need to
explicitly handle the positivity constraints.
The \(\mathcal{U}_{\theta}\) function is determined by the following
update equations
\[\pi^*
=
\frac{1}{UN}
\sum_{u=1}^U
\sum_{n=1}^N
q_{R_{nu1}}\]
\[\gamma_k^*
=
\sum_{n=1}^N
\sum_{m > n}
q_{F_{nmk}}
+
\text{const.}\]
as well as the multivariable minimization of \(\mathcal{E}\) with
respect to \((\mu, \sigma, \epsilon, \eta)\).
This minimization can be performed using a variety of iterative descent
methods.
By default, a trust region reflective Gauss-Newton method is used.
In order to use a descent method, we need the following derivatives.
The derivative of the energy w.r.t. \(\mu_j\) is
\[\begin{split}\frac{\partial \mathcal{E}}{\partial \mu_j}
&=
-
\frac{\partial}{\partial \mu_j} \left(
\mathbb{E}_{q_F} \left[
\log p(b | f; \mu, \sigma)
\right]
+
\mathbb{E}_{q_F q_R} \left[
\log p(\tilde{b} | f, r; \theta)
\right]
\right)
\\
&=
-
\sum_{n=1}^N
\sum_{m > n} \bigg(
q_{F_{nmj}}
\sum_{h=1}^H
\frac{\partial}{\partial \mu_j} \left(
\log \mathcal{N}(b_{nmh}; \mu_j, \sigma_j^2)
\right)
\\
&\quad
+
\sum_{k=-1}^1
q_{F_{nmk}} \bigg(
\sum_{u=1}^U
q_{R_{nu0}} q_{R_{mu0}}
\frac{\partial}{\partial \mu_j} \left(
\log \mathcal{M}_{k0}(\tilde{b}_{nmu}; \theta)
\right)
\\
&\qquad
+
q_{R_{nu1}} q_{R_{mu1}}
\frac{\partial}{\partial \mu_j} \left(
\log \mathcal{M}_{k1}(\tilde{b}_{nmu}; \theta)
\right)
\\
&\qquad
+
\left(
q_{R_{nu0}} q_{R_{mu1}}
+
q_{R_{nu1}} q_{R_{mu0}}
\right)
\frac{\partial}{\partial \mu_j} \left(
\log \mathcal{M}_{k \neq}(\tilde{b}_{nmu}; \theta)
\right)
\bigg)
\bigg)\end{split}\]
where
\[\frac{\partial}{\partial \mu_j}
\left(
\log \mathcal{M}_{k0}(b; \theta)
\right)
=
\mathcal{M}_{k0}(b; \theta)^{-1}
(1 - \epsilon)^{\delta(j, k)}
\left( \frac{\epsilon}{2} \right)^{1 - \delta(j, k)}
\frac{\partial}{\partial \mu_j}
\left(
\mathcal{N}(b; \mu_j, \sigma_j^2)
\right)\]
\[\frac{\partial}{\partial \mu_j} \left(
\log \mathcal{M}_{k1}(b; \theta)
\right)
=
\mathcal{M}_{k1}(b; \theta)^{-1}
\epsilon^{\delta(j, k)}
\left( \frac{1 - \epsilon}{2} \right)^{1 - \delta(j, k)}
\frac{\partial}{\partial \mu_j} \left(
\mathcal{N}(b; \mu_j, \sigma_j^2)
\right)\]
\[\frac{\partial}{\partial \mu_j} \left(
\log \mathcal{M}_{k \neq}(b; \theta)
\right)
=
\mathcal{M}_{k \neq}(b; \theta)^{-1}
\tilde{\epsilon}^{\delta(j, k)}
\left( \frac{1 - \tilde{\epsilon}}{2} \right)^{1 - \delta(j, k)}
\frac{\partial}{\partial \mu_j} \left(
\mathcal{N}(b; \mu_j, \sigma_j^2)
\right)\]
and
\[\frac{\partial}{\partial \mu_j} \left(
\mathcal{N}(b; \mu_j, \sigma_j^2)
\right)
=
\mathcal{N}(b; \mu_j, \sigma_j^2)
\frac{\partial}{\partial \mu_j} \left(
\log \mathcal{N}(b; \mu_j, \sigma_j^2)
\right)\]
\[\frac{\partial}{\partial \mu_j} \left(
\log \mathcal{N}(b; \mu_j, \sigma_j^2)
\right)
=
\sigma_j^{-2}
(b - \mu_j)\]
The derivative of the energy w.r.t. \(\sigma_j^2\) is
\[\begin{split}\frac{\partial \mathcal{E}}{\partial \sigma_j^2}
&=
-
\frac{\partial}{\partial \sigma_j^2} \left(
\mathbb{E}_{q_F} \left[
\log p(b | f; \mu, \sigma)
\right]
+
\mathbb{E}_{q_F q_R} \left[
\log p(\tilde{b} | f, r; \theta)
\right]
\right)
\\
&=
-
\sum_{n=1}^N
\sum_{m > n} \bigg(
q_{F_{nmj}}
\sum_{h=1}^H
\frac{\partial}{\partial \sigma_j^2} \left(
\log \mathcal{N}(b_{nmh}; \mu_j^2, \sigma_j^2)
\right)
\\
&\quad
+
\sum_{k=-1}^1
q_{F_{nmk}} \bigg(
\sum_{u=1}^U
q_{R_{nu0}} q_{R_{mu0}}
\frac{\partial}{\partial \sigma_j^2} \left(
\log \mathcal{M}_{k0}(\tilde{b}_{nmu}; \theta)
\right)
\\
&\qquad
+
q_{R_{nu1}} q_{R_{mu1}}
\frac{\partial}{\partial \sigma_j^2} \left(
\log \mathcal{M}_{k1}(\tilde{b}_{nmu}; \theta)
\right)
\\
&\qquad
+
\left(
q_{R_{nu0}} q_{R_{mu1}}
+
q_{R_{nu1}} q_{R_{mu0}}
\right)
\frac{\partial}{\partial \sigma_j^2} \left(
\log \mathcal{M}_{k \neq}(\tilde{b}_{nmu}; \theta)
\right)
\bigg)
\bigg)\end{split}\]
where
\[\frac{\partial}{\partial \sigma_j^2} \left(
\log \mathcal{M}_{k0}(b; \theta)
\right)
=
\mathcal{M}_{k0}(b; \theta)^{-1}
(1 - \epsilon)^{\delta(j, k)}
\left( \frac{\epsilon}{2} \right)^{1 - \delta(j, k)}
\frac{\partial}{\partial \sigma_j^2} \left(
\mathcal{N}(b; \mu_j, \sigma_j^2)
\right)\]
\[\frac{\partial}{\partial \sigma_j^2} \left(
\log \mathcal{M}_{k1}(b; \theta)
\right)
=
\mathcal{M}_{k1}(b; \theta)^{-1}
\epsilon^{\delta(j, k)}
\left( \frac{1 - \epsilon}{2} \right)^{1 - \delta(j, k)}
\frac{\partial}{\partial \sigma_j^2} \left(
\mathcal{N}(b; \mu_j, \sigma_j^2)
\right)\]
\[\frac{\partial}{\partial \sigma_j^2} \left(
\log \mathcal{M}_{k \neq}(b; \theta)
\right)
=
\mathcal{M}_{k \neq}(b; \theta)^{-1}
\tilde{\epsilon}^{\delta(j, k)}
\left( \frac{1 - \tilde{\epsilon}}{2} \right)^{1 - \delta(j, k)}
\frac{\partial}{\partial \sigma_j^2} \left(
\mathcal{N}(b; \mu_j, \sigma_j^2)
\right)\]
and
\[\frac{\partial}{\partial \sigma_j^2} \left(
\log \mathcal{N}(b; \mu_j, \sigma_j^2)
\right)
=
(2 \sigma_j^2)^{-1} \left(
(b - \mu_j)^2 - \sigma_j^2
\right)\]
\[\frac{\partial}{\partial \sigma_j^2} \left(
\mathcal{N}(b; \mu_j, \sigma_j^2)
\right)
=
\mathcal{N}(b; \mu_j, \sigma_j^2)
\frac{\partial}{\partial \sigma_j^2} \left(
\log \mathcal{N}(b; \mu_j, \sigma_j^2)
\right)\]
The derivative of the energy w.r.t. \(\eta\) is
\[\begin{split}\frac{\partial \mathcal{E}}{\partial \eta}
&=
-
\frac{\partial}{\partial \eta} \left(
\mathbb{E}_{q_F q_R} \left[
\log p(\tilde{b} | f, r; \theta)
\right]
\right)
\\
&=
-
\sum_{n=1}^N
\sum_{m > n}
\sum_{k = -1}^1
q_{F_{nmk}}
\sum_{u=1}^U
\left(
q_{R_{nu0}} q_{R_{mu1}}
+
q_{R_{nu1}} q_{R_{mu0}}
\right)
\frac{\partial}{\partial \eta} \left(
\log \mathcal{M}_{k \neq}(\tilde{b}_{nmu}; \theta)
\right)\end{split}\]
where
\[\frac{\partial}{\partial \eta} \left(
\log \mathcal{M}_{k \neq}(\tilde{b}; \theta)
\right)
=
\mathcal{M}_{k \neq}(\tilde{b}; \theta)^{-1}
\left(
(2 \epsilon - 1) \mathcal{N}(\tilde{b}; \mu_k, \sigma_k^2)
-
\frac{(2 \epsilon - 1)}{2}
\sum_{l \neq k}
\mathcal{N}(\tilde{b}; \mu_l, \sigma_l)
\right)\]
The derivative of the energy w.r.t. \(\epsilon\) is
\[\begin{split}\frac{\partial \mathcal{E}}{\partial \epsilon}
&=
-
\frac{\partial}{\partial \epsilon} \left(
\mathbb{E}_{q_F q_R} \left[
\log p(\tilde{b} | f, r; \theta)
\right]
\right)
\\
&=
-
\sum_{n=1}^N
\sum_{m > n}
\sum_{k = -1}^1
q_{F_{nmk}} \bigg(
\sum_{u=1}^U
q_{R_{nu0}} q_{R_{mu0}}
\frac{\partial}{\partial \epsilon} \left(
\log \mathcal{M}_{k0}(\tilde{b}_{nmu}; \theta)
\right)
\\
&\qquad
+
q_{R_{nu1}} q_{R_{mu1}}
\frac{\partial}{\partial \epsilon} \left(
\log \mathcal{M}_{k1}(\tilde{b}_{nmu}; \theta)
\right)
\\
&\qquad
+
\left(
q_{R_{nu0}} q_{R_{mu1}}
+
q_{R_{nu1}} q_{R_{mu0}}
\right)
\frac{\partial}{\partial \epsilon} \left(
\log \mathcal{M}_{k \neq}(\tilde{b}_{nmu}; \theta)
\right)
\bigg)\end{split}\]
where
\[\frac{\partial}{\partial \epsilon} \left(
\log \mathcal{M}_{k0}(\tilde{b}; \theta)
\right)
=
\mathcal{M}_{k0}(\tilde{b}; \theta)^{-1}
\left(
\frac{1}{2}
\sum_{l \neq k}
\mathcal{N}(\tilde{b}; \mu_l, \sigma_l^2)
-
\mathcal{N}(\tilde{b}; \mu_k, \sigma_k^2)
\right)\]
\[\frac{\partial}{\partial \epsilon} \left(
\log \mathcal{M}_{k1}(\tilde{b}; \theta)
\right)
=
\mathcal{M}_{k1}(\tilde{b}; \theta)^{-1}
\left(
\mathcal{N}(\tilde{b}; \mu_k, \sigma_k^2)
-
\frac{1}{2}
\sum_{l \neq k}
\mathcal{N}(\tilde{b}; \mu_l, \sigma_l^2)
\right)\]
\[\frac{\partial}{\partial \epsilon} \left(
\log \mathcal{M}_{k \neq}(\tilde{b}; \theta)
\right)
=
\mathcal{M}_{k \neq}(\tilde{b}; \theta)^{-1}
\left(
(2 \eta - 1) \mathcal{N}(\tilde{b}; \mu_k, \sigma_k^2)
-
\frac{(2 \eta - 1)}{2}
\sum_{l \neq k}
\mathcal{N}(\tilde{b}; \mu_l, \sigma_l^2)
\right)\]