Recursive Bayesian Networks

Inside Probabilities

The inside probability

Inside Probability

\beta(x_{i:k}) = p(\mathbf{Y}_{i:k} \ |\ x_{i:k})

is the probability of generating the subsequence of terminal variables \mathbf{Y}_{i:k}=\{y_{i+1},\ldots,y_{k}\} from the non-terminal variable x_{i:k} with all intermediate variables marginalised out.

Binary Case

This can be best understood by looking at a simple example with a single non-terminal template variable and a binary split transition:

Tree diagram

The inside probability \beta(x_{1:9}) is the probability of generating the subsequence of terminal variables \mathbf{Y}_{1:9}=\{y_{2},\ldots,y_{9}\} from the non-terminal variable x_{1:9}. Its value can be computed recursively by looking one step ahead in the generating process and considering all possibilities. In this simple case, the next step will always produce two child variables and the different possibilities are simply which part of the sequence is generated by which of the children. This includes the three possibilities illustrated above:

  • orange [split at j=4]: the left child x_{1:4} generates \{y_2,\ldots,y_4\}; the right child x_{4:9} generates \{y_5,\ldots,y_9\}

  • blue [split at j=5]: the left child x_{1:5} generates \{y_2,\ldots,y_5\}; the right child x_{5:9} generates \{y_6,\ldots,y_9\}

  • green [split at j=6]: the left child x_{1:6} generates \{y_2,\ldots,y_6\}; the right child x_{6:9} generates \{y_7,\ldots,y_9\}.

but splitting points j can go from 2 up to 8 (i<j<k). These different possibilities have to be summed up and the intermediate variables (the left and right child) have to be marginalised out:

Inside Probabilities (Binary Case)

\beta(x_{i:k})
={}&
p_S(z_{i:k}=\tau \mid x_{i:k})
\sum_{i<j<k}
\int\int
p_{\tau}(x_{i:j}, x_{j:k} \mid x_{i:k}) \
\beta(x_{i:j}) \beta(x_{j:k}) \
dx_{i:j} dx_{j:k}

Here, the sum runs over all possible splitting points and the integrals marginalise out the intermediate (child) variables. The inside probabilities \beta(x_{i:j}) and \beta(x_{j:k}) are the recursive part and “hide” all subsequent generation steps from the children on. Finally, the particular binary transition that we considered here (denoted by \tau) is usually only one of multiple possible transitions (at least one other transition is needed to generate terminal variables). We therefore have to multiply everything by the probability of taking this particular transition, which is the first term with the structural distribution p_S.

General Case

If the simple binary case is clear, it is much easier to understand the more complex general case, which differs in the following.

Multiple template variables: So far, we had a single non-terminal template variable x that could be instantiated at any possible location i:k in the parse chart (x_{i:k}). In other words, the set of non-terminal template variables \mathcal{X} contained a single variable \mathcal{X}=\{x\}. However, in general, \mathcal{X} may contain an arbitrary number of non-terminal template variables \mathcal{X}=\{x^{(1)},x^{(2)},\ldots\}. The same is true for the set of terminal template variables \mathcal{Y}=\{y^{(1)},y^{(2)},\ldots\}.

Multiple possible transitions: As mentioned above, even in the simple binary case, there has to be at least one other transition to generate the terminal variables (as the binary split only generates two new non-terminal variables). In general, there may be an arbitrary number of transitions \tau\in\mathcal{T}_x from a particular non-terminal variable x and each transition may involve any number of terminal or non-terminal variables v\in\mathcal{X}\cup\mathcal{Y}.

Multiple splitting points: If a transition generates more than two variables, there is more than one splitting point. To be precise, if a transition generates \eta variables, there are \eta-1 splitting points, which we can denote by j_1,\ldots,j_{\eta-1}.

Taking this into account, the general form of the inside probabilities for the non-terminal variable x_{i:k} is given by

Inside Probabilities (General Case)

\beta(x_{i:k})
={}&
\sum_{\tau \in \mathcal{T}_x}
p_S(z_{i:k}=\tau \mid x_{i:k})
{\sum\cdots\sum}_{i<j_1<\ldots<j_{\eta-1}<k} \\
& {\int\cdots\int}_{\{v_{j:j^\prime}\}\subseteq\mathcal{X}}
p_{\tau}(v_{i:j_1}, \ldots, v_{j_{\eta-1}:k} \mid x_{i:k})
\prod_{\{v_{j:j^\prime}\}\subseteq\mathcal{X}}
\beta(v_{j:j^\prime}).

We denote the second line as inside marginals \widetilde{\beta}_{i:\ldots:k}(x_{i:k})

Inside Marginals

\widetilde{\beta}_{i:\ldots:k}(x_{i:k})
=
{\int\cdots\int}_{\{v_{j:j^\prime}\}\subseteq\mathcal{X}}
p_{\tau}(v_{i:j_1}, \ldots, v_{j_{\eta-1}:k} \mid x_{i:k})
\prod_{\{v_{j:j^\prime}\}\subseteq\mathcal{X}}
\beta(v_{j:j^\prime}),

where the subscript i:\ldots:k indicates the dependency on i<j_1<\ldots<j_{\eta-1}<k. This is implemented in Transition.inside_marginals(), which returns an array or iterable over all possible inside marginals \widetilde{\beta}_{i:\ldots:k}(x_{i:k}) for the different (combinations of) splitting points.

We denote the summation in the first line as inside mixture

Inside Mixture

\beta(x_{i:k})
=
\sum_{\tau \in \mathcal{T}_x}
p_S(z_{i:k}=\tau \mid x_{i:k})
{\sum\cdots\sum}_{i<j_1<\ldots<j_{\eta-1}<k} \ \widetilde{\beta}_{i:\ldots:k}(x_{i:k}),

which uses the inside marginals \widetilde{\beta}_{i:\ldots:k}(x_{i:k}) to compute the (multi-)sum over possible (combinations of) splitting points as well as the sum over possible transitions weighted by the structural distribution. This is implemented in Cell.inside_mixture().

To explain the different parts of the equation in more detail, we have:

  • The sum over all possible transitions for the given variable, weighting each transition by its probability to occur, given by the structural distribution:

\sum_{\tau \in \mathcal{T}_x}
p_S(z_{i:k}=\tau \mid x_{i:k})

  • The multi-sum over all possible combinations of splitting points. For the simple binary case, where we only have a single splitting point, this reduces to the simple sum from above. A larger number of splitting points quickly becomes computationally expensive as the number of possibilities grows exponentially (and is also more difficult to vectorise):

{\sum\cdots\sum}_{i<j_1<\ldots<j_{\eta-1}<k}

  • The multi-integral to marginalise out the set of generated non-terminal (child) variables (denoted by \{v_{j:j^\prime}\}\subseteq\mathcal{X}). Importantly, this does not include any terminal variables that may have been generated along with the non-terminal variables:

& {\int\cdots\int}_{\{v_{j:j^\prime}\}\subseteq\mathcal{X}}

  • The actual transition probability. The generated variables v_{i:j_1}, \ldots, v_{j_{\eta-1}:k} may contain both non-terminal variables and terminal variables:

p_{\tau}(v_{i:j_1}, \ldots, v_{j_{\eta-1}:k} \mid x_{i:k})

  • The recursive part containing the product of inside probabilities of all generated non-terminal variables.

\prod_{\{v_{j:j^\prime}\}\subseteq\mathcal{X}}
\beta(v_{j:j^\prime}).

Non-Sequential Case

Everything transfers to the non-sequential case, which is more abstract but the equations become simpler.

Inside Probabilities (Non-Sequential)

\begin{aligned}
\beta(x_\pi) ={}& p(\mathbf{Y}_\pi \ |\ x_\pi) \\
={}&
\sum_{\tau \in \mathcal{T}_x}
p_S(z_{\pi}=\tau \mid x_{\pi})
\sum_{(\pi_1,\pi_2,\ldots)\in\mathcal{P}_\tau(\pi)}
\underbrace{
\int_{\{v_{\pi_i}\}\subseteq\mathcal{X}}
p_{\tau}(v_{\pi_1}, v_{\pi_2}, \ldots \mid x_{\pi})
\prod_{\{v_{\pi_i}\}\subseteq\mathcal{X}} \beta(v_{\pi_i})
}_{\widetilde{\beta}^{(\tau)}_{(\pi_1,\pi_2,\ldots)}(x_\pi)} \\
={}&
\sum_{\tau \in \mathcal{T}_x}
p_S(z_{\pi}=\tau \mid x_{\pi})
\sum_{(\pi_1,\pi_2,\ldots)\in\mathcal{P}_\tau(\pi)}
{\widetilde{\beta}^{(\tau)}_{(\pi_1,\pi_2,\ldots)}(x_\pi)}
& \text{\makebox[0pt][r]{(Inside Mixture)}} \\
\widetilde{\beta}^{(\tau)}_{(\pi_1,\pi_2,\ldots)}(x_\pi)
={}&
\int_{\{v_{\pi_i}\}\subseteq\mathcal{X}}
p_{\tau}(v_{\pi_1}, v_{\pi_2}, \ldots \mid x_{\pi})
\prod_{\{v_{\pi_i}\}\subseteq\mathcal{X}}
\beta(v_{\pi_i}),
& \text{\makebox[0pt][r]{(Inside Marginals)}}
\end{aligned}

where

  • \pi references a subset of terminal variables

  • \mathbf{Y}_\pi is the actual subset of terminal variables referenced by \pi

  • \beta(x_\pi) is the probability of generating \mathbf{Y}_\pi conditional on the non-terminal variable x_\pi with all intermediate variables marginalised out

  • \mathcal{P}_\tau(\pi) is the set of all possible partitionings of \pi compatible with transition \tau, in particular, for transitions with a fixed arity \eta, the number of parts must equal \eta (the number of variables produced by \tau)

  • (\pi_1,\pi_2,\ldots)\in\mathcal{P}_\tau(\pi) is one particular partition, so that each \pi_i is the subset of data points represented by the variable v_{\pi_i}.

This formulation also generalises to context-free graph grammars. For instance, in node-replacement graph grammars, variables correspond to nodes and each transition \tau defines additional constraints for the partition (\pi_1,\pi_2,\ldots)\in\mathcal{P}_\tau(\pi), such that the node replacement rule is fulfilled.

Outside Probabilities

For simplicity, we directly provide the outside probabilities for the general non-sequential case.

Outside Probabilities

\begin{aligned}
\alpha(x_\pi) ={}& p(x_\pi, \mathbf{Y} \setminus \mathbf{Y}_\pi) \\
={}&
\sum_{\bar{\tau} \in \bar{\mathcal{T}}_x} \
\sum_{(\bar{\pi},\pi_1,\pi_2,\ldots)\in\bar{\mathcal{P}}_{\bar{\tau}}(\pi)}
\underbrace{
\int_{\bar{x}_{\bar{\pi}}}
\int_{\{v_{\pi_i}\}\subseteq\mathcal{X}}
p_S(z_{\bar{\pi}}=\bar{\tau} \mid \bar{x}_{\bar{\pi}}) \
p_{\bar{\tau}}(x_{\pi}, v_{\pi_1}, v_{\pi_2}, \ldots \mid \bar{x}_{\bar{\pi}}) \
\alpha(\bar{x}_{\bar{\pi}})
\prod_{\{v_{\pi_i}\}\subseteq\mathcal{X}} \beta(v_{\pi_i})
}_{\widetilde{\alpha}^{(\bar{\tau})}_{(\bar{\pi},\pi_1,\pi_2,\ldots)}(x_\pi)} \\
={}&
\sum_{\bar{\tau} \in \bar{\mathcal{T}}_x} \
\sum_{(\bar{\pi},\pi_1,\pi_2,\ldots)\in\bar{\mathcal{P}}_{\bar{\tau}}(\pi)}
\widetilde{\alpha}^{(\bar{\tau})}_{(\bar{\pi},\pi_1,\pi_2,\ldots)}(x_\pi)
& \text{\makebox[0pt][r]{(Outside Mixture)}} \\
\widetilde{\alpha}^{(\bar{\tau})}_{(\bar{\pi},\pi_1,\pi_2,\ldots)}(x_\pi)
={}&
\int_{\bar{x}_{\bar{\pi}}}
\int_{\{v_{\pi_i}\}\subseteq\mathcal{X}}
p_S(z_{\bar{\pi}}=\bar{\tau} \mid \bar{x}_{\bar{\pi}}) \
p_{\bar{\tau}}(x_{\pi}, v_{\pi_1}, v_{\pi_2}, \ldots \mid \bar{x}_{\bar{\pi}}) \
\alpha(\bar{x}_{\bar{\pi}})
\prod_{\{v_{\pi_i}\}\subseteq\mathcal{X}} \beta(v_{\pi_i})
& \text{\makebox[0pt][r]{(Outside Marginals)}}
\end{aligned}

where

  • \mathbf{Y} \setminus \mathbf{Y}_\pi the set of all terminal variables without by \mathbf{Y}_\pi, that is, the complement of the variables from respective the inside probabilities

  • \alpha(x_\pi) is the joint probability of generating x_\pi and all terminal variables \mathbf{Y} \setminus \mathbf{Y}_\pi

  • \bar{\mathcal{T}}_x is the set of transitions that include x as one of the generated variables; if x appears multiple times in the generated variables, these are listed as separate transitions \bar{\tau} and the position of the respective occurrence of x is marked and used in the evaluation of p_{\bar{\tau}}

  • \bar{\pi}, \pi, \pi_1, \pi_2, \ldots reference subsets of terminal variables, such that \pi, \pi_1, \pi_2, \ldots is a partition of \bar{\pi}, and

    \bar{\mathcal{P}}_{\bar{\tau}}(\pi)
 = \{ (\bar{\pi},\pi_1,\pi_2,\ldots) \ | \ \pi, \pi_1, \pi_2, \cdots \text{ partitions } \bar{\pi} \}

    is the set of all such possibilities compatible with transition \bar{\tau}; note that the elements of \bar{\mathcal{P}}_{\bar{\tau}}(\pi) contain the parent set \bar{\pi}, not \pi.

Marginal Likelihood

The marginal likelihood p(\mathbf{Y}) is the probability of the observed data after marginalising out all latent variables of the model. This is the quantity typically used to measure how well a model describes the data and model parameters are optimised to minimise the marginal negative log likelihood - \log p(\mathbf{Y}). We obtain the marginal likelihood from the inside probabilities at the root node

Marginal Likelihood

p(\mathbf{Y}) = \sum_{x\in\mathcal{X}} w_x \int \beta(x_{0:n}) \ p_P(x_{0:n}) \ dx_{0:n},

where p_P(x_{0:n}) is the prior transition that generates a non-terminal variable of type x from scratch and w_x is the probability of generating a particular template variable of type x. Note that this is fundamentally the same as for normal transitions (w_x is the equivalent of the structural distribution), except that some things simplify as we do not condition on another non-terminal variable and do not have to consider possible splitting points as we are generating only a single new non-terminal variable. Computing marginal likelihoods from the inside probabilities at the root node is implemented in Prior.marginal_likelihood(), which essentially corresponds to combining the Transition.inside_marginals() and Cell.inside_mixture() methods for the special case of the prior.

Parsing

To compute marginal likelihoods, we have to recursively evaluate the equation for general inside probabilities to obtain their values at the root node. This is done by the RBN.inside() method. It iterates over all inside variables using inside_schedule(), delegating the marginalisation of variables to Transition.inside_marginals() and computing mixtures to Cell.inside_mixture(), before returning the marginal likelihood computed via Prior.marginal_likelihood().