Variable elimination

Next, we turn our attention to the problem of inference in graphical models. Given a probabilistic model (such as a Bayes net on an MRF), we are interested in using it to answer useful questions, e.g. determining the probability that a given email is spam. More formally, we will be focusing on two types of questions:

It turns out that inference is a challenging task. For many probabilities of interest, it will be NP-hard to answer any of these questions. Crucially, whether inference is tractable will depend on the structure of the graph that describes that probability. If a problem is intractable, we will still be able to obtain useful answers via approximate inference methods.

This chapter covers the first exact inference algorithm, variable elimination. We will discuss approximate inference in later chapters.

An illustrative example

Consider first the problem of marginal inference. Suppose for simplicity that we a given a chain Bayesian network, i.e. a probability of the form

We are interested in computing the marginal probability . We will assume for the rest of the chapter that the are discrete variables taking possible values eachThe principles behind variable elimination also extend to many continuous distributions (e.g. Gaussians), but we will not discuss these extensions here. .

The naive way of performing this is to sum the probability over all the assignments to :

However, we can do much better by leveraging the factorization of our probability distribution. We may rewrite the sum in a way that ``pushes in” certain variables deeper into the product.

We perform this summation by first summing the inner terms, starting from , and ending with . More concretely, we start by computing an intermediary factor by summing out . This takes time because we must sum over for each assignment to . The resulting factor can be thought of as a table of values (though not necessarily probabilities), with one entry for each assignment to (just as factor can be represented as a table. We may then rewrite the marginal probability using as

Note that this has the same form as the initial expression, except that we are summing over one fewer variableThis technique is a special case of dynamic programming, a general algorithm design approach in which we break apart a larger problem into a sequence of smaller ones. . We may therefore compute another factor , and repeat the process until we are only left with . Since each step takes time, and we perform steps, inference now takes time, which is much better than our naive solution.

Also, at each time, we are eliminating a variable, which gives the algorithm its name.

Eliminating Variables

Having established some intuitions, with a special case, we will now introduce the variable elimination algorithm in its most general form.

Factors

We will assume that we are given a graphical model as a product of factors

Recall that we can view a factor as a multi-dimensional table assigning a value to each assignment of a set of variables . In the context of a Bayesian network, the factors correspond to conditional probability distributions; however, this definition also makes our algorithm equally applicable to Markov Random Fields. In this latter case, the factors encode an unnormalized distribution; to compute marginals, we first calculate the partition function (also using variable elimination), then we compute marginals using the unnormalized distribution, and finally we divide the result by the partition constant to construct a valid marginal probability.

Factor Operations

The variable elimination algorithm will repeatedly perform two factor operations: product and marginalization. We have been implicitly been performing these operations in our chain example.

The factor product operation simply defines the product of two factors as

The scope of is defined as the union of the variables in the scopes of ; also denotes an assignment to the variables in the scope of defined by the restriction of to that scope. For example, we define .

Next, the marginalization operation “locally” eliminates a set of variable from a factor. If we have a factor over two sets of variables , marginalizing produces a new factor

where the sum is over all joint assignments to the set of variables .
Here, we are marginalizing out variable from factor

We use to refer to the marginalized factor. It is important to understand that this factor does not necessarily correspond to a probability distribution, even if was a CPD.

Orderings

Finally, the variable elimination algorithm requires an ordering over the variables according to which variables will be “eliminated”. In our chain example, we took the ordering implied by the DAG. It is important note that:

We will come back to these complications later, but for now let the ordering be fixed.

The variable elimination algorithm

We are now ready to formally define the variable elimination (VE) algorithm. Essentially, we loop over the variables as ordered by and eliminate them in that ordering. Intuitively, this corresponds to choosing a sum and “pushing it in” as far as possible inside the product of the factors, as we did in the chain example.

More formally, for each variable (ordered according to ),

  1. Multiply all factors containing
  2. Marginalize out to obtain new factor
  3. Replace the factors in by

Examples

Let’s try to understand what these steps correspond to in our chain example. In that case, the chosen ordering was Starting with , we collected all the factors involving , which were and . We then used them to construct a new factor . This can be seen as the results of steps 2 and 3 of the VE algorithm: first we form a large factor ; then we eliminate from that factor to produce . Then, we repeat the same procedure for , except that the factors are now .

For a slightly more complex example, recall the graphical model of a student’s grade that we introduced earlier.
Bayes net model of a student’s grade on an exam; in addition to , we also model other aspects of the problem, such as the exam’s difficulty , the student’s intelligence , his SAT score , and the quality of a reference letter from the professor who taught the course. Each variable is binary, except for , which takes 3 possible values.
The probability specified by the model is of the form

Let’s suppose that we are computing and are eliminating variables in their topological ordering in the graph. First, we eliminate , which corresponds to creating a new factor . Next, we eliminate to produce a factor ; then we eliminate yielding , and so forth. Note that these operations are equivalent to summing out the factored probability distribution as follows:

Note that this example requires computing at most operations per step, since each factor is at most over 2 variables, and one variable is summed out at each step (the dimensionality in this example is either 2 or 3).

Introducing evidence

A closely related and equally important problem is computing conditional probabilities of the form

where is a probability distribution, over sets of query variables , observed evidence variables , and unobserved variables .

We can compute this probability by performing variable elimination once on and then once more on .

To compute , we simply take every factor which has scope over variables that are also found in , and we set their values as specified by . Then we perform standard variable elimination over to obtain a factor over only .

Running Time of Variable Elimination

It is very important to understand that the running time of Variable Elimination depends heavily on the structure of the graph.

In the previous example, suppose we eliminated first. Then, we would have had to transform the factors into a big factor over 3 variables, which would require time to compute. If we had a factor , then we would have had to eliminate as well, producing a single giant factor in time. Then, eliminating any variable from this factor would require almost as much work as if we had started with the original distribution, since all the variable have become coupled.

Clearly some ordering are much more efficient than others. In fact, the running time of Variable Elimination will equal , where is the maximum size of any factor during the elimination process.

Choosing variable elimination orderings

Unfortunately, choosing the optimal VE ordering is an NP-hard problem. However, in practice, we may resort to the following heuristics:

In practice, these methods often result in reasonably good performance in many interesting settings.

Variable Elimination - Volodymyr Kuleshov