Dynamic programming algorithms are the bread and butter for structured prediction in NLP.

They are also quite a pain to debug, especially when implemented directly in your language of choice, and not in some high-level programming language for dynamic programming algorithms, such as Dyna.

In this post, I suggest a way, or more of a “debugging trick” to make sure that your dynamic programming algorithm is implemented correctly. This trick is a by-product of working with spectral algorithms, where parameters are masked by an unknown linear transformation. Recently, Avneesh, Chris and I have successfully debugged a hypergraph search algorithm for MT with it.

The idea is simple. Take your parameters for the model, and transform them by a random linear transformation, such that the linear transformation will cancel out if you compute marginals or any other quantity using the dynamic programming algorithm. Marginals have to be positive. If following the linear transformation, you are getting any negative marginals, that’s a (one-way) clear indication that somewhere you have a bug. You are actually supposed to get the same inside/outside probabilities and marginals as you get without the linear transformation.

Here are more exact details. Consider the CKY algorithm for parsing with context-free grammars — in its real semiring version, i.e. when it is computing the inside probabilities.The deduction rules (Dyna style or otherwise) are simple to understand, and they look something like:

\(\mathrm{goal \vdash root\_weight(a) + span(a,0,N)}\)

\(\mathrm{span(a,i,j) \vdash rule\_weight(a \rightarrow b c) + span(b,i,k) + span(c,k,j)} \)

\(\mathrm{span(a,i,i+1) \vdash rule\_weight(a \rightarrow x) + word(x, i, i+1)} \)

Here, the indices in \( \mathrm{span} \) denote spaces between words. So for example, in the sentence “The blog had a new post”, if we wanted to denote an NP over “The blog”, that would be computed in \( \mathrm{span(“NP”, 0, 2)}\). N is the length of the sentence.

\( \mathrm{word(x,i,i+1)} \) are axioms that denote the words in the sentence. For each word in the sentence, such as a “blog”, you would seed the chart with the word elements in the chart (such as \(\mathrm{word(“blog”, 1, 2)}\)).

\( \mathrm{rule\_weight(a \rightarrow b c)}\) are the probabilities \(p(a \rightarrow b c | a) \) for the underlying context-free grammar.

Let \(n\) be the number of nonterminals in the grammar. In addition, denote by \(R\) an \(n \times n \times n\) array such that \(R_{abc} = p(a \rightarrow b c | a)\) (and \(0\) if the rule is not in the grammar). In addition, denote by \(\alpha(i,j)\) a vector of size \(1 \times n\) such that \(\alpha_a(i,j) = \mathrm{span(a,i,j)}\).

Then, now, note that for any \(a\), the above CKY algorithm (in declarative form) dictates that:

\(\alpha_a(i,j) = \sum_{k=i+1}^{j-1} \sum_{b,c} \alpha_b(i,k) \alpha_c(k,j) R_{abc}\)

for non-leaf spans \(i,j\).

One way to view this is as a generalization of matrix product — tensor product. \(R\) is actually a three dimensional tensor (\(n \times n \times n\)) and it can be viewed as a mapping that maps a pair of vectors, and through *contraction* returns another vector:

\(\alpha_a(i,j) = \sum_{k=i+1}^{j-1} R(2,3; \alpha(i,k), \alpha(k,j))\)

Here, \(R(x,y ; w,v)\) denotes a contraction operation on the \( x \) and \( y \) dimensions. Contraction here means that we sum out the two dimensions (\(2\) and \(3\)) while multiplying in the two vectors \(w \in \mathbb{R}^n \) and \(v \in \mathbb{R}^n \).

Now that we have this in a matrix form, we can note something interesting. Let’s say that we have some matrix \(G \in \mathbb{R}^{n \times n}\), which is invertible. Now, we define linearly transformed \(\alpha\) with overline, as (call the following “Formula 1”):

\(\overline{\alpha}_a(i,j) = \left( \sum_{k=i+1}^{j-1} R(2,3; \overline{\alpha}(i,k)\times G^{-1} , \overline{\alpha}(k,j)\times G^{-1} ) \right) \times G \).

It is easy to verify that \(\overline{\alpha}(i,j) \times G^{-1} = \alpha(i,j)\). The \(G\) acts as a “plug”: it linearly transforms each \(\alpha\) term. We transform it back with \(G^{-1}\) before feeding it to \(R\), and then transform the result of \(R\) with \(G\) again, to keep everything linearly transformed.

Now, let \(\beta\) be a vector of size \(1 \times n\) such that \(\beta_a = \mathrm{root\_weight(a)}\). If we define \(\overline{\beta} = \beta (G^{-1})^{\top}\), then, the total probability of a string using the CKY algorithm can be computed as:

\(\langle \alpha(0,N), \beta \rangle = \alpha(0,N) \beta^{\top} = \alpha(0,N) G G^{-1} \beta^{\top} = \langle \overline{\alpha}(0,N), \overline{\beta} \rangle\).

This means that if we add the multiplication by \(G\) (and its inverse) to all of our \(\alpha\) and \(\beta\) terms as above, the calculation of our total probability of string would not change, and therefore, since \(p(a \rightarrow b c | a)\) are all positive, computing the total probability with the linear transformations should also be positive, and not only that — it should be identical to the result as if we are not using \(G\)!

The next step is noting that the multiplication by \(G\) can be folded into our \(R\) tensor. If we define

\([\overline{R}]_{abc} = \sum_{a’,b’,c’} R_{abc} [G^{-1}]_{bb’} [G^{-1}]_{cc’} G_{a’a}\)

then Formula 1 can be replaced with:

\( \overline{\alpha}_a(i,j) = \left( \sum_{k=i+1}^{j-1} \overline{R}(2,3; \overline{\alpha}(i,k), \overline{\alpha}(k,j)) \right) \).

The last set of parameters we need to linearly transform by \(G\) is that of \( \mathrm{rule\_weight(a \rightarrow x)}\). It is not hard to guess how. First, denote by \( \gamma_x \in \mathbb{R}^n \) a vector such that \( [\gamma_x]_a = \mathrm{rule\_weight(a \rightarrow x_i)}\).

Note \(\alpha_a(i,i+1) = [\gamma_{x_i}]_a \) where \(x_i \) is the \(i\)th word in the sentence. We need to multiply gamma for each \( x \) by \(G\) on the left: \(\overline{\gamma}_x = \gamma_x G\). We do that for all \(x\) in the vocabulary, defining these linearly transformed gammas. This means now that we also make sure that for the leaf nodes it holds that \( \alpha(i,i+1) = \overline{\alpha}(i,i+1) G^{-1} \) — all linear transformations by \( G \) will cancel from top to bottom when using \( \overline{\beta} \) for root probabilities, \(\overline{R}\) for binary rule probabilities and \( \overline{\gamma}\) for preterminal rule probabilities (instead of \( \beta\), \( R \) and \( \gamma \)).

Perfect! By linearly transforming our parameters \(\mathrm{rule\_weight}\), \(\mathrm{root\_weight}\), we got parameters which are very sensitive to bugs in our dynamic programming algorithm. If we have the slightest mistake somewhere, it is very likely, if \(G\) is chosen well, that some of the total tree probabilities on a training set won’t be identical (or will even be negative) to the non-linearly transformed parameters, even though they should be.

You might think that this is a whole lot of hassle to go through for debugging. But really, linearly transforming the parameters is just about few lines of code. Here is what Avneesh used in Python:

G = np.random.random_sample((rank, rank)) G = G + G.transpose() Ginv = np.linalg.inv(G) for src_RHS in paramDict: if src_RHS == "Pi": paramDict[src_RHS] = np.absolute(paramDict[src_RHS]).dot(Ginv) else: for target_RHS in paramDict[src_RHS]: parameter = np.absolute(paramDict[src_RHS][target_RHS]) arity = len(parameter.shape) - 1 if arity == 0: paramDict[src_RHS][target_RHS] = G.dot(parameter) elif arity == 1: paramDict[src_RHS][target_RHS] = G.dot(parameter).dot(Ginv) elif arity == 2: result = np.tensordot(G, parameter, axes=[1,0]) result = np.tensordot(Ginv, result, axes=[1,1]).swapaxes(0,1) result = np.tensordot(Ginv, result, axes=[1,2]).swapaxes(1,2) paramDict[src_RHS][target_RHS] = result

Voila! This code is added before the dynamic programming algorithm starts to execute. Then, the results of computing the inside probabilities with this linear transformation are compared to the results without the linear transformation — they have to be identical.

Lines 1-3 create an invertible G matrix. Note that we make sure it is symmetric, just to be on the safe side, in case we need to multiply something by \(G^{\top}\) instead of \( G \). This debugging method does not have to use a symmetric \( G\).

Next, we multiply all parameters in paramDict. The “Pi” one is the root probabilities, and we just multiply it by \(G^{-1}\). Later on “arity” defines whether the set of parameters is a matrix, i.e. arity = 1 (unary rules, not discussed above, but have very similar formulation), tensor, arity = 2 (binary rules) or vectors (preterminal rules; arity = 0).

If you want more details about this idea of linearly transforming the parameters, you can find them here for latent-variable PCFGs. The linear transformations in that paper are used for spectral learning of L-PCFGs — but as I mentioned, a by-product of that is this debugging trick. When using EM with dynamic programming algorithms, the usual method for making sure everything is working correctly, is checking that the likelihood increases after each iteration. The linear-transformation debugging method is more fine-grained and targeted towards making sure the dynamic programming algorithm is correct. In addition, I witnessed cases in which the likelihood of EM continuously improved, but there was still a bug with the dynamic programming algorithm.

By the way, here I described mostly the debugging method applied to the inside algorithm. The same linear transformations can be also used with the outside algorithm, and therefore to compute any marginal — the linear transformations should cancel in that case as well. It is also not too hard to imagine how to generalize this to the case where the dynamic programming algorithm is not CKY, but some other dynamic programming algorithm.