Category Archives: nlp

Spectral learning of hidden Markov models

I recently gave a tutorial at CMU about spectral learning for NLP. This tutorial was based on a tutorial I had given last year with Michael Collins, Dean Foster, Karl Stratos and Lyle Ungar at NAACL.

One of the algorithms I explained there was the spectral learning algorithm for HMMs by Hsu, Kakade and Zhang (2009). This algorithm estimates parameters of HMMs in the “unsupervised setting” — only from sequences of observations. (Just like the Baum-Welch algorithm — expectation-maximization for HMMs — does.)

I want to repeat this explanation here, and give some intuition about this algorithm, since it seems to confuse people quite a lot. At a first glance, it looks quite mysterious why the algorithm works, though its implementation is very simple. It is one of the earlier algorithms in this area of latent-variable learning using the method of moments and spectral methods, and promoted the creation of other algorithms for latent-variable learning.

So here are the main ideas behind it, with some intuition. In my explanation of the algorithm, I am going to forget about the “spectral” part. No singular value decomposition will be involved, or any type of spectral decomposition. Just plain algebraic and matrix multiplication tricks that require understanding what marginal probabilities are and how matrix multiplication and inversion work, and nothing more. Pedagogically, I think that’s the right thing to do, since introducing the SVD step complicates the understanding of the algorithm.

Consider a hidden Markov model. The parameters are represented in matrix form \( T \), \( O \) and \( \pi \). We assume \( m \) latent states, \( n \) observations. More specifically, \( T \) is an \( m \times m \) stochastic matrix where \( m \) is the number of latent states, such that \( T_{hh’} \) is the probability of transitioning to state \( h \) from state \( h’ \). \( O \) is an \( n \times m \) matrix such that \( O_{xh} \) is the probability of emitting symbol \( x \) — an observation — from latent state \( h \). \( \pi \) is an \( m \) length vector with \( \pi_h\) being the initial probability for state \( h \).

To completely get rid of the SVD step, and simplify things, we will have to make the assumption that \(m = n\). This means that the number of states equals the number of observations. Not a very useful HMM, perhaps, but it definitely makes the derivation more clear. The fact that \( m=n\) means that \( O \) is now a square matrix — and we will assume it is invertible. We will also assume that  \(T \) is invertible, and that \( \pi \) is positive in all coordinates.

If we look at the joint distribution of \(p(X_1 = x_1,X_2 = x_2)\), the first two observations in the HMM, then it can written as:

\( p(X_1 = x_1, X_2 = x_2) = \sum_{h_1,h_2} p(X_1 = x_1, H_1 = h_1, X_2 = x_2, H_2 = h_2) = \sum_{h_1,h_2} \pi_{h_1} O_{x_1,h_1} T_{h_2,h_1} O_{x_2,h_2} \)

Nothing special here, just marginal probability summing out the first two latent states.

It is not hard to see that this can be rewritten in matrix form, i.e. if we define \( [P_{2,1}]_{x_2,x_1} = p(X_1 = x_1, X_2= x_2) \) then:

\( P_{2,1} = O T \mathrm{diag}(\pi)O^{\top} \)

where \( \mathrm{diag}(\pi) \) is just an \( m \times m \) diagonal matrix with \( \pi_h \) on the diagonal.
Just write down this matrix multiplication step-by-step explicitly, multiplying, say, from right to left, and you will be able to verify this identity for \( P_{2,1} \). Essentially, the matrix product, which involves  dot-product between rows and vectors of two matrices, eliminates and sums out the latent states (and does other things, like multiplying in the starting probabilities).

Alright. So far, so good.

Now, what about the joint distribution of three observations?

\( p(X_1 = x_1, X_2 = x, X_3=x_3) = \sum_{h_1,h_2,h_3} p(X_1 = x_1, H_1 = h_1, X_2 = x_2, H_2 = h_2, X_3=x_3, H_3 = h_3) = \sum_{h_1,h_2,h_3} \pi_{h_1} O_{x_1,h_1} T_{h_2,h_1} O_{x_2,h_2} T_{h_3,h_2} O_{x_3,h_3} \)

Does this have a matrix form too? Yes, not surprisingly. If we fix \( x \), the second observation, and define \( [P_{3,x,1}]_{x_3,x_1} = p(X_1 = x_1, X_2 = x, X_3 = x_3) \), (i.e. \( P_{3,x,1} \) is an \( m \times m \) matrix defined for each observation symbol \( x \)), then

\( P_{3,x,1} = OT \mathrm{diag}(O_x) T \mathrm{diag}(\pi) O^{\top} \).

Here, \( \mathrm{diag}(O_x) \) is a diagonal matrix where the on the diagonal we have the \(x\)th row of \( O \).

Now define \( B_x = P_{3,x,1}P_{2,1}^{-1} \) (this is well-defined because \( P_{2,1} \) is invertible — all the conditions we had on the HMM parameters make sure that it is true), then:

\( B_x = OT \mathrm{diag}(O_x) T \mathrm{diag}(\pi)  O^{\top} \times (O T\mathrm{diag}(\pi)O^{\top})^{-1} = OT\mathrm{diag}(O_x)O^{-1} \)

(just recall that \( (AB)^{-1} = B^{-1} A^{-1} \) whenever both sides are defined and \( A \) and \( B \) are square matrices.)

This part of getting \( B_x \) (and I will explain in a minute why we need it) is the hardest part in our derivation so far. We can also verify that \( p(X_1 = x_1) \) equals \( O\pi \). Let’s call \( b_1 \) a vector such that \([b_1]_x = p(X_1=x_1)\) — i.e. \( b_1 \) is exactly the vector \( P_1 \).

We can also rewrite \( P_1 \) the following way:

\( P_1^{\top} = 1^{\top} T \mathrm{diag}(\pi) O^{\top} = 1^{\top} O^{-1} \underbrace{O T \mathrm{diag}(\pi) O^{\top}}_{P_{2,1}} \)

where \( 1^{\top} \) is an \( 1 \times m \) vector with the value 1 in all coordinates. The first equality is the “surprising” one — we use \( T \) to calculate the distribution of \( p(X_1 = x_1) \) — but if you write down this matrix multiplication explicitly, you will discover that we will be summing over the elements of \( T \) in such a way that it does not play a role in the sum — that’s because each row of \( T \) sums to 1. (As Hsu et al. put it in their paper: this is an unusual but easily verified form to write \( P_1 \).)

The above leads to the identity \( P_1^{\top} = 1^{\top} O^{-1} P_{2,1} \).

Now, it can be easily verified from the above form of \( P_1 \) that for \( b_{\infty}^{\top} \) defined as \( (P^{\top}_{2,1})^{-1} P_1 \), an \(m\) length vector, then:

\( b_{\infty}^{\top} = 1^{\top} O^{-1} \).

So what do we have so far? We managed to define the following matrices and vectors based only on the joint distribution of the first three symbols in the HMM:

\( B_x = P_{3,x,1}P_{2,1}^{-1}  = OT\mathrm{diag}(O_x)O^{-1}, \)

\( b_1 = P_1 = O\pi, \)

\( b_{\infty}^{\top} = (P^{\top}_{2,1})^{-1} P_1 = 1^{\top} O^{-1}. \)

The matrix \( B_x \in \mathbb{R}^{m \times m} \) and vectors \( b_{\infty} \in \mathbb{R}^m \) and \( b_1 \in \mathbb{R}^m \) will now play the role of our HMM parameters. How do we use them as our parameters?

Say we just observe a single symbol in our data, i.e. the length of the sequence is 1, and that symbol is \(x\). Let’s multiply \( b^{\top}_{\infty} B_x b_1 \).

According to the above equalities, it is true that this equals:

\( b^{\top}_{\infty} B_x b_1 = (1^{\top} O^{-1}) (O T \mathrm{diag}(O_x) O^{-1}) (O \pi) = 1^{\top} T \mathrm{diag}(O_x) \pi \).

Note that this quantity is a scalar. We are multiplying a matrix by a vector from left and right. Undo this matrix multiplication, and write it the way we like in terms of sums over the latent states, and what do we get? The above just equals:

\( b^{\top}_{\infty} B_x b_1 = \sum_{h_1,h_2} T_{h_2,h_1} O_{x,h_1} \pi_{h_1} = \sum_{h_1,h_2} p(H_1) p(X_1 = x | H_1 = h_1) p(H_2 = h_2 | H_1 = h_1) = p(X_1 = x_1) \).

So, this triplet-product gave us back the distribution over the first observation. That’s not very interesting, we could have done it just by using \( b_1 \) directly. But… let’s go on and compute:

\( b^{\top}_{\infty} B_{x_2} B_{x_1} b_1. \)

This can be easily verified to equal \( p(X_1 = x_1, X_2 = x_2) \).

The interesting part is that in the general case,

\( b^{\top}_{\infty} B_{x_n} B_{x_{n-1}}…B_{x_1} b_1 = p(X_1 = x_1, \ldots, X_n = x_n) \) –

we can now calculate the probability of any observation sequence in the HMM only by knowing the distribution over the first three observations! (To convince yourself about the general case above, just look at Lemma 1 in the Hsu et al. paper.)

In order to turn this into an estimation algorithm, we just need to estimate from data \( P_{2,1} \) and \( P_{3,x,1} \) for each observation symbol (all observed, just “count and normalize”), and voila, you can estimate the probability of any sequence of observations (one of the basic problems with HMMs according to this old classic paper, for example).

But… We made a heavy assumption. We assumed that \( n = m \) — we have as many observation symbols as latent states. What do we do if that’s not true? (i.e. if \( m < n \))? That’s where the “spectral” part kicks in. Basically, what we need to do is to reduce our \( O \) matrix into an \( m \times m \) matrix using some \( U \) matrix, while ensuring that \( U^{\top}O \) is invertible (just like we assumed \( O \) was invertible before). Note that \( U \) needs to be \( n \times m \).

It turns out that a \( U \) that will be optimal in some sense, and will also make all of the above algebraic tricks work is the left singular value matrix of \( P_{2,1} \). Understanding why this is the case requires some basic knowledge of linear algebra — read the paper to understand this!

A debugging trick for implementation of dynamic programming algorithms

Dynamic programming algorithms are the bread and butter for structured prediction in NLP.
They are also quite a pain to debug, especially when implemented directly in your language of choice, and not in some high-level programming language for dynamic programming algorithms, such as Dyna.

In this post, I suggest a way, or more of a “debugging trick” to make sure that your dynamic programming algorithm is implemented correctly. This trick is a by-product of working with spectral algorithms, where parameters are masked by an unknown linear transformation. Recently, Avneesh, Chris and I have successfully debugged a hypergraph search algorithm for MT with it.

The idea is simple. Take your parameters for the model, and transform them by a random linear transformation, such that the linear transformation will cancel out if you compute marginals or any other quantity using the dynamic programming algorithm. Marginals have to be positive. If following the linear transformation, you are getting any negative marginals, that’s a (one-way) clear indication that somewhere you have a bug. You are actually supposed to get the same inside/outside probabilities and marginals as you get without the linear transformation.

Here are more exact details. Consider the CKY algorithm for parsing with context-free grammars — in its real semiring version, i.e. when it is computing the inside probabilities.The deduction rules (Dyna style or otherwise) are simple to understand, and they look something like:

\(\mathrm{goal \vdash root\_weight(a) + span(a,0,N)}\)

\(\mathrm{span(a,i,j) \vdash rule\_weight(a \rightarrow b c) + span(b,i,k) + span(c,k,j)} \)

\(\mathrm{span(a,i,i+1) \vdash rule\_weight(a \rightarrow x) + word(x, i, i+1)} \)

Here, the indices in \( \mathrm{span} \) denote spaces between words. So for example, in the sentence “The blog had a new post”, if we wanted to denote an NP over “The blog”, that would be computed in \( \mathrm{span(“NP”, 0, 2)}\). N is the length of the sentence.

\( \mathrm{word(x,i,i+1)} \) are axioms that denote the words in the sentence. For each word in the sentence, such as a “blog”, you would seed the chart with the word elements in the chart (such as \(\mathrm{word(“blog”, 1, 2)}\)).

\( \mathrm{rule\_weight(a \rightarrow b c)}\) are the probabilities \(p(a \rightarrow b c | a) \) for the underlying context-free grammar.

Let \(n\) be the number of nonterminals in the grammar. In addition, denote by \(R\) an \(n \times n \times n\) array such that \(R_{abc} = p(a \rightarrow b c | a)\) (and \(0\) if the rule is not in the grammar). In addition, denote by \(\alpha(i,j)\) a vector of size \(1 \times n\) such that \(\alpha_a(i,j) = \mathrm{span(a,i,j)}\).

Then, now, note that for any \(a\), the above CKY algorithm (in declarative form) dictates that:

\(\alpha_a(i,j) = \sum_{k=i+1}^{j-1} \sum_{b,c} \alpha_b(i,k) \alpha_c(k,j) R_{abc}\)

for non-leaf spans \(i,j\).

One way to view this is as a generalization of matrix product — tensor product. \(R\) is actually a three dimensional tensor (\(n \times n \times n\)) and it can be viewed as a mapping that maps a pair of vectors, and through *contraction* returns another vector:

\(\alpha_a(i,j) = \sum_{k=i+1}^{j-1} R(2,3; \alpha(i,k), \alpha(k,j))\)

Here, \(R(x,y ; w,v)\) denotes a contraction operation on the \( x \) and \( y \) dimensions. Contraction here means that we sum out the two dimensions (\(2\) and \(3\)) while multiplying in the two vectors \(w \in \mathbb{R}^n \) and \(v \in \mathbb{R}^n \).

Now that we have this in a matrix form, we can note something interesting. Let’s say that we have some matrix \(G \in \mathbb{R}^{n \times n}\), which is invertible. Now, we define linearly transformed \(\alpha\) with overline, as (call the following “Formula 1”):

\(\overline{\alpha}_a(i,j) = \left( \sum_{k=i+1}^{j-1} R(2,3; \overline{\alpha}(i,k)\times G^{-1} ,  \overline{\alpha}(k,j)\times G^{-1} ) \right) \times G \).

It is easy to verify that \(\overline{\alpha}(i,j) \times G^{-1} = \alpha(i,j)\). The \(G\) acts as a “plug”: it linearly transforms each \(\alpha\) term. We transform it back with \(G^{-1}\) before feeding it to \(R\), and then transform the result of \(R\) with \(G\) again, to keep everything linearly transformed.

Now, let \(\beta\) be a vector of size \(1 \times n\) such that \(\beta_a = \mathrm{root\_weight(a)}\). If we define \(\overline{\beta} = \beta (G^{-1})^{\top}\), then, the total probability of a string using the CKY algorithm can be computed as:

\(\langle \alpha(0,N), \beta \rangle = \alpha(0,N) \beta^{\top} = \alpha(0,N) G G^{-1} \beta^{\top} = \langle \overline{\alpha}(0,N), \overline{\beta} \rangle\).

This means that if we add the multiplication by \(G\) (and its inverse) to all of our \(\alpha\) and \(\beta\) terms as above, the calculation of our total probability of string would not change, and therefore, since \(p(a \rightarrow b c | a)\) are all positive, computing the total probability with the linear transformations should also be positive, and not only that — it should be identical to the result as if we are not using \(G\)!

The next step is noting that the multiplication by \(G\) can be folded into our \(R\) tensor. If we define

\([\overline{R}]_{abc} = \sum_{a’,b’,c’} R_{abc} [G^{-1}]_{bb’} [G^{-1}]_{cc’} G_{a’a}\)

then Formula 1 can be replaced with:

\( \overline{\alpha}_a(i,j) = \left( \sum_{k=i+1}^{j-1} \overline{R}(2,3; \overline{\alpha}(i,k), \overline{\alpha}(k,j)) \right) \).

The last set of parameters we need to linearly transform by \(G\) is that of \( \mathrm{rule\_weight(a \rightarrow x)}\). It is not hard to guess how. First, denote by \( \gamma_x \in \mathbb{R}^n \) a vector such that \( [\gamma_x]_a = \mathrm{rule\_weight(a \rightarrow x_i)}\).

Note \(\alpha_a(i,i+1) = [\gamma_{x_i}]_a \) where \(x_i \) is the \(i\)th word in the sentence. We need to multiply gamma for each \( x \) by \(G\) on the left: \(\overline{\gamma}_x = \gamma_x G\). We do that for all \(x\) in the vocabulary, defining these linearly transformed gammas. This means now that we also make sure that for the leaf nodes it holds that \( \alpha(i,i+1) = \overline{\alpha}(i,i+1) G^{-1} \) — all linear transformations by \( G \) will cancel from top to bottom when using \( \overline{\beta} \) for root probabilities, \(\overline{R}\) for binary rule probabilities and \( \overline{\gamma}\) for preterminal rule probabilities (instead of \( \beta\), \( R \) and \( \gamma \)).

Perfect! By linearly transforming our parameters \(\mathrm{rule\_weight}\), \(\mathrm{root\_weight}\), we got parameters which are very sensitive to bugs in our dynamic programming algorithm. If we have the slightest mistake somewhere, it is very likely, if \(G\) is chosen well, that some of the total tree probabilities on a training set won’t be identical (or will even be negative) to the non-linearly transformed parameters, even though they should be.

You might think that this is a whole lot of hassle to go through for debugging. But really, linearly transforming the parameters is just about few lines of code. Here is what Avneesh used in Python:

G = np.random.random_sample((rank, rank))
G = G + G.transpose()
Ginv = np.linalg.inv(G)
for src_RHS in paramDict:
   if src_RHS == "Pi":
       paramDict[src_RHS] = np.absolute(paramDict[src_RHS]).dot(Ginv)
       for target_RHS in paramDict[src_RHS]:
          parameter = np.absolute(paramDict[src_RHS][target_RHS])
          arity = len(parameter.shape) - 1
          if arity == 0:
             paramDict[src_RHS][target_RHS] =
          elif arity == 1:
             paramDict[src_RHS][target_RHS] =
          elif arity == 2:
             result = np.tensordot(G, parameter, axes=[1,0])
             result = np.tensordot(Ginv, result, axes=[1,1]).swapaxes(0,1)
             result = np.tensordot(Ginv, result, axes=[1,2]).swapaxes(1,2)
         paramDict[src_RHS][target_RHS] = result

Voila! This code is added before the dynamic programming algorithm starts to execute. Then, the results of computing the inside probabilities with this linear transformation are compared to the results without the linear transformation — they have to be identical.

Lines 1-3 create an invertible G matrix. Note that we make sure it is symmetric, just to be on the safe side, in case we need to multiply something by \(G^{\top}\) instead of \( G \). This debugging method does not have to use a symmetric \( G\).

Next, we multiply all parameters in paramDict. The “Pi” one is the root probabilities, and we just multiply it by \(G^{-1}\). Later on “arity” defines whether the set of parameters is a matrix, i.e. arity = 1 (unary rules, not discussed above, but have very similar formulation), tensor, arity = 2 (binary rules) or vectors (preterminal rules; arity = 0).

If you want more details about this idea of linearly transforming the parameters, you can find them here for latent-variable PCFGs. The linear transformations in that paper are used for spectral learning of L-PCFGs — but as I mentioned, a by-product of that is this debugging trick. When using EM with dynamic programming algorithms, the usual method for making sure everything is working correctly, is checking that the likelihood increases after each iteration. The linear-transformation debugging method is more fine-grained and targeted towards making sure the dynamic programming algorithm is correct. In addition, I witnessed cases in which the likelihood of EM continuously improved, but there was still a bug with the dynamic programming algorithm.

By the way, here I described mostly the debugging method applied to the inside algorithm. The same linear transformations can be also used with the outside algorithm, and therefore to compute any marginal — the linear transformations should cancel in that case as well. It is also not too hard to imagine how to generalize this to the case where the dynamic programming algorithm is not CKY, but some other dynamic programming algorithm.