**You, my CKY algorithm,**

**dictate every parser’s rhythm,**

**if Cocke, Younger and Kasami hadn’t bothered,**

**all of our parsing dreams would have been shattered.**

**You are so simple, yet so powerful,**

**and with the proper semiring and time,**

**you will be truthful,**

**to return the best parse — anything less would be a crime.**

**With dynamic programming, memoization or tabularization,**

**you are one of a kind,**

**I really don’t need to mention,**

**if it weren’t for you, all syntax trees would be behind.**

**Failed attempts have been made to show there are better,**

**for example, by using matrix multiplication,**

**all of these impractical algorithms didn’t matter –**

**you came out stronger, insisting on just using summation.**

**All parsing algorithms to you hail,**

**at least those with backbones which are context-free,**

**you will never become stale,**

**as long as we need to have a syntax tree.**

**It doesn’t matter that the C is always in front,**

**or that the K and Y can swap,**

**you are still on the same hunt,**

**maximizing and summing, nonstop.**

**Every NLP student knows you intimately,**

**they have seen your variants dozens of times,**

**you have earned that respect legitimately,**

**and you will follow them through their primes.**

**CKY, going backward and forward,**

**inside and out,**

**it is so straightforward –**

**You are the best, there is no doubt.**

It is frustrating to see a paper about a core, pivotal area in NLP submitted to a computational linguistics conference come back with a comment such as “this paper is more appropriate for a machine learning conference.” More than frustrating, it is baffling. But it happens. In fact, it recently happened to me and my co-authors. I get the sense that no matter what, our paper would have done something to irritate this reviewer. The question is whether this reviewer tends to have a “reviewing bias” that is consistent across multiple reviews. I tend to think that such bias sometimes exists.

We all have biases as reviewers, I am pretty sure. There are reviewers who start the process with the intention to find all possible faults in a paper, and provide harsher feedback than others – they basically aim to kill. On the flip side, there are reviewers who are more lenient, and look for ways to get the paper accepted. In my experience as a reviewer for ACL conferences, I have seen multiple reviews of the same people, and I do get the impression that such prior bias exists. After all, we are all human, this is only natural. To exacerbate this problem, there is not any accepted objective measure by which to decide whether one paper is better than another, and should, as such, be given preference to be presented in the conference.

We have all been reviewers for quite a long time – as such, it is possible to control for bias. We can make the system more fair. Not as reviewers, but as those who make the decisions about reviewer assignment. This is a rather ad-hoc solution, somewhere in between completely abandoning the reviewing process altogether and continuing with the wild-west procedure that characterizes current conference reviewing, and that which we keep complaining about.

I suggest to do the following in order to improve the reviewing process. Based on history of reviews from previous conferences, we can simply create a prior distribution over scores for all reviewers who have reviewed papers multiple times. For each reviewer, we would have a profile based on these previous rounds of reviewing. Based on this, when we assign reviewers to papers, we control for the prior bias through this constructed reviewer profile. One way to control for this is to ensure that some “mean score” (such as the recommendation score) extracted from the profile of the reviewers assigned to a specific paper, is close to identical for all papers (if the mean score is a great indicator used by the area chairs or the program chairs).

There are more complex ways to create such a rubric (but this is a start). For example, from my experience, it is sufficient for one reviewer to give a relatively low score, in order for the paper to basically get rejected. So perhaps a better way to use the above reviewer profile (rather than aiming for the average score to be apriori similar across all papers) is to classify reviewers into “tend-to-be-positive” versus “tend-to-be-negative” and then use the same ratio of each group of reviewers for all papers. (That might be hard, given that the pool of reviewers is rather limited.) We can also create reviewer profiles with respect to a specific category of papers (where category is loosely defined as some function of the paper).

The way I see it, there could be two results to this experiment:

- If we succeed in controlling for reviewer bias, but the result is that more papers receive similar scores, or it is much harder to tell, as area chair, which papers to accept, then the conclusion is that the final scores shouldn’t matter that much, and there is a need to go deeper in the scrutiny of the papers as area chairs. One bad review, perhaps even a couple, shouldn’t get a paper rejected – perhaps, instead we need to extend our notion of borderline cases. (It could also mean that we are not properly eliminating bias using the prior distribution.)
- If we succeed in controlling for the average reviewing score over time, and the number of borderline cases stays similar, then I think we have a more balanced, fair, reviewing system. The solution is not full-proof, and there will still be great variance in the reviewing process. But I believe, overall, it will be a more just system. (For example, we are more likely to eliminate cases in which one paper gets two reviewers who tend to be negative in their reviews, while another paper gets none.)

Of course, this profile construction can be taken in other ways as well, for example, it can be performed on groups of reviewers, instead of single reviewers. One common belief is that junior graduate students tend to give lower scores in their reviews compared to more senior researchers. We can create profiles, or prior distributions, for groups based on this conjecture, such as graduate students versus senior researchers, and try to control bias with such profiles. This coarser version of bias control is likely to be more useful, since the number of reviewing datapoints required to create a personal profile for a single graduate student, especially when junior, is not sufficient.

(One thing that is a bit concerning, given all we have considered here, is that we might have to give up on the level of expertise to a specific topic assigned to reviewing a paper, unless we have a very large pool of reviewers. Otherwise, it would be hard to make sure that the prior scores are similar, across all papers, when only a small number of reviewers can be assigned to a specific paper.)

]]>

I recently gave a tutorial at CMU about spectral learning for NLP. This tutorial was based on a tutorial I had given last year with Michael Collins, Dean Foster, Karl Stratos and Lyle Ungar at NAACL.

One of the algorithms I explained there was the spectral learning algorithm for HMMs by Hsu, Kakade and Zhang (2009). This algorithm estimates parameters of HMMs in the “unsupervised setting” — only from sequences of observations. (Just like the Baum-Welch algorithm — expectation-maximization for HMMs — does.)

I want to repeat this explanation here, and give some intuition about this algorithm, since it seems to confuse people quite a lot. At a first glance, it looks quite mysterious why the algorithm works, though its implementation is very simple. It is one of the earlier algorithms in this area of latent-variable learning using the method of moments and spectral methods, and promoted the creation of other algorithms for latent-variable learning.

So here are the main ideas behind it, with some intuition. In my explanation of the algorithm, I am going to forget about the “spectral” part. No singular value decomposition will be involved, or any type of spectral decomposition. Just plain algebraic and matrix multiplication tricks that require understanding what marginal probabilities are and how matrix multiplication and inversion work, and nothing more. Pedagogically, I think that’s the right thing to do, since introducing the SVD step complicates the understanding of the algorithm.

Consider a hidden Markov model. The parameters are represented in matrix form \( T \), \( O \) and \( \pi \). We assume \( m \) latent states, \( n \) observations. More specifically, \( T \) is an \( m \times m \) stochastic matrix where \( m \) is the number of latent states, such that \( T_{hh’} \) is the probability of transitioning to state \( h \) from state \( h’ \). \( O \) is an \( n \times m \) matrix such that \( O_{xh} \) is the probability of emitting symbol \( x \) — an observation — from latent state \( h \). \( \pi \) is an \( m \) length vector with \( \pi_h\) being the initial probability for state \( h \).

To completely get rid of the SVD step, and simplify things, we will have to make the assumption that \(m = n\). This means that the number of states equals the number of observations. Not a very useful HMM, perhaps, but it definitely makes the derivation more clear. The fact that \( m=n\) means that \( O \) is now a square matrix — and we will assume it is invertible. We will also assume that \(T \) is invertible, and that \( \pi \) is positive in all coordinates.

If we look at the joint distribution of \(p(X_1 = x_1,X_2 = x_2)\), the first two observations in the HMM, then it can written as:

\( p(X_1 = x_1, X_2 = x_2) = \sum_{h_1,h_2} p(X_1 = x_1, H_1 = h_1, X_2 = x_2, H_2 = h_2) = \sum_{h_1,h_2} \pi_{h_1} O_{x_1,h_1} T_{h_2,h_1} O_{x_2,h_2} \)

Nothing special here, just marginal probability summing out the first two latent states.

It is not hard to see that this can be rewritten in matrix form, i.e. if we define \( [P_{2,1}]_{x_2,x_1} = p(X_1 = x_1, X_2= x_2) \) then:

\( P_{2,1} = O T \mathrm{diag}(\pi)O^{\top} \)

where \( \mathrm{diag}(\pi) \) is just an \( m \times m \) diagonal matrix with \( \pi_h \) on the diagonal.

Just write down this matrix multiplication step-by-step explicitly, multiplying, say, from right to left, and you will be able to verify this identity for \( P_{2,1} \). Essentially, the matrix product, which involves dot-product between rows and vectors of two matrices, eliminates and sums out the latent states (and does other things, like multiplying in the starting probabilities).

Alright. So far, so good.

Now, what about the joint distribution of three observations?

\( p(X_1 = x_1, X_2 = x, X_3=x_3) = \sum_{h_1,h_2,h_3} p(X_1 = x_1, H_1 = h_1, X_2 = x_2, H_2 = h_2, X_3=x_3, H_3 = h_3) = \sum_{h_1,h_2,h_3} \pi_{h_1} O_{x_1,h_1} T_{h_2,h_1} O_{x_2,h_2} T_{h_3,h_2} O_{x_3,h_3} \)

Does this have a matrix form too? Yes, not surprisingly. If we fix \( x \), the second observation, and define \( [P_{3,x,1}]_{x_3,x_1} = p(X_1 = x_1, X_2 = x, X_3 = x_3) \), (i.e. \( P_{3,x,1} \) is an \( m \times m \) matrix defined for each observation symbol \( x \)), then

\( P_{3,x,1} = OT \mathrm{diag}(O_x) T \mathrm{diag}(\pi) O^{\top} \).

Here, \( \mathrm{diag}(O_x) \) is a diagonal matrix where the on the diagonal we have the \(x\)th row of \( O \).

Now define \( B_x = P_{3,x,1}P_{2,1}^{-1} \) (this is well-defined because \( P_{2,1} \) is invertible — all the conditions we had on the HMM parameters make sure that it is true), then:

\( B_x = OT \mathrm{diag}(O_x) T \mathrm{diag}(\pi) O^{\top} \times (O T\mathrm{diag}(\pi)O^{\top})^{-1} = OT\mathrm{diag}(O_x)O^{-1} \)

(just recall that \( (AB)^{-1} = B^{-1} A^{-1} \) whenever both sides are defined and \( A \) and \( B \) are square matrices.)

This part of getting \( B_x \) (and I will explain in a minute why we need it) is the hardest part in our derivation so far. We can also verify that \( p(X_1 = x_1) \) equals \( O\pi \). Let’s call \( b_1 \) a vector such that \([b_1]_x = p(X_1=x_1)\) — i.e. \( b_1 \) is exactly the vector \( P_1 \).

We can also rewrite \( P_1 \) the following way:

\( P_1^{\top} = 1^{\top} T \mathrm{diag}(\pi) O^{\top} = 1^{\top} O^{-1} \underbrace{O T \mathrm{diag}(\pi) O^{\top}}_{P_{2,1}} \)

where \( 1^{\top} \) is an \( 1 \times m \) vector with the value 1 in all coordinates. The first equality is the “surprising” one — we use \( T \) to calculate the distribution of \( p(X_1 = x_1) \) — but if you write down this matrix multiplication explicitly, you will discover that we will be summing over the elements of \( T \) in such a way that it does not play a role in the sum — that’s because each row of \( T \) sums to 1. (As Hsu et al. put it in their paper: this is an unusual but easily verified form to write \( P_1 \).)

The above leads to the identity \( P_1^{\top} = 1^{\top} O^{-1} P_{2,1} \).

Now, it can be easily verified from the above form of \( P_1 \) that for \( b_{\infty}^{\top} \) defined as \( (P^{\top}_{2,1})^{-1} P_1 \), an \(m\) length vector, then:

\( b_{\infty}^{\top} = 1^{\top} O^{-1} \).

So what do we have so far? We managed to define the following matrices and vectors based only on the joint distribution of the first three symbols in the HMM:

\( B_x = P_{3,x,1}P_{2,1}^{-1} = OT\mathrm{diag}(O_x)O^{-1}, \)

\( b_1 = P_1 = O\pi, \)

\( b_{\infty}^{\top} = (P^{\top}_{2,1})^{-1} P_1 = 1^{\top} O^{-1}. \)

The matrix \( B_x \in \mathbb{R}^{m \times m} \) and vectors \( b_{\infty} \in \mathbb{R}^m \) and \( b_1 \in \mathbb{R}^m \) will now play the role of our HMM parameters. How do we use them as our parameters?

Say we just observe a single symbol in our data, i.e. the length of the sequence is 1, and that symbol is \(x\). Let’s multiply \( b^{\top}_{\infty} B_x b_1 \).

According to the above equalities, it is true that this equals:

\( b^{\top}_{\infty} B_x b_1 = (1^{\top} O^{-1}) (O T \mathrm{diag}(O_x) O^{-1}) (O \pi) = 1^{\top} T \mathrm{diag}(O_x) \pi \).

Note that this quantity is a scalar. We are multiplying a matrix by a vector from left and right. Undo this matrix multiplication, and write it the way we like in terms of sums over the latent states, and what do we get? The above just equals:

\( b^{\top}_{\infty} B_x b_1 = \sum_{h_1,h_2} T_{h_2,h_1} O_{x,h_1} \pi_{h_1} = \sum_{h_1,h_2} p(H_1) p(X_1 = x | H_1 = h_1) p(H_2 = h_2 | H_1 = h_1) = p(X_1 = x_1) \).

So, this triplet-product gave us back the distribution over the first observation. That’s not very interesting, we could have done it just by using \( b_1 \) directly. But… let’s go on and compute:

\( b^{\top}_{\infty} B_{x_2} B_{x_1} b_1. \)

This can be easily verified to equal \( p(X_1 = x_1, X_2 = x_2) \).

The interesting part is that in the general case,

\( b^{\top}_{\infty} B_{x_n} B_{x_{n-1}}…B_{x_1} b_1 = p(X_1 = x_1, \ldots, X_n = x_n) \) –

we can now calculate the probability of any observation sequence in the HMM only by knowing the distribution over the first three observations! (To convince yourself about the general case above, just look at Lemma 1 in the Hsu et al. paper.)

In order to turn this into an estimation algorithm, we just need to estimate from data \( P_{2,1} \) and \( P_{3,x,1} \) for each observation symbol (all observed, just “count and normalize”), and voila, you can estimate the probability of any sequence of observations (one of the basic problems with HMMs according to this old classic paper, for example).

But… We made a heavy assumption. We assumed that \( n = m \) — we have as many observation symbols as latent states. What do we do if that’s not true? (i.e. if \( m < n \))? That’s where the “spectral” part kicks in. Basically, what we need to do is to reduce our \( O \) matrix into an \( m \times m \) matrix using some \( U \) matrix, while ensuring that \( U^{\top}O \) is invertible (just like we assumed \( O \) was invertible before). Note that \( U \) needs to be \( n \times m \).

It turns out that a \( U \) that will be optimal in some sense, and will also make all of the above algebraic tricks work is the left singular value matrix of \( P_{2,1} \). Understanding why this is the case requires some basic knowledge of linear algebra — read the paper to understand this!

]]>Dynamic programming algorithms are the bread and butter for structured prediction in NLP.

They are also quite a pain to debug, especially when implemented directly in your language of choice, and not in some high-level programming language for dynamic programming algorithms, such as Dyna.

In this post, I suggest a way, or more of a “debugging trick” to make sure that your dynamic programming algorithm is implemented correctly. This trick is a by-product of working with spectral algorithms, where parameters are masked by an unknown linear transformation. Recently, Avneesh, Chris and I have successfully debugged a hypergraph search algorithm for MT with it.

The idea is simple. Take your parameters for the model, and transform them by a random linear transformation, such that the linear transformation will cancel out if you compute marginals or any other quantity using the dynamic programming algorithm. Marginals have to be positive. If following the linear transformation, you are getting any negative marginals, that’s a (one-way) clear indication that somewhere you have a bug. You are actually supposed to get the same inside/outside probabilities and marginals as you get without the linear transformation.

Here are more exact details. Consider the CKY algorithm for parsing with context-free grammars — in its real semiring version, i.e. when it is computing the inside probabilities.The deduction rules (Dyna style or otherwise) are simple to understand, and they look something like:

\(\mathrm{goal \vdash root\_weight(a) + span(a,0,N)}\)

\(\mathrm{span(a,i,j) \vdash rule\_weight(a \rightarrow b c) + span(b,i,k) + span(c,k,j)} \)

\(\mathrm{span(a,i,i+1) \vdash rule\_weight(a \rightarrow x) + word(x, i, i+1)} \)

Here, the indices in \( \mathrm{span} \) denote spaces between words. So for example, in the sentence “The blog had a new post”, if we wanted to denote an NP over “The blog”, that would be computed in \( \mathrm{span(“NP”, 0, 2)}\). N is the length of the sentence.

\( \mathrm{word(x,i,i+1)} \) are axioms that denote the words in the sentence. For each word in the sentence, such as a “blog”, you would seed the chart with the word elements in the chart (such as \(\mathrm{word(“blog”, 1, 2)}\)).

\( \mathrm{rule\_weight(a \rightarrow b c)}\) are the probabilities \(p(a \rightarrow b c | a) \) for the underlying context-free grammar.

Let \(n\) be the number of nonterminals in the grammar. In addition, denote by \(R\) an \(n \times n \times n\) array such that \(R_{abc} = p(a \rightarrow b c | a)\) (and \(0\) if the rule is not in the grammar). In addition, denote by \(\alpha(i,j)\) a vector of size \(1 \times n\) such that \(\alpha_a(i,j) = \mathrm{span(a,i,j)}\).

Then, now, note that for any \(a\), the above CKY algorithm (in declarative form) dictates that:

\(\alpha_a(i,j) = \sum_{k=i+1}^{j-1} \sum_{b,c} \alpha_b(i,k) \alpha_c(k,j) R_{abc}\)

for non-leaf spans \(i,j\).

One way to view this is as a generalization of matrix product — tensor product. \(R\) is actually a three dimensional tensor (\(n \times n \times n\)) and it can be viewed as a mapping that maps a pair of vectors, and through *contraction* returns another vector:

\(\alpha_a(i,j) = \sum_{k=i+1}^{j-1} R(2,3; \alpha(i,k), \alpha(k,j))\)

Here, \(R(x,y ; w,v)\) denotes a contraction operation on the \( x \) and \( y \) dimensions. Contraction here means that we sum out the two dimensions (\(2\) and \(3\)) while multiplying in the two vectors \(w \in \mathbb{R}^n \) and \(v \in \mathbb{R}^n \).

Now that we have this in a matrix form, we can note something interesting. Let’s say that we have some matrix \(G \in \mathbb{R}^{n \times n}\), which is invertible. Now, we define linearly transformed \(\alpha\) with overline, as (call the following “Formula 1”):

\(\overline{\alpha}_a(i,j) = \left( \sum_{k=i+1}^{j-1} R(2,3; \overline{\alpha}(i,k)\times G^{-1} , \overline{\alpha}(k,j)\times G^{-1} ) \right) \times G \).

It is easy to verify that \(\overline{\alpha}(i,j) \times G^{-1} = \alpha(i,j)\). The \(G\) acts as a “plug”: it linearly transforms each \(\alpha\) term. We transform it back with \(G^{-1}\) before feeding it to \(R\), and then transform the result of \(R\) with \(G\) again, to keep everything linearly transformed.

Now, let \(\beta\) be a vector of size \(1 \times n\) such that \(\beta_a = \mathrm{root\_weight(a)}\). If we define \(\overline{\beta} = \beta (G^{-1})^{\top}\), then, the total probability of a string using the CKY algorithm can be computed as:

\(\langle \alpha(0,N), \beta \rangle = \alpha(0,N) \beta^{\top} = \alpha(0,N) G G^{-1} \beta^{\top} = \langle \overline{\alpha}(0,N), \overline{\beta} \rangle\).

This means that if we add the multiplication by \(G\) (and its inverse) to all of our \(\alpha\) and \(\beta\) terms as above, the calculation of our total probability of string would not change, and therefore, since \(p(a \rightarrow b c | a)\) are all positive, computing the total probability with the linear transformations should also be positive, and not only that — it should be identical to the result as if we are not using \(G\)!

The next step is noting that the multiplication by \(G\) can be folded into our \(R\) tensor. If we define

\([\overline{R}]_{abc} = \sum_{a’,b’,c’} R_{abc} [G^{-1}]_{bb’} [G^{-1}]_{cc’} G_{a’a}\)

then Formula 1 can be replaced with:

\( \overline{\alpha}_a(i,j) = \left( \sum_{k=i+1}^{j-1} \overline{R}(2,3; \overline{\alpha}(i,k), \overline{\alpha}(k,j)) \right) \).

The last set of parameters we need to linearly transform by \(G\) is that of \( \mathrm{rule\_weight(a \rightarrow x)}\). It is not hard to guess how. First, denote by \( \gamma_x \in \mathbb{R}^n \) a vector such that \( [\gamma_x]_a = \mathrm{rule\_weight(a \rightarrow x_i)}\).

Note \(\alpha_a(i,i+1) = [\gamma_{x_i}]_a \) where \(x_i \) is the \(i\)th word in the sentence. We need to multiply gamma for each \( x \) by \(G\) on the left: \(\overline{\gamma}_x = \gamma_x G\). We do that for all \(x\) in the vocabulary, defining these linearly transformed gammas. This means now that we also make sure that for the leaf nodes it holds that \( \alpha(i,i+1) = \overline{\alpha}(i,i+1) G^{-1} \) — all linear transformations by \( G \) will cancel from top to bottom when using \( \overline{\beta} \) for root probabilities, \(\overline{R}\) for binary rule probabilities and \( \overline{\gamma}\) for preterminal rule probabilities (instead of \( \beta\), \( R \) and \( \gamma \)).

Perfect! By linearly transforming our parameters \(\mathrm{rule\_weight}\), \(\mathrm{root\_weight}\), we got parameters which are very sensitive to bugs in our dynamic programming algorithm. If we have the slightest mistake somewhere, it is very likely, if \(G\) is chosen well, that some of the total tree probabilities on a training set won’t be identical (or will even be negative) to the non-linearly transformed parameters, even though they should be.

You might think that this is a whole lot of hassle to go through for debugging. But really, linearly transforming the parameters is just about few lines of code. Here is what Avneesh used in Python:

G = np.random.random_sample((rank, rank)) G = G + G.transpose() Ginv = np.linalg.inv(G) for src_RHS in paramDict: if src_RHS == "Pi": paramDict[src_RHS] = np.absolute(paramDict[src_RHS]).dot(Ginv) else: for target_RHS in paramDict[src_RHS]: parameter = np.absolute(paramDict[src_RHS][target_RHS]) arity = len(parameter.shape) - 1 if arity == 0: paramDict[src_RHS][target_RHS] = G.dot(parameter) elif arity == 1: paramDict[src_RHS][target_RHS] = G.dot(parameter).dot(Ginv) elif arity == 2: result = np.tensordot(G, parameter, axes=[1,0]) result = np.tensordot(Ginv, result, axes=[1,1]).swapaxes(0,1) result = np.tensordot(Ginv, result, axes=[1,2]).swapaxes(1,2) paramDict[src_RHS][target_RHS] = result

Voila! This code is added before the dynamic programming algorithm starts to execute. Then, the results of computing the inside probabilities with this linear transformation are compared to the results without the linear transformation — they have to be identical.

Lines 1-3 create an invertible G matrix. Note that we make sure it is symmetric, just to be on the safe side, in case we need to multiply something by \(G^{\top}\) instead of \( G \). This debugging method does not have to use a symmetric \( G\).

Next, we multiply all parameters in paramDict. The “Pi” one is the root probabilities, and we just multiply it by \(G^{-1}\). Later on “arity” defines whether the set of parameters is a matrix, i.e. arity = 1 (unary rules, not discussed above, but have very similar formulation), tensor, arity = 2 (binary rules) or vectors (preterminal rules; arity = 0).

If you want more details about this idea of linearly transforming the parameters, you can find them here for latent-variable PCFGs. The linear transformations in that paper are used for spectral learning of L-PCFGs — but as I mentioned, a by-product of that is this debugging trick. When using EM with dynamic programming algorithms, the usual method for making sure everything is working correctly, is checking that the likelihood increases after each iteration. The linear-transformation debugging method is more fine-grained and targeted towards making sure the dynamic programming algorithm is correct. In addition, I witnessed cases in which the likelihood of EM continuously improved, but there was still a bug with the dynamic programming algorithm.

By the way, here I described mostly the debugging method applied to the inside algorithm. The same linear transformations can be also used with the outside algorithm, and therefore to compute any marginal — the linear transformations should cancel in that case as well. It is also not too hard to imagine how to generalize this to the case where the dynamic programming algorithm is not CKY, but some other dynamic programming algorithm.

]]>A CFG includes a set of nonterminals (\(N\)) with a designed start symbol \(\mathrm{S} \in N\), a set of terminal words (the “words”) and a set of context-free rules such as \(\mathrm{S \rightarrow NP\, VP}\) or \(\mathrm{NP \rightarrow DET\, NN}\), or \(\mathrm{NN \rightarrow\, dog}\). The left handside of a rule is a nonterminal, and the right handside is a string over the union of the nonterminals and the vocabulary.

A PCFG simply associates each rule with a weight, such that the sum of all weights for all rules with the same nonterminal on the left hand-side is 1. The weights, naturally, need to be positive.

Now, a PCFG defines a generative process for generating a phrase-structure tree which is followed by starting with the starting symbol \(\mathrm{S}\), and then repeatedly expanding the partially-rewritten tree until the yield of the tree contains only terminals.

Here is an example of a phrase-structure tree, which could be generated using a PCFG:

Each node (or its symbol, to be more precise) in this tree and its immediate children correspond to a context-free rule. The tree could have been generated probabilistically by starting with the root symbol, and then recursively expanding each node, each time probabilistically selecting a rule from the list of rules with the left handside being the symbol of the node being expanded. I am being succinct here on purpose, since I am assuming most of you know PCFGs to some extent. If not, then just searching for “probabilistic context-free grammars” will yield many results which are useful to learn the basics about PCFGs. (Here is a place to start, taken from Michael Collins’ notes.)

What is the end result, and what is the probability distribution that a PCFG induces? This is where we have to be careful. First, a PCFG defines a measure over phrase-structure trees. The measure of a tree is just the product of all rules that appear in that tree \(t\), so that:

$$\mu(t) = \prod_{r \in t} p_r.$$

Here, \(r \in t\) denotes that the rule \( r\) appears in \(t \) somewhere. \( t \) is treated as a list of rules (not a set! a rule could appear several times in \( t \), so we slightly abuse the \( \in \) notation). \( p_r \) is the rule probability associated with rule \( r \).

You might be wondering why I am using the word “measure,” instead of just saying that the above equation defines a distribution over trees. After all, aren’t we used to assuming that the probability of a tree is just the product of all rule probabilities that appear in the tree?

And that’s where the catch is. In order for the product of all rule probabilities to be justifiably called “a probability distribution”, it has to sum to 1 over all possible trees. But that’s not necessarily the case for any assignment of rule probabilities, even if they sum to 1 and are positive. The sum of all measures of finite trees could sum to *less* than 1, simply because some probability “leaks” to infinite trees.

Take for example the simple grammar with rules \(\mathrm{S \rightarrow S\, S}\) and \( \mathrm{S \rightarrow a} \). If the rule probability \(\mathrm{S \rightarrow S\, S}\) is larger than 0.5, then if we start generating a tree using this PCFG, there is a non-zero probability that we will never stop generating that tree! The rule \(\mathrm{S \rightarrow S\, S}\) has such large probability, that the tree we generate in a step-by-step derivation through the PCFG may potentially grow too fast and the derivation will never terminate.

Fortunately, we know that in a frequentist setting, if we estimate a PCFG using frequency count (with trees being observed) or EM (when trees are not being observed, but only strings are being observed), the resulting measure we get is actually a probability distribution. This means that this kind of estimation always leads to a *tight* PCFG. That’s the term used to denote a PCFG for which the sum of measures over finite trees is 1.

So where is the issue? The issue starts when we do not follow a vanilla maximum likelihood estimation. If we reguarlize, or if we smooth the estimated parameters or do something similar to that, we may end up with a non-tight PCFG.

This is especially true in the Bayesian setting, where typically we put a Dirichlet prior on each of the set of probabilities associated with a single non-terminal on the left handside. The support of the Dirichlet prior is all possible probability distributions over the \(K\) rules for nonterminal \(A\), and therefore, it assigns a non-zero probability to non-tight grammars.

What can we do to handle this issue? This is where we identified the following three alternatives (the paper can be found here).

- Just ignore the problem. That means that you let the Bayesian prior, say a factorized Dirichlet, have non-zero probability on non-tight grammars. This is what people have done until now, perhaps without realizing that non-tight grammars are considered as well.
- Re-normalize the prior. Here, we take any prior that we start with (such as a Dirichlet), remove all the non-tight grammars, and re-normalize to get back a probability distribution over all possible tight grammars (and only the tight grammars).
- Re-normalize the likelihood. It has been shown that any non-tight PCFG can be converted to a tight PCFG which induces a distribution over finite trees which is identical to the distribution that we would get from the non-tight PCFG, after re-normalizing the non-tight PCFG to become a distribution over the finite trees by dividing the probability of all finite trees by the total probability mass that the finite trees get (Chi, 1999). So, one can basically map all non-tight distributions in the prior to tight PCFG distributions according to Chi’s procedure.

Notice the difference between 2 and 3, in where we “do the re-normalization.” In 2, we completely ignore all non-tight grammars, and assign them probability 0. In 3, we map any non-tight point to a tight point according to the procedure described in Chi (1999).

The three alternatives 1, 2 and 3 are not mathematically equivalent, as is shown in this Mathematica log. But we do believe they are equivalent in the following sense: any prior which is defined using one of the approaches can be transformed into a different prior that can be used with one of the other approaches and both would yield the same posterior over trees, conditioned on a string and marginalizing out the parameters. That’s an open problem, so you are more than welcome sending us ideas!

But the real question is, why would we want to handle non-tight grammars? Do we really care? This might be more of a matter of aesthetics, as we show in the paper, empirically speaking. Still, it is worth pointing out this issue, especially since people have worked hard to show what happens with non-tightness in the non-Bayesian setting. In the Bayesian setting, this problem has been completely swept under the rug until now!

Notes:

- 1. For those of you who are less knowledgable about Bayesian statisitics, here is an excellent blog post about it from Bob Carpenter.
- Non-tight PCFGs are also referred to as “inconsistent PCFGs.”