April | 2015 | Manually generated text

A lot of ink has been spilled about the faults in the conference reviewing process; some even suggest eliminating it altogether. I’m sure many of us have experienced the presence of the “third reviewer” in our reviews (the one that kills the paper). Organizers from the latest NIPS conference recently ran an experiment in which multiple papers were reviewed multiple times, and virtual decisions were made based on different sets of reviews for the same paper. Their conclusion confirmed what was already known — there is a lot of randomness in the reviewing process. This experiment generated a lot of discussion at NIPS.

It is frustrating to see a paper about a core, pivotal area in NLP submitted to a computational linguistics conference come back with a comment such as “this paper is more appropriate for a machine learning conference.” More than frustrating, it is baffling. But it happens. In fact, it recently happened to me and my co-authors. I get the sense that no matter what, our paper would have done something to irritate this reviewer. The question is whether this reviewer tends to have a “reviewing bias” that is consistent across multiple reviews. I tend to think that such bias sometimes exists.

We all have biases as reviewers, I am pretty sure. There are reviewers who start the process with the intention to find all possible faults in a paper, and provide harsher feedback than others – they basically aim to kill. On the flip side, there are reviewers who are more lenient, and look for ways to get the paper accepted. In my experience as a reviewer for ACL conferences, I have seen multiple reviews of the same people, and I do get the impression that such prior bias exists. After all, we are all human, this is only natural. To exacerbate this problem, there is not any accepted objective measure by which to decide whether one paper is better than another, and should, as such, be given preference to be presented in the conference.

We have all been reviewers for quite a long time – as such, it is possible to control for bias. We can make the system more fair. Not as reviewers, but as those who make the decisions about reviewer assignment. This is a rather ad-hoc solution, somewhere in between completely abandoning the reviewing process altogether and continuing with the wild-west procedure that characterizes current conference reviewing, and that which we keep complaining about.

I suggest to do the following in order to improve the reviewing process. Based on history of reviews from previous conferences, we can simply create a prior distribution over scores for all reviewers who have reviewed papers multiple times. For each reviewer, we would have a profile based on these previous rounds of reviewing. Based on this, when we assign reviewers to papers, we control for the prior bias through this constructed reviewer profile. One way to control for this is to ensure that some “mean score” (such as the recommendation score) extracted from the profile of the reviewers assigned to a specific paper, is close to identical for all papers (if the mean score is a great indicator used by the area chairs or the program chairs).

There are more complex ways to create such a rubric (but this is a start). For example, from my experience, it is sufficient for one reviewer to give a relatively low score, in order for the paper to basically get rejected. So perhaps a better way to use the above reviewer profile (rather than aiming for the average score to be apriori similar across all papers) is to classify reviewers into “tend-to-be-positive” versus “tend-to-be-negative” and then use the same ratio of each group of reviewers for all papers. (That might be hard, given that the pool of reviewers is rather limited.) We can also create reviewer profiles with respect to a specific category of papers (where category is loosely defined as some function of the paper).

The way I see it, there could be two results to this experiment:

If we succeed in controlling for reviewer bias, but the result is that more papers receive similar scores, or it is much harder to tell, as area chair, which papers to accept, then the conclusion is that the final scores shouldn’t matter that much, and there is a need to go deeper in the scrutiny of the papers as area chairs. One bad review, perhaps even a couple, shouldn’t get a paper rejected – perhaps, instead we need to extend our notion of borderline cases. (It could also mean that we are not properly eliminating bias using the prior distribution.)
If we succeed in controlling for the average reviewing score over time, and the number of borderline cases stays similar, then I think we have a more balanced, fair, reviewing system. The solution is not full-proof, and there will still be great variance in the reviewing process. But I believe, overall, it will be a more just system. (For example, we are more likely to eliminate cases in which one paper gets two reviewers who tend to be negative in their reviews, while another paper gets none.)

Of course, this profile construction can be taken in other ways as well, for example, it can be performed on groups of reviewers, instead of single reviewers. One common belief is that junior graduate students tend to give lower scores in their reviews compared to more senior researchers. We can create profiles, or prior distributions, for groups based on this conjecture, such as graduate students versus senior researchers, and try to control bias with such profiles. This coarser version of bias control is likely to be more useful, since the number of reviewing datapoints required to create a personal profile for a single graduate student, especially when junior, is not sufficient.

(One thing that is a bit concerning, given all we have considered here, is that we might have to give up on the level of expertise to a specific topic assigned to reviewing a paper, unless we have a very large pool of reviewers. Otherwise, it would be hard to make sure that the prior scores are similar, across all papers, when only a small number of reviewers can be assigned to a specific paper.)

Manually generated text

A blog mostly about computational linguistics by Shay Cohen

Monthly Archives: April 2015

Eliminating Prior Bias: Reviewing as a Controlled Experiment