Monday, March 28, 2016

A federation of reviewing communities? Area-wise analysis of the amount of discussion on IJCAI papers...

I always thought that one of the defining characteristics of AI conferences is the significant amount of inter-reviewer discussions on each paper.

In  planning, for example, it is not all that unheard of to have the discussions that are as long as the paper (yes we are thinking of you,  **@trik!).

Having also handled many AI&Web papers over years, I did have a hunch that the amount of discussion is not same across areas.

Now that we have access to the reviews for the whole 2300 papers of IJCAI, we decided to see how the various areas stack up.

We counted the number of words in all the discussion comments for each paper, and then averaged them across each area.  Here is what we got:

So, papers in Planning & Scheduling, Heuristic Search, KR, Constraints, and MAS areas get significantly more discussion, compared to Machine Learning, NLP and AI&Web.

The AIW statistic is somewhat understandable as the reviewers there are not just from AI community and may have different cultural norms.

 The Machine Learning statistic, however, is worrisome especially since a majority of submissions are in ML. Some of the ML colleagues I talked to say that things are not that different at other ML conferences (ICML, NIPS etc). Which makes me wonder, whether the much talked about NIPS experiment is a reflection of peer reviewing in general, or peer reviewing in ML...

In case you are wondering, here is the plot for the length of reviews  (again measured in terms of the number of words across all reviews). Interestingly, AIW submissions have longer reviews than ML and NLP!

So you know!

(with all real legwork from Lydia, aka IJCAI-16 data scientist...)


  1. It would be interesting to weight these numbers according to the quality of the submissions (I am not totally sure how one would do that): a paper that is a clear reject (with blatant errors, for example) will have almost no discussion. Maybe there are many very bad papers from AIW? I am not claiming anything, I am just wondering.

  2. Thanks for sharing these stats! I think it's very important to think about the way we do reviews, and making data-driven analysis seems necessary...

    There's an underlying assumption here that longer reviews are better. I imagine there is a positive correlation (a 4 sentences review will never be high quality), but I've seen some very low quality long reviews...

    At least as someone in human-aware AI, I think the NIPS experiment results might also hold in our conferences (e.g., my paper that was rated as low contribution by all reviewers in AAAI, was now rated as high contribution by all reviewers in IJCAI. While we made some changes to the paper, the core hasn't changed...).

    I would also be curious to see the average scores papers in each area received. And variances on all these stats would be helpful too :-) In the major HCI conference they also did a survey for authors were they ask for example excerpts from low/high quality reviews to do more in depth analysis. Perhaps something to consider (of course, it's always hard to get objective view from someone who got a bad review, but I think it might be possible if authors are prompted for concrete examples).

    I think to some extent it might be a problem that we have the exact same criteria for all these different areas (a paper with human experiments is very different from a theoretical paper). And in general, I think it's worth thinking about how to frame the evaluation criteria because to a large extent it can affect the way a review is written.

    1. Hi OA:

      My main point is *not* about the length of reviews, but rather of discussion (or lack there of). I can imagine having a short review, with the reviewer taking part in the discussion. Having short reviews and *NO* discussion makes it very hard to figure out what the reviewers are thinking (as I can attest having gone through a whole lot of such papers this last week!). Besides, discussion is where you learn to calibrate your opinions, no?

      As for the NIPS experiment, my point is not that you can never get two sets of three reviewers who disagree about a paper, but that the relatively large incidence of it at NIPS may well be correlated to the amount of inter-reviewer discussion in ML compared to other areas.

      On your own paper of course, I have no idea--unless you help me out by telling me its ID so I can "rectify" the situation (*HUGE GRIN*).

      Who really should be working on notifications right about now, but is always a sucker for philosophy at the height of execution imperative (c.f. )

    2. Thanks Rao,

      Yeah, I agree that discussion length is important! Though it could potentially be caused by things like variance in the reviews? (and then maybe ML papers just have more agreement? that would be one way to explain the discussion length differences, but the data might not support it)

      Of course one example is not a proof, but in general I hypothesize that the human papers have higher variance than planning papers (my hypothesis might be rejected though :)). If that is true, then it is really a bad combination of having high variance and not enough discussions...

      Anyway, the bottom line is that I think it's great that you guys are looking at the data, and there are many more things that could be examined. So I'm in favor of this process!

      PS thanks for offering to rectify my situation, but for now I will keep the paper ID to myself :-)

  3. This comment has been removed by the author.

  4. I believe short length of discussion in ML area means low review quality. Take my paper for example, two reviewers fail to give any response to my rebuttal in terms of some key issues. Probably there are just too many papers in machine learning areas and it lacks of qualified reviewers for them.

  5. This comment has been removed by the author.