TIL: Gensim Koan

Ten Year Old Bug?

Vincent Warmerdam koaning.io
2021-06-11

A while ago one of my managers shared a paper on slack. It’s a paper on a 10 year old software bug in gensim. There are many ways to train word-embeddings but some common advise may be based on a software bug. To quote the abstract:

It is a common belief in the NLP community that continuous bag-of-words (CBOW) word embeddings tend to underperform skip-gram (SG) embeddings. We find that this belief is founded less on theoretical differences in their training objectives but more on faulty CBOW implementations in standard software libraries such as the official implementation word2vec.c and Gensim. We show that our correct implementation of CBOW yields word embeddings that are fully competitive with SG on various intrinsic and extrinsic tasks while being more than three times as fast to train.

That’d be a big “Ouch”. The described bug is also a pretty big one.

Specifically, we find that in these implementations, the gradient for source embeddings is incorrectly multiplied by the context window size, resulting in incorrect weight updates and inferior performance.

The bug is suggested to also have an effect on the hyperparameters.

We believe that Gensim requires a smaller learning rate for CBOW, because a smaller learning rate is more likely to reduce the stochastic loss, since the incorrect update is still a descent direction, although not following the negative stochastic gradient.

The article then goes on to show that their implementation is faster to train and still maintains predictive properties.

Lessons

There’s a few lessons that you might be able to learn from this. One lesson might be “open source software can still be fragile”. While true, I think there’s a even better lesson. If the code wasn’t open sourced, this error would never have been found.

Then again, another big lesson is that numeric algorithms are very tricky to debug.

Required Nuance

The Gensim twitter account made a comment on this work and pointed me to this GitHub issue. It seems that the initial version of the paper is a pre-print and that the paper has since changed. It even got a new name!

It also seems like there is plenty of nuance that could be applied. This quote, on GitHub, stands out to me in particular:

(as is sadly rampant in papers) it’s not crystal-clear that they’ve given preexisting implementations (in this case Gensim) the same breadth of meta-parameter optimizations as their method.

I’m inclined to agree here. The paper could do a better job of saying “we really only changed this one thing”. Without this explicit mention, one is at risk of reporting on metric-hacking results.

One big lesson still stands. Numeric algorithms are not easy to debug.