The Saga continues in Embeddings
A while ago there was a story in the news about an American teenager that wrote over 20,000 entries and many 200,000 edits for items in Scots Wikipedia. He caused a controversy in doing so.
The kids seems have had good intentions, but the crucial problem was that he did not speak Scots himself and he wrote a mixture of English and Scots on the pages. Many of his pages contain typos but also errornous definitions. Given that Scots is an endangered language, there was an awkward side effect: he may have been responsible for about 33-49% of all Scots content.
This story got plenty of coverage (1, 2, 3) but the saga seems to continue in the world of text embeddings. The thing is that many popular embeddings are trained on Wikipedia. After all, it’s a reasonably well structured dataset on a wide range of topics. So it makes sense as a data source for embeddings. But when half of the text comes from an errournous source … you kind of wonder if the embeddings contain made up words.
So I figured it’d be interesting to try and see if I could find one. I downloaded fasttext embeddings and bytepair embeddings to try it out. The slate article mentions that “smawer” is a fake word that got added and I can confirm this by checking the online scots dictionary.
So is it in the vocabulary?
> from whatlies.language import BytePairLanguage
> lang = BytePairLanguage(lang="sco", vs=100000)
> lang.score_similar("smawer")
# [(Emb[smawer], 0.0),
# (Emb[lairger], 0.18736843287866867),
# (Emb[sma], 0.3332672898020761),
# (Emb[than], 0.3675327164548263),
# (Emb[anes], 0.3803581570228056),
# (Emb[ither], 0.38676436447569584),
# (Emb[pairts], 0.3880290360751052),
# (Emb[lairge], 0.39303860726974627),
# (Emb[veelages], 0.3976859689479768),
# (Emb[▁smawer], 0.405010403535216)]
Note, this is a 3.7G filesize.
> from whatlies.language import FastTextLanguage
> lang = FastTextLanguage("cc.sco.300.bin")
> lang.score_similar("smawer")
# [(Emb[smawer], 1.1920928955078125e-07),
# (Emb[lairger], 0.6052425503730774),
# (Emb[than], 0.6364072561264038),
# (Emb[ither], 0.642298698425293),
# (Emb[larger], 0.656043291091919),
# (Emb[Canis], 0.675086259841919),
# (Emb[ootlyin], 0.6806771159172058),
# (Emb[bigger], 0.6863239407539368),
# (Emb[Nayarit], 0.7038226127624512),
# (Emb[nor], 0.7145431041717529)]
Both embeddings contain that “smawer”. The word does not exist in Scots, but I imagine it’s picked up the embeddings because the kid added it as a “Scottish”-sounding variant of “smaller” to the corpus. Both language models seem to agree that the word “lairger” is related to it, which the online Scots dictionary confirms could mean “large”.
Ouch.