TIL: Linkrot is a Huge Problem

It’s frequent and has hidden nasty bits.

Vincent Warmerdam koaning.io
2021-12-03

When you add a link on your blog to an online resource, odds are that the link might break over time. Some sites, like Wikipedia, are relatively stable. But blogs or news sites may certainly forget how to redirect old links when new versions of a website are deployed.

It turns out to be a huge problem.

The older an article is, the larger the probability for a a broken link.

Figure 1: The older an article is, the larger the probability for a a broken link.

A group of researches investigated links on nytimes.com in this article to try and understand how quick links might break. The article is titled “What the ephemerality of the Web means for your hyperlinks.” To quote the article;

We found that of the 553,693 articles within the purview of our study––meaning they included URLs on nytimes.com––there were a total of 2,283,445 hyperlinks pointing to content outside of nytimes.com. Seventy-two percent of those were “deep links” with a path to a specific page, such as example.com/article, which is where we focused our analysis (as opposed to simply example.com, which composed the rest of the data set). Of these deep links, 25 percent of all links were completely inaccessible. Linkrot became more common over time: 6 percent of links from 2018 had rotted, as compared to 43 percent of links from 2008 and 72 percent of links from 1998. Fifty-three percent of all articles that contained deep links had at least one rotted link.

One part of the story is that linkrot can be a good thing. Maybe there’s data on the internet via S3 or Google Drive that shouldn’t be shared. Shutting those links down is certainly a good thing. But a broken link from a domain that’s no longer registered can be easily hijacked to serve unwanted content. That’s something to be slightly weary about.

It’s not just the New York Times that’s suffering from linkrot; stackoverflow is also having issues with it.

Solutions?

Linkrot is a hard problem but part of me thinks it might be more maintainable with a CI step that checks for broken links. Projects like deadlink could run as Github Actions or as a manual check once in a while. If there really are broken links in the code, at least you could be warned as early as possible.