Link rot in backlist ebooks

Photo of Teresa Elsey.

Teresa Elsey is senior managing editor (digital) in the Trade division at Houghton Mifflin Harcourt. She directs a group that produces and updates more than a thousand ebooks yearly, including adult fiction and non-fiction, culinary and lifestyle, YA titles, picture books, and e-only projects. She began her career in print publishing (though she likes ebooks better!) and has also worked at O’Reilly Media, Let’s Go, and Cengage. She’ll be at ebookcraft March 19 in Toronto delivering a session called Building Ebooks That Last.

As the manager of a very large (and ever-increasing) ebook backlist at Houghton Mifflin Harcourt, I’ve become deeply interested in longevity — keeping those ebooks in good shape and on sale without continual maintenance from my team. The only way the ebook work we do can be sustainable is if we find ways to not have to redo… and redo… and redo it.

One threat to the longevity of our ebooks is link rot, or the disappearance of external websites cited by authors in their books. We may think of books as discrete, standalone objects, but even physical books exist as part of a network of citations, allusions, and references. Ebooks, because of their close linkage to the web, are even more connected to this network. And this is a wonderful fact about ebooks, but also a concerning one when you realize that maintaining your ebook requires some responsibility for maintaining that whole network.

When I spoke about maintaining backlist ebooks at ebookcraft in 2016, I talked about an example book from our backlist, Powers of Two, by Joshua Shenk. Like many of Houghton Mifflin Harcourt’s non-fiction titles, it contains numerous hyperlinks, mostly as part of citations in the endnotes. It was published in August 2014, and the ebook includes 275 linked URLs. I tested those URLs (using the W3C Link Checker) in 2016, a bit less than two years after pub date, and I found that 47 of them, or 17%, were not working. That wasn’t especially bad, in terms of what some of the research on link rot would suggest (an oft-cited study says the average URL has a two-year half-life, which would have predicted 50% not working). Now, in early 2019, we’re four and a half years from the book’s pub date. I retested all the URLs, and this time I found that 58 of them, or 21%, aren’t working. Put another way, 79% of the URLs the author cited in 2014 still lead to working websites.

That’s great in terms of the gloomiest predictions about link rot. (The two-year half-life model would give us less than 25% working at this point.) But having 79% of the hyperlinks in an ebook still working is a pretty bad result in terms of user experience. And it’s pretty terrible in terms of what we as publishers expose ourselves to, given that my company has seen retailers suppress or put a quality warning on an ebook because readers have reported as few as two broken links.

A note here on my method: Automatic link-checking has its limitations. Some URLs appear to resolve, but don’t really lead where they ought to (or once did). Some URLs come up as broken because they have typos in the original book. Some sites that the link checker told me returned 404s worked just fine when I re-checked them manually later. (And this checking doesn’t even consider reference rot, the idea that a page may still exist while saying something different than it did when it was originally cited). So which categories of HTTP status codes I counted as “broken” vs. “suspicious but maybe fine” is admittedly a bit arbitrary, but we have to draw the line somewhere and my baby only naps for so long.

Partially because of that, I wanted to take a closer look at what kinds of URLs had become problematic. In this book, it was some small personal websites and a lot of YouTube links, but also the occasional link to a site I would have expected to be more stable: The New Yorker, the Wall Street Journal, NPR, PBS.

I also wondered which of the broken links would be fixable at this point, four years after the book was published. I worked through the 32 URLs that returned 404 (not found) errors to see which I could find an archived page for. To do this, I used the Time Travel tool by Memento, which searches web archives for previous versions of a given URL. I entered the date Aug. 1, 2014 to find versions of the cited web pages that would be as close as possible to how the web pages appeared when the book was first published.

Decorative photo of an archive.

Of the 32 sites, Time Travel found at least one archived version for 21 of them. Time Travel searches a wide selection of web archives, but nearly all of the best results were from Internet Archive (four were from other places: Archive-It, archive.is, the Library of Congress, and Arquivo.pt). Put another way, 65% of my link-rot problems were solved by the foresight of Brewster Kahle. And while that’s certainly not all of the broken links, it’s a respectable batting average. Replacing those broken links in my ebook with links to the archived pages seems an obvious win for user experience.

Internet Archive did well at returning the sorts of pages that you’d hope your authors are citing (Wall Street Journal articles, NPR stories, academic papers). As well, it’s strong at capturing fundamental items of our shared cultural history, namely this video of the Beatles performing “Twist and Shout” for the Queen Mother. Its failures included a personal website, links to YouTube videos that had been taken down for copyright infringement, and a URL with an obvious typo in it. (It’s possible that some URLs deserve to be allowed to fade away.)

The process of using Time Travel to find archived pages was slow and manual, but the preponderance of results coming from Internet Archive reminded me of Simon Collinson’s advice that simply prefixing URLs with “http://web.archive.org/web/” directs you to the Internet Archive version of the page if one exists. Inspired by the automatability of that, I added “http://web.archive.org/web” as a prefix to every URL in my test, hoping to discover what fraction of all the book’s cited websites were already covered by Internet Archive… only to discover that Internet Archive disallows checking by web bots, so the W3C link checker returned nothing.

So while the degree to which Internet Archive alone solves the link-rot problems of the publishing industry is left as an exercise for the reader, I completed this experiment encouraged that there are tools we could be using, some of them quite easily and automatably, to improve user experience of the hyperlinks in our ebooks.

If you’d like to hear more from Teresa Elsey and building ebooks that last, register for ebookcraft on March 18 and 19, 2019 in Toronto. You can find more details about the conference here, or sign up for the mailing list to get all of the conference updates.