Book content in HTML: An overview

Photo of Nellie McKesson

Nellie McKesson has been working in publishing for well over a decade, in almost every book production role there is (production editor, print layout specialist, ebook production manager, to name a few). She’s spent the last six or so years working specifically with automated workflows: creating ebook and print files via hands-off automated toolchains. In late 2017, she founded the startup Hederis, whose goal is to solve publishing problems through user-focused automation and web-technology. She currently lives in Portland, Oregon, where she mostly just works and plays D&D. Nellie will be at ebookcraft‘s main day on March 19 delivering a talk called Pagination in the Browser: Why, What, and How.

Books and web technologies are a well-established pairing at this point. At the core of an EPUB file, you’ll find a collection of HTML files, along with CSS to make the book look good. You can also feed HTML and CSS files into a PDF processor program like Prince or Antenna House to create a print-ready PDF, or even view EPUBs and generate paged PDFs straight from a web browser (I’ll be diving deeply into this during my talk at ebookcraft 2019). You can use JavaScript or other scripting languages to dynamically adjust your content: insert information targeted at specific readers, generate a table of contents, re-order frontmatter based on the file format (e.g., move the copyright page to the back of the EPUB, but leave it in the front of the print PDF), and much more.

When it comes to building efficient workflows around HTML and CSS, one of the fundamental prerequisites is establishing a standard way to mark up your content. By adhering to a standard set of markup rules, you open the door to automation, template-based workflows, and more.

What does it mean to mark up content?

Content, in the context of books, means the actual book text, like this:

Alice in Wonderland

Down the Rabbit-Hole

Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, ‘and what is the use of a book,’ thought Alice ‘without pictures or conversation?’

When a human looks at this block of content, they can typically infer that the first two lines are titles (the first being the book title, the second the chapter title), and the third line marks the start of the actual book text (though, without any visual cues, it may take a moment to make sense of things).

When you’re marking up content, you’re adding extra information (or metadata) about the content itself. You’re labeling each distinct unit of text, like this:

Book Title* -> Alice in Wonderland

Chapter Title -> Down the Rabbit-Hole

Body Text -> Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, ‘and what is the use of a book,’ thought Alice ‘without pictures or conversation?’

*Please note: This is not HTML. This is a made-up label, for the purpose of illustrating the concept.

Image of computer screen with code displaying.

When computers need to process the content, these labels tell the computers what to do with each piece of content and how to display it (in addition to making it easier for humans to parse the content). For example, most web browsers (like Google Chrome, Firefox, and Internet Explorer) and ebook-reading apps (like Kindle and iBooks) know to display content that is marked up with an HTML heading tag in a larger font size than content that is tagged as a standard paragraph.** While this ability to parse content correctly may seem extraneous when you’re also adding visual design to your content, it’s vital for things like scripting, accessibility, templating, and more.

**You might ask: “Why not just rely on visual formatting like larger font sizes, bold, and italics, to denote different types of content, instead of markup?” While visual formatting is a language that most humans can understand, it’s one that computers still struggle with.



Defining an HTML markup standard (the basics)

Because HTML is designed to be used on lots of types of content, it’s very flexible; you can mark up content in a variety of ways, and all of them might be perfectly valid. To create an efficient workflow, you need to make HTML less flexible. You want to create rules for exactly how every piece of your content should be tagged, and then apply those same rules to every book. For example:

  • All chapter headings must be marked up with an <h1> tag*, and must use a class name of chaptertitle

  • All chapters must be wrapped in a <section> tag with a data-type attribute** of chapter

  • All body text paragraphs must be marked up with a <p> tag

*HTML has a limited set of tags that you can choose from. You can see a list of them here.

**Attributes are special extra descriptors that you can add to HTML tags to make them even more meaningful. You can choose from predefined attributes or make up your own starting with the “data” prefix.

Your rules should at minimum be:

Descriptive: Each label should easily convey the type of content — a heading, a quote, poetry, etc. Because HTML has a limited set of tags, you can convey extra meaning via attributes like classes, using names that you make up; however, you want to ensure that your HTML is also:

Semantic: Each tag you use should convey meaning about the content inside of it. You could certainly use the basic <p> tag for every block, and then rely on class attributes to distinguish them, e.g.:

<p class=”booktitle”> Alice in Wonderland</p>

<p class=”chaptertitle”> Down the Rabbit-Hole</p>

<p class=”bodytext”> Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, ‘and what is the use of a book,’ thought Alice ‘without pictures or conversation?’</p>***

***Unlike the previous labeling example, this IS HTML!

But this is poor semantics. Computers don’t inherently understand what your class names mean — after all, you made those class names up. However, computers do understand the inherent meaning of HTML tags, like this:

<h1 class=”booktitle”> Alice in Wonderland</h1>

<h2 class=”chaptertitle”> Down the Rabbit-Hole</h2>

<p class=”bodytext”> Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, ‘and what is the use of a book,’ thought Alice ‘without pictures or conversation?’</p>

Each tag conveys extra meaning about the content; this meaning is built into the HTML language, and helps the computer process that content faster and in the right way.

Here are some existing standards that you can check out for inspiration, or adopt as-is:

How might you mark up content?

I’m not going to dive too deeply into the methods for marking up content and converting to HTML, but here are some popular strategies:

Mark up in Microsoft Word, then convert to HTML: Since many publishing workflows are still Word-based, and the various people involved in book production (authors, proofreaders, copy editors, production editors, etc.) generally understand how to use that software, this is a common choice. It typically involves creating a set of Word Styles that are attached to every manuscript, and then applying a style to every paragraph in the manuscript. This often happens around the copyright stage of production, since the manuscript is largely complete at that point. After the styles are applied, the manuscript would be converted to HTML using custom scripts.* These styled Word files are also compatible with an InDesign workflow (and can help streamline that process).

*I’m always happy to chat or advise on these kinds of conversion scripts. I like to use the open-source Mammoth library as a starting point.

Author directly in HTML and apply markup as you go. In this workflow, you use an HTML-first authoring tool to write the content, which applies standardized markup as part of the writing process. Transitioning to this type of workflow can be a bit tricky as it requires author buy-in, and HTML-first authoring systems can sometimes introduce bugs in the markup; but it can also potentially speed up the book production process, and opens the door to more collaborative writing. Some HTML-first authoring systems are Pressbooks, Editoria, and WordPress Gutenberg.

Use machine-learning or “smart” conversion tools to mark up the content. Machine learning is a hot topic in the tech industry, and publishing is no exception. (Check out the panel “Cybernetic Ebooks: A Panel on Machine Learning and AI in Book Production” at ebookcraft 2019). While still a bit experimental, machine learning can analyze the book content and make guesses about how to label each block of text. The machine learning scripts would then apply these labels, and return the marked up content to you for review. There are also some smart, rule-based conversion tools that follow a similar approach, based on the formatting applied to the content. Bookalope is a pioneer in the machine-learning space, and PagedMedia.org’s XSweet is an example of the rule-based approach. Hederis** uses a combination of machine-learning and rule-based conversion.

**I feel a bit sheepish including my own company here, but it was built for this purpose!

Whichever approach you choose, it will require an understanding of the types of content labels that are available in your chosen standard and how to apply them, which means you’ll need to invest some time in training your staff and vendors.

If you’d like to hear more from Nellie McKesson and pagination solutions, register for ebookcraft on March 18 and 19, 2019 in Toronto. You can find more details about the conference here, or sign up for the mailing list to get all of the conference updates.