2 An API for Books

Hugh McGuire


Here is a radical statement: A publisher’s job is to provide good APIs (Application Programming Interfaces) for their books. Now that almost all books are made into digital products (that is, ebooks), good publishers of the future will be the ones who provide great APIs. In this article I am going to explore:

  • why I think publishers must provide good APIs
  • why this is actually much easier and less scary than you might think
  • why the old-fashioned book index should be the starting point for book APIs


What is an API?

An API is a set of tools/protocols that allow different bits of software to communicate with each other, under defined conditions. As Terry Jones puts it: “Just as a User Interface gives humans access to information, an API gives programs access to information.” Here are some examples of APIs that you have probably interacted with:

  • if you have ever “Facebook liked” a website, you were using an API
  • if you use an app or a desktop client to read and post to Twitter, you are using the Twitter API
  • if you have ever paid for something online using PayPal, you were using an API
  • if you’ve ever used an app that shows you maps, you were using an API

Here is another way to think of an API: an API lets one service use the data from another service, under defined circumstances.

What does this have to do with books?

What does this have to do with books? Books evoke emotions, books inspire thinking, books contain and convey ideas. But they also contain “data;” they are, if you like, made of data. A digital photo taken on your phone, a movie you stream over Netflix, an mp3 you listen to on your iPod … these are all digital things made up of 1s and 0s — they are represented as “data.” The way we access them is through technology. Books, especially digital books, are similarly made up of data. Of 1s and 0s, but also, in fact, of HTML.

If we start to think of “books as data,” then the traditional publisher’s role starts to sound a lot like the role of providing an API: A publisher’s job is to manage how and when and under what circumstances people (readers) or other services (book stores, libraries, other?) access books (data).

We know what this job looks like in the old world of bound paper and bricks and mortar stores, and we’re pretty sure we understand it in a world of EPUB and Kindle.

But as we move into a primarily digital world, publishers should, and soon must, start thinking about their digital APIs — in ways that go beyond the traditional “send our book to bookstores.”  (Don’t worry, I’ll tell you below why that might not be so scary).

What does a publisher do?

Here’s a historical definition of a publisher:

publisher (n.) mid-15c., “one who announces in public,” agent noun from publish (v.). Meaning “one whose business is bringing out for sale books, periodicals, engravings, etc.” is from 1740.
(source: the Online Etymology Dictionary)

A publisher’s job is to “make public” works created by authors, and usually, but not always, to monetize those works.

It’s no secret that digital is bringing profound changes to the environment under which publishers do their jobs. Bricks and mortar stores are disappearing, online book sales are increasing, and the barriers to publishing, especially ebooks, are almost gone. In this world of massive and exploding supply of books, and diminishing physical spaces selling books, the job of “making public” starts to look very different.

Some of the traditional mechanisms by which publishers achieved their goals of “making public” books to readers and monetizing them (namely: production, distribution and sales) are becoming commoditized, and available to writers themselves.

If self-publishers can do a lot of the things publishers used to do, then finding new and better ways to “make public” writing is what will separate the good publishers from the bad in the future. Those who are the best at it will be the publishers writers want to work with.

And one asset that a publisher, but no one else, has (not even Amazon) is a careful, contextual knowledge of the content of their books. Publishers have commissioned their books, edited them, copy edited them, proofread them (maybe!), designed them…They know their books back to front.

This knowledge should enable publishers to better aggregate interest around their books. The problem is that in general, although publishers have all this knowledge about their books, they are not particularly knowledgeable about how to share this knowledge, how to use their knowledge as an effective tool to bring their books to a wider public.

But it would be very easy for publishers to get better at it.

What do publishers have intimate knowledge of, exactly?

Books can be described with a bunch of things: a cover, an author name, a title, a list of chapters, a description of the book. This is what you might call “metadata” or, “the stuff we used to put on library catalog cards.”

Inside the book we have words and sentences, maybe some images. And if you look closely at those words and sentences you can define a set of “things” found in those sentences, such as:

  • people (real or fictional characters)
  • places (real or fictional)
  • times/dates

And there are some other more abstract kinds of things:

  • concepts
  • references to other texts
  • quotations from people, or books, or articles, or movies or songs
  • examples

This list can go on and on, and of course will depend on the kind of book you are producing.

What if I asked you to build an index instead of an API?

As you look at that list above, particularly if you make non-fiction books or textbooks, you might nod your head and say … yeah that’s the stuff we put in our index.

You are right. An index is a kind of “map of the stuff in your book, and where to find it.” An index in a traditional paper book is a great tool for humans to find out what kind of things are in that book. It doesn’t take much to turn that human-readable index into a computer-readable index, which makes it, more or less, an API. (This is how search engines work: they “index” every word on an HTML page, and every page in a website, and then provide an API that allows people and other programs find things).

In the case of paper books, the usual way to index a book is to list out whatever you think is important (people, places, dates, references, concepts etc), and to tell the reader on which pages those things appear. You print this list with page references and stick the list in the back of your book.

In the case of ebooks, a (good) index will do something slightly different: it will list out the people, places, concepts etc, and link to where they appear in the text (since page numbers don’t mean much in an ebook).

So in an a ebook — which is, after all, just a collection of HTML files, with a few other bits and bobs — an index entry links to an anchor tag in the HTML file where the entry appears.

So you might have text that reads:

blah blah John Smith blah blah.

And the HTML markup of this page, which allows you to link from your index file is:

blah blah <a id="index-entry-000134">John Smith</a> blah blah.

And then at the end of your ebook, you’ll have an index page, which will have a corresponding list item that looks something like this, linking to your anchor in the body of your text:

<a href="/chapter4/#index-entry-000134">Smith, John</a>

This index page/HTML file will be a list of links to places in your book where certain things (index entries) appear.

Add some semantic data, shake, and you’ve got yourself an API — or at least a map upon which to build your API

If instead of just making your anchor link for your index file, you spend a little bit of extra time adding some more data, you could do something like this, which will specify that at this anchor spot, you’ve defined a person named John Smith:

blah blah <a id="index-entry-000134" alt="Smith, John">John Smith</a> blah blah.

and similarly you could tag “place” text as:

blah blah <a id=”index-entry-000139″ alt=”Montreal” class=”place”>Montreal</a> blah blah.

and further, you could tag concepts as:

blah blah <a id="index-entry-000157" alt="existentialism">existentialism</a> blah blah.

If you wanted to be even fancier you could do this semantic tagging based on a defined schema such asschema.org, or microformats. This would allow you (and other book publishers) to standardize how you talk about “the things that are in our books.”

Now if you did this with your whole book (or let someone do it for you), you could then:

a) generate an index file (which you were going to do anyway)

but even more interesting, you could:
b) generate a “smart” index file that knows not just where “all instances of ‘John Smith’ appear” but:

  • where all people appear
  • where all instances of people named John appear
  • where all instances of people named Smith appear
  • that “my dear Granny Smith” is a person and “my delicious Granny Smith” is an apple

And now instead of “just” an index, you have a complete semantic map of your book, a map that was only a little bit harder to produce than the standard index you were going to make anyway.

So what?

With a semantic map like this you have the beginnings of an incredibly powerful API. If in addition to just having the semantic map … you also have your book “live” online (behind an authentication wall or not), it means you can make this semantic map/API available to the world (for free, or under certain commercial arrangements), and say: please, find wonderful things to do with my wonderful book content.

What might the world want to do with this book API?

Here are some things you might want to let people do (or maybe you want to do them yourself):

  • extract a list of characters in the book, and give each of them little biographies. Overlay these biographies in a “biographical ebook reader.”
  • extract a list of places in the book, and put them into a map
  • extract the 500 words around every appearance of the concept of “existentialism” … link all those appearances across your catalog of philosophy texts
  • extract the 500 words around every appearance of the concept of “existentialism” … across all publishers who make their semantic maps available, and sort by decade, influence, and nationality of the author
  • build a timeline of where different characters were at different times during the narrative of a book
  • link that timeline to a map
  • build a web-based “discovery navigation” (rather than linear reading), allowing users to explore a book by characters, by places, by time, by concepts

You might also want to publish this semantic map on the web, so that Google and other search engines will have a detailed understanding of what is in your book, so that when people are looking for a specific kind of information that appears on page 119 of your book, search engines will know to point to you.

There are so many etceteras here, and we are just getting started. We can see little shimmers of this happening already:

  • Small Demons extracts person/place/thing information from books through an API hack — they ask publishers for EPUB files and crawl those files, and then display the people, products, other books and movies and music found inside of those books.
  • Dracula Dissected takes the text of Bram Stoker’s Dracula, and separates it by character, date, location, so you can explore the book in different ways
  • Pearson’s FT Press API lets you pull out information from EPUBs, and do various things with them.
  • Wordnik.com extracts definitional sentences from Simon and Schuster books and displays them as example sentences for dictionary entries.

These are early days, and early examples.

We will see more.

We will see more because a publisher’s job is to “make public” and an API is designed especially to put conditions on how data is made public. That’s what it’s for.

We will see more because it turns out making an API for books is easy.

How will this happen?

Tools like PressBooks, which is a web native book production tool, make adding such semantic indexing tools close to trivial. Such an indexing tool could let you automatically create a semantic index, or let you do it by hand; or most effectively, some combination of the two. Select your text to go into your index file, add the right metadata (is this a person? a concept?), and press go. The challenge is making good index files of course, with good metadata, and good schemas.

We’ve got ebooks covered.

It’s time to start moving towards networked books with APIs.

Getting there isn’t so hard: the key to making these newfangled APIs will be the old-fashioned index.

Thanks to Brian O’LearyLaura DawsonTerry JonesErin McKean, and Dac Chartrand, for helping improve an earlier version of this article. Thanks to all of them as well, for the work they have done to inspire my thinking on this subject.



WordCamp Montreal 2013 Book Copyright © 2013 by Hugh McGuire. All Rights Reserved.


Leave a Reply

Your email address will not be published.