Why markup?

The TCP texts as keyed were and are encoded in Standard Generalized Markup Language, or SGML (the pre-cursor to XML), using a series of schemas originally abridged from the DTD known as TEI P3, or rather the popular subset of P3 called TEI-lite. TEI P3 (a searchable version of which is still maintained in the University of Michigan Library) was the version of the schema and of the accompanying guidelines widely in use at the time that the TCP was getting started in 1999.  That original DTD was much modified during the course of the TCP project as it encountered features in the books which the schema was unable to handle well, sometimes anticipating changes made by TEI, sometimes imitating them, with many of the discrepancies in the end reconciled. And that original SGML DTD was also replaced for most purposes, including the purposes of distributing and indexing the texts, by a near-equivalent XML DTD.

Function and purpose

The details of the markup language are less important than its function and purpose.  Like all markup, the TCP markup records information about the book and embeds that information within the text of the book at the point of relevance.

The purpose of this encoding is to explicitly mark the parts and structure of the text, and the relationships between these parts, so that the document’s structure can be understood by a computer just as human readers make sense of the layout of a typeset page.  It says, in effect, ‘this string of text is a chapter heading; this is an epigraph; this is a date; this is a table column; this is a chapter; this is a quoted piece of poetry divided into stanzas.’ And so on.

This encoding creates the potential for many possible uses, such as targeted searching (for example, searching for a term only when it appears in stage directions or in verse), flexible rendering of the texts (for example, placing all notes in a margin, or at the end of the document), intelligent navigation (allowing for the automatic generation of a table of contents), and more.

The current de facto standard for text encoding of primary sources in the humanities is TEI P5. Although the TCP’s XML schema does not conform to this version of the TEI guidelines (it is a bit closer to TEI P4), neither does it differ so radically that conversion between them is impossible. In most respects, in fact, conversion is quite straightforward, and the “semantics” of TCP markup remain very much within the TEI world.

Thanks to colleagues at Oxford, especially to Sebastian Rahtz,  indispensable pillar of TEI and friend to TCP, a set of stylesheets (run from an ANT script) are available capable of generating P5-conformant versions of the TCP texts. Such versions at present are available for all the TCP output, though in a few cases the P5 files may lag the TCP XML files as regards revisions.

History and philosophy of the TCP schema

In 2000, a working group  came together to develop a schema for the TCP’s encoding practice, working within traditions of library-based digitization efforts that had developed at universities like Michigan and Virginia during the preceding decade, and working particularly within the tradition such libraries had adopted of light structural markup — markup that was reasonably efficient to apply, and relatively devoid of bias toward any particular sort of use. The Document Type Definition (DTD) that the original group developed — effectively a subset of TEI-lite — outlines which tags may be used, and under which  circumstances, throughout a document. Working from that DTD, and from general guidelines arrived at by the working group, library staff at Michigan produced a more comprehensive and comprehensible set of keying and coding instructions from which our data conversion firms could work.

That original DTD, and those original keying instructions, necessarily suffered changes when they encountered the vast variety of early print.  And not all of the original decisions turned out to be wise ones. But most were, and the general philosophy of those original decisions proved robust enough to see the project through tens of thousands of encoded books.

Because EEBO contains so many different types of text, because the corpus contains so much text, and because most of the encoding would have to be done by taggers not versed in the history, controversies, or personnel of the early modern period, the group determined that the DTD would reflect a ‘low’ and fairly generic level of tagging, one that responded especially to visual cues in the books and could therefore be readily recognized by inexpert taggers.

Although there are some exceptions, as a general rule, the TCP emphasizes capturing block-level elements (such as paragraphs, stanzas, verse lines,  block quotations, dramatic speeches and stage directions, chapters, tables, lists, and figures), as well as font changes, but not phrase level elements (such as individual names, places, and dates mentioned in the text). The goal is that the TCP’s text will be a useful starting point for scholars who wish to enhance a text (or many texts) with additional, more granular markup.

Copies of the current DTDs and keying instructions can be found referenced from the TCP’s page of internal communication and documentation.