Keying/Coding Specifications

Revision History

This symbol: links to sample pages that illustrate a given feature.


Target data

The data-conversion vendor will return keyed and coded text files transcribed from the page images.

Transcriptional accuracy will be 99.995% or better (error rate of 1 character/byte in 20,000). We will test and if necessary reject data by the shipment.

Coding will be valid SGML, validated against the supplied dtd or a true subset thereof. This dtd is an XML-compliant extract from TEI and uses TEI semantics; the TEI guidelines (TEI P3) may be safely used as a general guide to the meaning of particular tags.

Rejection of source data

The vendor may, at its discretion, reject as much as 10% of the books submitted for conversion if they are deemed impossible to convert accurately. Valid reasons for rejection (which must be stated) include: (1) excessive abbreviation, and (2) illegible text (due to poor image or print quality).

Keying/coding guidelines

Changes. We recognize the need for consistency, and the expense entailed in changing instructions and procedures midstream; such changes will certainly be minimized. Nevertheless, there is certain to be unexpected material in the data; and there are certain to be unforeseen consequences to some of the instructions given here. These instructions, as well as the eebo_sgm.dtd will therefore undoubtedly undergo some revision during the course of this project; most of it, probably, towards the beginning.

Exceptions. There is considerable variety in the source material and minor special instructions may be required for some books, or some portions of books, in some cases overriding the instructions given below.

Feedback. Conversion firms involved in this project are encouraged to ask questions: both to inquire about specific features not anticipated by the Guidelines, and to challenge the Guidelines (or the dtd) if they seem to produce unreasonable results. We will likewise provide advice on the conversion firms' tagging practices as quickly as possible.

Naming scheme

Page-image ID numbers

The beginning of each page (including the first page and all blank pages) should be recorded with a <PB> tag. The REF attribute of the <PB> tag is required: its value will eventually be the ID-string of the page image used by Bell & Howell to retrieve the image. Until such ID strings are available, simply record the image number within the book as the value of the REF attribute. E.g., a page appearing on the the third page image will begin with <PB REF="3">.

Since most of the page images are in fact images of page openings (i.e., each image contains two pages), in most cases there will be two <PB> tags for each REF value, like this:

<PB N="6" REF="3">
<PB N="7" REF="3">
<PB N="8" REF="4">
<PB N="9" REF="4">
<PB N="10" REF="5">
<PB N="11" REF="5">

File names

The text captured from each book should be returned as a single file, *.sgm. For now use the method of file-naming specified on EEBO FAQ page, using the the Wing or Pollard & Redgrave STC number as the basis for the file name.


Material to record

With a few standard exceptions noted below, the entire text will be recorded in its entirety, first page to last, in the order it was intended to be read (top left to bottom right, left column before right column, etc.).

The chief exception is parallel texts. e.g. Running parallel texts, printed in a multi-column, multi-row, or facing-page arrangement, or some combination thereof, need to be treated as separate texts (normally, separate <DIV>s, sometimes perhaps separate <TEXT>s), each one recorded until its end and not restarted on each page. Notes and other material relating to only one of the texts on a page needs to be embedded in that text, not in any of the others. If a single heading or figure applies to more than one of the parallel texts, it should be recorded at the appropriate place in each text to which it applies.

Partial or fragmentary parallel texts will normally be broken primarily at the chapter or section level (e.g. <DIV1 TYPE="chapter">), then into parallel versions of that chapter (e.g. <DIV2 TYPE="version">) when necessary. But full parallel texts, e.g. an entire Latin-English parallel New Testament, or a Latin-English parallel Boethius) will normally be broken primarily into versions first (<DIV1 TYPE="version">), then each version into its chapters (<DIV2 TYPE="chapter">).

All material should be recorded in the form in which it appears in the book: do not attempt to correct spelling or typographic errors (except upside-down letters; see below). Spaces between words should always consist of one space character. Spacing between words is, however, often highly irregular in these books, often difficult to discern, and therefore often requires a measure of judgment. This may involve advisedly departing from the spacing that appears in the original book when sense demands it.

Material to record as attribute values

  1. 'Milestone' information

    Page numbers as printed in the book will be preserved only as the value of the "N" attribute of the <PB> (page-break) tag. Unnumbered pages should receive a <PB> tag with the N attribute omitted. Incorrect page numbers should be recorded just as they appear. Page numbers will usually consist of arabic or roman numerals, but may also appear as letters or letter-number combinations. If there appear to be multiple separate paginations, choose one to record with the <PB> tag; record the other with a <MILESTONE> tag. Ignore any typographic elements used to set off the page number. E.g. -2-, {p. 2}, and PAGE 2 should all be recorded as <PB N="2">; (ccii) .cc.ii. and -ccii- should all be recorded as <PB N="ccii">; etc.

    Placement of <PB> tags. The rules are: (1) "pages always break at the top"; that is, <PB> tags will be inserted in the text at the actual location of the page break, regardless of the location of the page number on the printed page. (2) "Divisions begin at page breaks; they don't end there"; that is, if a structural break of some kind coincides with the page break (e.g., if a new section, paragraph, etc., begins at the head of the new page), the <PB> tag should be tucked inside the opening tag for the new division, neither inside the closing tag for the old division nor between the two divisions. And (3) "Words cannot break at page breaks"; that is, if a hyphenated word straddles a page break, finish the word and any attached punctuation, then insert the <PB> tag. Treat the hyphen as any other end-of-line hyphen.

    In parallel texts, material on a single page is often recorded at widely separated points in the data stream (once in each parallel <DIV>). In that case, the <PB> tag, including the page number, should be repeated, i.e., recorded in both <DIV>s.

    Foliation. Some books may be foliated instead of paginated, i.e., every leaf may receive a number, rather than every page (in which case, typically, the back page of each leaf has no number). Record a foliated book in the same way as a paginated book, supplying the folio number as the value of the "N" attribute of the <PB> tag. A typical page sequence in this kind of book will look like this:

    <PB N="iij">
    <PB N="iv">
    <PB N="v">
    The folio number may be explicitly labeled as such ("Fol.xvii." or "Folio .cxli."). Discard the label and punctuation and record only the actual number (<PB N="xvii"> <PB N="cxli">).

    Other non-structural numerations and alternative numerations. If the book contains some other running numeration system alongside folio or page references, use the milestone element to record it, and use its form, recorded with the "rend" attribute, to distinguish it from other milestones. There is no need to interpret its meaning or decide on its "unit" value, unless it is clear what that is. For example, if a book contains an unexplained sequential series of numbers (perhaps in brackets) in its margins, record them as <MILESTONE N="">; if it contains an explained series of numbers in the margins, use the accompanying explanation as the "REND" value, insofar as that is practical: if an edition, for example, contains a series of sequential references in the margin that look like this: [Boeth., cap. 43], record them as milestones like this: <MILESTONE REND="Boeth." UNIT="cap." N="43">; if a book contains a series of sequential references in the margin that look like this: "*Chapter 4 in the Greek," record them like this: <MILESTONE REND="in the Greek" UNIT="chapter" N="4">. Note that this applies only to a sequence; occasional notes of this sort should be recorded simply as <NOTE>. If in doubt whether a set of numbers represents <MILESTONE>s or <NOTE>s, use <NOTE>. Some books contain conflicting structural enumerations, e.g. a system of proposition numbers in the margins that does not correspond with the chapter numbers; the former may be recorded using <MILESTONE> tags.

  2. Structural numerations

    Line numbers in verse should be recorded only as the value of the "N" attribute of the <L> tag. Record in this fashion only line numbers actually printed in the book, and use the form of the number that appears in the book. (Line numbers in prose should usually be regarded as non-structural and recorded as milestones, as above.)

    Stanza, chapter, section numbers, etc. (that is, sequential numbers that appear in the headings to <LG>s and numbered <DIV>s) should be included as they appear in the book as part of the text surrounded by the appropriate <HEAD> tag, but should also be recorded, if possible as an arabic number, as the value of the "N" attribute of the appropriate <DIV> or <LG> tag.

    <DIV2 TYPE="chapter" N="5"><HEAD>Chapter V.</HEAD>

    <LG N="14"><HEAD>Stanza XIV.</HEAD>

    Paragraph numbers (sequential numbers appearing at the beginning of a series of paragraphs that you have not chosen to regard as <DIV>s) should be included as they appear in the book as part of the text surrounded by the <P> tags, but should also be recorded, if possible as an arabic number, as the value of the "N" attribute of the <P> tag.

    Item numbers and label numbers in lists should be recorded as part of the text included within the <ITEM> (or <LABEL>) tags. They should not be recorded as attribute values. See below under "Lists and Tables."

    Enumerations in tables may be variously treated: given a column of their own, left as part of the text in a row, or even made part of an embedded <LIST>, whichever adequately represents the information most efficiently. See below under "Lists and Tables."

  3. Other attributes

    Language. Supply a value for the LANG attribute of numbered <DIV>s and of whole <TEXT>s, but do so only if the bulk of the text (barring notes) in that <DIV> or in that <TEXT> is in the indicated language. Supply the attribute at the highest level at which it applies: e.g., if an entire text is in Latin, add LANG="lat" to the <TEXT> tag, but not to all the <DIV> tags within that <TEXT>; if one of the <DIV1>s in a text is in Latin and other is in English, assign LANG="lat" to one of the <DIV1>s and LANG="eng" to the other; and so on.

    Assign multiple LANG values to the same <DIV> or <TEXT> only if it contains two or more languages in some kind of organized relationship. E.g., a bilingual Latin/English dictionary should be coded as <TEXT LANG="lat eng"> (with a space between the two codes). Supply a value for the LANG attribute only if you are sure what language it is; otherwise, do not use the attribute at all. Use USMARC 3-letter language codes published by the Library of Congress at (These are identical to the 3-letter codes contained in the ISO standard 639-2; see Do not attempt to differentiate between forms of the same language: e.g., record LANG="fre" for French texts and LANG="eng" for English ones, not LANG="frm" ('Middle French') or LANG="enm" ('Middle English').

    TYPEs of DIV. Supply a value for the TYPE attribute of numbered <DIV> elements if the appropriate value is obvious; otherwise, omit the attribute entirely. If you do supply a value, use these rules:

    1. Use the designation supplied by the book itself. "Chapter 3" should be recorded as <DIV1 TYPE="chapter">

    2. Use lower-case throughout ("chapter" not "Chapter").

    3. If the designation is not in English, and there is a ready equivalent in English, use the English. E.g., for "pars" or "partie" use "part"; for "capitulum" or "chapitre" or "cm." or "chapt." or "cap." use "chapter".

      If the designation in the book is a verbose version of a common English term, use the simpler form. E.g., if the book says "Prefatory Remarks by the Author," you shouldn't be afraid to translate this into <DIV1 TYPE="preface">

      Otherwise, use whatever is there.

    4. If there is no designation in the book, and the <DIV> is used to mark a series of items of similar type, use a term describing the form or genre shared by the items. E.g., in a book of poems, use
               <DIV1 TYPE="poem">
               <DIV1 TYPE="poem">
      See further under Poetry, below.

    5. If there is no designation in the book, and the <DIV> is used to mark a series of items of dissimilar type, or if there is no series at all, just use a term that describes the form of the item as generically as can be (<DIV1 TYPE="letter">; <DIV1 TYPE= "preface">)

    6. If none of these rules apply, do not supply any value for the TYPE attribute.

    MS. Any page that contains handwritten corrections, deletions, glosses, etc., should have the "MS" attribute of the <PB> tag for that page set to "Y".

    Provide other attribute values only when instructed to and when there is specific information to supply. Do not supply values of this sort: TYPE="unknown" or TYPE="unspecified".

Material not to record at all

  1. Running headers and footers.

  2. Catchwords and quire signatures.

  3. All other text that is simply an artifact of the printing process.

  4. Handwritten notes or other handwritten material.

  5. Text within illustrations (except captions).

  6. Separator lines and similar typographic flourishes.

  7. Most formatting. See below.

Material to record only as an empty tag or flag character marking the location

  1. Illustrations Except for captions attached to an illustration, which should be captured using the <HEAD> tag placed within an otherwise empty <FIGURE> tag (<FIGURE><HEAD>The meaning of the Embleme.</HEAD></FIGURE>), nothing but an empty FIGURE tag should mark the location of illustrations. That is, figures without captions should be recorded by <FIGURE> tags containing nothing (<FIGURE></FIGURE>).

  2. Missing and damaged text. A span of text that appears to be more or less completely missing (e.g. because of a flaw in the image, or a tear in the page, or un-inked type in the original printing) or so damaged as to be unreadable should be marked with the empty tag <GAP DESC="damage">. If only a character or two is missing, instead of the GAP tag, insert the character "@" (decimal 64, hex 40) for each missing/damaged letter. See further under Characters, below, for the use of "@" "$" <GAP> and <UNCLEAR> markers in combination.

    Transcribe as I charge the<GAP DESC="damage">ame of our Lord Iesus [...] he shall come to <GAP DESC="damage">quick and the dea [...] thou peruse this copie<GAP DESC="damage">ligently cor@@ct it [...] that thou put too likewise<GAP DESC="damage">ge, and set it

  3. Non-Roman alphabets. Extended text in a non-roman alphabet. Individual letters (e.g. Greek or Hebrew letters used as manuscript sigla, symbols, reference marks, or abbreviations) should be recorded as special characters, using character entities; but words or extended passages in a non-Roman alphabet (Cyrillic, Hebrew, Greek, Arabic, etc.) should be recorded simply as <GAP DESC="foreign">, without transcribing the word(s) themselves. The tags cannot contain any text, though any notes, milestones, page-breaks, etc. that appear within the passage should be recorded as usual, using <GAP> tags before and after the interrupting milestones as necessary.

    Surrounding structures should be preserved. A line of verse quoted in Greek, for example, should be recorded as

    <Q><L><GAP DESC="foreign"></L></Q>

    Record as: the semicircle .18.5, <GAP DESC="foreign"> .21.7, <GAP DESC="foreign"> .23

  4. Peculiar symbols. Characters and symbols (other than the kind of material listed above under "material not to record at all") that do not fall into the standard ISO sets and that we have made no provision for transcribing should be recorded with the symbol "#" (decimal 5, hex 2). See details below under Characters. If a number of unrecordable symbols occur together, record "#" for each such symbol.

  5. The presence of musical notation should be recorded with the <GAP> tag, with the value of the "DESC" attribute assigned as "music": <GAP DESC="music">.

Large structures


One text or many? Most works will consist of a single <TEXT> containing a single <BODY> element (optionally also a <FRONT> and/or <BACK> element for front and back matter respectively). Some works will consist instead of a <GROUP> element that contains multiple <TEXT>s (each <TEXT> with its own <BODY> and, optionally, <FRONT> and <BACK>). The GROUP element will be used most frequently for items that contain several works published or bound together, each with its own title page, that were originally printed separately, e.g. the collected works of an author.


The <BODY> (and, if necessary, the <FRONT> and <BACK> elements) will normally be divided into numbered <DIV>s corresponding to the main divisions of the text. Very simple documents, on the other hand, with no internal division (a work consisting of a single poem, for example) do not require <DIV>s at all: use no more <DIV> layers than necessary.


The numbered <DIV> elements, from <DIV1> to <DIV7>, represent a hierarchy: the <BODY> is subdivided into <DIV1>s; <DIV1>s, if necessary, are subdivided into <DIV2>s, and so on. <DIV>s divide into parts: with few exceptions, you need to have more than one of something to call it a <DIV>.

Individual small texts embedded within a larger work (e.g. entire poems quoted within a chapter of a treatise) should usually not be tagged as <DIV>s but should instead be placed within <Q> tags. The <Q> element may if necessary contain an entire <TEXT>, with its own <BODY>, <FRONT>, <BACK>, numbered <DIV>s etc.

Useful clues to the DIV structure include:

Weaker evidence for <DIV>s includes:

In general, these are not sufficient to establish a <DIV> and should instead be recorded as ordinary text. Numbered paragraphs, for example, should simply retain the number as part of the paragraph (and as the value of the "N" attribute of the <P> tag), but there is no need to call the number a <HEAD> and therefore make the <P> a <DIV>.

<P N="3">¶ III. In the third place, the Calvinist partie striveth ...
Marginal "headings" that you decide not to treat as <HEAD>s can usually be encoded either as <NOTE>s, with the PLACE attribute set to "marg" or (if they contain a sequential numeration), as <MILESTONE>s.

TYPES of DIVs. See above under "attributes."

Front matter

Front matter (material to include in the <FRONT> element) typically includes title pages, dedications, tables of contents, prefaces, prologues, honorific poems, remarks "to the Reader", etc., each of which should be recorded with a numbered <DIV>, their subsections recorded with higher-numbered <DIV>s, etc.

Title pages do not require special tags. Each title page should be recorded as a numbered <DIV> within the <FRONT> element. Include both the front and back (recto and verso).

Back matter

Back matter (material to include in the <BACK> element) typically includes indexes, glossaries, colophons, afterwords, appendices, etc., each of which should be recorded with a numbered <DIV>, their subsections recorded with higher-numbered <DIV>s, etc.


In general

Do not attempt to record the physical appearance of the page (centering, extra spaces, justification, type face, type size, etc.), though such cues may and should be used to determine the beginning and end of divisions within the text, the distinction between text and notes, etc. On type faces, see the special instructions below about use of the <HI> tag.

Record line-breaks (with the <LB> tag) only (1) if the text is unintelligible without a break; and (2) if there is no intervening structural tag. Many times, it is better to repeat a tag than to insert a line break in the middle of one; but more often it is possible to get by without doing either, especially if there is any punctuation at the line break. E.g., record this:

            CHAP. XI.
  Some Advantages and Helps for raising
  and affecting the Soul by Meditation.
like this:
  <HEAD>Some Advantages and Helps for raising and affecting
        the Soul by Meditation.</HEAD>
or, better, like this:
  <HEAD>CHAP. XI. Some Advantages and Helps for raising and
        affecting the Soul by Meditation.</HEAD>
but NOT like this:
  <LB>Some Advantages and Helps for raising and
        affecting the Soul by Meditation.</HEAD>

See below for the special case of prose interrupted by an interlinear gloss.

Paragraph breaks should be recorded with <p> in prose and with <lg> (line-group or stanza) in verse.

Typeface changes

Do not record italic or bold type, the various kinds of black-letter ("gothic") typefaces, regular roman typefaces, or fonts of different sizes as such. Instead record every change from the predominant typeface with the <HI> tag, unless you use that change as a cue to insert a structural tag of some kind. For example, a book may have black-letter text and italic headings. Record the headings as <HEAD> ... </HEAD>, not as <HEAD><HI> ... </HI></HEAD>, since you have used the change to italic as a cue to tag the italic text as a <HEAD>

Predominance is established at the <DIV> level. E.g., if the Preface or Dedication or chapter or section (occupying its own <DIV>) is in italic, it needs no special tagging, even if the main body of the book is in some other typeface. But if an individual word, phrase, sentence, line, or paragraph is in some other face than that which is predominant in that <DIV>, then mark the "different" text with the <HI> tag.

The exception, of course, is again if you are using the change of type face as a cue to structural role: in a book that prints its text in roman and its notes or block quotations in italic, once you have recorded the italic text as a <NOTE> or a <Q>, you do not need to mark it also as <HI>. Instead, the italic type itself becomes the predominant form within the <NOTE> or within the <Q>; any changes of typeface within these tags (e.g. a single word in textura black-letter) should be recorded with <HI>

If the text switches to yet another typeface within a section flagged with <HI>, simply mark the new typeface with another (nested) <HI> tag.

The most common contrasting type forms may be described as: (1) roman; (2) italic; (3) textura; (4) rotunda; (5) bastarda (see letter samples below), but individual books may use other contrasting forms: subtypes of italic; changes of font size; etc. The general appearance of the book must be the key: if the book intends two kinds of type to contrast, then flag the change with <HI> (as instructed above).

EXCEPTION. Many books use a "diminuendo" effect both in headings and in the beginning of text divisions: for two or three lines of text, each line is smaller than the one above it, and is sometimes in a different typeface as well. This is simply decorative, and can usually be ignored; i.e., if this is clearly what is going on, do not code the lines of contrasting appearance with <HI>.

Do not record changes of typeface within a word (e.g. a single letter or two in another typeface within a word that is otherwise in the typeface used in the immediate context).

When punctuation coincides with the end of a span marked by the <HI> element, and there is doubt as to whether the punctuation belongs inside or outside the closing tag, place it within the closing </HI> tag:

<HI>Sillepsis,</HI> or the Double supply.

If two adjacent spans of text are in two different typefaces, both of which contrast with the predominant face as well as with each other, record the two spans with two separate sets of <HI>..</HI> tags.

Record superscripted and subscripted text using the keyboard "circumflex" or "caret" character (^ = DECIMAL 94, HEX 5E) before each superscripted character (^a;, ^b;) and the same "caret" character doubled (^^) before each subscripted character (i.e., ^^a;, ^^b;, etc.).

Record ornamented capitals, large drop caps, etc., as ordinary capital letters.

Record "small caps" as ordinary capital letters.

Record vertical text (text printed perpendicularly to the main text) as if it were horizontal.

Block quotations

<Q>s are used for block quotations, whether of prose or verse. Don't use them for ordinary "inline quotations."

"Block quotations" include both quotations that are set off from the main text by indentation and blank lines (in the modern fashion) and also lengthy quotations that are set off by the use of other typographic cues, such as a series of quotation marks in the margin, or (if unambiguously marking a block quotation) a change of typeface, or some combination of these. If you're not sure if a block of text is a <Q>, simply record the appearance of the text (using, e.g. <P> and <HI>).

<Q>s are usually the best way to tag even very substantial items embedded in prose, e.g. a poem or a document of some kind quoted within a chapter, or within a note, or within an introduction.

<Q> can if necessary even contain an entire <TEXT>, with its own <FRONT> matter, <BODY>, <DIV> structure, and so on. Use <Q> for such embedded items, rather than trying to treat them as <DIV>s of the main text (unless that's really what they are). Treating them as <DIV>s forces you to treat all the material surrounding them as <DIV>s too, at the same level.
>Prefer this:

       <DIV1 TYPE="introduction">
       <P>blah blah</P>
       <P>blah blah</P>
         <Q>here's a poem</Q>
       <P>blah blah</P>
to this:

       <DIV1 TYPE="introduction">
       <DIV2 TYPE="stuff before the poem">
         <P>blah blah</P>
         <P>blah blah</P>
       <DIV2 TYPE="poem">
         <LG><L>here's a poem</L></LG>
       <DIV2 TYPE="stuff after the poem">
         <P>blah blah</P>

Block quotations accompanied by citations should record the quotation within <Q> tags and the citation within <BIBL> tags; the <BIBL> will normally be placed within the associated <Q>.

Notes, etc.

Most material that is set off from the main body of the text but is adjacent and related to it can be safely tagged as <NOTE>. (But arguments (summaries at the head of <DIV>s), salutations, and speaker names and stage directions in drama are among the note-like features that have their own tags.)

Record each note at the point in the main text to which it relates, set off by appropriate tags, not at the point where it appears on the page.

A note that spills onto the next page needs to be treated as a single note, not two, and should be placed in the text where it applies.

  1. Notes tied to points

    If the note points to a place in the text which is marked with a flag of some kind (e.g. a footnote reference number, an asterisk (*), etc.), discard this marker from the note once it has served its purpose by locating the <NOTE> in the right place in the text. The corresponding character in the text itself should also be removed, but not completely: it should be preserved as the value of the "N" attribute of the <NOTE> tag. Notes that use non-alphabetical symbols such as "daggers," section-marks, paragraph marks, etc., should preserve those characters too if possible, using character entities, like this: <NOTE N="&dagger;">. If the character is not recognized as corresponding to a readily available character entity, supply "#" or "$" as the value, using the rules given below for unrecognized symbols.

    If the note is keyed to the text by line number, verse number, etc., place the note at the end of the line (etc.) to which it applies, and discard the literal number from the note.

    Use the "PLACE" attribute of the <NOTE> tag to indicate where the note appears on the page:
    • PLACE="marg" in margin or adjacent to the text (even if part of it runs across the whole page because of lack of room in the margin)
    • PLACE="foot" in a footnote, below the text
    • PLACE="inter" interlinearly (between the lines of text)
    • PLACE="inline" not distinguished from the main text by location.

    If there are multiple distinguishable sets of notes in the same location (two sets of footnotes, for example; or multiple sets of marginal notes marked by different kinds of flags, one set marked by numbers, one by letters), distinguish them by appended numbers: PLACE="foot1" and PLACE="foot2" for example.

    Example of book with two sets of marginal notes, one keyed to letters, one to numbers; record them as <NOTE PLACE="marg1"> and <NOTE PLACE="marg2">

    Notes that apply to two (or more) distinct loci or lines should be reproduced and inserted at *both* (or all) the relevant points.

    These need to be distinguished from notes that apply to a span of loci or lines; notes applying to a span of lines should be placed after the last line in the span with indications of the length of the span (e.g., "14-23" [with reference to line numbers] or "*-*" [with reference to two "*" flags in the text]) retained.

  2. Notes tied to regions, divs, etc.

    1. Mostly in verse:

      A note that appears next to a single verse line or set of lines and seems to relate to that line (or set) should be placed at the end of the line(s) in question.

      A note that relates to a specified group of lines, verses, etc., should be moved into the text at the end of the last item to which it applies. If there are line numbers, the line number indication in the note should be preserved. If physical arrangement, rather than explicit line numbers, serve to specify the line or verse number range, and there are line numbers in the verse, supply the appropriate number range in brackets at the beginning of the note.

      Notes referenced to a line (verse, etc.) number followed by "f." ("2365 f." meaning "line 2365 and following") should be treated as notes referenced to a span of two lines (in this case, 2365-66), that is, placed at the end of the second line (2366), with the full line reference preserved in the note: <NOTE PLACE="foot">2365 f.: ... </NOTE>

    2. Mostly in prose:

      A note that seems to relate to an entire text division (e.g. a <DIV> or <P>) should be inserted at the beginning of the text that comprises that division, or to end of the <HEAD> if that is more convenient (and if it has one). E.g. a marginal note applying to a paragraph as a whole may be inserted at the beginning of the paragraph. This occurs commonly in books that contain a running summary or set of running headers in the margins: if these are not treated as <HEAD>s, or <ARGUMENT>s, they should be treated as <NOTE>s (PLACE="marg") and inserted at the beginning of the section to which they apply. If the summary is found centered at the head of the text proper (instead of in the margin) it should usually be given a tag of its own and tagged as <ARGUMENT> or <HEAD> (see below under Heads").

      A marginal note in a prose text that seems to apply vaguely to the material next to which it is placed should be inserted at the end of the nearest sentence (as marked by punctuation), or at some other break in the text if that seems more appropriate.

      In the case of notes that supply bibliographic citations, similarity of wording between note and text may provide a clue as to the best place to insert the note, as in this example:

            Democ.Instit.    Antonius Demochares saith of him, that he was exiled
            Christ.relig.    in the persecution under Diocletian, and that he
                             returned from banishment after the death of Diocletian
                             and Licinius, and recovered his Bishoprick again,
                             where he continued until the reign of Iulian.

      <P>Antonius Demochares saith of him,<NOTE PLACE="marg">Democ. Instit. Christ. relig.</NOTE> that he was exiled in the persecution under Diocletian, and that he returned from banishment after the death of Diocletian and Licinius, and recovered his Bishoprick again, where he continued until the reign of Iulian.</P>

      Apparatus that relates generally to the material on a page, or for which the appropriate place cannot readily be determined, should be attached to the last line of text at the bottom of the page.

Reference numbers in the text that point to something other than a note (e.g. to some part of an illustration), or for which the target cannot be found, should simply be recorded as part of the text.

Passages of verse (especially 2 or more lines, quoted and arranged as verse) within a note will normally be most readily coded as a quotation (<Q>) containing <L>s or <LG>s, embedded within the <NOTE> element.

Notes comprising a running interlinear commentary or interlinear gloss poses special problems. See below.

Lists and tables

In general, prefer to record itemized sequences as <LIST>s rather than <TABLE>s if possible. Use <TABLE> when the material cannot be readily understood without the spatial organization that tables provide.

Numbered sequences of items when the items themselves are blocks of text of considerable size (numbered paragraphs, for example) should not normally be treated as lists.

Complex lists (lists within lists) should be encoded with nested <LIST> tags:

<ITEM> .. </ITEM>
      <ITEM> .. </ITEM>
      <ITEM> .. </ITEM>

Treat any numbers that enumerate items in a list as part of the text of that item; record them neither with separate <LABEL> tags nor as attribute values. E.g.:

      <ITEM>1. Avarice</ITEM>
      <ITEM>2. Sloth</ITEM>
      <ITEM>3. Pride</ITEM>

Lists of pairs may be tagged with the element pair <LABEL> and <ITEM> (in that order). If you use this option, you may omit any "leader" (e.g. a dot leader) between the paired items. E.g.:

The Prince...............Jn. Longfellow
The Pauper...............Thomas Goodrich
Joan the Tappester........Jack Smithson

      <LABEL>The Prince</LABEL><ITEM>Jn. Longfellow</ITEM>
      <LABEL>The Pauper</LABEL><ITEM>Thomas Goodrich</ITEM>
      <LABEL>Joan the Tappester</LABEL><ITEM>Jack Smithson</ITEM>

Tables should be recorded as you would using HTML tables, oriented by row, with the number of columns determined by the number of cells within the row. Use the spatial organization of the text to determine the number of rows and columns (not necessarily reflected in printed border lines). The ROWS and COLS attributes of the <CELL> tag should be used just like the ROWSPAN and COLSPAN attributes of the <TD> in HTML to indicate cells that extend across two or more rows or columns. Cells that contain a heading or label for a row (or column) should receive the attribute ROLE="label".

EEBO dtdHTML equivalent
<ROW> <TR>
<CELL ROLE="label"> <TH>

Particularly complex tables may be recorded (again as in HTML) with nested <TABLE> tags, i.e., a <TABLE> within a <CELL>, or by combinations of <LIST> and <TABLE>, i.e. a <LIST> within a table <CELL> or a <TABLE> within a list <ITEM>.

Physical arrangements that cannot easily be accommodated by our simple table model (e.g., labels with text running vertically) may need to be adapted and adjusted until they fit; it is more important to preserve the relationships between the items in the table than to preserve its exact layout.

Tables that continue from one page to the next may be tagged as one continuous table, with an embedded <PB> tag, especially if its headings are not repeated on the new page. If the headings are repeated, it is usually easier to close the old <TABLE> and open a new one on the new page.

Here is a sample simple table (this one is simple enough that it could almost be done as a <LIST>). [For another example see e.g.]

Recorded as:

<DIV TYPE="table">
<HEAD>By this table, shall ye fynde the Epistles and Gospels, for the Son|daies, and other feastiuall dayes.</HEAD> <P>FOR TO fynde them the sooner, shall ye seke for these capital letters, <HI>A, B, C D,</HI> whi|che stande by the syde of this boke alwaies, On or vnder the letter shall you fynde a crosse &cross;, where the Epistle or the Gospell begynneth, and where the end is, there shal ye find an halfe crosse, # And the fyrst lyne in this table is alway the e|pistle, and the seconde lyne is alway the Gospell.</P>
<ROW><CELL ROLE="label" COLS="3">On the fyrst Sonday in Aduent.</CELL></ROW>
<CELL>Rom. xiii.</CELL>
<CELL>And for as muche as we knowe</CELL>
<CELL>Math. xxi.</CELL>
<CELL>Nowe when they drew nye vnto</CELL>
<ROW><CELL ROLE="label" COLS="3">On the second sonday in the Aduent.</CELL></ROW>
<CELL>Rom. xv.</CELL>
<CELL>what so euer thynges are writen</CELL>
<CELL>Luc. xx.</CELL>
<CELL>And there shall be signes</CELL>


Headings at the head of text divisions and stanzas (<DIV>s and <LG>s) should be tagged as <HEAD>. Subheadings should be tagged as <HEAD TYPE="sub">.

Some headings have special tags (see below). If heading-like material doesn't fall clearly into one of these special categories, use simple <HEAD>. Incipits ("here begins a tract about sin") are typically recorded as <HEAD>s with TYPE="incipit". Subheadings may be recorded as <HEAD TYPE="sub">.

  1. Arguments: use <ARGUMENT> for summaries or abstracts that appear at the head of a division, often headed by "ARGUMENT" ("Argumentum"; "Tharguement") e.g..

  2. Epigraphs: use <EPIGRAPH> for brief quotations (often in verse) that appear at the head of a division, with or without mention of the author or book from which it is quoted. Epigraphs are frequently centered, in quotation marks, or italics, or all three--but not always. The quotation itself should be tagged as <Q>; any attached bibliographical information on author or title or source should be recorded as <BIBL>:
    "Idlenesse is lesse harmefull then vnprofitable occupation."
    <Q>"Idlenesse is lesse harmefull then vnprofitable occupation."</Q>

    Epigraphs are a common place to find bits of non-roman script; record those bits with <GAP DESC="foreign"> as described above, but place the "foreign" portion inside the <EPIGRAPH> tag.

    Commentaries and sermons frequently quote a passage of text at the beginning (or at the beginning of each division), then comment on it. Encode these passages as <EPIGRAPH><Q> ... </Q></EPIGRAPH>.

  3. Openers: use <OPENER> for introductory phrases at the beginning of an item, especially letters, especially if the material is of some length or complexity, and especially if the item also has a distinct <HEAD> apart from the <OPENER>. If in doubt whether material qualifies as <OPENER> or <HEAD>, call it a <HEAD>, especially if the item would otherwise lack a <HEAD>. For the special use of <OPENER> with letters, see below under Letters.

  4. Bylines: use <BYLINE> for phrases indicating authorship, but only if the phrase is easily separable from other heading material. E.g.: <HEAD>The Defense of Poesie</HEAD><BYLINE>By Sir PHILIP SIDNEY Knight</BYLINE>


Material at the end of a text division that is set off from the main text is normally to be tagged as a <TRAILER> or <CLOSER>. <TRAILER> is the more general tag, used for material without such internal structures as datelines, salutations, or signatures. Typical <TRAILER>s include "Amen," "Finis," and explicits ("here ends the tract written by Master John Knox."). <CLOSER>, on the other hand, is the counterpart of <OPENER>; it is used when the concluding material includes lengthy or complex information, including datelines, salutations, or signatures, especially in letters. See Letters, below, for examples. Requests for prayer for the author's soul are typically recorded as <CLOSER>s.

Epigraphs and bylines can appear at the foot of a division as well as at its head (see above for a description of epigraphs).

Special types of texts

  1. Poetry

    Verse lines. Each verse line should be enclosed in <L> tags. Do not attempt to record the varying indentation of verse lines; pay attention to indentation only insofar as it indicates a stanza break or a "broken" line (see below).

    Broken lines. Sometimes when a verse line is too long to fit on the page, its last word or two is placed (sometimes marked off with a bracket or parenthesis) at the end of the next line or at the end of the preceding line (wherever it fits best). Such detached bits of verse lines should be recorded at the end of the line to which they really belong.

    Mary had a little lamb, [snow.
      Its fleece was white as

    <L>Mary had a little lamb,</L>
    <L>Its fleece was white as snow.</L>

    Groups of lines (<LG>s).

    1. Within a <DIV TYPE="poem">, <BODY>, or <SP>. When a poem constitutes a <BODY> or is tagged as a numbered <DIV> (or as a dramatic speech <SP>), groups of lines forming the smallest subdivisions of the poem should be enclosed in <LG> ("line-group", i.e. stanza) tags. A poem or speech containing no subdivisions (only ungrouped verse lines) does not need an <LG> tag: the <DIV> (or <SP>) tag provides enough context.

      <DIV1 TYPE="poem">
      <L>When the cat's away</L>
      <L>The mice will play</L>

    2. Interspersed with prose. <LG>s should be used around verse line(s) that alternate with prose paragraphs, so that the <LG> tag and the <P> tag serve a similar grouping function. See further below.

      <P>A stitch in time saves nine.</P>
      <L>When the cat's away</L>
      <L>The mice will play</L>
      <P>Too many cooks spoil the broth</P>

    3. Quoted within a <P>, <NOTE>, etc. When lines of verse occur within <Q> tags, e.g. quoted within a prose paragraph, in a note, or as part of an epigraph), do not place the verse lines inside a <LG> tag unless you have good reason to believe that the lines represent a complete stanza, e.g. if more than one stanza is quoted and you need to separate them; or possibly if the metrical form makes it clear that a whole stanza is quoted. If all you know is that some lines of verse are being quoted, then tag them as verse lines (<L>), period. The <Q> tag provides enough context. See further below.

      <P>John walked along, chanting constantly:
      <L>When the cat's away</L>
      <L>The mice will play</L>
      But no one noticed.</P>

      <P>John walked along, chanting constantly:
      <L>Red rover Red rover,</L>
      <L>Come over Come over</L>
      <L>The bird's on the wing,</L>
      <L>The dog's had his fling.</L>
      But no one noticed.</P>

    Lines vs. line-groups. It is often unclear when a group of lines has enough organization to be called a stanza (line- group <LG>). If in doubt, err on the side of fewer line-groups rather than more. And be consistent throughout a particular poem, so that a particular structure is not sometimes tagged as a <LG> and sometimes left untagged. Clues to look at include, in decreasing order of significance:


    Strongly suggestive:
    blank divider lines.
    drop caps

    Indicative, but need support:
    verse structure (rhyme; refrains; etc.)
    indentation (but indentation alone is insufficient to justify a <LG> element)
    paragraph signs (¶; but these are also used in many cases without any structural function)

    <LG>s vs. <DIV>s. It is not always easy to distinguish between <LG>s and <DIV>s: both can have headings; both can nest to create a structural hierarchy. Metrical units (true stanzas) are always <LG>s; verse paragraphs of irregular length are frequently best recorded as <LG>s, especially if they are not consistently supplied with headings. On the other hand, <DIV>s should be used for line-groups big enough to have true titles, or to appear in tables of contents.

    Groups of stanzas within a poem should receive a numbered <DIV> tag. In most cases, you will use only a single level of <LG> (no nesting), and treat it effectively as the lowest-level text division. Any grouping of stanzas is therefore recorded as a <DIV>.

    Entire poems. Each poem will usually be recorded as a <DIV> of the appropriate number (<DIV1> etc.), with TYPE="poem". Don't try to distinguish between different kinds of poems, between poems and songs, etc. Any discrete item in verse is TYPE="poem". Poems may, of course be subdivided further into <DIV>s and <LG>s of various types. If a book consists of a single poem, then the <BODY> element constitutes the poem. If a poem is quoted within a prose context, it is usually easiest to treat it as a <Q>. See next.

    Poetry mixed with prose. When poetry is truly interspersed with prose, and either the poetry is the predominant form, or there is no clearly predominant form, the prose should be recorded within <P> tags, the verse within <LG> tags. When poetry gives way to prose, close the <LG> and open a <P>; when prose gives way to poetry, close the <P> and open an <LG>, even if the actual prose paragraph, or even the last sentence, is not finished.


    1. Be aware that sometimes the interspersed "prose" is really a <HEAD> or a <TRAILER> to a section of the poetry; it may even be a <NOTE>.

    2. When a group of verse lines is quoted (e.g. in a passage predominantly composed of prose, or in a note), leave the prose <P> open, and embed within it the quoted passage (recorded with <L> and, if appropriate, <LG> tags as usual) within <Q> tags.

    3. An entire poem quoted within a prose context will ordinarily be treated as the <BODY> of a <TEXT> quoted within a <Q>.

  2. Drama

    Aside from a few special tags (below), prose drama should be recorded like other prose (in <P>s, etc.) and verse drama like other verse (in <LG>s, <L>s, etc.), including the rules for interspersed poetry and prose.

    Cast lists. Cast lists (often headed "dramatis personae") should be recorded like other lists, with the <LIST> tag. Cast lists will commonly appear as separate <DIV>s (within the <FRONT> matter of a book if the book contains one play). For complex cast lists, use nested lists and labels to indicate cast groupings.

    Stage directions. Stage directions should be recorded with the <STAGE> element. Stage directions sometimes appear between the columns of a multicolumn text, or in the margin, where they look like notes. In other books, they may be centered (as if they were headings) or indented (as if they were little paragraphs). They are occasionally typographically distinct (it italics; within parentheses; or both).

    Speakers. The name (sometimes abbreviated) of the speaker is recorded with <SPEAKER>. In print, these appear at the head of a speech: e.g. typically above the first line of the speech (sometimes centered), in the margin, in an indented line of its own at the head of the speech, or in italics at the beginning of the first line of the speech. Regardless of where it appears in print, the <SPEAKER> tag is tucked into the beginning of the appropriate <SP> ("speech") tag.

    Additional text associated with the speaker's name should be included in the <SPEAKER> tag, like this: <SPEAKER>Mr. Jones, reading from letter.</SPEAKER> Multiple names should be enclosed in a single set of <SPEAKER> tags, like this: <SPEAKER>Mr. Jones and Mrs. Smith.</SPEAKER>

    Speeches. The basic unit of drama is the SPEECH (<SP>). A speech normally continues uninterrupted as long as the character speaking it is uninterrupted by another speaker or by the end of a division (act, scene, etc.).

    "Songs" and other material specially set off within a speech should not normally be given any special tagging; if they have headings, they may need to be recorded as a nested <LG>. In exceptional cases when they contain an elaborate structure they may be recorded as a quotation (<Q>).

    Prologues and Epilogues should normally be treated as part of the play, recorded as <SP>s like any other speech, though they may sometimes require a numbered <DIV> of their own.

    Acts and Scenes. The act/scene structure should be recorded with appropriately TYPEd and numbered <DIV>s (e.g., <DIV2 TYPE="act" N="3">).

  3. Letters

    Personal letters that appear as text divisions should be treated as <DIV>s just like any other text division (chapters, sections, etc.). Letters quoted within running text (e.g. a letter quoted within the chapter of a book) have been given a special tag, <LETTER>. Note that dedications frequently look like letters, since they contain salutations and signatures, but they're not: treat them as <DIV TYPE="dedication">. (You may, however, still use <OPENER> <CLOSER&> <SIGNED> <SALUTE> etc. in such letter-like divisions, if they apply.)

    Special tags are available to tag the salutations (<SALUTE>), signatures (<SIGNED>), and datelines (<DATELINE>) often found in letters. Use these only if they clearly apply.

    Place <SALUTE>, <SIGNED>, and <DATELINE> within <OPENER> if they appear at the head of a letter; place them within <CLOSER> if they appear at the end. See the TEI guidelines for fuller descriptions of these elements. If salutations and signatures are combined or confused in a single opener or closer, use the <OPENER> or <CLOSER> tag alone, without trying to tag the separate constituent parts.

  4. Dictionaries and glossaries. will be recorded differently depending on the complexity of the entries. Some extremely simple word lists may be able to be recorded as <LIST>s e.g.. Slightly more complex ones may be able to be recorded with <P>s (one <P> for each entry). But many dictionaries will require numbered <DIV> elements to represent dictionary entries. The headword for each entry (with any associated grammatical information) can usually be recorded as a <HEAD> to the <DIV>. Complex entries can be subdivided if necessary into component parts using higher-numbered <DIV>s. For example, the following entries from Cotgrave's French-English Dictionary of 1611 ...
    Affaicter. To trim, tricke, decke, dresse curiously, make neat, spruce, fine; to refine; also, to tame, reclaime, breake, make gentle, bring to ciuilitie.
      Affaicter vn oiseau. To man a hauke throughly.
    Affaicterie: f. A trimming, tricking, decking, neat, quaint, or fine dressing; also, neatnesse, nicenesse, curiositie, quaintnesse; also, a breaking, taming, reclayming, ciuilizing, making gentle; (hence) also, the through manning of a hauke, &c.

    ...can be recorded like this. The encoding of the phrasal subentry for "Affaicter vn oiseau" with a <DIV2> is probably superfluous in this case (a new paragraph with a <HI> heading would do as well); it is encoded more thoroughly here as an example of what can be done with more complexe entries if necessary.

        <DIV1 TYPE="entry"><HEAD>Affaicter.</HEAD>
        <P>To trim, tricke, decke, dresse curiously, make neat, spruce,
        fine; to refine; also, to tame, reclaime, breake, make gentle, bring
        to ciuilitie.</P>
        <DIV2 TYPE="subentry">
        <HEAD>Affaicter vn oiseau.</HEAD>
        <P>To man a hauke throughly.</P>
        <DIV1 TYPE="entry">
        <HEAD>Affaicterie: f.</HEAD>
        <P>A trimming, tricking, decking, neat, quaint, or fine dressing;
        also, neatnesse, nicenesse, curiositie, quaintnesse; also, a breaking,
        taming, reclayming, ciuilizing, making gentle; (hence) also, the through
        manning of a hauke, &c.</P></DIV1>

  5. Interlinear commentaries and glosses pose special problems (1) because like tables, they depend on the physical layout of the page, in two dimensions, to make sense; (2) because the interlinear text may relate either to the preceding line of main text or to the following line, depending on the book, and one needs to be able to decide which it is; and (3) because they can occur in too many diverse forms to anticipate all likely variants here. Some general guidelines:


In general punctuation should be retained, but its spacing somewhat regularized. When a colon, semicolon, comma, question mark, closing quotation mark, or period falls between words, place a space after it, but none before it (unless it is being used to set off a number, like this: .lxvi. or .45. in which case it should be spaced as shown). When an opening quotation mark falls between words, place a space before it, but none after it. When a virgule falls between words, place a space before and after it. In case of doubt, follow the spacing of the original as best you can.

Record the various forms of colon, period, comma, semicolon, and virgule (slanted line) with their modern keyboard equivalents ( : . , / ); a vertical bar should be recorded using the &verbar; entity (since we have reserved the keyboard character for another purpose).

Question marks vary considerably in form (some of them looking like inverted semicolons); record them all with the standard "?"

Opening and closing double quotation marks should both be recorded using the ordinary keyboard double-quote character (" = HEX 22), not the &ldquo; and &rdquo; entities.

Opening and closing single quotation marks, as well as apostrophes, should be recorded with the same character, the ordinary keyboard single-quote character (' = HEX 27)

Hyphens (not dashes) should normally be recorded using the ordinary hyphen character.

Hyphens at the end of a line should be recorded as the ordinary keyboard "pipe" (vertical bar) character, unless they appear between numerals, when they should be recorded with the ordinary hyphen. Be aware that hyphens in many texts may appear as an angled stroke, not a horizontal one, and may also commonly appear doubled, resembling an equals sign (=), either horizontally or at an angle, like this:

If there is no end-of-line hyphen, but you think that there should have been (i.e., that a single word has been broken across two lines), place a plus sign, instead of a space, between the two halves: "cro+wn" "pri+nce"

Dashes should be recorded using the entity &mdash;, regardless of where they appear.

Ellipses, whether two characters or many--strings of dots or asterisks indicating omitted or missing text--should be recorded as ordinary text, not the &hellip; entity, using periods or asterisks as appropriate: . . . . . * * * * * . .

Some editions of prose mark extended quotations by placing quotation marks at the beginning of every quoted line. The same technique is used in other books to mark proverbs and other sententious remarks. E.g.,

     he made reasons...seyenge:  God made alle thynges
   " by reason, and governethe thynges
   " made by reason; the sterres be movede by reason; and so
   " oure naturalle lyfe excedynge from reason by slawthe and
   " ignoraunce awe to be reducede by lawes and reasons.
   " Wherefore thau3he there be somme thynges in the rule of
   " seynte Benedicte, the intellect of whom the dullenesse of my
   " mynde may not comprehende, y suppose hit be beste to 3iffe
   " credence to auctorite. Wherefore also he persuadeth hymselfe ...

     O no (said Cecropia) company confirmes reso-
   " lutions, & lonelines breeds a werines of ones thoughts,
   " and so a sooner consenting to reasonable profers.

If this is really a block quotation, and you can identify the beginning and end of it, go ahead and place the whole block of text marked by the marks inside <Q> tags. Bear in mind that the marginal quotation marks can take unusual forms (sometimes they look like a pair of commas), and that it is not always easy to discover where the quotation actually begins and ends. The upper example above could be encoded as:

<P> ... he made reasons...seyenge:
<Q>God made alle thynges by reason, and governethe thynges made by reason; the sterres be movede &startq; by reason; and so oure naturalle lyfe excedynge from reason by slawthe and ignoraunce awe to be reducede by lawes and reasons. Wherefore thau3he there be somme thynges in the rule of seynte Benedicte, the intellect of whom the dullenesse of my mynde may not comprehende, y suppose hit be beste to 3iffe &endq; credence to auctorite.</Q>
Wherefore also he persuadeth hymselfe ... </P>

Whether or not you can identify the quotation well enough to tag it as <Q>, record the first and last of the marginal quotation marks with the special entities &startq; (first mark) and &endq; (last mark). If there is only one such marginal quotation mark (as sometimes happens with short quotations or proverbs), use both entities in sequence (&startq;&endq;).

Braces and brackets that group multiple lines should be ignored if all they do is group portions of ordinary running text, such as poetry. But if they are used to link one piece of text to another, such as frequently in tables and lists, their meaning needs to be interpreted. Sometimes this will require entering text more than once, e.g. if the brace means "this word applies to all these other words," the easiest technique may be simply to apply the word to all of the other words by entering it as many times; sometimes it may require treating the single item as a head or label for a list containing the grouped items; sometimes it may involve attaching a ROWS or COLS attribute to a table <CELL>. Many variations are possible, which the following examples can only suggeste.g..

chapter1 How to build a kite
2 When to fly a kite
3 Famous kite flyers of our time
4 When not to fly a kite
5 "I've flown it: now what?"
(Brace used like "ditto" mark
to associate one word repeatedly
with a series of items;
may be recorded as follows,
by repeating the word:)
 <LABEL>chapter 1</LABEL>
  <ITEM>How to build a kite</ITEM>
 <LABEL>chapter 2</LABEL>
  <ITEM>When to fly a kite</ITEM>
 <LABEL>chapter 3</LABEL>
  <ITEM>Famous kite flyers of our time</ITEM>
 <LABEL>chapter 4</LABEL>
  <ITEM>When not to fly a kite</ITEM>
 <LABEL>chapter 5</LABEL>
  <ITEM>"I've flown it: now what?"</ITEM>
Dramatis Personae
Joan, a noblewoman
John, a philosopher
(Brace used to associate one item
as a head of a set of other items;
may be recorded as follows,
placing the one item in <HEAD< tag
and the list of items in <LIST>
and <ITEM> tags:)
<HEAD>Dramatis Personae</HEAD>
 <ITEM>Joan, a COuntess</ITEM>
 <ITEM>John, a philosopher</ITEM>
In apice trianguli.Triangulus.
In basi præcedens 3.
Sequens & vltima. 3.
(Brace used in a table to place one
cell in conjunction with a set of other
cells; may be recorded using
the COLS or ROWS attribute of the
<CELL> tag:)
<CELL>In apice trianguli.</CELL>
<CELL ROWS="3">Triangulus.</CELL>
<CELL>In basi praecedens 3.</CELL>
<CELL>Sequens &amp; vltima. 3.</CELL>


Basic letter forms. Most letters encountered will belong to the modern alphabet, though their appearance may be strange.

  1. "u" and "v," though often interchangeable in spelling, should be recorded as they appear ("u" for "u", "v" for "v", without applying modern spelling practice).

  2. Lower-case "j" is really just a variant of lower-case "i", but record the form that seems to be intended ("i" for "i" and "j" for "j"); "j" appears most often paired with "i" in order to distinguish the pair from letters like "u" or "n"; one thus finds roman numerals like this: xvij, xij, or Latin plurals like this: alijs. The dot on the "i" and "j" is often in the form of a slanted line, like an acute accent, but record these letters as ordinary dotted "i" and "j," NOT as &iacute; or &jacute;.

  3. Upper-case "I" and "J", on the other hand, are difficult if not impossible to distinguish: always record them as "I".

  4. Many books print a pair of "v"s or even a paired "Uv" where we would expect a "w"; do not convert these pairs to "w" but print whatever letters actually appear: "uu" "vv" etc.

  5. Lower-case "x" frequently resembles lower-case "r".

Ligatures. Ligatured characters (ae, oe, ct, st, sp, fi, ff, ss, etc.) should be recorded as two separate characters. Ignore the ligature. Be aware that the italic "ae" ligature usually has no upper bow to the "a" and is easily mistaken for "oe". Be aware also that italic fonts especially tend to have ligatures between many more pairs of letters than we are accustomed to.

Ampersands, whether shaped like & or like "7," should be recorded as &amp;.

Some examples of ampersands (&amp;)

Letters printed upside-down (a common printer's error) should be recorded as if turned right side up.

Recognizable letters with diacritics

  1. e-hook (an "e" with a small hook or reversed cedilla attached to the bottom of the bowl) should be recorded as "ae".

  2. "Macrons" and similar strokes over letters

    1. A more-or-less horizontal stroke above a single letter should be recorded using the keyboard "tilde" (or "swung dash") character (a~, e~, i~, o~, u~, v~, y~, m~, p~ ). This stroke may be somewhat slanted in some type faces, may resemble a tilde or a breve, or may be reduced to little more than an elongated dot.

      Some examples of 'macrons' (~)

    2. A similar stroke forms part of the normal letter form of some characters (especially v and w, like this: ) in some typefaces (especially in the Bastarda group). These should not, of course, receive any treatment other than as an ordinary letter.

    3. A similar stroke over two or more contiguous letters (whether or not it crosses an upright stroke on one of the letters) should be treated as a generic abbreviation mark; i.e., it should not be recorded as a character at all, but the entire word should be placed within <ABBR> tags.

  3. Acute, circumflex, and grave accents on vowels should be recorded with the standard ISO character entities (&agrave;, etc.) The dots on "i" and "j" frequently resemble acute accents but should not be recorded as such. Macrons and superscript letters may both sometimes resemble acute accents.

  4. Other diacritical marks attached to letters, unless they can be identified as superscript letters or are specifically listed below as forming an abbreviation "symbol," should not be recorded as diacritics at all; instead the entire word containing the letter(s) should be placed within <ABBR> tags.

    Some "general" abbreviation diacritics
    <ABBR>Cantuar</ABBR>, <ABBR>clico</ABBR>,
    <ABBR>Cantuar</ABBR>, <ABBR>clico</ABBR>,
    <ABBR>Suff</ABBR>, <ABBR>qd</ABBR>,
    <ABBR>Marchionis</ABBR> <ABBR>Ric</ABBR>
    <ABBR>Alred</ABBR> <ABBR>vl</ABBR>
    <ABBR>red</ABBR> <ABBR>apd</ABBR>

  5. Superscripts

    1. A small superscript letter appearing directly over another letter should be treated as if it followed the letter: "y" with a small superscript "t" over it or next to it (two forms of the common abbreviation for "that") should both be recorded as if it were yt, that is, as "y^t;".

    2. In English-language text, the commonest such abbreviations by superscript are ye (frequently printed as yc, or with the "e" reduced to barely recognizable form, but record y^e anyway) for "the"; yt (with the "t" frequently reduced to little more than a vertically elongated dot) for "that"; yu for "thou"; wc for "which"; and wt for "with."

      Some common abbreviations by superscript
      "thou" (y^u)
      "that" (y^t)
      "the" (y^e)
      "with" (w^t)

    3. Vaguely superscript-like strokes in Latin and French texts (if no actual superscripted letter can be distinguished) should be treated as an unidentified diacritic or general abbreviation mark, causing the entire word to be placed in <ABBR> tags.

  6. Abbreviation symbols. A number of abbreviation symbols, mostly based on ordinary letters, are distinctive enough and consistent enough in appearance to be recognized. Each should be recorded with its own character entity.

    1. The following table illustrates the commonest abbreviation symbols. More may be added later. Note that some have conditions attached; e.g., the "q3"- or "q;"-like symbol illustrated below means "-que" when it appears at the end of a word, but means something quite different (e.g. "quam," especially if it has a stroke over it) when it stands alone.

      SymbolRecord as:MeaningExamples:conditions:
      &abper;per, par 

      at the end of a word only
      at the end of a word only
      &absed;sedonly when forming a word by itself
      &abcon;con- cum-at the beginning of a word only
      &abrum;-rumat the end of a word only

      at the end of a word only

    2. Other abbreviation symbols:

      1. If there is a recognizable base letter to which modifications or additional strokes have been added, record the base letter, ignore the additional strokes, and place the entire word within <ABBR> tags, as described above.
        Some modified but identifiable letters: (p); (q); (d).

      2. If there is no identifiable base letter, record the symbol with the hash character (#) and place the entire word (if the symbol appears within a word) within <ABBR> tags.

Letters from other alphabets, e.g. Hebrew and Greek, when used singly (as opposed to in whole words or extended text) should be recorded with ISO standard character entities.

Other symbols include alchemical and astrological symbols, which will rarely if ever appear as part of words, but may appear in or as marginal notes, in designations of units of measure, in calendrical tables, etc.

  1. A selection follows.

    SymbolExampleMeaningRecord as
    Zodiacal signs
    Planetary signs
    Other signs

  2. Symbols and marks not listed here

    1. Recognizable standard symbols should receive the standard ISO character entity if one exists.

    2. Unrecognized symbols and marks, and symbols other than those listed here or in the standard ISO character sets, should be recorded either

      1. with the hash character (#) if the symbol is clear enough but is not listed here or in the ISO sets; or

      2. as "$" if you're not sure what to make of it.
        Dubious characters.Individual characters that cannot be readily identified as one thing or another ("is this a funny-looking "q" or some kind of symbol?" "Is this a "c" or a "t"?) should be recorded as "$". However, do not overuse this expedient: if the same symbol recurs repeatedly in a book, ask us for help in identifying it; do not simply record dozens or hundreds of examples of the same symbol with "$".

"Excessive" abbreviation. If sampling shows that more than one word in every ten in a given text contains an abbreviation symbol, a dubious mark ($), a peculiar symbol mark (#), or an <ABBR> tag, the work should be rejected for conversion.

Illegible text (text that is blurred, blotted, bled-through, or otherwise hard or impossible to read) should be surrounded with the tag <UNCLEAR>. If the text appears to have been deliberately erased or crossed-out, specify deletion as the cause with <UNCLEAR CAUSE="del">. Text within the <UNCLEAR> tags should be recorded as usual. E.g., characters, diacritics, superscripts, and symbols should be recorded as above as far as possible; characters that can't be identified with any confidence should be recorded as "$"; characters that are more or less completely gone or too damaged to read as "@";, peculiar symbols as "#"; and so forth. Discernible features, too, within an UNCLEAR span should be recorded as usual.

transcribed as: as <UNCLEAR>$p@$</UNCLEAR> hostility

transcribed as: one <UNCLEAR>accord @@@</UNCLEAR>

The following samples are far from a definitive list of letter forms, but are meant only to provide some help recognizing the most common letters in the most common typefaces. Many books will have to be considered individually, the form(s) of each letter ascertained by its presence in a recognized word or unambiguous context so as to create, in effect, an alphabet or set of alphabets for that book. The samples below are arranged under headings that describe the most common families of type: roman, italic, textura, rotunda, and bastarda. There are many variants of each of these (except rotunda, which is fairly uniform) which may differ very considerably from the examples given here. And individual misprinted and ill-aligned letters may present a very anomalous appearance.

Record as:TexturaItalicBastardaRotunda