Revision History
Text in red is included for purposes of discussion.
The data-conversion vendor will return keyed and coded text files transcribed from the page images.
Transcriptional accuracy will be 99.995% or better (error rate of 1 character/byte in 20,000). We will test and if necessary reject data by the shipment.
Coding will be valid SGML, validated against the supplied dtd or a true subset thereof. This dtd is an XML-compliant extract from TEI and uses TEI semantics; the TEI guidelines (TEI P3) may be safely used as a general guide to the meaning of particular tags.
The vendor may, at its discretion, reject as much as 10% of the books submitted for conversion if they are deemed impossible to convert accurately. Valid reasons for rejection (which must be stated) include: (1) excessive abbreviation, and (2) illegible text (due to poor image or print quality).
Changes. We recognize the need for consistency, and the expense entailed in changing instructions and procedures midstream; such changes will certainly be minimized. Nevertheless, there is certain to be unexpected material in the data; and there are certain to be unforeseen consequences to some of the instructions given here. These instructions, as well as the EEBO dtd will therefore undoubtedly undergo some revision during the course of this project; most of it, probably, towards the beginning.
Exceptions. There is considerable variety in the source material and minor special instructions may be required for some books, or some portions of books, in some cases overriding the instructions given below.
Feedback. Conversion firms involved in this project are encouraged to ask questions: both to inquire about specific features not anticipated by the Guidelines, and to challenge the Guidelines (or the dtd) if they seem to produce unreasonable results. We will likewise provide advice on the conversion firms' tagging practices as quickly as possible.
With a few standard exceptions noted below, the entire text will be recorded in its entirety, first page to last, in the order it was intended to be read (top left to bottom right, left column before right column, etc.).
The chief exception is parallel texts. Running parallel texts, printed in a multi-column, multi-row, or facing-page arrangement, or some combination thereof, need to be treated as separate texts (normally, separate <DIV>s, sometimes perhaps separate <TEXT>s), each one recorded until its end and not restarted on each page. Notes and other material relating to only one of the texts on a page needs to be embedded in that text, not in any of the others. If a single heading or figure applies to more than one of the parallel texts, it should be recorded at the appropriate place in each text to which it applies.
Partial or fragmentary parallel texts will normally be broken primarily at the chapter or section level (e.g. <DIV1 TYPE="chapter">), then into parallel versions of that chapter (e.g. <DIV2 TYPE="version">) when necessary. But full parallel texts, e.g. an entire Latin-English parallel New Testament, or a Latin-English parallel Boethius) will normally be broken primarily into versions first (<DIV1 TYPE="version">), then each version into its chapters (<DIV2 TYPE="chapter">).
All material should be recorded in the form in which it appears in the book: do not attempt to correct spelling or typographic errors (except upside-down letters; see below). Spaces between words should always consist of one space character. Spacing between words is, however, often highly irregular in these books, often difficult to discern, and therefore often requires a measure of judgment. This may involve advisedly departing from the spacing that appears in the original book when sense demands it.
Page numbers as printed in the book will be preserved only as the value of the "N" attribute of the <PB> (page-break) tag. Unnumbered pages should receive a <PB> tag with the N attribute omitted. Incorrect page numbers should be recorded just as they appear. Page numbers will usually consist of arabic or roman numerals, but may also appear as letters or letter-number combinations. If there appear to be multiple separate paginations, choose one to record with the <PB> tag; record the other with a <MILESTONE> tag. Ignore any typographic elements used to set off the page number. E.g. -2-, {p. 2}, and PAGE 2 should all be recorded as <PB N="2">; (ii) and -ii- should both be recorded as <PB N="ii">; etc.
Placement of <PB> tags. The rules are: (1) "pages always break at the top"; that is, <PB> tags will be inserted in the text at the actual location of the page break, regardless of the location of the page number on the printed page. (2) "Divisions begin at page breaks; they don't end there"; that is, if a structural break of some kind coincides with the page break (e.g., if a new section, paragraph, etc., begins at the head of the new page), the <PB> tag should be tucked inside the opening tag for the new division, NOT inside the closing tag for the old division. And (3) "Words can break at page breaks but their unbroken form needs to be recorded too"; that is, if a hyphenated word straddles a page break, insert the <PB> tag within the word at point where the page break occurs, but record the full word as an attribute value of the <ORIG> element, like this: <ORIG REG="condition">condi-</ORIG><PB>tion.
Alternative, simpler instruction: (3) "Words cannot break at page breaks"; that is, if a hyphenated word straddles a page break, finish the word and any attached punctuation, then insert the <PB> tag. Treat the hyphen as any other end-of-line hyphen.
In parallel texts, material on a single page is often recorded at widely separated points in the data stream (once in each parallel <DIV>). In that case, the <PB> tag, including the page number, should be repeated, i.e., recorded in both <DIV>s.
Foliation. Some books may be foliated instead of paginated, i.e., every leaf may receive a number, rather than every page (in which case, typically, the back page of each leaf has no number). When recording a foliated book, do not record any "N" value in the <PB> tag. Instead, after each <PB> tag for a page occupying the recto (front) of a leaf, place a <MILESTONE> tag with the attribute "UNIT" set to "folio" and the attribute "N" set to whatever the printed number of the leaf is. A typial page sequence in this kind of book will look like this:
<PB><MILESTONE UNIT="folio" N="iij"> <PB> <PB><MILESTONE UNIT="folio" N="iv"> <PB> <PB><MILESTONE UNIT="folio" N="v"> <PB>
Other non-structural enumerations and alternative numerations. If the book contains some other running numeration system alongside folio or page references, use the milestone element to record it, and use its form or location, recorded with the "rend" attribute, to distinguish it from other milestones. There is no need to interpret its meaning or decide on its "unit" value. For example, if a book contains an unexplained sequential number in brackets in its margins, as well as folio numbers, record the latter as <MILESTONE UNIT="folio" N=""> and the former as <MILESTONE REND="margin" N="">; if it contains an explained series of numbers in the margins, use the explanation as the "REND" value: if an edition, for example, contains a series of sequential references in the margin that look like this: [Boeth., cap. 43], record them as milestones like this: <MILESTONE REND="Boeth." UNIT="cap." N="43">. Note that this applies only to a sequence; occasional notes of this sort should be recorded simply as <NOTE>. If in doubt whether a set of numbers represents <MILESTONE>s or <NOTE>s, use <NOTE>. Some books contain conflicting structural enumerations, e.g. a system of proposition numbers in the margins that does not correspond with the chapter numbers; the former may be recorded using <MILESTONE> tags.
Line numbers in verse should be recorded only as the value of the "N" attribute of the <L> tag. Record in this fashion only line numbers actually printed in the book, and use the form of the number that appears in the book.
Stanza, chapter, section numbers, etc. (that is, sequential numbers that appear in the headings to <LG>s and numbered <DIV>s) should be included as they appear in the book as part of the text surrounded by the appropriate <HEAD> tag, but should also be recorded, if possible as an arabic number, as the value of the "N" attribute of the appropriate <DIV> or <LG> tag.
<DIV2 TYPE="chapter" N="5"><HEAD>Chapter V.</HEAD>
<LG N="14"><HEAD>Stanza XIV.</HEAD>
Paragraph numbers (sequential numbers appearing at the beginning of a series of paragraphs that you have not chosen to regard as <DIV>s) should be included as they appear in the book as part of the text surrounded by the <P> tags, but should also be recorded, if possible as an arabic number, as the value of the "N" attribute of the <P> tag.
Item numbers and label numbers in lists should be recorded as part of the text included within the <ITEM> (or <LABEL>) tags. They should not be recorded as attribute values.
Enumerations in tables may be variously treated: given a column of their own, left as part of the text in a row, or even made part of an embedded <LIST>, whichever adequately represents the information most efficiently.
Language. Supply a value for the LANG attibute of numbered <DIV>s and of whole <TEXT>s, but do so only if the bulk of the text (barring notes) in that <DIV> or in that <TEXT> is in the indicated language. Supply the attribute at the highest level at which it applies: e.g., if an entire text is in Latin, add LANG="lat" to the <TEXT> tag, but not to all the <DIV> tags within that <TEXT>; if one of the <DIV1>s in a text is in Latin and other is in English, asssign LANG="lat" to one of the <DIV1>s and LANG="eng" to the other; and so on.
Assign multiple LANG values to the same <DIV> or <TEXT> only if it contains two or more languages in some kind of organized relationship. E.g., a bilingual Latin/English dictionary should be coded as <TEXT LANG="lat eng"> (with a space between the two codes). Supply a value for the LANG attribute only if you are sure what language it is; otherwise, do not use the attribute at all. Use USMARC 3-letter language codes published by the Library of Congress at http://lcweb.loc.gov/marc/languages/ (These are identical to the 3-letter codes contained in the ISO standard 639-2; see http://lcweb.loc.gov/standards/iso639-2/langhome.html)
TYPEs of DIV. Supply a value for the TYPE attribute of numbered <DIV> elements if the appropriate value is obvious; otherwise, omit the attribute entirely. If you do supply a value, use these rules:
If the designation in the book is a verbose version of a common English term, use the simpler form. E.g., if the book says "Prefatory Remarks by the Author," you shouldn't be afraid to translate this into <DIV1 TYPE="preface">
Otherwise, use whatever is there.
<DIV1 TYPE="poem"> <DIV1 TYPE="poem">See further under Poetry, below.
Provide other attribute values only when instructed to and when there is specific information to supply. Do not supply values of this sort: TYPE="unknown" or TYPE="unspecified".
Surrounding structures should be preserved, so long as doing so does not involve a needless repetition of the <FOREIGN> tag. A line of verse quoted in Greek, for example, should be recorded as
<Q><L><FOREIGN></FOREIGN></L></Q>
The same <GAP> tag should be used for characters and symbols (other than the kind of material listed above under "material not to record at all") that we have made no provision for transcribing. See below under Characters.
One text or many? Most works will consist of a single <TEXT> containing a single <BODY> element (optionally also a <FRONT> and/or <BACK> element for front and back matter respectively). Some works will consist instead of a <GROUP> element that contains multiple <TEXT>s (each <TEXT> with its own <BODY> and, optionally, <FRONT> and <BACK>). The GROUP element will be used most frequently for items that contain several works published or bound together, each with its own title page, that were originally printed separately, e.g. the collected works of an author.
The <BODY> (and, if necessary, the <FRONT> and <BACK> elements) will normally be divided into numbered <DIV>s corresponding to the main divisions of the text. Very simple documents, on the other hand, with no internal division (a work consisting of a single poem, for example) do not require <DIV>s at all: use no more <DIV> layers than necessary.
The numbered <DIV> elements, from <DIV1> to <DIV7>, represent a hierarchy: the <BODY> is subdivided into <DIV1>s; <DIV1>s, if necessary, are subdivided into <DIV2>s, and so on. <DIV>s divide into parts: with few exceptions, you need to have more than one of something to call it a <DIV>.
Individual small texts embedded within a larger work (e.g. entire poems quoted within a chapter of a treatise) should usually not be tagged as <DIV>s but should instead be placed within <Q> tags. The <Q> element may if necessary contain an entire <TEXT>, with its own <BODY>, <FRONT>, <BACK>, numbered <DIV>s etc.
Useful clues to the DIV structure include:
<P N="3">¶ III. In the third place, the Calvinist partie striveth ...Marginal "headings" that you decide not to treat as <HEAD>s can usually be encoded either as <NOTE>s, with the PLACE attribute set to "marg" or (if they contain a sequential numeration), as <MILESTONE>s.
TYPES of DIVs. See above under "attributes."
Front matter (material to include in the <FRONT> element) typically includes title pages, dedications, tables of contents, prefaces, prologues, honorific poems, remarks "to the Reader", etc., each of which should be recorded with a numbered <DIV>, their subsections recorded with higher-numbered <DIV>s, etc.
Title pages do not require special tags. Each title page should be recorded as a numbered <DIV> within the <FRONT> element. Include both the front and back (recto and verso).
Back matter (material to include in the <BACK> element) typically includes indexes, glossaries, colophons, afterwords, appendices, etc., each of which should be recorded with a numbered <DIV>, their subsections recorded with higher-numbered <DIV>s, etc.
Do not attempt to record the physical appearance of the page (centering, extra spaces, justification, type face, type size, etc.), though such cues may and should be used to determine the beginning and end of divisions within the text, the distinction between text and notes, etc. On type faces, see the special instructions below about use of the <HI> tag.
Record line-breaks (with the <LB> tag) only (1) if the text is unintelligible without a break; and (2) if there is no intervening structural tag. Many times, it is better to repeat a tag than to insert a line break in the middle of one; but more often it is possible to get by without doing either, especially if there is any punctuation at the line break. E.g., record this:
CHAP. XI. Some Advantages and Helps for raising and affecting the Soul by Meditation.like this:
<HEAD>CHAP. XI.</HEAD> <HEAD>Some Advantages and Helps for raising and affecting the Soul by Meditation.</HEAD>or, better, like this:
<HEAD>CHAP. XI. Some Advantages and Helps for raising and affecting the Soul by Meditation.</HEAD>but NOT like this:
<HEAD>CHAP. XI. <LB>Some Advantages and Helps for raising and affecting the Soul by Meditation.</HEAD>
Paragraph breaks should be recorded with <p> in prose and with <lg> (line-group or stanza) in verse.
Do not record italic or bold type, the various kinds of black-letter ("gothic") typefaces, regular roman typefaces, or fonts of different sizes as such. Instead record every change from the predominant typeface with the <HI> tag, unless you use that change as a cue to insert a structural tag of some kind. For example, a book may have black-letter text and italic headings. Record the headings as <HEAD> ... </HEAD>, not as <HEAD><HI> ... </HI></HEAD>, since you have used the change to italic as a cue to tag the italic text as a <HEAD>
Predominance is established at the <DIV> level. E.g., if the Preface or Dedication or chapter or section (occupying its own <DIV>) is in italic, it needs no special tagging, even if the main body of the book is in some other typeface. But if an individual word, phrase, sentence, line, or paragraph is in some other face than that which is predominant in that <DIV>, then mark the "different" text with the <HI> tag.
The exception, of course, is again if you are using the change of type face as a cue to structural role: in a book that prints its text in roman and its notes or block quotations in italic, once you have recorded the italic text as a <NOTE> or a <Q>, you do not need to mark it also as <HI>. Instead, the italic type itself becomes the predominant form within the <NOTE> or within the <Q>; any changes of typeface within these tags (e.g. a single word in textura black-letter) should be recorded with <HI>
If the text switches to yet another typeface within a section flagged with <HI>, simply mark the new typeface with another (nested) <HI> tag.
The most common contrasting type forms may be described as: (1) roman; (2) italic; (3) textura; (4) rotunda; (5) bastarda (see letter samples below), but individual books may use other contrasting forms: subtypes of italic; changes of font size; etc. The general appearance of the book must be the key: if the book intends two kinds of type to contrast, then flag the change with <HI> (as instructed above).
Do not record changes of typeface within a word (e.g. a single letter or two in another typeface within a word that is otherwise in the typeface used in the immediate context).
Record superscripted and subscripted text using character entities (&supa;, &supb;; &suba;, &subb;, etc.), not elements.
Record ornamented capitals, large drop caps, etc., as ordinary capital letters.
Record "small caps" as ordinary capital letters.
<Q>s are used for block quotations, whether of prose or verse. Don't use them for ordinary "inline quotations."
"Block quotations" include both quotations that are set off from the main text by indentation and blank lines (in the modern fashion) and also lengthy quotations that are set off by the use of other typographic cues, such as a series of quotation marks in the margin, or (if unambiguously marking a block quotation) a change of typeface, or some combination of these. If you're not sure if a block of text is a <Q>, simply record the appearance of the text (using, e.g. <P> and <HI>). Quotations marked by a string of quotation marks along the margin are unambiguously block quotations; there is no other ready way to record them.
<Q>s are usually the best way to tag even very substantial items embedded in prose, e.g. a poem or or a document of some kind quoted within a chapter, or within a note, or within an introduction.
<Q> can if necessary even contain an entire <TEXT>, with its own <FRONT> matter, <BODY>, <DIV> structure, and so on. Use <Q> for such embedded items, rather than trying to treat them as <DIV>s of the main text (unless that's really what they are). Treating them as <DIV>s forces you to treat all the material surrounding them as <DIV>s too, at the same level.Prefer this:
<DIV1 TYPE="introduction"> <P>blah blah</P> <P>blah blah</P> <P> <Q>here's a poem</Q> </P> <P>blah blah</P> </DIV1>to this:<DIV1 TYPE="introduction"> <DIV2 TYPE="stuff before the poem"> <P>blah blah</P> <P>blah blah</P> </DIV2> <DIV2 TYPE="poem"> <LG><L>here's a poem</L></LG> </DIV2> <DIV2 TYPE="stuff after the poem"> <P>blah blah</P> </DIV2> </DIV1>
Most material that is set off from the main body of the text but is adjacent and related to it can be safely tagged as <NOTE>. (Arguments (summaries at the head of <DIV>s), salutations, and speaker names and stage directions in drama are among the note-like features that have their own tags.)
Record each note at the point in the main text to which it relates, set off by appropriate tags, not at the point where it appears on the page. If the place is marked in the text with a flag of some kind (e.g. a footnote reference number, an asterisk (*), etc.), discard this marker both from the text and from the note. If the note is keyed to the text by line number, verse number, etc., place the note at the end of the line (etc.) to which it applies, and discard the literal number from the note.
Use the "PLACE" attribute of the <NOTE> tag to indicate where the note appears on the page:If there are multiple distinguishable sets of notes in the same location (two sets of footnotes, for example; or multiple sets of marginal notes marked by different kinds of flags, one set marked by numbers, one by letters), distinguish them by appended numbers: PLACE="foot1" and PLACE="foot2" for example.
- PLACE="marg" in margin or adjacent to the text (even if part of it runs across the whole page because of lack of room in the margin)
- PLACE="foot" in a footnote, below the text
- PLACE="inter" interlinearly (between the lines of text)
- PLACE="inline" not distinguished from the main text by location.
A note that spills onto the next page needs to be treated as a single note, not two, and should be placed in the text where it applies.
A note that relates to a specified group of lines, verses, etc., should be moved into the text at the end of the last item to which it applies, with the line number indication preserved. If physical arrangement, rather than explicit line numbers, serve to specify the line or verse number range, supply the number range in brackets at the beginning of the note. Notes referenced to a line (verse, etc.) number followed by "f." ("2365 f." meaning "line 2365 and following") should be treated as notes referenced to a span of two lines (in this case, 2365-66), that is, placed at the end of the second line (2366), with the full line reference preserved in the note: <NOTE PLACE="foot">2365 f.: ... </NOTE>
A note that appears next to a single line or set of lines and seems to relate to that line (or set) should be placed at the end of the line(s) in question.
Notes that apply to two (or more) distinct loci or lines should be reproduced and inserted at *both* (or all) the relevant points.
These need to be distinguished from notes that apply to a span of loci or lines; notes applying to a span of lines should be placed after the last line in the span with indications of the length of the span (e.g., "14-23" [with reference to line numbers] or "*-*" [with reference to two "*" flags in the text]) retained.
A note that seems to relate to an entire text division should be attached as a note to the heading for that division. A summary of the division should usually be given a tag of its own and tagged as <ARGUMENT> or <HEAD> (see below under Heads).
Apparatus that relates generally to the material on a page, or for which the appropriate place cannot readily be determined, should be attached to the last line of text at the bottom of the page.
Reference numbers in the text that point to something other than a note (e.g. to some part of an illustration), or for which the target cannot be found, should simply be recorded as part of the text.
Passages of verse (especially 2 or more lines, quoted and arranged as verse) within a note will normally be most readily coded as a quotation (<Q>) containing <L>s or <LG>s, embedded within the <NOTE> element.
In general, prefer to record itemized sequences as <LIST>s rather than <TABLE>s if possible. Use <TABLE> when the material cannot be readily understood without the spatial organization that tables provide.
Numbered sequences of items when the items themselves are blocks of text of considerable size (numbered paragraphs, for example) should not normally be treated as lists.
Complex lists (lists within lists) should be encoded with nested <LIST> tags:
<LIST> <ITEM> .. </ITEM> <ITEM><LIST> <ITEM> .. </ITEM> <ITEM> .. </ITEM> </LIST> </ITEM> </LIST>
Treat any numbers that enumerate items in a list as part of the text of that item; record them neither with separate <LABEL> tags nor as attribute values. E.g.:
<LIST> <HEAD>Sins</HEAD> <ITEM>1. Avarice</ITEM> <ITEM>2. Sloth</ITEM> <ITEM>3. Pride</ITEM> </LIST>
Lists of pairs may be tagged with the element pair <LABEL> and <ITEM> (in that order). If you use this option, you may omit any "leader" (e.g. a dot leader) between the paired items. E.g.:
THE PLAYERS' NAMES The Prince...............Jn. Longfellow The Pauper...............Thomas Goodrich Joan the Tappester........Jack Smithson <LIST> <HEAD>THE PLAYERS' NAMES</HEAD> <LABEL>The Prince</LABEL><ITEM>Jn. Longfellow</ITEM> <LABEL>The Pauper</LABEL><ITEM>Thomas Goodrich</ITEM> <LABEL>Joan the Tappester</LABEL><ITEM>Jack Smithson</ITEM> </LIST>
Tables should be recorded as you would using HTML tables, oriented by row, with the number of columns determined by the number of cells within the row. Use the spatial organization of the text to determine the number of rows and columns (not necessarily reflected in printed border lines). The ROWS and COLS attributes of the <CELL> tag should be used just like the ROWSPAN and COLSPAN attributes of the <TD> in HTML to indicate cells that extend across two or more rows or columns. Cells that contain a heading or label for a row (or column) should receive the attribute ROLE="label".
EEBO dtd | HTML equivalent |
---|---|
<TABLE> | <TABLE> |
<ROW> | <TR> |
<CELL> | <TD> |
<CELL ROLE="label"> | <TH> |
<CELL ROWS=""> | <TD ROWSPAN=""> |
<CELL COLS=""> | <TD COLSPAN=""> |
Particularly complex tables may be recorded (again as in HTML) with nested <TABLE> tags, i.e., a <TABLE> within a <CELL>, or by combinations of <LIST> and <TABLE>, i.e. a <LIST> within a table <CELL> or a <TABLE> within a list <ITEM>.
Tables that continue from one page to the next may be tagged as one continuous table, with an embedded <PB> tag, especially if its headings are not repeated on the new page. If the headings are repeated, it is usually easier to close the old <TABLE> and open a new one on the new page.
Here is a sample simple table (this one is simple enough that it could almost be done as a <LIST>):
Recorded as:
<DIV TYPE="table">
<HEAD><HI>TABLE</HI></HEAD>
<HEAD>By this table, shall ye fynde the Epistles and Gospels, for the Son|daies, and other feastiuall dayes.</HEAD>
<P>FOR TO fynde them the sooner, shall ye seke for these capital letters, <HI>A, B, C D,</HI> whi|che stande by the syde of this boke alwaies, On or vnder the letter shall you fynde a crosse ✗, where the Epistle or the Gospell begynneth, and where the end is, there shal ye find an halfe crosse, <GAP EXTENT="1"> And the fyrst lyne in this table is alway the e|pistle, and the seconde lyne is alway the Gospell.</P>
<TABLE>
<ROW><CELL ROLE="label" COLS="3">On the fyrst Sonday in Aduent.</CELL></ROW>
<ROW>
<CELL>Rom. xiii.</CELL>
<CELL>C</CELL>
<CELL>And for as muche as we knowe</CELL>
</ROW>
<ROW>
<CELL>Math. xxi.</CELL>
<CELL>A</CELL>
<CELL>Nowe when they drew nye vnto</CELL>
</ROW>
<ROW><CELL ROLE="label" COLS="3">On the second sonday in the Aduent.</CELL></ROW>
<CELL>Rom. xv.</CELL>
<CELL>A</CELL>
<CELL>what so euer thynges are writen</CELL>
</ROW>
<ROW>
<CELL>Luc. xx.</CELL>
<CELL>C</CELL>
<CELL>And there shall be signes</CELL>
</ROW>
</TABLE>
Headings at the head of text divisions and stanzas (<DIV>s and <LG>s) should be tagged as <HEAD>. Subheadings should be tagged as <HEAD TYPE="sub">.
Some headings have special tags (see below). If heading-like material doesn't fall clearly into one of these special categories, use simple <HEAD>. Incipits ("here begins a tract about sin") are typically recorded as <HEAD>s with TYPE="incipit". Subheadings may be recorded as <HEAD TYPE="sub">.
"Idlenesse is lesse harmefull then vnprofitable occupation."
PUTTENHAM<EPIGRAPH> <Q>"Idlenesse is lesse harmefull then vnprofitable occupation."</Q> <BIBL>PUTTENHAM</BIBL></EPIGRAPH>
Epigraphs are a common place to find bits of non-roman script; record those bits with <FOREIGN> as described above, but place the <FOREIGN> portion inside the <EPIGRAPH> tag.
Material at the end of a text division that is set off from the main text is normally to be tagged as a <TRAILER> or <CLOSER>. <TRAILER> is the more general tag, used for material without such internal structures as datelines, salutations, or signatures. Typical <TRAILER>s include "Amen," "Finis," and explicits ("here ends the tract written by Master John Knox."). <CLOSER>, on the other hand, is the counterpart of <OPENER>; it is used when the concluding material includes lengthy or complex information, including datelines, salutations, or signatures, especially in letters. See Letters, below, for examples. Requests for prayer for the author's soul are typically recorded as <CLOSER>s.
Epigraphs can appear at the foot of a division as well as at its head (see above for a description of epigraphs).
Verse lines. Each verse line should be enclosed in <L> tags. Do not attempt to record the varying indentation of verse lines; pay attention to indentation only insofar as it indicates a change of stanza or a "broken" line (see below).
Broken lines. Sometimes when a verse line is too long to fit on the page, its last word or two is placed (sometimes marked off with a bracket or parenthesis) at the end of the next line or at the end of the preceding line (wherever it fits best). Such detached bits of verse lines should be recorded at the end of the line to which they really belong.
Mary had a little lamb, [snow. Its fleece was white as<L>Mary had a little lamb,</L>
<L>Its fleece was white as snow.</L>
Groups of lines (<LG>s). Groups of lines should be enclosed in <LG> ("line-group", i.e. stanza) tags. A poem containing no divisions (only undivided lines) should still be recorded as containing a single <LG>. Do not supply a value for the TYPE attribute of <LG>.
<DIV1 TYPE="poem"> <LG> <L>When the cat's away</L> <L>The mice will play</L> </LG> </DIV1>
Exception: when a group of lines is quoted, e.g. in a prose paragraph, in a note, or as part of an epigraph), do not place the lines inside a <LG> tag unless you have good reason to believe that the lines represent a complete stanza, e.g. if more than one stanza is quoted and you need to separate them; or possibly if the metrical form makes it clear that a whole stanza is quoted. If all you know is that some lines of verse are being quoted, then tag them as verse lines (<L>), period.
Lines vs. line-groups. It is often unclear when a group of lines has enough organization to be called a stanza (line- group <LG>). If in doubt, err on the side of fewer line-groups rather than more. And be consistent throughout a particular poem, so that a particular structure is not sometimes tagged as a <LG> and sometimes left untagged. Clues to look at include, in decreasing order of significance:
<LG>s vs. <DIV>s. It is not always easy to distinguish between <LG>s and <DIV>s: both can have headings; both can nest to create a structural hierarchy. Metrical units (true stanzas) are always <LG>s; verse paragraphs of irregular length are frequently best recorded as <LG>s, especially if they are not consistently supplied with headings. On the other hand, <DIV>s should be used for line-groups big enough to have titles, or to appear in tables of contents.
Groups of stanzas within a poem should receive a numbered <DIV> tag. In most cases, you will use only a single level of <LG> (no nesting), and treat it effectively as the lowest-level text division. Any grouping of stanzas is therefore recorded as a <DIV>.
Entire poems. Each poem should be recorded as a <DIV> of the appropriate number (<DIV1> etc.), with TYPE="poem". Don't try to distinguish between different kinds of poems, between poems and songs, etc. Any discrete item in verse is TYPE="poem". Poems may, of course be subdivided further into <DIV>s and <LG>s of various types.
Poetry mixed with prose. When poetry is interspersed with prose, and either the poetry is the predominant form, or there is no clearly predominant form, the prose should be recorded within <P> tags, the verse within <LG> tags. When poetry gives way to prose, close the <LG> and open a <P>; when prose gives way to poetry, close the <P> and open an <LG>, even if the actual prose paragraph, or even the last sentence, is not finished.
Exceptions:
Aside from a few special tags (below), prose drama should be recorded like other prose (in <P>s, etc.) and verse drama like other verse (in <LG>s, <L>s, etc.), including the rules for interspersed poetry and prose.
Cast lists. Cast lists (often headed "dramatis personae") should be recorded like other lists, with the <LIST> tag. Cast lists will commonly appear as separate <DIV>s (within the <FRONT> matter of a book if the book contains one play). For complex cast lists, use nested lists and labels to indicate cast groupings.
Stage directions. Stage directions should be recorded with the <STAGE> element. Stage directions sometimes appear between the columns of a multicolumn text, or in the margin, where they look like notes. In other books, they may be centered (as if they were headings) or indented (as if they were little paragraphs). They are occasionally typographically distinct (it italics; within parentheses; or both).
Speakers. The name (sometimes abbreviated) of the speaker is recorded with <SPEAKER>. These appear at the head of a speech: e.g. typically above the first line of the speech (sometimes centered), in the margin, in an indented line of its own at the head of the speech, or in italics at the beginning of the first line of the speech.
Additional text associated with the speaker's name should be included in the <SPEAKER> tag, like this: <SPEAKER>Mr. Jones, reading from letter.</SPEAKER> Multiple names should be enclosed in a single set of <SPEAKER> tags, like this: <SPEAKER>Mr. Jones and Mrs. Smith.</SPEAKER>
Speeches. The basic unit of drama is the SPEECH (<SP>). A speech normally continues uninterrupted as long as the character speaking it is uninterrupted by another speaker or the end of a division (act, scene, etc.).
"Songs" and other material specially set off within a speech should not normally be given any special tagging; if they have headings, they may need to be recorded as a nested <LG>. In exceptional cases when they contain an elaborate structure they may be recorded as a quotation (<Q>).
Prologues and Epilogues should normally be treated as part of the play, recorded as <SP>s like any other speech, though they may sometimes require a numbered <DIV> of their own.
Acts and Scenes. The act/scene structure should be recorded with appropriately TYPED and numbered <DIV>s.
Personal letters that appear as text divisions should be treated as <DIV>s just like any other text division (chapters, sections, etc.). Letters quoted within running text (e.g. a letter quoted within the chapter of a book) have been given a special tag, <LETTER>. Note that dedications frequently look like letters, since they contain salutations and signatures, but they're not: treat them as <DIV TYPE="dedication">.
Special tags are available to tag the salutations (<SALUTE>), signatures (<SIGNED>), and datelines (<DATELINE>) often found in letters. Use these only if they clearly apply.
Place <SALUTE>, <SIGNED>, and <DATELINE> within <OPENER> if they appear at the head of a letter; place them within <CLOSER> if they appear at the end. See the TEI guidelines for fuller descriptions of these elements.
Punctuation should be retained. When a colon, comma, question mark, closing quotation mark, or period falls between words, place a space after it. When an opening quotation mark falls between words, place a space before it. When a virgule falls between words, place a space before and after it.
For unusual punctuation marks, see under "Characters and symbols" below.
Record the various forms of colon, period, comma, and virgule (slanted line) with their modern keyboard equivalents ( : . , / ).
Question marks vary considerably in form (some of them looking like inverted semicolons); record them all with the standard "?"
Opening and closing double quotation marks should both be recorded using the ordinary keyboard double-quote character (" = HEX 22), not the “ and ” entities.
Opening and closing single quotation marks, as well as apostrophes, should be recorded with the same character, the ordinary keyboard single-quote character (' = HEX 27)
Hyphens (not dashes) should normally be recorded using the ordinary hyphen character.
Hyphens at the end of a line should be recorded as the entity &eolhy; ?or the ordinary keyboard "pipe" (vertical bar) character, unless they appear between numerals, when they should be recorded with the ordinary hyphen. Be aware that hyphens in many texts may appear as an angled stroke, not a horizontal one, and may also commonly appear doubled, resembling an equals sign (=), either horizontally or at an angle.
Dashes should be recorded using the entity —, regardless of where they appear.
Ellipses, whether two characters or many--strings of dots or asterisks indicating omitted or missing text--should be recorded as ordinary text, not the … entity, using periods or asterisks as appropriate: . . . . . * * * * * . .
Some editions of prose mark extended quotations by placing quotation marks at the beginning of every quoted line. E.g.,
he made reasons...seyenge: " God made alle thynges " by reason, and governethe thynges " made by reason; the sterres be movede by reason; and so " oure naturalle lyfe excedynge from reason by slawthe and " ignoraunce awe to be reducede by lawes and reasons. " Wherefore thau&yogh;he there be somme thynges in the rule of " seynte Benedicte, the intellect of whom the dullenesse of my " mynde may not comprehende, y suppose hit be beste to &yogh;iffe " credence to auctorite. Wherefore also he persuadeth hymselfe ...
Do not record the quotation marks themselves, but place the whole block of text marked by the marks inside <Q> tags. Bear in mind that the marginal quotation marks can take unusual forms (sometimes they look like a pair of commas), and that it is not always easy to discover where the quotation actually begins and ends. The example above should be encoded as:
<P> ... he made reasons...seyenge:
<Q>God made alle thynges by reason, and governethe thynges made by reason; the sterres be movede by reason; and so oure naturalle lyfe excedynge from reason by slawthe and ignoraunce awe to be reducede by lawes and reasons. Wherefore thau&yogh;he there be somme thynges in the rule of seynte Benedicte, the intellect of whom the dullenesse of my mynde may not comprehende, y suppose hit be beste to &yogh;iffe credence to auctorite.</Q>
Wherefore also he persuadeth hymselfe ... </P>
Braces and brackets that group multiple lines should be ignored if all they do is group portions of ordinary running text, such as poetry. But if they are used to link one piece of text to another, such as frequently in tables and lists, their meaning needs to be interpreted. Sometimes this will require entering text more than once, e.g. if the brace means "this word applies to all these other words"; sometimes it may require treating the single item as a head or label for a list containing the grouped items; sometimes it may involve attaching a ROWS or COLS attribute to a table <CELL>. Many variations are possible, which the following examples can only suggest.
chapter | ![]() | 1 How to build a kite | Dramatis Personae |
| |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2 When to fly a kite | townspeople | ![]() | Joe | ||||||||||
3 Famous kite flyers of our time | Mary | ||||||||||||
4 When not to fly a kite | Bothom | ||||||||||||
5 "I've flown it: now what? | Josephus | ||||||||||||
Joan, a noblewoman | |||||||||||||
John, a philosopher | |||||||||||||
| |||||||||||||
Record as: | Record as: | Record as: | |||||||||||
<LIST> <LABEL>chapter 1</LABEL> <ITEM>How to build a kite</ITEM> <LABEL>chapter 2</LABEL> <ITEM>When to fly a kite</ITEM> <LABEL>chapter 3</LABEL> <ITEM>Famous kite flyers of our time</ITEM> <LABEL>chapter 4</LABEL> <ITEM>When not to fly a kite</ITEM> <LABEL>chapter 5</LABEL> <ITEM>"I've flown it: now what?"</ITEM> </LIST> |
<LIST> <HEAD>Dramatis Personae</HEAD> <LABEL>townspeople</LABEL> <ITEM> <LIST> <ITEM>Joe</ITEM> <ITEM>Mary</ITEM> <ITEM>Bothom</ITEM> <ITEM>Josephus</ITEM> </LIST> </ITEM> <ITEM>Joan, a COuntess</ITEM> <ITEM>John, a philosopher</ITEM> </LIST> |
<TABLE> <ROW> <CELL>In apice trianguli.</CELL> <CELL ROWS="3">Triangulus.</CELL> </ROW> <ROW> <CELL>In basi praecedens 3.</CELL> </ROW> <ROW> <CELL>Sequens & vltima. 3.</CELL> </ROW> </TABLE> |
Basic letter forms. Most letters encountered will belong to the modern alphabet, though their appearance may be strange.
Ligatures. Ligatured characters (ae, oe, ct, st, sp, fi, ff, ss, etc.; italic fonts especially tend to have ligatures between many further pairs of letters) should be recorded as two separate characters. Ignore the ligature. Be aware that the italic "ae" ligature usually has no upper bow to the "a" and is easily mistaken for "oe".
Ampersands should be recorded as &.
Letters printed upside-down (a common printer's error) should be recorded as if turned right side up.
Recognizable letters with diacritics
Superscripts
Abbreviation symbols. A number of abbreviation symbols are distinctive enough and consistent enough in appearance to be recognized. Each should be recorded with its own character entity.
record as: | stands for: | some examples: | conditions: |
---|---|---|---|
&abper; | per, par | ![]() ![]() ![]() | |
&abpro; | pro | .. | |
&abus; | -us | ![]() | at the end of a word only |
&abque; | -que | ![]() ![]() ![]() ![]() | at the end of a word only |
&abser; | ser | .. | |
&abcon; | con- cum- | .. | at the beginning of a word only |
&abrum; | -rum | .. | at the end of a word only |
Other symbols include alchemical and astrological symbols. A selection follows.
Letters from other alphabets when used singly (as opposed to in whole words or extended text) should be recorded with ISO standard character entities.
Symbols and marks not listed here
"Excessive" abbreviation. If sampling shows that more than one word in every ten in a given text contains an abbreviation symbol, or a <GAP> tag, or an <ABBR> tag, the work should be rejected for conversion.
The following samples are far from a definitive list of letter forms, but are meant only to provide some help recognizing the most common letters in the most common typefaces. Many books will have to be considered individually, the form(s) of each letter ascertained by its presence in a recognized word or unambiguous context so as to create, in effect, an alphabet or set of alphabets for that book. The samples below are arranged under headings that describe the most common families of type: roman, italic, textura, rotunda, and bastarda. There are many variants of each of these (except rotunda, which is fairly uniform) which may differ very considerably from the examples given here. And individual misprinted and ill-aligned letters may present a very anomalous appearance.
Record as: | Textura | Italic |
---|---|---|
a | ![]() | ![]() |
b | ![]() | ![]() |
c | ![]() | ![]() |
d | ![]() ![]() | ![]() |
e | ![]() | ![]() |
f | ![]() | ![]() |
g | ![]() | ![]() ![]() |
h | ![]() ![]() | ![]() ![]() |
i | ![]() | ![]() |
j | ![]() | |
k | ![]() ![]() | |
l | ![]() ![]() | ![]() ![]() |
m | ![]() ![]() | ![]() |
n | ![]() | ![]() |
o | ![]() | ![]() |
p | ![]() | ![]() |
q | ![]() | |
r | ![]() ![]() ![]() | ![]() |
s | ![]() ![]() | ![]() ![]() |
t | ![]() | ![]() |
u | ![]() | ![]() |
v | ![]() ![]() | ![]() ![]() |
w | ![]() | ![]() ![]() |
x | ![]() | ![]() |
y | ![]() ![]() | ![]() |
z | ![]() | ![]() |