EEBO Text Conversion FAQ

[Questions about setup and production]

This document lists technical questions received from data conversion firms with regard to their work on the project, together with the answers. It also lists updates to the keying guidelines and other announcements, in order that all information supplied to any firm is available to all.


  1. Announcement: slight change to instruction set (draft 1.4)
    Posted 1/18/01. Revision by Schaffner.
  2. Should the final file submitted be validated and normalized?
    Submitted 1/22/01. Posted 1/24/01. Answered by Schaffner
  3. Are there any file naming requirements?
    Submitted 1/22/01. Posted 1/24/01. Answered by Schaffner and Nunn.
  4. Would it be possible to receive Group 4 TIFF images, instead of the DJVU files on the B&H website?
    Submitted 1/23/01. Answer in progress. Can you use pdf for now?
  5. (5) Announcement: new page-naming requirements (instructions draft 1.5; dtd 6 Feb revision)
    Posted 2/6/01. Revision by Schaffner, based on suggestions by Powell.

Draft 1.4 (18 Jan 2001) of keying instructions revised so as to add the requirement that footnote references ("flag" characters) be retained as the value of the "N" attribute of the <NOTE> tag, rather than discarded, in keeping with committee recommendations.

Q 2: Should the final file submitted be validated and normalized?
A: Delivered text files should certainly be valid (which means that they need to be validated). As for normalization, this depends on what your normalizer does. We would prefer that tags and attributes be consistent as to case (preferably all upper-case, but all lower-case is ok); attribute values should generally be lower case, as specified in the guidelines. We would prefer also (but do not insist) that attributes be consistent as to sequence within the tag. The dtd itself does not allow tag minimization except for elements declared as EMPTY, so normalization will add nothing to validation in that respect. But we do not wish to see that degree of normalization which would insert unnecessary default attributes into tags. That is, a normal table cell should appear as <CELL>, not as <CELL ROWS="1" COLS="1"> and so forth.

Q 3: Are there any file naming requirements?
A: Let's create unique ID numbers and use them as the basis for all file naming and tracking. We may need to develop a more sophisticated system later on, but for the time being, since it is the most widely known system, we will use the unique Short Title Catalog ID number as the basis for filenames, prefixing "S" or "W" to the name depending on whether the book is listed in the original Short title catalog of Pollard and Redgrave or its continuation by Wing, replacing any internal spaces, brackets, or periods with underscore characters, and attaching an *.sgm extension. (This is *not* the same as the online "ESTC Record ID number.") If multiple numbers are listed, we will use the first given. If a Wing number and a Thomason number are both given, we will use the Wing. If only a Thomason number is available, we will use that with a prefixed "T". Examples:

For book listed as STC (2nd ed.) / 12626ause filename S_12626a.sgm
For book listed as Wing (2nd ed., 1994) / B1210use filename W_B1210.sgm
For book listed as Wing / T2021Ause filename W_T2021A.sgm
For book listed ONLY as Thomason / E.684[13]use filename T_E_684_13.sgm
For item given all these numbers:
  • Wing (2nd ed., 1994) / B1187
  • Wing (2nd ed., 1994) / B1175A
  • Wing (2nd ed., 1994) / B1181
  • Thomason / E.740[10]
  • Thomason / E.741[1-5]
use filename W_B1187.sgm

Q 4: The image files [on] the B&H website are in DJVU, which seems to be a B&H proprietary format. Images in this format take 15-20 minutes each to download. In addition, these images can only be printed using the B&H tool, which is much slower than using our production printers. Group 4 TIFF images would be ideal. Is this possible?
A: Group-4 TIFs would obviously be ideal (for us too). We're working on this question with B&H. In the meantime, the B&H site offers another option that may serve: pdf, not an ideal format, but perhaps good enough for the first batch. To download an entire book in pdf:

  1. Go the the EEBO site.
  2. Select "search" at the top of the screen.
  3. Use the search function to find the right book.
  4. Click on the little camera icon to the left of the citation.
  5. When the first page comes up, check the "mark all" box.
  6. Click "marked list" at the top of the screen.
  7. Click "images / download" under the "images" column.
  8. Enter an email address and click "submit."
  9. wait for a pin to be displayed.
  10. Select and copy the pin (or note it).
  11. Click "download" at the top of the screen.
  12. Enter the pin that you just copied into the pin box.
  13. Click "submit."
  14. wait for the "[in progress]" to disappear and a link to appear in its place in the "files" column.
  15. Click on the link when it appears (it will look like this: "pages 1-699 (155644029 bytes)."
  16. This will download the entire book as a single .pdf file.

I've found it fastest to set my browser to download pdfs to disk rather than opening them in Acrobat Reader.

(5) ANNOUNCEMENT: NEW PAGE-NAMING REQUIREMENTS (changes to dtd and specs).
Change to specs: version 1.5 of Instructions now requires an attribute (REF) on each page-break tag that will serve to link the text page to the page image. The dtd accordingly adds a new REF attribute to <PB>. See the keying instructions for details.