EEBO Text Conversion FAQ

[Questions about the RFQ]

This document lists questions received from data conversion firms with regard to their prospective participation in the project, together with the answers. It also lists updates to the keying guidelines and other announcements, in order that all information supplied to any firm is available to all.

Contents

  1. Can we look at more examples of the text that we will be converting?
    Received 10/24. Posted 10/25. Answered by: Hillary Nunn (hnunn@umich.edu)
    See now (11/17) also question 26 below, which provides further information.
  2. The "e.g." image on the instructions doesn't link to anything.
    Received 10/25/00. Posted 10/26/00. Answered by Nunn and Schaffner.
  3. Have you a desired weekly or monthly volume/turnaround schedule?
    Received 10/24/00. Posted 10/26/00. Answered by Paul Schaffner (pfs@umich.edu).
  4. What character belongs in the filename for the DTD where it looks as if there's a space ("eebo?sgm.dtd.txt")?
    Received 10/30/00. Posted 10/30/00. Answered by Schaffner.
  5. Announcement: five new sample pages added to selection.
    Posted 10/31/00. Samples supplied by Schaffner.
  6. Two questions related to the specified accuracy rate.
    Received and posted 11/10/00. Answered by Schaffner.
  7. Are we to take the "kB" in the estimate of data quantity as three zeroes (103)?
    Received and posted 11/10/00. Answered by Schaffner.
  8. Announcement: five new sample pages added to selection.
    Posted 11/10/00. Samples supplied by Schaffner.
  9. Is there a list of the books that will be converted?
    Received 11/15/00. Posted 11/15/00. Answered by Nunn.
  10. In what order will books be converted?
    Received 11/15/00. Posted 11/15/00. Answered by Nunn.
  11. How will you clarify the specifications in a multi-vendor environment?
    Received 11/15/00. Posted 11/15/00. Answered by Nunn and Schaffner.
  12. How many vendors are involved?
    Received 11/15/00. Posted 11/15/00. Answered by Nunn and Schaffner.
  13. Do you have any estimate of the average number of pages in a book or bytes in a page?
    Received 11/15/00. Posted 11/15/00. Answered by Schaffner.
  14. Is there an optimal batch size? Optimal turnaround time?
    Received 11/15/00. Posted 11/15/00. Answered by Schaffner.
  15. Are the existing keyboarding instructions "best-guess" approximations?
    Received 11/15/00. Posted 11/15/00. Answered by Schaffner.
  16. How will the 99.995% accuracy requirement be interpreted in the context of potentially illegible material?
    Received 11/15/00. Posted 11/15/00. Answered by Schaffner.
  17. Who will be evaluating vendor bids and responses to the RFQ?
    Received 11/15/00. Posted 11/15/00. Answered by Sandler.
  18. Can the response to the EEBO-TCP RFQ be submitted electronically, or do you prefer hard copy?
    Received 11/13/00. Posted 11/15/00. Answered by Nunn (with Schaffner & Sandler).
  19. How many texts are expected to be converted in the six weeks immediately after the award?
    Received 11/13/00. Posted 11/15/00. Answered by Nunn (with Schaffner & Sandler).
  20. When is the award expected to be announced?
    Received 11/13/00. Posted 11/15/00. Answered by Nunn (with Schaffner & Sandler).
  21. Will the amount of data a vendor can convert per year have an impact on the evaluation process?
    Received 11/13/00. Posted 11/15/00. Answered by Nunn (with Schaffner & Sandler).
  22. Can we assume that each vendor selected will receive a cross-section of titles that range in difficulty from straight-forward to highly complex?
    Received 11/13/00. Posted 11/15/00. Answered by Nunn (with Schaffner & Sandler).
  23. Will the images of the text be provided on CD-ROM, or on an alternate media?
    Received 11/13/00. Posted 11/15/00. Answered by Nunn (with Schaffner & Sandler).
  24. What are the image specifications - dpi, grayscale, compression, etc.?
    Received 11/13/00. Posted 11/15/00. Answered by Nunn (with Schaffner & Sandler).
  25. You ask for a price in terms of dollars per thousand characters (kB). Are markup characters included in that thousand?
    Received 11/16/00. Posted 11/16/00. Answered by Schaffner.
  26. May we have access to the images now [for test purposes]?
    Received 11/16/2000. Posted 11/17/00. Answered by Nunn.
  27. Did you consider using the Chadwyck Healy DTD for Romantic Literature?
    Received 11/16/2000. Posted 11/17/00. Answered by Schaffner.
  28. What are the different types of documents in the database?
    Received 11/16/2000. Posted 11/17/00. Answered by Schaffner.
  29. How will the DTD Revisions be monitored by the University of Michigan?
    Received 11/16/2000. Posted 11/17/00. Answered by Schaffner.
  30. How will you check for the accuracy of the markup?
    Received 11/16/2000. Posted 11/17/00. Answered by Schaffner.
  31. How will you test for a 99.995% accuracy rate?
    Received 11/16/2000. Posted 11/17/00. Answered by Schaffner.
  32. How can you be sure that the attribute values supplied for language (LANG) are right?
    Received 11/16/2000. Posted 11/17/00. Answered by Schaffner.
  33. Don't CDATA attributes leave too much room for interpretation
    Received 11/16/2000. Posted 11/17/00. Answered by Schaffner.
  34. The placement of the marginal text needs to be defined and explained more clearly.
    Received 11/16/2000. Posted 11/17/00. Answered by Schaffner.
  35. It may be difficult to identify stanzas in poems.
    Received 11/16/2000. Posted 11/17/00. Answered by Schaffner.
  36. Intensive training will be required in the identification of the different alphabets.
    Received 11/16/2000. Posted 11/17/00. Answered by Schaffner.
  37. For how long a period of time will contracts be awarded?
    Received 11/16/00. Posted 11/27/00. Answered by Rebecca Dunlavy and Hillary Nunn.

Q 1: Can we look at more examples of the text that we will be converting?
A: Yes. You can look at UMI's free "Featured Content" sample, online at http://wwwlib.umi.com/eebo/featured (or reached via the EEBO home page at http://wwwlib.umi.com/eebo/ ) This contains 100 books intended to present the "best" books in the 90,000-volume EEBO collection.

ADDENDUM: See also Question 26 below.

Q 2: The "e.g." image on the instructions doesn't link to anything.
A: It will. The image that you mention is in fact one that at present serves no purpose at all, since I have not yet had the chance to add the appropriate links to sample pages that it refers to. However, you are not missing anything, since the links when I put them in, will simply point to the transcriptions and images already available and listed in the "samples" document:
http://www.lib.umich.edu/eebo/docs/dox/samples.html

ADDENDUM: Some links have now been added, and more will be. 10/26/00.

Q 3: Have you a desired weekly or monthly volume/turnaround schedule?
A: The truth is that we haven't yet established the staff or procedures here to deal with large-scale work-flow. We expect the project to start slowly (with a small set of books at first while we work the kinks out of the instructions, see what works and what needs work), and to speed up over the coming year, as the instruction set becomes more firmly fixed and resources here are developed to cope with the workflow. t its height, we anticipate receiving and processing something on the order of 25 MB of text per week, perhaps as early as the last quarter of 2001 or the first quarter of 2002. But even at that pace, (1) turnaround time will be less important than throughput; and (2) we expect that the job will be spread among two or more vendors, and the work allocated according to the vendors' capacity. No single vendor is likely to be responsible for the whole amount. Vendors that can do less will be given less to do. That said, if you are looking for a rough guide to our expectations, in the past, albeit with much smaller projects, we have been accustomed to turnaround times of one to two months, with throughput of (say) 30-40 MB/month of encoded text.

Q 4: What character belongs in the filename for the DTD where it looks as if there's a space? ("eebo?sgm.dtd.txt")
A: An "underscore" character (_). The full URL is
http://www.lib.umich.edu/eebo/docs/dox/eebo_sgm.dtd.txt

(5) ANNOUNCEMENT: FIVE NEW SAMPLE PAGES ADDED TO SELECTION.
Samples 3.1a, 3.1b, 3.2, 3.3, and 3.4 have been added to the selection linked to on the samples page.

(6) Two questions related to the specified accuracy rate: (1) Q: Given that there are potential problems in the source material (unclear text, characters that resemble each other, etc.), will we be able to test our ability to produce the desired accuracy rate on a small sample first? and (2) Q: Are there criteria or a protocol in place for evaluating character accuracy under these conditions?
A: (1) Yes, certainly you may test a small sample first. We intend to start slowly, so the first books done will serve as a sample of sorts that will enable us to adjust our expectations and requirements to what is feasible. These first books will serve as a (paid-for) sample. If you would like to work on a small (free) sample, the difficulty will be in finding something that adequately represents the variety of type problems in the source material. The small set of sample pages provided here on line offers a very small sample, but some variety; you may want to try to transcribe those, and then check your transcriptions against ours (though ours are not guaranteed to be free of error!). Or pick individual pages from several of the books in the UMI EEBO free set (as discussed above in answer to a previous question). Or pick a single book from that set, convert it, and we will proof 5% of its pages, as we expect to do during production.

(2) There is no fixed protocol yet for assessing character accuracy. In the past we have been reasonable (even generous), refusing to regard as errors mistakes that could not reasonably have been avoided without substantial interpretation of the text or context. Establishing such a protocol will be one of the first tasks once we begin production and have hard data to base it on.

Q 7: Are we to take the "kB" in the estimate of data quantity as three zeroes (103)?
A: Yes.

(8) ANNOUNCEMENT: FIVE NEW SAMPLE PAGES ADDED TO SELECTION.
Samples 3.5, 3.6, 3.7, 3.8, and 3.9 have been added to the selection linked to on the samples page.

Q 9: Is there a list of the books that will be converted?
A: There is no list, but there is a selection principle. Books will be selected for conversion based on the New Cambridge Bibliography of English Literature (NCBEL). All of the books associated in one way or another with the authors listed in the NCBEL will eventually be converted, including both books written by the authors in question and works otherwise associated with them (e.g., books which the authors are rebutting, or books which themselves rebut the NCBEL author). Moreover, all books printed before 1500 will be converted, regardless of author.

Q 10: In what order will books be converted?
A: The bulk of the pre-1500 books will not be done first; the first batch of 10-25 books (per vendor) will contain a representative range of books as regards both date and variety. No other sequence has been determined, though a prominent text or two is likely to appear early on.

Q 11: How will you clarify the specifications in a multi-vendor environment?
A: We have no experience with working with multiple vendors on the same project, so do not have a mechanism in place to do this, but we will ensure that all such information is shared among all participating vendors.

Q 12: How many vendors are involved?
A: We do not know how many vendors will end up working on this project; about a dozen have received the RFQ.

Q 13: Do you have any estimate of the average number of pages in a book or bytes in a page?
A: Initial estimates were based byte counts of the Chadwyck-Healey Early English Prose Fiction database, plus approx. 10% to account for the material that C-H typically omits. These estimates were: 200 pp./text; 1.8kB/page. Overall averages within the three major collections are as follows: Pollard & Redgrave (ESTC I): average 324.36 pp. per work (9.5 million pages total, 29,288 titles); Wing (ESTC II): 194.3 pages per work (11.5 million pages total, 59,189 titles); Thomason Tracts: 86 pp. per work (1.28 million pages total, 14,917 titles). But these averages do not necessarily correspond with the mix to be found in the final selection. Experience will give us surer figures in time; 2kB/pg is probably a good working estimate.

Q 14: Is there an optimal batch size? Optimal turnaround time?
A: Turnaround time per se is not very important (see question above); turnaround of 1-2 months seems reasonable. The optimal batch size will increase from quite small (ca. 5MB) at the beginning of the project to 20-25 MB during the period of full-scale production.

Q 15: Are the existing keyboarding instructions "best-guess" approximations? I.e., are they subject to improvement in response to vendor suggestions?
A: Yes, of course. We welcome vendor suggestions, and will consider emending the instructions in response. Further experience with the material on our part as well as on the part of the vendors will doubtless give us reason to improve the instructions.

Q 16: How will the 99.995% accuracy requirement be interpreted in the context of potentially illegible material?
A: As stated above, there is no fixed protocol yet for assessing per-character accuracy. In the past we have been reasonable (even generous), refusing to regard as errors mistakes that could not reasonably have been avoided without substantial interpretation of the text or context. Establishing such a protocol will be one of the first tasks once we begin production and have hard data to base it on.

Q 17: Who will be evaluating vendor bids and responses to the RFQ?
A: Representatives of the University Library's Collections department and Digital Library Production Service, as well as the University Purchasing office.

Q 18: Can the response to the EEBO-TCP RFQ be submitted electronically, or do you prefer hard copy?
A: E-mail attachments or hard copy files are fine.

Q 19: How many texts are expected to be converted in the six weeks immediately after the award?
A: We envision sending each vendor ten texts from a variety of time periods and in a variety of fonts in the first installment. We will be expecting vendors to begin sending texts back to us within six weeks, but we don't necessarily expect them all to be returned at the same time. Instead, we're aiming to establish a streaming system, where texts flow back and forth rather than are sent and returned in large blocks.

Q 20: When is the award expected to be announced? (Please note that Christmas [intervenes]; will the six weeks be extended to seven if the award is made in December?)
A: Vendors will know that they've been selected by December 15. We will not expect all ten initial texts to be completed within the six week period, but, as noted above, we would like to see as much encoded text as possible so that we can begin finetuning our keyboarding instructions.

Q 21: Will the amount of data a vendor can convert per year have an impact on the evaluation process?
A: Yes, the amount that a vendor can convert will be a factor in the evaluation process. We would like to know that vendors can devote enough time and space to EEBO material to become accustomed to its printing and to give sustained attention to the project so that what is learned can be evenly applied throughout the encoding proccess. By the same token, as has been stated, vendors that can do less will be given less to do, and turnaround time is more important than the amount of text produced.

Q 22: Can we assume that each vendor selected will receive a cross-section of titles that range in difficulty from straight-forward to highly complex?
A: Generally speaking, each vendor will be receiving a range of texts. Because of the difficulty of sorting such a mass of works, grouping texts by their subject, length, or typeface would prove impractical. As a result, vendors will need to be be able to encode a wide variety of texts.

Q 23: Will the images of the text be provided on CD-ROM, or on an alternate media?
A: Vendors will be given access to the texts via the internet. They will be given unique identifying numbers that call up texts to be encoded in the EEBO database; from there, vendors will be able to download the texts in Adobe format, burn their own CD ROMs, and print as they like. If delivery over the internet proves too slow, it may prove necessary to develop alternative systems for transferring texts, most likely via CD ROM. The question of who will pay the cost for burning CD ROMs will be addressed as the need arises.

Q 24: What are the image specifications - dpi, grayscale, compression, etc.?
A: The EEBO images are made from the mircrofilm, and are bitonal and 400 dpi, compressed via CCITT Group 4 (fax) compression. (Let me know if there are more specifics you'd like.)

Q 25: You ask for a price in terms of dollars per thousand characters (kB). Are markup characters included in that thousand?
A: Yes. The price should be per thousand characters regardless of whether they are characters transcribed from the book or supplied as tags, attributes, etc.

Q 26: May we have access to the images now, in order to assess the mix of genres, typefaces, and formats in the database, as well as to examine the quality and legibility of the images themselves?
A: Yes. Bell and Howell will provide full access to the EEBO images till December 1 to interested vendors. All vendors should shortly be receiving the login and password by fax.

Q 27: Does this Early English Work contain any romantic literature? If so did you consider using the Chadwyck Healy DTD for Romantic Literature?
A: I'm not familiar with any Chadwyck-Healey "romantic" collection, so I don't know what they or you mean by the term (Romantic poets? if so, no; Romances, in the medieval sense? if so, a few). In any case, though we work with many of the CH dtds because we mount many CH textbases locally, we would never consider using a proprietary dtd for our own work; we are a TEI shop and are accustomed to using TEI and TEI derivatives, regarding them as the public standard for the encoding of humanities texts. In a big collaborative effort like EEBO, a TEI-based dtd is the only choice possible.

Q 28: What are the different types of documents in the database? You have described, Prose, Poems, Plays, Letters, Dictionaries. However, many of the examples seemed to be Biblical in nature.
A: I'm not sure that I understand the last point. Do not Bibles fall under prose? There are four samples from Bibles amongst the 36 sample pages (and of these 4 one is really a map and list, the other a calendar that just happens to appear in a Bible). Bibles probably do not make up quite this proportion of the database generally; but the sample pages do not pretend to be proportionally representative.

On the larger question, the database contains virtually every early English printed book; therefore every possible type of book is to be found in it. The selection principle (as described in the answer to another question; see the Question Log) currently in place will probably mean that "literary" texts, including drama and poetry, will be somewhat overrepresented, but there will certainly be sermons, plays, reference books, philosophic and scientific works, laws, chronicles, practical books of every sort, as, of course, as Bibles, tracts, treatises, and commentaries. Even among the sample pages you will find an herbal, a poetic manual, several teaching manuals and grammars, a prose romance, poetry, rhetoric, a collection of proverbs, a work on linguistics, an allegory, ecclesiastical history, a Biblical commentary, sermons, and several tracts and treatises.

Q 29: How will the DTD Revisions be monitored by the University of Michigan and how will this be communicated to all the vendors. This did not form a part of the documentation. This will become a major area of concern and is unavoidable. It seems that you will need to maintain a database and a record of all changes requested.
A: We do not expect to make many changes in the dtd, especially after the first few months. Necessary changes will be announced directly to the project leader(s) at the conversion firm(s); a record of the announcement will be maintained on the EEBO web page; a dated record of all changes will be preserved at the head of the dtd itself. No one will be expected to call back work in progress to ensure retroactive compliance with the changes. In general, DTD changes have not been a problem in past projects; changes are usually slight (e.g., addition of single attribute or element to accommodate an unanticipated feature), and rarely affect the common run of books. And since most tend to make the dtd less, rather than more, restrictive, they do not in general cause previously converted texts to become invalid.

Q 30: How will you check for the accuracy of the markup and what is the acceptance criteria for the same?
A: We will employ in-house tag reviewers. Criteria for acceptance (aside from SGML validity) are necessarily somewhat subjective, but they will certainly include an assessment of whether the instructions were followed with reasonable care, bearing in mind that the data capture firm cannot be expected to apply specialized knowledge to the interpretation of the text; whether the text is tagged consistently and intelligibly (even if not in exact conformity with instructions); and whether repair of incorrect tagging would be onerous for our reviewers.

Q 31: For text accuracy you have defined 99.995% but have not declared as to how you will do the testing for this and the testing methods, sampling procedures that you will employ.
A: In past projects we have manually proof-read a random 5% of the pages of the books converted, rejecting data when necessary at the batch level (assuming a 20MB batch, which should contain fewer than (roughly) 50 erroneous characters), though it is usually possible to identify the individual books in which the errors are concentrated.

We expect to continue to use this method, unless it proves impractical. But no text will be rejected on the basis of anything less than 5% of its pages. See also the comments made in answer to one of the questions in the Question Log, regarding the criteria for assessing accuracy.

Q 32: (Problem Areas anticipated - Markup:) Identifying language -- for example whether it is Latin, Greek, Anglo-Saxon, Old English, Middle English or Modern Early English. The attribute value for this has been described as CDATA and hence will parse whatever we determine. But will it be right?
A: Perhaps not. But note that we specifically instructed NOT to try to distinguish between subtypes of language (anglo-saxon, if there is any, Middle English, and early Modern English all get recorded as "eng"). We also instructed not to insert a value unless you're sure of it. It may be that this instruction is still too ambitious; if it proves so, it may be dropped entirely after the early (test) phase of production.

As for the declaration of the LANG value as CDATA, replacing CDATA with something else (a list of possible values or #IDREFS) would not help matters any, since under those circumstances you could still insert the wrong value (e.g., put "lat" instead of "eng"). So I am not sure why you bring it up in this connection.

Q 33: (Problem Areas anticipated - Markup:) Incidentally this is true of almost all the attributes. While it is easier to use a Standard DTD and try to have all documents marked up to this standard, There will be multiple variations in the documents which will now be open to vendor interpretation. How will you resolve this -- example what are all the values that can occur within the CDATA?
A: Though we hope to achieve *some* uniformity across the data, perhaps (in the case of attribute values) through the publication of lists of suggested values, we are willing to accept some variety as well. Minor variation in attribute values can be resolved here through global harmonization. We hesitate to prescribe many values now, with little experience of the variety found in the data, simply because we do not know enough to be able to predict all the values that we will need. Hence CDATA.

This question also seems to overstate the importance in the instructions of attribute values; to ignore several instructions that specify values for attributes even when the specifications are not embodied in the dtd; and to ignore the various instructions that require the vendor to supply no value when there is no obvious value to supply.

Q 34: The placement of the marginal text needs to be defined and explained more clearly. Otherwise you could end up with different interpretations to the same.
A: So long as we are determined to interpret the relationship between marginal text and main text at all (and we are), different interpretations of that relationship, and particularly of the placement of the marginal text relative to the main text, are inevitable. Given the multitude of ways that early modern authors and printers found to use marginal text and to relate it to main text, we can only say that if you can suggest a clearer way of defining the possible relationships and of prescribing the optimal placement, please let us know!

Q 35: May be difficult to identify stanzas in poems, even the University has recognised this fact.
A: Yes, we recognize this fact because we have had many poetry texts converted in the past and have frequently found this to be a problem. Again, if you have a suggestion, short of dispensing with stanzas altogether, we would like to hear it.

Q 36: (Problem Areas anticipated - Text:) Intensive training will be required to train the keyboarders in the subtleties and identification of the different alphabets. More so since the languages are mixed.
A: This is really a comment rather than a question. But yes, we imagine that some training will be needed, but perhaps it is more a matter of practice than of training. Or even of learning a method for identifying the characters in a particular book by comparing them to other, known, characters in the same book, instead of rote training in the recognition of letter forms. But how you do this is of course up to you.

I assume that by "alphabets" you mean type faces; but if you are referring to the distinction between the Latin alphabet and (say) the Greek or Hebrew alphabets, this distinction is usually quite marked in the books I have seen.

Q 37: For how long a period of time will contracts be awarded?
A: The University reserves the right to award contracts in varying lengths depending on the answers we receive to the RFQ. No contract will be made for a period of less than a year; we anticipate that most vendors will receive two year contracts initially which will then be eligible for renewal.