Calculating error rates for EEBO data

1. Methods and terms

  1. By the book or the batch. Most errors are counted and an error rate is initially calculated for each book (or item) received. Smaller items (currently, those under 30KB) are often batched together and sampled and proofed as a unit, in order to save overhead.

  2. Sample sizes. For most books:
    1. at least 5% of its pages (randomly selected) are sampled.

    2. at least 5% of its data (in bytes) is sampled.

    3. at least five pages are sampled (unless the item is itself less than five pages long, in which case it is proofed entire).

    For very large books: the sample size is reduced on a sliding scale so that for the very largest books the sample size may amount to only 3% rather than 5% of the whole.

    For very small books: very small books are often combined into a temporary file which is itself the size of a modest book, and that file is in turn sampled.

  3. Error ratio: the denominator. For purposes of establishing an error ratio, the size of the sample is deemed to be the size of the actual sample file (in bytes), after all tags have been removed and all character entities reduced to a single character.
    NOTE: Though we have not been "counting" any errors of spacing, we have not been correspondingly reducing the sample size by the number of space (or newline) characters. This policy has caused the calculated error rate to be underestimated, probably by about 10%.

  4. Error ratio: the numerator.
    1. Error classes. The errors found in the sample file are evaluated on a case-by-case basis in order to determine into which category they fall:
      1. excusable or 'forced' transcription errors;
      2. unwarranted use of the illegible marker;
      3. inexcusable or 'unforced' transcription errors;
      4. spacing errors; and
      5. supply of a letter actually missing or completely illegible in the source.
      Only errors in categories (1) and (2) are used in calculating the error rate. Errors in categories (3), (4), and (5) are noted but not counted. See below for an explanation of category (1) and how it is distinguished from category (3).

    2. Error numbers:
      • 1 pair of transposed letters = 1 error
      • 1 wrongly interpreted letter = 1 error
      • 1 omitted letter = 1 error
      • 1 inserted letter (not representing anything in the text)= 1 error
      • 1 letter mistaken for 2 = 1 error
      • 2 letters mistaken for 1 = 1 error
      • 1 perfectly legible letter ($) or word ($Word$) flagged as illegible = 1 error

      For purposes of counting, a character entity counts as 1 letter, as does a paired superscript (or subscript) marker and its following letter, this pair being regarded as equivalent to a superscript (or subscript) character entity. Errors are case sensitive; that is, capturing "g" as "G" is an error.

      And of course "letter" in the above means any alphanumeric character or symbol.

2. "Excusable" vs. "inexcusable"

"Inexcusable" errors are, in general, those that a non-specialist keyer could not reasonably be expected to avoid, given the nature of the source material. Errors involving transposed characters, omitted characters, or inserted characters can all usually be regarded as inexcusable without question.

Follow this link to a file of EXAMPLES of "excusable", "inexcusable", and dubious errors.

Errors involving erroneous interpretation of characters (or character groups) inevitably have some subjective quality to them. When deciding whether a keyer should be expected to have interpreted the character correctly, we make the following assumptions:

  1. The keyers are looking at the same image file that we are.

  2. The keyers must depend on the physical appearance of the letter(s), and cannot be counted on to consider the sense of the word or passage, though they sometimes do anyway and certainly should when they can.

  3. A letter may often be accurately read even if broken or otherwise deformed, so long as enough remains to give unambiguous testimony to its original shape.

  4. Truly ambiguous letter forms should be represented by "$", not guessed at.

  5. The letters may be "zoomed in" on if that helps to resolve ambiguities.

  6. Not only the characteristic forms of a letter in general, but also the characteristic form of the letter in the given typeface and (especially) the characteristic form of the letter in the same book and on the same page may be brought to bear in resolving ambiguities.

  7. The shapes of adjacent letters may be adduced as evidence for the value of a given letter (especially where a ligature is involved).