Calculating error rates for EEBO data
1. Methods and terms
- By the book or the batch. Most errors are counted and an error rate is initially
   calculated for each book (or item) received. Smaller items (currently, those under 30KB) are often batched together and sampled and proofed as a unit, in order to save overhead.
- Sample sizes. For most books:
- at least 5% of its pages (randomly selected) are sampled.
- at least 5% of its data (in bytes) is sampled. 
- at least five pages are sampled (unless the item
        is itself less than five pages long, in which
        case it is proofed entire).
 For very large books: the sample size is reduced on a sliding scale so that for the very largest books the sample size may amount to only 3% rather than 5% of the whole.
 For very small books: very small books are often combined into a temporary file which is itself the size of a modest book, and that file is in turn sampled. 
- Error ratio: the denominator. For purposes of
    establishing an error ratio, the size of the
    sample is deemed to be the size of the actual
    sample file (in bytes), after all tags have been
    removed and all character entities reduced to
    a single character.
    NOTE: Though we have not been "counting" any errors
    of spacing, we have not been correspondingly
    reducing the sample size by the number of space
    (or newline) characters. This policy has caused
    the calculated error rate to be underestimated,
    probably by about 10%. 
- Error ratio: the numerator.
- Error classes. The errors found in
    the sample file are evaluated on a case-by-case
    basis in order to determine into which category
    they fall:
      
      - excusable or 'forced' transcription errors;
- unwarranted use of the illegible marker;
- inexcusable or 'unforced' transcription errors;
- spacing errors; and
- supply of a letter actually missing or completely illegible in the source.
      
 Only errors in categories (1) and (2) are used in calculating
    the error rate. Errors in categories (3), (4), and (5) are
    noted but not counted.
    See below for an explanation of category (1)
    and how it is distinguished from category (3).
- Error numbers:
    
    - 1 pair of transposed letters = 1 error
-  1 wrongly interpreted letter = 1 error
-  1 omitted letter = 1 error
-  1 inserted letter (not representing anything in the
          text)= 1 error
-  1 letter mistaken for 2 = 1 error
-  2 letters mistaken for 1 = 1 error
-  1 perfectly legible letter ($) or word ($Word$) flagged as illegible = 1 error
  For purposes of counting, a character entity
        counts as 1 letter, as does a paired superscript
        (or subscript) marker and its following letter,
        this pair being regarded as equivalent to a
        superscript (or subscript) character entity.
        Errors are case sensitive; that is, capturing
        "g" as "G" is an error.
      And of course "letter" in the above means
        any alphanumeric character or symbol.
 
 
2. "Excusable" vs. "inexcusable"
   "Inexcusable" errors are, in general, those that a
   non-specialist keyer could not reasonably be
   expected to avoid, given the nature of the source
   material. Errors involving transposed characters,
   omitted characters, or inserted characters can all
   usually be regarded as inexcusable without
   question.
   
Follow this link to a file of EXAMPLES
   of "excusable", "inexcusable", and dubious errors.
   
Errors involving erroneous interpretation of
   characters (or character groups) inevitably have
   some subjective quality to them. When deciding
   whether a keyer should be expected to have
   interpreted the character correctly, we make the
   following assumptions:
- The keyers are looking at the same image file
       that we are.
- The keyers must depend on the physical appearance
       of the letter(s), and cannot be counted on to
       consider the sense of the word
       or passage, though they sometimes do anyway and certainly should when they can.
- A letter may often be accurately read even if
       broken or otherwise deformed, so long as enough
       remains to give unambiguous testimony to its
       original shape.
- Truly ambiguous letter forms should be
       represented by "$", not guessed at.
- The letters may be "zoomed in" on if that helps
       to resolve ambiguities.
- Not only the characteristic forms of a letter
       in general, but also the characteristic form
       of the letter in the given typeface and
       (especially) the characteristic form of the
       letter in the same book and on the same page
       may be brought to bear in resolving ambiguities.
- The shapes of adjacent letters may be adduced
       as evidence for the value of a given letter
       (especially where a ligature is involved).