Calculating error rates for EEBO data
1. Methods and terms
- By the book or the batch. Most errors are counted and an error rate is initially
calculated for each book (or item) received. Smaller items (currently, those under 30KB) are often batched together and sampled and proofed as a unit, in order to save overhead.
- Sample sizes. For most books:
- at least 5% of its pages (randomly selected) are sampled.
- at least 5% of its data (in bytes) is sampled.
- at least five pages are sampled (unless the item
is itself less than five pages long, in which
case it is proofed entire).
For very large books: the sample size is reduced on a sliding scale so that for the very largest books the sample size may amount to only 3% rather than 5% of the whole.
For very small books: very small books are often combined into a temporary file which is itself the size of a modest book, and that file is in turn sampled.
- Error ratio: the denominator. For purposes of
establishing an error ratio, the size of the
sample is deemed to be the size of the actual
sample file (in bytes), after all tags have been
removed and all character entities reduced to
a single character.
NOTE: Though we have not been "counting" any errors
of spacing, we have not been correspondingly
reducing the sample size by the number of space
(or newline) characters. This policy has caused
the calculated error rate to be underestimated,
probably by about 10%.
- Error ratio: the numerator.
- Error classes. The errors found in
the sample file are evaluated on a case-by-case
basis in order to determine into which category
they fall:
- excusable or 'forced' transcription errors;
- unwarranted use of the illegible marker;
- inexcusable or 'unforced' transcription errors;
- spacing errors; and
- supply of a letter actually missing or completely illegible in the source.
Only errors in categories (1) and (2) are used in calculating
the error rate. Errors in categories (3), (4), and (5) are
noted but not counted.
See below for an explanation of category (1)
and how it is distinguished from category (3).
- Error numbers:
- 1 pair of transposed letters = 1 error
- 1 wrongly interpreted letter = 1 error
- 1 omitted letter = 1 error
- 1 inserted letter (not representing anything in the
text)= 1 error
- 1 letter mistaken for 2 = 1 error
- 2 letters mistaken for 1 = 1 error
- 1 perfectly legible letter ($) or word ($Word$) flagged as illegible = 1 error
For purposes of counting, a character entity
counts as 1 letter, as does a paired superscript
(or subscript) marker and its following letter,
this pair being regarded as equivalent to a
superscript (or subscript) character entity.
Errors are case sensitive; that is, capturing
"g" as "G" is an error.
And of course "letter" in the above means
any alphanumeric character or symbol.
2. "Excusable" vs. "inexcusable"
"Inexcusable" errors are, in general, those that a
non-specialist keyer could not reasonably be
expected to avoid, given the nature of the source
material. Errors involving transposed characters,
omitted characters, or inserted characters can all
usually be regarded as inexcusable without
question.
Follow this link to a file of EXAMPLES
of "excusable", "inexcusable", and dubious errors.
Errors involving erroneous interpretation of
characters (or character groups) inevitably have
some subjective quality to them. When deciding
whether a keyer should be expected to have
interpreted the character correctly, we make the
following assumptions:
- The keyers are looking at the same image file
that we are.
- The keyers must depend on the physical appearance
of the letter(s), and cannot be counted on to
consider the sense of the word
or passage, though they sometimes do anyway and certainly should when they can.
- A letter may often be accurately read even if
broken or otherwise deformed, so long as enough
remains to give unambiguous testimony to its
original shape.
- Truly ambiguous letter forms should be
represented by "$", not guessed at.
- The letters may be "zoomed in" on if that helps
to resolve ambiguities.
- Not only the characteristic forms of a letter
in general, but also the characteristic form
of the letter in the given typeface and
(especially) the characteristic form of the
letter in the same book and on the same page
may be brought to bear in resolving ambiguities.
- The shapes of adjacent letters may be adduced
as evidence for the value of a given letter
(especially where a ligature is involved).