Note: Some of this may be more easily done using a parsing editor with display options (e.g. XMetaL or EPCedit), but these instructions presuppose that you are using the plain-text editor TextPad.
<!DOCTYPE ETS PUBLIC "-//UMDLPS//DTD EEBOPROOF 2.0//EN">
<CHANGE><DATE>yyyy-mm-dd</DATE><RESPSTMT><NAME>[name of reviewer]</NAME><RESP>MURP</RESP></RESPSTMT><ITEM>Proofed text and corrected markup.</ITEM></CHANGE>
but many people use a more complicated one that contains a checklist of common problems to check, such as this:
<!DOCTYPE ETS PUBLIC "-//UMDLPS//DTD EEBOPROOF 2.0//EN">
<CHANGE><DATE>yyyy-mm-dd</DATE><RESPSTMT><NAME>[name of reviewer]</NAME><RESP>MURP</RESP></RESPSTMT><ITEM>
* Review overall document structure and hierarchy, including GROUP TEXT FRONT BODY BACK DIVs
* Observe divtop and divbottom material marking beginnings and ends of divisions, check that they are correctly tagged as HEAD, HEADNOTE, CLOSER, OPENER, ARGUMENT, EPIGRAPH, SIGNED, DATELINE, BYLINE, etc.
* Add TYPEs to DIVs in order to validate, making sure that the @type structure makes sense as a navigational aide, and that each type makes sense wrt to the one above it.
* Survey book for troublesome formats of all kinds, including lists, illustrations, tables, music, math, end-notes, quasi-tabular arrangments, and marginalia. Made sure that it was correctly tagged. Decide how much is worth changing.
* Look for unobtrusive numerations and other occult signs of structure. Do they mark divs? milestones?
* Examine marginalia. Is it all best tagged as NOTE?
* Look for illustrations, make sure that usable text within them has been captured and correctly tagged.
* Add FIGDESCs to any and all likely illustrations.
* Check placement and completeness of PBs.
* Check for GAPs and #s.
* If appropriate, check for yoghs.
* If appropriate, check for Latin problems, e.g. oe and ae digraphs
* If appropriate, check for abbreviation and brevigraph problems.
* If appropriate, check for units of measure, symbols (alchemical, astronomical, etc.) and the like.
* Look for decorated initials. Have they been marked as such?
* Survey the illegibles. decide whether resolving them is feasible or even possible. Use the 100 rule only as a rule of thumb. Resolve those you can.
* Correct the errors found during proofing.
* Proof title page(s).
* Run 'skint'
* Run 'check'
* Run "v'
The existing templates include a good deal of boilerplate. Pieces of it that do not apply can be deleted. This area is used to record anything distinctive done to the text, or anything left undone, e.g. "blackletter text should have been tagged as HI throughout, but wasn't." Feel free to edit the templates if you find that they do not accurately reflect the most common tasks that you find yourself performing in the books. Many reviewers use the template as a quick checklist of things to do and look for. If you do not use it this way, you may prefer to use a very shortened form of template.
Compare the structure applied by the vendor and correct to match the book. Detailed multilevel structural hierarchies can often be left unmarked if it proves too much trouble to capture them, but this decision should be made only after you determine to your satisfaction what the real structure is and how much would be sacrificed.
Typical vendor problems:
- missing the lower levels in a multi-level hierarchy
- treating whole poems as if they were merely stanzas (LGs)
- missing signs of subordination (i.e., putting two sections at the same level instead of making one subordinate to the other)
- using or abusing DIVs when what is really needed is <Q><TEXT><BODY><DIV1> ... </DIV1></BODY></TEXT></Q> as if the book had a 'grapefruit' structure instead of a 'raisins-in-oatmeal' structure.
In textpad, using Find in Files to search for <DIV[^>]*> (with binary, all matching lines, and regular expression checked) will provide a list of DIVs with TYPEs.
Lack of TYPEs should be primary (often the only) reason that a file fails to validate. Pursue invalid bits one by one till the file validates.
Check completeness of PBs. (In Find in Files, find <PB[^>]*> with Regular expression and Binary files checked. The resulting list should show a PB for every page in the file, including blank pages at beginning and end.
If the image set includes two (or more!) copies of the same page, choose one to capture and omit the other(s); mark the uncaptured page <GAP DESC="duplicate" EXTENT="1 page"> and a <PB> tag.
Note: Sometimes the last image in the set includes the FIRST page of a book that was bound with the one that you are working on. Omit this material and treat the page as a blank flyleaf. Similarly, the first image in a set will sometimes include the last page of some other book. Treat this also as a blank flyleaf. Include a PB tag but no text.
If Latin text is present, check oe's for possible ae ligatures.
# is used to mark a clear but unknown symbol, though sometimes it has been used for any old blot. Replace it with <GAP DESC="symbol">, <GAP DESC="illegible"> or (better) the correct character or character entity if that can be ascertained.
If you're looking for the right character entity, most of the most common ones are contained in the various *.ent files in \CODE\ENTITIES, complete with brief descriptions. Using TextPad's find-in-files in that directory, with files set to *.ent, can sometimes be useful in turning up the right symbol. Also consult the handy list of entities at http://www.lib.umich.edu/tcp/docs/dox/charmap.htm.
A brief sample will show whether the MUSIC, MATH, and FOREIGN gaps are correctly used. Check spacing around <GAP DESC="foreign">. Early files often need spaces added on each side.
Illegibilities are harder. You may find individual letters marked as $, groups of letters marked as strings of $s (e.g. Lo$$on for "London") illegible words marked as $word$, and pages, lines, and spans of text marked as $page$, $line$, and $span$.
Tne notes file should already contain a count of illegibilities of the most common types. Searching for (regular expression, binary, file count only) \$[^ ]*\$? should confirm the overall count, which is the most important one: if there are fewer than 100 $-groups in the file, you should at least consider resolving them individually (either by supplying the correct reading or by deciding that the text really is illegible). Sometimes it is clear that the illegibles arise from some insuperable problem (such as a tight binding or cropped pages); in that case it is not usually worth while trying to resolve them individually. And if the book contains more than 100 illegibles, it is again not usually worth while trying to resolve them individually.
Once you have resolved the illegibles that you intend to, run the batch file "skint" (in TextPad, from the tools menu). This file edits the sgm file 'in place' by replacing the $s with proper tags <GAP DESC="illegible">--and saves the unmodified version in the same directory with the extension .bak.
If you're resolving illegibilities individually, you'll find that many can be read (given contextual information) with at least 95% certainty. Feel free to insert the correct character in such cases based on context, so long as the physical form remaining does not contradict your conclusions as to the the correct character. Do not attempt to supply a character when there is nothing in the original at all, no matter how correct or inevitable it might be. Those that cannot be resolved should be replaced by <GAP DESC="illegible" EXTENT="1"> (or whatever extent applies), normally by running "skint."
Other problems with illegibility may require creative solutions, and they are too various to be listed here.