Dear TCP (EEBO/Evans/ECCO) firms:

I would like to address four sets of issues related to the 
capture of individual characters. One set of issues is new,
relates immediately only to EEBO (not Evans or ECCO),
and arises from our plan (mentioned last month) to convert a
number of Welsh books. The other three are matters that 
we did not give enough attention to in the beginning, and
that we have only slowly come to realize are problems. These
also are more likely to occur in EEBO than in the other projects,
but we would like to keep our standards identical through all
the TCP projects.

In each case, in what follows, I either suggest changes 
or clarify already existing changes in the way that we capture 
certain characters; I would appreciate it if you could
consider these and let me know if you think that they 
are feasible, or if you have alternative suggestions to make.
We realize that character-level changes are difficult and
therefore do not wish to make any mistake about what we are
doing!

I would also suggest that (if you agree to these changes), you
be given at least two or three months to implement them in 
production. We would not start testing for them during proofing till
we reach (say) files arriving after 1 July.

Here are the four categories:

(1) Characters used to print Welsh.
(2) Upper-case I and J.
(3) Upper-case U and V.
(4) Anglo-Saxon type.


(1) Characters used to print Welsh (likely to affect only EEBO)
---------------------------------------------------------------

As promised, I have looked at several of the Welsh books that
we hope to convert during the next few months, in search of
characters and other typographic features for which we ought
to prepare you. I have found a few.

Since Welsh spelling was not really fixed till the nineteenth
century, some of the early books use experimental systems to
represent the distinctive sounds of Welsh. Most of these
use the standard Latin alphabet, but there are few exceptions.

One system uses diacritic dots placed below l, d, and u to
denote the sounds nowadays spelt "ll" "dd" and "w". I suggest
that we capture these using character entities:

   d with a dot below it = &ddotb;
   l with a dot below it = &ldotb;
   u with a dot below it = &udotb;
   
I did not notice any upper-case versions of these, but if they
exist, they should be captured using upper-cased character
entities:

   D with a dot below it = &Ddotb;
   L with a dot below it = &Ldotb;
   U with a dot below it = &Udotb;

Several books use a very odd character that corresponds to 
modern Welsh "w" and "W" (lower and upper case). I suggest 
that we capture this also using a character entity:
 
   [odd Welsh w] = &w;
   [odd Welsh W] = &W;
   
Several books use two different forms of 'y'--an ordinary 'y'
and a strange-looking y with a broken back. I suggest that
we capture the ordinary 'y' as 'y', but capture the broken-
backed 'y' as &y; (upper-case version: &Y;).
 
   y = y    but 'broken' y = &y;
   Y = Y    but 'broken' Y = &Y;
 
Welsh books in blackletter sometimes use a 'd' with a hook
at the upper right corner, as the equivalent of modern Welsh
'dd'. In so doing, they appear to be adapting one form of
the medieval English (and modern Icelandic) character called 'eth' 
(= U+00F0). I suggest that the Welsh use of the eth should
be treated the same as early English use of the eth: both
should be captured as &eth; (from the ISO latin-1 character
set).
 
Finally, Welsh is full of contractions marked by apostrophes
(e.g. i'r  a'r etc.). But some of the books seem to distinguish
between the usual apostrophe that resembles a right single
quotation mark (U+2019) and a reversed apostrophe that resembles
a left single quotation mark (U+2018). To avoid conflict with
the existing practice and the body of books already converted,
in which we use the upright 'neutral vertical single quote' (U+0027)
I suggest that we continue to use ' for the normal apostrophe
(U+2019), but use a character entity &lsquo; for the reversed
apostrophe.
 
 (Quotation marks, as opposed to apostrophes, should continue
 to be captured using simple ')
 

Examples of all of these characters and some further details
can be found in a preliminary list on the web at 
http://www.lib.umich.edu/tcp/docs/dox/cym.html

A modified character-entities list including the new dotted
characters can be found in the usual place at:
http://www.lib.umich.edu/tcp/docs/dox/eebochar.ent.txt


(2) Upper-case I and J (all TCP projects)
-----------------------------------------

I think that I and J are being captured well now, but I'll take
this opportunity to re-state the principle and supply some
examples. 

The formal rule is: if there are distinct (visual) forms for the 
two letters in the given typeface, in the given book 
(and there almost always are, except in blackletter 
faces), capture the I form as I and the J form as J, even 
if the forms are not distributed as they are in modern usage.

If the typeface (as used in the given book) does not contain
distinct I and J forms, use I.

The rule of thumb is:

  In Roman, I and J are always distinct. Anything that
    looks J-like should be captured as J.

  In Italic, I and J are always distinct. Anything that
    looks J-like should be captured as J.*

  In Blackletter (all forms of it, including Fraktur, Rotunda,
    Bastarda, Textura), there is only one letter. It
    often looks rather J-like, but capture it as I anyway:
    there is no J.
    
  * there may be a few exceptional early Italic type-faces
    which contain an 'I' with a cross-bar that looks a little
    like J. But these are rare: we can deal with them
    individually. Likewise for any other rare exceptions.
    

For examples, see http://www.lib.umich.edu/tcp/docs/dox/ijnew.html


(3) Upper-case U and V (all TCP projects).
-----------------------------------------

U and V, like I and J, only gradually emerged during the course
of early printing as distinct forms with distinct functions.
We've noticed two problems with our present capture of these letters.

  (a) Blackletter faces, just as they tend to contain only a
  single I/J character (which we capture as I), also tend to 
  contain only a single U/V character. In some versions of
  Blackletter (in the Textura and Rotunda families), this letter
  looks like "U"; in other versions of Blackletter (especially
  in the Bastarda family), this letter looks rather like "V".
  But each font in fact contains only one letter, and we are
  left with inconsistency: books in Textura have been converted
  with all the upper-case U/Vs captured as U and books in
  Bastarda with all the upper-case U/Vs captured as V. 
  
  We suggest that this inconsistency be removed, and that the
  upper-case U/V character be captured as "V" in all blackletter
  faces.
  
  (b) Italic has a different problem. There are at least three
  distinct forms of the U/V character in Italic. One is clearly
  a U (a slanted form of the Roman U), one is clearly a V (a
  slanted form of the Roman V), but the third, which we call
  'swash-V' is ambiguous, and in fact changes in function from
  the early books to the late books. In early books, it tends
  to function like a V; in late books like a U.
  
  Our suggestion is to capture this ambiguous form with a
  distinctive entity (&V;), whilst capturing the unambiguous
  forms as before (using U and V). This will allow us to decide
  on a book-by-book basis whether to leave the 'swash' form
  as ambiguous, or change it globally to U or V.
  
For examples, see http://www.lib.umich.edu/tcp/docs/dox/uvnew.html


(4) Anglo-saxon type (mostly in EEBO)
-------------------------------------

A number of books have come through containing Anglo-Saxon type--
that is, type designed to mimic the letter-forms used by 
scribes writing in the earliest form of English (Old English). 
In principle, this material is not hard to capture: it is in
English (albeit Old English!), most of the letter-forms are
familiar, only a few letters (d f g r s t) have peculiar
forms, and even fewer letters (thorn, eth, wynn) and one
or two abbreviation-forms (that, &) are entirely new.

So you are free to key it in, if you like. Nevertheless, 
we recognize that you are not likely to want to
train people to recognize type that will be seen only very
occasionally, if at all. So we suggest two options:
either 
 

 (1) key it in, using the guidelines at
 http://www.lib.umich.edu/tcp/docs/dox/asax.html    or
 

 (2) flag it using the usual <GAP DESC="foreign"> but
 add the attribute LANG="ang". This will allow us to
 find these passages and key them in ourselves if
 desired.
 

Again, responses are welcome.

Paul