Dear TCP (EEBO/Evans/ECCO) firms: I would like to address four sets of issues related to the capture of individual characters. One set of issues is new, relates immediately only to EEBO (not Evans or ECCO), and arises from our plan (mentioned last month) to convert a number of Welsh books. The other three are matters that we did not give enough attention to in the beginning, and that we have only slowly come to realize are problems. These also are more likely to occur in EEBO than in the other projects, but we would like to keep our standards identical through all the TCP projects. In each case, in what follows, I either suggest changes or clarify already existing changes in the way that we capture certain characters; I would appreciate it if you could consider these and let me know if you think that they are feasible, or if you have alternative suggestions to make. We realize that character-level changes are difficult and therefore do not wish to make any mistake about what we are doing! I would also suggest that (if you agree to these changes), you be given at least two or three months to implement them in production. We would not start testing for them during proofing till we reach (say) files arriving after 1 July. Here are the four categories: (1) Characters used to print Welsh. (2) Upper-case I and J. (3) Upper-case U and V. (4) Anglo-Saxon type. (1) Characters used to print Welsh (likely to affect only EEBO) --------------------------------------------------------------- As promised, I have looked at several of the Welsh books that we hope to convert during the next few months, in search of characters and other typographic features for which we ought to prepare you. I have found a few. Since Welsh spelling was not really fixed till the nineteenth century, some of the early books use experimental systems to represent the distinctive sounds of Welsh. Most of these use the standard Latin alphabet, but there are few exceptions. One system uses diacritic dots placed below l, d, and u to denote the sounds nowadays spelt "ll" "dd" and "w". I suggest that we capture these using character entities: d with a dot below it = &ddotb; l with a dot below it = &ldotb; u with a dot below it = &udotb; I did not notice any upper-case versions of these, but if they exist, they should be captured using upper-cased character entities: D with a dot below it = &Ddotb; L with a dot below it = &Ldotb; U with a dot below it = &Udotb; Several books use a very odd character that corresponds to modern Welsh "w" and "W" (lower and upper case). I suggest that we capture this also using a character entity: [odd Welsh w] = &w; [odd Welsh W] = &W; Several books use two different forms of 'y'--an ordinary 'y' and a strange-looking y with a broken back. I suggest that we capture the ordinary 'y' as 'y', but capture the broken- backed 'y' as &y; (upper-case version: &Y;). y = y but 'broken' y = &y; Y = Y but 'broken' Y = &Y; Welsh books in blackletter sometimes use a 'd' with a hook at the upper right corner, as the equivalent of modern Welsh 'dd'. In so doing, they appear to be adapting one form of the medieval English (and modern Icelandic) character called 'eth' (= U+00F0). I suggest that the Welsh use of the eth should be treated the same as early English use of the eth: both should be captured as ð (from the ISO latin-1 character set). Finally, Welsh is full of contractions marked by apostrophes (e.g. i'r a'r etc.). But some of the books seem to distinguish between the usual apostrophe that resembles a right single quotation mark (U+2019) and a reversed apostrophe that resembles a left single quotation mark (U+2018). To avoid conflict with the existing practice and the body of books already converted, in which we use the upright 'neutral vertical single quote' (U+0027) I suggest that we continue to use ' for the normal apostrophe (U+2019), but use a character entity ‘ for the reversed apostrophe. (Quotation marks, as opposed to apostrophes, should continue to be captured using simple ') Examples of all of these characters and some further details can be found in a preliminary list on the web at http://www.lib.umich.edu/tcp/docs/dox/cym.html A modified character-entities list including the new dotted characters can be found in the usual place at: http://www.lib.umich.edu/tcp/docs/dox/eebochar.ent.txt (2) Upper-case I and J (all TCP projects) ----------------------------------------- I think that I and J are being captured well now, but I'll take this opportunity to re-state the principle and supply some examples. The formal rule is: if there are distinct (visual) forms for the two letters in the given typeface, in the given book (and there almost always are, except in blackletter faces), capture the I form as I and the J form as J, even if the forms are not distributed as they are in modern usage. If the typeface (as used in the given book) does not contain distinct I and J forms, use I. The rule of thumb is: In Roman, I and J are always distinct. Anything that looks J-like should be captured as J. In Italic, I and J are always distinct. Anything that looks J-like should be captured as J.* In Blackletter (all forms of it, including Fraktur, Rotunda, Bastarda, Textura), there is only one letter. It often looks rather J-like, but capture it as I anyway: there is no J. * there may be a few exceptional early Italic type-faces which contain an 'I' with a cross-bar that looks a little like J. But these are rare: we can deal with them individually. Likewise for any other rare exceptions. For examples, see http://www.lib.umich.edu/tcp/docs/dox/ijnew.html (3) Upper-case U and V (all TCP projects). ----------------------------------------- U and V, like I and J, only gradually emerged during the course of early printing as distinct forms with distinct functions. We've noticed two problems with our present capture of these letters. (a) Blackletter faces, just as they tend to contain only a single I/J character (which we capture as I), also tend to contain only a single U/V character. In some versions of Blackletter (in the Textura and Rotunda families), this letter looks like "U"; in other versions of Blackletter (especially in the Bastarda family), this letter looks rather like "V". But each font in fact contains only one letter, and we are left with inconsistency: books in Textura have been converted with all the upper-case U/Vs captured as U and books in Bastarda with all the upper-case U/Vs captured as V. We suggest that this inconsistency be removed, and that the upper-case U/V character be captured as "V" in all blackletter faces. (b) Italic has a different problem. There are at least three distinct forms of the U/V character in Italic. One is clearly a U (a slanted form of the Roman U), one is clearly a V (a slanted form of the Roman V), but the third, which we call 'swash-V' is ambiguous, and in fact changes in function from the early books to the late books. In early books, it tends to function like a V; in late books like a U. Our suggestion is to capture this ambiguous form with a distinctive entity (&V;), whilst capturing the unambiguous forms as before (using U and V). This will allow us to decide on a book-by-book basis whether to leave the 'swash' form as ambiguous, or change it globally to U or V. For examples, see http://www.lib.umich.edu/tcp/docs/dox/uvnew.html (4) Anglo-saxon type (mostly in EEBO) ------------------------------------- A number of books have come through containing Anglo-Saxon type-- that is, type designed to mimic the letter-forms used by scribes writing in the earliest form of English (Old English). In principle, this material is not hard to capture: it is in English (albeit Old English!), most of the letter-forms are familiar, only a few letters (d f g r s t) have peculiar forms, and even fewer letters (thorn, eth, wynn) and one or two abbreviation-forms (that, &) are entirely new. So you are free to key it in, if you like. Nevertheless, we recognize that you are not likely to want to train people to recognize type that will be seen only very occasionally, if at all. So we suggest two options: either (1) key it in, using the guidelines at http://www.lib.umich.edu/tcp/docs/dox/asax.html or (2) flag it using the usual but add the attribute LANG="ang". This will allow us to find these passages and key them in ourselves if desired. Again, responses are welcome. Paul