FAQ – Text Creation Partnership

Frequently asked questions

Is my institution a TCP partner? Does it matter?
May I contribute to the TCP corpus?
Where can I access the TCP digital collections?
How can I contact the TCP?
Can I download the raw files?
Where can I consult the documentation?
What is the difference between Early English Books Online (EEBO) and EEBO-TCP?
What is the difference between EEBO-TCP Phase I and Phase II?
How much does it cost to key and encode a single TCP text?
When will the texts be freely available?
Why would I buy something that is achievable only if others do the same?
Why would I buy something that is going to become freely available?
Are personal memberships or subscriptions available?
Why don’t you use OCR?
A work that I am interested in hasn’t been converted yet. When will you do it?
Why does TCP (for the most part) only include one edition of a work?
Do you want to know about errors?

Is my institution a TCP partner? Does it matter?

The TCP consists of four overlapping partnerships — four groups of institutions that contributed to the production of four separate batches of texts. Those institutions received in return our thanks, the satisfaction of having added to the world’s knowledge, perpetual ownership of the texts, and exclusive use of them for several years. For all four of the partnerships — Evans, ECCO, and EEBO Phases 1 and 2 — the period of exclusive use has ended and the texts are free to anyone to access and use. Therefore, for practical purposes, it does not matter whether or not one belongs to a partner institution.

May I contribute text to the TCP corpus?

Yes, individuals and individual projects are welcome to contribute transcriptions they have made to the TCP collection, and some have done so. For example, TCP has received texts from the Hamlet Quartos Project at Oxford and the Lexicons of Early Modern English project at Toronto, as well as some volunteer efforts. In order to “fit in,” text must of course meet the basic TCP criteria of accuracy, fidelity to the source (no modernization), and freedom from legal restrictions on distribution, access, and use. If you have a text or collection of texts that you are willing to add to the corpus, feel free to contact us at tcp-info@umich.edu and we can talk about accuracy, fidelity, and markup as needed.

Where can I access the TCP digital collections?

The University of Michigan hosts the texts for online search at the following addresses.

EEBO-TCP (Early English Books Online)
- Phase 1 comprising 25,000 texts available to everyone.
- Phase 2 comprising 35,000 texts available to everyone.
ECCO-TCP (Eighteenth Century Collections Online) – Full text of about 3,000 books available freely to everyone.
Evans-TCP (Evans Early American Imprints) – Full text of about 5,000 books available to everyone.

Note: These are the TCP home sites. Since the texts were intended for re-purposing and unconstrained re-use, TCP has contributed the same texts, in their original form or as modified, alone or in combination with other texts, to other sites such as the ProQuest EEBO site or the JISC Historic Texts portal.

How can I contact the TCP?

To contact the TCP team, please email us at tcp-info@umich.edu

Can I download the raw files?

Yes. The raw transcripts for EEBO Phases 1 and 2, ECCO, and Evans are all available for bulk download as zipped files for those wishing to do text mining or similar projects. These are substantial file transfers and should not be undertaken casually! We are in the process of reorganizing these downloadable files, doing away with the old arrangement by date-of-release (since that no longer matters), and rearranging the files strictly by ID number.

Public downloads are available from these Dropbox.com folders:

If for some reason, these links fail, a backup copy is mirrored on Box.com:

Most of the files are available in three forms:

TCP (P4) XML. This, the version that we generally recommend, uses UTF-8 character encoding and TEI bibliographic headers based on MARC catalog records. Its character inventory, converted from SGML SDATA character entities, consists mostly of corresponding single or composite Unicode characters, where those exist and are widely supported by fonts. The few exceptions are converted instead to text strings within {curly braces}. Though the TCP schema includes customizations, some of which anticipate developments in TEI P5, this is still better thought of as an essentially TEI P4 version.
SGML. The original SGML files, as produced by the TCP keyers and editors, also remain available. These files use 7-bit character encoding with named (‘mnemonic’) SDATA character entities and minimal headers but are otherwise very similar to TEI P3. Only users of tools that cannot accept multi-byte character encodings, or those that desire utter losslessness, are likely to want to look at this version.
P5 XML. For those who use TEI-based tools or need compatibility with other TEI corpora, we also make available (thanks to Sebastian Rahtz at Oxford) an XML version conformant to TEI P5, largely but not completely lossless relative to the underlying SGML. This version features TEI headers (again, based for the most part on underlying MARC) and UTF-8 character encoding, solving the character issues by heavy use of the TEI <g> element. The native home for most of the P5 files (EEBO-1, Evans, and ECCO) is on gitHub, but we provide a snapshot of the gitHub version for convenience. The P5 version of EEBO-2, since it was restricted at the time of its creation, is not currently available on gitHub, a public platform; we hope to change that.

Where can I consult the documentation?

Most of the TCP’s policies for transcription and encoding can be pieced together from the TCP documentation page. The documents listed and linked there are admittedly ill organized, incomplete (needing to be supplemented by ad-hoc decisions made in response to particular textual difficulties), and were never intended for public eyes. With those caveats, they remain valuable guides to both the philosophy and the practice of the TCP editors. Among the most useful:

The base capture and encoding instructions.
A list of SGML character entities, matched up with their equivalents in the XML.
The (TEI P3-based) SGML dtd.
The (TEI P4-based) XML dtd.

What is the difference between Early English Books Online (EEBO) and EEBO-TCP?

Simply put, EEBO is a commercial product published by ProQuest LLC, and available to libraries for purchase or license. EEBO-TCP is a project based at the Universities of Michigan and Oxford, and supported by more than 150 libraries around the world.

EEBO consists of the complete digitized page images and bibliographic metadata (catalog records) for more than 135,000 early English books listed in Pollard & Redgrave’s Short-Title Catalogue (1475-1640) and Wing’s Short-Title Catalogue (1641-1700) and their revised editions, as well as the Thomason Tracts (1640-1661) collection and the Early English Books Tract Supplement. With EEBO alone, you can search for a book based on the information in the catalog record and you can flip through or download page images in TIFF or PDF format. With EEBO alone, it is not possible to search the full text of a book or to read a modern-type transcription of the text.

EEBO-TCP captures the full text of each transcribed work in EEBO — by intention, almost every unique monographic English-language title. This is done by manually keying the full text of each work and adding markup to indicate the structure of the text (chapter divisions, tables, lists, etc.). The result is an accurate transcription of each work, which can be fully searched, or used as the basis of a new project. To date, EEBO-TCP has produced about 60,000 texts, with another few thousand still to come. The EEBO-TCP text files are delivered back to ProQuest and indexed in EEBO, so users at partner libraries can seamlessly perform full text searches and view transcriptions directly within the EEBO platform, although the texts can also be accessed in other ways, including TCP’s own search sites hosted by the University of Michigan Library. EEBO-TCP is administered by the University of Michigan Library. During its most productive years, it employed full teams of editors at Michigan and Oxford, plus a few ancillary sites. Only the Michigan team remains active today (2020).

What is the difference between EEBO-TCP Phase I and Phase II?

The initial EEBO-TCP project began in 1999. Its goal was to key and encode 25,000 selected works from the EEBO corpus. This effort was completed in 2009, with the support of nearly 150 library partners. The 25,000 texts produced by this effort are called “Phase I.” This set of texts was released to the public with no restrictions on use whatever, as of January 1, 2015. Anyone could and can search and view these texts online at the Michigan TCP site, or can download them in bulk for individual use and re-use.

Under the encouragement of the project advisory board, and with the promise of another round of support from many libraries, in 2008 the TCP decided to continue the work of EEBO-TCP in a second phase. As described more fully elsewhere on this site, EEBO Phase II adopted the audacious goal of keying and encoding at least one edition of each unique monographic English-language work (with principled exceptions) represented in EEBO. Our guess — and it was only a guess — was that that would require converting roughly 44,000 additional texts, contingent, of course, on obtaining the requisite funds. Our estimate of how many texts it would take to achieve comprehensive coverage has largely been borne out (perhaps 45,000 was more accurate?), while our fund-raising fell only slightly short. As currently funded, EEBO-TCP Phase II is likely to produce nearly 40,000 texts — 35,000 of which are already online, the remainder due to appear late in 2020. All texts belonging to Phase 2 (both prospective texts and those already released) were opened to the public on 1 August 2020.

So ultimately, the entire EEBO-TCP corpus (Phase I and Phase II together) will consist of about 65,000 works.

How much does it cost to key and encode a single TCP text?

The cost of keying and encoding a book depends on how long the book is and how difficult it is to capture and edit the text. A book might be particularly challenging due to the difficulty of the font, the quality of the image (as preserved, or as captured on microfilm and digital scan), or simply the presence of unusual and complex textual features, such as large tables or genealogical charts. A work might consist of a single broadsheet, or thousands of pages. Our vendors charged a flat fee by the character (technically, by the kilobyte) of data captured. The costs of review and editing, which was done in-house at Michigan and Oxford, are measured in time, typically by counting how many books can be reviewed in a month. On average, we estimate that it cost $200-$250 to key, encode, and review a “typical” work. The cost of a very large work can easily be in the thousands of dollars.

A research library paid $60,000 to become a partner, so each library that joined supported the conversion of 250-300 new books.

When will the texts be freely available?

All of the texts are now freely available. Indeed, it was always part of the mission of the TCP to ensure that the text files we produced would ultimately be freely available to the public. The date that restrictions on sharing and distributing the texts depended on when the project was completed.

ECCO-TCP is already available to the public
Evans-TCP was released to the public on June 30, 2014
EEBO-TCP Phase I was released to the public on January 1, 2015
EEBO-TCP Phase II was released to the public on August 1, 2020.

Why would I buy something that is achievable only if others do the same?

Mere calculation may have disinclined some libraries from joining. TCP partnership was always less a purchase than an (admittedly risky) investment, since all of Michigan’s projections for the TCP corpus depended on a certain optimistic assumptions about how many other institutions would join. Some libraries may have joined out of faith in Michigan’s track record, or because of a long-standing connection with the University or its staff. Some out of an idealistic belief in the collaborative model that TCP represented or in the public value of the product it promised. Some perhaps out of a cost-benefit risk estimate. For all the partner libraries, however, TCP membership was in effect a commitment to fellow libraries to share the burden and reward of this work. Partner libraries contributed to the cost of producing tens of thousands of painstakingly produced electronic editions of early English works. Each new library that joined made it possible for the project to key books that we otherwise would not, improving the corpus for everyone.

Why would I buy something that is going to become freely available?

This question too has no obvious answer that will please everyone, and indeed this question may have influenced some potential partners to refrain from joining. The structure of the TCP, with its provisions for exclusive access for a time, followed by public release, was something of a balancing act, designed to encourage membership by both those who were unwilling to wait ten years or so for access to the texts on behalf of their students and faculty, and those who believed in the creation of an unrestricted public resource and were willing on altruistic grounds to contribute to it. Regardless of their motives for joining, the success of the EEBO-TCP depended on the support of partner institutions. The partnership fee directly funded the conversion of new books, and greatly affected the rate at which the work was carried out. By joining up, a library not only gained immediate access to the texts, and not only contributed to making a larger, more comprehensive corpus for everyone, but also measurably affected the pace, and advanced the completion date, of the project–and thereby advanced the date at which the texts would be released to the public.

I’m not affiliated with an institution (or my institution doesn’t have EEBO or TCP), but I would be willing to pay for access to these texts. Do you have individual partnerships?

Since all the texts are now freely available, there is no need to join anything to gain access. Access to the underlying page images, however, remains restricted to customers of the companies in question (ProQuest, Gale, and Readex/Newsbank).

Why don’t you use OCR?

Because of the irregularity and difficulty of early printing, the ambiguity of some character glyphs, and the challenges of structural complexity, as well as the variable quality of the microfilm-based images from which we are working, optical character recognition cannot reliably “read” the EEBO images to produce an accurate electronic text. The review and correction of the text produced would be so expensive and labor-intensive that it is more efficient to simply key the work from scratch. However, there has been a great deal of interest over the past several years in Europe and in North America in improving OCR for older works and there are a number of research projects investigating this right now–using TCP texts as a “ground truth” against which to compare their results. See further our somewhat fuller discussion of the OCR vs. keying question.

A work that I am interested in hasn’t been converted yet. When will you do it?

We used to say that users (especially those affiliated with partner libraries) were welcome to request works from EEBO that had not yet been keyed, and that their requests would go to the top of the queue. Since active transcription has ceased, and there is no longer a queue, asking for a particular work is less likely to receive a satisfactory response. But users are still encouraged to ask about works they fail to find in the EEBO-TCP corpus (via email to tcp-info@umich.edu), since it is possible that the work is available in a related edition, or is lagging in the pipeline, or that an alternative source can be found. If none of these things is true, information that a work is in demand is valuable in itself and may inform future digitization efforts, for which, thank you.

Why does TCP (for the most part) only include one edition of a work?

We recognize that each edition of a work is unique, that one cannot stand in for others, and that for many scholarly purposes, there is value in examining closely the differences between editions. However, given limited funding, our first priority was always to capture as many different works and as great a variety of text as we could, usually focusing on the first edition of each work. Simply put, for every book that we chose to convert, a different book did not get converted: duplication, even partial duplication, has its costs. However, we have keyed additional editions where there is sufficient justification for doing so, and a user has made a case for it.

I found an error in a transcription.

We are very grateful to those who report errors to us, and will incorporate corrections into our next release of the texts. Unfortunately we don’t yet offer a way to report (or correct) errors within the interface itself. Please get in touch at tcp-info@umich.edu.