What is TEI?

TEI, the Text Encoding Initiative was founded in 1987 to develop guidelines for encoding machine-readable texts of interest to the humanities and social sciences.

TEI Lite, a somewhat smaller version of TEI, includes a subset of the whole TEI tag set selected to include the most commonly needed tags.

The Electronic Text Center uses Text Encoding Initiative (TEI) tag sets and rules, an application of the Extensible Markup Language (XML), to encode texts. TEI tags describe the structural hierarchies, divisions, and characteristics of a given document.

Basic TEI Tags

These are the basic tags that almost all TEI documents include:

<teiheader> ... </teiheader>
<frontmatter> ... </frontmatter>
<body> ... </body>
<backmatter> ... </backmatter>

What Does TEI Do?

TEI tags describe the characteristics of a given text.

For example, TEI tags may be used to indicate paragraph and line breaks, pagination, and major divisions of a text such as volumes, chapters, and sections. In addition, tags may be placed around typographical characteristics such as text that is underlined, italicized, superscripted, etc., and around text that needs special emphasis such as foreign words, misspellings, proper names, etc.

Special Characters

Special characters include characters that are not found on a standard English-language keyboard or that are not one of the 128 characters of the US-ASCII character code set. Examples include characters with diacritics and special symbols, such as the copyright sign or an ampersand. How these characters are represented varies in HTML and XML. For example, an ampersand is coded as & in HTML. In XML, an ampersand is coded as & #x0026;. In XML, codes for special characters typically begin with "&#" and end with a semicolon (;). The middle component is a code from the Unicode 16-bit character set. More than 65,536 characters can be represented using Unicode. A character entity file is an index of the special characters and is accessed when displaying a document. Character entities may be internal or external to the XML document.

Example:

À is represented with character entity À

^ is represented with character entity ^

How Does TEI Work?

TEI is used to organize text into a strict "document tree". The entire document is considered the "root element", with other features, such as sections, chapters, pages, paragraphs, titles, etc., branching off of the root. It is this strict tree structure that makes it possible to reliably search a TEI document and to apply stylesheets for display to the user.

TEI may be customized to fit the needs of the project.

TEI includes tags that are specific to a particular genre - drama, poetry, prose.

It is important have a DTD that is appropriate for the project. The DTD defines the structural rules of a type of document. These rules include a complete list of allowable elements and attributes, special character entities, rules for external files (such as images), as well as the hierarchical structure of all elements. Without a DTD it is difficult to automate the validation of a document's structure. The Electronic Text Center can help you create a TEI-conformant DTD, or to use a standard DTD.

Encoding: Getting Started

The text selected to encode will need to be transferred to a text editor. Manually typing the text and OCR scanning of the text are most common methods of transfer. Maintaining the format of the original text and noting typographical characteristics are helpful for placing the TEI tags. Tags may be used to indicate paragraph and line breaks, pagination, and major divisions of the text such as chapter or section headings. In addition, tags may be placed around typographical characteristics such as underlined or italicized text, hyphenation, special characters such as the ampersand or dollar sign, and alternate spellings and misspellings.

The visual presentation of a TEI encoded document requires the use of a style sheet or other conversion program. The University of Virginia states in its "Guidelines for SGML Text Mark-up at the Electronic Text Center":

SGML texts are not, of course, designed to be read "in the raw". Ideally, one uses them through software tools that interpret the tags as database "fields" while searching and as a set of typographical layout instructions while displaying the results.

The UNL Libraries Electronic Text Center uses Text Encoding Initiative (TEI) tag sets and rules, a sub-set of Extensible Markup Language (XML), to encode texts. TEI tags describe structural divisions and characteristics of a given text. As stated from the University of Virginia's "Guidelines for SGML Text Mark-up at the Electronic Text Center":

By recording the structure of a text, such tags allow one to use an SGML [or XML] search program to constrain searches to particular elements: one cannot limit a search to a single chapter in a novel if there are no markers in the text for chapter divisions; one cannot view a quotation from a play in the context of a scene if the scenes are not delimited.

Additional characteristics to note about TEI encoding include the following:

There is always an opening tag and a closing tag.
These tags are case sensitive and must be nested properly.
Attributes may be used to further define the tags.

Almost all TEI documents possess a basic set of tags:

TEI Header
Front Matter
Body
Back Matter

TEI documents are unique because tagging is specific to what is being encoded. Although there are some tags that will be the same no matter which source is being encoded, there are also tags that are unique to particular genres such as drama, poetry, or prose. Empty TEI documents that possess the basic tag sets are called templates. There are specific tag sets to make each template unique. Available templates may be used or customized to fit the goals of the project in question.

Printer-friendly version