Introduction to Text Encoding

Text encoding is a process whereby documents are transferred to an electronically searchable format for scholarly research.

CDRH prepares material for electronic access by encoding text. Four main steps are involved:

  1. transferring selected materials to a computer text editor
  2. encoding or marking up the document using markup tags and elements
  3. validating, or checking the correctness of, the document
  4. presenting the document to the user via a Web or some other interface
Examples of text encoding projects include:

Walt Whitman Archive

What is a Tag?

Once material is transferred to a computer text editor, the text is encoded by placing tags around portions of the text. These tags identify characteristics of the material and determine how it will display and function on the Internet. Tags may indicate where a title is located, that a passage is rendered italics, that a word is misspelled, where a table or image is placed, where links are located, and so forth.

There are three kinds of tags:

  1. opening, or beginning, tags: <title>
  2. closing, or end, tags: </title>
  3. empty tags: <page break/>

What is an Element?

An element refers to a section of text bound by a pair of opening and closing tags. Tags identify the element and keep different elements separate from each other. The following examples come from the Electronic Text Center:

Example:

<title>Leaves of Grass</title>

Elements can also appear within elements. This is known as nesting.

Example:

<title>Leaves of Grass <subtitle>a textual variorum of the printed poems </subtitle> </title>

If elements are not nested properly the text will not function properly on the Internet.

What is an Attribute?

An attribute modifies or further describes an element and appears only in the beginning tag.

Example:

<title rend="italic">Leaves of Grass</title>

The rend="italic" attribute signifies that the title should be rendered in italics: Leaves of Grass