Introducing the XML encoding

Introducing XML #

This project, like most scholarly projects in the humanities, uses the XML ("eXtensible Markup Language") implementation created by the Text Encoding Initiative. The fundamental concepts of XML are very simple.  It aims to encode documents in ways that will make them useful for computer processing.  It does this by inserting codes ("mark-up") in the stream of text.  For example, one could encode the heading at the start of this text as follows:

    <head>Introducing XML</head>

That is: we say the string "Introducing XML" is the content of a <head> element, with the element closed by </head>. In fact, in computer documents this encoding happens all the time; just most of the time you never see it.

XML has just two other features which you need to know (really, it is a simple language);

  1. Elements may have attributes. If we want to specify that this head is a top-level element, we can add an attribute, as follows: <head level="top">
  2. Elements may be "empty": that is, contain no content.  This <head> element does have content.  But if we wanted to say that this <head> element was preceded by a page break we might say <pb/><head>.  The "/" indicates the end of the element.  Thus <pb/> indicates that the <pb> element finishes as soon as it is opened.  This is useful when we want to mark a line or page break or similar.

This is all you need to know to get started!

A sample page #

Here is a sample transcription page, from folio 6r of Cp:

First, observe the elements <text><body><div> which start the page.  These are standard Text Encoding Initiative (TEI) elements, used in countless documents.  <text> says simply: this is a text. <body> says: this is the main part of the text. <div> says: we are looking at this part of the text. You will see that EVERY page of project transcription begins with these elements.  Note too that they are closed at the end of the page -- if you fail to close them you will get an error message.

Note the attributes on the first <div> element:

  1. n="GP" says: this is the General Prologue
  2. prev="urn:det:TCUSask:CTP2:document=Cp:Folio=5v:Line=36:entity=Tales:Group=GP".  This looks a lot. But all it is saying is: this <div> element is continuing the <div> element on the previous page, that is folio 5v (to be more precise: it is continuing the General Prologue from line 36 on this page)
  3. type="G" just says we define this as a "G", for group, division.

Now, look at the next line:

  1. <lb/> Here is an "empty" element: a line break! this means this line starts a new line
  2. <l n="363"> This is line (<l>) 363 of the General Prologue
  3. </l> closes the element holding line 363

Finally, look at the encoding on line 366: Frat<am>̉</am><ex>er</ex>nite, for the modern word "Fraternity" here spelt as Frat̉nite.

  1. <am>̉</am> says the hook next to the t is a mark of abbreviation
  1. <ex>er</ex> says that the hook is abbreviating the letters "er"

This represents most of the encoding you will meet!

Validating the XML #

There are two other features of XML which you should know:

  1. In XML, content elements must contain each other hierarchically.  That is, they cannot overlap: if one element contains a second element, then that second element must close before the first one does.  It is an error to say <head> a heading <p> with a paragraph</head> but this text overlaps! </p>
  2. IN XML, you can specify exactly what elements may contain each other, what attributes they may have, even the range of values those attributes might have. This is an extremely powerful way of helping ensure that documents are consistent.

You will see this at work if you make a mistake in the encoding and press the preview button.  For example, if you delete the closing </text> tag on a page, and press "Preview", you will see something like:

These messages can be rather cryptic to understand! Email us if you have a problem and just can't solve it.

3 Attachments
Average (0 Votes)
No comments yet. Be the first.