Better Proofreading through XSLT

31 Jul 2013

Posted by scott

Proofreading is a big concern over here at the Harry Watkins Diary shop. I believe both Naomi and I have already written a bit about it on these pages. Because we're using XML markup (in accordance with the TEI Guidelines, we hope), the text that we are faced with proofreading may suffer from an especially wide variety of faults. We may have mis-transcribed some of Harry's text, we may have introduced a syntax error into the XML, or we may have made a semantic error in deciding which tags to apply (e.g., is an occurrence of the word "Othello" a title, a role, or the name of a 'real' person?).

If you're XML-savvy, you're probably thinking, "Gosh, why are they typing raw XML? If they'd just create their XML using oXygen [a very powerful and utterly groovy XML editing/development tool], there will be no syntax errors." Which is true. The problem is that our unit of transcription/proofreading is one diary page. And it doesn't make sense to treat each diary page as a complete XML document -- for example, an underlined phrase may begin at the bottom of one page and extend onto the next. So each page is a fragment of a well-formed XML document, but it is not itself something an XML processor would understand.

So, at this point, our optimal process is looking something like this:

  1. transcribe all pages in one volume of the diary (there are 14 volumes, totalling over 1100 pages)
  2. concatenate all the XML together and wrap it in TEI-approved XML headers and footers
  3. have someone go through that document with oXygen, fixing all the XML problems (but how to "back-propagate" those changes into the individual page XML fragments?)
  4. proofread the syntactically correct XML for semantic/transcription errors

This relieves our weary eyes of analyzing the XML, but it doesn't relieve us of looking at it, which is almost as unpleasant. Is there a way to make this proofreading more comfortable? This is where XSLT comes in.

We know we're eventually going to have some serious tools for transforming the XML of the diary into a form suitable for publication online; XSLT is the way to go for us. So, in the interim, we've decided to use some quick-n-dirty XSLT to render the marked-up transcriptions in a way that supports proofreading. For example, here is the XML for "Volume 0" -- a prequel of ephemera found in the boxes with the diary. With a little bit of XSLT, we can transform this into HTML which makes it easy for us to treat each page separately, but also to see Watkins' corrections, abbreviations, and underlining, as well as the semantics of places, roles, and people.

We're very curious about how others have dealt with this problem -- essentially, having units of transcription which do not stand comfortably by themselves as full XML documents. Any advice?