Posted by scott

Unless you're staff on this project, there's not a whole lot to see, beyond the anecdotes and updates occasionally appearing here. Probably no-one wants to look too closely at an ongoing transcription project, anyway. For the stouter of heart, though, here's a tiny peep behind the scenes . . .

As the technical advisor for the project, my primary initial responsibility has been to provide a web-based environment which can support a multi-campus, multi-level transcription effort. The entire project is implemented using Drupal, a widely-used open-source content management platform. But the transcription process is specifically supported by a set of Drupal modules known collectively as Workbench. As the Workbench page explains,

Workbench provides overall improvements for managing content that Drupal does not provide out of the box. Workbench gives us three important solutions:

  • a unified and simplified user interface for users who ONLY have to work with content. This decreases training and support time.
  • the ability to control who has access to edit any content based on an organization's structure not the web site structure
  • a customizable editorial workflow that integrates with the access control feature described above or works independently on its own

What this means to me/us is

  • transcribers (who "only have to work with content") can work without knowing much Drupal arcana
  • different kinds of project staff, with different levels of responsibility, can be given different kinds of control over their portions of the transcription process
  • the especially demanding workflow required by transcription can be substantially automated

And, of course, the transcription itself can be carried out anywhere, though proofreading in pairs works best face to face.

So, the workflow. Step Zero is to make sure there's stuff to be transcribed. When the current batch of transcriptions is wrapping up, Shane sends me the next batch: a set of page scans as image files. I convert these into a set of pages on the Drupal site which allow transcribers to attach extra information to each page of the diary (such as the starting and ending date of the entries on the page, as well as the XML-tagged transcription itself).

When those are available, Amy and Naomi assign a set of pages to project staff. Then the real fun with Workbench begins. A single diary page goes through 6 workflow states:

Draft -> Needs Proofread 1 -> Needs Revision 1 -> Needs Proofread 2 -> Needs Revision 2 -> Published

One of the lovely things about Workbench is that it's fairly easy to limit the pages "visible" to an invididual staff member to those for which she is currently responsible. So the project directors can choose only to see those pages which are in a Needs Proofread state, and the transcription staff don't need to be distracted by pages for which they're not responsible. Of course, Amy and Naomi decided that it would be better to have different directors be responsible for the different states of proofreading, so one of them does the first proofread for odd-numbered volumes and the other does the first proofread for the even-numbered volumes. This is not something that Drupal+Workbench allows us to do directly, but Drupal is highly customizable, so with 40 lines of code I was able to create a special "view" for the directors which shows them exactly those pages they need to proofread at any given time.

Sometimes, though, we have to pay a small price for doing things our own way; we've learned that we have to treat the changing of moderation states very delicately. Both Drupal and Workbench are community-driven, which means there's a perpetual dance between what the code can do and what its users want it to be able to do. In the case of Workbench, the internal details of how Drupal stores information about revised content appears to conflict with some uses of Workbench, triggering heady philosophical discussions about (say) whether creating a new revision of some content should cause the last-updated timestamp to change. For us, this means that some transcriptions appear to exist in multiple workflow states simultaneously, which contributes a bit of occasional confusion. The Workbench developers have decided (probably wisely) not to attempt to fix this problem in the current version of Drupal but are waiting instead for some significant infrastructural changes coming with Drupal 8.

Update, July 31. In fact, our problems with Workbench are more easily solved: it appears that I had not-quite-intentionally allowed all of our staff to change the workflow state of a page to any other state, at any point -- rather than allowing them only to change to the 'next' state when editing was complete. Removing that feature, and making a few edits to the affected pages, solved the problem.

If you're curious (or have ideas) about how Drupal could be used to support similar kinds of "digital humanities" projects, take a look at the Drupal for Humanists project, which seems to be on a bit of a hiatus but nonetheless includes a good general overview as well as a handful of exemplars. And, of course, drop me a line.