Technology
The technological challenge for the Electronic Boethius project was to create an adaptable and extensible set of XML tools, an Edition Production Toolkit (EPT now stands for Edition Production Technology), for producing a fully encoded image-based electronic edition. We knew from long experience with the Electronic Beowulf that it was difficult to build an image-based electronic edition that truly integrated images and text. Because so much of the encoding of a scholarly edition (what is called its apparatus) is based on source documents, we wanted to create tools that helped an editor integrate images and text, to provide an electronic facsimile for pervasive analysis, including illustrations of all textual notes that described the manuscript. We also knew that the encoding or markup of a manuscript transcript and an edition of it produced overlapping (also called conflicting, or concurrent) markup. XML parsers cannot process documents with overlapping markup, because they lack well-formed XML syntax. We solved the problem in an ad hoc way with the Electronic Beowulf by encoding two separate and distinct XML (SGML at the time) files, one for the transcript and one for the edition. It was an unsatisfactory solution, however, because it required constant updating of two closely interrelated files. We wanted the software to keep track of and update all changes.
The technology underpinning the Electronic Boethius has as a result grown quite complex. The free and open source codes for one early version of the EPT developed in collaboration with the ARCHway Project is now available, and the source codes for the Electronic Boethius will be released, as well, when the project reaches a suitable stage of completion — one of the theoretical virtues of electronic editions is that they are never permanently complete. In our view, however, the source codes are not as important for anyone wishing to build image-based electronic editions as the Application Programming Interfaces (APIs). There are two APIs, discussed below, for the Electronic Boethius. By following the APIs, other projects can use and adapt our tools and create new ones that automatically work under the same EPT platform. (Potential users of our software may also wish to try out our Trial version of EPPT, Edition Production and Presentation Technology, at http://beowulf.engl.uky.edu/~eft/.) The EPT is too complex to explain in detail in this brief overview, but the information we provide here should provide an adequate introduction.
At the outset of the project our need to solve specific problems for preparing image-based encodings led us to develop a Java Swing platform, the Electronic Editions Editor (E3). The E3 platform included a general ScripText environment (now called ImagText) for integrating images and text; a Glossary Tool for building comprehensive glossaries from an XML text file; a Tagger (now xTagger) for inserting XML markup, based on the images, in the text file; a DucType tool for paleographical description, analysis, and encoding; and an OverLay tool for comparing and encoding multiple images of a folio taken under different lighting conditions. The E3 platform also included tools for XML validation and transformation (XSLT), as its menu indicates.

[The Electronic Edition Editor (E3) panel]
The functioning editing environment of what is now EPT was in this way established at an early stage in E3. In this example a paleographical tool allowed the editor to select a letter form from the image, compare it with other examples from the manuscript, encode it for its upper, body, and lower characteristics, and insert the markup for the chosen element, in this case <let> (i.e., letter), in the XML file.

[E3: Tagger, ScripText, OverLay, and Paleography tools]
In E3 we used eXist as a native XML database support, but we had to abandon it as we used more and more concurrent markup. We also developed a user interface for storing, retrieving, and querying XML documents, but we did not, and in fact still have not, developed a database. Until we do, we are using the file system support for data and an in-memory representation data structure for querying.
As we developed and enhanced these tools, we continually faced two ever-growing problems: (i) the complexity of the markup, which defied XML syntax; and (ii) the integration of text and image tools, which was made more difficult because the two types of tools were stand-alone programs developed and maintained by different programmers. We offer here only a brief overview of our solutions to these problems.
Electronic Boethius utilizes two different types of XML encoding, data-centric and document-centric. Data-centric XML, as for example in a glossary, typically represents a collection of semistructured data objects, where different levels of the XML element hierarchy represent objects and their features, while the content of XML elements represents the values of the features. Inherently more simple than document-centric XML, data-centric XML is usually not bound to a specific ordering of its elements. Thus the XML glossary entry,
<glossentry word="abisgod" pos="verb"> <hdwd>abisgian</hdwd> <fol>125r</fol> <line>11</line> </glossentry>
is equivalent to
<glossentry word="abisgod" pos="verb"> <fol>125r</fol> <line>11</line> <hdwd>abisgian</hdwd> </glossentry>
In compiling data-centric XML there are no situations that require overlapping markup, and the repetitive nature of the markup is conducive to the use of templates.
Document-centric XML, on the other hand, is inherently more complex and can become extremely complex. It typically contains XML elements with mixed content, describing properties, features, and meta-information about a pre-existing textual document, such as a transcript of a manuscript page.
In Document-centric XML the text is a union of its content and the XML elements, attributes, and values describing it. The order of the text determines the order in which elements and attributes and values occur in a document-centric XML document. If one changes the order of the <line> element in the following transcript, for instance, the content between tags remains the same but the meaning is not the same:
<col title="Boethius"> <line>Se boetius wæs oðre naman ha</line> <line>ten seuerinus se wæs heretoga</line> <line>romana</line> </col>
Boethius was by another name called Severinus; he was a consul of the Romans is not the same as Boethius was by another name cal- of the Romans -led Severinus; he was a consul. Unlike data-centric XML, the unique order of the content in document-centric XML is critical. As a result, encoding document-centric XML tends to be very different from, and much more difficult than, the encoding of data-centric XML documents. Because of the unique features of a document, the power of templates is considerably limited and the encoding must proceed following the document, rather than following an arbitrary form.
In addition, document-centric XML that is based on images presents specific encoding problems that require specific processing methods. A partial encoding of the transcript for lines 9-11 of fol. 38v illustrates how conflicting hierarchies emerge from the image:

[Some conflicting hierarchies on fol. 38 verso (UV)]
In this example, the <word> markup for haten (called), lines 1-2, overlaps the <line> markup in line 1. Traditional XML tools cannot process overlapping structures, which are not well-formed. For well-formed XML, the <word> markup must nest within the <line> markup. Because manuscript images unavoidably produce documents with overlapping, conflicting, concurrent hierarchies, it was essential to develop tools that could process document-centric XML that included these non-well-formed characteristics.
Tool integration and concurrent markup hierarchies are both non-trivial problems from a Computer Science standpoint. Working in concert with the Electronic Boethius project, the ARCHway Project undertook to tackle these two problems. To maintain the tools that we had developed and to leverage their full potential, ARCHway proposed that the Electronic Boethius programmers should reprogram the tools under the open-source Java Eclipse platform. By reprogramming the tools, we would create an integrated Edition Production Technology (EPT), rather than an Edition Production Toolkit containing discrete tools. While expensive and time-consuming, the reprogramming turned out to be highly beneficial for both projects. ARCHway was able to use the editing tools of the Electronic Boethius to poplulate its architectural model, and Electronic Boethius had an integrated workbench. The diagram schematizes ARCHway's architecture for the EPT, based on the Eclipse platform.

[ARCHway architecture for Edition Production Technology (EPT)]
The reprogramming of our editing tools, from Java Swing of E3 to Java SWT of Eclipse, provided an effective software architecture for the production and deployment of the EPT, which is now organized as a set of distinct plugin tools that work together through Eclipse. Each tool is responsible for a specific editing or administrative task, and the tools also work with one another through the platform, effectively borrowing functionality and allowing for the creation of new tools without having to reprogram established functions. Eclipse is an ideal environment for both programmers and EPT editors (and ultimately users). The EPTs plugin design makes it possible to assign individual tools to different programmers, while the integration problems are easily solved by the Eclipse platform as a deployment platform for the EPT. In addition to integrating the Electronic Boethius editing tools, the Eclipse architecture automatically integrates any new tools, provides modularity, built-in version control, online updates, and platforms for both development and deployment of the EPT.

[Using Eclipse as development and deployment platform]
The complexity and competing interests of the features of fire-damaged manuscripts and the non-hierarchical nature of a manuscript text, in general, required a transparent, automated process for managing overlapping markup. A successful automated process would free the human editors to work on editing the texts, instead of dealing with the issues of validity and well-formedness of XML document encoding.
This framework allows the editors to concentrate on the purpose of the encoding, the editing of the text and the compilation of a critical apparatus. In our approach, the editor describes a collection of editorial goals expressed in simple, formal, document schemas, each with its own hierarchy. As is true of an editors work, there is no obligation to build and maintain a master schema. These concurrent schemas demand, however, specialized software to support the editorial process and drive it by the semantics of the markup. This software allows the editor to indicate in the text where the markup must go, select the appropriate markup from a template, and record the results.

[A framework for managing Concurrent Markup Hierarchies]
We introduced an XML representation for documents with overlapping markup that was based on the GODDAG (General Ordered-Descendant Directed Acyclic Graph) representation devised by C.M. Sperberg-MacQueen and Claus Huitfeldt. Dubbed KyGODDAG by Sperberg-McQueen in his keynote speech at the WebDB 2005 and XIME-P 2005 workshops, our approach is a novel in-memory data structure representation with a corresponding parser. Applications can access the XML data through the KyGODDAG API. We have already developed an extension of the XPath query language on top of KyGODDAG for searching multihierarchical XML documents. We are working on an extension for XQuery for the near future, and our longer range plan is to develop persistent storage for XML documents with concurrent markup.
At this stage our framework for managing concurrent XML hierarchies addresses the following problems:
In the second stage of development we enhanced the tools to provide support for concurrent markup hierarchies. In particular, we completely implemented and integrated Extended XPath, as follows:
A complementary xTagger-ImagText pair of software tools now under production similarly allows the editor and research team to provide pervasive, extremely complex, document-centric, image-based XML encoding for the transcript and edition. As we have seen, document-centric XML encoding, especially image-based encoding, is not as amenable to broad template-driven markup as the data-centric encoding of a glossary. However, the xTagger-ImagText software is carefully designed to help editors descriptively encode with the aid of templates a manuscript edition, and link details from folio images to the transcript without concentrating on XML issues.
Although initially developed as separate plugins, xTagger and ImagText tools are designed to work closely together. Indeed, virtually everything that is tagged by xTagger is based on images of the manuscript displayed by ImagText. ImagText is an image viewer, enhanced with basic image display features, such as zoom, magnify, overlay, fit horizontally, and fit vertically. xTagger is a document-centric XML editor that, through markup, captures details of the manuscript and displays them in a variety of user-defined views. Working with xTagger, ImagText is capable of displaying not only a manuscript folio next to its corresponding transcript, but also any details, such as damaged areas, ultraviolet enhancements, notable letterforms, gatherings of folios, and anything else xTagger has tagged with locational markup from ImagText.
The diagram below gives a schematized view of the combination of image and text data in an image-based electronic edition and how the data is processed.

[Image-text processing]
Our tools manage the distributed manuscript images and their XML encodings through an image catalog. The diagram represents the separate image and text data as an image database and an XML markup database. The image catalog associates a manuscript folio name (used in the XML encoding database) with the physical representation of the manuscript folio (an image file in the image database). The editor normally creates an image catalog from the start, associating images of folios with transcripts of the folios and foliolines. But it is also possible to create the image catalogue, and to continuously update it as new images or image fragments accrue, while the encoding is in progress. The image catalog entry below associates the manuscript folio 38v to the corresponding manuscript folio image in the image database:
| Folio name | Folio image |
| 38v | file://c:/projects/OthoAvi/images/038v.jpg |
Linking images and text works as follows. After tagging a transcript of a folio with its folio name, the editor can tag the coordinates of any part of the folio for any feature (or element in the document schema). When tagging manuscript features, the editor views in one window the source of the descriptive markup in the digital images of the manuscript, and tags a transcript of it in another window using clickable element buttons. The resulting tagged file is, like the glossary, fully searchable and open to any number of configurable views. Because the XML encoding automatically includes x/y coordinates for all tagged parts of an image, searching the text and the image can proceed in tandem.

[Linking Images and Text]
Later, when the user wants to display the encoded segments of an image, xTagger first retrieves two pieces of information, the folio name and the coordinates of the requested feature or element (here <damage>), from the encoded text. The folio image is determined by the folio name in the image catalog, so ImagText can display it. Then, using the coordinates provided by xTagger, the ImagText tool outlines the image area corresponding to the encoded feature. Similarly, if the user wants to display in xTagger the markup of the encoded features of an image in ImagText, xTagger first determines the corresponding folio name from the image catalog, and then displays the markup by retrieving the encoded information associated with the folio name. The xTagger and ImagText software we are developing will correctly associate encoded features with those specific image areas previously selected by the editor, while resolving behind the scenes all conflicting tags (<w> and <damage> tags around severinus are not nested, or "well-formed," in the figure above). In other words, the KyGODDAG software silently avoids for the editor the creation of invalid or non-well-formed XML encoding.
The xMarkup interface is an invaluable editing tool in our image-based encoding system. The difference between the hierarchies of elements in xTagger and the hierarchies of features in xMarkup is that xTagger controls the essential document schemas (or DTDs), whereas xMarkup organizes elements for an editor into convenient groups for preparing the edition, whether or not the elements belong to the same document schemas. xMarkup is a user interface, ultimately controlled by xTagger, for insertion, deletion, and updating or editing of tags. To guide the process of encoding xMarkup provides for every element a template with all attributes (and their values). Each xMarkup template shows the connection between ImagText and xTagger tools by displaying any selected part of an image as well as its x/y coordinates, which are also automatically inserted as an attribute value in the XML document.
The Electronic Boethius uses templates in xMarkup, rather than obliging the editor to encode an XML document directly, for several reasons:
The central repository of data and information in our system is the document-centric XML of the electronic edition. The edition XML incorporates the manuscript transcript and a growing record of its structure (e.g., folios, folio-lines, evidence of quires), scribal features (e.g., word-divisions, book-divisions), editorial changes (verse lines, prose lines, modern word boundaries, restorations), physical condition (especially damages), and anything else represented in the document schemas. Moreover, all information that relates text and images through image coordinates is also contained in the XML edition document. It is accordingly crucial to maintain data integrity and ensure consistent data access for all tools in our platform of text and image tools. We achieve this maintenance through xTagger, the common data access interface for all tools.
xTagger is, in short, the gateway for accessing all information in the edition XML document. It is basically an XML editor with a three-tier architecture: the data structure layer, the mediator layer, and the xTaggers User Interface (UI). This architecture allows us to decouple the data management issues from the UI and alter the internal representation of XML without changes in the top layer.

[xTagger 3-tier architecture]
Data structure layer. The xTaggers underlying data structure is xTagger is KyGODDAG, which ensures support for concurrent markup hierarchies and Extended XPath queries over concurrent markup hierarchies. It is common to represent an XML document as an in-memory DOM tree. The DOM data structure allows document traversal and updates, and provides support for XML query languages. A persistent storage model could also be used. The benefit of in-memory representation is speed in relatively small projects, whereas a persistent storage model is more appropriate for larger projects and combined projects.
Mediator layer. The xTaggers mediator layer is the common interface for accessing the edition XML data. The mediator connects the XML documents data structure representation to the document presentation in the user interface. This layer is responsible for translating common editor functionality (for instance, character insertion or deletion, markup insertion or deletion, etc.) into calls to the data structure API. The mediator has two main functions: (1) it selects parts of the XML document (filtering markup and content, based on the users choice) for presentation in the editors UI; and (2) it maintains a mapping between the XML document components (for instance, KyGODDAG nodes) and the xTaggers UI components. The mediator is also responsible for translating possible search results over the XML data structure representation into the corresponding visual components in the editors UI. [Note: Another important function of the mediator layer it is designed for open access, not only for the tools we have developed so far, but also for any new tools in other projects that follow our system. To facilitate these external efforts, we have already published the API (see the Javadocs links), which we currently maintain and improve.]
User Interface (UI). The xTaggers UI is the presentation part of the XML document and represents the core technology for authoring document-centric XML with concurrent markup hierarchies. The UI collects user input (character data and markup insertions, deletion or updates, filtering requests) and is responsible for XML document presentation. Based on the users experience or needs, the xTaggers UI can show the content of the document alone or with a window showing as many XML tags as the user selects for display (filtering capabilities). The user can insert markup by highlighting the content range, selecting the element tag (or alias) and filling in the attribute values through a schema-based template. The mediator translates the content range into a data structure range, performs the necessary document integrity checks (for potential validity), and inserts the markup in the XML data structure.
The main features of xTagger (and the markup technologies represented by it) are enumerated below:

[Defining xTagger filters]
The editor can define an unlimited number of filtered views. As soon as a filter is defined, it becomes immediately available in a text or XML filtering menu. The following filtering options are available:
It is important to understand that filtering is simply an editors device for focusing on selected text or XML. Filtering does not remove or otherwise change markup. The markup that existed before applying a filter remains in the document, and is merely hidden from view for convenience.

[Using user hierarchies]

[Tool tip and status bar information display]

[Linking HTML display to the document-centric XML]
We use this technique in our data-centric XML Glossary Tool editor, too, to link specific glossary entries to the edition and transcript.
![]() |
© 2005 Electronic Facsimiles & Texts | Last modified by IEI ![]() |