Hitaext User Manual

Erwin Marsi
Department of Communication and Information Sciences
Tilburg University
The Netherlands
e.c.marsi@uvt.nl

This is work in progress!

Contents

  1. Contents
  2. Summary
  3. Quick start
  4. Introduction
  5. Using Hitaext
  6. Creating your own alignment files
  7. Reference



Summary

Hitaext is a graphical tool for manually aligning pairs of text documents with XML markup. It reads two XML documents and allows you to align XML elements on the basis of the text they contain. The format of the source and target documents is free as long as it is well-formed XML, and both documents are read-only. Alignment is virtually unrestricted: you can create one-to-one, one-to-many or even many-to-many alignments between arbitrary elements at any level of the XML tree structure. The alignments are stored in a simple XML format, which can be used for further processing. Hitaext is implemented in the Python programming language using the wxPython GUI toolkit. It has been tested on Mac OS X, GNU Linux and MS Windows, but should run on any platform which is supported by Python and wxPython.

Quick start

So you managed to install Hitaext and now you are in hurry to see if this is of any use to you? Sounds familiar :-) Try this:

Introduction

Hitaext is a tool for manually aligning text fragments from two arbitrary XML documents.

The Hitaext user interface consists of three different windows: the log window, the tree window and the text window.

Using Hitaext

Starting Hitaext

Start Hitaext by clicking (or perhaps double clicking) on its program icon. The log window is the first visible window upon startup. It serves to displays messages about what is going on inside the program, including error messages if something goes wrong. On MS Windows and Linux, the top of the log window contains the menubar; on Mac OS X, it is located at the top of the screen.

The log window
Note: the screenshots show Hitaext under Mac OS X and may look slightly different if you are running Hitaext under MS Windows or Linux.


Opening an alignment file

From the File menu, choose Open. Hitaext presents a file selection dialogue for selecting the alignment file. Go to the subdirectory called data and open the file alignment-sample-of-the-cathedral-and-the-bazaar.xml. If any errors are encountered during reading and processing, they will appear in the log window.

After opening an alignment file, two more windows appear. The tree window shows the hierarchical structure of the XML documents in the form of two tree controls. The tree on the left represent the source document; the tree on the right represents the target document. As both trees are still completely folded, the only nodes visible are the document roots (i.e. <book>).

...
Exploring documents

...

Altering alignments

...

Saving an alignment file

...

Creating your own alignment files

Preliminaries

This section explains how to create your own alignment files. It is assumed that you have a minimal knowledge of XML, basically that you understand how markup tags are used to create a hierarchical document structure of elemments. If you are not familiar with this, take a look at one of the many introduction to XML on the web, for instance, this XML Tutorial.

The two XML documents we will use are shown below. The source document, which has the filename sample-on-the-origin-of-species.xml, is a small sample from the 6th edition of On the Origin of Species by Charles Darwin (available for free from Project Gutenberg. The target document, with filename sample-over-het-ontstaan-van-de-soorten.xml, is a corresponding sample from the Dutch translation.

Both document have a simple, ad-hoc markup with <book as the document root, which contains - besides <title> and <author> elements - a number of <chapter> elements. Chapters contain <p>'s (paragraphs), or <section>'s containing paragraphs, where paragraphs can in turn contain <s>'s (sentences). Actually, Hitaext does not care at all which elements are used and how the markup structures the text, or whether both documents use the same markup scheme. The only thing Hitaext insists on is that both source and target are well-formed XML documents.


The source file: sample-on-the-origin-of-species.xml

<?xml version="1.0" encoding="utf-8"?>
<book>
  <title>The Origin of Species by means of Natural Selection; or, the
Preservation of Favoured Races in the Struggle for Life.</title>
  <author>Charles Darwin, M.A., F.R.S.</author>
  <chapter>
    <head>Introduction</head>
    <p>
      <s>When on board H.M.S. Beagle, as naturalist, I was much struck with certain
facts in the distribution of the organic beings inhabiting South America,
and in the geological relations of the present to the past inhabitants of
that continent.</s>
      <s>These facts, as will be seen in the latter chapters of
this volume, seemed to throw some light on the origin of species--that
mystery of mysteries, as it has been called by one of our greatest
philosophers.</s>
      <s>On my return home, it occurred to me, in 1837, that
something might perhaps be made out on this question by patiently
accumulating and reflecting on all sorts of facts which could possibly have
any bearing on it.</s>
      <s>After five years' work I allowed myself to speculate on
the subject, and drew up some short notes; these I enlarged in 1844 into a
sketch of the conclusions, which then seemed to me probable:  from that
period to the present day I have steadily pursued the same object.<s/>I hope
that I may be excused for entering on these personal details, as I give
them to show that I have not been hasty in coming to a decision.</s>
    </p>
    <p>
      <s>My work is now (1859) nearly finished; but as it will take me many more
years to complete it, and as my health is far from strong, I have been
urged to publish this abstract.<s/>I have more especially been induced to do
this, as Mr. Wallace, who is now studying the natural history of the Malay
Archipelago, has arrived at almost exactly the same general conclusions
that I have on the origin of species.</s>
      <s>In 1858 he sent me a memoir on this
subject, with a request that I would forward it to Sir Charles Lyell, who
sent it to the Linnean Society, and it is published in the third volume of
the Journal of that Society.</s>
      <s>Sir C. Lyell and Dr. Hooker, who both knew of
my work--the latter having read my sketch of 1844--honoured me by thinking
it advisable to publish, with Mr. Wallace's excellent memoir, some brief
extracts from my manuscripts.</s>
    </p>
    <p>[...another paragraph...]</p>
  </chapter>
  <chapter>
    <head>VARIATION UNDER DOMESTICATION</head>
    <section>
      <head>CAUSES OF VARIABILITY</head>
      <p>
        <s>When we compare the individuals of the same variety or sub-variety of our
older cultivated plants and animals, one of the first points which strikes
us is, that they generally differ more from each other than do the
individuals of any one species or variety in a state of nature.</s>
        <s>And if we
reflect on the vast diversity of the plants and animals which have been
cultivated, and which have varied during all ages under the most different
climates and treatment, we are driven to conclude that this great
variability is due to our domestic productions having been raised under
conditions of life not so uniform as, and somewhat different from, those to
which the parent species had been exposed under nature.</s>
        <s>There is, also,
some probability in the view propounded by Andrew Knight, that this
variability may be partly connected with excess of food.</s>
        <s>It seems clear
that organic beings must be exposed during several generations to new
conditions to cause any great amount of variation; and that, when the
organisation has once begun to vary, it generally continues varying for
many generations.</s>
        <s>No case is on record of a variable organism ceasing to
vary under cultivation.</s>
        <s>Our oldest cultivated plants, such as wheat, still
yield new varieties:  our oldest domesticated animals are still capable of
rapid improvement or modification.</s>
      </p>
      <p>[...another paragraph...]</p>
    </section>
    <section>
[...another section...]
</section>
  </chapter>
  <chapter>
[...another chapter...]
</chapter>
</book>

The target file: sample-over-het-ontstaan-van-de-soorten.xml

<?xml version="1.0" encoding="utf-8"?>
<book>
  <title>Het ontstaan van soorten door natuurlijke selectie ofwel het bewaard 
blijven van rassen die in het voordeel zijn in de strijd om het bestaan</title>
  <author>Charles Darwin</author>
  <translator>Ruud Rook</translator>
  <chapter>
    <head>Inleiding</head>
    <p>
      <s>Toen ik als natuuronderzoeker aan boord van Zijne Majesteits Beagle 
vertoefde , vielen mij bepaalde feiten op omtrent de verspreiding van de 
levende wezens van Zuid-Amerika en de geologische relaties tussen de huidige en 
de vroegere bewoners van genoemd werelddeel .</s>
      <s>Deze feiten schenen , zoals zal blijken uit de laatste hoofdstukken 
van dit boek , enig licht te werpen op het ontstaan van soorten -- het raadsel 
der raadselen , zoals een van onze grootste wijsgeren het noemde .</s>
      <s>Na mijn terugkeer in Engeland kwam ik in 1837 op de gedachte dat het 
raadsel misschien kon worden opgelost door geduldig allerlei gegevens die er 
mogelijk verband mee hielden te vergaren en de revue te laten passeren .</s>
      <s>Na vijf jaar studie stond ik mezelf bespiegelingen over dit onderwerp 
toe en maakte ik enkele korte notities .</s>
      <s>In 1844 werkte ik die uit tot een schets van wat me op dat moment 
waarschijnlijke bevindingen leken .</s>
      <s>Sindsdien heb ik me ononderbroken met dit onderwerp beziggehouden .</s>
      <s>Ik hoop dat deze persoonlijke details mij niet euvel worden geduid , 
want ik vermeld ze slechts om aan te tonen dat ik mijn conclusies niet 
overhaast heb getrokken .</s>
    </p>
    <p>
      <s>Mijn werk is nu ( 1859 ) bijna gereed .</s>
      <s>Maar omdat de voltooiing ervan nog vele jaren zal vergen , en omdat 
mijn gezondheid verre van volmaakt is , werd op publicatie van deze 
samenvatting aangedrongen .</s>
      <s>Ik ben hiertoe in het bijzonder aangezet omdat Mr. Wallace , die thans 
de natuurlijke historie van de Maleise archipel bestudeert , tot vrijwel 
dezelfde slotsom omtrent het ontstaan van soorten is gekomen als ik .</s>
      <s>In 1858 stuurde hij mij over dit onderwerp een opstel , met het 
verzoek het Sir Charles Lyell ter hand te stellen , die het doorstuurde naar de 
Linnean Society .</s>
      <s>Het is gepubliceerd in het derde deel van het Journal van genoemde 
instelling .</s>
      <s>Sir C.Lyell en dr. Hooker , die beiden bekend waren met mijn werk -- 
laatstgenoemde had mijn schets uit 1844 gelezen -- , gaven te kennen dat het 
raadzaam zou zijn om , nu het uitstekende opstel van Mr. Wallace er was , tot 
publicatie over te gaan van korte uittreksels van mijn manuscripten .</s>
    </p>
    <p> [...nog een paragraaf...]</p>
  </chapter>
  <chapter>
    <head>VARIATIE BIJ GEDOMESTICEERDE PLANTEN EN DIEREN</head>
    <section>
      <head>Oorzaken van variabiliteit</head>
      <p>
        <s>Wanneer we de individuen van dezelfde variëteit of subvariëteit 
van onze oude cultuurplanten en huisdieren vergelijken , is een van de eerste 
punten die ons opvallen dat ze doorgaans onderling meer verschillen dan 
individuen van soorten of variëteiten in de vrije natuur .</s>
        <s>En als wij onze gedachten laten gaan over de enorme verscheidenheid 
aan planten en dieren die gecultiveerd zijn , en die reeds van oudsher in alle 
mogelijke klimaten en dankzij de meest uiteenlopende behandelingen variëren , 
dan dringt zich de conclusie op dat deze grote variabiliteit het gevolg is van 
de omstandigheid dat gedomesticeerde dieren en planten ontstaan zijn onder 
leefomstandigheden die minder eenvormig zijn dan en enigszins verschillen van 
die waaraan de oorspronkelijke soorten in de vrije natuur waren blootgesteld 
.</s>
        <s>Ook is er iets te zeggen voor het standpunt van Andrew Knight dat 
deze variabiliteit deels toe te schrijven is aan een overdaad aan voedsel .</s>
        <s>Het lijkt duidelijk dat er pas variatie in enige omvang kan 
op­treden , als levende wezens in de loop van verscheidene generaties aan 
nieuwe omstandigheden zijn blootgesteld , en dat als de structuur eenmaal 
begonnen is te variëren , deze meestal in de loop van vele generaties blijft 
variëren .</s>
        <s>Er is geen voorbeeld bekend van een variabel organisme dat in 
gedomesticeerde toestand op is gehouden met variëren .</s>
        <s>Onze oudste cultuurgewassen , zoals tarwe , brengen nog steeds 
nieuwe variëteiten voort .</s>
        <s>Nog steeds kunnen onze oudste gedomesticeerde dieren snel worden 
veredeld of veranderd .</s>
      </p>
      <p> [...nog een paragraaf...]</p>
    </section>
    <section>
[...nog een sectie...]
</section>
  </chapter>
  <chapter>
[...nog een hoofdstuk...]
</chapter>
</book>


Step 1: create

Choose New from the File menu. Choose source and target files. Save alignment file.

Notice that the text is rendered in the text window as a single long string. We want to tweak the rendering so we can distinguish sentences and paragraphs more easily.

Remark: the following steps will be supported by the GUI in future version of Hitaext.

Step 2: tweak

Open alignment file in a text editor. It should look like the example below.


The alignment file: alignment-sample-on-the-origin-of-species.xml

<?xml version="1.0" encoding="utf-8"?>
<hitaext>
  <from>
    <filename>data/sample-on-the-origin-of-species.xml</filename>
    <render>
      <font/>
      <elements pseudo_root="book">
        <book blankline="False" ignore="False" newline="False" skip="False" uniq="True"/>
        <title blankline="False" ignore="False" newline="False" skip="False" uniq="True"/>
        <author blankline="False" ignore="False" newline="False" skip="False" uniq="True"/>
        <chapter blankline="False" ignore="False" newline="False" skip="False" uniq="False"/>
        <head blankline="False" ignore="False" newline="False" skip="False" uniq="False"/>
        <p blankline="False" ignore="False" newline="False" skip="False" uniq="False"/>
        <s blankline="False" ignore="False" newline="False" skip="False" uniq="False"/>
        <section blankline="False" ignore="False" newline="False" skip="False" uniq="False"/>
      </elements>
    </render>
  </from>
  <to>
    <filename>data/sample-over-het-ontstaan-van-de-soorten.xml</filename>
    <render>
      <font/>
      <elements pseudo_root="book">
        <book blankline="False" ignore="False" newline="False" skip="False" uniq="True"/>
        <title blankline="False" ignore="False" newline="False" skip="False" uniq="True"/>
        <author blankline="False" ignore="False" newline="False" skip="False" uniq="True"/>
        <translator blankline="False" ignore="False" newline="False" skip="False" uniq="True"/>
        <chapter blankline="False" ignore="False" newline="False" skip="False" uniq="False"/>
        <head blankline="False" ignore="False" newline="False" skip="False" uniq="False"/>
        <p blankline="False" ignore="False" newline="False" skip="False" uniq="False"/>
        <s blankline="False" ignore="False" newline="False" skip="False" uniq="False"/>
        <section blankline="False" ignore="False" newline="False" skip="False" uniq="False"/>
      </elements>
    </render>
  </to>
  <alignment/>
</hitaext>

The first <elements> contains all the elements occurring in the source document; the second one all those in the target document. You can tweak the way these elements are rendered in Hitaext by changing their attributes. Attributes are booleans and can have the value True of False. Explanation:

Attribute: Effect:
blankline insert a blank line after the elements's text in the text window
ignore the element and all its child elements are ignored in the tree and text window - like they do not exist
newline insert a new line character after the elements's text in the text window
skip the elements is skipped in the tree window - does not affect its text or child elements
uniq an administrative features that records whether an element is unique in the document - do not change

A tweaked version of the same alignment file might look like below.


The alignment file: alignment-sample-on-the-origin-of-species.xml

<?xml version="1.0" encoding="utf-8"?>
<hitaext>
  <from>
    <filename>data/sample-on-the-origin-of-species.xml</filename>
    <render>
      <font/>
      <elements pseudo_root="book">
        <book blankline="False" ignore="False" newline="False" skip="False" uniq="True"/>
        <title blankline="True" ignore="False" newline="False" skip="False" uniq="True"/>
        <author blankline="True" ignore="False" newline="False" skip="False" uniq="True"/>
        <chapter blankline="False" ignore="False" newline="False" skip="False" uniq="False"/>
        <head blankline="True" ignore="False" newline="False" skip="False" uniq="False"/>
        <p blankline="True" ignore="False" newline="False" skip="False" uniq="False"/>
        <s blankline="False" ignore="False" newline="True" skip="False" uniq="False"/>
        <section blankline="False" ignore="False" newline="False" skip="False" uniq="False"/>
      </elements>
    </render>
  </from>
  <to>
    <filename>data/sample-over-het-ontstaan-van-de-soorten.xml</filename>
    <render>
      <font/>
      <elements pseudo_root="book">
        <book blankline="False" ignore="False" newline="False" skip="False" uniq="True"/>
        <title blankline="True" ignore="False" newline="False" skip="False" uniq="True"/>
        <author blankline="True" ignore="False" newline="False" skip="False" uniq="True"/>
        <translator blankline="True" ignore="False" newline="False" skip="False" uniq="True"/>
        <chapter blankline="False" ignore="False" newline="False" skip="False" uniq="False"/>
        <head blankline="True" ignore="False" newline="False" skip="False" uniq="False"/>
        <p blankline="True" ignore="False" newline="False" skip="False" uniq="False"/>
        <s blankline="False" ignore="False" newline="True" skip="False" uniq="False"/>
        <section blankline="False" ignore="False" newline="False" skip="False" uniq="False"/>
      </elements>
    </render>
  </to>
  <alignment/>
</hitaext>

Now save the alignment file.

Step 3: reload

Choose Open from the File menu and reload the alignment. Check lay out. If you are not satisfied, edit the alignment file and reopen.

Remark: all of these steps can be carried out automatically using the Daeso Framework Python library, which provides the basic functionality for Hitaext.

Reference

Menus

File menu

...

Keys

The tree window supports the following key strokes:

Key: Effect:
Up arrow select previous element (skipping collapsed parts of the tree)
Down arrow select next element (skipping collapsed parts of the tree)
Right arrow Expand subtree and select first child element (for non-terminal elements, otherwise same as down arrow)
Left arrow Select parent element and collapse subtree
Tab Change focus between source and target trees
Ctrl + Tab Select aligned element in other tree; repeat to visit the other aligned elements (for one-to-many alignments)
Spacebar Toggles alignment between selected elements