What contribution can digital textual collections make to research in Ancient History?
As Arnaldo Momigliano (1980: 14) wrote:

the domain of work for a historian is determined by the existence of documents and information about the past, which must be combined and interpreted to understand what happened. His problems are determined by the relation between what the sources are and what he wants to learn.

Our daily experience as users of the World Wide Web may suggest two features of digital collections that can help this work: the size that digital collections can nowadays reach and the quickness and precision that query languages attain in retrieving relevant content. This last feature, however, is still far from being fully realized in practice. Searching for sources of historical phenomena requires us to delve into the content of each document at a far deeper level than what the typical search engines of the WWW can support. To quote Bruce Robertson (2009: 3) on the point:

[t]hough we can summon up an exhaustive list of Web resources that contain the words “Gallipoli” and “sources”, today’s Web cannot effectively respond to a basic historical question such as, “which sources attest the Gallipoli Campaign of World War I?” much less a more advanced one such as, “what evidence is there for major architectural projects undertaken in the U.K. during the period of the Boer War, and does anyone think that these projects are influenced by the conflict?”

The CIDOC Concept Reference Model is a popular answer to the problem of connecting cultural heritage information that has also gained the status of a standard. As you can hear directly from the words of one of its creators (S. Stead, “The CIDOC CRM”), the CIDOC-CRM is based on the idea that objects (being they texts, artefacts, pictures or others) are understood and digitally represented as being parts of events that take place in space and time.

Thanks to Natural Language Processing (NLP) technologies, linguists can extract the events that are mentioned or narrated, as well as the actors and circumstances that take part in them, directly from the texts. A series of projects are indeed applying these empirical methods in the field of the humanities (e.g. Reiter et al. 2010; Cybulska, and Vossen 2011). Yet, the adoption of data-driven approaches to event extraction requires the use of sophisticated NLP tools that, although available for high-resourced languages such as English, are still largely absent for Ancient Greek. Those NLP tools should be able to perform linguistic analysis at many different levels, inlcuding:

  • morphology (part-of-speech tagging and lemmatization);
  • syntax;
  • semantics (thematic roles, argument structure, word-sense disambiguation);
  • discourse (co-reference and ellipsis resolution).

In terms of both software and annotated corpora, Ancient Greek is poorly resourced but the situation is not quite desperate. For morphology and syntax, the Ancient Greek Dependency Treebank (AGDT) provides both a model and a solid framework for manual annotation (Bamman et al. 2009). In current linguistics, a treebank is a corpus of sentence with word-by-word annotation on morphology and syntactic structure.

A sentence annotated according to AGDT standards

The AGDT is the first comprehensive treebank of Ancient Greek literary texts of the Archaic and Classical era. In its current release it includes the unabridged text of the Iliad, Odyssey, the complete work of Hesiod, and Aeschylus, as well as five tragedies of Sophocles (Ajax, Oedipus Rex, Electra, Antigone, Trachiniae) and the Euthyphro of Plato. The crucial task for those who want to build infrastructure for the aforementioned work of event extraction will be to also supplement the morpho-syntactic analysis of the AGDT with the semantic information that is required for event extraction.

The Hellespont Project has chosen to explore this possibility. Hellespont is a case study that focuses on the history of Athens in the years 479-431 BCE, as narrated in the text of Thucydides’ Histories (1.89-118); it aims to create a dynamic environment for research where two of the largest online collections, the Perseus Digital Library and the Arachne archaeological database, will be integrated via the CIDOC-CRM (Romanello and Thomas 2012).

In the following posts, I will go into details explaining how the addition of a layer of semantic and pragmatic information (such as thematic roles, or co-reference resolution) is integrated on top of the available treebank.

I will also try to show how a fine-grained linguistic analysis is not only useful for digital representation. A text annotated with syntactic and semantic information can allow for a multitude of linguistic and literary studies that can help us in exploring Thucydides’ work in a multitude of different ways.


