In this post, I present User Manual notes for GATE’s Onto Root Gazetteer (ORG) and references to ORG. In Discussion of GATE’s Onto Root Gazetteer, I discuss aspects of Onto Root Gazetteer which I found interesting or problematic. These notes and discussion may be of use to those researchers in legal informatics who are interested in text mining and annotation for the semantic web.
Thanks to Diana Maynard, Danica Damljanovic, Phil Gooch, and the GATE User Manual for comments and materials which I have liberally used. Errors rest with me (and please tell me where they are so I can fix them!).
Purpose
Onto Root Gazetteer links text to an ontology by creating Lookup annotations which come from the ontology rather than a default gazetteer. The ontology is preprocessed to produce a flexible, dynamic gazetteer; that is, it is a gazetteer which takes into account alternative morphological forms and can be added to. An important advantage is that text can be annotated as an individual of the ontology, thus facilitating the population of the ontology.
Besides being flexible and dynamic, some advantages of ORG over other gazetteers:
- It is more richly structured (see it as a gazetteer containing other gazetteers)
- It allows one to relate textual and ontological information by adding instances.
- It gives one richer annotations that can be used for further processes.
In the following, we present the step by step instructions for ‘rolling your own’, then show the results of the ‘prepackaged’ example that comes with the plugin.
Setup
Step 1. Add (if not already used) the Onto Root Gazetteer plugin to GATE following the usual plugin instructions.
Step 2. Add (if not already used) the Ontology Tools (OWLIM Ontology LR, OntoGazetteer, GATE Ontology Editor, OAT) plugin. ORG uses ontologies, so one must have these tools to load them as language resources.
Step 3. Create (or load) an ontology with OWLIM (see the instructions on the ontologies). This is the ontology that is the language resource that is then used by Onto Root Gazetteer. Suppose this ontology is called myOntology. It is important to note that OWLIM can only use OWL-Lite ontologies (see the documentation about this). Also, I succeeded in loading an ontology only from the resources folder of the Ontology_Tools plugin (rather than from another drive); I don’t know if this is significant.
Step 4. In GATE, create processing resources with default parameters:
- Document Reset PR
- RegEx Sentence Splitter (or ANNIE Sentence Splitter, but that one is likely to run slower
- ANNIE English Tokeniser
- ANNIE POS Tagger
- GATE Morphological Analyser
Step 5. When all these PRs are loaded, create a Onto Root Gazetteer PR and set the initial parameters as follows. Mandatory ones are as follows (though some are set as defaults):
- Ontology: select previously created myOntology
- Tokeniser: select previously created Tokeniser
- POSTagger: select previously created POS Tagger
- Morpher: select previously created Morpher.
Step 6. Create another PR which is a Flexible Gazetteer. At the initial parameters, it is mandatory to select previously created OntoRootGazetteer for gazetteerInst. For another parameter, inputFeatureNames, click on the button on the right and when prompt with a window, add ‘Token.root’ in the provided text box, then click Add button. Click OK, give name to the new PR (optional) and then click OK.
Step 7. To create an application, right click on Application, New –> Pipeline (or Corpus Pipeline). Add the following PRS to the application in this order:
- Document Reset PR
- RegEx Sentence Splitter
- ANNIE English Tokeniser
- ANNIE POS Tagger
- GATE Morphological Analyser
- Flexible Gazetteer
Step 8. Run the application over the selected corpus.
Step 9. Inspect the results. Look at the Annotation Set with Lookup and also the Annotation List to see how the annotations appear.
Small Example
The ORG plugin comes with a demo application which not only sets up all the PRs and LRs (the text, corpus, and ontology), but also the application ready to run. This is the file exampleApp.xgapp, which is in resource folder of the plugin (Ontology_Based_Gazetteer). To start this, start GATE with a clean slate (no other PRs, LRs, or applications), then Applications, then right click to Restore application from file, then load the file from the folder just given.
The ontology which is used for an illustration is for GATE itself, giving the classes, subclasses, and instances of the system. While the ontology is loaded along with the application, one can find it here. The text is simple (and comes with the application): language resources and parameters.
FIGURE 1 (missing at the moment)
FIGURE 2 (missing at the moment)
One can see that the token “language resources” is annotated with respect to the class LanguageResource, “resources” is annotated with GATEResource, and “parameters” is annotated with ResourceParameter. We discuss this further below.
One further aspect is important and useful. Since the ontology tools have been loaded and a particular ontology has been used, one can not only see the ontology (open the OAT tab in the window with the text), but one can annotate the text with respect to the ontology — highlight some text and a popup menu allows one to select how to annotate the text. With this, one can add instances (or classes) to the ontology.
Documentation
One can consult the following for further information about how the gazetteer is made, among other topics:
- GATE User Manual on Onto Root Gazetteer.
- See section 3.1. of this paper.
- See section 4 of TAO Deliverables.
- See Peters and Maynard in A GATE-way into NeON.
Discussion
See the related post Discussion of GATE’s Onto Root Gazetteer.
By Adam Wyner
Distributed under the Creative Commons
Attribution-Non-Commercial-Share Alike 2.0
Helpful instructions Adam, thanks for that.
Probably worth mentioning that the OntoRoot Gazetteer only really works for ‘toy’ ontologies, and certainly not for anything more than a few MB in size. I’ve tried it with the Foundational Model of Anatomy ontology and it gives up after about 2 hours!
Cheers
Phil