Annotating Rules in Legislation

Over the last couple of months, I have had discussions about text mining and annotating rules in legislation with several people (John Sheridan of The Office of Public Sector Information, Richard Goodwin of The Stationery Office, and John Cyriac of Compliance Track). While nothing yet concrete has resulted from these discussions, it is clearly a “hot topic”.
In the course of these discussions, I prepared a short outline of the issues and approaches, which I present below. Comments, suggestions, and collaborations are welcome.
Vision, context, and objectives
One of the main visions of artificial intelligence and law has been to develop a legislative processing tool. Such a tool has several related objectives:

      [1.] To guide the drafter to write well-formed legal rules in natural language.
      [2.] To automatically parse and semantically represent the rules.
      [3.] To automatically identify and annotate the rules so that they can be extracted from a corpus of legislation for web-based applications.
      [4.] To enable inference, modeling, and consistency testing with respect to the rules.
      [5.] To reason with respect to domain knowledge (an ontology).
      [6.] To serve the rules on the web so that users can use natural language to input information and receive determinations.

While no such tool exists, there has been steady progress on understanding the problems and developing working software solutions. In early work (see The British nationality act as a logic program (1986)), an act was manually translated into a program, allowing one to draw inferences given ground facts. Haley is a software and service company which provides a framework which partially addresses 1, 2, 4, and 6 (see Policy Automation). Some research addresses aspects of 3 (see LKIF-Core Ontology). Finally, there are XML annotation schemas for legislation (and related input support) such as The Crown XML Schema for Legislation and Akoma Ntoso, both of which require manual input. Despite these advances, there is much progress yet to be made. In particular, no results fulfill [3.].
In consideration of [3.], the primary objective of this proposal is to use the General Architecture for Text Engineering (GATE) framework in order to automatically identify and annotate legislative rules from a corpus. The annotation should support web-based applications and be consistent with semantic web mark ups for rules, e.g. RuleML. A subsidiary objective is to define an authoring template which can be used within existing authoring applications to manually annotate legislative rules.
Attaining these objectives would:

  • Support automated creation, maintenance, and distribution of rule books for compliance.
  • Contribute to the development of a legislative processing tool.
  • Make legislative rules accessible for web-based applications. For example, given other annotations, one could identify rules that apply with respect to particular individuals in an organisation along with relevant dates, locations, etc.
  • Enable further processing of the rules such as removing formatting, parsing the content of the rules, and representing them semantically.
  • Allow an inference engine to be applied over the formalised rule base.
  • Make legislation more transparent and communicable among interested parties such as government departments, EU governments, and citizenry.

To attain the objectives, we propose the following phases, where the numbers represent weeks of effort:

  • Create a relatively small sample corpus to scope the study.
  • Manually identify the forms of legislative rules within the corpus.
  • Develop or adapt an annotation scheme for rules.
  • Apply the analysis tools of GATE and annotate the rules.
  • Validate that GATE annotates the rules as intended.
  • Apply the annotation system to a larger corpus of documents.

For each section, we would produce a summary of results, noting where difficulties are encountered and ways they might be addressed.
Extending the work
The work can be extended in a variety of ways:

  • Apply the GATE rules to a larger corpus with more variety of rule forms.
  • Process the rules for semantic representation and inference.
  • Take into consideration defeasiblity and exceptions.
  • Develop semantic web applications for the rules.

By Adam Wyner
Distributed under the Creative Commons
Attribution-Non-Commercial-Share Alike 2.0

Instructions for GATE's Onto Root Gazetteer

In this post, I present User Manual notes for GATE’s Onto Root Gazetteer (ORG) and references to ORG. In Discussion of GATE’s Onto Root Gazetteer, I discuss aspects of Onto Root Gazetteer which I found interesting or problematic. These notes and discussion may be of use to those researchers in legal informatics who are interested in text mining and annotation for the semantic web.
Thanks to Diana Maynard, Danica Damljanovic, Phil Gooch, and the GATE User Manual for comments and materials which I have liberally used. Errors rest with me (and please tell me where they are so I can fix them!).
Onto Root Gazetteer links text to an ontology by creating Lookup annotations which come from the ontology rather than a default gazetteer. The ontology is preprocessed to produce a flexible, dynamic gazetteer; that is, it is a gazetteer which takes into account alternative morphological forms and can be added to. An important advantage is that text can be annotated as an individual of the ontology, thus facilitating the population of the ontology.
Besides being flexible and dynamic, some advantages of ORG over other gazetteers:

  • It is more richly structured (see it as a gazetteer containing other gazetteers)
  • It allows one to relate textual and ontological information by adding instances.
  • It gives one richer annotations that can be used for further processes.

In the following, we present the step by step instructions for ‘rolling your own’, then show the results of the ‘prepackaged’ example that comes with the plugin.
Step 1. Add (if not already used) the Onto Root Gazetteer plugin to GATE following the usual plugin instructions.
Step 2. Add (if not already used) the Ontology Tools (OWLIM Ontology LR, OntoGazetteer, GATE Ontology Editor, OAT) plugin. ORG uses ontologies, so one must have these tools to load them as language resources.
Step 3. Create (or load) an ontology with OWLIM (see the instructions on the ontologies). This is the ontology that is the language resource that is then used by Onto Root Gazetteer. Suppose this ontology is called myOntology. It is important to note that OWLIM can only use OWL-Lite ontologies (see the documentation about this). Also, I succeeded in loading an ontology only from the resources folder of the Ontology_Tools plugin (rather than from another drive); I don’t know if this is significant.
Step 4. In GATE, create processing resources with default parameters:

  • Document Reset PR
  • RegEx Sentence Splitter (or ANNIE Sentence Splitter, but that one is likely to run slower
  • ANNIE English Tokeniser
  • ANNIE POS Tagger
  • GATE Morphological Analyser

Step 5. When all these PRs are loaded, create a Onto Root Gazetteer PR and set the initial parameters as follows. Mandatory ones are as follows (though some are set as defaults):

  • Ontology: select previously created myOntology
  • Tokeniser: select previously created Tokeniser
  • POSTagger: select previously created POS Tagger
  • Morpher: select previously created Morpher.

Step 6. Create another PR which is a Flexible Gazetteer. At the initial parameters, it is mandatory to select previously created OntoRootGazetteer for gazetteerInst. For another parameter, inputFeatureNames, click on the button on the right and when prompt with a window, add ‘Token.root’ in the provided text box, then click Add button. Click OK, give name to the new PR (optional) and then click OK.
Step 7. To create an application, right click on Application, New –> Pipeline (or Corpus Pipeline). Add the following PRS to the application in this order:

  • Document Reset PR
  • RegEx Sentence Splitter
  • ANNIE English Tokeniser
  • ANNIE POS Tagger
  • GATE Morphological Analyser
  • Flexible Gazetteer

Step 8. Run the application over the selected corpus.
Step 9. Inspect the results. Look at the Annotation Set with Lookup and also the Annotation List to see how the annotations appear.
Small Example
The ORG plugin comes with a demo application which not only sets up all the PRs and LRs (the text, corpus, and ontology), but also the application ready to run. This is the file exampleApp.xgapp, which is in resource folder of the plugin (Ontology_Based_Gazetteer). To start this, start GATE with a clean slate (no other PRs, LRs, or applications), then Applications, then right click to Restore application from file, then load the file from the folder just given.
The ontology which is used for an illustration is for GATE itself, giving the classes, subclasses, and instances of the system. While the ontology is loaded along with the application, one can find it here. The text is simple (and comes with the application): language resources and parameters.
FIGURE 1 (missing at the moment)
FIGURE 2 (missing at the moment)
One can see that the token “language resources” is annotated with respect to the class LanguageResource, “resources” is annotated with GATEResource, and “parameters” is annotated with ResourceParameter. We discuss this further below.
One further aspect is important and useful. Since the ontology tools have been loaded and a particular ontology has been used, one can not only see the ontology (open the OAT tab in the window with the text), but one can annotate the text with respect to the ontology — highlight some text and a popup menu allows one to select how to annotate the text. With this, one can add instances (or classes) to the ontology.
One can consult the following for further information about how the gazetteer is made, among other topics:

See the related post Discussion of GATE’s Onto Root Gazetteer.
By Adam Wyner
Distributed under the Creative Commons
Attribution-Non-Commercial-Share Alike 2.0

Discussion of GATE's Onto Root Gazetteer

In Instructions for GATE’s Onto Root Gazetteer, I have information to set up Onto Root Gazetteer. In this post, I discusses aspects of the Onto Root Gazetteer that I found interesting or problematic.
For me, the documentation was not helpful as too much technical information was provided (e.g. preprocessing the ontology) rather than the steps just to get it to run. Also, no walk through example was clearly illustrated. I would still like (and will provide in the near future) a richer text (a nice paragraph) and a simpler ontology (couple of classes, subclasses, object and data properties, and individuals) to illustrate just what is done fully.
Though I have it running, there are several questions (and partial answers or musings):

  • What is the annotation relative to the ontology good for?
  • What is the difference between gazetteers derived from ontologies and default gazetteers?
  • What is the selection criteria for annotating the tokens?
  • What is the relationship between the annotated text and the ontology?

Concerning the first point, presumably more annotations allow more processing capabilities. A (simple) example would be very helpful.
Concerning the second point, matters are more complex (to my mind). First, default gazetteers (or flexible gazetteers for that matter) are flat lists (a list containing no sublists as parts) where the items in the list are annotated as per the properties of the list; for example, if we have a gazetteer for Organisation (call this the header of the list) which lists IBM, BBC, Hackney Council (call these the items of the list), then every token of IBM, BBC, and Hackney Council found in the corpus will be annotated Organisation. If there is a token organisation in the corpus, it will not be annotated with Organisation; similarly, no token of IBM in the corpus is annotated IBM. The list categorises, in effect, IBM, BBC, and Hackney Council as of the type Organisation.
ORG works differently (I believe, but may be wrong), but these points are not made in the documentation. First, a gazetteer which is derived from an ontology preserves the subsumption hierarchy of the ontology, giving us a list of lists. Such a gazetteer is a taxonomy of terminology, which is not the same as an ontology (though frequently mistaken to be identical). Second, if a token in the text is found to (flexibly) match an item in the gazetteer, then the token is annotated with that item, meaning that if the string IBM is a token in our text and an item in the gazetteer, then token is annotated IBM. In these respect, ORGs work differently from other gazetteers.
The third question might be addressed in the richer documentation concerning ORG. It relates to observations concerning the results of the example application. Consider the following. The token “language resources” has the annotation:
URI=, heuristic_level=0, majorType=, propertyURI=, type=class
The token “resources” has the annotation:
URI=, heuristic_level=0, majorType=, propertyURI=, type=class
And the token “parameters” has annotation:
URI=, heuristic_level=0, majorType=, propertyURI=, type=class
We see that the tokens in the text are annotated in relation to the ontology. Yet it is not clear why the token “resources” is not annotated with LanguageResource or ResourceParameter since these are components of the ORG as well. Likely there is some prioritising among the annotations that we need to learn.
Finally, concerning the last question, matters are somewhat unclear (to me) largely because the line between annotations, gazetteers, and ontologies are blurred, where for me the key unclarity focuses around annotations in the text that match items in the gazetteer. Consider the issue from a different point of view. ORG was developed in the context of a project to support ontology development from text — find terms and relations which are candidates for the ontology, then (if one wants) use the terms and relations to build the ontology. For example, if one sees lots of occurrences of “organisation” in the text, then perhaps it would be introduced as a concept in the ontology. We have a many-one relation from the tokens to the ontology. This makes sense. See it another way, where we have a default gazetteer where every given token (e.g. IBM) in a text has the same annotation, giving the impression of a one-many relation. This also makes sense. Neither of these seem problematic to me largely because I don’t really know much or presume much about the meaning of the annotation on the token: from the text, I abstract the concept, from the gazetteer, I label tokens as belonging to the same annotation class. In no case is a token “organisation” annotated with Organisation; even if it were, I couldn’t really object unless I said more about what I think the annotation means.
Contrast these points with what goes on with ORG (admittedly, this gets pretty philosophical, and in terms of day to day practice, it may not be relevant). First, it seems that one instance in the ontology is associated with multiple tokens in the text. Second, an instance or class in the ontology can be associated with a token that is intended to have some similar meaning — e.g. the individual IBM in the ontology is associated by annotation with every token of IBM in the text, and similarly for the classes. Neither of these make sense to me in terms of what ontologies are intended to represent, which is a state of knowledge (the fixed concepts, object and data properties, and individuals) about a domain. On the first point, how can I be assured that the intended meaning of tokens is the same throughout the corpus? In one document, we might find IBM as the name of a non-existent company, in an other for an existing company, and in another for a company that has gone bankrupt. Simply put, the string might remain the same, but the knowledge we have about it may vary. Ontologies (as they are currently represented) do not allow such dynamic interpretation. To ignore this point risks having annotations (and whatever might flow from the annotations) slip; for example, it would be wrong to find a relationship between IBM and owners where the company doesn’t exist. On the second point, conceptually it makes no sense to say that a token “organisation” is itself associated with the concept or instance or ‘organisation’ in the ontology. Or course, in developing the ontology, going from the text to the ontology makes good sense since one is abstracting from the text to the ontology. Yet, in that move, one makes something different — a concept over all the “ideas” drawn from the tokens. So, I disagree emphatically with Peters and Maynard (from the NeON article): “Texts are annotated with ontology classes, and the textual elements function as instances of these classes.” The textual element “organisation” or “IBM” is an instance of the concept organisation or the individual IBM? I think this is a category mistake.
In general, I find the relationship between the text, intermediate representations (gazettees), and ontologies (higher level representations of knowledge) rather interesting, but somewhat murky. As I said earlier, perhaps this is just philosophy. Depending on the domain of discussion, the corpus, and the way the annotations and ontologies are used, perhaps my intuition of lurking trouble will not be realised…. Equally, there is likely something simple that I’m missing. If so, please enlighten me.
By Adam Wyner
Distributed under the Creative Commons
Attribution-Non-Commercial-Share Alike 2.0