Information Extraction with ANNIC – Logic Language Law Computing

In Information Extraction of Legal Case Factors, we presented lists and rules for annotation of legal case factors. In this post, we go one step further and use the ANNotations In Context (ANNIC) tool of GATE. This is a plug which helps to search for annotations, visualise them, and inspect features. It is useful for JAPE rule development. We outline how to plug in, load, and run ANNIC. (See introductory notes on this and related posts.)
Introduction to ANNIC
ANNIC is an annotation indexing and retrieval system. It is integrated with the data stores, where results of annotations on a corpus can be saved. Once a processing pipeline is run over the corpus, we can use ANNIC to query and inspect the contexts where annotations appear; the queries are in a subset of the JAPE language, so can be complex. The results of the queries are presented graphically, making them easy to understand. As such, ANNIC is a very useful tool in the development of rules as one can discover and test patterns in corpora. There is also an export facility, so the results can be presented in a file, but this is not a full information extraction system such as one might want with templates.
For later, but important to know from the documentation: “Be warned that only the annotation sets, types and features initially indexed will be updated when adding/removing documents to the datastore. This means, for example, that if you add a new annotation type in one of the indexed document, it will not appear in the results when searching for it.” This implies that where one adds new annotations to the pipeline, one should delete the old data store and create another one with respect to the new results. For example, if one ran the pipeline without POS, one cannot add POS later and inspect it in the pipeline.}
Further details on ANNIC are available at GATE documentation on ANNIC and there is an online video.
Instantiating the serial data store
The following steps are used to create the requisite parts and inspect them with ANNIC. One starts with an empty GATE, then adds processing resources, language resources, and pipelines since these can all be related to the data store in a later step. This material is adapted or adopted from the GATE ANNIC documentation, cutting out many of the options. To instantiate a serial data store (SSD), which is how the annotated documents are saved and searched. The application, lists, and rules that this example uses is from Information Extraction of Legal Case Factors.

RC on Datastores > Create datastore.
From the drop-down list select “Lucene Based Searchable DataStore”.
At the input window, provide the following parameters:

DataStore URL: Select an empty folder where the data store is created.
Index Location: Select an empty folder. This is where the index will be created.
Annotation Sets: Provide the annotation sets that you wish to include or exclude from being indexed. There are options here, but we want to index all the annotation sets in all the documents, so make this list empty.
Base-Token Type: These are the basic tokens of any document (e.g. Token) which your documents must in order to get indexed.
Index Unit Type: This specifies the unit of Index (e.g. Sentence). In other words, annotations lying within the boundaries of the annotations are indexed (e.g. in the case of Sentence, no annotations that are spanned across the boundaries of two sentences are considered for indexing). We use the Sentence unit.
Features: Users can specify the annotation types and features that should be included or excluded from being indexed (e.g. exclude SpaceToken, Split, or Person.matches).

Click OK. If all parameters are OK, a new empty searchable SSD will be created.
Create an empty corpus and save it to the SSD.
Populate the corpus with some documents. Each document in the corpus is automatically indexed and saved to the data store.
Load some processing resources and then a pipeline. Run the pipeline over the corpus.
Once the pipeline has finished (and there are no errors), save the corpus in the SSD by right clicking on the corpus, then “Save to its datastore”.
Double click on the SSD file under Datastores. Click on the “Lucene DataStore Searcher” tab to activate the search GUI.
Now you are ready to specify a search query of your annotated documents in the SSD.

Output
The GUI opens with parts as shown in the following two figures:
ANNIC search for
ANNIC search for disclosure concept
Working with the GUI
The figures above show three main sections. In the top section, left section, there is a blank text area in which one can write a query (more on this below); the search query returns the “content” of the annotations. There are options to select a corpus, annotation set, the number of results, the size of the context (e.g. the number of tokens to the left and right of what one searches for). In the central section, one can see a visualisation of annotations and values given the search query. In the bottom section, one has a list of the matches to the query across the corpus, giving the left and right contexts relative to the search results. An annotation rows manager lets one add (green plus sign) or remove (red minus sign) annotation types and features to display in the central section. The bottom section contains the results table of the query, i.e. the text that matches the query with their left and right contexts. The bottom section also contains tabbed panes of statistics such as how many instances of particular annotation appear.
Queries
The queries written in the blank text area are a subset of the JAPE patterns and use the annotations used in the pipeline. Queries are activated by hitting ENTER (or the Search icon). The following are some template patterns that can be used Below we give a few examples of JAPE pattern clauses which can be used as SSD queries.

String
{AnnotationType}
{AnnotationType == String}
{AnnotationType.feature == feature value}
{AnnotationType1, AnnotationType2.feature == featureValue}
{AnnotationType1.feature == featureValue,
AnnotationType2.feature == featureValue}

Specific queries are:

Trandes — returns all occurrences of the string where it appears in the corpus.
{Person} — returns annotations of type Person.
{Token.string == “Microsoft”} — returns all occurrences of “Microsoft”.
{Person}({Token})*2{Organization} — returns Person followed by zero or up to two tokens followed by Organization.
{Token.orth==”upperInitial”, Organization} — returns Token with feature orth with value set to “upperInitial” and which is also annotated as Organization.
{Token.string==”Trandes”}{Token})*10{Secret} — returns string “Trandes” followed by zero to ten tokens followed by Secret.
{Token.string ==”not”}({Token})*4{Secret} — returns the string “not”, followed by 4 or less tokens, followed by something annotated with Secret.

An example of a result for the last query is:
Trandes averred nothing more than that it possessed secret.
In ANNIC, the result of the query appears as:
ANNIC search for negation and disclosure concept
One can write queries using the JAPE operators: | (OR operator), +, and *. ({A})+n means one and up to n occurrences of annotation {A}, and ({A})*n means zero or up to n occurrences of annotation {A}.
Summary
ANNIC is particularly useful in writing and refining one’s JAPE rules. Finally, one’s results can be exported at HTML files.
By Adam Wyner
Distributed under the Creative Commons
Attribution-Non-Commercial-Share Alike 2.0

Leave a Reply Cancel reply