Text Mining Legal Resources with GATE — Study 1

This page reports the results of a first study of applying GATE to a legal resource. The focus of this study was to annotate a list of cases.
I used a web page from BAILII which contains a list of cases with the following information:

  • The first party in a case, e.g. Meade.
  • The second party in a case, e.g. Mason.
  • The citation date, e.g. [1999]
  • The court level in which the case was decided, e.g. England and Wales Court of Appeals.
  • The court within the level, e.g. Civil.
  • The citation number, e.g. 780.
  • The date of the decision, e.g. 12 February 1999.

A sample of entries from the page I worked with is:
McSpadden v Keen [1999] EWCA Civ 1515 (27 May 1999)
McTaggart, R v [1997] EWCA Crim 3050 (24th November, 1997)
McTaggart, R v [1997] EWCA Crim 3137 (2nd December, 1997)
McVeigh & Anor, R v [1998] EWCA Crim 784 (3rd March, 1998)
McWhirter & Anor, R (on the application of) v Secretary of State for Foreign and Commonwealth Affairs [2003] EWCA Civ 384 (05 March 2003)
M-D v D [2008] EWHC 1929 (Fam) (19 December 2008)
MD (Guinea) v Secretary of State for the Home Department [2009] EWCA Civ 733 (17 June 2009)
MD (Iran) v Secretary of State for the Home Department [2007] EWCA Civ 532 (27 April 2007)
Below, we have a screenshot of the result of annotation in GATE. The parts of the annotation are colour coded as appear in the column on the right. In Firefox, one can right click on the image, then View Image in order to view a larger version, then click the back button on the browser to return to the post.
GATE annotations on a list of legal case information
There were range of irregularities in the source which had to be accommodated:

  • v and v. for the versus relation.
  • Decision date formats.
  • Length of the names of the parties.
  • Different orders of court and court level
  • Variations that arise as a consequence of using a page stripped of HTML annotations. The first name in the image is an artifact.

In this approach, I did not annotate the parties as plaintiff and defendant as the case decisions themselves associate the parties with different roles in different court contexts; our approach is more general. In consideration of the variants among case citations, I opted to identify each piece of the citation, which will allow one to extract and reconstruct the citation in a subsequent work.
While a small scale and relatively simple task, the result has one main strength — it gives us a list of parties to cases. It is difficult to automatically identify parties in general, but with this approach, we can extract those entities which have been involved in a case, then use that information for subsequent annotation tasks. Another strength is that we have isolated the components of the case citation, which can then be reconstructed as we wish.
The list of parties could be further refined by isolating last names, distinguishing among parties which appear in a list, differentiating persons from organisations, and filtering out additional information that appears. This is left for future work.
The Case Base List zip file contains the following files, which were used with GATE.

  • ew-cases-0133.html, which is the HTML file that lists the cases.
  • ew-cases-0133SHORT.xml, which is the XML file with the result of annotation. This is file related to the graphic above. The file is a short version of ew-cases-0133.html so that one can more easily see the results of the annotation. These appear as stand-off annotations. In the first part of the file, one can see the tokens of the file with numerical ranges (node numbers); later in the file, one can see indications of the annotations, making reference to the starting and ending numbers of each token.
  • GraphicListAnnotation.png, the graphic above.
  • CiteYear.jape, this annotates out the citation year for use in the citation as in [1998]
  • Courts_abbr.jape, this annotates the court level in terms of abbreviations as in EWCA, which is the English and Wales Court of Appeals.
  • dateAWynerMods.jape, this annotates the decision date such as (23rd June, 2001) and (21 July 2000).
  • FirstParty.jape, this annotates the first party, which is that party to the left of versus.
  • SecondParty.jape, this annotates the second party, which is that party to the right of versus.
  • SubCourts_abbr.jape, this annotates the courts within a court level such as civil courts (Civ) and criminal courts (Crim).
  • Versus.jape, this annotates the versus divider.
  • england_wales_courts_hierarchy.lst, this is a list of courts in England and Wales.
  • england_wales_courts_hierarchy_abbr.lst, this is a list of abbreviations for the courts in England and Wales.
  • england_wales_courts_subclass.lst, this is a list of the divisions within a court level.
  • england_wales_courts_subclass_abbr.lst, this is a list of abbreviations of courts within a court level.
  • cite_year.lst, this is a list of years with square brackets as in [1999]. Perhaps a rule can be written for this, taking into account the brackets.
  • list.def, the ‘master list’ of lists for use in GATE.

The files are released under a Creative Commons Attribute and ShareAlike license. The main objective of the contribution is to foster open, public, and collaborative development of text mining tools for legal resources.
Advice, suggestions, alternatives, and contributions along the lines of this work are very welcome.
Cheers,
Adam
Copyright © 2009 Adam Wyner

2 thoughts on “Text Mining Legal Resources with GATE — Study 1”

Leave a Reply

Your email address will not be published. Required fields are marked *