Estrella Project Overview

Until September 2009, I worked on the Estrella Project (The European project for Standardized Transparent Representations in order to Extend Legal Accessibility) at the University of Liverpool. One of the documents which I co-authored (with Trevor Bench-Capon) for the project was the ESTRELLA User Report, which is an open document about key elements of the project. In the context of commercial, academic, governmental collaborations, many of the issues and topics from that project are still relevant, especially concerning motivations and goals of open source materials for legal informatics. In order to circulate this discussion further afield, I have taken the liberty to reproduce an extracted from the article. LKIF stands for the Legal Knowledge Interchange Format, which was a key deliverable in the project. For further documents from the project, see the Estrella Project website.
Overview
The Estrella Project (The European project for Standardized Transparent Representations in order to Extend Legal Accessibility) has developed a platform which allows public administrations to deploy comprehensive solutions for the management of legal knowledge. In reasoning about social benefits or taxation, public administrators must represent and reason with complex legislation. The platform is intended to support the representation of and reasoning about legislation in a way that can help public administrations to improve the quality and efficiency of their services. Moreover, given a suitable interface, the legislation can be made available for the public to interact with. For example, LKIF tools could be made available to citizens via the web to help them to assess their eligibility for a social benefits as well as filling out the appropriate application forms.
The platform has been designed to be open and standardised so that public administrations need not become dependent on proprietary products of particular vendors. Along the same lines, the platform supports interoperability among various components for legal knowledge-based systems allowing public administrations to freely choose among the components. A standardised platform also enables a range of vendors to develop innovative products to suit particular market needs without having to be concerned with an all-encompassing solution, compatibility with other vendors, or being locked out of a strategic market by “monolithic” vendors. As well, the platform abstracts from the expression of legislation in different natural languages so providing a common, abstract legal “lingua franca”.
The main technical achievement of the Estrella Project is the development of a Legal Knowledge Interchange Format (LKIF), which represents legal information in a form which builds upon emerging XML-based standards of the Semantic Web. The project platform provides Application Programmer Interfaces (APIs) for interacting with legal knowledge-based systems using LKIF. LKIF provides formalisms for representing concepts (“ontologies”), inference rules, precedent cases and arguments. An XML document schema for legislation has been developed, called MetaLex, which complements and integrates national XML standards for legislation. This format supports document search, exchange, and association among documents as well as enforces a link between legal sources and the legal knowledge systems which reason about the information in the sources. In addition, a reference inference engine has been developed which supports reasoning with legal knowledge represented in LKIF. The utility of LKIF as an interchange format for legal knowledge has been demonstrated with pilot tests of legal documents which are expressed in proprietary formats of several vendors then translated to and from the format of one vendor to that of another.
Background Context
The Estrella Project originated in the context of European Union integration, where:

  • The European Parliament passes EU wide directives which need to be incorporated into or related to the legislation of member states.
  • Goods, services, and citizens are free to move across open European borders.
  • Democratic institutions must be strengthened as well as be more responsive to the will of the citizenry.
  • Public administrations must be more efficient and economical.

In the EU, the legal systems of member states have been composed of heterogeneous, often conflicting, rules and regulations concerning taxes, employment, education, pensions, health care, property, trade, and so on. Integration of new EU legislation with existing legislation of the member states as well as homogenisation of legal systems across the EU has been problematic, complex, and expensive to implement. As the borders of member states open, the rules and regulations concerning the benefits and liabilities of citizens and businesses must move as people, goods, and services move. For example, laws concerning employment and pension ought to be comparable across the member states so as to facilitate the movement of employees across national boundaries. In addition, there are more general concerns about improving the functionality of the legal system so as to garner public support for the legal system, promoting transparency, compliance, and citizen involvement. Finally, the costs of administering the legal system by EU administrative departments, administrations of member states, and companies throughout the EU are signfi cant and rising. The more complex and dynamic the legislative environment, the more burdensome the costs.
Purposes
Given this background context, the Estrella Project was initiated with the following purposes in mind:

  • to facilitate the integration of EU legal systems
  • to modernise public administration at the levels of the EU and within member states by supporting efficiency, transparency, accountability, accessibility, inclusiveness, portability, and simplicity of core governmental processes and services
  • to improve the quality of legal information by testing legal systems for consistency (are there contradictions between portions of the law) and correctness (is the law achieving the goal it is speci ed for?).
  • to reduce the costs of public administration
  • to reduce private sector costs of managing their legal obligations
  • to encourage public support for democratic institutions by participation, transparency, and personalisation of services
  • to ease the mobility of goods, services, and EU citizens within the EU
  • to support businesses across EU member states
  • to provide the means to “modularise” the legal systems for di fferent levels of EU legal structure, e.g. provide a “municipal government” module which could be amended to suit local circumstances
  • to support a range of governmental and legal processes across organisations and on behalf of citizens and businesses
  • to support a variety of reasoning patterns as needed across a range of resources (e.g. directives, legal case bases).

Using XSLT to Re-represent GATE Output

Once one has processed some documents with GATE, what can one do with the result? After all, information extraction implies that the information is extracted, not simply annotated. (See introductory notes on this and related posts.)
There are several paths. One is to use Annotations in Context (ANNIC), which searches for and returns a display of annotated elements; we discuss how to use ANNIC in a separate post. However, this does not appear to support an “export” function to further process the results. Another path is to export the document with inline annotations; this, with a bit of further manual work, can then be processed further with EXtensible Stylesheet Language Transformations — XSLT. There are other approaches (e.g. XQUERY), but this post provides an example of using XSLT to present output as a rule book.
In Legislative Rule Extraction, we annotated some legislation. We carry on with the annotated legislation.
Output of GATE
In addition to the graphic output from GATE’s application, we can output the results of the annotation either inline or offset. As we are interested to provide alternative presentations of the annotated material, we look at the inline annotation.
In GATE, by right clicking on the document file (after applying the application to it) and choose “Save preserving document format'”. For out sample text, the result is:


 Article 1 
 Subject matter 
 This Directive lays down rules concerning the
following :
 1) 
 the taking-up and pursuit, within the Community,
of the self-employed activities of direct insurance and
reinsurance ;
 2) 
 the supervision in the case of insurance and
reinsurance groups ;
 3) 
 the reorganisation and winding-up of direct
insurance undertakings .

Legal XML
The GATE output needs to be made into proper XML, having a root and being properly nested. As there will be several rules, each rule extracted should go between some legal XML annotation. There is an issue about how to save and process a full corpus, as the only options to save are XML or Datastore, but we leave this aside for the time being. For now, we ‘manually’ wrap our GATE output as below.
I used the online XSLT editor at w3schools, which has nice online functionality which allows one to experiment and get results right away. In particular, one can cut and paste the XML rulebook (below) into the left hand pane and the XSLT code (below) into the right hand pane, hit the edit button, and get the transformed output. Caveat, one might have to do a bit of editing on the XML rulebook for spaces and returns since there are some bumps between what appears in WordPress and what is needed to run code.
The XML Rulebook:





 Article 1 
 Subject matter 
 This Directive lays down rules concerning the
following :
 1) 
 the taking-up and pursuit, within the Community,
of the self-employed activities of direct insurance and
reinsurance ;
 2) 
 the supervision in the case of insurance and
reinsurance groups ;
 3) 
 the reorganisation and winding-up of direct
insurance undertakings .



The XSLT code:






  
  
  

My Rulebook

Reference Code:
Title:
Description:
Description:
Description:

XSLT Output
The result is the following:
Output of XSLT on the XML Rulebook
In general, one can create any number of rulebooks from the same underlying data, varying the layout and substance of the presentation. For example, we can change the colours or headers easily; we can present more or less information. This is a lot more powerful than the static book that now exists.
Problems and Issues
Our example is a simple illustration of what can be done. Note that we have not yet fulfilled the requirements from our initial post since we have not numbered the sections, but this can be added later.
An important problem is that GATE annotations are not always in accordance with XML standards. In particular, XML markups must be strictly embedded as in

      

There can be no crossover such as in

     

though this may well occur for GATE annotations. There may be several approaches to this problem, but we leave that for future discussion.
Another problem is that “Save preserving document format” only works with documents and not corpora, and we might want to work with corpora.
Finally, XSLT is useful for transforming XSL files, not in extracting information from XML files, for which one would need something such as XQuery.
By Adam Wyner
Distributed under the Creative Commons
Attribution-Non-Commercial-Share Alike 2.0

Legal Informatics Start-up from Stanford University

The Stanford Daily, an online newspaper with news from Stanford University, reports the creation of a spin-off, start-up company Lex Machina which is the result of collaboration between the Law School and Department of Computer Science at Stanford. The focus of the company is to make intellectual property litigation more transparent; it covers patent infringement, copyright, trademark, antitrust, and certain trade secret lawsuits. There are commercial and non-commercial services.
This is an interesting development, particularly in terms of the collaboration between a law school and department of computer science. I hope it is the first of many, and I look forward to learning more about the company and system.

CFP: Workshop on Semantic Processing of Legal Texts

LREC 2010 Workshop on
SEMANTIC PROCESSING OF LEGAL TEXTS (SPLeT-2010)
CALL FOR PAPERS

23 May 2010, Malta
Workshop description
The legal domain represents a primary candidate for web-based information distribution, exchange and management, as testified by the numerous e-government, e-justice and e-democracy initiatives worldwide. The last few years have seen a growing body of research and practice in the field of Artificial Intelligence and Law which addresses a range of topics: automated legal reasoning and argumentation, semantic and cross-language legal information retrieval, document classification, legal drafting, legal knowledge discovery and extraction, as well as the construction of legal ontologies and their application to the law domain. In this context, it is of paramount importance to use Natural Language Processing techniques and tools that automate and facilitate the process of knowledge extraction from legal texts.
With the last two years, a number of dedicated workshops and tutorials specifically focussing on different aspects of semantic processing of legal texts has demonstrated the current interest in research on Artificial Intelligence and Law in combination with Language Resources (LR) and Human Langugage Technologies (HLT). The LREC 2008 Workshop on “Semantic processing of legal texts” was held in Marrakech, Morocco, on the 27th of May 2008. The JURIX 2008 Workshop on “the Natural Language Engineering of Legal Argumentation: Language, Logic, and Computation (NaLEA)”, which focussed on recent advances in natural language engineering and legal argumentation. The ICAIL 2009 Workshops “LOAIT ’09 – the 3rd Workshop on Legal Ontologies and Artificial Intelligence Techniques joint with the 2nd Workshop on Semantic Processing of Legal Texts” and “NALEA’09 – Workshop on the Natural Language Engineering of Legal Argumentation: Language, Logic, and Computation”, the former focussing on Legal Knowledge Representation with particular emphasis on the issue of ontology acquisition from legal texts, the latter tackling issues related to legal argumentation and linguistic technologies.
To continue this momentum, a 3rd Workshop on “Semantic Processing of Legal Texts” is being organised at the Language Resources and Evaluation Conference to bring to the attention of the broader LR/HLT community the specific technical challenges posed by the semantic processing of legal texts and also share with the community the motivations and objectives which make it of interest to researchers in legal informatics. The outcome of these interactions are expected to advance research and applications and foster interdisciplinary collaboration within the legal domain.
The main goals of the workshop are to provide an overview of the state-of-the-art in legal knowledge extraction and management, to explore new research and development directions and emerging trends, and to exchange information regarding legal LRs and HLTs and their applications.
Areas of Interest
The workshop will focus on the topics of the automatic extraction of information from legal texts and the structural organisation of the extracted knowledge. Particular emphasis will be given to the crucial role of language resources and human language technologies.
Papers are invited on, but not limited to, the following topics:

  • Building legal resources: terminologies, ontologies, corpora
  • Ontologies of legal texts, including subareas such as ontology acquisition, ontology customisation, ontology merging, ontology extension, ontology evolution, lexical information, etc.
  • Information retrieval and extraction from legal texts
  • Semantic annotation of legal texts
  • Legal text processing
  • Multilingual aspects of legal text semantic processing
  • Legal thesauri mapping
  • Automatic Classification of legal documents
  • Logical analysis of legal language
  • Automated parsing and translation of natural language arguments into a logical formalism
  • Linguistically-orientied XML mark up of legal arguments
  • Dialogue protocols for argumentation
  • Legal argument ontology
  • Computational theories of argumentation that are suitable to natural language
  • Controlled language systems for law.

Submissions
Submissions are solicited from researchers working on all aspects of semantic processing of legal texts. Authors are invited to submit papers describing original completed work, work in progress, interesting problems, case studies or research trends related to one or more of the topics of interest listed above. The final version of the accepted papers will be published in the Workshop Proceedings.
Short or full papers can be submitted. Short papers are expected to present new ideas or new visions that may influence the direction of future research, yet they may be less mature than full papers. While an exhaustive evaluation of the proposed ideas is not necessary, insight and in-depth understanding of the issues is expected. Full papers should be more well developed and evaluated. Short papers will be reviewed the same way as full papers by the Program Committee and will be published in the Workshop Proceedings.
Full paper submissions should not exceed 10 pages, short papers 6 pages; both should be typeset using a font size of 11 points. Style files will be made available by LREC for the camera-ready versions of accepted papers. Papers should be submitted electronically, no later than February 10, 2010. The only accepted format for submitted papers is Adobe PDF. Submission will be electronic using START paper submission software available at
SPLeT 2010 Workshop
Note that when submitting a paper through the START page, authors will be kindly asked to provide relevant information about the resources that have been used for the work described in their paper or that are the outcome of their research. In this way, authors will contribute to the LREC2010 Map, our new feature for LREC 2010. For further information on this initiative, please refer to
LREC2010 Map of Language Resources
Important Dates
Paper submission deadline: 10 February 2010
Acceptance notification sent: 5 March 2010
Final version deadline: 21 March 2010
Workshop date: 23 May 2010
Workshop Chairs

  • Enrico Francesconi (Istituto di Teoria e Tecniche dell’Informazione Giuridica of CNR, Florence, Italy)
  • Simonetta Montemagni (Istituto di Linguistica Computazionale of CNR, Pisa, Italy)
  • Wim Peters (Natural Language Processing Research Group, University of Sheffield, UK)
  • Adam Wyner (Department of Computer Science, University College London, UK)

Address any queries regarding the workshop to: lrec10_legalWS@ilc.cnr.it
Program Committee

  • Johan Bos (University of Rome, Italy)
  • Danièle Bourcier (Humboldt Universität, Berlin, Germany)
  • Thomas R. Bruce (Cornell Law School, Ithaca, NY, USA)
  • Pompeu Casanovas (Institut de Dret i Tecnologia, UAB, Barcelona, Spain)
  • Alessandro Lenci (Dipartimento di Linguistica, Università di Pisa, Pisa, Italy)
  • Leonardo Lesmo (Dipartimento di Informatica, Università di Torino, Torino, Italy)
  • Raquel Mochales Palau (Catholic University of Leuven, Belgium)
  • Paulo Quaresma (Universidade de Évora, Portugal)
  • Erich Schweighofer (Universität Wien, Rechtswissenschaftliche Fakultät, Wien, Austria)
  • Manfred Stede (University of Potsdam, Germany)
  • Daniela Tiscornia (Istituto di Teoria e Tecniche dell’Informazione Giuridica of CNR, Florence, Italy)
  • Tom van Engers (Leibniz Center for Law, University of Amsterdam, Netherlands)
  • Radboud Winkels (Leibniz Center for Law, University of Amsterdam, Netherlands)

Open Source Information Extraction: Data, Lists, Rules, and Development Environment

Open source software development and standards are widely discussed and practiced. It has led to a range of useful applications and services. GATE is one such example.
However, one quickly learns that open source can easily mean open to a certain extent: GATE is open source, but the applications and additional functionalities that are developed with respect to GATE often are not. On the one hand, this makes perfect sense as the applications and functionalities are added value, labour intensive, and so on. On the other hand, the scientific community cannot verify, validate, or build on prior work unless the applications and functionalities are available. This can also hinder commercial development since closed development impedes progress, dissemination, and a common framework from which everyone benefits. It also does not recognise the fundamentally experimental aspect of information extraction. In contrast, the rapid growth and contributions of the natural (Biology, Physics, Chemistry, etc) or theoretical (Maths) sciences could only have occurred in an open, transparent development environment.
I advocate open source information extraction where an information extraction result can only be reported if it can be independently verified and built on by members of the scientific community. This means that the following must be made available concurrent with the report of the result:

  • Data and corpora
  • Lists (e.g. gazetteers)
  • Rules (e.g. JAPE rules)
  • Any additional processing components (e.g. information extraction to schemes or XSLT)
  • Development environment (e.g. GATE)

In other words, the results must be independently reproducible in full. The slogan is:

No publication without replicability.

This would:

  • Contribute to the research community and build on past developments.
  • Support teaching and learning.
  • Encourage interchange. The Semantic Web chokes on different formats.
  • Return academic research to the common (i.e. largely taxpayer funded) good rather than owned by the researcher or university. If someone needs to keep their work private, they should work at a company.
  • Lead to distributive, collaborative research and results, reducing redundancy and increasing the scale and complexity of systems.

Solving the knowledge bottleneck, particularly in relation to language, has not and likely will not be solved by any one individual or research team. Open source information extraction will, I believe, make greater progress toward addressing it.
Obviously, money must be made somewhere. One source is public funding, including contributions from private organisations which see a value in building public infrastructure. Another source is, like other open source software, systems, or other public information, to make money “around” the free material by adding non-core goods, services, or advertising.
By Adam Wyner
Distributed under the Creative Commons
Attribution-Non-Commercial-Share Alike 2.0

Tutorial on NLP techniques for managing legal resources on the Semantic Web

Next week, 16 December 2009, I am giving a three hour tutorial at JURIX (International Conference on Legal Knowledge and Information Systems) in Rotterdam, The Netherlands on Natural Language Processing Techniques for Managing Legal Resources on the Semantic Web. The tutorial description appears below. Further material from the tutorial will be presented on the blog.
Legal resources such as legislation, public notices, and case law are increasingly available on the internet. To be automatically processed by web services, the resources must be annotated using semantic web technologies such as XML, RDF, and ontologies. However, manual annotation is labour and knowledge intensive. Using natural language processing techniques and systems (NLP), a significant portion of these resources can be automatically annotated. In this tutorial, we outline the motivations and objectives of NLP, give an overview of several accessible systems (General Architecture on Text Engineering, C&C/Boxer, Attempto Controlled English), provide examples of processing legal resources, and discuss future directions in this area.

Annotating Rules in Legislation

Over the last couple of months, I have had discussions about text mining and annotating rules in legislation with several people (John Sheridan of The Office of Public Sector Information, Richard Goodwin of The Stationery Office, and John Cyriac of Compliance Track). While nothing yet concrete has resulted from these discussions, it is clearly a “hot topic”.
In the course of these discussions, I prepared a short outline of the issues and approaches, which I present below. Comments, suggestions, and collaborations are welcome.
Vision, context, and objectives
One of the main visions of artificial intelligence and law has been to develop a legislative processing tool. Such a tool has several related objectives:

      [1.] To guide the drafter to write well-formed legal rules in natural language.
      [2.] To automatically parse and semantically represent the rules.
      [3.] To automatically identify and annotate the rules so that they can be extracted from a corpus of legislation for web-based applications.
      [4.] To enable inference, modeling, and consistency testing with respect to the rules.
      [5.] To reason with respect to domain knowledge (an ontology).
      [6.] To serve the rules on the web so that users can use natural language to input information and receive determinations.

While no such tool exists, there has been steady progress on understanding the problems and developing working software solutions. In early work (see The British nationality act as a logic program (1986)), an act was manually translated into a program, allowing one to draw inferences given ground facts. Haley is a software and service company which provides a framework which partially addresses 1, 2, 4, and 6 (see Policy Automation). Some research addresses aspects of 3 (see LKIF-Core Ontology). Finally, there are XML annotation schemas for legislation (and related input support) such as The Crown XML Schema for Legislation and Akoma Ntoso, both of which require manual input. Despite these advances, there is much progress yet to be made. In particular, no results fulfill [3.].
In consideration of [3.], the primary objective of this proposal is to use the General Architecture for Text Engineering (GATE) framework in order to automatically identify and annotate legislative rules from a corpus. The annotation should support web-based applications and be consistent with semantic web mark ups for rules, e.g. RuleML. A subsidiary objective is to define an authoring template which can be used within existing authoring applications to manually annotate legislative rules.
Benefits
Attaining these objectives would:

  • Support automated creation, maintenance, and distribution of rule books for compliance.
  • Contribute to the development of a legislative processing tool.
  • Make legislative rules accessible for web-based applications. For example, given other annotations, one could identify rules that apply with respect to particular individuals in an organisation along with relevant dates, locations, etc.
  • Enable further processing of the rules such as removing formatting, parsing the content of the rules, and representing them semantically.
  • Allow an inference engine to be applied over the formalised rule base.
  • Make legislation more transparent and communicable among interested parties such as government departments, EU governments, and citizenry.

Scope
To attain the objectives, we propose the following phases, where the numbers represent weeks of effort:

  • Create a relatively small sample corpus to scope the study.
  • Manually identify the forms of legislative rules within the corpus.
  • Develop or adapt an annotation scheme for rules.
  • Apply the analysis tools of GATE and annotate the rules.
  • Validate that GATE annotates the rules as intended.
  • Apply the annotation system to a larger corpus of documents.

For each section, we would produce a summary of results, noting where difficulties are encountered and ways they might be addressed.
Extending the work
The work can be extended in a variety of ways:

  • Apply the GATE rules to a larger corpus with more variety of rule forms.
  • Process the rules for semantic representation and inference.
  • Take into consideration defeasiblity and exceptions.
  • Develop semantic web applications for the rules.

By Adam Wyner
Distributed under the Creative Commons
Attribution-Non-Commercial-Share Alike 2.0

Meeting with John Sheridan on the Semantic Web and Public Administration

I met today with John Sheridan, Head of e-Services, Office of Public Sector Information, The National Archives, located at the Ministry of Justice, London, UK. Also at the meeting was John’s colleague Clare Allison. John and I had met at the ICAIL conference in Barcelona, where we briefly discussed our interests in applications of Semantic Web technologies to legal informatics in the public sector. Recently, John got back in contact to talk further about how we might develop projects in this area.
Perhaps most striking to me is that John made it clear that the government (at least his sector) is proactive, looking for research and development projects that make government data available and usable in a variety of ways. In addition, he wanted to develop a range of collaborations to better understand the opportunities the Semantic Web may offer.
As part of catching up with what is going on, I took a look around the web for relatively recent documents on related activities.

In our discussion, John gave me an overview of the current state of affairs in public access to legislation, in particular, the legislative markup and API. The markup is intended to support publication, revision, and maintenance of legislation, among other possibilities. We also had some discussion about developing an ontology of goverment which would be linked to legislation.
Another interesting dimension is that John’s office is one of a few that I know of which are actively engaged to develop a knowledge economy partly encouraged by public administrative requirements and goals. Others in this area are the Dutch and the US (with xml.gov). All very promising and discussions well worth following up on.
Copyright © 2009 Adam Wyner

Session I of "Automated Content Analysis and the Law" Workshop

Today is session I of the NSF sponsored workshop on Automated Content Analysis and the Law. The theme of today’s meeting is the state of judicial/legal scholarship in order to:

  • Identify the theoretical and substantive puzzles in legal and judicial scholarship which might benefit from automated content analysis
  • Discuss the kinds of data/measures that are required to address these puzzles which automated content analysis could provide.

Further comments later in the day after the session.
–Adam Wyner
Copyright © 2009 Adam Wyner

Participating in One-Lex — Managing Legal Resources on the Semantic Web

Later this summer, I’ll be participating in the summer school Managing Legal Resources in the Semantic Web, September 7 to 12 in San Domenico di Fiesole (Florence, Italy). This program will focus on several aspects of legal document management:

  • Drafting methods, to improve the language and the structure of legislative texts
  • Legal XML standards, to improve the accessibility and interoperability of legal resources
  • Legal ontologies, to capture legal metadata and legal semantics
  • Formal representation of legal contents, to support legal reasoning and argumentation
  • Workflow models, to cope with the lifecycle of legal documentation

While I’m familiar with several of these areas, I’m using this opportunity to fill in my knowledge in these key areas.