In this note, I point to various parts of a discussion on developing and analysing legal textual data raised at ICAIL 2011. Please feel free to add comments to this document (or to me in person, by email, on your blog and linked to this, etc), which I can then add to the post (I’m very happy to attribute contributions). The intention is to stimulate discussion on these matters to help the community of researchers move ahead on common interests.
Unlike the situation from several years ago, we have accessible sources of large corpora of legal textual information. The World Legal Information Institutes provide free, independent and non-profit access to worldwide law. For example, one can go to the US site and download cases: United States v Grant  USCA9 19; 286 F.2d 157 (19 January 1961); one can request zipped files or screen scrap cases. The LIIs have introduced standardised references and formats for cases. There are boolean and regex searches.
From the contacts that I have had (e.g. in the US and UK), the LIIs would be very happy to collaborate with academic researchers in the analysis of their data and in keeping with their primary mission. In particular, developing tools that can be integrated and deployed with their platforms might be a way to go, thereby addressing significant platform and dissemination issues.
Another source of corpora is public.resource.org, which distributes a range of corpora covering legislation, codes, and cases.
Analysis and Annotation
There are a range of issues about information retrieval and extraction. Others can speak about IR, statistical, machine learning approaches. What I know better is annotation, whether fully or semi automatic and manual. Here we have issues about what to annotate and how. Some low level information is unproblematic (e.g. entities of a range of sorts, sections, and sentiment); higher level information (e.g. factors) might be more complex. I have some suggestions for annotations for low level information; a good starting point for factors are the CATO factors, though there is a general issue about how to extend factor identification to other domains (CATO factors are specific for intellectual property).
One general problem with analysis is that different researchers might use different tools in their work and just report the results. This means results are not interchangeable, which is particularly problematic with annotation work. If a common ‘framework’ tool is used and some consensus is developed about (at least) low level annotation types, then work can proceed more collaboratively, transparently, and reproducibly. One can develop a more forceful argument for researchers (public service bodies and information providers) to promote such an open development methodology (among them are justification and traceability, see Wyner and Peters 2010 and David Lewis’s ICAIL 2011 keynote address on related points). General Architecture for Text Engineering is an open framework for text processing modules.
There are ‘open’ systems for text annotation — Open Calais and Open Up platform’s data enrichment service from The Stationery Office. However, there are intellectual property issues that need to be considered.
Another general issue is how to carry out manual annotation, for example to build gold standards, which are required for machine learning systems. There has been significant progress, for example, with TeamWare, which provides for curated, web-based annotation tools along with annotation analysis (e.g. inter-annotator agreement). For a short tutorial (for an experiment) on using TeamWare for annotation of some legal case factors, see Web-based Annotation Support for the Law. Wim Peters and I proposed to law school faculty to use this tool to support their student exercises for first and second year students since these exercises often require identifying and extracting information from cases. Wim and I think integrating annotation exercises into legal e-learning could both help to develop large annotated sets of data and to serve an important educational purpose. See our paper about some of these points and proposals.
Large corpora can be formed, tools can be applied to them, but for fund raising, the community needs to develop a range of motivating research questions and use cases. Asides from questions pursued in the AI and Law community, we might consult further with public bodies (National Center for State Courts and similar), legal information service providers (Lexis-Nexis, ThomsonReuters, Practical Law Company, law societies, political scientists, etc. The kinds of answers we look for partially guide how we structure not only the corpora, but moreso the annotations.
Digging into Data and the Request for Proposals, but the due date is June 16 (I had been working on a proposal, but needed better research questions to hold local interest). Though the deadline is too soon to submit a proposal, it does demonstrate a widespread interest in funding bodies in the development and analysis of large corpora in the humanities and social sciences. The other obvious funding sources are national (US, UK, French, etc) and international (EU and Digging into Data).
By Adam Wyner
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.