Open source software development and standards are widely discussed and practiced. It has led to a range of useful applications and services. GATE is one such example.
However, one quickly learns that open source can easily mean open to a certain extent: GATE is open source, but the applications and additional functionalities that are developed with respect to GATE often are not. On the one hand, this makes perfect sense as the applications and functionalities are added value, labour intensive, and so on. On the other hand, the scientific community cannot verify, validate, or build on prior work unless the applications and functionalities are available. This can also hinder commercial development since closed development impedes progress, dissemination, and a common framework from which everyone benefits. It also does not recognise the fundamentally experimental aspect of information extraction. In contrast, the rapid growth and contributions of the natural (Biology, Physics, Chemistry, etc) or theoretical (Maths) sciences could only have occurred in an open, transparent development environment.
I advocate open source information extraction where an information extraction result can only be reported if it can be independently verified and built on by members of the scientific community. This means that the following must be made available concurrent with the report of the result:
- Data and corpora
- Lists (e.g. gazetteers)
- Rules (e.g. JAPE rules)
- Any additional processing components (e.g. information extraction to schemes or XSLT)
- Development environment (e.g. GATE)
In other words, the results must be independently reproducible in full. The slogan is:
No publication without replicability.
This would:
- Contribute to the research community and build on past developments.
- Support teaching and learning.
- Encourage interchange. The Semantic Web chokes on different formats.
- Return academic research to the common (i.e. largely taxpayer funded) good rather than owned by the researcher or university. If someone needs to keep their work private, they should work at a company.
- Lead to distributive, collaborative research and results, reducing redundancy and increasing the scale and complexity of systems.
Solving the knowledge bottleneck, particularly in relation to language, has not and likely will not be solved by any one individual or research team. Open source information extraction will, I believe, make greater progress toward addressing it.
Obviously, money must be made somewhere. One source is public funding, including contributions from private organisations which see a value in building public infrastructure. Another source is, like other open source software, systems, or other public information, to make money “around” the free material by adding non-core goods, services, or advertising.
By Adam Wyner
Distributed under the Creative Commons
Attribution-Non-Commercial-Share Alike 2.0
Hi Adam,
I largely agree with your stance on openness.
Where we differ inopinion is how you present this. In my opinion you are wrong to single out Gate the way you do as a special case of the lack iof openness you want to criticize.
Reproducability and openness of course very important in science and have always been firmly embedded in GATE’s philosophy.
I contend that the development of GATE applications around the world has fostered scientific progress and collaboration. The evidence is all over the web. Just look at the GATE web site to appreciate how large the GATE community out there is. Just read the GATE papers to encounter exactly the same opinions that you put forward. The very essence of GATE is that it is important for GATE that people work with its freely available code. That’s why the architecture itself is open source, and new applications are constantly made publicly available. Solving the knowledge bottleneck is in the case of GATE a truly collaborative activity.
Most freely available tools come as executables, i.e. without source code. Increasingly tools are also offered as web services. This is general practise.
Depending on the rigidity of one’s principle, opinions may vary about whether applications should be made available as open source code on the one hand or as executable/web service on the other.
In my view, what is crucial is the level of detail of the description of the tools’ functionality. In principle, as in other branches of science, reproducabilty entails independent verification of both methodology and results. Executables and web services provide the results, but hide the source code. Adequate description of the methods, e.g. coverage of grammars and statistical algorithms, should warrant reproducability. Description of scientific method, not the provision of scientific technology, is fundamental to scientific progress.
As analogy, in chemistry, results can be verified without the original author supplying laboratory equipment or analytical software.
It is therefore in perfect tandem with scientific method if, for instance, in NLP, the verification of noun phrase annotation performance involves the evaluation of the results of an NP annotator with a description of its functionality, but without the provision of the source code.
You imply that all Sheffield development of GATE should be made public. I reply that this is our principle, except when tools are still under development, or subjec tto legal restrictions imposed by the funder.
Best wishes,
Wim
Dr. W. Peters
Natural Language Processing group
Department of Computer Science
University of Sheffield
Regent Court
211 Portobello Street
Sheffield S1 4DP
tel: 00-44-114-2221902
fax: 00-44-114-2221810
email: w.peters@dcs.shef.ac.uk