Open Linguistics

Fourth Workshop on Linked Data in Linguistics (LDL-2015): Resources and Applications

Christian Chiarcos — Thu, 18 Dec 2014 19:43:09 +0000

We are very happy to announce the next instantiation of the OWLG’s Linked Data in Linguistics (LDL) workshop series. The OWLG’s fourth Workshop on Linked Data in Linguistics is becoming increasingly international, and, for the first time, will be held outside of Europe: on June 31st, 2015, in Beijing, China, collocated with ACL-IJCNLP 2015.

See you in Beijing!

4th Workshop on Linked Data in Linguistics (LDL-2015): Resources and Applications
Beijing, June 31st, 2015, http://ldl2015.linguistic-lod.org, collocated with ACL-IJCNLP 2015

Workshop Description

The substantial growth in the quantity, diversity and complexity of linguistic data accessible on the Web has led to many new and interesting research areas in Natural Language Processing (NLP) and linguistics. However, resource interoperability represents a major challenge that still needs to be addressed, in particular if information from different sources is combined. With its fourth instantiation, the Linked Data in Linguistics workshop continues to provide a major forum to discuss the creation of linguistic resources on the web using linked data principles, as well as issues of interoperability, distribution protocols, access and integration of language resources and natural language processing pipelines developed on this basis.

As a result of the preceding workshops, a considerable number of resources is now available in the Linguistic Linked Open Data (LLOD) cloud [1]. LDL-2015 will thus specifically welcome papers addressing the usage aspect of Linked Data and related technologies in NLP, linguistics and neighboring fields, such as Digital Humanities.

Organized by the interdisciplinary Open Linguistics Working Group (OWLG) [2], the LDL workshop series is open to researchers from a wide range of disciplines, including (computational) linguistics and NLP, but also the Semantic Web, linguistic typology, corpus linguistics, terminology and lexicography. In 2015, we plan to increase the involvement of the LIDER project [3] and the W3C Community Group on Linked Data for Language Technology (LD4LT) [4], to build on their efforts to facilitate the use of linked data and language resources for commercial applications, and to continue the success of LIDER‘s roadmapping workshop series in engagement with enterprise.

[1] http://linguistics.okfn.org/resources/llod/
[2] http://linguistics.okfn.org/
[3] http://www.lider-project.eu/
[4] http://www.w3.org/community/ld4lt/

Topics of Interest

We invite presentations of algorithms, methodologies, experiments, use cases, project proposals and position papers regarding the creation, publication or application of linguistic data collections and their linking with other resources, as well as descriptions of such data. This includes, but is not limited to, the following:

A. Resources

Modelling linguistic data and metadata with OWL and/or RDF.
Ontologies for linguistic data and metadata collections as well as cross-lingual retrieval.
Descriptions of data sets following Linked Data principles.
Legal and social aspects of Linguistic Linked Open Data.
Best practices for the publication and linking of multilingual knowledge resources.

B. Applications

Applications of such data, other ontologies or linked data from any subdiscipline of linguistics or NLP.
The role of (Linguistic) Linked Open Data to address challenges of multilinguality and interoperability.
Application and applicability of (Linguistic) Linked Open Data for knowledge extraction, machine translation and other NLP tasks.
NLP contributions to (Linguistic) Linked Open Data.

We invite both long (8 pages and 2 pages of references, formatted according to the ACL-IJCNLP guidelines) and short papers (4 pages and 2 pages of references) representing original research, innovative approaches and resource types, use cases or in-depth discussions. Short papers may also represent project proposals, work in progress or data set descriptions.

Dataset Description Papers

In addition to full papers and regular short papers, authors may submit short papers with a dataset descriptions describing a resource’s availability, published location and key statistics (such as size). Such papers do not need to show a novel method for the creation or publishing of the data but *instead* will be judged on the quality, usefulness and clarity of description given in the paper.

For contact information, submission details and last-minute updates, please consult our website under http://ldl2015.linguistic-lod.org

Important Dates

May 8th, 2015: Paper submission
June 5th, 2015: Notification of Acceptance
June 21st, 2015: Camera-Ready Copy
June 31st, 2015: Workshop

Organizing Committee

Christian Chiarcos (Goethe University Frankfurt, Germany)
Philipp Cimiano (Bielefeld University, Germany)
Nancy Ide (Vassar College, USA)
John P. McCrae (Bielefeld University, Germany)
Petya Osenova (Bulgarian Academy of Sciences, Bulgaria)

Program Committee

Eneko Agirre (University of the Basque Country, Spain)
Guadalupe Aguado (Universidad Politécnica de Madrid, Spain)
Claire Bonial (University of Colorado at Boulder, USA)
Peter Bouda (Interdisciplinary Centre for Social and Language Documentation, Portugal)
Antonio Branco (University of Lisbon, Portugal)
Martin Brümmer (University of Leipzig, Germany)
Paul Buitelaar (INSIGHT, NUIG Galway, Ireland)
Steve Cassidy (Macquarie University, Australia)
Nicoletta Calzolari (ILC-CNR, Italy)
Thierry Declerck (DFKI, Germany)
Ernesto William De Luca (University of Applied Sciences Potsdam, Germany)
Gerard de Melo (University of California at Berkeley)
Judith Eckle-Kohler (Technische Universität Darmstadt, Germany)
Francesca Frontini (ILC-CNR, Italy)
Jeff Good (University at Buffalo)
Asunción Gómez Pérez (Universidad Politécnica de Madrid, Spain)
Jorge Gracia (Universidad Politécnica de Madrid, Spain)
Yoshihiko Hayashi (Waseda University, Japan)
Fahad Khan (ILC-CNR, Italy)
Seiji Koide (National Institute of Informatics, Japan)
Lutz Maicher (Universität Leipzig, Germany)
Elena Montiel-Ponsoda (Universidad Politécnica de Madrid, Spain)
Steven Moran (Universität Zürich, Switzerland)
Sebastian Nordhoff (Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany)
Antonio Pareja-Lora (Universidad Complutense Madrid, Spain)
Maciej Piasecki (Wroclaw University of Technology, Poland)
Francesca Quattri (Hong Kong Polytechnic University, Hong Kong)
Laurent Romary (INRIA, France)
Felix Sasaki (Deutsches Forschungszentrum für Künstliche Intelligenz, Germany)
Andrea Schalley (Griffith University, Australia)
Gilles Sérraset (Joseph Fourier University, France)
Kiril Simov (Bulgarian Academy of Sciences, Sofia, Bulgaria)
Milena Slavcheva (JRC-Brussels, Belgium)
Armando Stellato (University of Rome, Tor Vergata, Italy)
Marco Tadic (University of Zagreb, Croatia)
Marieke van Erp (VU University Amsterdam, The Netherlands)
Daniel Vila (Universidad Politécnica de Madrid)
Cristina Vertan (University of Hamburg, Germany)
Walther v. Hahn (University of Hamburg, Germany)
Menzo Windhouwer (Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands)

Third Workshop on Linked Data in Linguistics, Reykjavik, 27th May 2014

John P. McCrae — Tue, 13 May 2014 13:07:24 +0000

The Open Linguistics Working Group, in conjunction with the W3C OntoLex Community Group, is organizing the Third Workshop on Linked Data in Linguistics (LDL-2014), in co-location with Language Resource and Evaluation Conference in Reykjavik, Iceland on the 27th of May 2014. LDL-2014 is also supported by two recently started EU Projects: LIDER (Linked Data as an enabler of cross-media and multilingual content analytics for enterprises across Europe), which aims to provide an ecosystem for the establishment of linguistic linked open data, as well as media resources metadata, for a free and open exploitation of such resources in multilingual, cross-media content analytics across Europe. Secondly, QTLeap (Quality Translation with Deep Language Engineering Approaches), which explores novel ways for attaining machine translation of higher quality that are opened by a new generation of increasingly sophisticated semantic datasets (including Linked Open Data) and by recent advances in deep language processing.

The goal of the workshop is twofold. First, we will assemble researchers from various fields of linguistics, natural language processing, knowledge management and information technology to present and discuss principles, case studies, and best practices for representing, publishing and linking mono- and multilingual linguistic and knowledge data collections, including corpora, grammars, dictionaries, wordnets, translation memories, domain specific ontologies etc. In particular, we will discuss the application of the Linked Open Data paradigm to linguistic data as it might provide an important step towards making linguistic data: i) easily and uniformly queryable, ii) interoperable and iii) sharable over the Web using open standards such as the HTTP protocol and the RDF data model [1].

Secondly, we will provide researchers on natural language processing and semantic web technologies a platform to present case studies and best practices on the exploitation of linguistic resources exposed on the Web for Natural Language Processing applications, or other content-centered applications such as content analytics, knowledge extraction, etc. The availability of massive linked open knowledge resources raises the question how such data can be suitably employed to facilitate different NLP tasks and research questions. This workshop will also present several contributions to the Linguistic Linked Open Data (LLOD) cloud [2], in particular, contributions that demonstrate an added value resulting from the combination of linked datasets and ontologies as a source for semantic information with linguistic resources published according to as linked data principles. Another important question that will be addressed in the workshop is how Natural Language Processing techniques can be employed to further facilitate the growth and enrichment of linguistic resources on the Web.

META-NET Data Liberation Campaign

John Judge — Mon, 19 Nov 2012 11:25:36 +0000

Data Liberation Campaign

Recently META-NET, an extensive network of excellence in Language Technologies, conducted an internal survey to find out about the existence and availability of national corpora (or similar) for the various European languages covered by the network. The results of this small study showed that for almost every European language there exists some reference corpus of an established quality and in many cases produced or otherwise endorsed by the respective official language body. However, despite these corpora existing and being held by national organisations in the majority of cases, it is not possible for language technology researchers to get access to these corpora for their own work. For example, it is not possible for researchers to download or to run their own analysis over the data.
In most cases the reasons cited for these restrictions are copyright and redistribution restrictions that the corpus owners or corpus compilers have with publishers who provided the source data. These restrictions prevent researchers from using the data for non profit purposes such as scientific research which can benefit the entire language and language technology community. This is a striking finding in the wake of the recent publication of the Languages in the Digital Age series which highlighted that a lack of resources or a lack of availability of resources is putting many European languages at risk in the digital age.

In response to these findings, META-NET have prepared an open letter to all the official language bodies in Europe and to those holding onto the various corpora calling on them to consider trying to make this important language data available for research purposes. The open letter is also addressed to the European Patent Office. In this letter META-NET ask for those with the power to do so to reconsider their distribution policies to allow greater access to their data for research. They also offer to provide a safe and secure mechanism (META-SHARE) to share the data should they choose to do so, and also any additional help their legal team can provide regarding licensing, copyright and other legal issues.

The open letter, including a recipient list, is reproduced below. If you feel that there is a huge benefit to liberating these corpora and making them available for research then please contact your local language body and let them know that you are in favour of the META-NET proposal.

META-NET-Open-Letter

UBY – A Large-Scale Unified Lexical-Semantic Resource (UBY 1.0) released

Judith Eckle-Kohler — Sat, 31 Mar 2012 14:03:00 +0000

We are pleased to announce the release of UBY 1.0 –

a large-scale lexical-semantic resource for natural language processing (NLP)
based on the ISO standard Lexical Markup Framework (LMF), see UBY website.

UBY combines a wide range of information from expert-constructed and collaboratively constructed resources for English and German.

Currently, UBY holds structurally and semantically interoperable versions of nine resources in two languages:

English WordNet, Wiktionary, Wikipedia, FrameNet and VerbNet,
German Wikipedia, Wiktionary and GermaNet, and multilingual OmegaWiki.

A subset of these resources is linked at the word sense level.
There are monolingual sense alignments between VerbNet–FrameNet and VerbNet–WordNet as well as between WordNet–Wikipedia and WordNet–Wiktionary.
In addition, UBY provides cross-lingual sense alignments between WordNet and German OmegaWiki,
also including the inter-language links given in Wikipedia and OmegaWiki.

All resources in UBY are represented according to our LMF lexicon model, UBY-LMF.
UBY-LMF captures lexical information at a ﬁne-grained level by employing a large number of Data Categories from ISOCat.

Highlights of UBY:

The union of a wide range of heterogeneous resources in a single, standardized resource.
The linking at the word sense level between a subset of the resources.

UBY is complemented by a Java API, the UBY-API, and conversion tools (e.g., for converting the resources to UBY-LMF).
The UBY API and conversion tools are available at Google Code:

http://code.google.com/p/uby/

Highlights of the UBY-API:

Unified access to the various information types in the nine resources.
Easy cross-resource access to the various information types in the resources.

A tutorial showing the use of the UBY-API can be found at

http://code.google.com/p/uby/wiki/ApiTutorial

A Web Interface for exploring and visualizing UBY is currently being developed and will soon be available
at the UBY website.

This project was initiated under the auspices of Prof. Dr. Iryna Gurevych, Ubiquitous Knowledge Processing Lab (UKP), Technische Universität Darmstadt.
We are grateful for the generous financial support from the Volkswagen Foundation and the German Research Foundation.

Please direct any questions or suggestions to uby-users@googlegroups.com

Workshop on Linked Data in Linguistics (LDL-2012)

Christian Chiarcos — Mon, 05 Mar 2012 10:39:48 +0000

The Open Linguistics Working Group is organizing a workshop on Linked Data in Linguistics, March 7 – 9, 2012, Frankfurt/Main, Germany, as part of the 34th Annual Meeting of German Linguistics Society (DGfS).

The explosion of information technology has led to a substantial growth in quantity, diversity and complexity of web-accessible linguistic data. These resources become even more useful when linked. This workshop will present principles, use cases, and best practices for using the linked data paradigm to represent, exploit, store, and connect different types of linguistic data collections.

We are glad to welcome researchers from the fields of language documentation, typology, computational linguistics, corpus linguistics, as well as researchers from other empirically-oriented disciplines of linguistics who share an interest in data and metadata modelling with Semantic Web technologies such as RDF or OWL.
Aside from numerous presentations from such diverse fields of linguistics, information technology and neighboring disciplines, we are happy to announce two invited speakers, Nancy Ide (Vassar College), and Martin Haspelmath (Max-Planck Institute for Evolutionary Anthropology, Leipzig).

We’re very much looking forward to welcome you in Frankfurt. For anyone who cannot attend we made the proceedings available online.
Please see the LDL-2012 website for detailed information on program, venue, and registration. The workshop will have both online proceedings and a printed companion volume. The online proceedings are available from the Workshop page.

Workshop on Open Data in Linguistics

Sebastian Hellmann — Thu, 02 Jun 2011 10:34:50 +0000

The workshop will be held on June 30th, 17:30 in Workshop II during the OKCon 2011.

To read up on the current status of the Open Linguistic Group see this blog post.

At the beginning there will be 6 presentations:

Dennis Spohr– Linking lexical resources and ontologies on the Semantic Web with lemon – Slides – There are a large number of ontologies currently available on the Semantic Web. However, in order to exploit them within natural language processing applications, more linguistic information than can be represented in current Semantic Web standards is required. Further, there are a large number of lexical resources available representing a wealth of linguistic information, but this data exists in various formats and is difficult to link to ontologies and other resources. We present a model we call lemon (Lexicon Model for Ontologies) that supports the sharing of terminological and lexicon resources on the Semantic Web as well as their linking to the existing semantic representations provided by ontologies. We demonstrate that lemon can succinctly represent existing lexical resources and in combination with standard NLP tools we can easily generate new lexica for domain ontologies according to the lemon model. We demonstrate that by combining generated and existing lexica we can collaboratively develop rich lexical descriptions of ontology entities. We also show that the adoption of Semantic Web standards can provide added value for lexicon models by supporting a rich axiomatization of linguistic categories that can be used to constrain the usage of the model and to perform consistency checks.
Ernesto William De Luca – Multilingual Lexical Linked Data – A lot of information that is already available on the Web, or retrieved from local information systems and social networks is structured in data silos that are not semantically related. Semantic technologies make it emerge that the use of typed links that directly express their relations are an advantage for every application that can reuse the incorporated knowledge about the data. In this presentation, we present our work of providing Lexical Linked Data (LLD) through a meta-model that contains all the resources and gives the possibility to retrieve and navigate them from different perspectives. We show some use cases where we link lexical data, and show how to reuse and inference semantic data derived from lexical data.
Christian Chiarcos – Modelling linguistic corpora and their annotations with OWL/DL – Slides
Sebastian Nordhoff – The Glottolog/Langdoc Project: Publishing a bibliographical database of 200,000 references for 7,000 languages as Linked Data – Slides
Sebastian Hellmann – NIF: NLP Interchange Format – Slides
Richard Littauer – Towards Open Methods: Using Scientific Workflows in Linguistics – Slides

The second part will consist of a mixture of an Open Panel and a Q&A session. At any time people can come to the front and make a statement and then we will discuss it, topics include and are not limited to:

Incentives for publishing data: Requirement analysis for a Scientific Journal as a forum for publishing data
Best practices for publishing Open Linguistic Data

If you have any questions, or if you’re interested in keeping in touch, please write to the [open-linguistics mailing list](https://lists.okfn.org/mailman/listinfo/open-linguistics)!

The Open Linguistics Working Group

Christian Chiarcos — Fri, 20 May 2011 08:08:44 +0000

Status Quo and Perspectives, by Christian Chiarcos and Sebastian Hellmann

Since its formation last year, the Open Linguistics Working Group (OWLG) has been steadily growing and the direction the working group is heading has been clarified (although a number of issues remain open). In the last months, we concentrated on the identification of goals and directions for this working group to pursue, and in this blog post, we summarize results of this process, about its current status as well as the main challenges and problems we have identified so far.

An important result of our discussion are the seven points described in the next section, which define the purpose of the working group. In the next section, we summarize four major problems and challenges of the work with linguistic data. Such problems will become a primary topic of the Working Group. Thereafter, we give an overview of the current status and activities of the group and provide some suggestions for how to get involved.

Purpose

As a result of numerous discussions with interested linguists, NLP engineers and information technology experts, we identified seven open problems for our respective communities and their ways to use, to access and to share linguistic data. These represent the challenges to be addresses by the working group, and the role that it is going to fulfil:

Promote the idea and definition, as specified in opendefinition.org of open data in linguistics and in relation to language data.
Act as a central point of reference and support for people interested in open linguistic data.
Provide guidance on legal issues surrounding linguistic data to the community.
Build an index of indexes of open linguistic data sources and tools and link existing resources.
Facilitate communication between existing groups.
Serve as a mediator between providers and users of of technical infrastructure.
Assemble best-practice guidelines and use cases to create, use and distribute data.

In many aspects, the OWLG is not unique with respect to these goals. Indeed, there are numerous initiatives with similar motivation, e.g., the Cyberling blog, the ACL Special Interest Group for Annotation, and large multi-national initiatives as the ISO initiative on Language Resources Management (ISO TC37/SC4) or European projects such as CLARIN, FLARENET and METANET. The key difference between these and our Working Group is that we are not affiliated to an existing organization or one particular community, but that our members represent the whole band-width from academic linguistics (with its various subfields, e.g., typology and corpus linguistics) over applied linguistics (e.g., language documentation, computational linguistics, computational lexicography) and computational philology to natural language processing and information technology. We do not consider ourselves as being in competition with any existing organization, but hope to establish new links and further synergies between these.

In the following section, we summarize typical and concrete scenarios where such an interdisciplinary community may help to resolve problems observed (or, sometimes, overlooked) in the daily praxis of working with linguistic resources.

Open linguistics resources, problems and challenges

Among the broad range of problems associated with linguistic resources, we identified four major classes of problems and challenges during our discussions that may be addressed by the OWLG. First, there is a great uncertainty with respect to legal questions of the creation and distribution of linguistic data; second, there are technical problems such as the choice of tools, representation formats and metadata standards for different types of linguistic annotation; third, we have not yet identified a point of reference for existing open linguistic resources; finally, there is the agitation challenge, i.e., how (and whether) we should convince our collaborators to release their data under open licenses.

These challenges are described below in detail.

1. Legal questions

The linguistic community becomes increasingly aware of the potentially difficult legal status of different types of linguistic resources:

How to find a suitable license for my corpus ?
Whose copyright do I have to respect ? For example, corpora may have complex copyright situations where the original authors own the primary data, and thus may have partial copyright on the entire collection.
Are there exceptions (e.g. for academic research) to the copyright that may allow me to work with my corpus anyway ?
How to circumvent (or solve) copyright issues ?
What legal restrictions apply to a particular resource (e.g., web corpora, newspaper corpora, digitizations of printed editions, audio and video files) ?
How to create multi-media (audio, video) data collections in a way that allows us to use (and hopefully, distribute) them for research ?

The situation is even more complex because the legal situation may change over time (e.g., German copyright law was changed twice within the last decade), and this complexity multiplies on an international scale. The OLWG provides a platform to discuss such problems, to collect recommendations and document use cases as found in publications and technical reports, and discussed on conferences and mailing lists.

2. Technical problems

Often, when creating a new corpus in a novel domain, the question is to be answered which tool to choose for which type of annotation. The OLWG will collect case studies and best practice recommendations with respect to this, it will encourage the documentation of use cases, collect links to documented case studies and best practice recommendations (e.g., by EMELD, or FLARENET), and participate in the maintenance of existing sites that provide an overview over annotation tools and their domain of application (e.g., the Linguistic Annotation Wiki, or corresponding parts of the ACL Wiki).
A question related to the choice of tools is the question which representation formalisms to choose. We intend to provide basic information about proposed standard formats (e.g., the ISO proposal LAF/GrAF, the specifications of the Text Encoding Initiative [TEI]) and applicable formalisms (e.g., XML or RDF). These formats, again, are closely related to the question which corpus infrastructure (data base, search interface) may be suitable to store, query and visualize what kind of linguistic annotations (e.g., domain- and community-specific tools like Toolbox and ELAN, or general-purpose corpus query tools like ANNIS).
A third problem is the question of documentation requirements for different types of resources, the use of metadata standards (e.g., Dublin Core, or the TEI header), and how annotation documentation and interoperability can be improved linking linguistic resources with terminology repositories (e.g., GOLD, ISOcat).
The OLWG aims to collect such questions and (partial) answers to these, we will contribute to existing metadata repositories and co-operate with other initiatives that pursue similar goals, e.g., the ACL Special Interest Group in Linguistic Annotation. As opposed to these, the OLWG does not require membership in a particular organization, and we carry a focus on linguistic resources released under an open license. Further, we encourage (but do not require) the conversion of linguistic resources to Linked Data.

3. Overview over existing resources

If a new research question is to be addressed, the question arises which resources may already be available and whether these may be accessible, and often, this problem is still solved by asking experts on mailing lists, e.g. the CORPORA list.
Therefore, the OLWG has begun to collect metadata about open linguistic resources within the CKAN repository. Although there are other metadata repositories (e.g., those maintained by META-NET, FLARENET, or CLARIN) available, the CKAN repository is qualitatively different in two respects: On the one hand, CKAN focuses on the license status of the resources and it encourages the use of open licenses. On the other hand, it is not specifically directed to linguistic resources, but rather, it is used by a large set of different working groups, whose resources may be exploited by linguists (e.g., exhaustive collections of legal documents from several countries [from law], or the open richly annotated cuneiform corpus [from archeology]).

4. Agitation

One of the goals of the OWLG is the promotion of open licenses for linguistic data collections. As we know from practical experience, researchers sometimes hesitate to provide their data under an open license. There has many different reasons for this, ranging from the uncertainty with respect to the legal situation to the (understandable) because fear that people exploit the resources before the original author had the chance to do so.
We hope to contribute to the clarification of legal issues and to provide case studies that may help to clarify these problems. For example, one solution for second aspect mentioned above may be that data collections are designed as open linguistic resources from the beginning, but that their publication is delayed for several years, so that the creators can exploit this data long enough before any concurrent may get hands on it.
One important argument that favors the use of open resources in academia is that only resources that are available to other researchers make it possible that empirically working linguists meet elementary scientific standards such as verifiability. Following this premise, we intend to promote the use of open resources in linguistics.

Current status and on-going developments (as of May, 19th, 2011)

So far, we focused on the task to delineate what questions the Open Linguistics Working Group may address, to formulate its general goals and potentially fruitful application scenarios. This blog entry summarizes these discussions, and it concludes a critical step in the formation process of the working group: Having defined a (preliminary) set of goals and principles, we can now concentrate on the tasks at hand, and in to collect resources and to attract interested people in order to address the challenges identified above.

At the moment, our Working Group assembles 32 people from 21 different organizations and 7 countries (Germany, US, UK, France, Canada, Hungary, and Slovenia). Our group is relatively small, but continuously growing and sufficiently heterogeneous. It includes people from library science, typology, historical linguistics, cognitive science, computational linguistics, and information technology, just to name a few, so, the ground for fruitful interdisciplinary discussions has been laid out.

We are very glad that famous linguists such as Nancy Ide (Text Encoding Initiative, American National Corpus, Vassar College) and Christiane Fellbaum (WordNet, University of Princeton) accepted our invitation to post guest blogs, and we would like to intensify this tradition and encourage all members of the OWLG to describe interesting projects and experiences on this medium, to share insights and difficulties over the Open Linguistics mailing list, and, of course, to join our meetings and telcos. The next meeting is about to be held in conjunction with the Fifth Open Knowledge Conference (OKCon), June 30th to July 1st 2011 in Berlin, Germany, and of course the OKCon itself is a great reason to join us there.

As for our first concrete activities, we have begun to compile a list of resources of particular interest to the members of the working group. Most of these resources are free, others are partially free (i.e., annotations free, but text under copyright), and a few have been included that are very representative for a particular type of resource (e.g., corpora derived from the Penn Treebank as a prototypical multi-layer corpus). Altogether, the list comprises 102 entries by now, and the next step would be to register them at the CKAN metadata repository and to select a few for deeper investigation.

One aspect of such investigations may be the conversion of some of the resources to RDF and to provide them as Linked Data. Several working group members (including the authors of this blog) are working towards this direction.
The ultimate result may be an Linguistics Linked (Open) Data cloud, as sketched in the graphic to the right (click to enlarge). On this basis, novel applications in all participating fields may be developed.

Get involved

Having all that said, we hope to have encouraged others to contribute and to join. And if indeed we succeeded in doing so, you may be interested in how to join and how to contribute:

How to join

Sign up for the Open Linguistics mailing list
Put your name on the Wiki page
Optional: Contribute

How to contribute

Register your (open) resources at CKAN (and please, don’t forget to tag them as “linguistics”)
Attend meetings / telcos (announced over the mailing list
Write blog posts for our blog
Become a group administrator on CKAN (on request)