W3C Workshop Program:
The Multilingual Web – Linked Open Data and MultilingualWeb-LT Requirements
11–13 June 2012, Dublin
Last updated: 16 July 2012
The MultilingualWeb-LT project is a continuation of the MultilingualWeb project. Coordinated by the W3C, it aims to develop meta-data for Web content (mainly HTML5) and deep Web content (e.g., XML files from which HTML pages are generated) that facilitates interaction with multilingual technologies and localization processes. The project aims to raise the visibility of existing best practices and standards and identify gaps. This fifth Multilingual Web Workshop in Dublin was hosted by Trinity College Dublin.
Day one (11 June) focused on the intersection between Linked Open Data and multilingual technologies. One of the major issues in the evolution of Linked Open Data is its lack of multilinguality. There is a strong separation between, on the one hand, terminology, lexical and language resources, and on the other, the technologies used for Linked Data. This gap makes it difficult to exploit any potential synergy among those approaches. This also makes it difficult to exploit the possibilities offered for Linked Data by multilingual web technologies. Removing silos and integrating these technologies is therefore an important goal, and was the main topic of this portion of the workshop.
Days two and three (12–13 June) focused on the MultilingualWeb-LT project, with the aim of refining and developing the project's initial list of requirements. The workshop gathered detailed feedback about the requirements and the metadata items.
NB: In contrast to previous MultilingualWeb workshops, this workshop had two specific foci and was aimed at a narrower cross-section of a technically oriented audience, with the goal of producing concrete proposals about next steps in the two areas described above. While future MultilingualWeb workshops will continue the format of broad events and will aim at a larger audience, attendees for this workshop were required to participate actively.
The IRC log is the raw scribe log, which has not undergone careful post-editing and may contain errors or omissions. It should be read with that in mind. It constitutes the best efforts of the scribes to capture the gist of the talks and discussions that followed, in real time. IRC was used not only to capture notes on the talks, but can be followed in real time by remote participants, or participants with accessibility problems. People following on IRC can also add contributions to the flow of text themselves.
As many of the sessions, especially on June 12 and 13, were working meetings, no slides or recordings are available. Interested parties should refer instead to the IRC log links provided below.
Linked Open Data
Professor Vincent Wade
Trinity College Dublin
Greetings from the University
W3C Internationalization Activity Lead
Brief introductory remarks
Research Lecturer, Knowledge and Data Engineering Group, School of Computer Science & Statistics, Trinity College Dublin
Welcome and Discussion of Goals
Linked Open Data and Connecting Europe
Keynote: The Privilege and Responsibility of Personal and Social Freedom in a World of Autonomous Machines
abstract The next generations of computing devices are going to form an advanced network of autonomous sensing, the Internet of Things. What is the impact of this development, and how can we prepare individuals, and society as a whole to cope with the onslaught of accelerating change in our lives and our civilization?
Publications Office of the EU
Multilingualism and Linked Open Data in the EU Open Data Portal and Other Projects of the Publications Office
abstract As the official publisher of the EU Institutions, the Publications Office (PO) is concerned with multilingual publishing on the Web. This presentation will discuss different projects of the PO, the possible impact of standardisation in terms of multilingual interoperability, and the contribution provided by "official" information providers in terms of quality and provenance.
Juliane Stiller & Marlies Olensky
Researchers, Berlin School of Library and Information
Science, Humboldt-Universität zu Berlin
Europeana: A Multilingual Trailblazer
abstract For Europeana, the single access point to digital cultural heritage aggregated from libraries, museums and archives, multilinguality is a crucial feature to enable universal access. Europeana's content is aggregated from 33 different countries and it is Europeana's mission to provide access to cultural heritage content irrespective of the language of the content or the user's mother tongue. Multilingual access will be implemented in Europeana in different layers: (1) translation of user interface of all static pages, (2) cross-lingual search (query translation and document translation), (3) subject browsing (controlled vocabulary and semantic data layer). A challenge is the scope and coverage of semantic resources that do not cover all domains and all languages. Therefore the new Europeana Data Model builds on and accommodates different standards and provides interoperability while preserving the original data. It uses Semantic Web principles to represent the objects. The Europeana Data Model's feasibility was confirmed in workshops with representatives from the archive, library, audiovisual archive and museum communities. Last year, a LOD-pilot was released in cooperation with partners that were willing to publish their metadata as Linked Open Data. At its launch it comprised 3.5 million of the 19 million objects represented in Europeana. Hence, Europeana is a trailblazer for the planned European Open Data Portal. The work that has been done in the related Europeana projects with respect to data modelling, multilinguality and publishing Linked Open Data, is an important step towards such a portal and should be reused by other European initiatives.
[Chair: Paul Buitelaar • Scribe: Felix Sasaki]
AKSW, University of Leipzig
Linked Data in Linguistics for NLP and Web Annotation
abstract This presentation introduces three major data pools that have recently been made freely available as Linked Data by a collaborative community process: (1) the DBpedia Internationalization committee is concerned with the extraction of RDF from the language-specific Wikipedia editions; (2) the creation of a configurable extractor based on DBpedia and able to extract information from all languages of Wiktionary with manageable effort; (3) the Working Group for Open Lingustic Data, an Open Knowledge Foundation group with the goal of converting Open Linguistics data sets to RDF and interlinking them. The presentation highlights and stresses the role of Open Licences and RDF for the sustenance of such pools. It also provides a short update on the recent progress of NIF (Natural Language Processing Interchange Format) by the LOD2-EU project. NIF 2.0 will have many new features, including interoperability with the above-mentioned data pools as well as major RDF vocabularies such as OLiA, Lemon, and NERD. Furthermore, NIF can be used as an exchange language for Web annotation tools such as AnnotateIt as it uses robust Linked Data aware identifiers for Website annotation.
Research Assistant, Trinity College Dublin
Linking Localisation and Language Resources
abstract We introduce Localisation & Language Linked Data (L3D) as a way to connect content creators, consumers, language service providers and translators. Our Drupal-based platform, CMS-LION, provides text segmentation, machine translation, crowd-sourced post-editing, and content review. The output of each component is inter-connected through Linked Data, which is enriched using provenance data. This allows for post-edits to be extracted and used in the re-training of Statistical Machine Translation (SMT) engines in close collaboration with the Panacea (http://www.panacea-lr.eu/) research project. Panacea offers web service based work flows for executing Natural Language Processing (NLP) processes. This talk will introduce the platform, its features, and collaboration with Panacea, as well as discussing our current and future research agenda.
Jose Emilio Labra Gayo
Associate Professor University of Oviedo
Best Practices for Multilingual Linked Open Data
abstract This presentation gives an overview of guidelines and best practices for publishing multilingual linked open data. There are many projects that publish linked open data with the data in multiple languages and this presentation will address some of the issues and how to resolve them for these kinds of projects.
[Chair: Dave Lewis • Scribe: Arle Lommel]
Linked Open Data and the Lexicon
Director, Translation Research Group, Brigham Young
Bringing Terminology to Linked Data through
abstract There is currently a "strong separation" between the well-established fields of terminology and Linked Data. This presentation describes a proposal to adapt TermBase eXchange (TBX, ISO 30042)—the international standard for representing complex, typically multilingual termbases—for use in the Linked Data and MultilingualWeb-LT communities. This version would be called RDF-TBX and would be isomorphic to the current version TBX. It would help bring the terminology community and the Linked Data community (more generally the Semantic Web community) closer together. Possibile applications include (a) linking terms in data to a termbase that has been converted to RDF-TBX and (b) carrying out automated conversions between existing terminology resources with concept relations and existing Semantic Web ontology resources. I would like to determine though face-to-face interaction at the workshop whether this project, if carried out, is likely to address the separation between terminology and Linked Data. Depending on the reaction to this position by the Dublin workshop participants, an RDF-TBX project could begin immediately at the Brigham Young University Translation Research Group.
Managing Director, Interverbum Technology
Challenges with Linked Data and Terminology
abstract This presentation will address some of the current challenges in integrating structured terminology data into Linked Open Data, with a focus on practical issues that arise in commercial environments.
Extending the Use of Web-Based Terminology Services
abstract Today terminology plays an extremely important role in multilingual Europe in ensuring efficient and precise communication. This presentation focuses on Tilde's projects and advances in establishing cloud-based platforms for acquiring, sharing, and reusing language resources to improve automated natural language data processing (e.g., machine translation_. Multilingual consolidated and harmonized terminology is already utilized as data in the process of human translation and now it is also being developed as a web-based service with machines (e.g. machine translation systems, indexing systems, and search engines) as users. This development has the potential to vastly enhance the degree of automation for linked open data technologies and reveal synergies between Linked Open Data and Multilingual Language Technologies.
Research Associate, University of Bielefeld
The Need for Lexicalization of Linked Data
abstract While linked data is frequently independent of language, the interpretation of this data requires natural language identifiers in order to be meaningful to the end user. For many applications of linked data, especially in multilingual contexts, it is necessary to go beyond the simple string label and provide a richer description of the lexicalization of the linked data entities, for example by generating natural language descriptions of the data. To address this gap we have proposed a model, which we call Lemon (Lexicon Model for Ontologies), that distinguishes the labels at both the semantic and syntactic levels. Lemon aims to build on existing models for representing lexical information, but is concise, descriptive and modular. Furthermore, this model is designed to bridge the gap between the existing linked data cloud, described in formats such as RDF(S) and OWL and the rapidly growing linguistic linked data cloud, where a significant amount of multilingual data already exists. I will show examples of how we can use collaborative editing techniques with the Lemon model to create such data without significant effort and how this can be applied to tasks such as answering natural language questions over linked data.
eGov Consultant, W3C
Cool URIs Are Human Readable
abstract In order to aid data interoperability between public administrations across Europe, the EU's ISA Programme is promoting the development of a system whereby things like standards, code lists and vocabularies can be discovered through a common metadata system called ADMS. In parallel, a number of core vocabularies have been developed for describing people, businesses and locations. And guess what? All the documentation is in English. The RDF and XML schemas are all in English and the terms in those vocabularies are all English words. All the hooks are in place for the schemas to be localised, but the problem goes a little deeper that merely finding the budget for localisation: Language is part of someone's identity. In some cases it's a defining national characteristic and so in order to create a framework for interoperability across national borders we need not only to get the language right but also the culture and the trust. Vocabularies must be available at a stable URI, be subject to an identifiable policy on change control, and be published on a domain that is both politically and geographically neutral. Technically, this last point is irritating since example.us is a "dumb string," but can you imagine the French government using it? Or the UK government using anything ending in .eu? Multilingual Linked Data needs localised RDF vocabularies and localised reference data but don't overlook the importance of the branding and/or national identity inherent in a domain name. Cool URIs are human-readable—and humans are irrational.
[Chair: Arle Lommel • Scribe: Jirka Kosek]
Identifying Users and Use Cases
Matching Data to
[Session Leader: Thierry Declerck • Scribe: Tadej Štajner]
The META-NET Strategic Research Agenda and Linked Open Data
abstract META-NET (http://www.meta-net.eu) develops a Strategic Research Agenda (SRA) for the European Language Technology research community. The preparation process of the SRA has received input from more than 130 persons involved language technology and adjacent areas and communities such as the Semantic Web. The overall goal is to provide input to the currently ongoing, long-term research strategy planning in order best to realise and to support the multilingual European information society. In the long and complex process towards developing the SRA three priority research themes emerged: (1) Translation Cloud; (2) Social Intelligence and e-Participation; (3) Socially Aware Interactive Assistants. The SRA is already in a mature state and will be finalised within the next months. The MultilingualWeb workshop in Dublin is a great opportunity to discuss the role of linked open data (LOD) in the SRA in general and in the priority themes specifically. For example, using LOD to generate cross-lingual references between named entities as a part of the translation cloud, or exploiting publicly available government data to foster e-participation across Europe. The goal of this presentation is to foster a discussion on what the role of LOD in the SRA priority themes can and should be, and how the areas of multilinguality and open data can should work together in developing further long-term research goals.
[Chair: Kimmo Rossi • Scribe: Declan Groves]
Linked Open Data and Connecting Europe
Light buffet dinner
Held at the Trinity Capital Hotel
Research Lecturer, Knowledge and Data Engineering Group,
School of Computer Science & Statistics, Trinity College Dublin
Welcome and Introduction
State of Requirements
Representation Formats: HTML, XML, RDFa, etc.
Speaker: Maxime Lefrançois
MLW-LT, the Semantic Web, and Linked Open Data
abstract This talk describes issues that arise in combining MLW-LT activities with Semantic Web and Linked Open Data requirements. It will make proposals for the representation format of MLW-LT in order to interface better with these communities.
[Session Leader: Jirka Kosek • Scribe: Felix Sasaki]
[Session Leader: Phil Ritchie • Scribe: Pedro Diez]
[Session Leader: Tadej Štajner • Scribe: Phil Ritchie]
Updating ITS 1.0
(Review of Data Categories from
[Session Leader: Felix Sasaki • Scribe: Milan Karasek]
Content Authoring Requirements
Speaker: Alex Lik
Biosense Webster (J&J)/A-CLID
CMS-Based Localisation Management
abstract This talk discusses the authoring requirements of Biosense Webster (with reference to additional Johnson&Johnson companies) at various stages of topic-based authoring, with emphasis on terminology management awareness and implementation. This case demonstrates a strong need for easy and ROI-reasonable ways to deploy multilingual content, especially in light of the forthcoming (March 2013) electronic Instructions for Use (e-IFU) requirement.
Speaker: Des Oates
Adobe's Content Localisation Process
abstract Adobe, as a publisher of large volumes of global content, will present an overview of its web content localisation process as the services and systems employed in its workflows. To help illustrate of the requirement for MultilingualWeb-LT metadata, we will illustrate how a standard form set of metadata would help reduce the impedance across our services.
[Session Leader: Moritz Hellwig • Scribe: Felix Sasaki]
Speaker: Bryan Schnabel
XLIFF Technical Committee
Encoding ITS 2.0 Metadata to Facilitate an XLIFF Roundtrip
abstract Three ways to support extensibility have been discussed for XLIFF 2.0: (1) allow custom namespaces in prescribed elements (similar to XLIFF 1.2); (2) provide XML elements and attributes to prescribe extensibility (no custom namespaces); (2) allow both options, but be very picky about which are valid where. After explaining the three options and their pros and cons, this presentation will show which has been adopted by the XLIFF committee and how to do a roundtrip with ITS 2.0.
[Session Leader: Yves Savourel • Scribe: Tadej Štajner]
Speaker: Mark Davis
Coordinating the BCP47 "t" Extension with MLW-LT Data Categories
abstract Mark Davis, President of the Unicode Consortium, will join the Workshop remotely to discuss the "transformed content" Extension to BCP 47 that provides subtags for specifying the source language or script of transformed content, including content that has been transliterated, transcribed, or translated, or in some other way influenced by the source. It also provides for additional information used for identification.
Private room at the Brazen Head
Welcome and Implementation Commitments
[Session Leader: Felix Sasaki • Scribe: Jirka Kosek]
Project Information Metadata
[Session Leader: David Filip • Scribe: Jirka Kosek]
Translation Process Metadata
Speaker: David Filip
Translation (Localization) Process
[Session Leader: Pedro L. Díez Orzas • Scribe: Moritz Hellwig]
[Session Leader: Dave Lewis • Scribe: David Filip]
[Session Leader: Yves Savourel • Scribe Felix Sasaki]
Implementation Commitments and Plans
[Session Leader: Felix Sasaki • Scribe: Arle Lommel]
Coordination and Liaison with Other Initiatives
[Session Leader: David Filip • Scribe: Yves Savourel]