Standards and best practices for the Multilingual Web
Today, the World Wide Web is fundamental to communication in all walks of life. As the share of English web pages decreases and that of other languages increases, it is vitally important to ensure the multilingual success of the World Wide Web.
The MultilingualWeb initiative was established to look at best practices and standards related to all aspects of creating, localizing and deploying the Web multilingually. It aims to raise the visibility of existing best practices and standards and identify gaps. The core vehicle for this is a series of events which started in 2010. Currently these workshops are supported by the MultilingualWeb-LT (MLW-LT) project. MLW-LT aims at defining ITS 2.0, which provides metadata for multilingual content creation, localization, and language technologies.
On 11–13 June 2012, the W3C ran the fifth workshop in the MultilingualWeb series in Dublin, “The Multilingual Web - Linked Open Data and MultilingualWeb-LT Requirements.” It was hosted by Trinity College Dublin (TCD) and Professor Vincent Wade, Director of the Intelligent Systems Laboratory at TCD, gave a brief welcome address.
The purpose of this workshop was two-fold: first, to discuss the intersection between Linked Open Data and Multilingual Technologies (11 June), and second, to discuss requirements of the W3C MultilingualWeb-LT Working Group (12–13 June). This workshop was different than the previous workshops because it focused on two specific topics and targeted a smaller audience (to allow for more discussion). Future MultilingualWeb Workshops will return to the broader scope of previous workshops and aim to be a meeting place for content creators, localizers, tools developers, language technology experts, and others to discuss the challenges of the multilingual Web.
The Dublin event ran for three days. The final attendance count was 78, almost exactly the number the organizers had felt was ideal for having deeper discussions about the chosen topics. The audience and speakers encompassed key representatives from the EU, Europeana, the linked open data community, localizers working on linked open data, ISO TC 37, localization companies and research institutions working on terminology, the W3C, and also flagship EU projects in the realm of language technologies like META-NET. This was an ideal group of stakeholders working on the intersection between linked open data and the Multilingual Web.
As with previous workshops, the presenters were video-recorded with the assistance of VideoLectures, who have made the video available on the Web. We again made live IRC scribing available to help people follow the workshop remotely, and assist participants in the workshop itself. As before, people Tweeted about the conference and the speakers during the event, and you can see these linked from the program page. (Note that video is not available for working sessions, which were highly interactive. Those interested in the content of these sessions should consult the IRC transcript for more details.)
After a short summary of key highlights and recommendations, this document provides a short summary of each talk accompanied by a selection of key messages in bulleted list form. Links are also provided to the IRC transcript (taken by scribes during the meeting), video recordings of the talk (where available), and the talk slides. Almost all talks lasted 15 minutes. Finally, there are summaries of the breakout session findings, most of which are provided by the participants themselves. It is strongly recommended to watch the videos, where available, as they provide much more detail than do these summaries.
What follows is an analysis and synthesis of ideas brought out during the workshop. It is very high level, and readers should watch the individual speakers’ talks or read the IRC logs to get a better understanding of the points made.
The workshop opened with short welcome addresses from Vincent Wade, Richard Ishida, Dave Lewis and Kimmo Rossi. Kimmo Rossi emphasized that in the planning of the upcoming Horizon 2020 and other European research programs under consideration, language technology will be seen in the context of the data challenge, that is: linked open data, big data, the data cloud, etc. This workshop is a good opportunity to gather ideas about the relation between language technology and data, including topics for upcoming EC funding opportunities.
The setting the stage session started with our keynote speaker, David Orban. He looked into the future of computing devices and the "Web of Things". This is both a technical revolution and a societal challenge. We need to address these both on a personal and a political, decision-making level. The keynote was followed by two presentations: Peter Schmitz presented developments within the European Publications Office related to multilingual linked open data, and Juliane Stiller and Marlies Olensky described the European digital library Europeana, which features core multilinguality in its user interface and data, and which is currently being converted to used linked open data. Both presentations were similar with respect to their perspective: currently, huge, multilingual data sets are being made available as linked open data. The presentations were different with regards to application scenarios of the data, including different user groups and consuming tools. For both data sets, the interlinkage with the current linked open data cloud is an important task.
The linking resources session concentrated on this aspect of interlinking information. The speakers provided three different perspectives on the topic:
The main topic of the linked open data and the lexicon session was how to make use of existing language resources—primarily in the realm of lexicons and terminology within linked open data, and vice versa. This section featured five presentations:
In the last session of the first day, Georg Rehm introduced the Strategic Research Agenda (SRA) being developed within the META-NET project. The SRA is one main piece of input for research in the Horizon 2020 and other programs. The presentation of the SRA at the workshop and discussions with participants had a crucial influence on framing the SRA with regards to the relation between linked open data and language technologies, see the SRA section “Linked Open Data and the Data Challenge”.
In addition to the above presentations, during the first day several longer discussion sessions were held: a working session about identifying users and use cases - matching data to users, and a final discussion of action plans.
The second and the third day of the workshop were dedicated to gathering requirements about the Internationalization Tag Set 2.0, which is currently being developed within the MultilingualWeb-LT W3C Working Group. Several presentations highlighted different requirements. These are summarized briefly below.
After a welcome session for the ITS 2.0 discussion, the discussion of representation formats for ITS 2.0 metadata started with a presentation by Maxime Lefrançois. He explained the challenges for making the metadata available for or via Semantic Web technologies like RDF.
Three working sessions followed the representation format discussion: a session on quality metadata, on terminology metadata, and on updating ITS 1.0. The outcomes of these sessions and the two workshop days about ITS 2.0 requirements are summarized at the end of this executive summary.
In the content authoring requirements session, two presentations were given. Alex Lik emphasized the need of assuring interoperability of ITS 2.0 metadata with content related standards like DITA or localization workflow related standards like XLIFF. Des Oates reiterated that requirement with a focus on large scale localization workflows within Adobe. In these not only metadata is needed, but also the possibility for metadata round-tripping and a mechanism to expose metadata capabilities in a service oriented architecture.
The session on requirements on ITS 2.0 for localization scenarios started with a presentation from Bryan Schnabel. As the chair of the XLIFF technical committee, Bryan gave insights into current discussions within the XLIFF TC which are likely to influence the relation between XLIFF and ITS 2.0, and potentially the adoption of ITS 2.0 in general.
The second day closed with a presentation from Mark Davis about latest BCP 47 developments. BCP 47 is the standard for language tags, and recently extensions to BCP 47 have been created that overlap with functionality needed for ITS 2.0 metadata. The session was helpful in starting a coordination in standardization activities, that is now being continued between the relevant W3C working groups.
On the third day, the deep discussion sessions on ITS 2.0 requirements continued in various discussion sessions on implementation commitments, project information metadata, translation process metadata, provenance metadata, translation metadata, a summary of implementation commitments and a session on coordination and liaisons between the MultilingualWeb-LT group and other initiatives.
In the translation process metadata session, David Filip talked about the need to have ITS 2.0 metadata available in complex localization workflows, reiterating statements e.g. from Des Oates and Alex Lik.
The requirements gathering for ITS 2.0 was especially successful in terms of raising awareness for the upcoming metadata in various communities. These range from large cooperates with localize huge amounts of content on a daily basis, using both human and machine translation, to smaller localization technology providers who have their specific technological solutions.
Another main achievement was the consolidation of requirements. The Requirements for Internationalization Tag Set (ITS) 2.0 had been published only a few weeks before the workshop and provided a huge amount of proposed metadata items. Via the discussions during the workshop, many proposals were consolidated, taking both needs in real life use cases and judgement of efforts into account. Follow the link to the first working draft of ITS 2.0 to see the set of metadata items agreed upon by workshop participants.
The work within the MultilingualWeb-LT working group will now focus on selected additional metadata and implementations of the metadata proposals with real life use cases, including various “real” client-company scenarios. In this way, the usefulness of the metadata for filling gaps in the way towards a truly multilingual Web can be assured.
Vincent Wade, Director of the Intelligent Systems Laboratory at TCD, welcomed the participants to Dublin and gave a brief speech emphasizing the important role that multilinguality and linked open data plays in Ireland, within CNGL or DERI, with support both from national and European funding.
Richard Ishida, W3C Internationalization Activity Lead and the leader of the EU MultilingualWeb project that funded the MultilingualWeb workshop series between 2009 and early 2012, emphasized the success of the MultilingualWeb workshop series so far and the plans to continue this series with a general MultilingualWeb workshop to be held next year.
Dave Lewis, research lecturer at the School of Computer Sciences & Statistics at Trinity College Dublin, introduced the participants to the goal of the first day: to discuss the intersection between multilingual technologies and linked open data, involving perspectives from areas like language technology, localization, and the Web. The 7th framework program of the EU, the upcoming Horizon 2020 program and national funding play an important role for moving these topics forward.
Kimmo Rossi the European Commission, DG for Communications Networks, Content and Technology (CONNECT), project Officer for the MultilingualWeb and the MultilingualWeb-LT projects, started with information about a reorganization within the EU: As of 1st July 2012, the European Commission Directorate General for Communications Networks, Content and Technology (DG CONNECT) has been created. Within DG CONNECT, the language technology area is now part of the data value chain unit: In the future, language technology will be seen in the context of topics like linked open data or public sector information. Language technology can help to leverage linked open data e.g. by extracting meaning from text or by converting structured data from unstructured data, and by bringing in work on language related resources (terminologies, ontologies, taxonomies etc.). Other significant remarks:
The Setting the Stage session was chaired by Paul Buitelaar from the Digital Enterprise Research Institute (DERI).
David Orban, CEO of dotSub, gave the keynote speech, entitled “The Privilege and Responsibility of Personal and Social Freedom in a World of Autonomous Machines.” In his talk he provided a look into the future at the next generations of computing devices. The Web and the physical world are coming together in the Web or “Internet of Things.” What will be the impact of this development on our society, and how can we prepare individuals, and society as a whole to cope with the onslaught of accelerating change in our lives and our civilization?. David Orban provided some views about and answers to these questions in his keynote. The main points are summarized via the bullet points below.
Peter Schmitz, Head of Unit
Enterprise Architecture at the Publications Office of the European Commission, talked about “Multilingualism and Linked Open Data in the EU Open Data Portal and Other Projects of the Publications Office.” The talk presented contributions of the EU Publications Office (PO) to the Multilingual Web: daily publications in 23 languages, the multilingual thesaurus EuroVoc, multilingual controlled vocabularies, linked multilingual Web content, etc. The Publications Office is also heavily engaged in the European Open Data Portal and plans to provide dissemination of metadata as RDF, including persistant URIs. Other significant remarks:
Juliane Stiller and Marlies Olensky, researchers at the Berlin School of Library and Information Science at Humboldt-University Berlin, talked about “Europeana: A Multilingual Trailblazer”. Europeana aggregates content from 33 different countries to provide access to cultural heritage content across languages. Multilinguality in Europeana is implemented via the translation of the user interface, cross-lingual search, and subject browsing. It relies on translation of metadata to allow searches to cross language barriers and leverage content from one language to assist search in another. The Europeana Data Model (EDM) has been created to foster interoperability on a semantic level, and has already enabled to project to publish 3.5 million objects as linked open data. Other significant remarks:
The Setting the Stage session on the first day ended with a Q&A period with questions and comments about the current and future state of multilingual technologies on the Web, about content from Europeana and the role of languages resources, and about the danger of “killing new topics by too much standardization.” For more details, see the related links.
The Linked Open Data and the Lexicon session was chaired by Arle Lommel from the German Research Center for Artificial Intelligence (DFKI).
Alan Melby, director of the Translation Research Group at Brigham Young University, gave a talk about “Bringing Terminology to Linked Data through TBX.” Terminology and linked open data are currently separate fields. To address this challenge, the TermBase eXchange (TBX) format in its latest version is currently being developed to include an isomorphic RDF version, RDF-TBX. This will allow to integrate the huge amount of linked open data available into terminological applications. For the linked open data community, TBX based resources can help with disambiguation tasks in multilingual scenarios and a concept based approach towards translation of linked open data. Other significant remarks:
Ioannis Iakovidis, managing director at Interverbum Technology, gave a talk about “Challenges with Linked Data and Terminology.” The integration of structured terminology data into linked open data is a challenge, especially in commercial environments. The TermWeb solution is a Web-based terminology management system that integrates with widely adopted software, e.g., office applications. There are various challenges that needs to be addressed for Web based terminology management systems, e.g.:
Standardization in various areas is a prerequisite to address these challenges. The formats and technologies provided by linked open data may be the standardized technology stack that provides the functionality needed to achieve these goals.
Tatiana Gornostay, terminology service manager at Tilde, talked about “Extending the Use of Web-Based Terminology Services.” Tilde is currently advancing the establishment of cloud-based platforms for acquiring, sharing, and reusing language resources. The main application scenario is to improve automated natural language data processing tasks like machine translation. The exposure of terminology services on the Web allows for integration with machine translation systems, indexing systems, and search engines. Other significant remarks:
John McCrae, research associate at the University of Bielefeld, talked about “The Need for Lexicalization of Linked Data.” The talk presented Lemon (Lexicon Model for Ontologies), a model for describing lexical information relative to ontologies. Lemon is based on existing models for representing lexical information (Lexical Markup Framework and SKOS) and aims to bridge the gap between the existing linked data cloud and the growing linguistic linked data cloud. Via the linguistic linked data cloud, a significant amount of multilingual data can be re-used in a variety of applications. Other significant remarks:
Phil Archer, W3C team member working on eGovernment, talked about “Cool URIs Are Human Readable.” The EU is developing various standardized vocabularies for describing people, businesses, locations, etc. Currently the documentation for these is in English; in addition to the task of finding funding for the localization effort, political issues come into play: language, cultural identity, and trust need to be taken into account. This requirements influences basic decisions like choosing an URI that is neutral and acceptable across national borders: Vocabularies must be available at a stable URI, be subject to an identifiable policy on change control, and be published on a domain that is both politically and geographically neutral. Other significant remarks:
xmlns.comis geographically and politically neutral when compared to
The Linking Open Data and the Lexicon session on the first day ended with a Q&A period with a discussion about how the problems discussed relate to linguistic research and modeling in the area of semantics (which has a long tradition), about the limits of formalizing concepts, and about issues with the translating terms in different languages. For more details, see the related links.
The Identifying Users and Use Cases Matching Data to Users session was chaired by Thierry Declerck from German Research Center for Artificial Intelligence (DFKI).
This working session featured discussion about the use cases and user groups interested in linked open data. The process of identifying users and getting their feedback on requirements is difficult, particularly because, while some users are multilingual, most are monolingual and gaining their input requires careful analysis of server logs and other resources to determine how they interact with content. One of the difficulties for multilingual linked open data is that it is often generated at the end of the production chain, when it is difficult to make needed changes. Finally, it was emphasized that developers need to recall that linked data does not always equal linked open data: there are many cases where users will want to use linked data technologies but cannot make data open due to legal or privacy restrictions (e.g., linked data refers to identifiable personal information). As a result linked open data developers need to consider security requirements and build in ways to deal with them early on, and consider the ways in which linked data can move between open and closed categories.
The Building Bridges session was chaired by Kimmo Rossi from the European Commission.
Georg Rehm from DFKI gave a talk about the “META-NET Strategic Research Agenda and Linked Open Data.” META-NET is developing a Strategic Research Agenda (SRA) for the European Language Technology research community. For the SRA, three priority themes have emerged: (1) the Translation Cloud, (2) Social Intelligence and e-Participation, and (3) Socially Aware Interactive Assistants. In all these areas, linked open data plays a crucial role. For example, data can be used to generate cross-lingual references between named entities as a part of the translation cloud, or to exploit publicly available government data to foster e-participation across Europe. Other significant remarks:
The Building Bridges session on the first day ended with a Q&A period with questions and comments about the legal frameworks discussed in the session, the prospects for disruptive change from language technology and how far things have come in just the past decade, and the difficulties in extending human language technologies for smaller languages, and other issues around extending multilingualism’s benefits to those in need. For more details, see the related links.
The Action Plans session was chaired by Dave Lewis from Trinity College Dublin.
Dave Lewis led the discussion about action plans coming out of the first workshops day. The discussion focused on what is needed to achieve concrete developments in the area of multilingual linked open data. As a community, practitioners need to address motivations and incentives for maintenance: as present much of the data is made available by researchers to support specific projects, but extending data and maintaining it is an expensive proposition and the community needs to figure out sustainable models to address these issues. There is also a clear need for best practices to help organizations make their data available and lower the costs involved in doing so. One issue is that those who benefit from linked open data may or may not be the same people who bear the cost of publishing it, so we do not yet see the business model needed to support extensive issues in this area, with a focus on resolving payment issues. Kimmo Rossi closed my stating that he sees these developments as taking a number of years, but that we need to start reaching out now to the broader linked open data community to ensure that multilingualism is a core concern and that the solutions that are developed will meet the needs of a multilingual society.
Felix Sasaki, senior researcher at DFKI and W3C fellow, talked about “State of Requirements.” He introduced the Requirements for Internationalization Tag Set (ITS) 2.0 Working Draft and discussed its contents and the process for moving data categories forward in the W3C process.
The Representation Formats session was chaired by Jirka Kosek from University of Economics Prague.
Maxime Lefrançois, researcher at Inria, talked about “MLW-LT, the Semantic Web, and Linked Open Data.” The standardization efforts of the MultilingualWeb-LT working group is focused on HTML5 and XML as content formats. The challenge is to integrate these formats and ITS 2.0 metadata with Semantic Web and linked open data approaches. Addressing the challenge will help to interface better with the linked open data community. Other significant remarks:
The Quality metadata session was chaired by Phil Ritchie from VistaTEC.
Phil Ritchie of VistTEC was joined by Arle Lommel (DFKI) to discuss a proposal for an ITS 2.0 model for marking language quality data in XML and HTML5 documents. The proposal sparked discussion about the needs of the group, the difficulties in exchanging quality data, and the verbosity of the proposed model (with standoff markup as one proposal for how to address the issue of verbosity and also data security). Overall the group felt that the proposal needed more elaboration in some key points and Ritchie and Lommel were charged with moving it forward.
The “Terminology metadata” session was chaired by Tadej Štajner from Jožef Stefan Institute.
Tadej Štajner led a discussion about the ways to address terminology needs in ITS 2.0. While fully automated terminology work is not possible, there is a need for mechanism to tag term candidates in documents so that subsequent processes can evaluate them and make appropriate decisions. In discussion, the following points were raised:
The “Updating ITS 1.0” session was chaired by Felix Sasaki from DFKI / W3C.
In this section the group discussed updates to the existing ITS 1.0 data categories in ITS 2.0 to ensure that there was agreement about the details. In most cases, existing data categories are being adopted as is, but there was particular discussion about updating the ruby model to take into account work done in the HTML5 ruby model.
The Localization Requirements session was chaired by Yves Savourel from Enlaso.
Bryan Schnabel, chair of the XLIFF Technical Committee, gave a speech, entitled “Encoding ITS 2.0 Metadata to Facilitate an XLIFF Roundtrip.” Currently XLIFF 2.0 is under development and one major topic under discussion is how to support extensibility in XLIFF 2.0. The decisions about extensibility will also influence the role ITS 2.0 metadata may play in XLIFF. The presentation discusses three options with regards to extensibility in detail and shows how to achieve metadata round-tripping with XLIFF and various content formats. Other significant remarks:
The BCP 47 Developments session was chaired by Felix Sasaki from DFKI / W3C.
Mark Davis, Sr. internationalization architect at Google and president of the Unicode Consortium, talked about “Coordinating the BCP47 ‘t’ Extension with MLW-LT Data Categories.” The standard for language tags BCP 47 currently has two so-called registered extensions (see the language tag extensions registry): one for setting behavior in locale APIs, the Unicode Locale Extension (“u”), and an extension (“t”) to identify content that has been transformed, including but not limited to: transliteration, transcription, and translation. Since ITS 2.0 metadata is also concerned with these processes, coordination is needed, to avoid conflicts between metadata conveyed via the BCP 47 extension and the to-be-developed ITS 2.0 metadata. Other significant remarks:
The “implementation commitments” session was chaired by Felix Sasaki from DFKI / W3C.
The session discussed the importance in the W3C process of getting firm implementation commitments very soon since any data category without implementation commitments will be dropped. The current status of these commitments will be maintained on a commitment wiki.
The Project Information Metadata session was chaired by David Filip from University of Limerick.
This session discussed various proposal for project information metadata. It resulted in dropping translation qualification, genre, format type, purpose, and a number of other metadata proposals, but retained domain for discussion, although a number of issues need to be resolved.
The Translation Process Metadata session was chaired by Pedro Díez from Linguaserve.
David Filip, CNGL, talked about “Translation (Localization) Process Metadata?.” All ITS 2.0 metadata items are orthogonal. However, in enterprise environments, the metadata has to work with many different types of workflows (see also the presentation from Des Oates) in a huge, service-oriented architecture. This includes, for example, terminology and translation memory life-cycles. In such a scenario, orthogonal categories must sync on the fly. Under the umbrella of the MultilingualWeb-LT working group, Trinity College Dublin and University of Limerick are working on implementations that help to support this requirement for ITS 2.0 metadata. Other significant remarks:
Pedro Díez led a discussion on various process-related metadata proposals with the goal of creating data-driven processes. Various categories like readiness indicator and state attributes could offer a lot of value to users, but the relationship with APIs needs to be clarified. ITS 2.0 also needs to be coordinated with XLIFF in this area to avoid a collision of definitions between the two. ITS 2.0 has the potential to bring together stages outside of the scope of XLIFF, such as CMS-side authoring.
The Provenance Metadata session was chaired by Dave Lewis from Trinity College Dublin.
Provenance has emerged in recent years as a major topic because knowing where content came from and how it was produced can have a major impact on what is done with it, what quality processes are used, and how trustworthy the content is. The W3C currently has a Provenance working group, and it was agreed that the ITS 2.0 team should approach the Provenance working group to ensure that our efforts are coordinated with theirs. There remains considerable work to be done on defining the ITS 2.0 provenance model and specifying the use cases, but this topic is one that will continue to be very important.
The Translation Metadata session was chaired by Yves Savourel from Enlaso.
Much of the discussion in this section centered on the target pointer proposal, which has implementation commitments from Yves Savourel and Shaun McCance have both committed to implementing it because it allows for basic processing of multilingual XML files without requiring full support for an XML file in some cases. The second category was locale filter, which provides a way for authors to specify into which locales specific content items should be translated. The auto language processing rule category required more elaboration. Next the term location category was discussed as an equivalent to the XLIFF restype category. The end result was that more information is needed on some of these categories before decisions can be made whether to incorporate them or not.
The Implementation Commitments and Plans session was chaired by Felix Sasaki from DFKI / W3C.
The goal of this section was to gain firm implementation commitments for data categories. It was agreed that all implementation commitments should be finalized by mid-July, at which point Felix Sasaki was to create a new draft of the specification that included those categories for which there were sufficient commitments. All creators of data categories were to work to build consensus on their proposals and commitments to implement the proposed definitions.
The Coordination and Liaison with Other Initiatives session was chaired by David Filip from University of Limerick.
This working session discussed the relationship between the ITS 2.0 effort and other groups, such as XLIFF, ISO TC 37, and the ETSI ISG LIS, as well as participation in the upcoming FEISGILTT conference in October to be held in conjunction with Localization World. David Filip invited participants in the Workshop who were interested in participating in the program committee for that program to get involved.
The Fifth Multilingual Web Workshop was closed by Arle Lommel from DFKI.
This section ended the Workshop with special thanks to the staff of Trinity College Dublin, especially Eithne McCann and Dominic Jones for their support in running a successful workshop. The next Workshop, which will return to the broader, more general format of previous Multilingual Web workshops, was announced tentatively for Rome in March 2013.