W3C Workshop Report:
Data, content and services for the Multilingual Web
29 April 2015, Riga
Today, the World Wide Web is fundamental to communication in all parts of life. As the share of English web pages decreases and that of other languages increases, it is vitally important to foster multilingualism for the World Wide Web.
The MultilingualWeb initiative examines best practices and standards related to all aspects of creating, localizing, and deploying the Web multilingually. The initiative aims to raise the visibility of existing best practices and standards and to identify gaps in Web technologies that impact multilinguality online. The core vehicle for this effort is a series of events that started in 2010, run by the initial MultilingualWeb project and now by the LIDER project.
On 29 April 2015 the W3C ran the eighth workshop in the series. The theme of the workshop was
Data, content and services for the Multilingual Web. The workshop in Riga was co-organized by Tilde and co-located with the Riga Summit 2015 on the Multilingual Digital Single Market. The opening session was organized as a joined session with CEF – Towards a Connected Multilingual Europe. The following presenters gave a brief welcome address: Dace Melbārde (Minister of Culture of the Republic of Latvia), Márta Nagy-Rothengass (Head of Unit, Data Value Chain, European Commission, DG Connect), Jānis Kārkliņš (Internet Governance Forum) and Richard Ishida (World Wide Web Consortium, Internationalization Activity Lead).
As with the previous workshops, this event focused on discussion of best practices and standards aimed at helping content creators, localizers, tools developers, and others to meet the challenges of the multilingual Web. The key objective was to provide the opportunity for networking across a wide range of communities.
Participants were able to switch between the workshop and the parallel CEF event. Together the two events had more than 200 registered participants and featured one day of talks and discussions. A specific focus of the MultilingualWeb workshop was on data, content and services on and for the multilingual Web, provided via standardized technologies.
All presentations were recorded, and are available on the Web using the MultilingualWeb YouTube channel. We also provided scribing of presentations and discussions. The related session notes are linked from the presentations.
The program and attendees continued to reflect the same wide range of interests and subject areas as in previous workshops. We once again had a good representation e.g. from content creators, the localization industry, research and the government/non-profit sector.
After a short summary of highlights, this document provides a description of each talk accompanied by a selection of key messages. In addition there are links to session notes, video recordings of the presentations, and slides.
During the workshop we recognized the contributions of Jörg Schütz. Jörg was a long-term contributor to the MultilingualWeb community. He was supposed to present at the Riga workshop and passed away unexpectedly a few weeks before.
The creation of this report was supported by the European Commission through the Seventh Framework Programme (FP7), Grant Agreement No. 610782: the LIDER project. The MultilingualWeb workshop series is being supported by LIDER and has been supported by the Thematic Network MultillingualWeb, Grant Agreement No. 250500, and by the LT-Web project, Grant Agreement No. 287815.
Contents: Summary • Keynote • Developers and Creators • Localizers • Machines • Lightning Talks • Users
Supported by LIDER
What follows is an analysis and synthesis of ideas brought out during the workshop. It is very high level, and you should watch or follow the individual speakers talks to get a better understanding of the points made.
During the initial joint session held with the parallel CEF event, Dace Melbārde (Minister of Culture of the Republic of Latvia), stressed the importance of user friendly, multilingual digital services especially for minority languages, while Márta Nagy-Rothengass (Head of Unit, Data Value Chain, European Commission, DG Connect) talked about the imminent deployment of digital services infastructure. Jānis Kārkliņš (Internet Governance Forum) discussed how to foster multilingualism on the internet e.g. via broad adoption of IDN (Internationalized Domain Names), and Richard Ishida (W3C, Internationalization Activity Lead) argued that standards create interoperable diversity, and should be world-wide in scope, not limited to Europe.
The keynote presentation from Paige Williams stressed how multilingual technology is about changing the lives of people around the world. Paige gave a broad overview of technologies that are crucial for a truly multilingual Web. The workshop continued with the Developers and Creators session and Han-Teng Liao. He emphasized the need for best practices and standards that put the user into the center. Roberto Navigli and Rodolfo Maslias presented a multilingual metasearch service, which allows users to access Web information in multiple languages.
Fernando Serván presented work undertaken by the Food and Agriculture Organisation (FAO) in the area of multilingual publishing. There is a need to implement culture specific requirements for high quality layout. Juliane Stiller closed the session with a presentation on the evaluation of multilingual features in Europeana.
The Localizers session was opened by Leonid Glazychev. He discussed methodologies for standardizing quality assessment, a topic that was taken up later by Arle Lommel. Jan Nelson presented on a concrete tool implementing localization standards: the multilingual app toolkit. David Filip made a presentation with contributions from Loic Dufresne on localization technologies being deployed within Intel: the so-called I18nl/L10 service bus.
The Machines session started with a joint presentation from Asunción Gómez-Pérez and Philipp Cimiano. They presented intermediate outcomes of the LIDER project, among other items focusing on a roadmap for the use of linguistic linked data in content analytics applications. The presentation from Andrezj Zydron and Dave Lewis provided the current state in the FALCON project. The machines session closed with a presentation from Ilan Kernerman on multilingual glossaries.
The following session of lightning talks started with Felix Sasaki. He presented the FREME project, which is developing interfaces to several language and data technologies for multilingual and semantic enrichment of digital content. Felix also presented on behalf of Phil Ritchie the Ocelot application. Ocelot is going to implement several enrichment functionalities of FREME. The last lightning talk was given by Ben Koeleman and introduced the topic of swarm translation.
The final, Users session started with Delyth Prys, who talked about best practices for supporting minority languages, using the example of Welsh. Thibault Grouas introduced JocondeLab, a multilingual web site providing access to cultural heritage artifacts via linked data technologies. The Users session finished with a presentation from Dennis Tan. With the example of internationalized domain names (IDN), Dennis emphasized the need to take an end-to-end perspective into account, involving every part of multilingual content creation, distribution and consumption. This view fit very well as a closing message to the workshop and supported the holistic and broad perspective of the MultilingualWeb community.
The keynote was given by Paige Williams (Microsoft). She spoke about
People-First: Multilingualism in a Single Digital World. Paige pointed out that in today's world of borderless communication using a growing amount of heterogeneous technologies, one has to rethink the classic approach to localization. Technology has to be made available in the preferred language of a global audience. Language technologies play an important role for achieving this goal. By improving global communication, language technologies can also contribute to growth in a global economy. Other important remarks:
- Standards simplify the process of working with multiple languages. Microsoft is building bridges between language, culture and technologies through language related standards and the Microsoft local language program.
- Microsoft provides the global readiness approach for employees and external developers. This should help them to build adequate customer experience for all markets and languages.
- Paige encouraged the audience to embrace diversity and to try to understand what
global really mean.
During the Q&A session, questions were raised about Microsoft's plans to provide speech technologies for small languages and for regions with little or no internet connectivity. Paige replied that currently technology support for smaller languages is low, and that Microsoft sees this is an important challenge to be addressed in the future.
Developers and Creators session
The developers and creators session was chaired by Tatjana Gornostaja, Tilde.
Han-Teng Liao (Oxford Internet Institute) gave a presentation entitled
A call to implement a common translation and country-language selector repository. In his view, there are several gaps when it comes to expressing country and language choices in a consistent and usable fashion. These gaps are related to preferences around machine translated content, and language or country preferences. Europe with its great cultural and linguistic diversity should foster the development of related best practices. Other significant remarks:
- The Unicode Common Locale Data Registry (CLDR) provides a lot of machine readable information needed to address the challenges being discussed.
- European institutions should push for the harmonization of auto-translation mechanisms and language and country selectors. This concerns both the public and the private sector.
- Public institutions that could provide good examples are immigration departments or visa and border control agencies.
Roberto Navigli (Sapienza University) and Rodolfo Maslias (European Parliament) gave a presentation entitled
Metasearch in a Big Multilingual Terminological Space. In the European Union every day a vast amount of multilingual data is being produced. Roberto and Rodolfo presented work on a multilingual metasearch engine that could help to tackle this challenge. The metasearch engine integrates several large-scale, multilingual data sources. Other significant remarks:
- Domain specific data is key to make the search results more accurate. The metasearch engine uses data from domains like labor market or migration.
- The metasearch engine could play an important economic role, e.g. by helping with cross border and cross lingual job search.
- Multilingual term bases are a crucial element for implementing these and other search scenarios.
Fernando Serván (FAO) gave a presentation entitled
Moving from a Multilingual Publications Repository to eBook Collections in the United Nations. FAO is publishing high volumes of multilingual content every year. The publications are made available using the standardized ePub format. FAO wants to take high quality layout requirements for a wide range of cultures and languages into account. Other significant remarks:
- At FAO, currently the creation of ePub documents is not fully automated. Each ePub document needs to be corrected manually before publication.
- Due to this situation, FAO needs standards and best practices on how to implement high quality layout requirements in the area of digital publishing.
- To reach a global audience, especially requirements from non-latin scripts and typography traditions need to be taken into account.
Juliane Stiller (Humboldt University) gave a presentation entitled
Evaluating Multilingual Features in Europeana: Deriving Best Practices for Digital Cultural Heritage . She reported on recent improvements in Europeana, the European digital library. A key feature of Europeana is cross-lingual querying of information about cultural artifacts. Juliane reported e.g. about improvements made in the area of query translation. Other significant remarks:
- A standardized multilingual vocabulary for identifying entities like persons, locations, events etc. is a basis for implementing cross-lingual queries in Europeana.
- Enriching metadata about cultural artifacts with links to entity information can improve search results tremendously.
- A non-technical challenge is the re-use of suitable multilingual resources, including licensing aspects.
The developers and creators session ended with a Q&A session. Related to Hang-Teng's presentation, it was pointed out that European public administration, specifically the Publications Office of the European Union, has some high quality metadata that can help to tackle the challenges described. Also related to Hang-Teng's presentation, it was pointed out that W3C provides some best practices on language and country selection. These best practices could be improved with additional information from various layout traditions.
Data used by the multilingual metasearch engine (cf. the presentation from Roberto and Rodolfo) was discussed as well. The metasearch engine relies both on data sources from the public sector, like IATE, and from industry. For successful metasearch, one needs three categories of database: public data, academic data and data from industry.
This session was chaired by Fernando Serván of the Food and Agriculture Organization of the UN (FAO).
Leonid Glazychev (Logrus Intl.) gave a presentation entitled
Standardizing Quality Assessment for the Multilingual Web. Leonid discussed the need for a standard way to assess translation quality. He introduced a proposal for a standard: ASTM WK46397. It aims at simplifying quality assessment. Other significant remarks:
- The proposal covers several aspects like quality assessment via crowdsourcing, the general assessment process, and quality metrics.
- Leonid presented an application of the proposed standard for the review of a well-known public web portal.
- He emphasized the need for standardized metrics to assess translation quality, a topic covered later by Arle Lommel.
Jan Anders Nelson (Microsoft) gave a presentation entitled
XLIFF 2.0 and Microsoft´s Multilingual App Toolkit. Jan gave an overview of Microsoft's approach towards localization. Several years ago support of standards was not an important element of the company strategy. This has changed, as can be seen by Microsoft's effort on XLIFF 2.0. Jan demonstrated the role of XLIFF 2.0 in the multilingual app toolkit for localizing applications, targeting several platforms like Windows, iOS and Android. Other significant remarks:
- The use of XLIFF 2.0 enables developers to re-use large percentages of their code.
- Developers also benefit from translation services that are provided via the toolkit.
- During the presentation he showed how cross-platform projects are supported by Microsoft´ use of XLIFF 2.0
- The support includes both machine translation services as well as engagement with language services providers.
David Filip (University of Limerick), gave a presentation entitled
Developing Standards-Based Localization Service Bus at Intel. David presented also material from Loic Dufresne de Virel. The Intel localization group has started a partnership with the Irish ADAPT centre. The aim of the collaboration is to design within Intel a data model and architecture for Internationalization and Localization: the I18n/L10n service bus. Other significant remarks:
- The I18n/L10n service bus benefits from supporting several recent standards in the realm of multilingual content production.
- The modularity of standards is a prerequisite to match abstract requirements of business processes.
- Relevant standards encompass CMIS 1.1, ITS 2.0, XLIFF 2.0 and the upcoming XLIFF 2.1.
This session was chaired by Feiyu Xu from DFKI.
Philipp Cimiano (Bielefeld University) and Asunción Gómez-Pérez gave a presentation entitled
LIDER: Building Free, Interlinked and Interoperable Language Resources. They presented an overview of the intermediate outcomes of LIDER. The aim of LIDER is to build the basis for a linguistic linked data (LLD) cloud. LIDER produces guidelines and best practices for the creation of LLD, tooling like LingHub that demonstrates the application of linked metadata for exploring linguistic resources, and a community around LLD and use cases and requirements for content analytics tasks of unstructured multilingual cross-media content. Other significant remarks:
- The LIDER roadmap for the use of LLD Data for content analytics tasks is being developed with valuable feedback from various research and industry communities. These are active in several W3C community groups: BPMLOD, OntoLex and LD4LT.
- The LIDER reference architecture describes how to develop LLD applications based on existing and new multilingual resources and their deployment in natural language processing services.
- LIDER as a project plays a crucial role in bridging communities like language technology and (big) data technologies.
Andrezj Zydron (XTM International) and Dave Lewis gave a presentation entitled
FALCON: Building the Localization Web. They introduced the current state of the FALCON project. FALCON is deploying the LIDER principles of linguistic linked data in the translation and localization community. The project cooperates closely with LIDER to provide feedback from this industry on LLD use cases and requirements. FALCON has developed an LLD enabled online translation workflow that combines technologies from the realms of translation, translation management, computer-aided translation, and terminology management. Other significant remarks:
- The FALCON tool chain has been enhanced with automatic text extraction, machine translation and publicly available language resources.
- These components are integrated into one workflow through a web services architecture. It leverages open standards and linked tabular data formats.
- FALCON demonstrates how statistical machine translation training can be integrated with manual translation correction by human translators.
Ilan Kernerman (K Dictionaries) gave a presentation entitled
Semi-Automatic Generation of Multilingual Glossaries. K Dictionaries provides multilingual dictionaries as a basis for glossary generation. The semi-automatic generation process can be enhanced via linguistic linked data technologies. Standardized formats for representing lexica as linked data like LEMON can help to improve the process. Other significant remarks:
- Publishing houses, including providers of lexica, over several years have invested in XML technologies. Hence, the creation of linked data sources often involves a custom conversion from XML to RDF.
- Sub steps of the conversion include e.g. processing of word lists and continuous re-processing of the results.
- The multilingual glossaries help to improve access to web content across languages.
During the Q&A the level of disambiguation in multilingual glossaries was discussed. This is an area of active development and improvement for multilingual glossaries. Here resources like BabelNet can be of great help. The availability of BabelNet and multilingual glossaries as linguistic linked data eases the task of resource integration.
Participants were asking for actual use cases of linguistic linked data. As examples, references were made to the case of BabelNet and IATE, see the presentation from Roberto Navigli and Rodolfo Maslias, and the use of linked metadata by the Publications Office of the European Union. The session chair Fei Xu described how in German national Big Data projects the relation between big data technologies and natural language processing is being explored. So-called sar-graphs help to build a bridge between linguistic and world knowledge.
Lightning Talks session
This session was chaired by David Filip from University of Limerick.
Arle Lommel (DFKI) gave a presentation entitled
Designing Purpose-Specific Quality Metrics for the Web. He discussed the Multidimensional Quality Metrics (MQM) and their application to web content. The ITS 2.0 Localization Quality Issue data category is an application of a subset of the quality types provided by MQM. Certain tools like Ocelot allow to parameterize the actual set of metrics being used for a given quality assessment task.
Felix Sasaki (DFKI/W3C Fellow) gave a presentation entitled
FREME project: Language and Data Processing as First-Class Citizens on the Web. FREME is a European project that which aims bring various types of language and data technologies to the market. This is done via the design of interfaces (software interfaces for programmers and graphical user interfaces) for implementing multilingual and semantic enrichment processes for digital content. The interface design is driven by four selected business cases and may inform future standardization related to language and data services on the Web.
On behalf of Phil Ritchie (VistaTEC) Felix Sasaki gave a presentation entitled
Ocelot: An Agile XLIFF editor. The presentation provided an overview of the Ocelot tool, a flexible XLIFF editor that is an integral component of Okapi framework, an open source set of components and applications for localization purposes. Ocelot's flexibility can be seen in the forehand mentioned adaption to MQM, as well as its usage within the FREME project. In FREME currently Ocelot is being adapted to allow translators for deploying multilingual semantic enrichment during localization.
Ben Koeleman (YAYANGO) gave a presentation entitled
Swarm Translation. Ben shared his experience in a special translation project: in a huge community of volunteers, over 25,000 volunteers collaboratively translated books into their native language. Ben called this approach
swarm translation. An important aspect of swarm translation is that this is not a sequential process. Many passages are translated simultaneously, including parallel discussions on the translations themselves. Ben shared lessons learned concerning the technological and organizational set up of swarm translation. Formulating these as best practices may help to guide future, similar endeavors.
In the Q&A session, one challenge of swarm translation was discussed: how to balance demand and translation service offers. This is difficult in a community project that consists of volunteers. Related to the FREME project, priorities of the interfaces that are being developed were discussed. To make linked data accessible to people who are not linked data experts, one needs interfaces that hide complexity of data sets and data access, and the ability to work with non linked data, structured formats like CSV. The usefulness of tools like Ocelot or translate5 was questioned, since they don't provide what translators need in daily life, that is, CAT (Computer-assisted translation) tooling functionality. This topic has to be separated from the design of interfaces to certain data and language technologies provided by FREME. The vision behind these interfaces is to ease technology integration across platforms.
This session was chaired by Olaf-Michael Stevanof of ASLING and JIAMCATT.
Delyth Prys (Bangor University) gave a presentation entitled
Best Practices for Sharing Language Technology Resources in Minority Language Environments. She presented a new Welsh National Language Technologies Portal launched by the Language Technologies Department of Bangor University. The portal provides technology components for developing language technology applications for the Welsh language. Example components are spelling/grammar checkers, part of speech taggers and a machine translation system. Other significant remarks:
- Having technology components available is not enough. To achieve uptake by technology users one needs documentation, tutorials and examples of how to use the technology.
- Another crucial aspect for minority languages is the costs of language technology components. The portal for Welsh makes the components available for free, to foster commercial application development.
- The portal for Welsh can be seen as a success story of how to avoid
digital extinction - a term that was created to explain the situation of smaller languages, see the META-NET White Paper Series.
Thibault Grouas (Ministry of Culture and Communication, France) gave a presentation entitled
Building a Multilingual Website with No Translation Resources. Thibault introduced JocondeLab, a multilingual web site that provides access to cultural artifacts in France. A key feature of JocondeLab is that it provides information integration of various types of multilingual resources, using linked data technologies. Among these resources are the french DBpedia. Interlinking this and other resources, JocondeLab provides access in 14 languages. Other significant remarks:
- The approach of creating multilingual user experience via publicly available cross-lingual linked data sources can be seen as a best practice of low budget multilingual content creation.
- The multilingual experience was created without actual translation (by humans or machines), but purely based on cross-lingual linking. The only translation / localization task concerned the multilingual user interface.
- JocondeLab won the French Data Intelligence Award. For re-use, the data has been made available under various licenses.
Dennis Tan (VeriSign, Inc.) gave a presentation entitled
Towards an End-to-End Multilingual Web. Dennis presented on the current state of internationalized domain names (IDNs). IDNs allow users around the world to experience web addresses in their own languages and scripts. There is a huge demand for IDNs, since most users are living in countries and regions that use non-latin scripts. Due to the interconnected nature of the Web itself, the tooling around IDNs needs to be made available in an end-to-end manner. Other significant remarks:
- Sometimes certain issues like homographs are regarded as a challenge hindering adoption of IDNs. Dennis pointed out that there is mature technology to avoid the danger of homographs: in the browser, in domain name registries, and via dedicated testing tools.
- Dennis introduced the initiative of Universal Acceptance. Its aim is to foster the end-to-end vision with contributions from a wide range of stakeholders.
- Universal acceptance will be achieved once IDNs are available for everybody and in every tool: in web browsers, email clients, mobile applications, and user settings.
In the Q&A session, the language technology efforts around Welsh were discussed. Keyboard layout for Welsh and treatment of diagraph characters have been an issue but this now seems to be resolved. Coding clubs are a means to gather interest from young people to work with the Welsh language. Components being developed for Welsh have the potential to be re-used for other small languages, e.g. in the area of speech and machine translation system development.
The multilingual cultural heritage data sets that have been developed for the JocondeLab portal may find usage also in commercial applications. The data is available under suitable licenses. One challenge during the data set creation process was granularity: e.g. DBpedia provides less detailed in some areas than certain culture heritage metadata.
Finally, the success of IDNs and the usefulness of networking around the topic in the last years were discussed. The effort of Universal Acceptance is a means to get important stakeholders involved. Events like the MultilingualWeb workshop series are essential to keep the conversation across stakeholder groups going. Dennis's words fit well for closing the 8th MultilingualWeb workshop: It is all about meeting the right people and about keeping the conversation going.