W3C Workshop Program:
Data, content and services for the Multilingual Web
29 April 2015, Riga, part of the
Riga Summit 2015
on the Multilingual Digital Single Market (27-29 April)
The MultilingualWeb community develops and promotes best practices and standards related to all aspects of creating, localizing, and deploying the Web across boundaries of language. This W3C workshop aims to raise the visibility of existing best practices and standards for dealing with language on the Internet and on identifying and resolving gaps that keep the Internet from living up to its global potential. It will be held in Riga, and will be part of the Riga Summit on the Multilingual Digital Market 2015. See information about the venue in the summit website.
The workshop is made possible with the support of the LIDER project.
After the keynote speech, each main session will contain a series of talks followed by some time for questions and answers.
Related links: Report • About W3C
Minister of Culture, Latvia
Internet Governance Forum
World Wide Web Consortium
People-First: Multilingualism in a Single Digital World
abstractIn today’s borderless world of devices and services, content travels and roams alongside people. This reality means rethinking the classic approach to localization in a multilingual, multi-geo, and multicultural capability. How can we account for going beyond language? How do we address this dilemma internally in the enterprise and what do we offer our external partners to extend their market reach? This talk explores how Microsoft’s Global Readiness group is increasingly focused on going beyond traditional borders of localization to help deliver global-ready solutions to meet people’s preferences and expectations.
[Chair, Arle Lommel, Richard Ishida, Felix Sasaki • Scribe, Phil Ritchie]
Oxford Internet Institute, University of Oxford
A Call to Build and Implement a Common Translation and Country-Language Selector Repository
abstract Several gaps exist for Web users in expressing their country and language choices in a consistent and usable fashion. These gaps include: (a) better ways for users to express their language preferences for receiving machine translated content versus original content for any language(s) they specify; (b) consistent language selectors; (c) consistent country selectors.
Europe has the opportunity to demonstrate and develop best practices and standards for its Web users and the world by: (a) expanding specification/implementation of what to (to not) translate preferences via the ACCEPT-LANGUAGE field; (b) providing common and open resources for ordering language names and codes; and (c) providing common and open resources for the categorization and collation of country names and codes. The Unicode CLDR provides some standard solutions for the last two issues. For better Web user experience, the European institutions can lead by harmonizing auto-translation mechanisms, language and country selectors for websites in both public and private sectors. Example institutions include immigration, visa and border control agencies, airline, hotel-booking and travel agencies, serving both European and international Web users alike.
Sapienza University of Rome
Metasearch in a Big Multilingual Terminological Data Space
abstract One of the major challenges faced by the 28 EU Member States is managing the large amounts of data produced every day in a complex variety of different codes (languages, formats, etc.). Our initiative is developing a multilingual metasearch service that allows access to Web information in an unconventional but logical way: one enters a query in one of the official EU language and gets results from the other 23 official EU languages, without file format limitations. The metasearch technology searched a large number of heterogeneous terminological databases and glossaries and obtains processed results. This technology aims to seamlessly connect thousands of glossaries and terminological resources across languages (provided by TermCoord), such as of BabelNet (University La Sapienza), which integrates several large-scale (but general-purpose) multilingual resources like Wikipedia, WordNet, Wiktionary, etc. in a big multilingual terminological data space.
To make the search results more accurate and richer, two universities are providing data from specific domains, such as migration, human trafficking, and human rights (Aristotle University of Thessaloniki), and management of waste sorting (Università di Salerno). The Terminology Coordination Unit focuses on the labour market domain (job mobility, job vacancy). In order to further promote and contribute to a barrier-free EU setting, the metasearch engine could enable EU citizens to seek online for all sorts of information that will help them to enrich their academic qualifications, improve their employability or for EU companies to extent their visibility. The key element is terminology, as data cannot be properly explored without terminology resources and therefore multilingual terminological resources are the only way to disambiguate big data sets.
Moving from a Multilingual Publications Repository to eBook Collections in the United Nations
abstract With more than 600 publications per year, the Food and Agriculture Organization of the United Nations (FAO) is one of the main publishers in the area of food and agriculture. Increasingly, these publications are being made available in EPUB format, but this process is not automated and it is subject to errors and corrections. This presentation reviews the need for standards and best practices in the area of high-quality digital publishing from a user perspective, with particular attention to constraints for non-Latin languages.
Berlin School of Library and Information Science, Humboldt University of Berlin
Evaluating Multilingual Features in Europeana: Deriving Best Practices for Digital Cultural Heritage
abstract This talks addresses recent improvement in multilingual access to content from Europeana (the European digital library, archive and museum). In the past year, Europeana implemented query translation and further improved the processes for automatic enrichment with multilingual vocabulary. Automatic enrichment and linking of digital objects to multilingual vocabulary for persons, location, time spans, and concepts help users to discover content in languages they do not speak. Determining criteria for identifying suitable multilingual resources (e.g., with clear and workable re-use conditions) has recently become one of Europeana’s pressing concerns. Query translation is a challenge as the queries can appear in more than 30 different languages and named entities often predominate in them.
The presentation focuses on the impact of these multilingual implementations and how they can be evaluated. Europeana has thus far evaluated its query translation feature and the impact of automatic enrichment on cross-language retrieval. The results are helping to develop best practices for multilingual access in digital cultural heritage.
[Chair, Tatiana Gornostay • Scribe, Thierry Declerck]
Logrus International Corp.
Standardizing Quality Assessment for the Multilingual Web
abstract Standards are crucial during all stages of multilingual content production, from file formats to tools to processes, etc. A standardized way of assessing the quality of translated multilingual materials independently and efficiently is no less important. This becomes critical when dealing with big and visible resources designated for wide public use.
The presentation focuses on the ASTM WK46397 proposal, which aims to create a simplified quality assessment standard for multilingual content that targets cases that combine high visibility of multilingual resources, a large and significantly diverse target audience, and limited review capabilities and/or budget. The standard is comprised of three components: the general approach based on customized crowdsourcing, the process, and the quality metric itself. Each of these and their connection are described in detail. It also discusses the standard’s applicability and the results of an actual project carried out using the developed process and metric that reviewed a translated version of an important public portal.
XLIFF 2.0 and Microsoft’s Multilingual App Toolkit
abstract The presentation examines the role of XLIFF 2.0 in cross-platform, multilingual app development and looks at how Microsoft has moved to provide rich support for developers working on Windows, iOS, and Android platforms. This support enables them not only to re-use large percentages of their code, but also to benefit from cross-platform translation services in the latest version of the Multilingual App Toolkit, an extension to Visual Studio. We will show how cross-platform projects are supported by our use of XLIFF 2.0 in this unique extension.
Univ. of. Limerick
Loïc Dufresne de Virel
Developing a Standards-Based Localization Service Bus at Intel
abstract Intel’s in-house localization group recently embarked on a major project that will revolutionize their internal service provision. Intel partnered with CNGL/ADAPT to design the data model and architecture for a modular, extensible, vendor-agnostic, and future-proof I18n/L10n service bus. This presentation details how the proposed data model and the overall bus architecture benefit from the use of a metadata-rich message (workflow token) format that is largely informed by recent standards such as CMIS 1.1, ITS 2.0, XLIFF 2.0, and XLIFF 2.1. Modularity of the above-mentioned standards offers a robust match for a generalized and abstracted BPM bus solution connecting a number of messaging brokers, grouping Content Management Systems, Code Control Repositories, and I18n/L10n services that cover code scanning, pseudo-translation, machine translation, etc.
[Chair, Fernando Serván • Scribe, Kevin Koidl]
Universidad Politécnica de Madrid
LIDER: Building Free, Interlinked, and Interoperable Language Resources
abstract This presentation provides an overview of the intermediate outcomes of the LIDER project. LIDER aims at establishing a new Linked Open Data (LOD)-based ecosystem of free, interlinked, and semantically interoperable language resources: Linguistic Linked Data. A specific focus will be on the roadmap for the use of Linguistic Linked Data for Content Analytics, which is currently being developed in LIDER with feedback from various communities in research and industry.
FALCON: Building the Localization Web
abstract The FALCON project (falcon-project.eu) has assembled an online translation tool chain that combines state of the art web site translation, translation management, computer-aided translation, and terminology management products. This tool chain has been enhanced with open-source automatic text extraction and machine translation technologies and public language resources. Integration of these components uses web services that leverage open standards including XLIFF, TBX, and open linked tabular data formats. Iterative quality improvement in this language technology is delivered by using linked data to actively manage the curation and reuse language resources within customer projects. The work demonstrates the integration the management of iterative SMT training with active curation of MT corrections and target term capture by post-editors in a live localisation workflow using this commercial tool-chain.
Semi-Automatic Generation of Multilingual Glossaries
abstract The semi-automatic generation of multilingual glossaries stems from K Dictionaries’ unique English multilingual dictionary. It involves: (1) reverse engineering parts of the initial data, followed by (2) editing the word lists and their links then re-processing the results, and (3) expanding through Linked (Open) Data and Semantic Web technologies. Stages (1) and (2) are ready for 17 languages, each with 43 languages. The standard applied for the LOD-extension is based on LEMON and combines research held in 2014 at Leipzig and Madrid Polytechnic universities on converting our data from XML to RDF format. The glossaries serve as powerful tools in dealing with multilingual contents on the web and networking dozens of languages.
[Chair, Feiyu Xu • Scribe, Felix Sasaki]
Designing Purpose-Specific Quality Metrics for the Web
abstract This talk examines the process of creating appropriate translation quality metrics for various forms of Web content, which have different requirements from traditional printed documents. It focuses on the use of the Multidimensional Quality Metrics (MQM) specification as a way to design suitable metrics and facilitate technical integration with ITS 2.0-aware tools.
Felix Sasaki; representing the FREME project
DFKI / W3C Fellow
Language and Data Processing as First-Class Citizens on the Web
abstract Language technologies and data technologies have led to many solutions to produce and process multilingual data and content, but challenges hinder wide-spread adoption. Many technologies are hidden behind vendor-specific solutions or can only be configured via data processing or language technology experts. Out of the box solutions often produce unsatisfactory results, and the solutions may not cover the languages, domains, or content formats needed for a given use case.
This presentation introduces the FREME project. FREME aims to fill these gap by providing easy to use interfaces for various types of language and data technologies. These interfaces will be implemented in the browser, among other environments, making language and data technologies first-class citizens on the Web. The design of the interfaces (including both software and graphical user interfaces) needs to be driven by input from various user types. In the project, four domains are central: localisation, digital publishing, public data from the agriculture domain, and Web site personalisation. This presentation presents a call for input in these areas (and beyond) to assure that the solutions respond to global user requirements.
Ocelot: An Agile XLIFF Editor
abstract Ocelot is a flexible, open source XLIFF editor that is an integral component of the well-known Okapi Framework. Though Ocelot cannot be considered a fully-fledged Computer-Aided Translation tool, its open architecture, implementation of two important standards, and creative use of metadata means that it can be used to great effect in localization processes that relate to content curation and quality assurance.
Ocelot supports XLIFF 1.2, and more recently XLIFF 2.0. Its compatibility with the Okapi Framework filtering/merging and translation kit creation pipeline means that many disparate file types can be safely edited. Ocelot also supports aspects of the ITS 2.0 W3C Recommendation—Provenance, Language Quality Issue and MT Confidence—in an XLIFF environment. It has been used to great effect in Linguistic Review (where quality models such as QT Launchpad’s MQM can be configured) and post-editing, where the capture of Fluency/Adequacy scores is done efficiently with no impact on linguist performance. Within the EC-funded FREME project, Vistatec is currently adding Semantic Web features to Ocelot, such as an in-editor semantic enrichment and serialization as Linked Data.
abstract Social networks use the Internet to inspire thousands of users quickly for a matter. The quantitative jump in communication volume arising from these networks opens up the possibility of a new method of translation: swarm translation. Swarm translation connects people who have acquired a foreign language into worldwide networks. In this model previously scattered knowledge will be transformed to an innovative potential on the translating platforms, by allowing individuals to take an active role in translating what they care about, instead of waiting in vain for the market to deliver translations. This presentation describes the concept and the development path of one swarm translating community, where, within a few days, over 25,000 users collaboratively translated books into their native language. In some cases over thirty different formulations of text passages were generated. An important consideration in this model is that translation is shifted from a sequential process in which passages are translated in order to one in which all portions of the translation are conducted in simultaneous and parallel. The presentation further describes how swarm translation can help strengthen the development and status of translators’ mother tongues, especially in the case of less-dominant languages. We will describe the best practices and lessons learned concerning the technological and organizational conditions required to initiate swarm translation so that language barriers can be converted into mass border crossings to other language cultures.
[Chair, David Filip • Scribe John McCrae]
Bangor University, Wales
Best Practices for Sharing Language Technology Resources in Minority Language Environments
abstract On March 6, 2015 the Language Technologies Department at Bangor University launched a new Welsh National Language Technologies Portal at http://techiaith.org/?lang=en. It contains an initial collection of eight language technology resources selected by the Welsh Government and comprises a list of resources developed in past projects that can be shared with software developers, language activists, coding clubs, researchers and others. These include API keys for our spelling/grammar checker, Part of Speech tagger, and machine translation.
In addition to the resources themselves, the website contains documentation, tutorials and suggestions for use to encourage take-up. They are available at no cost and on generous free licences to enable commercial companies, public organisations, and enthusiasts to use them. We believe that this is a good model to follow for minority and less-resourced languages where knowledge of what is available and how they may be used is limited and funding and access to these resources is restricted.
Policy officer for languages in the digital environnment, Ministère de la Culture et de la Comminication (France)
Building a Multilingual Website with No Translation Resources
abstract The French Ministy of Culture has designed a fully multilingual website, JocondeLab (http://jocondelab.iri-research.org/jocondelab/), with a monolingual set of data consisting of 300 000 illustrated descriptions of works of arts displayed in French museums. We added the multilingualism by matching the tags from Joconde with those from DBpedia.fr, a RDF database extraction of French Wikipedia made in partnership with the French INRIA research institute. Doing so allowed full access to 300,000 files in 14 languages, including French regional languages present on Wikipedia. Only three months of development were necessary, with a very low budget, to obtain these results. No translation was necessary except for a few interface items (less than half a page). The JocondeLab website uses W3C's SPARQL standard to integrate data from DBpedia and is compliant with most of the WCAG standard. It is also distributed under a copyleft licence (the French licence Ceccill, compatible with Gnu LGPL and EUPL). The project won the French “Data Intelligence Award” in 2014 and is listed on OECD observatory of public innovations. It provides a very good example on how to make multilingualism a reality with no translation resources.
Towards an End-to-End Multilingual Web
abstract Today, the majority of Internet users live in countries where Latin is not the native script, and it is expected that the next generations of Internet users will continue the trend towards greater use of non-Latin scripts. Internationalized Domain Names (IDNs) offer a familiar localized Web experience for these users, as well as a natural and memorable way to identify websites and other resources that they can use to navigate the World Wide Web. However, the reality is that IDNs continue to face serious acceptance challenges that may cause registrants to overlook IDNs for their online presence and identity. If domain names are used to establish an online identity and IDNs offer an intuitive and familiar alternative to achieve the same goal but in the local language, why do we have this problem? This presentation reviews the current state of IDN deployment worldwide and discusses some of the best practices adopted by software developers to narrow the adoption gap. Finally, it focus on the homograph issue to separate facts from myths.
[Chair, Olaf-Michael Stefanov • Scribe, Arle Lommel]
In the Gold Hall Foyer
details A cocktail event sponsored by Verisign will take place in the Gold Hall Foyer.