W3C Workshop Program:
Making the Multilingual Web Work
12-13 March 2013, Rome
The MultilingualWeb community develops and promotes best practices and standards related to all aspects of creating, localizing, and deploying the Web across boundaries of language. This W3C workshop aims to raise the visibility of existing best practices and standards for dealing with language on the Internet and on identifying and resolving gaps that keep the Internet from living up to its global potential. It will be held in Rome, and will be hosted by the Food and Agriculture Organisation (FAO) of the United Nations.
After the keynote speech, each main session on the first day and a half will contain a series of talks followed by some time for questions and answers. The afternoon of the second day will be dedicated to an Open Space discussion forum, where participants can discuss the themes of the workshop in breakout sessions. This will be facilitated by Des Oates of Adobe. All attendees participate in all sessions.
The program also features a showcase of implementations of the forthcoming ITS 2.0 specification that will allow attendees to get a sneak peak at how this specification will impact and support multilingual requirements on the Web.
The Workshop will be followed on March 14 by two independent half-day workshops run by the QTLaunchPad project, one of the Workshop sponsors, on translation quality and European research initiatives in translation. Interested parties should visit the QTLaunchPad pages for more information and to apply to participate in these workshops.
Related links: Workshop Report • Call for Participation • About W3C
Food and Agriculture Organization of the United Nations (FAO)
Deutsches Forschungszentrum für Künstliche Intelligenz (DFKI)
Welcome and Introductions
Mark Davis, Vladimir Weinstein
Keynote: Innovations in Internationalization at Google
abstractThis presentation covers the range of internationalization challenges that Google has encountered and overcome in the past year, illustrating the challenges that many companies face. Among the topics are how to better serve multilingual users (particularly focusing on personal names, plurals, gender, language resolution, and bidi languages), expanding CLDR and ICU (and Google products) to more languages, client-side i18n, and the overall localization process.
Jan Anders Nelson
Going Global with Mobile App Development: Enabling the Connected Enterprise
abstract Companies are shifting from viewing the enterprise as a machine with mostly sequential workflows to seeing it as an interconnected complex ecosystem with a variety of feedback entities. Enterprises need to become more responsive to changing environments to better adapt and further evolve by operating as a learning entity that proactively interacts with its environment and continuously improves based on experiments and feedback. Accordingly, the currently static multilingual content (web, technical communication and documentation, service descriptions, etc.) offered by companies must also become more fluent and dynamic. This change which will have a dramatic impact on existing corporate language formalities and rules, which must become learning enabled entities within the connected enterprise. It also poses many challenges for language-processing and compliance-control systems, both internally and externally. As a result there will be not a single, static language model, but instead multiple models derived and learned from natural patterns identified in continuous data streams, shared across communities and combined with additional media elements. As a first attempt towards possible solutions, we discuss how the development of mobile applications for smartphones and tablets might contribute to help companies in successfully mastering the transition phase to a fully connected enterprise. The presentation is organized into: (1) the emerging connected enterprise, (2) arising language challenges, (3) contribution of mobile app development, and (4) projection and outlook.
University of Sassari
Multilingual Mark-Up of Text-Audio Synchronization at a Word-by-Word Level – How HTML5 May Assist In-Browser Solutions
abstract W3C standards and recommendations can be adopted to facilitate multilingual narration by synchronising the presentation of text and audio sources at a word-by-word level. The advent of HTML5 audio within the browser, in particular, is a key enabler. We explore the issues arising in scripting using that standard to allow people on the web to mark-up audio cue-points that correspond to particular words in the text, leading towards a tool that may be useful in multilingual appreciation/learning. This work extends our existing MLW semantic mark-up prototype
first presented at the Pisa MLW Workshop.
Multilingual Challenges from a Tool Developer's Perspective
abstract As non-native English speaker I've been involved with multilingual development projects for over a decade, first leading the PHP documentation and website teams and then on Drupal since 2003. The open-source, community-built Drupal 8 Multilingual Initiative, which I am currently leading, focuses on making Drupal the base system more multilingual-friendly and includes huge user facing improvements and many more backend technology developments and unification. This presentation will address how multilingual data is modeled in a practical implementation (software translation, different content models, workflow basics, etc.), how a predominantly English/American focused development team be persuaded to include multilingual features, and how to integrate multilingual tool development with open source methods (e.g., crowd-sourcing software translation, pluggable translation integration systems). Taken together, these issues show how software approaches to multilingual needs can be built right into one of the world's leading content management systems.
Director LRC, University of Limerick
Enabling the Global Conversation in Communities
abstract The Rosetta Foundation has recently conducted a first pilot project using SOLAS, a highly innovative, standards-based community space for translation and localization, developed at the Localisation Research Centre at the University of Limerick, in cooperation with the Centre for Next Generation Localisation. This presentation makes a case for demand- and user-driven translation and localization (social localization), and then describes why social localisation requires new technologies—based on open standards and open source—using the experience of the Rosetta Foundation as an example. It will then demonstrate how SOLAS-Match can be used in this context.
Román Díez González
Spanish Tax Agency
Pedro L. Díez-Orzas
ITS2.0 Implementation Experience in HTML5 with the "Spanish
Tax Agency" Site
abstract The presentation will show the experience of the Spanish Tax Agency in working with MT-specific features of the forthcoming ITS 2.0 specification (developed in the W3C's MultilingualWeb-LT project). It specifically talks about shifting from HTML 4.01 to HTML5 and strategies for annotating HTML5 content with ITS 2.0 markup in an efficient and pragmatic way, when faced with real-world pressures and requirements. This presentation will describe how the www.agenciatributaria.es site has been made multilingual using Linguaserve's Real Time Translation System, and the shift HTML5 and experience with ITS2.0 annotation (both automatic and manual).
Hans-Ulrich von Freyberg
Standardization for the Multilingual Web: A Driver of Business Opportunities
abstract Standardisation efforts are particularly attractive if they promise to drive business applications forward. Although still in development, the ITS 2.0 standard (developed by the W3C's MultilingualWeb-LT project) is already proving that it can fulfil this promise. This talk shows how the German Industrial Machine Builders' Association (VDMA) and Cocomore, as its service provider, benefit from the development of the ITS 2.0 standard. It demonstrates how the systems created during the standardisation effort support developing client relationships and business opportunities. As a further aspect, it is shown how the results of the standard development process have impacted VDMA's ability to conserve valuable resources.
Building Multilingual Web Sites with Joomla! the Leading Open Source CMS
abstract Used by over 2.8% of the web and by over 3000 government web sites, Joomla is the leading Open Source CMS. Making web sites truly multilingual, rather than relying on automated translation tools, is now a requirement. The latest release of Joomla makes building multilingual sites considerably easier, while also making them more accessible across different user agents and form factors (e.g., desktop or laptop computers, mobile phones, or tablets). This presentation will showcase how these new developments can be used in Joomla to greatly reduce the burden of building and releasing multilingual sites.
Humboldt Universität zu Berlin
The Europeana Use Case - Multilingual and Semantic Interoperability in Cultural Heritage Information Systems
abstract This presentation discusses the semantic and multilingual interoperability challenges that arise when building large-scale cultural heritage information systems. Europeana—the European digital library, archive, and museum—will be used as a use case to discuss concrete processes and issues when developing an aggregated solution to access very heterogeneous collections of cultural heritage material. What does it mean to target a European audience? How can sparse metadata in different formats be aggregated and enriched in order to provide a satisfactory multilingual user experience? Using actual examples from Europeana, challenges and problems will be highlighted and potential solutions discussed.
Tool-Supported Linguistic Quality in Web-Related Multilanguage Contexts
abstract Textual content still dominates the Web. The linguistic quality of textual content—correct spelling, terminology, grammar and style—is of the utmost importance for various content-related processes. Search engines and Machine Translation systems, for example, become more accurate if they operate on high-quality content. Given the volume of content on the Web, automation is important for linguistic quality management. The presentation will address LanguageTool, an adaptable open-source tool that has implemented support for the currently drafted ITS 2.0 specification. It will focus on the experience of adapting LanguageTool in a real-world scenario (examples will be drawn from Russian and English) and using it for checking compliance with governmental regulations like the German BITV2.
Making the Multilingual Web Work: When Open Standards Meet CMS
abstract Multilingual content on the web is no longer a luxury, but instead a requirement for most organizations. Although getting web content translated, prepared, and onto the web is hardly turn-key, a best practice is emerging. Many organizations have turned to Web Content Management Systems (CMS) to solve the complexity of publishing to the web. Similarly, many organizations have embraced open standards to solve the complexity of their content/data creation, publication, and translation workflow. This presentation will demonstrate how the use of a CMS, combined with open standards, moved my team one step closer to a turn-key multilingual web workflow. It will show how using a component CMS (Trisoft, with DITA and XLIFF), and web CMS (Drupal, and the Drupal XLIFF module) improves quality, reduces cost, and reduces time-to-market. It will also address the architecture, the hurdles, and the benefits experienced.
How do you publish one thousand web pages, in 12 languages, at a high quality, 50% quicker than you can today?
abstract Today's public organizations and institutions are faced with creating information for a digital world: information that is fluid rather than static, highly customized for individual needs, and available on-demand across multiple channels and geographies. As content volumes increase, new ways of delivering multilingual information without overstretching translation budgets must be found. Machine translation combined with human post-editing is an innovative new approach to help overcome these challenges, but can it really deliver the level of quality required for multilingual web content? Calling on real-life examples from some of the world leading private sector companies, this presentation will demonstrate how integrated machine translation and post-editing is already in use to considerably increase the amount of multilingual information that is being published on the web, without compromising on quality. Referencing SDL's automated translation survey 2010, ran in conjunction with the European Association of Machine Translation and the American Machine Translation Association, the presentation will also consider trends in the acceptance of machine translation combined with post-editing.
Delivering Multilingual Web Content for Mass Consumer Products: Rewriting Contemporary Industry Standards
abstract This presentation is based on a case study of Adobe Photoshop Elements, which has about quarter a million words spread across 100+ webpages per language and ships in 16 languages. Given its popularity, content changes happen regularly throughout the year, requiring quick and efficient localization. While advanced authoring tools are sensitive to the needs of internationalizing the content properly, even the most advanced localization tools do not provide a truly failsafe mode of translation and testing localized content, with many manual, error-prone tasks required. This situation leads to a "reactive" mode in which issues are identified after production is done and web pages are fixed after they are live. For any popular product, however, developers cannot afford to be in a reactive mode and instead have to fix issues proactively. The Photoshop Elements team uses a cutting-edge processes and tools that not only ensure high-quality translations happen by leveraging MT and seamless content distribution, but also that "proactive" testing and bug-fixing happen happen during staging itself with almost the entire process automated. This innovative development has greatly improved turnaround time, reduced effort, and increased efficiency of web page localization.
Charles McCathie Nevile
Localization in a Big Company: The Hard Bits
abstract Yandex has more than two decades of experience in developing linguistic analysis tools. We have machine-assisted translation systems, localisation systems, content management systems, front-end development modularisation and more. So what goes wrong in practice? And why doesn't all this technology always lead to perfect results?
Quality Translation: Addressing the Next Barrier to Multilingual Communication on the Internet
abstract As Europe has integration politically, languages are still one of the most pervasive barriers to interpersonal communication, cross-border commerce, and full participation in European democracy. Despite recent progress in machine translation, it is clear that the quality of today's Internet-based translation services is neither good enough for many tasks nor complete in terms of language coverage. Speakers of smaller languages thus find themselves largely excluded from vital discussions of European identity and policy. This talk will argue in favor of a concerted push in Europe for quality translation technology and address concrete preparatory actions performed by the EC-funded QT LaunchPad Project. It will discuss how these actions will directly influence the future of the Multilingual Web, both in Europe and around the world.
Sponsored and hosted by FAO
To further promote networking among attendees, a reception was held on FAO premises.
Sponsored by Verisign
Dinner for Workshop participants at the Ristorante Orazio (requires advance reservation)
José Emilio Labra Gayo
University of Oviedo
Multilingual Linked Open Data Patterns
abstract This talk presents a catalog of patterns and best practices to publish Multilingual Linked Data and identifies some issues that should be taken into account. Each pattern contains a description, a context, an example and a short discussion of its usage.
Universidad Politécnica de Madrid
Multilingualism in Linked Data
abstract This presentation presents a Linked Data generation process that follows an iterative incremental life cycle model. It covers the following activities: (1) specification: for analyzing and selecting the data sources, (2) modelling: for developing the model that represents the information domain of the data sources, (3) generation: for transforming the data sources into RDF, (4) linking: for creating links between the RDF resources of our dataset and other RDF resources, of external datasets, (5) publication: for publishing the model, RDF resources, and links generated, and (6) exploitation: for developing applications that consume the dataset in question. The presentation addresses each of these activities and its constituent tasks and the techniques, technologies, and tools available for reuse in a multilingual scenario. It will also present how multilingualism is present in the different activities of the linked data life cycle from specification through to maintenance and use.
Publications Office EU
Public Linked Open Data - the Publications Office's Contribution to the Semantic Web
abstract This presentation addresses the current status of ongoing projects under that responsibility of the Publications Office of the European Union (Publications Office) that contribute to the Semantic web: (a) CELLAR, a repository exposing metadata about official EU information as Linked Open Data that the Publications Office is preparing to open to the public with data loading ongoing since June 2012. (b) The Open Data Portal, (technical implementation directed by the Publications Office (a beta version of which has been available on the web since mid-December 2012), which provides metadata (information about datasets) and several data sets as Linked Open Data. c) Standardisation The Publications Office has contributed the definition of the European Legislation Identifier (ELI). ELI is based on machine-readable URI templates. The Publications Office provides multilingual controlled vocabularies for re-use.
Multilingual Issues in the Representation of International Bibliographic Standards for the Semantic Web
abstract The presentation will discuss current initiatives at the International Federation of Library Associations and Institutions (IFLA) to apply its language policy to linked data representations of its bibliographic standards in a multilingual (semantic) web environment. These include guidelines on translations of namespaces, the Multilingual Dictionary of Cataloguing, and current multilingual element sets and value vocabularies including the Functional Requirements family of models and International Standard Bibliographic Description.
University of Franche-Comté
Language Technology Tools for Supporting the Multilingual Web
abstract The NooJ linguistic development environment
www.nooj4nlp.net) has a strong community, with
language resources for 22 languages available, and more in development. This presentation discusses the potential of such robust finite state tools for generating annotated multilingual resources for the web. It will present the current ability to process HTML/AML-annotated documents and to transform those annotations for specific purposes such standardised markup for supporting multilingual applications.
Internationalized Domain Names: Challenges and Opportunities
abstract Today International Domain Names are getting more attention than at any other time since they were introduced into the Domain Name System in 2000, but they still have a long path to general adoption. IDNs are far from being ubiquitous and trusted. Verisign, as a registry operator and manager of over 1M IDNs, plays a small part in this ecosystem comprised by not only registries, but developers, content creators, policy and standard making bodies who are all attempting to further internationalize, or locally localize, the identifiers on the Internet. Therefore, we intend to highlight some of the challenges we have found through our experience as a registry operator and encourage all players to make IDNs a ubiquitous and trusted product for the multilingual web.
What’s in a Name?
abstract This presentation discusses the complex issues that arise when dealing with names on the Internet and in applications that require semantic knowledge of names.
Sebastian Hellmann (presenting for Sören Auer)
The LOD2 Stack and the NLP2RDF Project
abstract This presentation provides an overview on the LOD2 Stack and its components. It then discusses NLP2RDF Project
Reorganizing Information in a Multilingual Website: Issues and Challenges
abstract This presentation addresses the issues and challenges faced when reorganizing content on a multilingual website. It identifies the standards and best practices available and their use, and addresses issues of long-term use of multilingual content.
The Globalization Penalty
abstract All of the efforts made in best practice, website management, structure and process will not result in success if the content on an organization or firm's multilingual website can't be found. True best practice includes the need to understand and quickly respond to global algorithm updates on the world's top search engines, allowing content to stay on top of changes, trends and updates. However, this presentation will discuss how innovative, global leaders infuse SEO throughout the translation process, going beyond simple keyword localization and on page deployment, to increase the qualified traffic to their web content and site. We will consider several case studies, including the use of improved Search Performance in the eSupport world, where the inability to find critical 'fixes' and product information leads to costly call center sessions or damaging product returns. In this presentation we will also examine specifics and best practices for:(a) Global Search Performance: How will you best analyse search engine rankings for current in-market websites across multiple markets in relation to your competition, or similar web sites? (b) International Keyword Optimization: What do we see in terms of best practice as you research and curate keywords with high relevancy and search volumes for each market and locale? (c) Content Keyword Mapping: To improve rankings and visibility, how do leaders insert top keywords in website metadata and content to ensure optimization? (d) On-going ISEO Management: What should you plan for as you manage ongoing benchmarking monitor your competitors, ensuring each of your sites is up-to-date on new, local keyword, search insights, and global tracking keywords as well as ensuring each of your in-algorithm changes and web regulations?
Explanation of the format for the afternoon, and selection of discussion topics. Topics are suggested by participants, and the most popular are allocated to breakout groups. A chair is chosen for each group from volunteers. There are also three pre-selected groups.
Various locations are available for breakout groups. Participants can join whichever group they find interesting, and can switch groups at any point. Group chairs facilitate the discussion and ensure that notes are taken to support the summary to be given to the plenary.
The breakout groups were
Group reports and discussion
Everyone meets again in the main conference area and each breakout group presents their findings. Other participants can comment and ask questions.
Proxy-Based Website Translation
abstract Traditional Global Translation Management Systems (GTMS) use an infrastructure-heavy process that requires integration of a Content Management System with the GTMS to translator content. It also requires the content creator's site to serve up all content itself and have the appropriate architecture for maintaining all of the multilingual files and directories. At the opposite extreme is Google Translate, which is fully automatic but totally non-customizable. This presentation describes a "middle way" between the two in which the infrastructure needed for a custom solution is outsourced so that the content creator maintains a monolingual site. When a request for another language is requested, it is intercepted by a proxy server, which recognizes the language requirement. The request is forwarded to the monolingual site, thus fetching the latest content (including dynamically generated content) and returned to the proxy. The proxy server then extracts the translatable content and, if it has previously been translated by a human translator, is substitutes it for the source language. If the text has not been translated, it can access MT for the content for immediate delivery and enter it into a human translation workflow. This process offers advantages for certain classes of site creators.
brands4friends (eBay Inc.)
Designing User Experience for Multilingual Web Sites
abstract A look on multilingual Web sites through the users' eyes. What do they expect? A splash screen with no content but a language selection menu? Or content automatically served in a language they (most likely) understand? A drop-down to chose the language from? Or flags for that matter? A presentation aimed especially at those whose answers were not 1× yes and 3× no (not in that order).
University of Bielefeld
Modeling Multilingual Language Resources on the Web
abstract The availability of multilingual resources on the Web is a key component in enabling the creation of sophisticated multilingual tools and agents using Web data. Existing resources (e.g., WordNets, terminologies, machine-readable dictionaries, and lexica) often use significantly different schemes to represent their data and as such it is difficult to combine these. I present the proposed
lemonmodel that provides a simple core model along with a collection of modules, in order to capture the information stored in various resources, and make these available on the Web using linked data principles such as RDF. I will discuss the current status of resources on the Web, which have adopted the model, and describe recent theoretical developments due to the W3C OntoLex Community Group.
Update on the META-NET Strategic Research Agenda for Multilingual Europe 2020: Final Version and Next Steps
abstract Following a presentation of the META-NET Strategic Research Agenda for Multilingual Europe 2020 (SRA) at the MultilingualWeb Workshop in Dublin, this presentation will provide an update on the current state of play. This will include a quick walkthrough through the final version of the SRA, also mentioning where we currently stand and what the next steps in the META-NET initiative are.
Ontology Engineering Group, Universidad Politécnica de Madrid
Towards an Observatory of the Multilingual Web of Data
abstract With the Web of Data growing at a fast pace, it is time to start thinking about it in terms of language. In this talk we will present the first results of the Multilingual Web of Data Observatory. The main goal will be to shed light on questions such as: What is the distribution of natural languages on the Web of Linked Data? To which extent are language tags used to indicate the language of property values? Which domains are predominately mono/multilingual? What is the distribution of cross-lingual links vs. monolingual links? How are cross-lingual links established (e.g. owl:sameAs)? Do mono/multilingual datasets organize themselves into clusters with respect to the natural languages used?
Andrejs Vasiljevs & Mārcis Pinnis
The Next Step in Translation Automation: Online Terminology Services for Human and Machine Translation
abstract The explosion of multilingual content on the Web is a tremendous challenge for everyone in the localization industry. Although machine translation is already transforming the industry, translation quality, especially for terminology translation, is among the most critical deficiencies that hinder a wider application of machine translation. This presentation covers work on a new wave of terminology services to assist human translators and improve the quality of machine translation systems. Workflows for terminology services include term identification in the source text, automatic acquisition of translation candidates from term banks, extraction of multilingual terminology from parallel and comparable Web resources, user involvement in terminology data review and clean-up. The presentation demonstrates terminology services under development within the FP7 Terminology as a Service (TaaS) project.
Apex Data & Knowledge Management Lab, Shanghai Jiao Tong University
Zhishi.me: Towards Chinese Linking Open Data
abstract Linking Open Data (LOD) has become one of the most important community efforts to publish high-quality interconnected semantic data. Such data has been widely used in many applications to provide intelligent services such as entity search and personalized recommendation. While DBpedia, one of the LOD core data sources, contains resources described in multilingual versions and semantic data in English is proliferating, there is very few work on publishing Chinese semantic data. In this talk, I will present Zhishi.me, the frst effort to publish large scale Chinese semantic data and link them together as a Chinese LOD (CLOD). Besides common challenges of interlinking heterogeneous data sources, I will emphasize the specified issues for dealing with Chinese w.r.t. XML encoding and IRI formating and the corresponding solutions. Moreover, I will introduce our current effort to build a large scale Chinese relation knowledge base, which is further integrated into zhishi.me. Finally, I will show you the future plan of developing zhishi.me.