W3C Workshop Program:
New Horizons for the Multilingual Web
7-8 May 2014 (8-9 May: LIDER workshop), Madrid
The MultilingualWeb community develops and promotes best practices and standards related to all aspects of creating, localizing, and deploying the Web across boundaries of language. This W3C workshop aims to raise the visibility of existing best practices and standards for dealing with language on the Internet and on identifying and resolving gaps that keep the Internet from living up to its global potential. It will be held in Madrid, and will be hosted by the Universidad Politécnica de Madrid; see information about the venue in the call for participation.
Live stream: a live stream for the event is available.
After the keynote speech, each main session on the first day and a half will contain a series of talks followed by some time for questions and answers. The afternoon of the second day will be dedicated to an Open Space discussion forum, where participants can discuss the themes of the workshop in breakout sessions. All attendees participate in all sessions.
Aligned with the event, the LIDER project is organizing a Workshop on Linked Data, Language Technologies and Multilingual Content Analytics. This workshop is running 8 May (afternoon, in parallel to the open space sessions) and 9 May (morning). A separate registration is required.
Related links: Workshop report • LD4LT / LIDER workshop report • About W3C
Félix Pérez Martínez
Director de la Escuela Técnica Superior de Ingenieros de Telecomunicación de la UPM (ETSIT UPM)
Victor Robles Forcada
Director de la Escuela Técnica
Superior de Ingenieros Informáticos de la UPM (ETSIINF UPM)
Welcome and Introductions
Keynote: Multilingual User Generated Content at Wikipedia scale
Pau Giner, David Chan and Santhosh Thottingal
Best Practices on the Design of Translation
abstract Wikipedia is one of the most multilingual projects on the web today. In order to provide access to knowledge to everyone, Wikipedia is available in more than 280 languages. However, the coverage of topics and detail varies from language to language. The Language Engineering team from the Wikimedia Foundation is building open source tools to facilitate the translation of content when creating new articles to facilitate the diffusion of quality content across languages. The translation process in Wikipedia presents many different challenges. Translation tools are aimed at making the translation processes more fluent by integrating different tools such as translation services, dictionaries, and information from semantic databases as Wikidata.org. In addition to the technical challenges, ensuring content quality is one of the most important aspects considered during the design of the tool since any translation that does not read natural is not acceptable for a community focused on content quality. This talk will cover the design (from both technical and user experience perspectives) of the translation tools, and their expected impact on Wikipedia and the Web as a whole.
DFKI, Co-Founder of Yocoy
Always Correct Translation for Mobile Conversational Communication
abstract The coexistence of different languages and cultures poses a true challenge for global mobility and communication in business and personal life. It is known that machine translation is still one of the biggest challenges in artificial intelligence and especially for Language Technology. Nowadays most machine translation systems rely on machine learning and statistical methods by taking parallel texts as their training data. Such systems can come along with a very broad coverage, but their translation results are often lack of accuracy and fluency. Thus, they cannot be employed for face to face communication, because translation errors will bring too many troubles for people in foreign countries. In this talk we present a mobile application called "Yocoy Language Guide" which is used for face to face communication between people speaking different languages. The heart of the app consists of a new technology called ACT (Always Correct Transalation) which supports a 100% correct translation for everyday dialogues among 5 languages: German, English, Spanish, French, and Chinese. The Yocoy Language Guide has been widely downloaded by more than 200.000 people from all over the world.
W3C Internationalization Activity Lead
New Internationalization Developments at the World Wide Web Consortium
abstract The W3C Internationalization Activity is striving to ensure that the technology of the World Wide Web remains world wide in scope. The Internationalization Working Group works with other W3C working groups and liaises with other organizations to help ensure universal access to the Web, regardless of language, script or culture. It does so by reviewing and contributing to developments in Web technology. It also provides internationalization-related education and outreach for specification developers, content developers and implementers. This talk will list and describe some of the things at the top of the group's radar in May 2014, and some of the recent success stories.
Charles McCathie Nevile
Multilingual Aspects of Schema.org
abstract Schema.org produces a continuously evolving metadata schema that is already used in around 10% of websites. The talk will brief ly explain what it is and how to use it, then explore what it means for developers, especially in the context of a multiingual Web, and how they can help fix things that don't work for them.
UN Food and Agriculture Organization
Bridging the Online Language Barriers with Machine Translation at the United Nations
abstract The Food and Agriculture Organization of the United Nations is introducing a widget for machine translation into Arabic, Chinese and Russian in its website, FAO.org. This is a new experience because of the technical challenges, the political environment and the internal dynamics. The presentation discusses the approach adopted, the steps taken and the feedback mechanism developed. This experience is considered to be useful to international organizations considering the adoption of machine translation as part of their vision for a multilingual web.
Multilingual Web Considerations for Multiple Devices and Channels – When should you think about Multilingual Content?
abstract No longer it is satisfactory to consider the PC when developing websites. Increasingly people are accessing data via tablets and smartphones. Consideration needs to be made not only how we display our multilingual content but how that looks on all devices and how it downloads through the various channels. Is it OK for the mobile phone user to have to scroll backwards and forwards to see a conventional website on their phone because it has not been customised for that device? Is it OK for the ipad user to switch to puffin because a flash video has been embedded into your site? Should a user on GPRS be happy to wait more than 90 seconds to download a page on your site? One of the most common and expensive mistakes made when developing a website is not considering the multilingual aspect until the site has been completed, approved and published. So when should you consider multilingual content and how to manage it to make it more cost effective? We believe the answer is before you create a single sentence. If the writers used structured content to create the site the reuse during translation could be maximised sometimes saving over 30% of the cost of non-structured content.
Space, expansion and display need to be considered to ensure there are no issues displaying in different locales. Perhaps most crucially the language used needs to be ‘translatable’ – we have seen multiple examples where
marketer has produced a ‘brilliant’ concept in English but the translated version can have very different effects.
Universidad Europea - Madrid
Post-editing Practices and the Multilingual Web: Sealing Gaps in Best Practices and Standards
abstract With post-editing becoming a widespread activity in the translation/localisation industry, there are still some gaps to be filled in best practices and standards. If post-editing is to contribute fully to ensuring the multilingual success of the World Wide Web, there a series of issues that are yet to be dealt with:
- Is there a real benefit in using standards for post-editing purposes in daily practice?
- Do annotation tags make sentences slightly less understandable and more cryptic for post-editors?
- In cases where there is more than one annotation per phrase, the post-editor may miss the visual continuity of the sentence, spend too much time rereading it or even leave syntax mistakes from the MT uncorrected. How should this information be presented (if at all)?
- Should post-editors be allowed to insert annotations?
This talk aims at raising awareness towards these questions that challenge everyday scenarios where the post-editor is confronted with productivity, quality and price.
Spain Tourism and Cultural Website
abstract Gemma is Content Manager in SEGITTUR. I am responsible for managing several international and promotional websites: www.spain.info, wwww.spainisculture.com and www.studyinspain.info. These websites are in 18 languages. I will explain how we manage content in so many languages and how searchengines are changing our methodology and general content strategy.
CNGL KDEG Trinity College Dublin Ireland
Marking Up Our Virtue: Multilingual Standards and Practices in the Digital Humanities
abstract This talk will introduce the area of Digital Humanities and Digital Cultural Heritage and discuss the relevance of linguistic standards on the web to research and best practice in the area. These domains represent a vibrant area of research activity for a wide number of academic disciplines. At the heart of DH is the use of digital approaches to the interpretation, collation, comprehension and dissemination of cultural and historical artefacts. Examples of artefacts can range from 17th-Century accounts of rebellions in Ireland, through early 20th-Century Grey Literature concerning the World Wars to 14th Century commentaries on herbs. Humanists research the texts, the manuscripts themselves and the networksthat surround these artefacts in a great many ways, for example by attempting to trace the references to particular paramilitary units, or to seek the assistance of the general public in tagging and marking uphand-written artefacts.Language is a highly complex topic in this domain. Many documents may be written in several archaic forms of one or more languages, with highly informal or irregular content, and may have accompanying markup which is in yet another language. The general assumptions of the availability of linguistic resources and their applicability is extremely subtle in this regard. The incredible richness of the content, combined with its social and cultural value, means that there is a constant challenge to choosing the best approaches to the digitisation process. A realisation that standards are required to create enduring digital cultural archives has been met with the difficulty of how to choose what standard, and how to involve non-technical experts in the best practice?
The Multilingual App Toolkit Version 3.0: How to create a Service Provider for any Translation Service Source as a key Extensibility Feature
abstract Over the last few years, MLW attendees have heard about the release and evolution of the Microsoft Multilingual App Toolkit (MAT) for Windows Store and Phone apps and the MS Translator service integration into the toolkit. In this talk, we will demonstrate MAT v.3 and will focus on how to create a service provider for any translation service source as a key extensibility feature. We will also briefly address new features and improvements in this release. As before, we continue to share our interest in support for the XLIFF standard as a showcase implementation.
The Difference made by Standards oriented Processes
abstract We will discuss the benefits of changing a translation workflow to largely standards driven and automated processes. We will give some insight into the optimizations this allows to make, and we will highlight the portions of the translation process where we still see need for better standardization. An interesting aspect of standardization in language technology to date is that it merely helped the definition of tighter internal processes, but not significantly in interoperability across the industry. The change and challenges from the historical file and batch based content translation approach to the highly interactive on-demand and on-change process of today will be another focus point. And finally we will discuss what we see as the final frontier in today’s content creation, translation, and presentation model.
Content Relevancy starts with understanding your international Audience
abstract Many organizations nowadays are (becoming) publishers of vast amounts of content, which is dictated by the internet and modern behavior. Consumers/citizens expect to find relevant information by “google-ing”. Modern consumers and citizens are impatient. If they do not find what they are looking for, or if the content is not relevant or they do not understand what is written, they will voice their opinions using social channels like Twitter, Facebook but also many other social networks and blogs. The challenge of reaching the modern citizen explodes when having an international audience. Topics and themes differ per country, or even within different ethnic populations within a country. Market Research has been the traditional way to understand audiences. However it is expensive, cumbersome and it is difficult to capture the thoughts of the younger (-30) generation. Conversations citizens have on social channels such as forums and blogs and remarks they make on micro-blogs like Twitter create a new data set which can be used to understand your citizens. SDL will present how to optimize your translation quality so your content is easier to find by and more relevant to your audience.
Towards an open Framework of E-services for Translation Business, Training, and Research by the example of Terminology Services
abstract Coming from the three “worlds” – industry, academia, and freelancing – and being a provider, a trainer, and a consumer at the same time, I call for cooperation amongst our communities – Data, Language Technology, and the Web. I will present several use cases in translation and terminology work that will address gaps with regards to standards and the multilingual Web. One of the examples would be the export of terminology into the Lined Data format. By the example of terminology services, I hope my presentation will serve as a “seed” to make our goals mutual and to plan next steps towards this 3-dimensional open infrastructure of e-services for translation business, training, and research.
David Filip, David Lewis and Arle Lommel
CNGL at University of Limerick, CNGL at Trinity College Dublin, DFKI
Quality Models, Linked Data and XLIFF: Standardization Efforts for a Multilingual and Localized Web
abstract The last year has seen unprecedented synchronised activity is the advancement of standards to support of the multilingual and localized web. At the W3C, the MultilingualWeb-Language Technology Working Group completed its work in producing the Internationalisation Tag Set (ITS2.0) Recommendation. This provides for content meta-data addressing a range of content management, localization and language technology integration issues. This was progressed in close collaboration with the development of the XLIFF2.0 specification at OASIS, with a mapping between XLIFF and ITS 2.0 available. In parallel the QTLauchpad project, in consultation with GALA, has produced a specification for Multidimensional Quality Metric relevant to translation and its use of language technology, which is also aligned with ITS2.0. In this presentation, members from these activities will briefly introduce these development, explain how they relate to each other and highlight the path for further developments, in particular through the use of linked data.
University of Rome
Babelfying the Multilingual Web: state-of-the-art Disambiguation and Entity Linking of Web Pages in 50 Languages
abstract The creation of huge, broad-coverage knowledge bases and linguistic linked data such as DBpedia and BabelNet, made possible by the availability of collaboratively-curated online resources such as Wikipedia and Wiktionary, has not only entailed researchers, but it has also hit big industry players such as Google and IBM, who are moving fast towards large-scale knowledge-oriented systems. The semantic annotation of arbitrary Web pages is an important use of multilingual linguistic linked data. However, most of the time, the task is limited to just linking (some) named entities or working on a restricted number of languages. In this talk we will present Babelfy (http://babelfy.org), a new, state-of-the-art, wide-coverage system which leverages a novel graph-based algorithm to semantically annotate and link arbitrary text, such as Web pages, written in any language, with both concepts, i.e. abstract meanings of words or domain terms, and named entities from BabelNet (http://babelnet.org), a huge multilingual semantic network and linked data set.
Victor Rodríguez Doncel
Universidad Politécnica de Madrid
Towards high quality, industry-ready Linguistic Linked Licensed Data
abstract Researcher and developer Universidad Politécnica de Madrid 3LD: Towards high quality, industry-ready Linguistic Linked Licensed Data. The application of Linked Data technology to the publication of linguistic data promises to facilitate interoperability of such resources and has lead to the emergence of the so called Linguistic Linked Data Cloud (LLD) in which linguistic data is published following the Linked Data principles. Three essential issues need to be addressed for such data to be easily exploitable by language technologies: i) appropriate machine-readable licensing information is needed for each dataset, ii) minimum quality standards for Linguistic Linked Data need to be defined, and iii) appropriate vocabularies for publishing Linguistic Linked Data resources are needed. We propose the notion of Licensed Linguistic Linked Data (3LD) in which different licensing models might co-exist, from totally open to more restrictive licenses through to completely closed datasets.
At the Posada de la Villa
Dinner sponsored by Verisign with support from Lionbridge
details Dinner on May 7 will be served at 20:30 at the Posada de la Villa, Calle Cava
Baja, 9. Note that the restaurant is not within easy walking distance of the
Workshop venue. It is roughly six kilometers away and can be accessed via
Metro lines 1 and 2 (Tirso de Molina station, approx. 500 meters) and 5 (La Latina
station, approx. 250 meters). If any attendees require assistance in reaching
the restaurant, please let Ms. Nieves Sande know in advance at the registration
desk. For more information on transit options, see the separate flyer on the dinner.
Alta Plana Corporation
Sentiment, opinion, and emotion on the multilingual Web
abstract Two sorts of information co-exist on the multi-lingual Web, intertwined: Facts and Feelings. Extraction of each type is complicated by idiom, expressive vocabularies, metaphor, and cultural context that often translate poorly from one language to another, if it all. This talk will describe sentiment's business value and survey technical approaches to meeting the extraction challenge.
Universidad Politécnica de Madrid
The LIDER Project
abstract The LIDER project aims at establishing a new Linked Open Data (LOD) based ecosystem of free, interlinked, and semantically interoperable language resources (corpora, dictionaries, lexical and syntactic metadata, etc.) and media resources (image, video, etc. metadata) that will allow for free and open exploitation of such resources in multilingual, cross-media content analytics across the EU and beyond, with specific use cases in industries related to social media, financial services, localization, and other multimedia content providers and consumers. In some cases, we will explore new business model and hybrid licensing schemes for using of Linguistic Linked Data in commercial settings for Free but not Open resources.
Martin Brümmer, Mariano Rico, Marco Fossati
University of Leipzig, Universidad Politécnica de Madrid, Fondazione Bruno Kessler
DBpedia: Glue for all Wikipedias and a Use Case for Multilingualism
abstract 14 official DBpedia chapters, apart from the English one, exist, namely Basque, Czech, Dutch, French, German, Greek, Indonesian, Italian, Japanese, Korean, Polish, Portuguese, Russian and Spanish. However, there is still a lot to be done in terms of DBpedia internationalization. This talk will cover:
- Deployment of local DBpedia chapters acting as the backbone of the emerging national Open Data landscapes, with an emphasis on the liaison with governmental and civic society organizations. It will then focus on the task of mapping language-specific knowledge representations coming from the Wikipedias into a unified culture-agnostic ontology.
- Benefits of an increasingly multilingual DBpedia as a source for language resources and a resource for Natural Language Processing are overwhelming.
Jorge Gracia and José Emilio Labra
Universidad Politécnica de Madrid, University of Oviedo
Best Practises for Multilingual Linked Open Data: a Community Effort
abstract The W3C Best Practises for Multilingual Linked Open Data community group was born one year ago during the last MLW workshop in Rome. Nowadays, it continues leading the effort of a numerous community towards acquiring a shared view of the issues caused by multilingualism on the Web of Data and their possible solutions. Despite our initial optimism, we found the task of identifying best practises for ML-LOD a difficult one, requiring a deep understanding of the Web of Data in its multilingual dimension and in its practical problems. In this talk we will review the progresses of the group so far, mainly in the identification and analysis of topics, use cases, and design patterns, as well as the future challenges.
Josef van Genabith
Quality Machine Translation for the 21st Century
abstract Over the last 10 years Machine Translation (MT) has started to make substantial in-roads in professional translation (localisation and globalisation) and our daily lives. Much research in machine translation has concentrated on in-bound translation for gisting. We frequently use (freely available) applications like Google Translate or Bing Translate to bridge language barriers on the Web to make sense of information that is presented in a language we are not familiar with. For this to be successful, MT output has to be understandable (in the sense that it gets most of the basic content encoded in the source language across into the target language – this is often referred to as adequacy). Crucially, MT does not have to be perfectly fluent and may even make some mistakes or omissions that can be recognised, tolerated and compensated for by the human user given the context. For many applications, in particular out-bound translations (rather than in-bound for gisting), however,translation quality (fluency and adequacy) is crucial. Most MT engines are now statistical (SMT), i.e. they learn from previous human translations (bitext) how to translate new text. Barriers towards high quality (S)MT include lack of training data, statistical models that do not fully capture large translation divergences between certain “challenging” language pairs (e.g. substantial reordering, translation into morphologically rich languages etc.), limits in how translation outputs are evaluated automatically and by humans. The QT21 project application brings together key stakeholder constituencies including leading MT teams in Europe, language services provider companies, professional translation and localisation organisations, think-tanks and a large public translation user to systematically address these quality barriers focusing on “challenging” languages.
Pedro L. Díez Orzas
Linguaserve I.S. S.A.
Multilingual Web: Affordable for SMEs and small Organizations?
abstract Multilingual Web Language Technologies solutions are often only affordable for certain companies and organizations. Reasons might come from several points of view, such as costs, infrastructures and specialized personnel:
- Budget to implement and maintain a continuous multilingual web activity.
- Know-how and maturity level for understanding and adapting new technologies to their special needs.
- Internal or external personnel to take care of technical and management tasks to develop multilingual web strategies.
In order to implement MLW solutions, these organizations can be helped by expert companies in the localization web sector, but they usually also need software integration providers or an internal technical department to approach certain solutions. Making a monolingual web become a multilingual web is not a problem, even for small companies, but taking competitive advantage to benefit from ultimate technology for continuous and incremental multilingual web activity becomes very often something that SMEs say they cannot afford. The fact is that the European business network consists mostly of SMEs, and one of the peculiarities of the European market is its multilingual composition (as opposed to the US market, for instance). Also, reaching external and global markets requires having the capacity of multilingual communication strategies, not only for the needs of sustainable and incremental multilingual web, but also for all other needs surrounding online commercial activities. This presentation offers a medium-term vision and introduces some factors that impact on making these technologies, solutions and services more affordable for a wide range of small and medium enterprises and public organizations, based on five key factors, among other:
- Best practices for web creation
- Globalization or interoperability standards
- Metadata and data standards
- Adaptive solutions and services for companies of different sizes
- Flexible complementary multilingual communication services
Societies that achieve to bring their SMEs’ creativity and outstanding entrepreneurial initiatives to the global markets will be the near future’s successful societies.
Universal Access: Barrier or Excuse?
abstract While IDN ccTLDs and IDN TLDs work as expect within the DNS system, they don’t work well within the real world. This could be why IDN TLDs are not being adopted as quickly as desired. This paper looks at two issues: What are the barriers to the effective use of IDN TLDs and who can help address these issues. The paper doesn't provide answers but raises questions that, I hope, will be answered in part during subsequent responses.
Internationalized Domain Names: Challenges and Opportunities, 2014
abstract Today Internationalized Domain Names (IDNs) are getting more attention than at any other time since they were introduced into the Domain Name System in 2000, but they still have a long path to general adoption. IDNs are far from being ubiquitous and trusted. Verisign, as a registry operator and manager of over 1M IDNs, plays a small part in this ecosystem comprised by not only registries, but developers, content creators, policy and standard making bodies who are all attempting to further internationalize, or locally localize, theidentifiers on the Internet. Therefore, we intend to highlight some of the challenges we have found through our experience as a registry operator and encourage all players to make IDNs a ubiquitous and trusted product for the multilingual web.
Digital Language Extinction as a Challenge for the Multilingual Web
abstract The large, pan-European study “Europe’s Languages in the Digital Age”, published as the META-NET White Paper Series in late 2012, has shown that 21 of the 31 languages investigated are in danger of digital extinction. The level of support these 21 languages receive through language technologies is either weak or non-existent at all. In early 2014 the comparison has been extended by ca. 20 additional languages – these updated results are even more alarming because far more than the 21 original languages are in fear of becoming digitally extinct. Most of these additional languages belong to a category that is usually referred to as “small languages” which is, generally speaking, used almost synonymously with “under-resourced languages”. The proposed presentation will first describe the original approach of the META-NET study “Europe’s Languages in the Digital Age” and then explain the updated results. The presentation will also discuss the challenges of this current situation for the multilingual web and the corresponding needs, especially with regard to intensifying technology development and knowledge transfer from the better supported to the less supported languages.
[Chair, Thierry Declerck]
Explanation of the open space format for the afternoon, and selection of discussion topics. Topics are suggested by participants, and the most popular are allocated to breakout groups. A chair is chosen for each group from volunteers. There are also some pre-selected groups.
The LIDER workshop will run in parallel to the Open Space session.
Various locations are available for breakout groups. Participants can join whichever group they find interesting, and can switch groups at any point. Group chairs facilitate the discussion and ensure that notes are taken to support the summary to be given to the plenary.
Group reports and discussion
Everyone meets again in the main conference area and each breakout group presents their findings. Other participants can comment and ask questions.