W3C Workshop Program:
A Local Focus for the Multilingual Web
21-22 September 2011, Limerick, Ireland
The MultilingualWeb project is looking at best practices and standards related to all aspects of creating, localizing and deploying the Web multilingually. Coordinated by the W3C, the project aims to raise the visibility of existing best practices and standards and identify gaps. This third workshop in Limerick, Ireland, was hosted by the University of Limerick. The workshop was co-located with the 16th Annual LRC Conference.
Each main session on the first day contained a series of 15 minute talks followed by some time for questions and answers. On the second day, the workshop lasted for the morning only, and was dedicated to an Open Space discussion forum, where participants can discuss the themes of the workshop in breakout sessions. This was organized by TAUS. All attendees participated in all sessions.
The IRC log is the raw scribe log, which has not undergone careful post-editing and may contain errors or omissions. It should be read with that in mind. It constitutes the best efforts of the scribes to capture the gist of the talks and discussions that followed, in real time. IRC was used not only to capture notes on the talks, but can be followed in real time by remote participants, or participants with accessibility problems. People following on IRC can also add contributions to the flow of text themselves.
Where no link is provided to slides, we are still waiting to receive them. Some video links are unavailable because the speaker requested it. In two cases the speaker was unable to attend the workshop, but their slides are available. You can also find links to all videos on the VideoLectures workshop page. Thanks to VideoLectures for hosting the videos.
Related links: Workshop report • About W3C
Dean of the Faculty of Science and Engineering, University of Limerick
Brief welcome address
W3C Internationalization Activity Lead & MultilingualWeb Project Coordinator
Disruptive Innovations / W3C CSS Working Group Co-Chair
Babel 2012 on the Web
abstract If Open Web and Internet Standards were mostly western-centric in the
early years, things have drastically changed. English is not any more
the most common language on the net and the various standard bodies
have improved the support for the languages and scripts of the world.
The new cool kids on the block 2012 will be HTML5, CSS3, EPUB3 and
this talk will show you how Standards are paving the way for the
Dr. David Filip
MultilingualWeb-LT: Meta-data interoperability between Web CMS, Localization tools and Language Technologies at the W3C
abstract MLW-LT, an FP7 funded coordination action, is going to set up a W3C Working Group (WG) for standardizing metadata exchange between Web CMS, Localization Tools and Language Technologies. This session will open the public discussion of the WG Charter and encourage participation in the WG from outside of the initial EC funded consortium.
The WG aims to address three major interoperability gaps in the multilingual web content lifecycle, namely between Deep Web meta-data and localization (L10n); Surface Web meta-data and Real time Machine Translation; and Deep Web meta-data and meta-data driven MT training.
Addressing these gaps will include alignment with other existing and
ongoing LT and L10n standardization activities; prominently W3C ITS and OASIS XLIFF TC effort, as XLIFF will be used for prototyping MLW-LT metadata round-trips in the three main scenarios outlined above.
The journey of the W3C Internationalization Tag Set - current location and possible itinerary
abstract The W3C Internationalization Tag Set (ITS) is an enabler for the internationalization and localization of content. Although ITS is a rather young standard, its uptake has been impressive. One reason behind this are the activities of the ITS Interest Group (ITS IG) which promotes its adoption, and gathers feedback. This presentation will sketch insights of the ITS IG. The following will be covered: 1. Brief Introduction to W3C ITS; 2. Review of ITS use in commercial and open source tools; 3. Existing Rule Sets; 4. Overview of suggested enhancements; 5. Relationships; 6. Outlook (Contributors: Yves Savourel, Jirka Kosek, Felix Sasaki, Richard Ishida, Christian Lieske)
CSS & i18n: dos and don'ts when styling multilingual Web sites
abstract The talk covers best practices and pitfalls when dealing with languages that create large compound words (like German), languages with special capitalization rules (again, like German), or languages written in right-to-left scripts. This includes things like box sizes, box shadows and corners, image replacement etc. It also covers benefits that new CSS 3 properties and values offer in terms of internationalization, a discussion wheather the :lang pseudo-class selector meets all needs or if there's more to wish for, and how to implement style sheets for various languages and scripts (all rules in a single file or spread over multiple files?). The talk will be of rather practical than theoretical nature.
[Chair, Tadej Štajner • Scribe, Jirka Kosek]
CMS and Localisation – Challenges in Multilingual Web Content Management
abstract Content Management Systems (CMS) have come to be widely used to
provide and manage content on the Web. As such, CMS are increasingly used for
multilingual content, which presents new challenges to developers and
content providers. This presentation will explore these challenges and show
how and why a closer alignment of CMS developers and LSP can improve
translation management, workflows and quality.
Multilinguality on Health Care Websites – Local Multi-Cultural Challenges
abstract Globally acting health care organisations like the World Health Organization have to present their websites in a variety of languages to make sure that as many people as possible can benefit from their online offer. The same applies to the European Union, which publishes its official documents in 23 languages and therefore has to guarantee that its websites are equally multilingual. Due to the fact that Germany is a country with a large number of immigrants, the government and other official institutions would do well to present their websites not only in German or English, but also in other languages, like Turkish or Russian. The websites of the WHO, the EU and some German institutions were checked for their multilingual offer and possible shortcomings of the different language versions. The severest and most frequent shortcomings and their consequences for users will be highlighted in this talk.
Lise Bissonnette Janody
Balance and Compromise: Issues in Content Localization
abstract Web content managers need to make choices with respect to the content they translate and localize on their websites. What guides these decisions? When in the process should they be made? What are their impacts? This talk provides a high-level overview of these choices, and how they fit into the overall content strategy cycle.
[Chair, Charles McCathieNevile • Scribe, Christian Lieske]
Efficient translation production for the Multilingual Web
abstract The translation editor has seen major technological advances over the last years. Compared to classic translation memory applications, current systems allow expert users to double, if not triple, the amount of words translated. Whereas the key technology advances are in the area of sub-segment reuse and statistical machine translation (SMT), the actual productivity gains relate to the ergonomics of how systems allow users to interact, control and automate the various data sources. This presentation will review key capabilities on the various document, segment and sub-segment levels like: Document level SMT, TrustScore, dynamic routing, dynamic preview; Match type differentiation, Auto-propagation, SMT integration and SMT configurations, segment-level SMT trust scores and feedback cycles (segment level); Auto-suggest dictionary and phrase completions (sub-segment level). The discussed capabilities will be brought into perspective of how the vast amount of multilingual online content are affected by such innovation.
A Micro Crowdsourcing Architecture to Localize Web Content for Less-Resourced Languages
abstract We will report on a novel browser extension-based client-server architecture using open standards that allows localization of web content using the power of the crowd. We address issues related to MT-based solutions and propose an alternative approach based on translation memories (TMs). The approach is inspired by Exton et al. (2009) on real-time localization of desktop software using the crowd and Wasala and Weerasngihe (2008) on browser based pop-up dictionary extensions. The architectural approach chosen enables in-context real-time localization of web content supported by the crowd. To best of our knowledge, this is the only practical web content localization methodology currently being proposed that incorporates Translation Memories. The approach also supports the building of resources such as parallel corpora – resources that are still not available for many, but especially for under-served languages.
Interoperability standards in the localization industry – Status today and opportunities for the future
abstract Interoperability and related standards are topics still frequently and controversially discussed.
While standards such as TMX and TBX are established within the industry, others, such as XLIFF are rated differently and not that widely implemented.
This presentation is covering the current status of interoperability in the localization and translation industry, historical development, understanding of interoperability, related business requirements, effects on delivery models, interoperability between tools, open standards, current challenges and opportunities for the future.
[Chair, Christian Lieske • Scribe, Felix Sasaki]
The use of SMT in financial news sentiment analysis
abstract Statistical Machine Translation systems are a welcome development for news analytics. They enable topic-specific translation services, but are not without problems. The SMT system that is developed for the Let'sMT (FP7) project is trained and used to translate financial news for SemLab's news sentiment analysis platform. This talk will give an example of the benefits and problems of integrating such systems.
University of Leipzig
NLP Interchange Format (NIF)
abstract NIF is an RDF/OWL-based format that allows to combine and chain several NLP tools in a flexible, light-weight way. The core of NIF consists of a vocabulary, which can represent Strings as RDF resources. A special URI design is used to pinpoint annotations to a part of a document. These URIs can then be used to attach arbitrary annotations to the respective character sequence. Based on these URIs, annotations can be interchanged between different NLP tools. Although NLP Tools are abundantly available on all linguistic levels for the English language, this is often not the case for languages with fewer speakers. Thus, it becomes especially necessary to create a format that allows the integration and interoperability of NLP tools. Web site: http://aksw.org/Projects/NIF .
With respect to multilinguality, two use cases come to mind: 1. an already existing English software system, that uses an English NLP tool needs to be ported to another language. The NLP tool for the other language is not compatible to the system, because there is no common interface (Example: A CMS with keyword extraction). 2. Paragraphs in different kinds of documents can be annotated in RDF with multilingual translations that can potentially remain stable over the life-time of a document. Especially, the introduced URI recipe (Context-Hash) possesses advantageous properties, which withstand comparison to other URI naming approaches.
LMF-aware Web services for accessing lexical resources
abstract This talk will demonstrate that Lexical Markup Framework (LMF), the
ISO standard for modeling and representing lexicons, can be nicely
applied to the design and implementation of lexicon access Web
services, in particular, when the service is designed with so-called
RESTful style. As the implemented prototype service provides access to
bilingual/multilingual semantic resources, in addition to standard
WordNets, slight revisions to the LMF specification will also be
[Chair, Felix Sasaki • Scribe, Dag Schmidtke]
CNGL/Trinity College Dublin
Digital Content Management Standards for the Personalised Multlingual Web
abstract The World Wide Web is at a critical phase in its evolution right now. The
user experience is no longer limited to a single
offering in a single language. Localisation has offered a web of many
languages to users, and this is now becoming a
hyper-focused tailoring that makes each web experience different for each
user. The need to address the key requirements
of a web which is real-time, personal and in the right language is
paramount to the future of how information is consumed.
This talk will discuss the key trends in personalisation, with particular
focus on work being undertaken in the Digital Content
Management track of the CNGL, and will provide an insight into current and
future trends, both in research and in the living web.
An Open Source tool helps a global community of professionals shift from traditional contacts and annual meetings to continuous interaction on the web
abstract The challenges of maintaining and developing a multilingual web site with open source software tools and crowd-sourced translations, for a community of professional translators and terminologists working for international organizations and multilateral bodies where that "community" has no budget, depends on members' contributions in kind, but continues to grow, and has been growing since 1987.
Using an Open Source tool which supports multilingualism to provide a complex support site for an international working group on language issues.
How use of the Tiki CMS Wiki Groupware software made it possible to provide an ongoing interactive support site for JIAMCATT, helping convert the "International Annual Meeting on Computer-Assisted Translation and Terminology" into an ongoing year-round affair. The site, which is run without a budget and on the spare time of members, nevertheless is fully bilingual English-French, with parts in Arabic, Chinese, Russian and Spanish (all official languages of the United Nations) as well as some German.
[Chair, Reza Keschawarz • Scribe, Jirka Kosek]
University of Vienna
Terminologies for the Multilingual Semantic Web - Evaluation of Standards in the light of current and emerging needs
abstract In recent years several standards have emerged or have come of age in
the field of terminology management (such as ISO 30042 (TBX), ISO 26162),
ISO 12620, etc.). Different user communities in language industry (incl.
translation and localization), language technology research, industrial
engineering and other domain communities are increasingly interested in
using such standards in their local application contexts. This is
exactly where problems more often than not arise in the natural need to
adapt global and sometimes abstract, heavy-weight standards specifications
to local situations that differ from each other. Thus the way standards
are prepared needs to be adapted in such a way that different
requirements from user groups and from local situations can be processed
and taken into account appropriately and efficiently. The papers discusses
innovative (web-service-oriented) approaches to standards creation in the
field of terminology management in relation to different web-based user
groups and semantic web-application contexts, integrating
vocabulary-oriented W3C recommendations such as SKOS. The speaker will
integrate his experiences in the strategic contexts of FlareNet, CLARIN,
ISO/TC 37 and in concrete user communities, e.g. in legal and
administrative terminologies (the "LISE" project) and in risk
terminologies (the "MGRM" project).
META-NET: Towards a Strategic Research Agenda for Multilingual Europe
abstract META-NET is a Network of Excellence, consisting of 47 research centres in 31 countries, dedicated to fostering the technological foundations of a multilingual European information society. A continent-wide effort in Language Technology (LT) research and engineering is needed for realising applications that enable automatic translation, multilingual information and knowledge management and content production across all European languages. The META-NET Language White Paper series "Languages in the European Information Society" reports on the state of each European language with respect to LT and explains the most urgent risks and chances. The series covers all official and several unofficial as well as regional European languages. After a brief introduction of META-NET we will present key results of the 30 Language White Papers which provide valuable insights concerning the technological, research, and also standards-related gaps of a multilingual Europe realised with the help of LT. These insights are an important piece of input for the Strategic Research Agenda for Multilingual Europe which will be finalised by the beginning of 2012.
Beyond Specifications: Looking at the Big Picture of Standards
abstract In the localization industry standardization has been seen primarily as a technical activity: the development of technical specifications. As a result there are many technical standards that have failed to achieve widespread adoption. The GALA Standards Initiative, an open, non-profit effort, is attempting to address areas that surround standards development—education, promotion, coordination of development activities, and development of useful guidelines and business cases, and non-technical, business-oriented standards—to help achieve an environment in which the needs of various user groups will help drive greater adoption of standards.
[Chair, Jörg Schütz • Scribe, Charles McCathieNevile]
At the Carlton Castletroy Park Hotel
details To further promote networking among attendees, there will be a reception in the restaurant of the Carlton Castletroy Park Hotel, starting at 8pm. (This is same location as the workshop venue.)
Jaap van der Meer
Explanation of the format for the morning, and selection of discussion topics. Topics are suggested by participants, and the most popular are allocated to breakout groups. A chair is chosen for each group from volunteers.
Various locations are available for breakout groups. Participants can join whichever group they find interesting, and can switch groups at any point. Group chairs facilitate the discussion and ensure that notes are taken to support the summary to be given to the plenary.
Group reports and discussion
Everyone meets again in the main conference area and each breakout group presents their findings. Other participants can comment and ask questions.