W3C Workshop Program:
The Multilingual Web: Where Are We?
26-27 October 2010, Madrid, Spain
The MultilingualWeb project is looking at best practices and standards related to all aspects of creating, localizing and deploying the Web multilingually. Coordinated by the W3C, the project aims to raise the visibility of existing best practices and standards and identify gaps. This first workshop in Madrid, Spain, was hosted by the Universidad Politécnica de Madrid.
Each main session begins with a half-hour 'anchor' presentation. This is followed by a series of 15 minute talks. Questions & answers are saved for a (typically) half hour discussion slot at the end of each session. All attendees participated in all sessions.
The IRC log is the raw scribe log, which has not undergone careful post-editing and may contain errors or omissions. It should be read with that in mind. It constitutes the best efforts of the scribes to capture the gist of the talks and discussions that followed, in real time. IRC was used not only to capture notes on the talks, but can be followed in real time by remote participants, or participants with accessibility problems. People following on IRC can also add contributions to the flow of text themselves.
Some video links are missing, either because the speaker requested it or in some cases due to technical problems. In one case an audio track is available, rather than a video. Thanks to the Universidad Politécnica de Madrid for recording and hosting the videos.
Related links: Workshop report • About W3C
Director, Escuela Técnica Superior de Ingenieros de Telecomunicación (ETSIT- UPM)
Workshop opening and welcome
European Commission - DG INFSO E1
EC language programs and hopes for the future
LRC, University of Limerick
The Multilingual Web, Policy Making and Access to Digital Knowledge for All
abstract Access to digital knowledge is no longer just a "nice-to-have", it is a
fundamental human right as important as access to food and water, to
appropriate educational and health services. The World Health Organisation
has reported that thousands of people die every day because they do not have
access to appropriate health information. Content and languages currently
ignored by mainstream localisation efforts – because there is no "business
case" for them - can realistically only be tackled using leading edge
component technologies linked together in standardised and interoperable
frameworks. Efforts under the umbrella of The Rosetta Foundation and the
United Nations' Internet Governance Forum to create such an open framework
will be outlined, and their potential highlighted to reach billions of users
currently being excluded from the digital world.
W3C (World Wide Web Consortium)
The Multilingual Web: Latest developments at the W3C/IETF
abstract The World Wide Web Consortium (W3C) develops base standards for the Web, such as HTML, CSS, SVG, XML, the Semantic Web and so on. Since the beginning, "Web for All" has been a fundamental goal of the W3C. Richard's talk will look at the work of the W3C and some other key organizations that are helping to develop standards and best practices that make the World Wide Web international — what has been done and what is currently in progress.
Localizing the web from the Mozilla perspective
abstract Axel will present on the achievements and challenges Mozilla is facing
both as a browser vendor and as a variety of websites. Firefox is
available in over 70 languages, with over 40 locales participating early
in the beta program for Firefox 4. We host a variety of websites in a
variety of languages, based on a variety of infrastructures. We'll
present what works, and where we're still researching and developing.
What is a multilingual web differ for static sites, for web
applications, and live multilingual documents? Technically, what is
"Localizing HTML"? Also, how can we serve our users the web in their
language and respect their privacy at the same time? Can we improve
The Web everywhere: Multilingualism at Opera
abstract This presentation is about approaches for localization of various Opera browser versions and other assets.
Jan Nelson, Peter Constable
Bridging languages, cultures, and technology
abstract Microsoft's products span a very wide range of applications that depend on
advancing Web technologies. This talk will provide a high level overview of
how our currently shipping products reach across regions, languages, markets
and to what extent our "connected" products are available. We will also
look at some areas of progress and some outstanding challenges in
globalization of Web applications.
[Chair, Adriane Rinsche • Scribe, Jirka Kosek]
Roberto Belo Rovella, David Vella
BBC World Service
Challenges for a multilingual news provider: pursuing best practices and standards for BBC World Service
abstract The BBC World Service operates in 32 languages and many of them still
present challenges to be properly displayed in various web platforms,
particularly on mobile devices. Roberto Belo-Rovella (Interactive Editor)
and David Vella (Software Engineer) explore some of these issues, and the
implemented solutions, aimed at reaching the audience in whichever platform
they used to access BBC content.
Multilingual Aspects in Speech and Multimodal Interfaces
abstract How multiliguality affects standards for voice and (more in general)
multimodal applications? Is speech different from text as regards languages?
Tricky issues, current standard solutions, best practices, and open
questions when you shift the focus from written/visual to spoken/auditory
Universidad Politécnica de Madrid
Experiences in creating multilingual web sites
abstract Luis talked about the "Lingu@net Europa" web site, a multilingual center for language learning. 32 different languages are available. The content is created not by professional translators, but by language teaching professionals. For this group the usage of technologies like translation memories is not easily learned. Hence, for them a site is created in a workflow which they find easy to use. In terms of standards, utf-8 character encoding, HTML, CSS and XML play a crucial role. But other technologies like MS-Office or the text indexing framework Lucene are applied, too. An issue in creating multilingual sites is how to handle "multilingual links". In the project, XSLT was used to create such links. Luis suggested that having such facilities in a CMS would be helpful.
Pedro L. Díez Orzas, Giuseppe Deriard, Pablo Badía Mas
Key Aspects of Multilingual Web Content Life Cycles: Present and Future
abstract Pedro emphasized that the creation of multilingual content is a process. Important information in that process is for example what needs not to be translated. To express that information, the tDTD format ("translatability data type definition") has been developed. However, in a common CMS and the content life cycle, multilinguality (i.e. such kind of information) is not regarded as important. The methodologies and workflows are getting more and more "hybrid": traditional human translation, combination of MT with post editing, "MT only". To make the creation of such workflows easy, CMS have to take the requirements of multilingual content management into account.
World Wide Web Foundation
The Remaining Five Billion: Why is Most of The World's Population Not Online and What Mobile Phones Can Do About It
abstract This presentation explains how the huge penetration rate of mobile phones is
making them the prime tool to bridge the digital divide, and shows that
there is a lot to be done before it happens: information should be made
available using voice or messaging, it should be relevant to its users (and
in their language), and everybody should be able to contribute to it.
[Chair, Charles McCathieNevile • Scribe, Felix Sasaki]
Best Practices and Standards for Improving Globalization-related Processes
abstract Today's globalization-related processes such as translation can benefit from a number of best practices and standards developed by a number standards bodies. Starting from a sketch of enterprise scale globalization-related processes, the presentation will touch on the material developed by the aforementioned organizations. In addition, relations and gaps in the ecosystem will be discussed.
Josef van Genabith
Centre for Next Generation Localisation (CNGL)
Next Generation Localisation
abstract Next Generation Localisation can be conceptualised in terms of a spacial metaphor consisting of a cube with three axes: volume, access and personalisation. In the talk I will show which part of this "Localisation Cube" is addressed by current state-of-the-art technologies as used in the industry, and how Next Generation Localisation Technologies will allow us address each point in the cube at configurable quality and speed.
Applying Standards –Benefits and Drawbacks
abstract Daniel briefly reviewed the history of standards such as TMX, TBX and XLIFF. He then asked whether standards are always the best approach, and argued that they may be overkill for some projects. For other projects, where you have to exchange large volumes of content and data is accessed by many, a standard format like XLIFF is absolutely appropriate.
Institut Jozef Stefan
Cross-lingual document similarity for Wikipedia languages
abstract Marko explained that cross-lingual information retrieval can be seen in a scenario where a user initiates a search in one language but expects results in more than one language. There are many areas involved in text-related search and each represents data about text in a slightly different way. Marko focused in on the correlated vector space model and described how the system they are currently building works using Wikipedia correlated texts.
[Chair, Felix Sasaki • Scribe, Elliot Nedas]
Language resources, language technologies, text mining, the Semantic Web: How interoperability of machines can help humans in the multilingual web
abstract Felix introduced applications concerning summarization, Machine Translation (MT) and text mining, and showed what is needed in terms of resources. For this he identified different types of language resources, and distinguished between linguistic approaches and statistical approaches. Machines need three types of data: input, resources and workflow, and currently there are the following types of gaps that exist in this data scenario: metadata, process, purpose. These gaps were exemplified with an MT application. The purpose gap specifically concerns the identification of metadata, process flows and the employed resources. Any identification must be facilitated across applications with a common understanding, and therefore different communities have to join in and share the information that has to be employed in the descriptive part of the identification task. A particular solution that can provide a machine-readable information foundation was provided by the semantic technologies of the Semantic Web (SW). A more shallow approach than the complex fully-fledged approach of the SW for web pages is available through microformats or RDFa. With some few examples some insights were presented on how the SW actually contributes to closing the introduced gaps.
Nicoletta Calzolari Zamorani
Language Resources: a pillar of Language Technology
abstract The traditional production process is too costly. It is urgent to create a framework that enables effective cooperation of many groups on common tasks, adopting the paradigm of accumulation of knowledge so successful in more mature disciplines, such as biology, astronomy and physics. This requires a change in the paradigm, and the design of a new generation of language resources, based on open content interoperability standards. The semantic web notion may help in determining the shape of the language resources of the future, consistent with the vision of an open distributed space of sharable knowledge available on the web for processing. This enables building on each other achievements, integrating results, and having them accessible to various systems and applications. This is the only way to make a great leap forward.
lemon: An Ontology-Lexicon model for the Multilingual Semantic Web
abstract In order to allow ontologies to interact with multilingual text in both the analysis and the generation mode, it is necessary to model the relation that natural language expressions have with language-independent knowledge representation systems. Most of the latter use a label attribute to encode the natural language expressions that correspond to a concept. And often, such labels exist only in English, or only in the language of the country for which a taxonomy or ontology has been designed. Such labels correspond in fact to terms, which are not explicitly linked to other terms/labels of other concepts. As such, a lot of information about possible linguistic realizations of concepts is left out. The Semantic Web, and in particular the Linked Data project, proposes solutions that allow for the re-use of lexical and terminological resources by semantic interlinking. However, currently, there is no standard for describing the relationship between natural language expressions and ontology elements. Therefore a central aspect in the Monnet project on Multilingual Ontologies for Networked Knowledge (http://www.monnet-project.eu/) is in the design and development of a model that associates linguistic information with domain semantics. This model, which we call lemon (lexicon model for ontologies), is built on existing work, in particular LMF, ISOcat, SKOS, LexInfo and LIR. Lemon is an RDF model that allows for lexical data to be shared and interlinked on the Web and is a central endeavor towards standardizing lexicalized, multilingual knowledge representation on the Semantic Web.
José Carlos González
Universidad Politécnica de Madrid / DAEDALUS
Turning multilingual resources into applications: a market perspective
abstract The talk will show how language resources and tools can evolve to
satisfy the present and future demands coming from clients across
industries. The line of argumentation will be supported by a experience of
13 years since the starting of a university spin-off around language
Semantic Technologies in Multilingual Business Intelligence
abstract Recently Business Intelligence (BI) has gained a new momentum with the increased penetration of cloud computing and open source software developments in the field of business analytics, forecasting and business process optimization. Even SMEs as well as micro companies are now in the position to employ BI tools that previously have been reserved to large enterprises with enormous license costs and additional human resources.
VU University Amsterdam
KYOTO: a platform for anchoring textual meaning across languages
abstract The KYOTO project builds a platform that can be used by social groups to
model the meaning of their terms and to use this model to mine facts from
the textual sources in their community.
Since the KYOTO project uses a generic architecture that can be used for any
set of languages, the modeling of terms and the extraction of facts from
text is interoperable across these languages as well. The core semantic
technology of KYOTO thus enables social groups with a common interest to
access their knowledge using formal systems, thus connecting Semantic Web2.0
communities to Semantic Web3 technology, but it also allows to create
communities across language borders. Likewise, knowledge that is implicit in
these social communities can be exchanged across different language
communities, thus creating a more global understanding and exchange.
W3C Internationalization Tag Set
abstract ITS is a W3C Recommendation that helps to internationalize XML-based contents. Content that has been internationalized with ITS can more easily be processed by humans and machines. ITS also plays an important role in the W3C Best Practice Note: "Best Practices for XML Internationalization". Christian explained that seven so-called data categories are the heart of ITS. They cover topics such as a marker that a range of content must not be translated. Thus, ITS helps humans and machines since ITS information for example can help to configure a spell checker or to communicate with a translator. ITS data categories are valuable in themselves – you do not need to work with the ITS namespace. They are therefore useful also for RDF or for other non-XML data. Although ITS is a relatively new standard, Christian was able to point to existing implementations (e.g. the Okapi framework) that support ITS-based processing. In addition, he sketched first scenarios, and visions for the possible pivotal role of ITS in the creation of multilingual, Web-based resources: clients (such as Web-browsers) that interpret ITS and thus can feed more adequate content to machine translation systems.
[Chair, Dan Tufis • Scribe, Jörg Schütz]
Facebook Translation Technology and the Social Web
abstract The interactive nature of the social web presents both unique challenges and opportunities unseen in traditional translation models. The current approaches to creating user interfaces and the translation technologies available are inadequate when it comes to providing fast, high quality translations, neither do they take advantage of the capabilities inherent in the social web. This talk will describe the challenges and opportunities and show how Facebook's technology deals with both.
Google's community Translation in Sub Saharan Africa
abstract Sub-Saharan Africa has around 14% of the world's population yet only 2% of the world's internet users. Low representation of African languages & relevant content remain among of the biggest barriers to this discrepancy. Google is serious about Africa, and our strategy is to get users online by developing an accessible, relevant and sustainable internet ecosystem. This talk will explore the relevance facet by sharing insights from two recent community translation initiatives aimed at increasing african language web content.
Emmanuelle Gutiérrez y Restrepo, Loïc Martínez Normand
Localization and web accessibility
abstract The localization of web content is a creative and evolving undertaking that can maintain and even increase the accessibility of the original content. In any case, the localized version should never be less accessible than the original version. For this reason it is essential for web content localization practitioners to understand and apply the web accessibility guidelines (WCAG 2.0) with the goal of providing high quality results that also respect the rights of all users.
Department of Information Technology, Government of India
Challenges for Multilingual Web in India : Technology development and Standardization perspective
abstract After setting the scene with an overview of the challenges in India and the complexity of Indian scripts, Swaran Lata talked about various technical challenges they are facing, and some key initiatives aimed at addressing those challenges. For example, there are e-government initiatives in the local languages of the various states, and a national ID project, that brings together multilingual databases, on-line services and web interfaces. She then mentioned various standardization related challenges, and initiatives that are in place to address those.
[Chair, Chiara Pacello • Scribe, Charles McCathieNevile]
Google / Unicode Consortium
Software for the world: Latest developments in Unicode and CLDR (videocast)
abstract Mark started with information about the extent of Unicode use on the Web, and the recent release of Unicode 6.0. He then talked about International Domain Names and recent developments in that area, such as top level IDNs and new specifications related to IDNA and Unicode IDNA compatibility processing. For the remainder of his talk, Mark described CLDR (the Common Locale Data Repository), which aims to provide locale-based data formats for the whole world so that applications can be written in a language independent way. This included a description of BCP 47 language tag structure, and the new Unicode locale extension.
[Chair, Richard Ishida • Scribe, Charles McCathieNevile]