Standards and best practices for the Multilingual Web
Today, the World Wide Web is fundamental to communication in all walks of life. As the share of English web pages decreases and that of other languages increases, it is vitally important to ensure the multilingual success of the World Wide Web.
The MultilingualWeb project is looking at best practices and standards related to all aspects of creating, localizing and deploying the Web multilingually. The project aims to raise the visibility of existing best practices and standards and identify gaps. The core vehicle for this is a series of four events which are planned for the coming two years.
On 26-27 October 2010 the W3C ran the first workshop in Madrid, entitled "The Multilingual Web: Where Are We?". The Madrid workshop was hosted by the Universidad Politécnica de Madrid.
The aim of the workshop was to survey and introduce people to currently available best practices and standards that are aimed at helping content creators, localizers, tools developers, and others meet the challenges of the multilingual Web. The key objective was to share information about existing initiatives and begin to identify gaps.
One of the unique features of the workshop was the variety of backgrounds of the slightly more than 100 workshop participants. The program and attendees reflected an unusually wide range of topics and, judging from attendee feedback received, the participants appreciated not only the breadth of insights, but also the interesting and useful networking opportunities.
What follows will describe the topics introduced by speakers, followed by a selection of key messages raised during their talk in bulleted list form. Links are also provided to the IRC transcript (taken by scribes during the meeting), video recordings of the talk (where available), and the talk slides.
A wide range of topics were covered by the workshop, but they all play together to realize the Multilingual Web of the future.
In the Developers session we heard about many initiatives that are currently in development to better support the multilingual user experience on the Web, including characters and fonts, locale data formats, internationalized domain names and typographic support, but there is still much to do. There are also questions about how to manage the process of handling translations and multilingual customer feedback. A critical part of the effort is for users to make their voice heard in standards arenas and, importantly, to application developers such as browser implementers.
The Creators session echoed the need for web applications and devices to support locale content better. It highlighted the growing importance of the Mobile Web, and the fact that this platform has significant deficiencies with regards to multilingual support. Speakers in the session also reminded us that "content" means not only textual web pages, but also information for multilimodal and voice applications, or SMS, especially in developing countries. There are also issues surrounding navigation to information in a multilingual environment, from choosing how to link to translated pages, to better understanding IVR systems. Another area, which still shows some gaps, is related to translation: currently there are various approaches to provide translation related information about non-translatability, about (automatic) tools being used for translation, about translation quality etc. Hybrid systems may be important to increase effectiveness of the localization process.
In the Localization session we heard about the need to improve standards and better integrate them into the localization process, and the need to get a better grip on metadata. Content itself is evolving - becoming more complex and fast changing - and localization approaches need to be adapted to address the needs of this voluble and decentralized multilingual web of the future. The idea of a language neutral representation was first introduced in this session as a way of dealing more effectively with multilingual searches.
The Machines session reiterated the need for standardization of metadata related to the localization process and the importance of the Semantic Web technologies to support that. RDF and the Semantic Web featured in other talks in this session as a way of representing information in a language neutral form, that may even change the way we think about machine translation in the future. For significant progress in these areas, however, we were warned that a new way of working is needed, founded on cooperation and sharing of standardized resources, rather than hiding data away. A new framework is needed to enable this. The META-NET project is expected to be a significant contributor in this respect.
In the Users session, speakers demonstrated that there is an appetite for multilingual content around the world, but that there are significant organizational and technical challenges in the way of reaching people in continents such as Africa and Asia. We also saw linkages between multilingual development of the web and work on accessibility.
This initial workshop was mostly about sharing information about what goes to make up the multilingual web, and the kinds of initiatives that are currently being worked on. That kind of information sharing will continue in upcoming workshops, but we will also begin asking speakers to identify specific steps that can help us move forward to more effectively create, manage and present multilingual content to users around Europe and the world.
The workshop began with a welcome address from Guillermo Cisneros Director, Dean and Director of the Escuela Técnica Superior de Ingenieros de Telecomunicación (ETSIT- UPM). He said that this is a worldwide task and the multilingual issue is important for the future of the Web. He also brought our attention to the needs of accessibility and sign languages as part of the task of making the Web accessible to all. Given the background of the university, they are very happy to support this initiative.
Following Sr. Cisneros, Richard Ishida gave a brief overview of the MultilingualWeb project, and introduced the format of the workshop.
Kimmo Rossi of the European Commission, DG for Information Society and Media, Digital Content Directorate, praised the enthusiasm and voluntary contributions of the project partners. This project is about much more than standardization. We are reaching the end of the 7th Framework program and there have been some visible commitments on the digital agenda for Europe - an attempt to find out how technologies can be put to the service of society that goes beyond Europe. There is a need to bridge the 'innovation gap' that plagues Europe - to take up innovations faster. The 8th Framework program is getting under way with 'research and innovation' as a focus. We need good quality structured input for the consultation process, and the Commission hopes to make use of the opinions of the stakeholders who attend these workshops. Kimmo spoke about upcoming calls for proposals at the Commission. There is a an ongoing call for proposals, open until 18 Jan, with 50 million euros available for projects in language technology. Areas include multilingual content processing (whole chain of authoring and managing online multilingual content), multilingual information access & mining, and natural speech interaction. Another funding opportunity launched 1st Feb. is SME initiative for digital content and languages. The objective is to bridge the language barriers in the data economy, by enabling acquisition of large quantities of data, then pooling of data to build useful services for citizens.
The keynote speaker was Reinhard Schäler, of the Language Resource Centre at the University of Limerick, and recently named CEO of the Rosetta Foundation.
The developers Session was chaired by Adriane Rinsche (LTC).
The session started with talk "The Multilingual Web: Latest developments at the W3C/IETF", by Richard Ishida (W3C). Richard gave an overview of W3C and various I18N activities done not only in W3C but also in Unicode Consortium and IETF. At the end of talk he presented several tools available from http://www.w3.org/International which can improve in testing existing sites.
In the next talk "Localizing the Web from the Mozilla perspective" Axel Hecht (Mozilla) talked mainly about dealing with the challenges of community driven development of software and Web sites in more then 80 languages.
Then Charles McCathieNevile (Opera) presented talk "The Web everywhere, multilingualism at Opera". He was talking about approaches for localization of various Opera browser versions and other assets.
Following was a joint talk "Bridging languages, cultures, and technology" by Jan Nelson (Snr. Program Manager, Microsoft) and Peter Constable (Snr. Program Manager, Microsoft). Jan briefly introduced complexity of translation, I18N and globalization problems handled in his company. Then he introduced the http://www.wikibhasha.org/ project. Peter talked about some issues of the multilingual web and also demonstrated some of IE9's I18N features.
At the end of the second day Mark Davis of Google, and President of the Unicode Consortium, gave a videocast talk from California with the title "Software for the world: Latest developments in Unicode and CLDR". Mark started with information about the extent of Unicode use on the Web, and the recent release of Unicode 6.0. He then talked about International Domain Names and recent developments in that area, such as top level IDNs and new specifications related to IDNA and Unicode IDNA compatibility processing. For the remainder of his talk, Mark described CLDR (the Common Locale Data Repository), which aims to provide locale-based data formats for the whole world so that applications can be written in a language independent way. This included a description of BCP 47 language tag structure, and the new Unicode locale extension.
The Developers session on the first day ended with vital Q&A discussion about HTML5 development and issues that should be solved as a part of MLW activity. For details see the related links.
This session was chaired by Charles McCathieNevile of Opera.
Roberto Belo Rovella and David Vella, of the BBC World Service, kicked off the Creators session with "Challenges for a multilingual news provider: pursuing best practices and standards for BBC World Service". The BBC World Service has a multilingual site with editors for each site, i.e. not direct translations. Important topics are character encoding and font support. There is a general movement towards using Unicode, and the BBC was one of the first content providers to go this path. In the past, some web sites used images instead. With font support getting better and better, the adoption of Unicode is also increasing. The situation for mobile display in India is still problematic: 70% of mobile devices in India cannot display Hindi properly. Hence, the BBC is delivering their content partially as images. This is seen as a temporary measure, until the mobile display issue resolves. Such experiences let you realize that the approach "create once, publish everywhere" in practice does not work: local requirements and missing font support lead to different workflows.
Paolo Baggia from Loquendo presented about "Multilingual Aspects in Speech and Multimodal Interfaces". Many people cannot read, or there are contexts in which there is no written language available. W3C has worked in this area for about 10 years, in the "Multimodal Interaction" and "Voice Browser" Working Groups. Standards like VoiceXML, SSML (Speech Synthesis Markup Language) or PLS (Pronunciation Lexicon Specification) can be used to create multilingual voice applications. Both in speech synthesis or text-to-speech scenarios, xml:lang and the underlying standard BCP 47 for language identifiers are crucial for letting users choose the adequate language. In the area of phonetic alphabets, more information than provided by BCP 47 is necessary.
Luis Bellido from Universidad Politécnica de Madrid presented about "Experiences in creating multilingual web sites". His example is the "Lingu@net Europa" web site, a multilingual center for language learning. 32 different languages are available. The content is created not by professional translators, but by language teaching professionals. For this group the usage of technologies like translation memories is not easily learned. Hence, for them a site is created in a workflow which they find easy to use. In terms of standards, utf-8 character encoding, HTML, CSS and XML play a crucial role. But other technologies like MS-Office or the text indexing framework Lucene are applied, too. An issue in creating multilingual sites is how to handle "multilingual links". In the project, XSLT was used to create such links. Luis suggested that having such facilities in a CMS would be helpful.
Pedro L. Díez Orzas, together with Giuseppe Deriard and Pablo Badía Mas, from Linguaserve presented about "Key Aspects of Multilingual Web Content Life Cycles: Present and Future". Pedro emphasized that the creation of multilingual content is a process. Important information in that process is for example what needs not to be translated. To express that information, the tDTD format (“translatability data type definition") has been developed. However, in a common CMS and the content life cycle, multilinguality (i.e. such kind of information) is not regarded as important. The methodologies and workflows are getting more and more "hybrid": traditional human translation, combination of MT with post editing, "MT only". To make the creation of such workflows easy, CMS have to take the requirements of multilingual content management into account.
Max Froumentin from the World Wide Web Foundation presented about "The Remaining Five Billion: Why is Most of the World's Population Not Online and What Mobile Phones Can do About it". Max pointed out that not only Web pages, but also SMS or Voice systems are part of the Web. These are used a lot by people who cannot access Web pages easily. The underlying technologies (HTTP, URIs) are the same. But in the case of SMS or Voice, the user is not aware of that. SMS and Voice could be a means to spread knowledge: e.g. a farmer who cannot read, but who knows how to grow trees in the desert, can share his knowledge via recordings or voice applications. The World Wide Web foundation is running projects, which help to develop such applications. Max gave examples from the Sahara region and India. These demonstrate also business models: people are willing to pay for information delivered in that way.
During the Q/A part of the Creator session, various kinds of information relevant for multilingual content was discussed; e.g. information like "this page is multilingual" or "this page was machine translated". Sometimes new mechanisms were proposed, e.g. for multilingual links, which could be implemented with existing standards. It became clear that in such situations not new standards, but wide spread knowledge about how standards work together is important. It was pointed out that user or customer requirements need to be in the center of standardization and (MT) technology development.
This session was chaired by Felix Sasaki of DFKI.
As the anchor speaker for the Localizers session, Christian Lieske of SAP in Germany talked about "Best Practices and Standards for Improving Globalization-related Processes". Christian started with a sketch of enterprise scale, globalization-related processes, and went on to touch on globalization best practices and some of the standards and best practices that have been developed by standards bodies. Useful standards include TermBase eXchange (TBX) (for terminology assets), Translation Memory eXchange (TMX) (for former translation assets), XML Localization Interchange File Format (XLIFF) (for canonicalised content), and the Internationalization Tag Set (ITS) (for resource descriptions related to internationalization). Christian explained benefits of some of these standards, but then went on to discuss opportunities for improvement, given that in reality things are not always simple.
The next speaker was Josef van Genabith, from the Centre for Next Generation Localisation (CNGL), talking about "Next Generation Localisation". After reviewing what localization is, Josef described three global 'mega-trends' in localization: volume (moving from corporate to social media content), access (moving from desktop to mobile devices), and personalization (moving from broad regional targets to individual user preferences). He then explored a few alternate views on how to address the challenge of coping with the demands of these trends taken together.
Daniel Grasmick of Lucy Software spoke about "Applying Standards - Benefits and Drawbacks". Daniel briefly reviewed the history of standards such as TMX, TBX and XLIFF. He then asked whether standards are always the best approach, and argued that they may be overkill for some projects. For other projects, where you have to exchange large volumes of content and data is accessed by many, a standard format like XLIFF is absolutely appropriate.
Marko Grobelnik of the Jozef Stefan Institute spoke about "Cross-lingual Document Similarity for Wikipedia Languages". Marko explained that cross-lingual information retrieval can be seen in a scenario where a user initiates a search in one language but expects results in more than one language. There are many areas involved in text-related search and each represents data about text in a slightly different way. Marko focused in on the correlated vector space model and described how the system they are currently building works using Wikipedia correlated texts.
The session ended with Q&A on various topics. These included why SAP doesn't use standards like XLIFF, who are the best players for developing standards, why there are various different browsers, thoughts about standardizing non-translate flags, and more. For details, see the related links.
The Machines Session started with an introduction by session chair, Dan Tufis, pointing out that machines are essential in the development of a true multilingual World Wide Web because effective machine-to-machine interoperability is key to an efficient multilingual web user experience.
In the first presentation, "Language Resources, Language Technology, Text Mining, the Semantic Web: How interoperability of machines can help humans in the multilingual web", Felix Sasaki of the DFKI in Germany talked about Language Technologies (LT) and in particular interoperability of technologies. He introduced applications concerning summarization, Machine Translation (MT) and text mining, and showed what is needed in terms of resources. For this he identified different types of language resources, and distinguished between linguistic approaches and statistical approaches. Machines need three types of data: input, resources and workflow, and currently there are the following types of gaps that exist in this data scenario: metadata, process, purpose. These gaps were exemplified with an MT application. The purpose gap specifically concerns the identification of metadata, process flows and the employed resources. Any identification must be facilitated across applications with a common understanding, and therefore different communities have to join in and share the information that has to be employed in the descriptive part of the identification task. A particular solution that can provide a machine-readable information foundation was provided by the semantic technologies of the Semantic Web (SW). A more shallow approach than the complex fully-fledged approach of the SW for web pages is available through microformats or RDFa. With some few examples some insights were presented on how the SW actually contributes to closing the introduced gaps. The point to remember is that RDF is a means to provide a common information space for machines. The talk closed with some ideas on joint projects (community effort), and specifically on how META-NET is already working in that direction. A short discussion pointed out the currently available language description frameworks and the overall complexity of RDF for browser developers.
In the next presentation Nicoletta Calzolari of ILC-CNR in Pisa, Italy extended the notion of language resources (LR) to also include tools. She pointed out that a new paradigm is needed in the LR world to accumulate the continuous evolution of the multilingual Web including the satisfaction of new demands and requirements which shall account for these dynamic needs. For this, right now a web of LRs is built and is driven by standardization efforts. To successfully enable the further evolution, distributed services are also needed, including effective collaborations between the different communities. In this context, a very important and sensitive issue concerns politics and policies to support these changes. Today, several EU funded projects have taken up this new R&D direction, and national initiatives are joining in to build stronger communities that are critical to ensure a truly web-based access to LRs together with a common global vision and cooperation. Examples are projects such as CLARIN, FLaReNet and META-NET which jointly have on their roadmap that interoperability between resources and tools is key for the overall success as well as more openness through sharing efforts. The follow-up short discussion focused on a many infrastructures scenario, and the question about the interoperability right now which identified the EU project META-NET as the key player in solving these issues.
The third speaker Thierry Declerk of DFKI presented the "LEMON" project which is researching and developing to provide an ontology-lexicon model for LRs. LEMON is part of the EU funded project Monnet and collaborates with standardization bodies. The project's R&D contributes to the multilingual Web by providing linguistically enriched data and information. The industrial use case of Monnet is the financial world, in particular the field of reporting. The standards used from the industrial side are XBRL, IFRS and GAAP, and the encodings in these standards are related to Semantic Web standards to build a bridge between financial data and linguistic data. The overall approach was exemplified by an online term translation application. The talk closed with an architectural overview of the Monnet components and the standards used on the language side which are among others LMF, LIR and SKOS. The project envisages to establish a strong link with the META-NET project very soon.
The last talk of this morning session was given by Jose Gonzales of UPM and the university spin-off company DAEDALUS in Madrid, Spain. The company's projects share similar subject areas than the LEMON project and the other presented approaches but with a strong market-oriented view. The employed LRs of DAEDALUS date back to the 1990 when no resources for Spanish were available, and the initial focus was therefore on spell checking which was needed mainly by the media market. These developments had an important influence on all future developments such as search and retrieval, and even ontology development. The initial work on LRs was followed by multilingual developments and were based on the continuous experience in the field which was exemplified by a multilingual information extraction application followed by an example that integrated speech (DALI) into the application scenario. Current applications include sentiment analysis and opinion mining (also in Japanese), and an EC funded project (Flavius) which takes into account user-generated content and machine translation. With some online examples the talk closed with an outlook on linking ontologies with lexicon tools.
In the first presentation of the afternoon track of the machines session, Jörg Schütz of bioloom group, Germany, introduced yet another application field for multilingual web activities which is quite different from the previous scenarios: business and process intelligence. He pointed out that in response to our global economies, today's business and process intelligence processes and tools have to deal with input that stems from many different languages and cultures without having a clear distinction between what has been the original source language and what should be the eventual target language of the output - output that triggers possible decision making and optimization processes based on data mining, combining and analytics findings. To bridge the gap between existing language technologies and process intelligence operations the semantic web provides the means to design, model and set into operation appropriate mechanisms. However, although there are existing standards on both sides for data representation, modeling of rules and terminology, querying and inferencing, and fact extraction which can be seamlessly combined through semantic web technologies, the communities of these ecosystems so far have failed to talk to each other, due to trust in their own standards, fear over increased complexity, lack of reference implementations, being outpaced by technology, lack of exchange between solution providers, and uncertain involvement of buyers/customers. The talk concluded with what is currently missing to lead to a closer collaboration between the actors of these ecosystems, and to a better interoperability between their data, tools and processes.
Piek Vossen of VU University Amsterdam gave a talk entitled "KYOTO: a platform for anchoring textual meaning across languages". The talk described a very open generic platform with which you can mine any kind of relations from text, but which can be tuned or customized to the kind of information you are interested in. Piek began his talk with the provocative question "Why translate text if you can mine text and represent the knowledge and information in a language neutral form?" He then described the evolution of the web into the future, where knowledge would be understood by machines, and what was needed to get there. The remainder of his talk described how the Kyoto project was attempting to convert the natural language of the web so far into knowledge representations that can be used in the future web.
As a stand-in for a talk that had to be canceled, Christian Lieske of SAP in Germany took the stage a second time. This time (drawing on older material developed with Felix Sasaki and Richard Ishida), he presented the "W3C Internationalization Tag Set (ITS)". ITS is a W3C Recommendation that helps to internationalize XML-based contents. Content that has been internationalized with ITS can more easily be processed by humans and machines. ITS also plays an important role in the W3C Best Practice Note: “Best Practices for XML Internationalization”. Christian explained that seven so-called data categories are the heart of ITS. They cover topics such as a marker that a range of content must not be translated. Thus, ITS helps humans and machines since ITS information for example can help to configure a spell checker or to communicate with a translator. ITS data categories are valuable in themselves – you do not need to work with the ITS namespace. They are therefore useful also for RDF or for other non-XML data. Although ITS is a relatively new standard, Christian was able to point to existing implementations (e.g. the Okapi framework) that support ITS-based processing. In addition, he sketched first scenarios, and visions for the possible pivotal role of ITS in the creation of multilingual, Web-based resources: clients (such as Web-browsers) that interpret ITS and thus can feed more adequate content to machine translation systems.
In the short Q&A discussion round before the lunch break the question was raised on how all these "islands of LRs" could be made available to the public, and concluded with the observation that there are already some ISO initiatives underway to support this opening direction also in terms of structuring the resources, and that an open issue still is the actual representation format.
This session was chaired by Chiara Pacella of Facebook.
Ghassan Haddad from Facebook presented the anchor talk for the Users session with a talk entitled "Facebook Translation Technology and the Social Web". After a short movie showing the amount of interactions with Facebook, there was a brief introduction to the history of Facebook and the unique challenges that they set themselves to achieve translations into, currently, 77 languages without slowing development. This was done through crowd-sourcing. Ghassan demonstrated approaches Facebook uses to help people adopt new translations as they appear, and to deal with the complexities of interactive and dynamic messages. Ghassan then talked about the process of translation using the community in the crowd-sourcing approach, and how app developers can set up content for translation. About 50 languages are completely supported by the community - the other 27 languages use professional support.
Denis Gikunda from Google presented about "Google's community Translation in Sub Saharan Africa". Sub-Saharan Africa has around 14% of the world's population yet only 5% of the world's internet users(though this is growing). Low representation of African languages and relevant content remain among the biggest barriers to removing this discrepancy. Google's strategy for Africa is to get people online by developing an accessible, relevant and sustainable internet ecosystem. The talk shared insights from two recent community translation initiatives aimed at increasing African language content: the Community Translation program and the Kiswahili Wikipedia Challenge.
Loïc Martínez Normand, representing the Sidar Foundation, presented about "Localization and web accessibility". Loïc described how he and Emmanuelle Gutiérrez y Restrepo have interfaced with many standards, especially those of the W3C, discussed what 'accessibility' means, and introduced the Sidar Foundation. The main content of the presentation explored the W3C Web Content Accessibility Guidelines (WCAG) that are related to international users. At the end, the presentation compared the quicktips from the W3C Accessibility and Internationalization activities.
Swaran Lata, of the Department of Information Technology, Government of India, and also W3C India Office Manager, closed off the Users session with a talk entitled "Challenges for Multilingual Web in India : Technology development and Standardization perspective". After setting the scene with an overview of the challenges in India and the complexity of Indian scripts, Swaran Lata talked about various technical challenges they are facing, and some key initiatives aimed at addressing those challenges. For example, there are e-government initiatives in the local languages of the various states, and a national ID project, that brings together multilingual databases, on-line services and web interfaces. She then mentioned various standardization related challenges, and initiatives that are in place to address those.
The Q&A, began with opinions that it is difficult to get work done through volunteers - the do-good approach wears off after a while. This was followed by some discussion of sign languages, and then minority languages. The final question was about standardization of identifiers for African languages. For details, see the links below.