MULTILINGUALWEB

Standards and best practices for the Multilingual Web

World Wide Web Consortium (W3C)Universidad Politécnica de Madrid European Commission

W3C Workshop Report:
The Multilingual Web: Where Are We?
26-27 October 2010, Madrid, Spain

Today, the World Wide Web is fundamental to communication in all walks of life. As the share of English web pages decreases and that of other languages increases, it is vitally important to ensure the multilingual success of the World Wide Web.

alt

The MultilingualWeb project is looking at best practices and standards related to all aspects of creating, localizing and deploying the Web multilingually. The project aims to raise the visibility of existing best practices and standards and identify gaps. The core vehicle for this is a series of four events which are planned for the coming two years.

On 26-27 October 2010 the W3C ran the first workshop in Madrid, entitled "The Multilingual Web: Where Are We?". The Madrid workshop was hosted by the Universidad Politécnica de Madrid.

The aim of the workshop was to survey and introduce people to currently available best practices and standards that are aimed at helping content creators, localizers, tools developers, and others meet the challenges of the multilingual Web. The key objective was to share information about existing initiatives and begin to identify gaps.

One of the unique features of the workshop was the variety of backgrounds of the slightly more than 100 workshop participants. The program and attendees reflected an unusually wide range of topics and, judging from attendee feedback received, the participants appreciated not only the breadth of insights, but also the interesting and useful networking opportunities.

What follows will describe the topics introduced by speakers, followed by a selection of key messages raised during their talk in bulleted list form. Links are also provided to the IRC transcript (taken by scribes during the meeting), video recordings of the talk (where available), and the talk slides.

Contents: SummaryWelcomeDevelopersCreatorsLocalizersMachinesUsers

Summary

A wide range of topics were covered by the workshop, but they all play together to realize the Multilingual Web of the future.

In the Developers session we heard about many initiatives that are currently in development to better support the multilingual user experience on the Web, including characters and fonts, locale data formats, internationalized domain names and typographic support, but there is still much to do. There are also questions about how to manage the process of handling translations and multilingual customer feedback. A critical part of the effort is for users to make their voice heard in standards arenas and, importantly, to application developers such as browser implementers.

The Creators session echoed the need for web applications and devices to support locale content better. It highlighted the growing importance of the Mobile Web, and the fact that this platform has significant deficiencies with regards to multilingual support. Speakers in the session also reminded us that "content" means not only textual web pages, but also information for multilimodal and voice applications, or SMS, especially in developing countries. There are also issues surrounding navigation to information in a multilingual environment, from choosing how to link to translated pages, to better understanding IVR systems. Another area, which still shows some gaps, is related to translation: currently there are various approaches to provide translation related information about non-translatability, about (automatic) tools being used for translation, about translation quality etc. Hybrid systems may be important to increase effectiveness of the localization process.

In the Localization session we heard about the need to improve standards and better integrate them into the localization process, and the need to get a better grip on metadata. Content itself is evolving - becoming more complex and fast changing - and localization approaches need to be adapted to address the needs of this voluble and decentralized multilingual web of the future. The idea of a language neutral representation was first introduced in this session as a way of dealing more effectively with multilingual searches.

The Machines session reiterated the need for standardization of metadata related to the localization process and the importance of the Semantic Web technologies to support that. RDF and the Semantic Web featured in other talks in this session as a way of representing information in a language neutral form, that may even change the way we think about machine translation in the future. For significant progress in these areas, however, we were warned that a new way of working is needed, founded on cooperation and sharing of standardized resources, rather than hiding data away. A new framework is needed to enable this. The META-NET project is expected to be a significant contributor in this respect.

In the Users session, speakers demonstrated that there is an appetite for multilingual content around the world, but that there are significant organizational and technical challenges in the way of reaching people in continents such as Africa and Asia. We also saw linkages between multilingual development of the web and work on accessibility.

This initial workshop was mostly about sharing information about what goes to make up the multilingual web, and the kinds of initiatives that are currently being worked on. That kind of information sharing will continue in upcoming workshops, but we will also begin asking speakers to identify specific steps that can help us move forward to more effectively create, manage and present multilingual content to users around Europe and the world.

Welcome session

The workshop began with a welcome address from Guillermo Cisneros Director, Dean and Director of the Escuela Técnica Superior de Ingenieros de Telecomunicación (ETSIT- UPM). He said that this is a worldwide task and the multilingual issue is important for the future of the Web. He also brought our attention to the needs of accessibility and sign languages as part of the task of making the Web accessible to all. Given the background of the university, they are very happy to support this initiative.

Following Sr. Cisneros, Richard Ishida gave a brief overview of the MultilingualWeb project, and introduced the format of the workshop.

Kimmo Rossi of the European Commission, DG for Information Society and Media, Digital Content Directorate, praised the enthusiasm and voluntary contributions of the project partners. This project is about much more than standardization. We are reaching the end of the 7th Framework program and there have been some visible commitments on the digital agenda for Europe - an attempt to find out how technologies can be put to the service of society that goes beyond Europe. There is a need to bridge the 'innovation gap' that plagues Europe - to take up innovations faster. The 8th Framework program is getting under way with 'research and innovation' as a focus. We need good quality structured input for the consultation process, and the Commission hopes to make use of the opinions of the stakeholders who attend these workshops. Kimmo spoke about upcoming calls for proposals at the Commission. There is a an ongoing call for proposals, open until 18 Jan, with 50 million euros available for projects in language technology. Areas include multilingual content processing (whole chain of authoring and managing online multilingual content), multilingual information access & mining, and natural speech interaction. Another funding opportunity launched 1st Feb. is SME initiative for digital content and languages. The objective is to bridge the language barriers in the data economy, by enabling acquisition of large quantities of data, then pooling of data to build useful services for citizens.

The keynote speaker was Reinhard Schäler, of the Language Resource Centre at the University of Limerick, and recently named CEO of the Rosetta Foundation.

Developers session

The developers Session was chaired by Adriane Rinsche (LTC).

alt

The session started with talk "The Multilingual Web: Latest developments at the W3C/IETF", by Richard Ishida (W3C). Richard gave an overview of W3C and various I18N activities done not only in W3C but also in Unicode Consortium and IETF. At the end of talk he presented several tools available from http://www.w3.org/International which can improve in testing existing sites.

  • If you include ASCII-only pages, almost 70% of the Web is now using UTF-8 (a Unicode encoding) for web pages (according to a survey of 6.5 billion pages by Google).
  • Be careful about normalization of Unicode text on the Web. Use NFC. Be aware that HTML and CSS don't normalize class/id names with style selectors.
  • Work is moving forward on internationalized domain names and internationalized resource identifiers.
  • You should be using BCP 47 for language tagging.
  • CSS and HTML work needs more support from local users to establish requirements and use cases. Vertical text and ruby are currently hot topics, as are extensions to provide better support for right-to-left and bidirectional text on the Web (Arabic, Hebrew, etc.).
  • WOFF web fonts are coming very soon. These will allow downloadable font support. There are still issues related to rendering OpenType and related features that are OS specific (eg. Uniscribe).
  • HTML5 provides a number of changes relating to internationalization features in HTML - for example, new form input types.
  • The Mobile Web is, and will become even more, important for the future of the Web. So also is speech technology.
  • The W3C provides resources such as articles, tests, tutorials, etc at http://www.w3.org/International/ organized by task.
  • Use the Internationalization Checker on your web pages: http://qa-dev.w3.org/i18n-checker/
  • The W3C needs multilingual people to participate in standards development to get the work done.
  • There are three legs to the Web: standards developers, user agent implementers, and users, and they all have to work together for things to work. Users need to be talking to implementers to get Web standards implemented.

In the next talk "Localizing the Web from the Mozilla perspective" Axel Hecht (Mozilla) talked mainly about dealing with the challenges of community driven development of software and Web sites in more then 80 languages.

  • If your site sniffs User-Agent headers, please stop and use Accept-Language instead to determine the language of the page the user wants to see, because Mozilla is changing the user agent string with Firefox 4 and dropping the locale code
  • When negotiating content language another challenge, especially with increasing numbers of minority languages, can be sensitivity about declaring preference for say, Kurdish within, say, a Turkish domain. How to balance between best content and user privacy. .
  • Many things have not improved since the start of JavaScript, eg. Date.toLocaleString() is not truly internationalized. There is some work under way to add internationalization features, but it doesn't go far enough. Mozilla wants to hear what you need - should Accept-Language be site-specific, for example, with a dedicated user interface? Do we need better APIs for selecting languages per BCP47?
  • There are live multi-lingual documents, like documentation, knowledge bases: for these pages you don't know whether a change to a German page is a translation or a bug-fix. How do you differentiate between added translation or bug-fix in a content that should be propagated to pages in other languages? .
  • When users hit the international feedback button they submit comments in many languages. Mozilla is experimenting with Firefox 4, but still doesn't know how to handle this.
  • Mozilla worked through several existing Wiki systems, but none was really sufficient from a multilingual point of view. They're now developing their own Kitsune system.
  • How do you keep apart localizable content from application logic, in particular for Web Apps?
  • What functionality is missing in the browsers (in general or in Firefox)?
  • How to create/maintain documents in multiple languages without putting too much burden on people with regard to process flow?

Then Charles McCathieNevile (Opera) presented talk "The Web everywhere, multilingualism at Opera". He was talking about approaches for localization of various Opera browser versions and other assets.

  • India and Iran are key markets for mobile Web.
  • Extensions using widget technology relies on the widget developer to build in internationalization.
  • XLIFF may be over-engineered - they'd like something simpler.
  • Content is in XHTML, which makes it easier to process, eg. using XSLT to transform to other formats
  • There is a conflict between large dictionaries with many rules to support line-breaking and the size of small implementations on TVs or basic games consoles.
  • If you want good RTL and vertical text support we need a 'big kick' from the user community because it involves messing with a very difficult part of the code base.
  • Enabling translation from languages other than English is more complicated, eg. for maintaining quality, but it would give access to a broader community of skilled translators.
  • Extensibility is a potential problem.

Following was a joint talk "Bridging languages, cultures, and technology" by Jan Nelson (Snr. Program Manager, Microsoft) and Peter Constable (Snr. Program Manager, Microsoft). Jan briefly introduced complexity of translation, I18N and globalization problems handled in his company. Then he introduced the http://www.wikibhasha.org/ project. Peter talked about some issues of the multilingual web and also demonstrated some of IE9's I18N features.

  • We are in danger of losing an important proportion of world culture as languages are dying out around us. "A language not on the Internet is a language that 'no longer exists'..."
  • There are APIs and a collaborative translation framework to help developers work with the 35 languages supported by Microsoft translation tools. http://www.microsofttranslator.com/
  • WikiBhasha beta has just been announced - an open source project, supporting 35 languages, aimed at facilitating the development of additional language content in Wikipedia. http://www.wikibhasha.org/
  • The Microsoft Local Language Program fosters the development and proliferation of regional language groups, to discuss how to extend language reach. (Currently 95 languages supported and about 1 billion people) http://www.microsoft.com/LLP
  • Internet users will surpass 2 billion in 2010 and these are mostly developing languages, and we need to figure out how to support them.
  • We're talking about the multilingual WEB, not the multilingual browser.
  • The growth of UTF-8 to over 50% of web pages has happened in the last year and a half or so.
  • While we have a great separation of content and code in software, how do we separate content from HTML code?
  • How do we separate preferred language from location so that browsers serve information that the user prefers in a more intelligent way?
  • There are numerous issues related to user language, format and location preferences in scripting.

At the end of the second day Mark Davis of Google, and President of the Unicode Consortium, gave a videocast talk from California with the title "Software for the world: Latest developments in Unicode and CLDR". Mark started with information about the extent of Unicode use on the Web, and the recent release of Unicode 6.0. He then talked about International Domain Names and recent developments in that area, such as top level IDNs and new specifications related to IDNA and Unicode IDNA compatibility processing. For the remainder of his talk, Mark described CLDR (the Common Locale Data Repository), which aims to provide locale-based data formats for the whole world so that applications can be written in a language independent way. This included a description of BCP 47 language tag structure, and the new Unicode locale extension.

  • Unicode 6.0 was just released with 109,000 characters, including the new Rupee symbol for India and emoji characters used heavily in Japan.
  • Each character has around 70 properties, and these are important to allow programmers to write code in a way that is truly language independent.
  • CLDR is used in a wide range of places, from your iPhone to Google, and so improvements to the CLDR data (for example for African locales) will have a significant impact.
  • The new Unicode locale extension now allows you to augment BCP 47 language tags to include locale information.

The Developers session on the first day ended with vital Q&A discussion about HTML5 development and issues that should be solved as a part of MLW activity. For details see the related links.

Creators session

This session was chaired by Charles McCathieNevile of Opera.

alt

Roberto Belo Rovella and David Vella, of the BBC World Service, kicked off the Creators session with "Challenges for a multilingual news provider: pursuing best practices and standards for BBC World Service". The BBC World Service has a multilingual site with editors for each site, i.e. not direct translations. Important topics are character encoding and font support. There is a general movement towards using Unicode, and the BBC was one of the first content providers to go this path. In the past, some web sites used images instead. With font support getting better and better, the adoption of Unicode is also increasing. The situation for mobile display in India is still problematic: 70% of mobile devices in India cannot display Hindi properly. Hence, the BBC is delivering their content partially as images. This is seen as a temporary measure, until the mobile display issue resolves. Such experiences let you realize that the approach "create once, publish everywhere" in practice does not work: local requirements and missing font support lead to different workflows.

  • Of their 23 million online unique visitors per week, nearly 10% were mobile users, and this is their focus for the future.
  • It is of key importance to the BBC to support the cultural differences in the way people view the world, even though in some cases they have to look for the best common denominator. For example, the look and feel of the basic page is recognizably that of the BBC, however Russian users are used to long, heavy front pages.
  • Some automated technology is used, such as the automatic conversion between Simplified and Traditional Chinese.
  • Some users are transitioning between scripts, for example younger Uzbek users prefer Latin script, but older readers prefer Cyrillic. Also some regions use Arabic script. The BBC is monitoring those tendencies and uses all three scripts on their Uzbek site, depending on which is most appropriate.
  • When dealing with users in places like China, the BBC has to take into account that it is often difficult to link outside the country, so they package up and send all the information that's needed for their local partner to create a complete user experience.
  • Another issue has been enabling people to input text to the CMS. Keyboards don't necessarily exist, or provide all the needed features. Tavultesoft and Microsoft Keyboard Creator to fill the gap until operating system step up to provide support.
  • BBC is looking forward to exploiting the font download technology currently being specced and implemented, as a way of making Unicode text readable to wider audiences.
  • Moving to mobile devices is reintroducing some of the problems related to font support - either fonts aren't supported or OpenType features are not supported. About 70% of devices don't render the text right.
  • Because of this, the BBC had to resort to creating alternative pages that download graphical images of text - this is only a temporary solution, however in 2 months since launch it has increased traffic by 50%.
  • There are a number of issues you need to address to work with right-to-left scripts, which include design features such as picture carousel direction.
  • Ensure you anticipate growth of text in other languages when designing standard layouts.
  • Check whether support technologies, such as Flash, contain support for multilingual rollout.
  • Gaps:
    • The ability to publish once and display correctly everywhere - new font download technology needs to be supported by adequate rendering of complex font features.
    • Mobile device developers need to be provided with expertise in language support.
    • Support for older browsers needs to be rapidly deprecated.
    • Rich interactive experiences need to be platform agnostic.

Paolo Baggia from Loquendo presented about "Multilingual Aspects in Speech and Multimodal Interfaces". Many people cannot read, or there are contexts in which there is no written language available. W3C has worked in this area for about 10 years, in the "Multimodal Interaction" and "Voice Browser" Working Groups. Standards like VoiceXML, SSML (Speech Synthesis Markup Language) or PLS (Pronunciation Lexicon Specification) can be used to create multilingual voice applications. Both in speech synthesis or text-to-speech scenarios, xml:lang and the underlying standard BCP 47 for language identifiers are crucial for letting users choose the adequate language. In the area of phonetic alphabets, more information than provided by BCP 47 is necessary.

  • Speech technologies enable multilinguality to be addressed in a wide variety of sectors and applications.
  • The role of audio material in the Web arena is increasing constantly.
  • The use of standards facilitates the development of speech multilingual applications.
  • Voice is different from text because it takes into account the reader, and the speaker may have an accent.
  • W3C Voice Browser standards are the basis for all the voice development in the Web:
    • Dialog Apps – VoiceXML 2.0 (2004), VoiceXML 2.1 (2007)
    • Grammars for Speech (and DTMF) – SRGS 1.0 (2004), SISR 1.0 (2007)
    • Prompts – SSML 1.0 (2004), SSML 1.1 (2010)
    • Pronunciation Lexicon – PLS 1.0 (2008)
    • Input Results – EMMA 1.0 (2009)
    • More to come: VoiceXML 3.0, SCXML 1.0, EmotionML 1.0, etc.
  • Identifying language information to speech markup is not trivial.
  • Proposal to create an IANA Registry for Phonetic Alphabets

Luis Bellido from Universidad Politécnica de Madrid presented about "Experiences in creating multilingual web sites". His example is the "Lingu@net Europa" web site, a multilingual center for language learning. 32 different languages are available. The content is created not by professional translators, but by language teaching professionals. For this group the usage of technologies like translation memories is not easily learned. Hence, for them a site is created in a workflow which they find easy to use. In terms of standards, utf-8 character encoding, HTML, CSS and XML play a crucial role. But other technologies like MS-Office or the text indexing framework Lucene are applied, too. An issue in creating multilingual sites is how to handle "multilingual links". In the project, XSLT was used to create such links. Luis suggested that having such facilities in a CMS would be helpful.

  • A key issue is how to deal with links to material that is available in multiple languages, but not necessarily in the language of the document containing the link. How to indicate to the user which languages are available?
  • Content management systems should also provide assistance for dealing with this type of situation.

Pedro L. Díez Orzas, together with Giuseppe Deriard and Pablo Badía Mas, from Linguaserve presented about "Key Aspects of Multilingual Web Content Life Cycles: Present and Future". Pedro emphasized that the creation of multilingual content is a process. Important information in that process is for example what needs not to be translated. To express that information, the tDTD format (“translatability data type definition") has been developed. However, in a common CMS and the content life cycle, multilinguality (i.e. such kind of information) is not regarded as important. The methodologies and workflows are getting more and more "hybrid": traditional human translation, combination of MT with post editing, "MT only". To make the creation of such workflows easy, CMS have to take the requirements of multilingual content management into account.

  • It would be good to have standard ways of indicating and excluding content that does not need to be translated - in terms of both content types and content structural elements. Also use of attributes to indicate whether content has been translated and when.
  • In most cases CMSs don't consider lifecycles and workflow for multilingual content - neither for the content itself, nor the metadata, and there is no real management of language versions. Programmed solutions tend to be short-term and non-extensible.
  • The future points to hybrid systems for localization, combining different production systems like offline and real-time localization.
  • The way forward for translation is also to have hybrid systems that combine technologies for rule-based MT, statistical MT, and both segment and page translation memories and combine different methodologies too.
  • For best practice, avoid the generalized use of SSL throughout the portal, JavaScript, and images or Flash applications with text. Do use XHTML and web accessibility guidelines, well-structured templates, and text that is as free as possible from errors.

Max Froumentin from the World Wide Web Foundation presented about "The Remaining Five Billion: Why is Most of the World's Population Not Online and What Mobile Phones Can do About it". Max pointed out that not only Web pages, but also SMS or Voice systems are part of the Web. These are used a lot by people who cannot access Web pages easily. The underlying technologies (HTTP, URIs) are the same. But in the case of SMS or Voice, the user is not aware of that. SMS and Voice could be a means to spread knowledge: e.g. a farmer who cannot read, but who knows how to grow trees in the desert, can share his knowledge via recordings or voice applications. The World Wide Web foundation is running projects, which help to develop such applications. Max gave examples from the Sahara region and India. These demonstrate also business models: people are willing to pay for information delivered in that way.

  • Half the world's population has access to the internet and doesn't use it. In total, 75% of the world's population has access to the Web.
  • SMS and voice are also ways to access the Web.
  • SMS has created barriers to reading and typing local languages into messages, and that needs to be fixed.
  • Interactive voice response applications struggle to support speech synthesis and recognition in non-mainstream languages.
  • More work is needed in natural language processing to help people who are not familiar with how menu selection systems work for IVR systems because they haven't been exposed to the desktop computer.
  • There is a lack of local, interesting and relevant content - this is particularly problematic.
  • There is also a lack of knowledge about how to develop interactive systems for local needs, or to obtain funding and business models to support it.

During the Q/A part of the Creator session, various kinds of information relevant for multilingual content was discussed; e.g. information like "this page is multilingual" or "this page was machine translated". Sometimes new mechanisms were proposed, e.g. for multilingual links, which could be implemented with existing standards. It became clear that in such situations not new standards, but wide spread knowledge about how standards work together is important. It was pointed out that user or customer requirements need to be in the center of standardization and (MT) technology development.

Localizers session

This session was chaired by Felix Sasaki of DFKI.

alt

As the anchor speaker for the Localizers session, Christian Lieske of SAP in Germany talked about "Best Practices and Standards for Improving Globalization-related Processes". Christian started with a sketch of enterprise scale, globalization-related processes, and went on to touch on globalization best practices and some of the standards and best practices that have been developed by standards bodies. Useful standards include TermBase eXchange (TBX) (for terminology assets), Translation Memory eXchange (TMX) (for former translation assets), XML Localization Interchange File Format (XLIFF) (for canonicalised content), and the Internationalization Tag Set (ITS) (for resource descriptions related to internationalization). Christian explained benefits of some of these standards, but then went on to discuss opportunities for improvement, given that in reality things are not always simple.

  • Use a well-supported source format, such as XHTML, DocBook, Darwin Information Typing Architecture (DITA), Open Document Format (ODF), Office Open XML (OOXML), etc., since you cannot easily duplicate the effort that went into those standards and their reach.
  • Provide general annotations (e.g. batch) with standardized metadata.
  • Consult Internationalization Quick Tips for the Web and XML Internationalization Best Practices.
  • The 5M Safety System model includes: Meetings, Models, Metas, Modules, Mashes
  • We should be prepared to mix and match standards to create a real world solution.
  • Metadata can help link concepts between different representations.

The next speaker was Josef van Genabith, from the Centre for Next Generation Localisation (CNGL), talking about "Next Generation Localisation". After reviewing what localization is, Josef described three global 'mega-trends' in localization: volume (moving from corporate to social media content), access (moving from desktop to mobile devices), and personalization (moving from broad regional targets to individual user preferences). He then explored a few alternate views on how to address the challenge of coping with the demands of these trends taken together.

  • Current localization is focused on high volume, desktop access and broad regional targets. To expand coverage in each dimension we will need more automatic processes, without human intervention.
  • For every 10,000 calls to Customer Service, an additional100,000 customers self-serve on corporate web site, PLUS an additional 300,000 customers self-serve within external customer forums.
  • A model architecture to manage the complexity has to take into account components frameworks, data/software, reusability, standards, plug & play, flexible adaptive workflows connecting components, on demand and on the fly.
  • Although this needs massive concerted effort and much basic research, this is not just work for big players. SMEs have flexibility that empowers them to also participate.

Daniel Grasmick of Lucy Software spoke about "Applying Standards - Benefits and Drawbacks". Daniel briefly reviewed the history of standards such as TMX, TBX and XLIFF. He then asked whether standards are always the best approach, and argued that they may be overkill for some projects. For other projects, where you have to exchange large volumes of content and data is accessed by many, a standard format like XLIFF is absolutely appropriate.

  • People often criticize standards such as TMX and TBX, but the important fact is that not a lot of people volunteer to help fix things.

Marko Grobelnik of the Jozef Stefan Institute spoke about "Cross-lingual Document Similarity for Wikipedia Languages". Marko explained that cross-lingual information retrieval can be seen in a scenario where a user initiates a search in one language but expects results in more than one language. There are many areas involved in text-related search and each represents data about text in a slightly different way. Marko focused in on the correlated vector space model and described how the system they are currently building works using Wikipedia correlated texts.

  • We want to represent text in a language neutral form across around 200 languages using statistical techniques. That would give us a 200x200 matrix of correspondences for comparing document content.
  • The system should be available next year with an open licence and pre-trained statistical models.

The session ended with Q&A on various topics. These included why SAP doesn't use standards like XLIFF, who are the best players for developing standards, why there are various different browsers, thoughts about standardizing non-translate flags, and more. For details, see the related links.

Machines session

The Machines Session started with an introduction by session chair, Dan Tufis, pointing out that machines are essential in the development of a true multilingual World Wide Web because effective machine-to-machine interoperability is key to an efficient multilingual web user experience.

alt

In the first presentation, "Language Resources, Language Technology, Text Mining, the Semantic Web: How interoperability of machines can help humans in the multilingual web", Felix Sasaki of the DFKI in Germany talked about Language Technologies (LT) and in particular interoperability of technologies. He introduced applications concerning summarization, Machine Translation (MT) and text mining, and showed what is needed in terms of resources. For this he identified different types of language resources, and distinguished between linguistic approaches and statistical approaches. Machines need three types of data: input, resources and workflow, and currently there are the following types of gaps that exist in this data scenario: metadata, process, purpose. These gaps were exemplified with an MT application. The purpose gap specifically concerns the identification of metadata, process flows and the employed resources. Any identification must be facilitated across applications with a common understanding, and therefore different communities have to join in and share the information that has to be employed in the descriptive part of the identification task. A particular solution that can provide a machine-readable information foundation was provided by the semantic technologies of the Semantic Web (SW). A more shallow approach than the complex fully-fledged approach of the SW for web pages is available through microformats or RDFa. With some few examples some insights were presented on how the SW actually contributes to closing the introduced gaps. The point to remember is that RDF is a means to provide a common information space for machines. The talk closed with some ideas on joint projects (community effort), and specifically on how META-NET is already working in that direction. A short discussion pointed out the currently available language description frameworks and the overall complexity of RDF for browser developers.

  • Machines are not doing a good job of using metadata available in the input to assist multilingual content processing.
  • Machines don't know about the workflow/processes that input data has been through.
  • Machines don't make explicit 'who' they are and what resources they are using. This would make it easier to combine tools.
  • These gaps can only be filled by a huge alliance of people who are producing content.
  • All these groups need to agree upon one machine readable information space for filling the gaps, and that space is already there: the Semantic Web.
  • We aim to propose a project to work on these issues. The place to do that work is the W3C, since it needs to be a part of the core Web technologies.
  • The META-NET project is a large project that is keen to build bridges across communities and people working on language technologies, including small companies.

In the next presentation Nicoletta Calzolari of ILC-CNR in Pisa, Italy extended the notion of language resources (LR) to also include tools. She pointed out that a new paradigm is needed in the LR world to accumulate the continuous evolution of the multilingual Web including the satisfaction of new demands and requirements which shall account for these dynamic needs. For this, right now a web of LRs is built and is driven by standardization efforts. To successfully enable the further evolution, distributed services are also needed, including effective collaborations between the different communities. In this context, a very important and sensitive issue concerns politics and policies to support these changes. Today, several EU funded projects have taken up this new R&D direction, and national initiatives are joining in to build stronger communities that are critical to ensure a truly web-based access to LRs together with a common global vision and cooperation. Examples are projects such as CLARIN, FLaReNet and META-NET which jointly have on their roadmap that interoperability between resources and tools is key for the overall success as well as more openness through sharing efforts. The follow-up short discussion focused on a many infrastructures scenario, and the question about the interoperability right now which identified the EU project META-NET as the key player in solving these issues.

  • We need a new paradigm for R&D in language resources and language technologies, which will be based on open and distributed linguistic infrastructures, and enables the controlled and effective cooperation of many groups on common tasks. The value of our resources increases as we share them.
  • We need an infrastructure that takes us from 'Language Resources' to 'Language Services'.
  • The thing that allows us to link together old and dispersed investments in the lexical web is standards. But we need to develop standards for content interoperability.
  • Results from the Vienna and Barcelona FLaReNet forums indicated that standards, interoperability and metadata are important topics to be approached in cooperation.
  • META-SHARE: an open, integrated, secured, and interoperable language data and tools exchange facility for the HLT(Human Language Technologies) domain and other applicative domains (e.g. digital libraries, cognitive systems, robotics, etc). It is the bid step forward and a paradigm shift.

The third speaker Thierry Declerk of DFKI presented the "LEMON" project which is researching and developing to provide an ontology-lexicon model for LRs. LEMON is part of the EU funded project Monnet and collaborates with standardization bodies. The project's R&D contributes to the multilingual Web by providing linguistically enriched data and information. The industrial use case of Monnet is the financial world, in particular the field of reporting. The standards used from the industrial side are XBRL, IFRS and GAAP, and the encodings in these standards are related to Semantic Web standards to build a bridge between financial data and linguistic data. The overall approach was exemplified by an online term translation application. The talk closed with an architectural overview of the Monnet components and the standards used on the language side which are among others LMF, LIR and SKOS. The project envisages to establish a strong link with the META-NET project very soon.

  • Monnet seeks to ensure multilingual access to structured/networked knowledge: ontologies, knowledge bases, linked data by abstracting from the language-specific content to a semantic level.
  • Monnet is based on standards: LMF, LexInfo, LIR, SKOS, OWL, RDF, IsoCat.
  • lemon can contribute to the W3C on how to represent lexicalized ontologies and to put them into relation with other semantic resources on the web.
  • lemon can contribute to ISO on on guidelines for linguistic driven lexicalization of ontologies: adjusting the role played by LMF and other ISO standards on language resource management.
  • Future work will be dedicated to ontology-based, crosslingual information access and presentation.

The last talk of this morning session was given by Jose Gonzales of UPM and the university spin-off company DAEDALUS in Madrid, Spain. The company's projects share similar subject areas than the LEMON project and the other presented approaches but with a strong market-oriented view. The employed LRs of DAEDALUS date back to the 1990 when no resources for Spanish were available, and the initial focus was therefore on spell checking which was needed mainly by the media market. These developments had an important influence on all future developments such as search and retrieval, and even ontology development. The initial work on LRs was followed by multilingual developments and were based on the continuous experience in the field which was exemplified by a multilingual information extraction application followed by an example that integrated speech (DALI) into the application scenario. Current applications include sentiment analysis and opinion mining (also in Japanese), and an EC funded project (Flavius) which takes into account user-generated content and machine translation. With some online examples the talk closed with an outlook on linking ontologies with lexicon tools.

  • The Flavius project is working on converting user-developed content, with it's grammatical errors, shortcuts, etc into a more regularized syntax that can then be used by machine translation tools.

In the first presentation of the afternoon track of the machines session, Jörg Schütz of bioloom group, Germany, introduced yet another application field for multilingual web activities which is quite different from the previous scenarios: business and process intelligence. He pointed out that in response to our global economies, today's business and process intelligence processes and tools have to deal with input that stems from many different languages and cultures without having a clear distinction between what has been the original source language and what should be the eventual target language of the output - output that triggers possible decision making and optimization processes based on data mining, combining and analytics findings. To bridge the gap between existing language technologies and process intelligence operations the semantic web provides the means to design, model and set into operation appropriate mechanisms. However, although there are existing standards on both sides for data representation, modeling of rules and terminology, querying and inferencing, and fact extraction which can be seamlessly combined through semantic web technologies, the communities of these ecosystems so far have failed to talk to each other, due to trust in their own standards, fear over increased complexity, lack of reference implementations, being outpaced by technology, lack of exchange between solution providers, and uncertain involvement of buyers/customers. The talk concluded with what is currently missing to lead to a closer collaboration between the actors of these ecosystems, and to a better interoperability between their data, tools and processes.

  • We need better exchanges between and mutual understanding of the communities and the actors of the two ecosystems .
  • There needs to be a common mindset for change.
  • There is a need for joint reference implementations with a strong commitment to interoperability .
  • There is a need for real self-adapting and self-learning tools and technologies.

Piek Vossen of VU University Amsterdam gave a talk entitled "KYOTO: a platform for anchoring textual meaning across languages". The talk described a very open generic platform with which you can mine any kind of relations from text, but which can be tuned or customized to the kind of information you are interested in. Piek began his talk with the provocative question "Why translate text if you can mine text and represent the knowledge and information in a language neutral form?" He then described the evolution of the web into the future, where knowledge would be understood by machines, and what was needed to get there. The remainder of his talk described how the Kyoto project was attempting to convert the natural language of the web so far into knowledge representations that can be used in the future web.

  • We need an interoperable representation of the structure of language and an interoperable representation of formal conceptual knowledge in order to connect the different versions of the web. We then need methods to map natural language on the older web into these interoperable representations.
  • The moment we have RDF it means that machines can interpret this and reason over the information found in the text and do useful things.
  • We shouldn't just focus on machine translation, but on cross-language text mining based on uniform representations of the text. This requires anchoring vocabularies of all languages to a common conceptual backbone.
  • This process generates a huge amount of data, and we need to get to grips with how to represent that in a usable way. We also need to work on rendering that information back into natural language for human consumption.

As a stand-in for a talk that had to be canceled, Christian Lieske of SAP in Germany took the stage a second time. This time (drawing on older material developed with Felix Sasaki and Richard Ishida), he presented the "W3C Internationalization Tag Set (ITS)". ITS is a W3C Recommendation that helps to internationalize XML-based contents. Content that has been internationalized with ITS can more easily be processed by humans and machines. ITS also plays an important role in the W3C Best Practice Note: “Best Practices for XML Internationalization”. Christian explained that seven so-called data categories are the heart of ITS. They cover topics such as a marker that a range of content must not be translated. Thus, ITS helps humans and machines since ITS information for example can help to configure a spell checker or to communicate with a translator. ITS data categories are valuable in themselves – you do not need to work with the ITS namespace. They are therefore useful also for RDF or for other non-XML data. Although ITS is a relatively new standard, Christian was able to point to existing implementations (e.g. the Okapi framework) that support ITS-based processing. In addition, he sketched first scenarios, and visions for the possible pivotal role of ITS in the creation of multilingual, Web-based resources: clients (such as Web-browsers) that interpret ITS and thus can feed more adequate content to machine translation systems.

  • The W3C Internationalization Tag Set (ITS) is a W3C Recommendation that helps to internationalize XML-based content.
  • Among other things, ITS proposes a standard way to mark up content that should not be translated.
  • Content that has been internationalized with ITS can more easily be processed by humans and machines.
  • ITS has an important partner document in the W3C Note "Best Practices for XML Internationalization".
  • ITS data categories are valuable in themselves – you do not need to work with the ITS namespace. They are therefore useful also for RDF or for other non-XML data. You can also embed ITS information in a document, if the schema supports it, or use it in a separate reference document.
  • A user agent could use ITS rules for converting content into XLIFF.

In the short Q&A discussion round before the lunch break the question was raised on how all these "islands of LRs" could be made available to the public, and concluded with the observation that there are already some ISO initiatives underway to support this opening direction also in terms of structuring the resources, and that an open issue still is the actual representation format.

Users session

This session was chaired by Chiara Pacella of Facebook.

alt

Ghassan Haddad from Facebook presented the anchor talk for the Users session with a talk entitled "Facebook Translation Technology and the Social Web". After a short movie showing the amount of interactions with Facebook, there was a brief introduction to the history of Facebook and the unique challenges that they set themselves to achieve translations into, currently, 77 languages without slowing development. This was done through crowd-sourcing. Ghassan demonstrated approaches Facebook uses to help people adopt new translations as they appear, and to deal with the complexities of interactive and dynamic messages. Ghassan then talked about the process of translation using the community in the crowd-sourcing approach, and how app developers can set up content for translation. About 50 languages are completely supported by the community - the other 27 languages use professional support.

  • In 2007 when FB was in 1 language, they had 50 M users, and 7% of (new) users were international (outside US). In 2010 when FB has 77 languages, they have 500 M users, and 75% of (new) users are international.
  • Choosing your language has to be very easy, since Facebook caters for people of all types. One approach was to pick 10 languages to add to the registration page (with a link to others) based on the languages spoken in your region. Within a week, this led to over 300% increase in the number of people selecting a language other than US-English.
  • When a new language translation is released, people in relevant regions receive notifications that encourage them strongly to change to the new language. In Arabic, this led to a 500% increase in language adoption overnight.
  • Applications need to figure out how to handle things such as gender and number in interactive and dynamic messages. For Russian, Arabic and Hebrew, for example, over 50% of all strings require some adaptation depending on context. Facebook developed dynamic string explosion rules to enable this.
  • Facebook believes it is important to make translations and tools available to the public.
  • In the crowd-sourcing context, quality is no longer the opinion of a regional marketing manager. Translated text receives reviews from between 50 to 5000 users.
  • The fb:intl markup Facebook uses allows you to add descriptions to text to aid translators.
  • There are sometimes translations that are rated highly by users but which Facebook changes because they don't believe it is a good translation.
  • Motivational leader boards are useful to reward people for their work.

Denis Gikunda from Google presented about "Google's community Translation in Sub Saharan Africa". Sub-Saharan Africa has around 14% of the world's population yet only 5% of the world's internet users(though this is growing). Low representation of African languages and relevant content remain among the biggest barriers to removing this discrepancy. Google's strategy for Africa is to get people online by developing an accessible, relevant and sustainable internet ecosystem. The talk shared insights from two recent community translation initiatives aimed at increasing African language content: the Community Translation program and the Kiswahili Wikipedia Challenge.

  • Oral literature, indigenous knowledge, cultural novelty, and creativity remain unamplified, and lost over generations.
  • Price is a barrier, in particular the cost of getting bandwidth to the continent in the first place.
  • Africa needs initiatives to create content in order to drive more content creation.
  • Wikipedia is source of locally relevant information, so it can seed translation, and that can be used to seed more content.
  • There is a lack of data to start from, which is needed to bootstrap machine translation.

Loïc Martínez Normand, representing the Sidar Foundation, presented about "Localization and web accessibility". Loïc described how he and Emmanuelle Gutiérrez y Restrepo have interfaced with many standards, especially those of the W3C, discussed what 'accessibility' means, and introduced the Sidar Foundation. The main content of the presentation explored the W3C Web Content Accessibility Guidelines (WCAG) that are related to international users. At the end, the presentation compared the quicktips from the W3C Accessibility and Internationalization activities.

  • Accessibility in the Web context is not only about people with disabilities. The key principle of accessibility is designing web sites and software that are flexible enough to meet different user needs, preferences, and situations.
  • Non-text content is increasingly important on the Web today and accessibility to it needs to be considered when addressing the multilingual Web.
  • Some things such as providing text alternatives don't appear to be in the i18n vocabulary, but we think it is important to take these things into account.
  • Accessibility should be a right, but in reality we need business cases and a lot of work to convince people to make the Web accessible.

Swaran Lata, of the Department of Information Technology, Government of India, and also W3C India Office Manager, closed off the Users session with a talk entitled "Challenges for Multilingual Web in India : Technology development and Standardization perspective". After setting the scene with an overview of the challenges in India and the complexity of Indian scripts, Swaran Lata talked about various technical challenges they are facing, and some key initiatives aimed at addressing those challenges. For example, there are e-government initiatives in the local languages of the various states, and a national ID project, that brings together multilingual databases, on-line services and web interfaces. She then mentioned various standardization related challenges, and initiatives that are in place to address those.

  • Internet usage in India is increasing, but mainly in English.
  • According to Census 2001 India has 122 major languages and 2371 dialects. Out of 122 languages 22 are constitutionally recognized languages. There are also many different writing systems. Traveling through India is like traveling through Europe.
  • The government works with businesses in consortia to develop linguistic solutions for a given region.
  • TDIL's work on machine translation has been advancing, and shortly they expect to make this available, like Google translate. They are trying to link all other Indian languages with English and Hindi.
  • Optical character recognition technology is being used to convert legacy data into Unicode.
  • Unicode was declared as a Standard for Data Storage for Web Based E-Governance Services in India.
  • A report is being prepared about issues encountered when rendering Indic scripts in browsers, including such things as drop capitals, underlining, list numbering, character indentation, etc. In particular, most browsers are unaware of syllable boundaries for Indic scripts, although some issues also come from the operating system.
  • In addressing speech they found that the W3C Pronunciation Lexicon Specification (PLS 1.0) needs part-of-speech tags to disambiguate Indic language content.
  • We would like to work with people here on joint projects that include Indian needs.
  • A lot of standardization work still remains to be done for many areas touching on mobile Web support, from character encodings and font support, to keyboard and transliteration.

The Q&A, began with opinions that it is difficult to get work done through volunteers - the do-good approach wears off after a while. This was followed by some discussion of sign languages, and then minority languages. The final question was about standardization of identifiers for African languages. For details, see the links below.

Author: Richard Ishida. Thanks to Jirka Kosek, Jörg Schütz and Felix Sasaki for help in producing this report, and to the scribes for the workshop sessions: Jirka Kosek, Felix Sasaki, Eliott Nedas, Jörg Schütz, and Charles McCathieNevile. Photos courtesy of Luis Bellido, Richard Ishida, and Matti Pölla.