2011 Pisa Workshop Report

W3C Workshop Report:
Content on the Multilingual Web
4-5 April 2011, Pisa, Italy

Today, the World Wide Web is fundamental to communication in all walks of life. As the share of English web pages decreases and that of other languages increases, it is vitally important to ensure the multilingual success of the World Wide Web.

The MultilingualWeb project is looking at best practices and standards related to all aspects of creating, localizing and deploying the Web multilingually. The project aims to raise the visibility of existing best practices and standards and identify gaps. The core vehicle for this is a series of four events which are planned over a two year period.

On 4-5 April 2011 the W3C ran the second workshop in the series, in Pisa, entitled "Content on the Multilingual Web". The Pisa workshop was hosted jointly by the Istituto di Informatica e Telematica and Istituto di Linguistica Computazionale, Consiglio Nazionale delle Ricerche.

As for the previous workshop, the aim of this workshop was to survey, and introduce people to, currently available best practices and standards that are aimed at helping content creators, localizers, tools developers, and others meet the challenges of the multilingual Web. The key objective was to share information about existing initiatives and begin to identify gaps.

The workshop was originally planned to be a small discussion-based workshop for around 40 people, but after the format of the Madrid workshop proved to be such a success, it was decided to run a similar type of event with a similar number of people. The final attendance count was 95, and in addition to repeating the wide range of sessions of the Madrid workshop we added another Policy session.

Another innovation of this event was that we not only video recorded the presenters, but streamed that content live over the Web. A number of people who were unable to attend the workshop, including someone as far away as New Zealand, followed the streamed video. We also made available live IRC minuting, and some people used that to follow the conference and contribute to discussion. As in Madrid, there were numerous people tweeting about the conference and the speakers during the event, and a number of people afterwards wrote blog posts about their experience. The tweets and blog posts are linked to from the Social Media Links page for the workshop.

The program and attendees reflected the same unusually wide range of topics as in Madrid and, attendee feedback indicated, once again, that the participants appreciated not only the unusual breadth of insights, but also the interesting and useful networking opportunities. We had a good representation from industry (content and localization related) as well as research.

What follows will describe the topics introduced by speakers, followed by a selection of key messages raised during their talk in bulleted list form. Links are also provided to the IRC transcript (taken by scribes during the meeting), video recordings of the talk (where available), and the talk slides. Most talks lasted 15 minutes, though some sessions started with a half-hour 'anchor' slot.

Contents: Summary • Welcome • Developers • Creators • Localizers • Machines • Users • Policy

Summary

Workshop sponsors

Become a sponsor.

As in the previous workshop, a wide range of topics were covered, and several themes seem to span more than one session. What follows is an analysis and synthesis of ideas brought out during the workshop. It is very high level, and you should watch the individual speakers talks to get a better understanding of the points made. Alongside some of the points below you will find examples of speakers who mentioned a particular point (this is not an exhaustive list).

During the workshop we heard about work on a number of new or under-used technologies that should have an impact on the development of the multilingual Web as we go forward. These included content negotiation, XForms, Widgets, HTML5, and IDN (Pemberton, Caceres, Ishida, Bittersmann, Laforenza). This is still work in progress, and the community needs to participate in the ongoing discussions to ensure that these developments meet its needs and come to fruition. The time to participate is now.

We also heard that the MultilngualWeb project inspired a new widget extension for Opera to help users choose a language variant of a page (McCathieNevile).

We were given an overview of the Internationalisation Tag Set (ITS) and how it is implemented in various formats (Kosek). A key obstacle to its use, however, is the inability to work with ITS customizations in various authoring tools. Several speakers voiced a desire for better training of authoring tool implementers in internationalisation needs, in order to make it easier for content authors to produce well internationalised content (Leidner, Pastore, Serván).

Also there was a call for universities to add training in internationalisation to their curricula for software engineers and developers in general (Leidner, Pastore, Nedas, Serván). More best practice guides should also be produced, complemented by more automation of support tools for authors (Pastore, Schmidtke, Carrasco, Serván).

In the Localizers session, several speakers stressed the need for and benefits of more work on standards as a means to enable interoperability of data and tools (Lieske). Lack of interoperability is seen as an important failing in the industry by many speakers (van der Meer, etc.). Existing standards need to be improved upon with more granular and flexible approaches, and a view to standardising more than just the import and export of files. It was proposed, however, that standards development should be speeded up, and use a more 'agile' approach (Andrë) - proving viability with implementation experience, and discarding quickly things that don't work (an idea revisited later in the workshop). They should not impede innovation (van der Meer, Andrë).

There was a pronouncement that TMX is dead (Filip), but that was modified slightly afterwards by several people who felt that the size of its legacy base would keep it around for several more years, just like CD-ROMs (Herranz). There were a lot of hopes and expectations surrounding the upcoming version of XLIFF (Filip, van der Meer).

There was also a call for more work on the elaboration and use of metadata to support localization processes, building on the foundation provided by ITS (Leidner, Filip) but also using Semantic Web technologies such as RDF (Lewis). One particular suggestion was the introduction of a way to distinguish the original content from its translations. This was also picked up in a discussion session.

We saw how crowd sourcing was implemented at Opera, and some of the lessons learned (Nes). Crowd sourcing would reappear several times during the workshop, in speakers talks, but also in the discussion sessions. The industry is still trying to understand how and where this is best applied, and where it is most useful. Facebook shared with us how their system works (Pacella).

The Social Web is leading to an explosion of content that is nowadays directly relevant to corporate and organizational strategies. This is leading to a change, where immediacy trumps quality in many situations (Shannon, Truscott). This and other factors are placing increased emphasis on automated approaches to handling data and producing multilingual solutions (Herranz, Lewis, Truscott), but in order to cope with this there is a need for increased interoperability between initiatives via standard approaches.

While many speakers are looking to improvements in language automation, there appears to be a strong expectation that machine translation can now, or will soon be able to provide useable results (Schmidtke, Grunwald, Herranz, Vasiljevs), either for speeding up translation (using post-editors rather than full translation), providing gist translations for social media content, or extracting data to feed language technology, etc.. We heard about various projects that are aiming to produce data to support machine translation development, specialising in the 'Hidden Web' (Pajntar) or smaller languages (Vasiljevs), and the META-SHARE project that aims to assist in sharing data with those who need it (Piperidis). One thing that such tools need to address is how to deal with comparable texts (ie. texts that are not completely parallel, since one page has slightly different content than another) (de Rijke).

On the other hand, machine translation is unlikely to translate poetry well any time soon. We saw a demonstration of a tool that helps human translators align data within poems (Brelstaff). This project benefited greatly from the use of standardised, open technologies, although there were still some issues with browser support in some cases.

Changes in the way content is generated and technology developments are also expected to shift emphasis further onto the long tail of translation (Lewis, Lucardi). There is also a shift to greater adaptation of content and personalisation of content for local users. Speakers described their experiences in assuring a web presence that addresses local relevance (Schmidtke, Hurst, Truscott). The ability to componentise content is a key enabler for this, as is some means of helping the user find relevant content, such as geolocation or content negotiation.

Inconsistencies in user interfaces for users wanting to switch between localized variants of sites needs investigation and standardisation. In some cases this is down to differences in browser support (Bittersmann, Carrasco).

We also saw how one project used mood related information in social media in various ways to track events or interests (de Rijke), and received advice on how to do search engine optimisation in a world that includes the social Web (Lucardi). Following W3C best practices was cited as important for the latter. And Facebook described how they manage controlled and uncontrolled text when localizing composite messages for languages that modify words in substantially different ways in different contexts according to declension, gender and number (Pacella).

In the Policy session (an addition since the Madrid conference) we heard how the industry is at the beginning of radical change, such as it hasn't seen for 25 years (van der Meer), and interoperability and standards will be key to moving forward into the new era.

The next workshop will take place in Limerick, on 21-22 September, 2011.

Welcome session

alt

Related links: Slides • IRC

Domenico Laforenza, Director of the Institute for Informatics and Telematics (IIT), Italian National Research Council (CNR), opened the workshop with a welcome and a talk about "The Italian approach to Internationalised Domain Names (IDNs)". Basically this is a system through which you can use URLs on the Internet in, for example, Danish or Chinese, using accented letters or non-Latin characters. Until recently, the choice of domain names was limited by the twenty-six Latin characters used in English (in addition to the ten digits and the hyphen "-"). IDN, introduced by ICANN (Internet Corporation for Assigned Names and Numbers) represents a breakthrough, for hundreds of millions of Internet users in the world that until now were forced to use an alphabet that was not their own. With regard to Italy, the impact of accents will certainly be less marked, but it will give everyone the opportunity to register domains which completely match the name of the person, company or brand name chosen. Domenico described the Italian registry, and the basic concepts of how IDNs work.

Following this talk, Richard Ishida gave a brief overview of the MultilingualWeb project, and introduced the format of the workshop.

Related links: Slides • IRC • Video

Oreste Signore, employee of CNR and Head of the W3C Italian Office, also welcomed delegates with a talk entitled "Is the Web really a "Web for All"?". This talk was a brief reminder of the basic issues of the Web: multicultural, multilingual, for all. It also took a look into the relevant W3C activities to pursue the ultimate goal of One Web, which include accessibility as well as multilinguality.

Related links: IRC • Video

Kimmo Rossi, Project Officer for the MultilingualWeb project, and working at the European Commission, DG for Information Society and Media, Digital Content Directorate, praised the enthusiasm and voluntary contributions of the project partners. Kimmo has found this to be a wonderful forum for networking and finding out about the various aspects of the multilingual Web. Now we need to start putting these ideas into practice, so he is looking for good recommendations for industry and stakeholders about what needs to be done, and preferably who could do it. Kimmo described some key findings of a EuroBarometer survey that is soon to be published: about 90% interviewed prefer to use their own language for non-passive use, and around 45% believe that they are missing out on what the Web has to offer due to lack of content in their language.

Related links: Slides • IRC • Video

The keynote speaker was Ralf Steinberger, of the European Commission's Joint Research Centre JRC. In his talk he said that there is ample evidence that information published in the media in different countries is largely complementary and that only the biggest stories are being discussed internationally. This applies to facts (e.g. on disease outbreaks or violent events) and to opinions (e.g. the same subject may be discussed with very different emotions across countries), but there is also a more subtle bias of the media: National media prefer to talk about local issues and about the actions of their politicians, giving their readers an inflated impression of the importance of their own country. Monitoring the media from many countries and aggregating the information found there would allow readers a less biased and more equilibrated view, but how to achieve this aggregation? The talk gave evidence of such information complementarity from the Europe Media Monitor family of applications (accessible at http://emm.newsbrief.eu/overview.html) and showed first steps towards the aggregation of information from highly multilingual news collections.

Developers session

The developers Session was chaired by Adriane Rinsche (LTC).

alt

Related links: Slides • IRC

Steven Pemberton of CWI/W3C gave the anchor talk for the Developers session, "Multilingual forms and applications". After an introduction to content management and to XForms, the talk described the use of XForms to simplify the administration of multilingual forms and applications. A number of approaches are possible, using generic features of XForms, that allow there to be one form, with all the text centralised, separate from the application itself. This can be compared to how style sheets allow styling to be centralised away from a page, and allow one page to have several stylings; the XForms techniques can provide a sort of Language-Sheet facility. Key points:

HTTP allows for content negotiation in order to return to you documents that you prefer, if there are multiple alternatives on the server.
Some servers or proxies send you something other than a 404 message when nothing matches your request. This is bad for link checking.
Unless you are using Chrome or have set Google-specific preferences, a Google page uses your IP address to determine the language, and can often guess incorrectly. They ignore the browser settings sent in HTTP.
Sites that allow you to switch between languages often mix labels for regions with those for languages.
XForms is a W3C technology originally designed to make production of forms for the Web easier, but has evolved into an applications language in general.
XForms splits out the data from the form itself, and requires an order of magnitude less work to create code than JavaScript.
There are many implementations of XForms, including native implementations in Mozilla and OpenOffice, plugins such as FormsPlayer, and 'zero install' and server-based implementations (which are good for mobile use).
XForms provides a way of implementing multilingual sites.

Related links: IRC • Video

Marcos Caceres, Platform Architect at Opera Software, prepared a talk entitled "Lessons from standardizing i18n aspects of packaged web applications". Since Marcos was unable to make it to the workshop, the talk was delivered by Charles McCathieNevile, Chief Standards Officer at Opera. The W3C's Widget specifications have seen a great deal of support and uptake within industry. Widget-based products are now numerous in the market and play a central role in delivering packaged web applications to consumers. Despite this, the W3C's Widget specifications, and its proponents, have faced significant challenges in both specifying and achieving adoption of i18n capabilities. This talk described how the W3C's Web Apps and i18n Working Group collaborated to create an i18n model, the challenges they faced in the market and within the W3C Consortium, and how some of those challenges were overcome. This talk also proposed some rethink of best practices and relayed some hard lessons learned from the trenches. Key points:

Widgets are Web applications written in HTML and installed locally to your device, mobile, TV, server, etc. Opera extensions are widgets. The 7 specifications are in Last Call phase.
Widgets have been designed to enable easy localization.
The Widgets specification added aspects of the ITS (Internationalisation Tag Set) specification.
Make sure you introduce internationalisation into your development as early as possible, keep it simple, and test thoroughly.
The new SwapLang widget extension for Opera helps you change the language page by picking up alternatives from the markup of the page.

Related links: Slides • IRC • Video

Richard Ishida, Internationalisation Activity Lead at the W3C, presented "HTML5 proposed markup changes related to internationalisation". HTML5 is proposing changes to the markup used for internationalisation of web pages. They include character encoding declarations, language declarations, ruby, and the new elements and attributes for bidirectional text support. HTML5 is still very much work in progress, and these topics are still under discussion. The talk aimed to spread awareness of proposed changes so that people can participate in the discussions. Key points:

UTF-8 and its subset of ASCII is now used for over 50% of all web pages.
HTML5 provides an alternative way to declare the encoding of pages and applies some additional rules about how declarations work, particularly for pages designed to work as both HTML and XHTML. Some of these changes have been decided extremely recently, and you can still voice your opinion if you have other ideas.
The HTML Working Group has just decided that the meta element with http-equiv set to Content-Language will be non-conforming in HTML5. This means that if you want to have an in-document way to indicate metadata about the language of the document as an object you will need to find an alternative approach.
Aspects of ruby markup defined in the Ruby Annotation specification do not currently appear in HTML5. There is still a need to clarify how this should play out.
A number of changes have been introduced into HTML5 (and CSS) to address problems with bidirectional text (such as that found in Arabic, Hebrew, Urdu, etc) which are not handled well currently. These changes are particularly important for content that is inserted into a page from an external source - be it a database or human input.
If you feel that HTML5 is not on the right track with regards to international features, you need to get involved to make things better!

Related links: Slides • IRC • Video

Gunnar Bittersmann, a developer at VZ Netzwerke, talked about "Internationalisation (or the lack of it) in current browsers". The talk addressed two common i18n problems that users of current mainstream browsers face. Users should get content from multilingual Web sites automatically in a language they understand, hence they need a way to tell their preferences. Some browsers give users this option, but others don't. Gunnar demonstrated live if and how languages can be set in various browsers and discussed the usability issue that browser vendors have to deal with: the trade-off between functionality and a simple user interface. Users should also be able to enter email addresses with international domain names into forms. That might not be possible in modern browsers that already support HTML5's new email input type. Gunnar showed how to validate email addresses without being too restrictive and raised the question: Does the HTML5 specification have to be changed to reflect the users' needs? Key points:

HTML5 should not prevent users from typing email addresses in non-ASCII scripts. Currently we have to write our own pattern checker to support, say, Cyrillic addresses.
Some browsers currently don't allow you to easily set language preferences the way you want to.
Content negotiation should allow you to say that you would like to see a page in the original language for languages that you speak when the original is not the language you would generally prefer.
There could be standardisation around a vocabulary to distinguish original vs. translation, human vs. machine translations, or translations vs. adaptations.

Creators session

This session was chaired by Felix Sasaki of DFKI.

alt

Related links: Slides • IRC • Video

Dag Schmidtke, Senior International Project Engineer at the Microsoft European Development Centre, gave the anchor talk for the Creators session with "Office.com 2010: Re-engineering for Global reach and local touch". Office.com is one of the largest multilingual content driven web-sites in the world. With more than 1 billion visits per year, it reaches 40 languages. For the Office 2010 release, authoring and publishing for Office.com was changed to make use of Microsoft Word and SharePoint. A large migration effort was undertaken to move 5 million+ assets for 40 markets to new file formats and management systems. This talk presented lessons learnt from this major re-engineering exercise for designing and managing multilingual web-sites. Key points:

'Global reach' is about the size of the translation effort and how many people we can reach, 'Local touch' is about being relevant in the market.
International was a key stakeholder for platform redesign and migration.
Limit the number of moving parts, don't change everything at once.
Rather than just test features, test user oriented scenarios.
Treat English as just another language.
Be prepared to scale down large amounts of content for markets that can't sustain the work required to maintain and update large amounts of content.
Although the authoring environment was switched to Word, to make life easier for the authors, the content is exported to XML for localization. A strict schema, global CSS and guidelines were essential to make this work.
In order to have a local touch there are in-market managers and in-market content development and community engagement.
It's important to maintain a customer connection to understand their voice and measure value for them: Microsoft analyses search engine patterns, feeds results back to authors and marketeers, and reorganises the site.
Trends for the future: growing impact of the multilingual cloud, growth of multimedia, language automation to provide more content in markets, and interoperability through standards.
Conclusion: it is possible to design both for scale and for local relevance.

Related links: Slides • IRC • Video

Jirka Kosek, XML Guru from the University of Economics, Prague, presented "Using ITS in the common content formats". The Internationalisation Tag Set (ITS) is set of generic elements and attributes which can be used in any XML content format to support easier internationalisation and localization of documents. In this talk examples and advantages of using ITS in formats like XHTML, DITA and DocBook were shown. Also problems of integration with HTML5 were briefly discussed. Key points:

You can use ITS without namespaces because the specification proposes data categories that can be mapped to your own markup.
The goal of ITS is to help establish markup that makes localization more accurate, faster and better prepared for automatic processing.
Examples of ITS support include DocBook, DITA, OOXML, and ODF via extensions.
Although you can add ITS features to OOXML and ODF, it is not easy to use/apply those customizations to the content using the authoring interface.
XHTML can easily support ITS markup extensions but HTML5 will not support suitable extension mechanisms for adding ITS support. The ITS features need to be integrated into the HTML specification.
The translate flag is something that would be very useful, but popular translation services (such as Google and Microsoft) need to also recognise and support markup for that.

Related links: Slides • IRC • Video

Manuel Tomas Carrasco Benitez, of the European Commission Directorate-General for Translation prepared a presentation about "Standards for Multilingual Web Sites". Because he was unable to attend the workshop, Charles McCathieNevile gave the talk for him. The talk argued that additional standards are required to facilitate the use and construction of multilingual web sites. The user interface standards should be a best practices guide combining existing mechanisms such as transparent content negotiation (TCN) and new techniques such as a language button in the browser. Servers should expect the same API to the content, though eventually one should address the whole cycle of Authorship, Translation and Publishing Chain (ATP-chain). Key points:

Currently users are faced with inconsistent user interfaces for moving between localized versions of sites, event within the same web site.
It should be easy to create a new multilingual web site using standard off-the-shelf software and approaches, but it's not.
There needs to be a best practice guide for designing interfaces for multilingual sites.
Suggestions for the user: language buttons in the browser; transparent content negotiation; a reserved URI keyword to show available variants; integrate machine translation automatically, or send requests for human translation; use metadata more.
Proposal: create a new Working Group in the W3C Internationalisation Activity - the issue is to get people together to do the work.

Related links: Slides • IRC • Video

Sophie Hurst, Director of Global Corporate Communications at SDL, presented about "Local is Global: Effective Multilingual Web Strategies". The talk proposed that Web on-the-go is now an everyday reality. It touches all of our lives from the moment we wake, to our commute, from work to an evening out on the town. This reality presents both an opportunity and an incredible challenge as Web content managers attempt to optimise customer engagement. Because visitors do not see themselves as part of a global audience but as individuals, the talk examined the Web content management software requirements that enable organizations to maintain central control while providing their audiences with locally relevant and translated content. From a Global Brand Management perspective, the talk examined how organizations can manage, and build and sustain a global brand identity by reusing brand assets across all channels (multiple, multilingual web sites, email and mobile web sites). It also took a fresh look at automated personalization and profiling, and how Web content can be targeted for specific language requirements as well as the local interests of local audiences. Key points:

Asia has the highest internet usage figures but the lowest penetration figures. Various other statistics point to the non-English world as the area of main growth for the Web. Online sales are an increasingly important use case for the Web. So online, relevant information in the right language is the key to global success today.
A component-based model is key to representing your brand consistently in different countries while allowing for local marketing input. Content components should be reusable and modified per country.
Targeting and personalisation, and local knowledge is key for making sure that content on there is relevant to a specific culture. Use geo-positioning and ensure you work with an agency/local people that understand local nuances and can give feedback into what works.
For an efficient way of making sure all the content gets localized on time and on message you need integrated web content and translation management.

Related links: IRC • Video

The Q/A part of the Creator session began with questions about why we need standard approaches to multilingual web navigation if companies have already figured out how to do it, whether companies use locally-adapted CSS, and how accurate geolocation is. A large part of the session was dedicated to a discussion about the value or opportunities for sub-locale personalisation. This brought in other topics such as how many people are multilingual, aspects of dealing with the social web, and approaches to crowdsourcing. For more details, see the related links.

Localizers session

This session was chaired by Jörg Schütz of bioloom group.

alt

Related links: Slides • IRC

Christian Lieske, Knowledge Architect at SAP AG, talked about "The Bricks to Build Tomorrow's Translation Technologies and Processes". His co-authors were Felix Sasaki, of DFKI, and Yves Savourel, of Enlaso. Two questions were addressed: Why talk about tomorrow’s Translation Technologies and Processes?, and What are the most essential Ingredients for building the Tomorrow? Although support for standards such as XLIFF and TMX has increased interoperability among tools, today's translation-related processes are facing challenges beyond the ability to import and export files. They require standards that are granular and more flexible. Using concrete examples of the ways that various tools can interoperate beyond the exchange of files, this session walked through some of the issues encountered and outlined the use of a new approach to standardization in which modular standards, similar to Lego® blocks, could serve as core components for tomorrow's agile, interoperable, and innovative translation technologies.

Answers to the Why? included remarks related to the growing demand for language services (in particular translations), and lacking interoperability between language-related tools. Additionally, Christian mentioned shortcomings in existing standards such as the XML Localization Interchange File Format (XLIFF) and lacking adoption of Web-based technologies as challenges for the status quo.

The What? was summarized by the observation that not static entities (such as data models) should be the starting point for and evolution of translation technologies and processes. Rather, the right mindset and an overall architecture/methodology should be put in focus first. Detailed measures that were mentioned included the following:

Identify processing areas (e.g. extraction of text) related language processing - and keep them apart
Determine the entities that are needed in each area (e.g. “text units” and “skeleton (after extraction of text)”
Chart technology options and needs (e.g. use of RDF)
Realize opportunities to reuse, and worship standards
Distinguish between models and implementations/serializations
Distinguish between entities without context and entities with business/processing context
Set up rules to transform data models into syntaxes
Set up flexible registries (or even more powerful collaboration tools)

The Core Components Technical Specification (CCTS) developed within UN/CEFACT, UBL and ebXML were mentioned as an example from a non-language business domain that exemplified the measures.

Related links: Slides • IRC • Video

Eliott Nedas, Business Development Manager at XTM International, spoke about "Flexibility and robustness: The cloud, standards, web services and the hybrid future of translation technology". After introducing the current state of affairs, describing leading innovations, and also lamenting the demise of LISA, the talk moved to describing the possible future and who will be the winners, who will be the losers. The last part of the talk looked at what we can do to get standards moving internally in medium, large, organisations. Key points:

There are many standards out there, and what is needed is a system for users that makes the standards themselves invisible.
OAXAL provides a standards based system that is free.
In the future all translation systems will e totally intuitive hybrid systems, actually use standards, integrate social media type tools for communication, use advanced TM architectures for total control of linguistic assets, powerful quality controls, including automated and human review features, enable real-time previews of all content types, use advanced MT of all types, have user configurable automation, allow for easy integration with content cycles, and work from any platform.
Developers should be trained in standards as they learn to develop so that it becomes second nature to use them.

Related links: Slides • IRC • Video

Pål Nes, Localization Coordinator at Opera Software, gave a talk about "Challenges in Crowd-sourcing". Opera Software has a large community, with members from all over the world. The talk presented various obstacles encountered and lessons learned from using a community of external volunteer resources for localization in a closed-source environment. Included topics were training and organization of volunteers and managing terminology and branding, as well as other issues that come with the territory. The talk also describes the tools and formats used by Opera. Key points:

Crowd sourcing is not free or effortless, though it can certainly help.
You should have at least one contact person per language and you should teach them about branding guidelines and terminology.
Opera found that they needed to start vetting applicants for crowd sourcing. A small number of productive translators is better than a large number of inactive applicants.
Crowd sourcing is excellent for static translations with a relatively stable content, but not so good for things like press releases and marketing materials.
It's best to start with a small set of languages first, to iron out problems.
If I had known what I know today I would never have tried to create a customized version of XLIFF!
XLIFF is a minefield because tools don't support the same features, especially for inline content.

Related links: Slides • IRC • Video

Manuel Herranz, CEO of PangeaMT, talked about "Open Standards in Machine Translation". The web is an open space and the standards by which it is "governed" must be open. However, according to the talk, one barrier clearly remains to make the web even more transnational and truly global. This has been called "the language barrier". Language Service Providers translation business model is clearly antiquated and it is increasingly being questioned when we face real translation needs by web users. Here, immediacy is paramount. This talk is about open standards in machine translation technologies and workflows, supporting a truly multilingual web. Key points:

Language translation is a job that is becoming unmanageable. Increasing demands, increasing volumes, shorter deadlines. Human production is not sufficient.
I disagree that TMX will disappear any time soon. It is likely to remain in use for some time, just like mp3 and CD-rom are still around.

Related links: Slides • IRC • Video

David Grunwald, CEO of GTS Translation, spoke about "Website translation using post-edited machine translation and crowdsourcing". In his talk he describes a plugin for web sites that GTS has developed using the open-source Wordpress CMS. It is the only solution that supports post-editing MT and allows content publishers to create their own translation community. This talk presented the GTS system and described some of the challenges in translation of dynamic web content and the potential rewards that their concept holds. Key points:

There are over 100 Million blog publishers worldwide; tens of thousands of online newspapers/magazines and web sites use open-source CMS. GTS sees this as a large potential market for a solution capable of producing high quality content at a very low cost based on a content translation platform using machine translation software and human post-editing by a translation community (crowdsourcing).
Widgets from Google and Microsoft (like the ones on multilingualweb.eu) do not cache content so it is not indexed on search engines.

Related links: IRC • Video

The Q&A dwelt briefly on crowdsourcing considerations. A comment was also made that initiatives, such as Interoperability Now, should be sure to talk with standards bodies at some point. It was mentioned that the W3C has started up a Community Group program to enable people to discuss and develop ideas easily, and then easily take them on to standardisation if it is felt that it is appropriate. For details, see the related links.

Machines session

This session was chaired by Tadej Štajner of the Jožef Stefan Institute.

alt

Related links: Slides • IRC • Video

Dave Lewis, Research Lecturer at the Centre for Next Generation Localisation (CNGL) and Trinity College Dublin gave the anchor talk for the Machines session: "Semantic Model for end-to-end multilingual web content processing". This talk presented a Semantic Model for end-to-end multilingual web content processing flows that encompass content generation, its localisation and its adaptive presentation to users. The Semantic Model is captured in the RDF language in order to both provide semantic annotation of web services and to explore the benefits of using federated triple stores, which form the Linked Open Data cloud that is powering a new range of real world applications. Key applications include the provenance-based Quality Assurance of content localisation and the harvesting and data cleaning of translated web content and terminology needed to train data-driven components such as statistical machine translation and text classifiers. Key points:

There's going to be more and more work in the long tail of translation because of increasing changes in the way content is generated.
Before looking at the enterprise web site, people tend to seek answers to problems via user forums, social media, etc. Enterprises may increasingly attempt to bring that information into their web site, but it will likely be small and frequent pieces of content.
Web services appear to offer a range of benefits for managing translation, especially of smaller content, but there's not a clear how various niche products would fit into any standardisation framework.
A handoff standard is required that conforms to careful profiling and definition of processing expectations. XLIFF variants have been used, but XLIFF is not really designed for this kind of use in web services.
The lower level of the Semantic Web stack, in particular RDF-based data interchange, provides a number of interesting benefits for handling communication between services.
We've been looking at processing content and developing at basic taxonomy of content states that we can integrate in various ways with other ontologies, and a high level taxonomy of services to go with it. These taxonomies are being developed in an ongoing, trail and error way with real projects.
We have to deal with a world where we start packaging content and pushing it down a pipeline, but it changes part way along, so we need to track the current state of the content, between major services and between web services also.
The localization industry should not be trying to standardise semantics.

Related links: Slides • IRC • Video

Alexandra Weissgerber, Senior Software Engineer at Software AG, spoke next about "Developing multilingual Web services in agile software teams". Developing multilingual Web services in agile software teams is a multi-facetted enterprise which comprises various areas that include methodology, governance and localization. The talk reports on Software AG's employment of standards and best practices, particularly where and how they fit or did not fit, and the gaps they have encountered and their strategies to bridge them effectively as well as some of their workarounds.

Related links: Slides • IRC • Video

Andrejs Vasiljevs of Tilde spoke about "Bridging technological gap between smaller and larger languages". Small markets, limited language resources, tiny research communities – these are some of the obstacles in development of technologies for smaller languages, according to this talk. This presentation shared experiences and best practices from EU collaborative projects with a particular focus on acquiring resources and developing machine translation technologies for smaller languages. Novel methods helped to collect more training data for statistical MT, involve users in data sharing and MT customisation, collect multilingual terminology and adapt MT to terminology and stylistic requirements of particular applications. Key points:

"Creation, preservation and processing of, and access to [..] content in digital form should [..] ensure that all cultures can express themselves and have access to Internet in all languages, including indigenous and minority languages." (UNESCO, Code of Ethics for the Information Society (Draft))
"Survival of smaller languages depends on the outcome of the race between development of Machine Translation and proliferation of larger languages." (Alvin Toffler)
Automated acquisition of linguistic knowledge extracted from parallel corpora replace time- and resource-consuming manual work. But applicability of current data- driven methods directly depends on the availability of very large quantities of parallel corpus data, and translation quality of current data-driven MT systems is low for under-resourced languages and domains.
System adaptation is prohibitively expensive service not affordable to smaller companies or the majority of public institutions.
Some strategies to bridge the gap: encourage users to share their data, involve users in MT improvements, and use other kind of multilingual data beyond parallel texts.
The Accurat project is looking at "comparable corpora" - ie. non-parallel bi- or multilingual text resources. Sources include multilingual news feeds, multilingual web sites, Wikipedia articles, etc. Results will be made available as an open-source toolkit later this year.
Web is becoming increasingly spoiled with low quality machine translated pages. Tagging MT translated texts would help to avoid this data in MT training corpora.

Related links: Slides • IRC • Video

Boštjan Pajntar, Researcher at the Jozeph Stefan Institute gave a talk about "Collecting aligned textual corpora from the Hidden Web". With the constant growth of web based content large collections of textual become available. Many if not most professional non-English web sites offer translated web pages to English and other languages of their clients and partners. These are usually professional translations and are abundant. The talk refers to this as the Hidden Web, and presents possibilities, problems and best practices for harnessing such aligned textual corpora. Such data can then be efficiently used as a translation memory for example as help for a human translators or as training data for machine translation algorithms. Key points:

When we looked for standards to apply to our research it was easy to find TMX (so it may not be dead yet), but XLIFF was not so visible.
The 'hidden web' refers to the huge amounts of high quality translated text on non-English web sites Our objective is to harness this.
It would be good to have information about what level of correctness is the minimum for good quality parallel corpora.
Translation memory tools appear to be designed for human compilation or simple cases. There is a need for more powerful tools for automatic harnessing of parallel text data.

Related links: Slides • IRC • Video

Gavin Brelstaff, Senior Researcher at CRS4 Sardinia, provided the final talk in the Machines session, entitled "Interactive alignment of Parallel Texts – a cross browser experience". His co-author was Francesca Chessa, of the University of Sassari. The talk reported their experience test-driving current standards and best-practice related to multilingual Web applications. Following an overview of their pilot demonstrator for the interactive alignment of parallel texts (e.g. poetic translations in/out of Sardinian), they indicated pros and cons of the practical deployment of key standards - including TEI-p5, XML, XSL, UTF-8, CSS2, RESTful-HTTP, XQuery, W3C-range. Key points:

Statistical machine translation is not going to translate poetry well any time soon.
Gavin's tool allow translators to manually align text for a poem, using colour-coding and a point-and-click interface, to express how the translation was done.
It should be possible to bind events to XML, not just HTML.
w3cRange doesn't work, so it is not possible to align within words.
The project suffered various problems due to differences in browser support or bugs (listed in the slides).

Related links: IRC • Video

Topics discussed during the Q&A session included the following: whether semantic tagging can assist machine translation; what are the implications of copyright when harvesting resources from the hidden Web; how does localization apply within the Scrum model; the effectiveness of matching on the hidden Web when the content of two comparable pages has gaps; can one ever expect to translate poetry, and what is the actual purpose of Gavin's tool; and will RDF semantic tagging lead to new approaches for natural language generation. For the details, follow the related links.

Users session

This session was chaired by Christian Lieske of SAP.

alt

Related links: Intro video • Slides • IRC • Video

Paula Shannon, Chief Sales Officer at Lionbridge, presented the anchor talk for the Users session with a talk entitled "Social Media is Global. Now What?". Paula began the session with a short video entitled "Social Media Revolution 2" which can be seen on Youtube. According to Paula, there is no question about it, companies are embracing social media and working it on a global scale. But the expansion is not without its challenges. Chief among them is how to effectively communicate on multiple platforms, in multiple languages, with a variety of cultural audiences. So this talk looked at how companies are making it happen, in what ways are they using social media globally, and what are the emerging best practices for dealing with language and culture on blogs, Twitter, community forums and other platforms? Key points:

Social media is an ecosystem, and at the centre of this ecosystem is search and will be search. Rather than design perfect copy and ensure that it goes out via the channel you want, you need to create landing pages that map to the reality of the searcher. The user absolutely sits in the centre of this ecosystem.
The move into the social media arena for companies who want to improve contact with customers is happening quickly, the volumes of content are mind-boggling, and it is expanding, not retracting.
Consider your strategy for social media representation: do you use a single centralised page, or dispersed local language/culture pages, or a hybrid model?
The primary concern of users is moving from quality to immediacy and sometimes just access. This is the tipping.
Social media demands real time language. There's no time for pre- and post-processing. The industry needs to let go of old-fashioned thinking and understand that the user makes the call on what level of quality is important for them.
There's a renewed interest in all manner of automated translation.
Engagement is the new ROI (which could be expanded as Risk Of Ignoring).

Related links: Slides • IRC • Video

Maarten de Rijke, from the University of Amsterdam, presented about "Emotions, experiences and the social media". There is little doubt, said Maarten, that the web is being fundamentally transformed by social media. The realization that we now live a significant part of our lives online is giving rise to new perspectives on text analytics and to new interaction paradigms. The talk proposed that motions and experiences are key to communication in social media: recognizing and tracking them in highly dynamic multilingual text streams produced by users around Europe, or even around the globe, is an emerging area for research and innovation. In his talk, Maarten illustrated this with a few examples derived from online reputation management and large scale mood tracking. Key points:

The Political Mashup project tracks ownership of topics from parliament to social media and back in multiple languages.
The CoSyne project looks for parallel wikipedia pages and translates additional content in one language to fill gaps in another language.
The MoodViews project tracked emotion indicators used by LiveJournal users. By identifying spikes it is possible to identify major news events, and by tracking against topics follow how moods change over time, etc.
In this way, with the right linguistic tools, social media can be used as a 'societal thermometer'.

Related links: Slides • IRC • Video

Gustavo Lucardi, COO of Trusted Translations, spoke about "Nascent Best Practices of Multilingual SEO". The talk touched, from the perspective of a Language Service Provider (LSP), on how Multilingual Search Engine Optimisation (MSEO) is already an essential part of the language Localization process. The presentation provided an in-depth look at the nascent Best Practices and explained the concepts behind Multilingual Search Engine Optimisation. Key points:

We are beginning to see Social SEO complementing traditional Search Engine Optimisation techniques.
It is not effective to simply translate keywords, as people in different countries and cultures are looking for different things, based on their own culture and behaviour. You have to find the right keywords for each language.
You have to label your content for language using W3C standards for multilingual SEO to work.
What worked for us: focus on long tail and niche markets, and look for keywords that produce conversions rather than just traffic.

Related links: Slides • IRC • Video

Chiara Pacella, of Facebook Ireland, gave a talk about "Controlled and uncontrolled environments in social networking web sites and linguistic rules for multilingual web sites". She argued that in social networking web sites, a "controlled" component, generated by content creators, must coexist with an "uncontrolled" component, that is generated by the users. Even if the latter is more difficult to control, it is the former that create more challenges in terms of l10n/i18n. The use of a crowdsourcing approach has proven successful for Facebook, but this was achieved thanks to the implementation of standard linguistic rules that are complex and detailed but, at the same time, easily understandable by the actors involved in the translation process. Key points:

Social media content is extremely dynamic. "Uncontrolled" components, not necessarily textual, are combined with "controlled" components to generate the final output. This has to be taken into account when localising content. But the controlled components are still not static, due to changes to the context in which the text appears.
Facebook has developed an approach that uses tokens and dynamic string explosion to address the needs of dynamic text to cope with declensions, gender, number, etc. via rules. This involves translators providing multiple translations for controlled text to fit in the various different linguistic contexts.
Because Facebook uses a crowd-sourcing approach, the rules and translation approach need to be understandable by non-linguists.
Using Facebook's crowd-sourcing approach, over 500,000 users contributed to the translation of the site. French was translated in 24 hours and released in less than 3 weeks. Over 76 languages have been launched, and about 100 languages are being translated in total. So crowd-sourcing provides quality, speech and reach.

Related links: Slides • IRC • Video

Ian Truscott, VP Products at SDL Tridion, finished off the Users session with a talk entitled "Customizing the multilingual customer experience – deliver targeted online information based on geography, user preferences, channel and visitor demographics ". The talk posited that users are increasingly using social media and different devices next to the 'traditional' web and offline media, and therefore information that was previously unavailable or inaccessible is today shaping their opinions and buying behaviour. As a result, users' expectations have changed and have raised the bar for any organization that interacts with them. They expect that information is always targeted and relevant to their needs, available in their language and on the device of their choice. The presentation sought to highlight some of the specific challenges that are emerging as well as demonstrate the technology available to solve them. Key points:

These days 'instantness' rather than quality is assuming more importance, and people are looking for relevant content as the needle in the information haystack.
Over 50% of tweets are not in English, and that figure will continue to grow. This is important because social media provide people with their own channels which need to be tracked for monitoring and understanding audience sentiment.
User generated content drives buying decisions, so how do you leverage this in new markets and connect multilingual communities?
When we publish we need to be where the user is, which involves repurposing by channel and community. This drives an explosion in content on multiple devices.
The demand for more content and more languages drives a need for more automation.

Related links: IRC • Video

The Q&A, began a question about what progress Lionbridge and SDL have made with regard to managing social media translations. There was a comment that ICU is working on library support for handling gender and plural variations for complex language display. And there was a question about the sources of the theories that underlie Maarten's work. For details, see the related links.

Policy session

This session was chaired by Charles McCathieNevile of Opera Software.

alt

Related links: Slides • IRC • Video

Jaap van der Meer of TAUS presented the anchor talk for the Policy session with a talk entitled "Perspectives on interoperability and open translation platforms". This presentation gave a summary of the joint TAUS-LISA survey on translation industry interoperability and a report from the recent Standards Summit in Boston (February 28-March 1) as well as perspectives on open translation platforms from TAUS Executive Forums. Key points:

There is a certain apathy about standards out there, but lack of interoperability costs the language industry a fortune!
Survey respondents defined interoperability as a need for a standard format for exchange of translation memory, interaction of CMS with translation management systems & MT, and standard format for exchange of terminology. These are the things that are needed.
Standards are important to simplify translation business processes, reduce costs and improve translation quality. Also to allow people to switch vendors.
Biggest barriers include lack of compliance with interchange format standards, but also lack of organising body to lead and monitor compliance.
Most important standards are TMX and XLIFF, according to the survey.
Survey respondents can be grouped into Believers (largest group), Realists, and Pragmatists, according to their views on how to address current issues.
Strengths and weaknesses of the localisation industry seem to map onto the top of the content disruption pyramid, whereas identified opportunities and threats seem to relate more to the emerging, new content related to social media and new technologies, such as support, knowledge bases and user generated content.
One of the key things we have noted is that whereas in the old model TM is core, in the new model Data is core. That change is happening very quickly now.
Enterprises have realised that the 20-year old model of squeezing translation rates a little each year is gone; now they need an enterprise language Strategy. We will see more change over the next 5 years than over the last 25 years. Enterprises need to take decisions about where they will fit over the next five years: open, collaborative, closed?
If you want to be a player in the 21st Century translation model you have to be interoperable - you have no choice. We need coaching, need to fix the issues around standards, etc.
As we introduce standardisation, we must try to avoid a 'Translation State' that closes out the innovators.

Related links: Slides • IRC

Fernando Serván, Senior Programme Officer in the Meeting Planning and Language Support Group of the FAO of the UN, presented "From multilingual documents \\to multilingual web sites: challenges for international organizations with a global mandate". International organizations face many challenges when trying to reach their global audience in as many languages as possible. The Food and Agriculture Organization of the United Nations (FAO) works in six languages (Arabic, Chinese, English, French, Russian and Spanish) to try to have an impact in the agricultural sector of its member countries. The presentation focused on the need of multilingual support on the Web and referred to standards and best practices needed. It covered aspects such as the creation and deployment of multilingual content, the translation needs and possible integration of TM and MT, the availability of CAT tools, etc. Key points:

Organizations need to integrate content of documents into web sites and vice versa. CMS software would need capabilities to export content so it can be easily processed with CAT tools and adapted to documents or imported back into the CMS.
Additional work with universities and training institutes is needed to integrate best practices into university curricula generating improved methods and new professional profiles (different from the current language or IT professionals which have a particular set of skills).
Multilingual web sites are difficult to maintain by international organizations. Best practices and methods from commercial corporate web sites could provide guidance and advice to be adapted to non-for profit web sites
Work on integration on TM-MT is ongoing and international organizations such as FAO are an interested potential partner to collaborate in these initiatives.

Related links: Slides • IRC • Video

Stelios Piperidis, Senior Researcher at ILSP-"Athena" RC, gave a talk entitled "On the way to sharing Language Resources: principles, challenges, solutions". This talk presented the basic features of the META-SHARE architecture, the repositories network, and the metadata schema. It then discussed the principles that META-SHARE uses regarding language resource sharing and the instruments that support them, the membership types along with the privileges and obligations they entail, as well as the legal infrastructure that META-SHARE will employ to achieve its goals. The talk concluded by elaborating on potential synergies with neighbouring initiatives and future plans at large. Key points:

Jaap needs data, Fernando has data, the issue is how to share the data. Sharing and exchanging data is the focus of the META-NET work.
Data collection, cleaning, annotation, curation, maintenance, etc is a very costly business, let alone standardisation. META-SHARE tries to share the benefits of an open, integrated, secure, and interoperable exchange infrastructure for language data and tools for the Human Language Technologies domain.
META-SHARE is simple, it's free, and it's yours, so consider increasing your share in META-SHARE.

Author: Richard Ishida. Contributors: Scribes for the workshop sessions, Jirka Kosek, Steven Pemberton, Felix Sasaki, Tadej Štajner, and Jörg Schütz. Photos in the collage at the top courtesy of Richard Ishida. CNR for video recording the conference, and VideoLectures for hosting the video content.

MULTILINGUALWEB

W3C Workshop Report: Content on the Multilingual Web 4-5 April 2011, Pisa, Italy

Summary

Workshop sponsors

Welcome session

Developers session

Creators session

Localizers session

Machines session

Users session

Policy session

W3C Workshop Report:
Content on the Multilingual Web
4-5 April 2011, Pisa, Italy