W3C Workshop Report:
A Local Focus for the Multilingual Web
21-22 September 2011, Limerick, Ireland
Today, the World Wide Web is fundamental to communication in all walks of life. As the share of English web pages decreases and that of other languages increases, it is vitally important to ensure the multilingual success of the World Wide Web.
The MultilingualWeb project is looking at best practices and standards related to all aspects of creating, localizing and deploying the Web multilingually. The project aims to raise the visibility of existing best practices and standards and identify gaps. The core vehicle for this is a series of four events which are planned over a two year period.
On 21-22 September 2011 the W3C ran the third workshop in the series, in Limerick, entitled "A Local Focus for the Multilingual Web". The Limerick workshop was hosted by the University of Limerick. Kieran Hodnett, Dean of the Faculty of Science and Engineering, gave a brief welcome address.
As for the previous workshops, the aim of this workshop was to survey, and introduce people to, currently available best practices and standards that are aimed at helping content creators, localizers, tools developers, and others meet the challenges of the multilingual Web. The key objective was to share information about existing initiatives and begin to identify gaps.
Unlike the previous workshops, due to co-location with the 16th LRC Conference, this event ran for one and a half days. In another departure, the final half day was dedicated to an Open Space discussion forum, in breakout sessions, rather than to presentations. Participants pooled ideas for discussion groups at the beginning of the morning, split into 6 breakout areas and reported back in a plenary session at the end of the morning. Participants could join whichever group they find interesting, and could switch groups at any point. During the reporting session participants in other groups could ask questions of or make comments about the findings of the group. This proved to be an extremely popular part of the workshop, and several participants wanted to continue working on the things they discussed after the event, with a view to meeting again for further discussion during the next workshop. The final attendance count for the event was 85.
As for previous workshops, we video-recorded the presenters and with the assistance of VideoLectures, made the video available on the Web. We were unable to stream the content live over the Web as we did in Pisa. We also once more made available live IRC scribing to help people follow the workshop remotely, and assist participants in the workshop itself. As before, there were numerous people tweeting about the conference and the speakers during the event, and you can see these linked from the program page.
The program and attendees continued to reflect the same wide range of interests and subject areas as in previous workshops and we once again had good representation from industry (content and localization related) as well as research.
In what follows, after a short summary of key highlights and recommendations, you will find a short summary of each talk accompanied by a selection of key messages in bulleted list form. Links are also provided to the IRC transcript (taken by scribes during the meeting), video recordings of the talk (where available), and the talk slides. All talks lasted 15 minutes. Finally, there are summaries of the breakout session findings, most of which are provided by the participants themselves.
The next workshop will take place in Luxembourg, on 15-16 March, 2012.
What follows is an analysis and synthesis of ideas brought out during the workshop. It is very high level, and you should watch the individual speakers talks to get a better understanding of the points made.
Our keynote speaker, Daniel Glazman, reviewed the progress made by CSS and HTML in terms of internationalization support. He exhorted content authors to always write documents in utf-8 and tag them with language information. He also encouraged the use of content negotiation in order to help people reach resources in their own language more easily. This latter point was also taken up later in the Creators session, and in one of the discussion groups.
He called for browser implementers to quickly implement the new HTML5 bdi element, to support 'start' and 'end' values instead of 'left' and 'right', and to address other problems with mixed direction content in forms. He warned, however, that more pressure is needed from users, especially Japanese and Chinese, to encourage browser developers to support their language features.
Finally, he intimated that ePub and related standards are likely to be an important way of packaging documents in the future.
In the Developers session we heard from Christian Lieske how ITS (the W3C's Internationalization Tag Set) is being applied today, but how there is a need for new data categories to be supported, and application of ITS to additional usage scenarios, such as HTML5. The MultilingualWeb-LT project, introduced by David Filip, plans to address these concerns and others in its work to improve and provide reference implementations for metadata through the authoring and localization process chain.
We heard some more about various useful new features for script support in CSS, as well as some pitfalls to avoid, from Gunnar Bittersmann. He proposed that there was a need for a :script selector in CSS, although this was disputed during the Q&A that followed.
During the Creators session, Moritz Hellwig spoke about his experiences with Content Management Systems (CMS) and how the lack of a simple method to reformat translated text content and integrate media into translated text could cause projects to be abandoned. He called for the gap between CMS and Language Service Provider (LSP) communities to be closed. He said that we need a metadata standard that covers issues such as terminology and domain information (content domains and audience domains), and widely adopted open definitions for content import and export functions. The translation workflow should also allow shipping of other information, over and above text.
Danielle Boßlet took up the topic of user navigation again, calling on content developers to put focus on users and their needs when creating web sites. She called for all information (including site maps and indexes) to be available in all languages offered, and for more attention to be paid to links so that users can clearly see what information is available in their own language.
Lise Bisonnette Janody concluded the session with a call for better processes to ensure the development of effective content, and a number of ideas about how to achieve that.
The Localizers session began with an analysis of how to improve the efficiency of the translator's workbench from Matthias Heyn. He ended his talk by predicting that reviewer productivity (in addition to translator productivity) would soon become an important topic, and that another emerging theme would be automation of the broader production chain.
Asanka Wasala demonstrated a crowd-sourced approach to enabling more content for the linguistic long tail of the Web, where many languages are still under served, leaving people without access to information. For him, a key standardization issue is the divergent ways in which browsers and operating systems handle the rendering of complex scripts. He also called for more standardization with regards to extension development across browsers.
From Sukumar Munshi we had a review of the importance of interoperability, and the plea for an initiative to help people collaborate and to facilitate work on standards.
In the Machines session we had overviews of ongoing best practices in news sentiment analysis, NLP interchange formats and Web services for accessing lexical resources.
Thomas Dohmen said that there is a need for a standard API that is easy to integrate and rich enough for domain-specific machine translation, and that data collection standards are needed for markup to tell you about the quality of the data being used to train machine translation systems.
Alex proposed that there is an enormous opportunity and a great need to include globalization and internationalization in the personalized future of the web.
Olaf-Michael showed some clever ways of tracking changes in a wiki where there was no specific source language, and highlighted the Linport project, which defines a universal way to put everything needed for a translation project into one electronic package for transfer among stakeholders. This type of standard was later discussed in a breakout session.
Kicking off the Policy session, Gerhard Budin said that we need a modelling approach to terminological semantics that includes the modelling of cultural differences (only some of which are expressed linguistically), and an integrative model of semantic interoperability in order to (semi-)automate cross-lingual linkages, inferences, translations and other operations. He also called for more networking among stakeholders to better link the many diverse projects together and bring forward successful elements into sustainable structures once a project ends.
Georg Rehm followed with an overview of the status of the META-NET project, and in particular drew attention to a number of white papers that are in development and which each provide information about the state of one European language in the digital age including the respective support through Language Technology.
In the final presentation, Arle Lommel reviewed the state of GALA, and called for localization standards to be written in a more user friendly way, rather than by geeks for geeks. He also called for more coordination in order to avoid incompatible standards, and more guidance about how to use standards and the business case they address.
The second day was devoted to Open Space breakout sessions. The session topics were: Standardization; Translation Container Standards; LT-Web; Multilingual Social Media; and User Focused Best Practices.
Among the proposals arising from these discussions was a way of achieving automation and operability in translation container standards that goes beyond Linport in terms of scope. Another recommendation was a stronger coupling between CMS and translation memory systems, and an endorsement of the value of backing standardization with implementations and test suites, such as the MultilingualWeb-LT project will deliver.
The best practices group documented a number of recommendations. These include producing a simplified version of existing WCAG (Web Content Accessibility Guidelines) 2.0 documents, and involving translators, localizers and designers more in the design and training related to web accessibility guidelines. There were also recommendations about how to adapt Web content for educational users, and a call to protect endangered languages by providing more localized content. As previously mentioned, there was also a call for more focus on linking strategies, to help people better understand what is and is not available to them in their language.
Another group discussed making a list of currently available standards, and making that list available to the public as a way of raising awareness about such standards, and informing future work on those standards.
Welcome session & Keynote talk
Daniel Glazman, of Disruptive Innovations, and W3C CSS Working Group Co-Chair gave the keynote speech, entitled "Babel 2012 on the Web". In his talk Daniel said that if Open Web and Internet Standards were mostly western-centric in the early years, things have drastically changed. English is not any more the most common language on the net and the various standards bodies have improved the support for the languages and scripts of the world. The new cool kids on the block 2012 will be HTML5, CSS3, EPUB3 and this talk showed us how Standards are paving the way for the Multilingual Web in those areas.
- Authors are encouraged to use UTF-8. Many recent content authoring tools use UTF-8 as the default, and some tools no longer offer other character encoding options. This is a good development. I strongly suggest that you never write documents in an encoding other than UTF-8.
- Tagging the language of a document makes sense for everyone. Programming languages give access to the HTML lang attribute, but not to the xml:lang attribute. The lang attribute should be ubiquitous throughout all markup formats to make things simpler.
- You need multiple links to link to resources in different languages. It would be good to have a way around this so that a user gets the resource in their own language automatically.
- The HTML dir attribute doesn't handle vertical text and there are no plans yet for it to do so. You can do it with CSS. There is not enough pressure from Japanese and Chinese users yet to influence the development of the HTML markup.
- There are still problems when dealing with mixed direction in form data. Also the new bdi element still needs to be implemented to deal with mixed direction text inserted into pages.
- There are still issues with localization of dates and calendars, and particularly with time zones. Also, knowledge about handling names is still in the early stages.
- PHP5 also has a lot of issues with character encodings and UTF-8. PHP 6 is supposed to fix this, but is not here yet. Hopefully it will also provide a better localization framework.
- We need a more stable approach to localization frameworks. We can't rely on solutions that may not last.
- CSS is working on vertical text support, text transformation, hyphenation, emphasis marks, international list numbering, etc. Also downloadable fonts and OpenType language support is coming.
- The CSS box model has a big issue because it uses 'left' and 'right' everywhere. Because of the direction of their script, many users need an abstraction. CSS is implementing 'start' and 'end' keywords to replace 'left' and 'right' to make content more easily transferrable between scripts.
- ePub3, for ebooks, incorporates OpenType and WOFF, HTML5, CSS 2.1 and parts of CSS3. This is likely to be an important way of packaging documents in the future.
The developers Session was chaired by Tadej Štajner of the Institut Jožef Stefan.
Dr. David Filip , Senior Researcher at the Localisation Research Centre (LRC) and the Centre for Next Generation Localisation (CNGL), started off the Developers session with a talk entitled, "MultilingualWeb-LT: Meta-data interoperability between Web CMS, Localization tools and Language Technologies at the W3C". MultilingualWeb-LT, an FP7 funded coordination action, is going to set up a W3C Working Group to standardize metadata exchange between Web Content Management Systems, Localization Tools and Language Technologies. This session opened the public discussion of the working group's charter and encouraged participation in the working group from outside of the initial EC funded consortium. Key points:
- The working group aims to address three major interoperability gaps in the multilingual web content life cycle: namely, those between Deep Web metadata and localization (L10n); Surface Web metadata and Real time Machine Translation; and Deep Web metadata and metadata driven MT training.
- Addressing these gaps will include alignment with other existing and ongoing language technology and localization standardization activities; prominently the W3C ITS (Internationalization Tag Set) and the OASIS XLIFF technical committee effort, as XLIFF will be used for prototyping MultilingualWeb-LT metadata round-trips in the three main scenarios outlined above.
- The work will include development of reference implementations, XLIFF round-trip prototypes, and test suites for all three areas.
- Metadata under consideration includes the ITS data categories, such as translate, Localization note, terminology, etc, but may also include translation provenance, human post-editing, QA provenance, domain information, etc.
Christian Lieske, Knowledge Architect at SAP AG, talked about "The Journey of the W3C Internationalization Tag Set – Current Location and Possible Itinerary". His coauthors were Felix Sasaki of DFKI, Yves Savourel of Enlaso Jirka Kosek of the University of Economics in Prague, and Richard Ishida of the W3C. Two questions were addressed: Where is the W3C Internationalization Tag Set (ITS) today?, and Where may ITS be heading? In addition, a brief introduction to ITS was provided - it highlighted that ITS is a set of internationalization-related concepts that may not just be applicable in an XML world.
- Although ITS is only four years old, it already has considerable implementation support especially in the world of localization tools. Support for ITS does already exist in both market leading commercial tools, as well as in free and open source (FOSS) efforts such as the Okapi framework.
- In addition, ITS has already been related to popular formats such as XHTML 1.1., and the Darwin Information Typing Architecture (DITA) . Furthermore, ITS has been integrated into several formats - most notably DocBook 5.0.
- As for the possible route ahead, several directions are on the horizon. On the one hand, a couple of enhancements have already been suggested for ITS. They include features that are mainly of use for machines (e.g. ITS rules that allow the generation of/designation of identifiers), as well as features that are of high importance for humans (e.g. a scheme and mechanism to talk about different types of context such as "the 'source' element provides semantic information, namely the name of an author encoded with the Dublin Core vocabulary").
- Another direction is the use of ITS concepts in additional usage scenarios such as HTML5.
- The exact route for ITS, however, isn't chartered yet. Options range from an additional version produced by a W3C Working Group, to continued requirements gathering and real world implementation in the context of the forthcoming MultilingualWeb-LT project (funded by the European Commission).
Gunnar Bittersmann, Web Developer at brands4friends, presented "CSS & i18n: dos and don'ts when styling multilingual Web sites". The talk covered best practices and pitfalls when dealing with languages that create large compound words (like German), languages with special capitalization rules (again, like German), or languages written in right-to-left scripts. This includes things like box sizes, box shadows and corners, image replacement etc. It also covers benefits that new CSS3 properties and values offer in terms of internationalization, a discussion about whether the :lang pseudo-class selector meets all needs or if there's more to wish for, and how to implement style sheets for various languages and scripts (all rules in a single file or spread over multiple files?). The talk was of practical rather than theoretical nature. Key points:
- Browsers are just beginning to implement automatic hyphenation, but not for all languages at once – hyphenation rules are language-specific.
- Don't expect text to fit in fixed size boxes of the same size when translated.
- Be careful when applying text effects, such as text-transform rules, since they may not be appropriate in all languages, eg. German needs uppercase letters at the start of nouns.
- Different text directions can cause problems for things like shadow styling. If using separate styles for different direction text you need to ensure that the order in which style sheets are included is the same, and you need to remember to make fixes in both style sheets. I recommend you use specific selectors to include all the style rules in a single style sheet.
- Useful new CSS features coming through include auto hyphenation, new text emphasis styles, script-specific justification rules, and use of start/end rather than left/right.
- It would be useful to have a :script selector in CSS, rather than just relying on language selectors.
The Developers session on the first day ended with a Q&A question about language negotiation, and a suggestion that it should be possible to identify the original language in which a page was written.
This session was chaired by Charles McCathieNevile of Opera Software.
Moritz Hellwig, Consultant at the Cocomore, gave the first talk for the Creators session, "CMS and Localisation – Challenges in Multilingual Web Content Management". Because he was unable to make the trip, this talk was delivered as a pre-recorded video. Content Management Systems (CMS) have come to be widely used to provide and manage content on the Web. As such, CMS are increasingly used for multilingual content, which presents new challenges to developers and content providers. This presentation explored these challenges and showed how and why a closer alignment of CMS developers and LSP can improve translation management, workflows and quality. Key points:
- When dealing with one customer, substantial problems were encountered after translation related to formatting of text content, and re-integration of pictures and other media into the translated text. This lead to the abandonment of plans to localize the content.
- The Web is much more open than traditional media, and brings companies and customers together across geographical boundaries. Content management systems therefore have to be able to not only store, display and manage translations, but also deal with and integrate translations from language service providers. However there is a gap in understanding between CMS and LSP communities.
- We need a metadata standard that covers issues like terminology and domain information and that can be integrated into the editorial workflows that content managers use. For example, we need a way to identify terminology that needs to be translated with particular care. The markup therefore needs to carry information about under what circumstances content should be translated.
- Another example is the lack of domain information that creates MT translation issues. Typically there need to be both Content domains and Audience domains
- Also we need open definitions for content import and export functions which encourage widespread adoption. The translation workflow should allow shipping of other information, above and beyond text, to the translation vendor. We need a clear definition of what content data and meta information must be exchanged between the content management side and the language service provider side, in order to minimise reintegration efforts, as well as improve translations.
- We look forward to participate in the MultilingualWeb-LT initiative because it will bring LSPs and CMSs together and help them to better understand each other's requirements and issues.
Danielle Boßlet, Translator, spoke on "Multilinguality on Health Care Websites - Local Multi-Cultural Challenges". Global health care organizations like the World Health Organization have to present their websites in a variety of languages to make sure that as many people as possible can benefit from their online offer. The same applies to the European Union, which publishes its official documents in 23 languages and therefore has to guarantee that its websites are equally multilingual. Due to the fact that Germany is a country with a large number of immigrants, the government and other official institutions would do well to present their websites not only in German or English, but also in other languages, like Turkish or Russian. The websites of the WHO, the EU and some German institutions were checked for their multilingual offerings and possible shortcomings of the different language versions. The severest and most frequent shortcomings and their consequences for users were highlighted in this talk. Key points:
- There are two main issues on the website of the WHO: 1. Many pages are only available in English. Therefore, less information is provided for non-English speaking users. 2. Links are often poorly implemented: Either because they are left in English and therefore cannot be used by non-English speakers or because they lead to English content without indicating it.
- The health portal of the European Union has also two severe issues: 1. On the one hand links are implemented well because they are translated. But on the other hand they very often lead to information that is only available in English without an appropriate indication for the user. 2. Neither the site map nor the A to Z index of the website have been translated and remain for almost all languages in English, making both functions inaccessible for non-English speakers.
- Two of three German government healthcare websites only offer a German and an English version, which is clearly insufficient regarding the number of immigrants living in Germany. On all websites issues with links can be found.
- To avoid the problems described above, the focus has to be on users and their needs when creating websites. It is important that all information (including site maps and indexes) is available in all languages offered. If this is not possible, the least thing to do is to clearly show users which information they can access in their language.
- More attention has to be paid to links; after all they are one of the key features of the web. They need to be transparent for users. To achieve this, they have to be translated and must lead to website versions in the users' native language. If this is not possible they must be accompanied by a hint that indicates the language of the linked content.
Lise Bissonnette Janody, Content Consultant at Dot-Connection, presented "Balance and Compromise: Issues in Content Localization". Web content managers need to make choices with respect to the content they translate and localize on their websites. What guides these decisions? When in the process should they be made? What are their impacts? This talk provided a high-level overview of these choices, and how they fit into the overall content strategy cycle. Key points:
- Effective content is appropriate, useful, user-centred, clear, consistent, concise and is supported by tools, people and resources.
- You need a process to ensure that you are developing effective content.
- Set baseline targets, examine what you have, get into specifics and keep it in synch.
The Q/A part of the Creator session began with questions for Danielle about how she gathered her data, and a couple of audience members contributed additional information. One mentioned that a significant factor that hinders large organizations to meet user needs is fear of imperfections, but another is resources. Crowd sourced resources may be able to help. Another question asked to what extent are standards being applied? There's a time lag for adoption, and the proliferation of content on the Web is driving issues. W3C recommendations from W3C for page design are commonly followed, but WHO doesn't offer content negotiation. For more details, see the related links.
This session was chaired by Christian Lieske of SAP.
Matthias Heyn, VP Global Solutions at SDL, talked about "Efficient translation production for the Multilingual Web". The translation editor has seen major technological advances over the last years. Compared to classic translation memory applications, current systems allow expert users to double, if not triple, the amount of words translated. Whereas the key technology advances are in the area of sub-segment reuse and statistical machine translation (SMT), the actual productivity gains relate to the ergonomics of how systems allow users to interact, control and automate the various data sources. This presentation reviewed key capabilities on the various document, segment and sub-segment levels like: Document level SMT, TrustScore, dynamic routing, dynamic preview; Match type differentiation, Auto-propagation, SMT integration and SMT configurations, segment-level SMT trust scores and feedback cycles (segment level); Auto-suggest dictionary and phrase completions (sub-segment level). The discussed capabilities were brought into perspective of how the vast amount of multilingual online content are affected by such innovation. Key points:
- At the topic level, don't translate if it hasn't changed, but show it to provide context for the text that has actually changed. This can be achieved by markup exclusions or by 'perfect matching', ie. comparison with earlier versions.
- On the segment level,
- Don't translate if you can re-use an (approved) existing translation. Different mechanisms for identifying text can affect quality.
- Adapt an automated translation proposal (instead of translating from scratch). MT is increasingly accepted by professional translators, and significant productivity gains can come depending on the relevance and training of the engine. Use 'trust scores' to determine when a proposal is most likely useful and when not.
- Automate retraining of SMT engine / phrase dictionaries in feedback cycle(s).
- Auto-propagate translations for identical source segments.
- At the subsegment level,
- A key driver is auto completion. It's an art to display not too many suggestions and to avoid noise.
- Actual productivity gains relate to the ergonomics of how systems allow users to interact, control and automate the various data sources.
- ITS and XLIFF are important standards for the topic level, translation memory and XLIFF are important for segment level, but there is nothing yet around the auto-propagation and auto-suggest areas.
- An upcoming theme is reviewer productivity (in addition to translator productivity).
- Another upcoming theme is automation of the broader production chain.
Asanka Wasala, PhD Student from the Centre for Next Generation Localisation (CNGL) and the Localisation Research Centre (LRC), talked about "A Micro Crowdsourcing Architecture to Localize Web Content for Less-Resourced Languages". He reported on a novel browser extension-base,d client-server architecture using open standards that allows localization of web content using the power of the crowd. The talk addressed issues related to MT-based solutions and proposed an alternative approach based on translation memories (TMs). The approach is inspired by Exton et al. (2009) on real-time localization of desktop software using the crowd and Wasala and Weerasngihe (2008) on browser based pop-up dictionary extensions. The architectural approach chosen enables in-context real-time localization of web content supported by the crowd. To best of his knowledge, this is the only practical web content localization methodology currently being proposed that incorporates Translation Memories. The approach also supports the building of resources such as parallel corpora – resources that are still not available for many, but especially for under-served languages. Key points:
- Many languages are not supported by translation sites such as Google Translate, Bing Translator and Yahoo Babel Fish, leaving large numbers of people without access to information.
- A browser extension mechanism that provides term lookup and translation memory access enables people to help themselves via a crowd-sourced approach to translation. A key benefit is that it is independent of web sites themselves.
- Issues include copyright questions, localization of non-text content and formatting, constant updates, as well as how to manage voting for the best translation and deployment of the content.
- A key standardization issue is the different ways in which browsers handle rendering of text. Browsers tend to rely on different Operating System level code for representation of complex scripts, and there is variable support for OpenType features.
- Another issue is a lack of standardization with regard to extension development across browsers.
Sukumar Munshi, Corporate Development Officer at Across Systems, spoke about "Interoperability standards in the localization industry – Status today and opportunities for the future". Unable to attend the workshop at the last minute, Sukumar provided a pre-recorded video presentation. Interoperability and related standards are topics still frequently and controversially discussed. While standards such as TMX and TBX are established within the industry, others, such as XLIFF are rated differently and not that widely implemented. This presentation covered the current status of interoperability in the localization and translation industry, historical development, understanding of interoperability, related business requirements, effects on delivery models, interoperability between tools, open standards, current challenges and opportunities for the future. Key points:
- The industry does not suffer from a lack of standards, but more from lack of awareness.
- Some standards are not in synch with evolving requirements.
- Standards relevant to the industry are developed by many bodies.
- There is a lack of training, promotion and best practices surrounding localization standards.
- We need an initiative (such as GALA) to collaborate and facilitate work on standards, and bring together disparate constituencies that are impacted by standards.
- Predictions for the future of interoperability in 2015:
- The pressure on interoperability has increased due to cloud based processing.
- The community of organizations, customers, suppliers and tool vendors will fully endorse supporting standards.
- All interested parties will have agreed on how interoperability standards should be applied.
- Tools will support the basic concepts of all current interoperability standards.
- Added value will be on performance, throughput and solutions for specific applications of the standards.
The Q&A dwelt briefly on crowd sourcing considerations. A comment was also made that initiatives, such as Interoperability Now, should be sure to talk with standards bodies at some point. It was mentioned that the W3C has started up a Community Group program to enable people to discuss and develop ideas easily, and then easily take them on to standardization if it is felt that it is appropriate. For details, see the related links.
This session was chaired by Felix Sasaki of DFKI.
Thomas Dohmen, Project Manager at SemLab talked about "The use of SMT in financial news sentiment analysis". Statistical Machine Translation systems are a welcome development for news analytics. They enable topic-specific translation services, but are not without problems. The SMT system that is developed for the Let'sMT (FP7) project is trained and used to translate financial news for SemLab's news sentiment analysis platform. This talk gave an example of the benefits and problems of integrating such systems. Key points:
- MT systems can be very useful to sentiment analysts that are not well acquainted with translation technologies but the APIs need to be standardised.
- Standard API need to be easy to integrate and rich enough for domain-specific MT.
- Data collecting standards are needed for markup to tell you about the quality of the data being used to train MT, e.g. from human translation, not re-used MT data, whether literal or approximate translations, etc.
- General translation standard (accessibility & quality assurance).
Sebastian Hellmann, Researcher at the University of Leipzig, spoke about the "NLP Interchange Format (NIF)". NIF is an RDF/OWL-based format that allows developers to combine and chain several NLP tools in a flexible, light-weight way. The core of NIF consists of a vocabulary, which can represent Strings as RDF resources. A special URI design is used to pinpoint annotations to a part of a document. These URIs can then be used to attach arbitrary annotations to the respective character sequence. Based on these URIs, annotations can be interchanged between different NLP tools. Although NLP Tools are abundantly available on all linguistic levels for the English language, this is often not the case for languages with fewer speakers. Thus, it becomes especially necessary to create a format that allows the integration and interoperability of NLP tools. With respect to multilinguality, two use cases come to mind: 1. an already existing English software system, that uses an English NLP tool needs to be ported to another language. The NLP tool for the other language is not compatible to the system, because there is no common interface (Example: A CMS with keyword extraction). 2. Paragraphs in different kinds of documents can be annotated in RDF with multilingual translations that can potentially remain stable over the life-time of a document. Especially, the introduced URI recipe (Context-Hash) possesses advantageous properties, which withstand comparison to other URI naming approaches. Key points:
- The NLP Interchange Format (NIF) is an RDF/OWL-based format that aims to achieve interoperability between Natural Language Processing (NLP) tools, language resources and annotations.
- Currently it is difficult to address content within a web document. The URI normally only points to the whole document. NIF provides 2 URI schemes to attach annotations to substrings of the document and use them in RDF as subjects.
- Conceptual interoperability is achieved by incorporating existing linguistic ontologies such as the Ontologies of Linguistic Annotation (OLiA) . OLiA provides a local annotation model for a tag set and then links it to the OLiA reference model. Conceptual interoperability allows to achieve parser, language and framework independence
- NIF enables the creation of heterogeneous, distributed and loosely coupled NLP applications, which use the Web as an integration platform. Another benefit is, that a NIF wrapper has to be only created once for a particular tool, but enables the tool to interoperate with a potentially large number of other tools without additional adaptations. Ultimately, we envision an ecosystem of NLP tools and services to emerge using NIF for exchanging and integrating rich annotations.
- Community portal and interactive tutorial challenges online to foster adoption, e.g. the Multilingual Part-Of-Speech Tagger Release with more summary information.
Yoshihiko Hayashi, Professor at the Osaka University, gave a talk about "LMF-aware Web services for accessing lexical resources". This talk demonstrated that Lexical Markup Framework (LMF), the ISO standard for modeling and representing lexicons, can be nicely applied to the design and implementation of lexicon access Web services, in particular, when the service is designed with so-called RESTful style. As the implemented prototype service provides access to bilingual/multilingual semantic resources, in addition to standard WordNets, slight revisions to the LMF specification were also proposed. Key points:
- International standards for language resource management can be effectively utilized in implementing standardized language Web services, in particular for accessing lexical resources. Accessing a lexicon is achieved by query-driven extraction and rendering of the relevant portion of the resource (sub-lexicon).
- The described service is designed in a RESTful way:
- the access URI specifies a sub-lexicon, as the Wordnet-LMF, a variant of Lexical Markup Framework (LMF; ISO24613:2008), defines.
- the resulting sub-lexicon is represented as an LMF-compliant XML document, which can be easily consumed, for example by being converted into an HTML document via XSLT conversion.
- Slight modifications to the standard were required to accommodate the EDR concept dictionary, which is a bilingually constructed semantic lexicon.
- Next steps include, incorporating more worknets to further attest the applicability of the LMF standard, Web-servicize other types of lexical resources (eg. bilingual dictionaries), and implement an RDF endpoint for Lexical Linked Data, which requires a standardized RDF representation of LMF.
Topics discussed during the Q&A session included whether XPointer is an alternative for the RDF-based string annotation described by Sebastian, and why Sparql can't be used to access the LMF data. There was a suggestion that moving Sebastian's data into XML and overlaying it with XML:TM would allow for translation memory at a segment level. For the details, follow the related links.
This session was chaired by Reza Keschawarz of the LTC.
Alexander O'Connor, Research Fellow (DCM) at Trinity College Dublin and the Centre for Next Generation Localisation (CNGL), started the Users session with a talk entitled "Digital Content Management Standards for the Personalized Multilingual Web". The World Wide Web is at a critical phase in its evolution right now. The user experience is no longer limited to a single offering in a single language. Localisation has offered a web of many languages to users, and this is now becoming a hyper-focused tailoring that makes each web experience different for each user. The need to address the key requirements of a web which is real-time, personal and in the right language is paramount to the future of how information is consumed. This talk discussed the key trends in personalization, with particular focus on work being undertaken in the Digital Content Management track of the CNGL, and provided an insight into current and future trends, both in research and in the living web. Key points:
- There is an overwhelming amount of content on the web, it is constantly growing in volume and diversity.
- Web content is also increasingly being created by non-English speakers, in their native and in other languages.
- There is a central need for users to be able to filter their experience and to be recommended content which they might browse.
- Content standards are established, metadata standards are emerging and user models are not yet standardised.
- There is enormous opportunity and a great need to include globalization and internationalization in the personalized future of the web.
Olaf-Michael Stefanov, Co-administrator, Multilingual Website at JIAMCATT, talked about how "An Open Source tool helps a global community of professionals shift from traditional contacts and annual meetings to continuous interaction on the web". The challenges of maintaining and developing a multilingual web site with open source software tools and crowd-sourced translations, for a community of professional translators and terminologists working for international organizations and multilateral bodies where that "community" has no budget, depends on members' contributions in kind, but continues to grow, and has been growing since 1987, using an Open Source tool which supports multilingualism to provide a complex support site for an international working group on language issues. The talk explored how use of the Tiki CMS Wiki Groupware software made it possible to provide an ongoing interactive support site for JIAMCATT, helping convert the "International Annual Meeting on Computer-Assisted Translation and Terminology" into an ongoing year-round affair. The site, which is run without a budget and on the spare time of members, nevertheless is fully bilingual English-French, with parts in Arabic, Chinese, Russian and Spanish (all official languages of the United Nations) as well as some German. Key points:
- Use of the online multilingual community tool has increased participation dramatically.
- JIAMCATT working groups include translation support (tools & services), machine translation, standardization and interoperability of systems, tools for interpreters, Arabic translation tools, and UN Terminology.
- The Linport project aims to define a universal way to put everything needed for a translation project into one electronic package that can be transmitted among all stakeholders to improve communication and efficiency in translation.
- The Tiki has a useful way of showing what translations need updating on a given page as changes are made to other language versions - there is no source language.
The Q&A, began with a request for more information from Alex about the implications for localization. He said that good localization needs additional metadata, including that related to identification and personalization. There was also a comment that there is not such as harsh clash between the business and academic worlds with regards to semantic web technologies. Alex replied that it's more an issue of adoption.
This session was chaired by Jörg Schütz of bioloom group.
Gerhard Budin of the University of Vienna presented the first talk in the Policy session, "Terminologies for the Multilingual Semantic Web - Evaluation of Standards in the light of current and emerging needs". In recent years several standards have emerged or have come of age in the field of terminology management (such as ISO 30042 (TBX), ISO 26162, ISO 12620, etc.) Different user communities in the language industry (including translation and localization), language technology research, industrial engineering and other domain communities are increasingly interested in using such standards in their local application contexts. This is exactly where problems more often than not arise in the natural need to adapt global and sometimes abstract, heavy-weight standards specifications to local situations that differ from each other. Thus the way standards are prepared needs to be adapted in such a way that different requirements from user groups and from local situations can be processed and taken into account appropriately and efficiently. The paper discussed innovative (web-service-oriented) approaches to standards creation in the field of terminology management in relation to different web-based user groups and semantic web-application contexts, integrating vocabulary-oriented W3C recommendations such as SKOS. The speaker integrated his experiences in the strategic contexts of FlareNet, CLARIN, ISO/TC 37 and in concrete user communities, e.g. in legal and administrative terminologies (the "LISE" project) and in risk terminologies (the "MGRM" project). Key points:
- Language-unaware and language-independent data modelling in the Semantic Web and for ontologies creates huge problems in localization and translation work. What we need is a modelling approach to terminological semantics that includes the modelling of cultural differences (only some of which are expressed linguistically).
- We need an integrative model of semantic interoperability in order to (semi-)automate cross-lingual linkages, inferences, translations and other operations.
- Lemon (an ontology-lexicon model for the multilingual semantic web) is one of the most promising approaches so far.
- We need to spend more time linking the many diverse projects together and bring forward successful elements into sustainable structures once a project ends.
- Networking among stakeholders is crucial to create standards that fulfill their purposes, and we need real life reference implementations of standards involving users.
- Linked Open Data is one of the most important initiatives at the moment, and we will see more of this, but we need standards to integrate and support it.
Georg Rehm, META-NET Network Manager at DFKI GmbH, presented "META-NET: Towards a Strategic Research Agenda for Multilingual Europe". META-NET is a Network of Excellence, consisting of 47 research centres in 31 countries, dedicated to fostering the technological foundations of a multilingual European information society. A continent-wide effort in Language Technology (LT) research and engineering is needed for realizing applications that enable automatic translation, multilingual information and knowledge management and content production across all European languages. The META-NET Language White Paper series "Languages in the European Information Society" reports on the state of each European language with respect to LT and explains the most urgent risks and chances. The series covers all official and several unofficial as well as regional European languages. After a brief introduction of META-NET the talk presented key results of the 30 Language White Papers which provide valuable insights concerning the technological, research, and also standards-related gaps of a multilingual Europe realized with the help of language technology. These insights are an important piece of input for the Strategic Research Agenda for Multilingual Europe which will be finalized by the beginning of 2012. Key points:
- META-NET is a network of excellence dedicated to fostering the technological foundations of the European multilingual information society. META-NET consists of 54 research organizations in 33 European countries
- META-NET wants to assemble all stakeholders – researchers, LT user and provider industries, language communities, funding programmes, policy makers – in the Multilingual Europe Technology Alliance (META) so that they team up for a major dedicated push.
- Language Technology support varies greatly from language to language. The META-NET Language White Paper Series consists of 30 documents that inform about the state of one European language each in the digital age including the respective support through Language Technology. The results are alarming: to give Machine Translation as an example, 23 out of the 30 European languages that were examined currently suffer from very limited quality and performance of MT support.
- In a large-scale vision and strategy building process, META-NET consulted with hundreds of LT experts, researchers, partners in collaborating projects (such as, among other, Multilingual Web and its successor, Multilingual Web LT), language professionals, officials, policy makers and administrators in order to provide input for the META-NET Strategic Research Agenda. This document will be finalized in early 2012 and presented to national and international politicians, administrators and funding agencies. The SRA will cover a timeframe from now to ca. 2025 and present big umbrella visions, key research goals and technology roadmaps.
Arle Lommel, Standards Coordinator for the Globalization and Localization Association (GALA), gave a talk entitled "Beyond Specifications: Looking at the Big Picture of Standards". In the localization industry standardization has been seen primarily as a technical activity: the development of technical specifications. As a result there are many technical standards that have failed to achieve widespread adoption. The GALA Standards Initiative, an open, non-profit effort, is attempting to address areas that surround standards development—education, promotion, coordination of development activities, and development of useful guidelines and business cases, and non-technical, business-oriented standards—to help achieve an environment in which the needs of various user groups will help drive greater adoption of standards.. Key points:
- Localization standards tend to be written by geeks for geeks, and the people who need to actually use them may not understand them.
- Obstacles to use of standards: yet another format can complicate rather than simplify the situation if not widely embraced; they may simply be unimplementable; scope or specifications may be unclear; unanticipated evolving use cases and feature creep may dilute effectiveness; it's not clear whether it saves money.
- Coordination is needed to avoid incompatible standards. A centralised coordination makes this simpler.
- There needs to be guidance on how to use standards, the business case and in-depth training.
- GALA aims to help standards groups promote what they are doing.
During the Q&A, it was suggested that it would be useful to have a summary of all the standards from the workshop - a glossary of alphabet soup we've talked about. (This was developed further in the discussion sessions on the following day.) There was also some discussion about whether there are too many standards, and whether we can find a way to merge things to make life simpler. And a final set of questions focused on MT support and roadmaps related to the presentations.
This session was chaired by Jaap van der Meer of TAUS.
Workshop participants were asked to suggest topics for discussion on small pieces of paper that were then stuck on a wall. Jaap then lead the group in grouping the ideas and selecting a number of topics for breakout sessions. People voted for the discussion they wanted to participate in, and a group chair was chosen to facilitate. The participants then separated into breakout areas for the discussion, and near the end of the workshop met together again in plenary to discuss the findings of each group. Participants were able to move between breakout groups.
At Limerick we split into the following groups:
- Standardization, lead by Gerhard Budin
- Translation Container Standard, led by Manuel Tomás Carrasco Benitez
- LT-Web, led by David Filip
- Multilingual Social Media, led by Timo Honkela
- Best Practices (User focus), led by Silvia Rodriguez
Summaries of the findings of these groups are provided below, some of which have been contributed by the breakout group chair. A number of groups were keen to renew discussion on these topics at the next workshop in Luxembourg.
Translation Container Standard
Discussions in this group centred around the idea that, in practice, translation seems to include a lot of files being e-mailed around, and translation tools create packages, which are not always interoperable – so what can be done to achieve automation and interoperability?
Many issues were discussed. Some were addressed in Linport.
There are a couple of key workflow issues to consider. First, can I send you something and can you immediately use it? This was the focus on Linport. However, there's often a focus on the containers and not on the concrete content formats. Standards are often too narrowly focused in use cases. Interoperability-Now! was slightly broader, LINPORT even more so. Secondly, can we merge all of the efforts into one? At the very least, we want to avoid overlapping development of the same functionality. We can't just focus on being a translation-focused project (LINPORT), we need to look for broader scope.
A question arose about independent translators, who don't have the bandwidth to participate in these initiatives.
- NIF format (Sebastian): annotated (linked open) data can contribute to test suites, if NIF (meta) data is close to what is used in localization area. That can be an input to the test suites that need to be developed for MultilingualWeb-LT.
- SDL (Matthias Heyn): all three scenarios are interesting: deep web <> LSP, surface web <> real time MT, deep web <> MT training. Details are important.
- ISO (Monica): reference implementations, tests suites good idea. People who want to apply standards will like that. Data categories are very critical. Mappings of data categories can help to make language resources relevant for (localization area) clients. But these clients have already their categories. So all you can do in a standard is specify what should be used ("best practices"), give examples (test suite) of input and output.
- CMS (Matthias again). Coupling between CMS and TMs - there are industry solutions already. Same mistakes are repeated again and again. E.g. translate inside, forget about supply chain, user exception issues etc. are forgotten in the setup of the system. Normally you have XML mapping tables, so a solution is already here. MultilingualWeb-LT needs to be clear about what is needed here. A lot of connectors do exist. Everybody has the own idea what is important. Important to place the effort of MultilingualWeb-LT adequately.
Summary: Maybe best practices, test suites and implementations that demonstrate the usage of the best practices, are needed more than "solving world hunger". Open questions: what data categories to tackle, how that relates to ITS.
Multilingual Social Media
One important topic discussed was crowd sourcing for translation, emphasizing the Facebook use case, where half a million people are participating in choices of terminology for 75 languages. Important criteria are speed, cost, quality, trustworthiness. Lionbridge said that a lot of trustworthiness comes from the fact that crowd sourced content has a local feel.
People have many motivations to participate: the chance to make decisions, peer recognition, to see contributions made visible, national pride, etc.
Hybrid approaches of professional plus crowd source localization are also feasible and practical.
There are also two cases for cultural differences: the design of the product, and interacting with the user generated content For instance, family relationships are not always directly translatable, since they mean different things in different cultures. Multilingual mining of SM can be very useful for marketing analytics.
During the plenary session, the question was asked about how to control quality assurance in the face of contributions from some many diverse sources. Timo responded that if there are many people in a local community, mistakes get discovered sooner or later. On the other hand, pro translators make correct translations, but may not adapt to the market completely A rule of thumb: if a use of a term is shared by a large number of people, it's preferred over the official translation.
Best Practices (Focus:Users)
Throughout the whole session, this breakout group had a strong focus on the users, evaluating which are their current needs and expectations regarding multilingual websites.
1. Web accessibility and usability issues
The Web Accessibility Initiative (WAI) of the World Wide Web Consortium (W3C) released the Web Content Accessibility Guidelines (WCAG 2.0) in December 2008. These guidelines are concerned with being testable, one of the main criticisms of their previous version, WCAG 1.0 (May, 1999). However, up to present, website accessibility validation has not been an easy task to accomplish by web designers and developers, since they acknowledge that they are not only reasonably difficult to implement, but also too restrictive and time-consuming. Studies have shown that success criteria are often met only up to the first level (A) because sometimes accessibility issues are not seen as important as other web design parameters. This is partially due to:
- language and style they are written in
- sometimes content can be subject to interpretation (especially regarding principle 3: understandable).
- supplementary documentation to WCAG 2.0: The guidelines are also interlinked with other four documents which ideally should help developers and designers to implement them.
One of the suggestions proposed is to come up with a simplified version of existing WCAG 2.0 documents.
Note from Richard: There is some work being currently done on these.
When it comes to dealing with multilingual websites, implementation of WCAG 2.0 becomes even more complicated. Some of the accessibility concerns observed by the WAI should also be taken into account during the localization process (subtitling, audio-description, alternative content, focus order…). Sign language was a particularly interesting issue for debate. When localizing websites and videos, should we not include a sign language interpretation embedded videos too? After all, sign languages are often the native languages of deaf people. We pointed out, though, that they are usually rejected by customers because they are "too expensive". All these things considered, one might even think about web accessibility as a localizable element of the web: the accessibility level achieved in the original product should be maintained in the target languages. Even more, negotiations could occur with the client to improve the level of accessibility for the localized websites, even if the source original website was not 100% accessible. However, little attention is currently paid to web accessibility when performing localization tasks. In order to improve this situation, it has been suggested to enhance communication across the whole web development cycle. That is, there is a need for further training, as well as to bridge experts in different fields, getting developers, translators, localizers and designers involved in the implementation of web accessibility guidelines.
Now, from the Web Localization industry perspective, it is important to make customers understand the need to observe these web accessibility guidelines too. The example of a "rainbow menu bar" popped up. The idea might seem attractive, but maybe we are restricting the accessibility level of the website, thus leaving some communities of users aside. It is hard to find a compromise between what the client wants and what is technically and ethically accessible. In this sense, one of the participants said that it is useful to have a session with customers, letting them "experience" what restricted accessibility actually means. Accessibility workshops for developers are very helpful, too. Most companies are not aware of what is required in creating a multilingual/localized website, let alone any accessibility issues, and in a lot of cases, their web developers do not have the sufficient knowledge on this area either. Therefore, it takes some "education" sessions with customers to explain to them that, for instance, using heavy large imagery for countries where Internet connection speeds are relatively slow (e.g. China) is not advisable, or using Flash on the Home page for countries where most people access the Internet from their mobile phone (e.g. Russia). Also, even simpler aspects to take into account such as the color scheme, for example in Japan white is the color of death, etc.
Answers from the audience:
- Indeed, guidelines are often based around checklists, which are convenient for QA, but not so good for designers – it is often better to guide designers while they're actually designing. WAI is starting a new EC-funded project which may provide an opportunity to input web accessibility concerns and get them addressed.
- The suggestion of inclusion of sign language videos is great, especially combined with subtitles for screen readers.
2. MLW for educational purposes
Within the educational domain, (multilingual) websites should be even more adapted to the user, although sometimes more attention is paid to the purpose or function of the website than to the end user. The content, as well as the user's interface, should be adapted to an audience made mostly of children and/or young students (it is often the case that teachers implement eLearning platforms suitable for them, but not for their students, in terms of clarity and simplicity of the content and organizational design). Among others, these are the recommendations that were proposed when thinking about how an educational-oriented website should look like: larger size fonts, simple/simplified content, right selection of images (yes, here the rainbow would fit), appropriate vocabulary complexity, etc.
Notes: There was another W3C workshop on Plain Language issues in Germany (Berlin) on Sept 19 which could add helpful insights: http://www.xinnovations.de/w3c-tag/articles/w3c-tag-2011.html – only accessible to people with knowledge of German, though.
3. MLW in institutional organizations
In general, international organizations and institutions tend to ignore the users' needs in terms of web usability and, in particular, as regards language choice. It is often the case that they create the websites thinking first about themselves as an entity (image of the organization, internal communication, documents repository…), but not about the end users. As an example, we recalled one of the presentations (Danielle's) on health care international institutions and their multilingual portals. For instance, the user support on the EU health portal is considered poor because it is only offered in English. During the Q&A session, people seemed to justify this by explaining the technical challenges (human resources, funding…) behind such an implementation. However, users do not know very often about internal problems: the only thing that is important for them is having the information that they need available in their language. Organizations should, therefore, at least get a handle on their accessibility and localization issues, so that they can make improvements, given budget.
Generally speaking, the main problems that we have encountered when looking at institutional websites are the following:
- Institutional websites (such as EU's, an institution that supports multilingualism) should protect endangered languages by making their websites available in those languages.
- When different languages are offered, content usually differs between languages. There is always a dominant language (mainly English) with information that has not been translated into the rest of the languages the site claims to offer.
- When browsing the web in a specific language, some link names are not translated, and some others lead only to English content (without previously warning the user).
- The result of all this is that the user/reader is deceived: a lot of content seems to be available in his/her language, but it is not always the case.
- Sometimes layout is not consistent across languages
Answers from the audience:
- Large multilingual websites are often expensive to maintain
- Christian: In Germany, there are awards for accessible websites: www.biene-award.de
- Dorothea: Nevertheless, accessibility is not a big issue as I perceive it, rather a necessity by law for public institutions, often not well understood
- Maria Pia: In Italy there is a Legislation on Accessibility: Legge Stanca http://www.pubbliaccesso.gov.it/english/index.htm
- Dorothea: Actually, Here is a list of legislation in other countries of the world, including EU countries: http://www.standards-schmandards.com/projects/government-guidelines/
- We are entering the mobile world. If you provide mobile-ready content, you almost automatically get accessibility.