2014 Madrid Workshop Program

W3C Workshop Program:
New Horizons for the Multilingual Web
7-8 May 2014 (8-9 May: LIDER workshop), Madrid

Sponsors:

The MultilingualWeb community develops and promotes best practices and standards related to all aspects of creating, localizing, and deploying the Web across boundaries of language. This W3C workshop aims to raise the visibility of existing best practices and standards for dealing with language on the Internet and on identifying and resolving gaps that keep the Internet from living up to its global potential. It will be held in Madrid, and will be hosted by the Universidad Politécnica de Madrid; see information about the venue in the call for participation.

Live stream: a live stream for the event is available.

After the keynote speech, each main session on the first day and a half will contain a series of talks followed by some time for questions and answers. The afternoon of the second day will be dedicated to an Open Space discussion forum, where participants can discuss the themes of the workshop in breakout sessions. All attendees participate in all sessions.

Aligned with the event, the LIDER project is organizing a Workshop on Linked Data, Language Technologies and Multilingual Content Analytics. This workshop is running 8 May (afternoon, in parallel to the open space sessions) and 9 May (morning). A separate registration is required.

Related links: Workshop report • LD4LT / LIDER workshop report • About W3C

7 May

09:00

Welcome

Félix Pérez Martínez

Director de la Escuela Técnica Superior de Ingenieros de Telecomunicación de la UPM (ETSIT UPM)

Victor Robles Forcada

Director de la Escuela Técnica Superior de Ingenieros Informáticos de la UPM (ETSIINF UPM)

Official Welcome

Arle Lommel

DFKI

Welcome and Introductions

09:30

Keynote

Alolita Sharma

Wikimedia Foundation

Keynote: Multilingual User Generated Content at Wikipedia scale

Slides

10:15

Coffee Break

10:45

Developers

Pau Giner, David Chan and Santhosh Thottingal

Wikimedia Foundation

Best Practices on the Design of Translation

abstract

Wikipedia is one of the most multilingual projects on the web today. In order to provide access to knowledge to everyone, Wikipedia is available in more than 280 languages. However, the coverage of topics and detail varies from language to language. The Language Engineering team from the Wikimedia Foundation is building open source tools to facilitate the translation of content when creating new articles to facilitate the diffusion of quality content across languages. The translation process in Wikipedia presents many different challenges. Translation tools are aimed at making the translation processes more fluent by integrating different tools such as translation services, dictionaries, and information from semantic databases as Wikidata.org. In addition to the technical challenges, ensuring content quality is one of the most important aspects considered during the design of the tool since any translation that does not read natural is not acceptable for a community focused on content quality. This talk will cover the design (from both technical and user experience perspectives) of the translation tools, and their expected impact on Wikipedia and the Web as a whole.

Slides

Feiyu Xu

DFKI, Co-Founder of Yocoy

Always Correct Translation for Mobile Conversational Communication

abstract

The coexistence of different languages and cultures poses a true challenge for global mobility and communication in business and personal life. It is known that machine translation is still one of the biggest challenges in artificial intelligence and especially for Language Technology. Nowadays most machine translation systems rely on machine learning and statistical methods by taking parallel texts as their training data. Such systems can come along with a very broad coverage, but their translation results are often lack of accuracy and fluency. Thus, they cannot be employed for face to face communication, because translation errors will bring too many troubles for people in foreign countries. In this talk we present a mobile application called "Yocoy Language Guide" which is used for face to face communication between people speaking different languages. The heart of the app consists of a new technology called ACT (Always Correct Transalation) which supports a 100% correct translation for everyday dialogues among 5 languages: German, English, Spanish, French, and Chinese. The Yocoy Language Guide has been widely downloaded by more than 200.000 people from all over the world.

Slides

Richard Ishida

W3C Internationalization Activity Lead

New Internationalization Developments at the World Wide Web Consortium

abstract

The W3C Internationalization Activity is striving to ensure that the technology of the World Wide Web remains world wide in scope. The Internationalization Working Group works with other W3C working groups and liaises with other organizations to help ensure universal access to the Web, regardless of language, script or culture. It does so by reviewing and contributing to developments in Web technology. It also provides internationalization-related education and outreach for specification developers, content developers and implementers. This talk will list and describe some of the things at the top of the group's radar in May 2014, and some of the recent success stories.

Slides

Charles McCathie Nevile

Yandex

Multilingual Aspects of Schema.org

abstract

Schema.org produces a continuously evolving metadata schema that is already used in around 10% of websites. The talk will brief ly explain what it is and how to use it, then explore what it means for developers, especially in the context of a multiingual Web, and how they can help fix things that don't work for them.

Slides

[Chair, Philipp Cimiano]

11:45

Q&A

12:00

Lunch and Posters

13:15

Creators

Fernando Servan

UN Food and Agriculture Organization

Bridging the Online Language Barriers with Machine Translation at the United Nations

abstract

The Food and Agriculture Organization of the United Nations is introducing a widget for machine translation into Arabic, Chinese and Russian in its website, FAO.org. This is a new experience because of the technical challenges, the political environment and the internal dynamics. The presentation discusses the approach adopted, the steps taken and the feedback mechanism developed. This experience is considered to be useful to international organizations considering the adoption of machine translation as part of their vision for a multilingual web.

Slides

Marion Shaw

SDL

Multilingual Web Considerations for Multiple Devices and Channels – When should you think about Multilingual Content?

abstract

No longer it is satisfactory to consider the PC when developing websites. Increasingly people are accessing data via tablets and smartphones. Consideration needs to be made not only how we display our multilingual content but how that looks on all devices and how it downloads through the various channels. Is it OK for the mobile phone user to have to scroll backwards and forwards to see a conventional website on their phone because it has not been customised for that device? Is it OK for the ipad user to switch to puffin because a flash video has been embedded into your site? Should a user on GPRS be happy to wait more than 90 seconds to download a page on your site? One of the most common and expensive mistakes made when developing a website is not considering the multilingual aspect until the site has been completed, approved and published. So when should you consider multilingual content and how to manage it to make it more cost effective? We believe the answer is before you create a single sentence. If the writers used structured content to create the site the reuse during translation could be maximised sometimes saving over 30% of the cost of non-structured content.
Space, expansion and display need to be considered to ensure there are no issues displaying in different locales. Perhaps most crucially the language used needs to be ‘translatable’ – we have seen multiple examples where marketer has produced a ‘brilliant’ concept in English but the translated version can have very different effects.

Slides

Celia Rico

Universidad Europea - Madrid

Post-editing Practices and the Multilingual Web: Sealing Gaps in Best Practices and Standards

abstract

With post-editing becoming a widespread activity in the translation/localisation industry, there are still some gaps to be filled in best practices and standards. If post-editing is to contribute fully to ensuring the multilingual success of the World Wide Web, there a series of issues that are yet to be dealt with:
- Is there a real benefit in using standards for post-editing purposes in daily practice?
- Do annotation tags make sentences slightly less understandable and more cryptic for post-editors?
- In cases where there is more than one annotation per phrase, the post-editor may miss the visual continuity of the sentence, spend too much time rereading it or even leave syntax mistakes from the MT uncorrected. How should this information be presented (if at all)?
- Should post-editors be allowed to insert annotations?
This talk aims at raising awareness towards these questions that challenge everyday scenarios where the post-editor is confronted with productivity, quality and price.

Slides

Gemma Miralles

SEGITTUR

Spain Tourism and Cultural Website

abstract

Gemma is Content Manager in SEGITTUR. I am responsible for managing several international and promotional websites: www.spain.info, wwww.spainisculture.com and www.studyinspain.info. These websites are in 18 languages. I will explain how we manage content in so many languages and how searchengines are changing our methodology and general content strategy.

Slides

Alexander O'Connor

CNGL KDEG Trinity College Dublin Ireland

Marking Up Our Virtue: Multilingual Standards and Practices in the Digital Humanities

abstract

This talk will introduce the area of Digital Humanities and Digital Cultural Heritage and discuss the relevance of linguistic standards on the web to research and best practice in the area. These domains represent a vibrant area of research activity for a wide number of academic disciplines. At the heart of DH is the use of digital approaches to the interpretation, collation, comprehension and dissemination of cultural and historical artefacts. Examples of artefacts can range from 17th-Century accounts of rebellions in Ireland, through early 20th-Century Grey Literature concerning the World Wars to 14th Century commentaries on herbs. Humanists research the texts, the manuscripts themselves and the networksthat surround these artefacts in a great many ways, for example by attempting to trace the references to particular paramilitary units, or to seek the assistance of the general public in tagging and marking uphand-written artefacts.Language is a highly complex topic in this domain. Many documents may be written in several archaic forms of one or more languages, with highly informal or irregular content, and may have accompanying markup which is in yet another language. The general assumptions of the availability of linguistic resources and their applicability is extremely subtle in this regard. The incredible richness of the content, combined with its social and cultural value, means that there is a constant challenge to choosing the best approaches to the digitisation process. A realisation that standards are required to create enduring digital cultural archives has been met with the difficulty of how to choose what standard, and how to involve non-technical experts in the best practice?

Slides

[Chair, Paul Buitelaar]

14:30

Q&A

14:45

Coffee Break and Posters

15:15

Localizers

Jan Nelson

Microsoft

The Multilingual App Toolkit Version 3.0: How to create a Service Provider for any Translation Service Source as a key Extensibility Feature

abstract

Over the last few years, MLW attendees have heard about the release and evolution of the Microsoft Multilingual App Toolkit (MAT) for Windows Store and Phone apps and the MS Translator service integration into the toolkit. In this talk, we will demonstrate MAT v.3 and will focus on how to create a service provider for any translation service source as a key extensibility feature. We will also briefly address new features and improvements in this release. As before, we continue to share our interest in support for the XLIFF standard as a showcase implementation.

Slides

Joachim Schurig

Lionbridge

The Difference made by Standards oriented Processes

abstract

We will discuss the benefits of changing a translation workflow to largely standards driven and automated processes. We will give some insight into the optimizations this allows to make, and we will highlight the portions of the translation process where we still see need for better standardization. An interesting aspect of standardization in language technology to date is that it merely helped the definition of tighter internal processes, but not significantly in interoperability across the industry. The change and challenges from the historical file and batch based content translation approach to the highly interactive on-demand and on-change process of today will be another focus point. And finally we will discuss what we see as the final frontier in today’s content creation, translation, and presentation model.

Slides

Rob Zomerdijk

SDL

Content Relevancy starts with understanding your international Audience

abstract

Many organizations nowadays are (becoming) publishers of vast amounts of content, which is dictated by the internet and modern behavior. Consumers/citizens expect to find relevant information by “google-ing”. Modern consumers and citizens are impatient. If they do not find what they are looking for, or if the content is not relevant or they do not understand what is written, they will voice their opinions using social channels like Twitter, Facebook but also many other social networks and blogs. The challenge of reaching the modern citizen explodes when having an international audience. Topics and themes differ per country, or even within different ethnic populations within a country. Market Research has been the traditional way to understand audiences. However it is expensive, cumbersome and it is difficult to capture the thoughts of the younger (-30) generation. Conversations citizens have on social channels such as forums and blogs and remarks they make on micro-blogs like Twitter create a new data set which can be used to understand your citizens. SDL will present how to optimize your translation quality so your content is easier to find by and more relevant to your audience.

Slides

Tatiana Gornostay

Tilde

Towards an open Framework of E-services for Translation Business, Training, and Research by the example of Terminology Services

abstract

Coming from the three “worlds” – industry, academia, and freelancing – and being a provider, a trainer, and a consumer at the same time, I call for cooperation amongst our communities – Data, Language Technology, and the Web. I will present several use cases in translation and terminology work that will address gaps with regards to standards and the multilingual Web. One of the examples would be the export of terminology into the Lined Data format. By the example of terminology services, I hope my presentation will serve as a “seed” to make our goals mutual and to plan next steps towards this 3-dimensional open infrastructure of e-services for translation business, training, and research.

Slides

David Filip, David Lewis and Arle Lommel

CNGL at University of Limerick, CNGL at Trinity College Dublin, DFKI

Quality Models, Linked Data and XLIFF: Standardization Efforts for a Multilingual and Localized Web

abstract

The last year has seen unprecedented synchronised activity is the advancement of standards to support of the multilingual and localized web. At the W3C, the MultilingualWeb-Language Technology Working Group completed its work in producing the Internationalisation Tag Set (ITS2.0) Recommendation. This provides for content meta-data addressing a range of content management, localization and language technology integration issues. This was progressed in close collaboration with the development of the XLIFF2.0 specification at OASIS, with a mapping between XLIFF and ITS 2.0 available. In parallel the QTLauchpad project, in consultation with GALA, has produced a specification for Multidimensional Quality Metric relevant to translation and its use of language technology, which is also aligned with ITS2.0. In this presentation, members from these activities will briefly introduce these development, explain how they relate to each other and highlight the path for further developments, in particular through the use of linked data.

Slides

[Chair, Phil Ritchie]

16:45

Q&A

17:00

Machines - Part I

Roberto Navigli

University of Rome

Babelfying the Multilingual Web: state-of-the-art Disambiguation and Entity Linking of Web Pages in 50 Languages

abstract

The creation of huge, broad-coverage knowledge bases and linguistic linked data such as DBpedia and BabelNet, made possible by the availability of collaboratively-curated online resources such as Wikipedia and Wiktionary, has not only entailed researchers, but it has also hit big industry players such as Google and IBM, who are moving fast towards large-scale knowledge-oriented systems. The semantic annotation of arbitrary Web pages is an important use of multilingual linguistic linked data. However, most of the time, the task is limited to just linking (some) named entities or working on a restricted number of languages. In this talk we will present Babelfy (http://babelfy.org), a new, state-of-the-art, wide-coverage system which leverages a novel graph-based algorithm to semantically annotate and link arbitrary text, such as Web pages, written in any language, with both concepts, i.e. abstract meanings of words or domain terms, and named entities from BabelNet (http://babelnet.org), a huge multilingual semantic network and linked data set.

Slides

Victor Rodríguez Doncel

Universidad Politécnica de Madrid

Towards high quality, industry-ready Linguistic Linked Licensed Data

abstract

Researcher and developer Universidad Politécnica de Madrid 3LD: Towards high quality, industry-ready Linguistic Linked Licensed Data. The application of Linked Data technology to the publication of linguistic data promises to facilitate interoperability of such resources and has lead to the emergence of the so called Linguistic Linked Data Cloud (LLD) in which linguistic data is published following the Linked Data principles. Three essential issues need to be addressed for such data to be easily exploitable by language technologies: i) appropriate machine-readable licensing information is needed for each dataset, ii) minimum quality standards for Linguistic Linked Data need to be defined, and iii) appropriate vocabularies for publishing Linguistic Linked Data resources are needed. We propose the notion of Licensed Linguistic Linked Data (3LD) in which different licensing models might co-exist, from totally open to more restrictive licenses through to completely closed datasets.

Slides

[Chair, Dan Tufis]

17:30

Q&A

20:30

Dinner

At the Posada de la Villa

Dinner sponsored by Verisign with support from Lionbridge

details

Dinner on May 7 will be served at 20:30 at the Posada de la Villa, Calle Cava Baja, 9. Note that the restaurant is not within easy walking distance of the Workshop venue. It is roughly six kilometers away and can be accessed via Metro lines 1 and 2 (Tirso de Molina station, approx. 500 meters) and 5 (La Latina station, approx. 250 meters). If any attendees require assistance in reaching the restaurant, please let Ms. Nieves Sande know in advance at the registration desk. For more information on transit options, see the separate flyer on the dinner.

8 May

09:00

Machines - Part II

Seth Grimes

Alta Plana Corporation

Sentiment, opinion, and emotion on the multilingual Web

abstract

Two sorts of information co-exist on the multi-lingual Web, intertwined: Facts and Feelings. Extraction of each type is complicated by idiom, expressive vocabularies, metaphor, and cultural context that often translate poorly from one language to another, if it all. This talk will describe sentiment's business value and survey technical approaches to meeting the extraction challenge.

Slides

Asunción Gómez-Pérez

Universidad Politécnica de Madrid

The LIDER Project

abstract

The LIDER project aims at establishing a new Linked Open Data (LOD) based ecosystem of free, interlinked, and semantically interoperable language resources (corpora, dictionaries, lexical and syntactic metadata, etc.) and media resources (image, video, etc. metadata) that will allow for free and open exploitation of such resources in multilingual, cross-media content analytics across the EU and beyond, with specific use cases in industries related to social media, financial services, localization, and other multimedia content providers and consumers. In some cases, we will explore new business model and hybrid licensing schemes for using of Linguistic Linked Data in commercial settings for Free but not Open resources.

Slides

Martin Brümmer, Mariano Rico, Marco Fossati

University of Leipzig, Universidad Politécnica de Madrid, Fondazione Bruno Kessler

DBpedia: Glue for all Wikipedias and a Use Case for Multilingualism

abstract

14 official DBpedia chapters, apart from the English one, exist, namely Basque, Czech, Dutch, French, German, Greek, Indonesian, Italian, Japanese, Korean, Polish, Portuguese, Russian and Spanish. However, there is still a lot to be done in terms of DBpedia internationalization. This talk will cover:
- Deployment of local DBpedia chapters acting as the backbone of the emerging national Open Data landscapes, with an emphasis on the liaison with governmental and civic society organizations. It will then focus on the task of mapping language-specific knowledge representations coming from the Wikipedias into a unified culture-agnostic ontology.
- Benefits of an increasingly multilingual DBpedia as a source for language resources and a resource for Natural Language Processing are overwhelming.

Slides

Jorge Gracia and José Emilio Labra

Universidad Politécnica de Madrid, University of Oviedo

Best Practises for Multilingual Linked Open Data: a Community Effort

abstract

The W3C Best Practises for Multilingual Linked Open Data community group was born one year ago during the last MLW workshop in Rome. Nowadays, it continues leading the effort of a numerous community towards acquiring a shared view of the issues caused by multilingualism on the Web of Data and their possible solutions. Despite our initial optimism, we found the task of identifying best practises for ML-LOD a difficult one, requiring a deep understanding of the Web of Data in its multilingual dimension and in its practical problems. In this talk we will review the progresses of the group so far, mainly in the identification and analysis of topics, use cases, and design patterns, as well as the future challenges.

Slides

Josef van Genabith

DFKI

Quality Machine Translation for the 21st Century

abstract

Over the last 10 years Machine Translation (MT) has started to make substantial in-roads in professional translation (localisation and globalisation) and our daily lives. Much research in machine translation has concentrated on in-bound translation for gisting. We frequently use (freely available) applications like Google Translate or Bing Translate to bridge language barriers on the Web to make sense of information that is presented in a language we are not familiar with. For this to be successful, MT output has to be understandable (in the sense that it gets most of the basic content encoded in the source language across into the target language – this is often referred to as adequacy). Crucially, MT does not have to be perfectly fluent and may even make some mistakes or omissions that can be recognised, tolerated and compensated for by the human user given the context. For many applications, in particular out-bound translations (rather than in-bound for gisting), however,translation quality (fluency and adequacy) is crucial. Most MT engines are now statistical (SMT), i.e. they learn from previous human translations (bitext) how to translate new text. Barriers towards high quality (S)MT include lack of training data, statistical models that do not fully capture large translation divergences between certain “challenging” language pairs (e.g. substantial reordering, translation into morphologically rich languages etc.), limits in how translation outputs are evaluated automatically and by humans. The QT21 project application brings together key stakeholder constituencies including leading MT teams in Europe, language services provider companies, professional translation and localisation organisations, think-tanks and a large public translation user to systematically address these quality barriers focusing on “challenging” languages.

Slides

[Chair, Hans Uszkoreit]

10:15

Q&A

10:30

Coffee Break and Posters

11:00

Users

Pedro L. Díez Orzas

Linguaserve I.S. S.A.

Multilingual Web: Affordable for SMEs and small Organizations?

abstract

Multilingual Web Language Technologies solutions are often only affordable for certain companies and organizations. Reasons might come from several points of view, such as costs, infrastructures and specialized personnel:
- Budget to implement and maintain a continuous multilingual web activity.
- Know-how and maturity level for understanding and adapting new technologies to their special needs.
- Internal or external personnel to take care of technical and management tasks to develop multilingual web strategies.
In order to implement MLW solutions, these organizations can be helped by expert companies in the localization web sector, but they usually also need software integration providers or an internal technical department to approach certain solutions. Making a monolingual web become a multilingual web is not a problem, even for small companies, but taking competitive advantage to benefit from ultimate technology for continuous and incremental multilingual web activity becomes very often something that SMEs say they cannot afford. The fact is that the European business network consists mostly of SMEs, and one of the peculiarities of the European market is its multilingual composition (as opposed to the US market, for instance). Also, reaching external and global markets requires having the capacity of multilingual communication strategies, not only for the needs of sustainable and incremental multilingual web, but also for all other needs surrounding online commercial activities. This presentation offers a medium-term vision and introduces some factors that impact on making these technologies, solutions and services more affordable for a wide range of small and medium enterprises and public organizations, based on five key factors, among other:
- Best practices for web creation
- Globalization or interoperability standards
- Metadata and data standards
- Adaptive solutions and services for companies of different sizes
- Flexible complementary multilingual communication services
Societies that achieve to bring their SMEs’ creativity and outstanding entrepreneurial initiatives to the global markets will be the near future’s successful societies.

Slides

Don Hollander

APTLD

Universal Access: Barrier or Excuse?

abstract

While IDN ccTLDs and IDN TLDs work as expect within the DNS system, they don’t work well within the real world. This could be why IDN TLDs are not being adopted as quickly as desired. This paper looks at two issues: What are the barriers to the effective use of IDN TLDs and who can help address these issues. The paper doesn't provide answers but raises questions that, I hope, will be answered in part during subsequent responses.

Slides

Dennis Tan

VeriSign, Inc

Internationalized Domain Names: Challenges and Opportunities, 2014

abstract

Today Internationalized Domain Names (IDNs) are getting more attention than at any other time since they were introduced into the Domain Name System in 2000, but they still have a long path to general adoption. IDNs are far from being ubiquitous and trusted. Verisign, as a registry operator and manager of over 1M IDNs, plays a small part in this ecosystem comprised by not only registries, but developers, content creators, policy and standard making bodies who are all attempting to further internationalize, or locally localize, theidentifiers on the Internet. Therefore, we intend to highlight some of the challenges we have found through our experience as a registry operator and encourage all players to make IDNs a ubiquitous and trusted product for the multilingual web.

Slides

Georg Rehm

DFKI GmbH

Digital Language Extinction as a Challenge for the Multilingual Web

abstract

The large, pan-European study “Europe’s Languages in the Digital Age”, published as the META-NET White Paper Series in late 2012, has shown that 21 of the 31 languages investigated are in danger of digital extinction. The level of support these 21 languages receive through language technologies is either weak or non-existent at all. In early 2014 the comparison has been extended by ca. 20 additional languages – these updated results are even more alarming because far more than the 21 original languages are in fear of becoming digitally extinct. Most of these additional languages belong to a category that is usually referred to as “small languages” which is, generally speaking, used almost synonymously with “under-resourced languages”. The proposed presentation will first describe the original approach of the META-NET study “Europe’s Languages in the Digital Age” and then explain the updated results. The presentation will also discuss the challenges of this current situation for the multilingual web and the corresponding needs, especially with regard to intensifying technology development and knowledge transfer from the better supported to the less supported languages.

Slides

[Chair, Thierry Declerck]

12:00

Q&A

12:15

Lunch

13:30

Open Space set up &
start of LIDER workshop

Arle Lommel

DFK

Explanation of the open space format for the afternoon, and selection of discussion topics. Topics are suggested by participants, and the most popular are allocated to breakout groups. A chair is chosen for each group from volunteers. There are also some pre-selected groups.

The LIDER workshop will run in parallel to the Open Space session.

14:00

Open space

Break-out discussions

Various locations are available for breakout groups. Participants can join whichever group they find interesting, and can switch groups at any point. Group chairs facilitate the discussion and ensure that notes are taken to support the summary to be given to the plenary.

15:00

Coffee Break and Posters

15:30

Open Space (contd.)

17:00

Open space

Group reports and discussion

Everyone meets again in the main conference area and each breakout group presents their findings. Other participants can comment and ask questions.

17:45

Wrap-Up

Workshop close

Posters

Manuel Tomas Carrasco Benitez

Language Technology, European Commission

Big Multilingual Linked Data (BigMu)

abstract

BigMu is the confluence of three streams with their own peculiarities and traditions:
- Linked Data
- Big Data
- Multilingual parallel corpora
One must apply standards in the simplest fashion. Back to basics: it is about using URIs and content negotiation for the language, format and similar items. Going for simplicity, one service can supply both the human and the machine readable versions using XHTML with appropriate markings, though different output could be arranged in XML and HTML. The same mechanisms must work: for small data (one record) and Big Data (terabytes-sized databases); for tabular and prose data.
Multilingual data often requires cleaning; hard-to-process data might be discarded; and there is a bias toward bilingual data. The challenge is to end up with clean, complete n-lingual aligned data. To put the problematic in perspective, there is nothing better that trying to process a large corpus such as about ten years of the Official Journal of the European Union (OJ).
The presentation will combine both:
- Theory: using current web technologies
- Practice: the experience of cleaning a large corpus

Poster

Martin Brümmer

University of Leipzig

The DBpedia Data Stack: a Large, Multilingual, Semantic Knowledge Graph

abstract

DBpedia is currently a very successful data dissemination project with high-industry uptake. We analysed the project and identified several barriers preventing DBpedia from realizing its full potential and ensuring a sustainable operation, namely:
- lack of tools to support improved and cost-efficient data curation and multilingualism,
- lack of highly available value-added services with quality of service (QoS) guarantees and lack of enterprise-optimized infrastructures
- lack of proper documentation, tutorials and support, resulting in steep learning curves for new technologies
These obstacles prohibit the participation of SMEs in Linked Data environments, thus depriving them of valuable resources for business diversification and development. On the other hand, Linked Data technologies are stuck in their original research roots, being also deprived from real world development opportunities. To address these barriers, technological advances as well as an organizational framework are required to provide a sustainable environment for future developments.

The poster presents: 1. available data provided by DBpedia with the focus on multilingual data and data exploitable for NLP processes 2. a new organisation called the DBpedia Association (http://dbpedia.org/association) which has been created to support DBpedia and improve the output for exploitation by industry.

Thierry Declerck and Paul Buitelaar

DFKI and INSIGHT National Center for Data Analytics, National University of Ireland, Galway

Multilingual polarity and sentiment lexicons in the LOD framework

abstract

In this poster we present our experiences in the generation, integration and use of Linguistic Linked Licensed Data in the context of two industry-driven EU projects: EuroSentiment and TrendMiner.
The EuroSentiment project is concerned with the establishment of a market for language resources for sentiment analysis. In the EuroSentiment project we develop a pool of semantically interoperable language resources for sentiment analysis, including domain-specific lexicons and annotated corpora. Sentiment analysis applications are able to access domain-specific polarity scores for individual lexical items in the context of semantically defined sentiment lexicons and corpora, or access and integrate complete language resources. The provision of such services across providers, customers and applications depends on a semantically interoperable representation of language resources for sentiment analysis. Language resources as used for sentiment analysis include a variety of dictionaries, corpora, sentiment models, etc. In the EuroSentiment project we are concerned with the specification and use of a model that enables the easy exchange and/or integration of these different language resources across sentiment analysis platforms and applications.
The TrendMiner project is delivering portable open-source real-time methods for cross-lingual mining and summarisation of large-scale stream media. At the beginning of the project two use cases have been described:: “Multilingual Trend Mining and Summarisation for Financial Decision Support”, and “Multilingual Public Spheres: Political Trends and Summaries”, and in the recent past, in the context of a project extension, use cases in the field of Psychology and eHealth have been added. In all cases, detection of opinions and sentiments related to relevant entities is playing a central role.
Both the TrendMiner and Eurosentiment project are using the Marl model for encoding opinions (encoded by polarity features) expressed over time on entities. The Marl model is integrated in the TrendMiner ontologies, which are covering various domain fields, biography, etc. The use and further development of this model is the cornerstone of the cooperation between Eurosentiment and TrendMiner.
Together, EuroSentiment and TrendMiner are delivering a large set of semantically enriched lexical and corpora resources of opinion/sentiment marked lexicons and corpora, published in the Linguistic Linked Data and Linked Data clouds, starting to bridge also linguistic knowledge and domain knowledge.
References:
http://eurosentiment.eu/
http://www.trendminer-project.eu/

Serge Gladkoff and Renat Bikmatov

Logrus International

Previewing ITS 2.0 metadata in XLIFF-based frameworks

abstract

Logrus would like to present a universal parser and preview tool which would help users to easily view localization metadata embedded in the content. The parser has been drastically improved since WICS project. The tool supports the Internationalization Tag Set (ITS) 2.0 standard of metadata recently introduced by W3C. The tool is built on JavaScript only. It supports previewing HTML and XML files in web browsers; any external ITS 2.0 rule files are properly processed. No preliminary file format conversion needed. The product is based on and is a further development of an open source metadata preview tool, code name "WICS", developed by Logrus for W3C in 2013. During presentation, the new tool will be used to preview several sample files containing different ITS 2.0 metadata.

Poster

Dave Lewis

CNGL at Trinity College Dublin

The Localisation Web: Combining Open Data and Language Technology for Web Content Translation

abstract

The FALCON project combines the power of open data on the web with data-driven language technologies to construct the Localization Web. This consists of a network of terms and translations inter-linked to each other and to source and target documents via URLs. FALCON will integrate the resulting web of linked localisation and language data into localisation tool chains using existing data query and access control standards. Meta-data from these tools will add value to these data assets, enabling seamless quality monitoring across the value chain and their on-demand leverage in training machine translation and text analytics engines. FALCON will demonstrate the active curation of language resources and value-add meta-data, operating as an integral part of next generation localisation workflows. An open meta-data schema will capture the provenance of terms and translations as they progress through these workflows. The controlled, decentralized generation and sharing of this meta-data will yield new levels of end-to-end visibility into process and quality across the value chain. This will enable flexible, on-demand assembly of training data for targeted domain and quality improvements to machine translation and text analytics engines.

John P. McCrae and Philipp Cimiano

University Bielefeld

WordNet as a central hub of the Multilingual Linked Data Web

abstract

WordNet is the most-widely used lexical resource today, but is however a resource only for English. As such, WordNet has been emulated for many languages and the founding of the Global WordNet Association has formed a large community of language resource producers, working on many languages, especially under-resourced languages. However, up until now there have been no agreed formats for the representation of WordNets and no stable identifiers for synsets to allow interoperability between models. We have worked on producing the first RDF version of WordNet directly supported by the development team of WordNet to provide a stable, up-to-date RDF version of WordNet, in order to enable an interlingual index to be created between wordnets in different languages. Furthermore, by the use of the W3C vocabulary lemon we have established a sound yet extensible model for the representation of WordNets. As such, we believe that the Princeton RDF WordNet, although itself monolingual, will play the role of a crucial hub for multilingual language resources on the web.

Roberto Navigli, Tiziano Flati and Andrea Moro

Sapienza University of Rome

Babelfy: State-of-the-art Multilingual Word Sense Disambiguation and Entity Linking

abstract

Entity Linking (EL) and Word Sense Disambiguation (WSD) both address the lexical ambiguity of language. But while the two tasks are pretty similar, they differ in a fundamental aspect: in EL the textual mention can be linked to an entity which may or may not contain the exact mention, while in WSD there is a perfect match between the word form (better, its lemma) and a suitable sense. In this poster we present Babelfy, a unified graph-based approach to EL and WSD based on a loose identification of candidate meanings coupled with a densest subgraph heuristic which selects high-coherence semantic interpretations. Experiments show state-of-the-art performances on both tasks on different datasets, including a multilingual setting.

Poster

Alexander O´Connor

CNGL KDEG Trinity College Dublin Ireland

CENDARI

abstract

The Cendari Project is an EU FP7 Project which aims to facilitate the exploration and interconnection of the hidden archives of Europe. Taking an infrastructure approach, the project focuses on developing services to support Historians of WW1 and Mediæval manuscript Research in their academic inquiries. This invovles collaborations with archives, libraries and other memory instititions in many countries, seeking to collect and collate collection and item level information. In addition, the project seeks to provide new forms of query tool, which provide scholars with the apparatus to pose and answer research questions, rather than manipulating text queries. The proposed Cendari infrastructure takes a unification approach, drawing in heterogeneous data in many languages and formats, and providing a unified, tagged and enriched representation which complies with standards and best practice. Currently, Cendari is developing initial versions of services which import and reconcile heterogeneous archive metadata; tag and identify entity references in text; a semantic note-taking environment; and a tool for historical queries based on complex semantic attributes. These tools are being piloted in a number of integration scenarios, concentrating on a group of deserters in the aftermath of the First World War and to Mediæval Scientific and Philosophical manuscripts. Drop over to our poster to hear more!

Participants from the LIDER project

LIDER

Overview of the LIDER project

abstract

The LIDER project aims at providing the basis for the creation of a Linguistic Linked Data cloud that can support content analytics tasks of unstructured multilingual cross-media content. By achieving this goal, LIDER will impact on the ease and efficiency with which Linguistic Linked Data will be exploited in content analytics processes. LIDER will create a strong community around the topic, guidelines and best practices for building and exploiting of resources, a reference architecture for Linguistic LD, and a long-term roadmap for the use of LD for multilingual and multimedia content analytics in enterprises.

MULTILINGUALWEB

W3C Workshop Program: New Horizons for the Multilingual Web 7-8 May 2014 (8-9 May: LIDER workshop), Madrid

Sponsors:

W3C Workshop Program:
New Horizons for the Multilingual Web
7-8 May 2014 (8-9 May: LIDER workshop), Madrid