MULTILINGUALWEB

Standards and best practices for the Multilingual Web

8 May
09:00

Machines - Part II

Seth Grimes

Alta Plana Corporation

Sentiment, opinion, and emotion on the multilingual Web

abstract openTwo sorts of information co-exist on the multi-lingual Web, intertwined: Facts and Feelings. Extraction of each type is complicated by idiom, expressive vocabularies, metaphor, and cultural context that often translate poorly from one language to another, if it all. This talk will describe sentiment's business value and survey technical approaches to meeting the extraction challenge.

Asunción Gómez-Pérez

Universidad Politécnica de Madrid

The LIDER Project

abstract openThe LIDER project aims at establishing a new Linked Open Data (LOD) based ecosystem of free, interlinked, and semantically interoperable language resources (corpora, dictionaries, lexical and syntactic metadata, etc.) and media resources (image, video, etc. metadata) that will allow for free and open exploitation of such resources in multilingual, cross-media content analytics across the EU and beyond, with specific use cases in industries related to social media, financial services, localization, and other multimedia content providers and consumers. In some cases, we will explore new business model and hybrid licensing schemes for using of Linguistic Linked Data in commercial settings for Free but not Open resources.

Martin Brümmer, Mariano Rico, Marco Fossati

University of Leipzig, Universidad Politécnica de Madrid, Fondazione Bruno Kessler

DBpedia: Glue for all Wikipedias and a Use Case for Multilingualism

abstract open14 official DBpedia chapters, apart from the English one, exist, namely Basque, Czech, Dutch, French, German, Greek, Indonesian, Italian, Japanese, Korean, Polish, Portuguese, Russian and Spanish. However, there is still a lot to be done in terms of DBpedia internationalization. This talk will cover:
- Deployment of local DBpedia chapters acting as the backbone of the emerging national Open Data landscapes, with an emphasis on the liaison with governmental and civic society organizations. It will then focus on the task of mapping language-specific knowledge representations coming from the Wikipedias into a unified culture-agnostic ontology.
- Benefits of an increasingly multilingual DBpedia as a source for language resources and a resource for Natural Language Processing are overwhelming.

Jorge Gracia and José Emilio Labra

Universidad Politécnica de Madrid, University of Oviedo

Best Practises for Multilingual Linked Open Data: a Community Effort

abstract openThe W3C Best Practises for Multilingual Linked Open Data community group was born one year ago during the last MLW workshop in Rome. Nowadays, it continues leading the effort of a numerous community towards acquiring a shared view of the issues caused by multilingualism on the Web of Data and their possible solutions. Despite our initial optimism, we found the task of identifying best practises for ML-LOD a difficult one, requiring a deep understanding of the Web of Data in its multilingual dimension and in its practical problems. In this talk we will review the progresses of the group so far, mainly in the identification and analysis of topics, use cases, and design patterns, as well as the future challenges.

Josef van Genabith

DFKI

Quality Machine Translation for the 21st Century

abstract openOver the last 10 years Machine Translation (MT) has started to make substantial in-roads in professional translation (localisation and globalisation) and our daily lives. Much research in machine translation has concentrated on in-bound translation for gisting. We frequently use (freely available) applications like Google Translate or Bing Translate to bridge language barriers on the Web to make sense of information that is presented in a language we are not familiar with. For this to be successful, MT output has to be understandable (in the sense that it gets most of the basic content encoded in the source language across into the target language – this is often referred to as adequacy). Crucially, MT does not have to be perfectly fluent and may even make some mistakes or omissions that can be recognised, tolerated and compensated for by the human user given the context. For many applications, in particular out-bound translations (rather than in-bound for gisting), however,translation quality (fluency and adequacy) is crucial. Most MT engines are now statistical (SMT), i.e. they learn from previous human translations (bitext) how to translate new text. Barriers towards high quality (S)MT include lack of training data, statistical models that do not fully capture large translation divergences between certain “challenging” language pairs (e.g. substantial reordering, translation into morphologically rich languages etc.), limits in how translation outputs are evaluated automatically and by humans. The QT21 project application brings together key stakeholder constituencies including leading MT teams in Europe, language services provider companies, professional translation and localisation organisations, think-tanks and a large public translation user to systematically address these quality barriers focusing on “challenging” languages.
[Chair, Hans Uszkoreit]
10:15
10:30
Coffee Break and Posters
11:00

Users

Pedro L. Díez Orzas

Linguaserve I.S. S.A.

Multilingual Web: Affordable for SMEs and small Organizations?

abstract openMultilingual Web Language Technologies solutions are often only affordable for certain companies and organizations. Reasons might come from several points of view, such as costs, infrastructures and specialized personnel:
- Budget to implement and maintain a continuous multilingual web activity.
- Know-how and maturity level for understanding and adapting new technologies to their special needs.
- Internal or external personnel to take care of technical and management tasks to develop multilingual web strategies.
In order to implement MLW solutions, these organizations can be helped by expert companies in the localization web sector, but they usually also need software integration providers or an internal technical department to approach certain solutions. Making a monolingual web become a multilingual web is not a problem, even for small companies, but taking competitive advantage to benefit from ultimate technology for continuous and incremental multilingual web activity becomes very often something that SMEs say they cannot afford. The fact is that the European business network consists mostly of SMEs, and one of the peculiarities of the European market is its multilingual composition (as opposed to the US market, for instance). Also, reaching external and global markets requires having the capacity of multilingual communication strategies, not only for the needs of sustainable and incremental multilingual web, but also for all other needs surrounding online commercial activities. This presentation offers a medium-term vision and introduces some factors that impact on making these technologies, solutions and services more affordable for a wide range of small and medium enterprises and public organizations, based on five key factors, among other:
- Best practices for web creation
- Globalization or interoperability standards
- Metadata and data standards
- Adaptive solutions and services for companies of different sizes
- Flexible complementary multilingual communication services
Societies that achieve to bring their SMEs’ creativity and outstanding entrepreneurial initiatives to the global markets will be the near future’s successful societies.

Don Hollander

APTLD

Universal Access: Barrier or Excuse?

abstract openWhile IDN ccTLDs and IDN TLDs work as expect within the DNS system, they don’t work well within the real world. This could be why IDN TLDs are not being adopted as quickly as desired. This paper looks at two issues: What are the barriers to the effective use of IDN TLDs and who can help address these issues. The paper doesn't provide answers but raises questions that, I hope, will be answered in part during subsequent responses.

Dennis Tan

VeriSign, Inc

Internationalized Domain Names: Challenges and Opportunities, 2014

abstract openToday Internationalized Domain Names (IDNs) are getting more attention than at any other time since they were introduced into the Domain Name System in 2000, but they still have a long path to general adoption. IDNs are far from being ubiquitous and trusted. Verisign, as a registry operator and manager of over 1M IDNs, plays a small part in this ecosystem comprised by not only registries, but developers, content creators, policy and standard making bodies who are all attempting to further internationalize, or locally localize, theidentifiers on the Internet. Therefore, we intend to highlight some of the challenges we have found through our experience as a registry operator and encourage all players to make IDNs a ubiquitous and trusted product for the multilingual web.

Georg Rehm

DFKI GmbH

Digital Language Extinction as a Challenge for the Multilingual Web

abstract openThe large, pan-European study “Europe’s Languages in the Digital Age”, published as the META-NET White Paper Series in late 2012, has shown that 21 of the 31 languages investigated are in danger of digital extinction. The level of support these 21 languages receive through language technologies is either weak or non-existent at all. In early 2014 the comparison has been extended by ca. 20 additional languages – these updated results are even more alarming because far more than the 21 original languages are in fear of becoming digitally extinct. Most of these additional languages belong to a category that is usually referred to as “small languages” which is, generally speaking, used almost synonymously with “under-resourced languages”. The proposed presentation will first describe the original approach of the META-NET study “Europe’s Languages in the Digital Age” and then explain the updated results. The presentation will also discuss the challenges of this current situation for the multilingual web and the corresponding needs, especially with regard to intensifying technology development and knowledge transfer from the better supported to the less supported languages.
[Chair, Thierry Declerck]
12:00
12:15
Lunch
13:30

Open Space set up &
start of LIDER workshop

Arle Lommel

DFK

Explanation of the open space format for the afternoon, and selection of discussion topics. Topics are suggested by participants, and the most popular are allocated to breakout groups. A chair is chosen for each group from volunteers. There are also some pre-selected groups.

The LIDER workshop will run in parallel to the Open Space session.

14:00

Open space

Break-out discussions

Various locations are available for breakout groups. Participants can join whichever group they find interesting, and can switch groups at any point. Group chairs facilitate the discussion and ensure that notes are taken to support the summary to be given to the plenary.

15:00
Coffee Break and Posters
15:30

Open Space (contd.)

17:00

Open space

Group reports and discussion

Everyone meets again in the main conference area and each breakout group presents their findings. Other participants can comment and ask questions.

17:45

Wrap-Up

Workshop close



Posters

Manuel Tomas Carrasco Benitez

Language Technology, European Commission

Big Multilingual Linked Data (BigMu)

abstract openBigMu is the confluence of three streams with their own peculiarities and traditions:
- Linked Data
- Big Data
- Multilingual parallel corpora
One must apply standards in the simplest fashion. Back to basics: it is about using URIs and content negotiation for the language, format and similar items. Going for simplicity, one service can supply both the human and the machine readable versions using XHTML with appropriate markings, though different output could be arranged in XML and HTML. The same mechanisms must work: for small data (one record) and Big Data (terabytes-sized databases); for tabular and prose data.
Multilingual data often requires cleaning; hard-to-process data might be discarded; and there is a bias toward bilingual data. The challenge is to end up with clean, complete n-lingual aligned data. To put the problematic in perspective, there is nothing better that trying to process a large corpus such as about ten years of the Official Journal of the European Union (OJ).
The presentation will combine both:
- Theory: using current web technologies
- Practice: the experience of cleaning a large corpus

Martin Brümmer

University of Leipzig

The DBpedia Data Stack: a Large, Multilingual, Semantic Knowledge Graph

abstract openDBpedia is currently a very successful data dissemination project with high-industry uptake. We analysed the project and identified several barriers preventing DBpedia from realizing its full potential and ensuring a sustainable operation, namely:
- lack of tools to support improved and cost-efficient data curation and multilingualism,
- lack of highly available value-added services with quality of service (QoS) guarantees and lack of enterprise-optimized infrastructures
- lack of proper documentation, tutorials and support, resulting in steep learning curves for new technologies
These obstacles prohibit the participation of SMEs in Linked Data environments, thus depriving them of valuable resources for business diversification and development. On the other hand, Linked Data technologies are stuck in their original research roots, being also deprived from real world development opportunities. To address these barriers, technological advances as well as an organizational framework are required to provide a sustainable environment for future developments.

The poster presents: 1. available data provided by DBpedia with the focus on multilingual data and data exploitable for NLP processes 2. a new organisation called the DBpedia Association (http://dbpedia.org/association) which has been created to support DBpedia and improve the output for exploitation by industry.

Thierry Declerck and Paul Buitelaar

DFKI and INSIGHT National Center for Data Analytics, National University of Ireland, Galway

Multilingual polarity and sentiment lexicons in the LOD framework

abstract openIn this poster we present our experiences in the generation, integration and use of Linguistic Linked Licensed Data in the context of two industry-driven EU projects: EuroSentiment and TrendMiner.
The EuroSentiment project is concerned with the establishment of a market for language resources for sentiment analysis. In the EuroSentiment project we develop a pool of semantically interoperable language resources for sentiment analysis, including domain-specific lexicons and annotated corpora. Sentiment analysis applications are able to access domain-specific polarity scores for individual lexical items in the context of semantically defined sentiment lexicons and corpora, or access and integrate complete language resources. The provision of such services across providers, customers and applications depends on a semantically interoperable representation of language resources for sentiment analysis. Language resources as used for sentiment analysis include a variety of dictionaries, corpora, sentiment models, etc. In the EuroSentiment project we are concerned with the specification and use of a model that enables the easy exchange and/or integration of these different language resources across sentiment analysis platforms and applications.
The TrendMiner project is delivering portable open-source real-time methods for cross-lingual mining and summarisation of large-scale stream media. At the beginning of the project two use cases have been described:: “Multilingual Trend Mining and Summarisation for Financial Decision Support”, and “Multilingual Public Spheres: Political Trends and Summaries”, and in the recent past, in the context of a project extension, use cases in the field of Psychology and eHealth have been added. In all cases, detection of opinions and sentiments related to relevant entities is playing a central role.
Both the TrendMiner and Eurosentiment project are using the Marl model for encoding opinions (encoded by polarity features) expressed over time on entities. The Marl model is integrated in the TrendMiner ontologies, which are covering various domain fields, biography, etc. The use and further development of this model is the cornerstone of the cooperation between Eurosentiment and TrendMiner.
Together, EuroSentiment and TrendMiner are delivering a large set of semantically enriched lexical and corpora resources of opinion/sentiment marked lexicons and corpora, published in the Linguistic Linked Data and Linked Data clouds, starting to bridge also linguistic knowledge and domain knowledge.
References:
http://eurosentiment.eu/
http://www.trendminer-project.eu/

Serge Gladkoff and Renat Bikmatov

Logrus International

Previewing ITS 2.0 metadata in XLIFF-based frameworks

abstract openLogrus would like to present a universal parser and preview tool which would help users to easily view localization metadata embedded in the content. The parser has been drastically improved since WICS project. The tool supports the Internationalization Tag Set (ITS) 2.0 standard of metadata recently introduced by W3C. The tool is built on JavaScript only. It supports previewing HTML and XML files in web browsers; any external ITS 2.0 rule files are properly processed. No preliminary file format conversion needed. The product is based on and is a further development of an open source metadata preview tool, code name "WICS", developed by Logrus for W3C in 2013. During presentation, the new tool will be used to preview several sample files containing different ITS 2.0 metadata.

Dave Lewis

CNGL at Trinity College Dublin

The Localisation Web: Combining Open Data and Language Technology for Web Content Translation

abstract openThe FALCON project combines the power of open data on the web with data-driven language technologies to construct the Localization Web. This consists of a network of terms and translations inter-linked to each other and to source and target documents via URLs. FALCON will integrate the resulting web of linked localisation and language data into localisation tool chains using existing data query and access control standards. Meta-data from these tools will add value to these data assets, enabling seamless quality monitoring across the value chain and their on-demand leverage in training machine translation and text analytics engines. FALCON will demonstrate the active curation of language resources and value-add meta-data, operating as an integral part of next generation localisation workflows. An open meta-data schema will capture the provenance of terms and translations as they progress through these workflows. The controlled, decentralized generation and sharing of this meta-data will yield new levels of end-to-end visibility into process and quality across the value chain. This will enable flexible, on-demand assembly of training data for targeted domain and quality improvements to machine translation and text analytics engines.

John P. McCrae and Philipp Cimiano

University Bielefeld

WordNet as a central hub of the Multilingual Linked Data Web

abstract openWordNet is the most-widely used lexical resource today, but is however a resource only for English. As such, WordNet has been emulated for many languages and the founding of the Global WordNet Association has formed a large community of language resource producers, working on many languages, especially under-resourced languages. However, up until now there have been no agreed formats for the representation of WordNets and no stable identifiers for synsets to allow interoperability between models. We have worked on producing the first RDF version of WordNet directly supported by the development team of WordNet to provide a stable, up-to-date RDF version of WordNet, in order to enable an interlingual index to be created between wordnets in different languages. Furthermore, by the use of the W3C vocabulary lemon we have established a sound yet extensible model for the representation of WordNets. As such, we believe that the Princeton RDF WordNet, although itself monolingual, will play the role of a crucial hub for multilingual language resources on the web.

Roberto Navigli, Tiziano Flati and Andrea Moro

Sapienza University of Rome

Babelfy: State-of-the-art Multilingual Word Sense Disambiguation and Entity Linking

abstract openEntity Linking (EL) and Word Sense Disambiguation (WSD) both address the lexical ambiguity of language. But while the two tasks are pretty similar, they differ in a fundamental aspect: in EL the textual mention can be linked to an entity which may or may not contain the exact mention, while in WSD there is a perfect match between the word form (better, its lemma) and a suitable sense. In this poster we present Babelfy, a unified graph-based approach to EL and WSD based on a loose identification of candidate meanings coupled with a densest subgraph heuristic which selects high-coherence semantic interpretations. Experiments show state-of-the-art performances on both tasks on different datasets, including a multilingual setting.

Alexander O´Connor

CNGL KDEG Trinity College Dublin Ireland

CENDARI

abstract openThe Cendari Project is an EU FP7 Project which aims to facilitate the exploration and interconnection of the hidden archives of Europe. Taking an infrastructure approach, the project focuses on developing services to support Historians of WW1 and Mediæval manuscript Research in their academic inquiries. This invovles collaborations with archives, libraries and other memory instititions in many countries, seeking to collect and collate collection and item level information. In addition, the project seeks to provide new forms of query tool, which provide scholars with the apparatus to pose and answer research questions, rather than manipulating text queries. The proposed Cendari infrastructure takes a unification approach, drawing in heterogeneous data in many languages and formats, and providing a unified, tagged and enriched representation which complies with standards and best practice. Currently, Cendari is developing initial versions of services which import and reconcile heterogeneous archive metadata; tag and identify entity references in text; a semantic note-taking environment; and a tool for historical queries based on complex semantic attributes. These tools are being piloted in a number of integration scenarios, concentrating on a group of deserters in the aftermath of the First World War and to Mediæval Scientific and Philosophical manuscripts. Drop over to our poster to hear more!

Participants from the LIDER project

LIDER

Overview of the LIDER project

abstract openThe LIDER project aims at providing the basis for the creation of a Linguistic Linked Data cloud that can support content analytics tasks of unstructured multilingual cross-media content. By achieving this goal, LIDER will impact on the ease and efficiency with which Linguistic Linked Data will be exploited in content analytics processes. LIDER will create a strong community around the topic, guidelines and best practices for building and exploiting of resources, a reference architecture for Linguistic LD, and a long-term roadmap for the use of LD for multilingual and multimedia content analytics in enterprises.