Weakly supervised multilingual identification of verbal multiword expressions

Weakly supervised multilingual identification of verbal multiword expressions

Domain: Natural Language Processing
Keywords: natural language processing, multi-word expressions, multilingualism, supervised machine learning
Location: University of Tours, LIFAT (Laboratoire d'Informatique Fondamentale et Appliquée de Tours), Blois campus, France
Supervisors: Agata Savary, Caroline Pasquer, Jean-Yves Antoine
Duration: 6 months
Remuneration: around 577 EUR / month (minimum)
Funding: ANR PARSEME-FR project

Motivation and context

The aim of this internship is to boost applications in Natural Language Processing (NLP), by focusing on one of their major challenges: multiword expressions (MWEs). MWEs are groups of words which exhibit unpredicted properties (Baldwin & Kim, 2010). Most prominently, their meaning does not straightforwardly derive from the meanings of their components. For instance, faire ‘make/do’ and valoir ‘be worth sth’ are verbs, while their combination yields a noun: faire-valoir ‘a stooge, a person who is used by somebody to do things that are unpleasant or dishonest’. Similarly, the meaning of casser sa pipe ‘to die’ (literally to break one’s pipe) cannot be straightforwardly deduced from the meanings of the individual components. Additionally, MWEs exhibit unpredicted morpho-syntactic and lexical constraints. For instance, replacing the verb in lancer un appel ‘to issue a call’ (lit. to throw a call) by a synonym yields an invalid expression *jeter un appel ‘to throw a call’. Doing alike in casser sa pipe ‘to die’ imposes a literal reading of the resulting expression: briser sa pipe ‘to break one’s pipe’.

One of the main aims of MWE-oriented NLP research is to model such expressions so as to optimize their automatic processing (for instance, to avoid their literal translation in machine translation systems). Two major MWE-related NLP tasks include MWE identification and MWE discovery. In the former, an identifier takes a text on input and automatically annotates (points at) the occurrences of MWEs in context. In the latter, the input consists in large quantities of raw texts and the output is a list of potential MWEs given out of context. MWE identification is usually done in a supervised manner, i.e. by training a system on a manually annotated corpus. MWE discovery, conversely, is usually unsupervised, i.e. can be applied to very large quantities of raw data. MWE identification is a pre-requisite for downstream applications such as machine translation (which may want to treat MWEs with dedicated procedures).

Automatic identification of verbal MWEs in 19 languages was addressed by the PARSEME shared task edition 1.1 (Ramisch et al., 2018), in which the BdTln team participated with the VarIDE system (Pasquer et al., 2018a). The results of the shared task show that identifying unseen MWEs (i.e. those MWEs which do not occur in the training data) is particularly challenging (Savary et al. 2019). Thus, identification should, ideally, exploit not only annotated corpora but also MWE lexicons and MWE discovery methods.

The aim of this internship is, thus, to study the potential of coupling MWE identification with their discovery, so as to better cope with unseen data and increase the global identification performances. The general idea is to use the MWE candidates extracted by a discovery tool, and their occurrence contexts, as seen data (as if they have been manually annotated) but with a lower reliability score. This is to be done in a highly multilingual context, where manually annotated data for at least 19 language are openly available.

We believe that the context of this internship is particularly stimulating. It includes the European research network PARSEME (Parsing and Multiword Expressions) and its French spin-off ANR project PARSEME-FR. These communities have developed cross-lingually unified and validated corpora manually annotated for verbal MWEs (like casser sa pipe or lancer un appel), which undelay the PARSEME shared task 1.1 mentioned before. In 2020, edition 1.2 of this shared task is planned. It will be dedicated to weakly supervised identification of verbal MWEs. The outcome of this interniship (i.e. a VMWE identification and discovery tool) will potentially be proposed to compete in this shared task, side-by-side with many other systems developed worldwide. The student working in this internship will, thus, regularly participate in on-site or on-line meetings not only with her/his supervisors but also with other external members of the aforementioned initiatives.

Expected outcomes

The expected outcomes of this internship include:

an overview of the state-of-the-art in MWE identification and discovery, and especially on coupling these two functionalities
a system which couples, MWE identification and discovery, so as to address the major challenge in MWE identification stemming from unseen data
running the system on the data of the PARSEME shared task 1.2 early 2020
submitting the results to the shared task platform
submitting a system description paper to the MWE-LEX 2020 workshop
presenting the system at the MWE-LEX 2020 workshop, co-located with the COLING 2020 conference in Barcelona in September 2020

Expected follow-up

A 3-4-year PhD grant, related to the same line of research, might be available in the research team starting from September 2020

Candidate's profile

2nd-year master student in computational linguistics, computer science or alike
Interests in linguistics and familiarity with language technology
Good knowledge of French
Good programming skills, preferably in Python

Important dates

Application deadline: 16 December 2019 (or until filled)
Notification: 20 December 2019
Position starts: mid-January 2020
Position ends: around mid-July 2020

How to apply

Send your CV and a cover letter to Agata Savary, Caroline Pasquer and Jean-Yves Antoine (first.last@univ-tours.fr).

References

Baldwin, T. and Kim, S. N. (2010) Multiword Expressions, in Nitin Indurkhya and Fred J. Damerau (eds.) Handbook of Natural Language Processing, Second Edition, CRC Press, Boca Raton, USA, pp. 267-292.
Matthieu Constant, Gülşen Eryiğit, Johanna Monti, Lonneke van der Plas, Carlos Ramisch, Michael Rosner, and Amalia Todirascu. 2017. Multiword expression processing: A survey. Computational Linguistics, 43(4):837–892.
Caroline Pasquer, Carlos Ramisch, Agata Savary, Jean-Yves Antoine (2018) VarIDE at PARSEME Shared Task 2018: Are variants really as alike as two peas in a pod?, in the Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018), 25-26 August 2018, Santa Fe, USA.
Carlos Ramisch, Silvio Ricardo Cordeiro, Agata Savary, Veronika Vincze, Verginica Barbu Mititelu, Archna Bhatia, Maja Buljan, Marie Candito, Polona Gantar, Voula Giouli, Tunga Güngör, Abdelati Hawwari, Uxoa Iñurrieta, Jolanta Kovalevskaitė, Simon Krek, Timm Lichte, Chaya Liebeskind, Johanna Monti, Carla Parra Escartín, Behrang QasemiZadeh, Renata Ramisch, Nathan Schneider, Ivelina Stoyanova, Ashwini Vaidya, Abigail Walsh (2018) Edition 1.1 of the PARSEME Shared Task on Automatic Identification of Verbal Multiword Expressions, In the Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018), 25-26 August 2018, Santa Fe, USA.
Agata Savary, Silvio Ricardo Cordeiro, Carlos Ramisch (2019) Without lexicons, multiword expression identification will never fly: A position statement, In the Proceedings of the Joint Workshop on Multiword Expressions and WordNet (MWE-WN 2019), 2 August 2019, Florence, Italy.

Table of Contents