Table of Contents

Weakly supervised multilingual identification of verbal multiword expressions

Motivation and context

The aim of this internship is to boost applications in Natural Language Processing (NLP), by focusing on one of their major challenges: multiword expressions (MWEs). MWEs are groups of words which exhibit unpredicted properties (Baldwin & Kim, 2010). Most prominently, their meaning does not straightforwardly derive from the meanings of their components. For instance, faire ‘make/do’ and valoir ‘be worth sth’ are verbs, while their combination yields a noun: faire-valoir ‘a stooge, a person who is used by somebody to do things that are unpleasant or dishonest’. Similarly, the meaning of casser sa pipe ‘to die’ (literally to break one’s pipe) cannot be straightforwardly deduced from the meanings of the individual components. Additionally, MWEs exhibit unpredicted morpho-syntactic and lexical constraints. For instance, replacing the verb in lancer un appel ‘to issue a call’ (lit. to throw a call) by a synonym yields an invalid expression *jeter un appel ‘to throw a call’. Doing alike in casser sa pipe ‘to die’ imposes a literal reading of the resulting expression: briser sa pipe ‘to break one’s pipe’.

One of the main aims of MWE-oriented NLP research is to model such expressions so as to optimize their automatic processing (for instance, to avoid their literal translation in machine translation systems). Two major MWE-related NLP tasks include MWE identification and MWE discovery. In the former, an identifier takes a text on input and automatically annotates (points at) the occurrences of MWEs in context. In the latter, the input consists in large quantities of raw texts and the output is a list of potential MWEs given out of context. MWE identification is usually done in a supervised manner, i.e. by training a system on a manually annotated corpus. MWE discovery, conversely, is usually unsupervised, i.e. can be applied to very large quantities of raw data. MWE identification is a pre-requisite for downstream applications such as machine translation (which may want to treat MWEs with dedicated procedures).

Automatic identification of verbal MWEs in 19 languages was addressed by the PARSEME shared task edition 1.1 (Ramisch et al., 2018), in which the BdTln team participated with the VarIDE system (Pasquer et al., 2018a). The results of the shared task show that identifying unseen MWEs (i.e. those MWEs which do not occur in the training data) is particularly challenging (Savary et al. 2019). Thus, identification should, ideally, exploit not only annotated corpora but also MWE lexicons and MWE discovery methods.

The aim of this internship is, thus, to study the potential of coupling MWE identification with their discovery, so as to better cope with unseen data and increase the global identification performances. The general idea is to use the MWE candidates extracted by a discovery tool, and their occurrence contexts, as seen data (as if they have been manually annotated) but with a lower reliability score. This is to be done in a highly multilingual context, where manually annotated data for at least 19 language are openly available.

We believe that the context of this internship is particularly stimulating. It includes the European research network PARSEME (Parsing and Multiword Expressions) and its French spin-off ANR project PARSEME-FR. These communities have developed cross-lingually unified and validated corpora manually annotated for verbal MWEs (like casser sa pipe or lancer un appel), which undelay the PARSEME shared task 1.1 mentioned before. In 2020, edition 1.2 of this shared task is planned. It will be dedicated to weakly supervised identification of verbal MWEs. The outcome of this interniship (i.e. a VMWE identification and discovery tool) will potentially be proposed to compete in this shared task, side-by-side with many other systems developed worldwide. The student working in this internship will, thus, regularly participate in on-site or on-line meetings not only with her/his supervisors but also with other external members of the aforementioned initiatives.

Expected outcomes

The expected outcomes of this internship include:

Expected follow-up

Candidate's profile

Important dates

How to apply

Send your CV and a cover letter to Agata Savary, Caroline Pasquer and Jean-Yves Antoine (first.last@univ-tours.fr).

References