User Tools

Site Tools

Agence Nationale de la Recherche

2018-lifat-m2-1

This is an old revision of the document!


Lexicon-to-corpus multiword expression browser

  • Domain: Natural Language Processing
  • Location: University of Tours, LIFAT (Laboratoire d'Informatique Fondamentale et Appliquée de Tours), Blois campus
  • Duration: 6 months
  • Remuneration: around 577 EUR / month
  • Funding: ANR PARSEME-FR project

Motivation and context

The internship will take place in the framework of the PARSEME-FR project, which involves several NLP teams in France. The aim is to boost applications in Natural Language Processing (NLP), by focusing on one of their major challenges: multiword expressions (MWEs).

MWEs are groups of words which exhibit unpredicted properties (Baldwin & Kim, 2010). Most prominently, their meaning does not straightforwardly derive from the meanings of their components. For instance, _faire_ ‘make/do’ and valoir ‘be worth sth’ are verbs, while faire-valoir ‘a stooge, a person who is used by somebody to do things that are unpleasant or dishonest’. Similarly, the meaning of casser sa pipe (literally ‘to break one’s pipe’) ‘to die’ cannot be straightforwardly deduced from the meanings of the individual components. Additionally, MWEs exhibit unpredicted morpho-syntactic and lexical constraints. For instance, replacing the verb in lancer un appel (lit. to throw a call) ‘to issue a call’ by a synonym yields an invalid expression *jeter un appel ‘to throw a call’. Doing alike in casser sa pipe ‘to die’ imposes a literal reading of the resulting expression: briser sa piper ‘to break one’s pipe’.

One of the main aims of MWE-oriented NLP research is to model such expressions so as to optimize their automatic processing (for instance, to avoid their literal translation in machine translation systems). Two major MWE-related NLP tasks include MWE discovery and MWE identification. In the former, the input consists in large quantities of raw texts and the output is a list of potential MWEs. In the latter, and identifier takes a text on input and automatically annotates (points at) the occurrences of MWEs in it. MWE identification is a pre-requisite for downstream applications such as machine translation (which may want to treat MWEs with dedicated procedures).

Automatic identification of MWEs in 19 languages was addressed by the PARSEME shared task1 (Ramisch et al., 20182018), in which the BdTln team participated with the VarIDE system (Pasquer et al., 2018a). The results of the shared task show that identifying unseen MWEs (i.e. those MWEs which do not occur in the training data) is particularly challenging. Thus, identification should, ideally, exploit not only annotated corpora but also MWE lexicons and MWE discovery methods.

Objectives

The main objective of this internship is to develop tools to link and explore the annotated corpus and the lexicon, built in the PARSEME-FR project. The internship can be divided in the following tasks:

  • Study the linguistic properties of French multiword expressions, through the exploration of a verbal MWE lexicon extracted from the Lexicon-Grammar tables (Gross 1984)
  • Develop a tool to automatically link annotated verbal MWEs and their corresponding entries in the lexicon, using matching procedures based on some normalization of the expressions. For instance, a lexicon entry like to spill the beans should be linked to its corpus occurrences like the beans he spilled, spilling the beans etc.
  • Develop a web interface in order to search annotated expressions via multicriteria filters and to visualize their tree-like linguistic description
  • Optionally, extend the work to all multiword expressions, which involves the automatic conversion of an existing dictionary of nominal, adjectival and adverbial compounds into the PARSEME-FR format, in particular using syntactic information available in the annotated corpus

Profile

  • Master 1 or Master 2 in computational linguistics or computer science,
  • Good knowledge of French,
  • Interests in linguistics and familiarity with language technology,
  • Programming skills (python, web programming).

Important dates

  • Application deadline: 15 January 2018 (or until filled)
  • Notification: 25 January 2018
  • Position starts: around March 2018
  • Position ends: July-August 2018

References

Marie Candito, Mathieu Constant, Carlos Ramisch, Agata Savary, Yannick Parmentier, Caroline Pasquer, and Jean-Yves Antoine. Annotation d’expressions polylexicales verbales en français. In Jean-Yves Antoine Iris Eshkol, editor, 24e conférence sur le Traitement Automatique des Langues Naturelles (TALN), Actes de TALN, volume 2 : articles courts, pages 1–9, Orléans, France, 06 2017.

Maurice Gross. Lexicon-grammar and the syntactic analysis of French. In Proc. of COLING-ACL 1964, pages 275–282, Stanford, CA, 1984. Association for Computational Linguistics.

Agata Savary, Carlos Ramisch, Silvio Cordeiro, Federico Sangati, Veronika Vincze, Behrang QasemiZadeh, Marie Candito, Fabienne Cap, Voula Giouli, Ivelina Stoyanova, and Antoine Doucet. The PARSEME shared task on automatic identification of verbal multiword expressions. In Proc. of EACL 2017 Workshop on MWEs, pages 31–47, Valencia, April 2017.


How to apply

Applications should be sent to Mathieu.Constant@univ-lorraine.fr. They should include a CV, a cover letter, and possibly support letters by teacher.

2018-lifat-m2-1.1537531097.txt.gz · Last modified: 2018/09/21 13:58 by agata.savary

Page Tools