Lexicon-to-corpus multiword expression browser

  • Domaine: natural language processing
  • Location: ATILF, Nancy, France
  • Supervisors: Mathieu Constant and Agata Savary
  • Duration: 4 to 6 months
  • Remuneration: around 500 euros a month
  • Funding: CNRS

Motivation and context

The proposed internship will take place in the framework of the PARSEME-FR project, which aims at investigating links between multiword expression (MWE) identification and syntactic-semantic analysis. MWEs, such as hot dog, hard disk, kick the bucket, United Nations and pay attention, are groups of words which exhibit unpredicted properties. Most prominently, their meaning does not straightforwardly derive from the meanings of their components. For this reason they pose major challenges to automatic text processing and have to be explicitly described in a lexicon.

One of the objectives of the PARSEME-FR project is to build both a comprehensive MWE lexicon including fine-grained linguistic information and a MWE-annotated corpus in French. One of the most notable outcomes so far is the annotation of the French dataset (Candito et al. 2017) for the PARSEME Shared Task on verbal multiword expression identification (Savary et al. 2017).

The two resources aims not only at being used for natural language applications, such as syntactic and semantic analysis, but also at exploring them for linguistic studies.


The main objective of this internship is to develop tools to link and explore the annotated corpus and the lexicon, built in the PARSEME-FR project. The internship can be divided in the following tasks:

  • Study the linguistic properties of French multiword expressions, through the exploration of a verbal MWE lexicon extracted from the Lexicon-Grammar tables (Gross 1984)
  • Develop a tool to automatically link annotated verbal MWEs and their corresponding entries in the lexicon, using matching procedures based on some normalization of the expressions. For instance, a lexicon entry like to spill the beans should be linked to its corpus occurrences like the beans he spilled, spilling the beans etc.
  • Develop a web interface in order to search annotated expressions via multicriteria filters and to visualize their tree-like linguistic description
  • Optionally, extend the work to all multiword expressions, which involves the automatic conversion of an existing dictionary of nominal, adjectival and adverbial compounds into the PARSEME-FR format, in particular using syntactic information available in the annotated corpus


  • Master 1 or Master 2 in computational linguistics or computer science,
  • Good knowledge of French,
  • Interests in linguistics and familiarity with language technology,
  • Programming skills (python, web programming).

Important dates

  • Application deadline: 15 January 2018 (or until filled)
  • Notification: 25 January 2018
  • Position starts: around March 2018
  • Position ends: July-August 2018


Marie Candito, Mathieu Constant, Carlos Ramisch, Agata Savary, Yannick Parmentier, Caroline Pasquer, and Jean-Yves Antoine. Annotation d’expressions polylexicales verbales en français. In Jean-Yves Antoine Iris Eshkol, editor, 24e conférence sur le Traitement Automatique des Langues Naturelles (TALN), Actes de TALN, volume 2 : articles courts, pages 1–9, Orléans, France, 06 2017.

Maurice Gross. Lexicon-grammar and the syntactic analysis of French. In Proc. of COLING-ACL 1964, pages 275–282, Stanford, CA, 1984. Association for Computational Linguistics.

Agata Savary, Carlos Ramisch, Silvio Cordeiro, Federico Sangati, Veronika Vincze, Behrang QasemiZadeh, Marie Candito, Fabienne Cap, Voula Giouli, Ivelina Stoyanova, and Antoine Doucet. The PARSEME shared task on automatic identification of verbal multiword expressions. In Proc. of EACL 2017 Workshop on MWEs, pages 31–47, Valencia, April 2017.

How to apply

Applications should be sent to They should include a CV, a cover letter, and possibly support letters by teacher.

internships-2018-atilf-m2-1-en.txt · Last modified: 2017/12/01 11:58 by agata.savary

Page Tools