User Tools

Site Tools

Agence Nationale de la Recherche
Action disabled: revisions

job-2019-lis-postdoc

Unsupervised and semi-supervised methods for the identification of unseen multiword expressions


Research topic

Multiword expressions (MWEs) such as French fries, take a break, do one's best and spill the beans pose problems for text analysis pipelines because of their idiosyncrasies ( Baldwin and Kim 2009). Among their challenging characteristics, they can present irregular syntactic structure (by and large), discontinuities (to take her remarks into account), ambiguity (a piece of cake is something easy or something you can eat) and non compositionality (e.g. a hot dog is not a dog). MWE identification is a task consisting in finding such irregularities in text, that is, determining which words are part of MWEs, and how they are related to each other ( Constant et al. 2017).

Significant progress has been made in MWE identification in the recent years. Successful systems have been developed with the help of machine learning techniques based on annotated corpora. This progress has been made possible by the release of open corpora and open-source systems, especially in the context of international shared tasks such as DiMSUM 2016, PARSEME 2017 and PARSEME 2018. These systems often consist of adaptations of standard NLP models such as sequence taggers ( Waszczuk 2018, Taslimipoor and Rohanian 2018) and parsers ( Nasr et al 2015, Constant and Nivre 2016).

Albeit this progress, the performance of MWE identification systems is still not on pair with the performance of other text analysis tools. For instance, the best MWE identification system at the PARSEME shared task 2018 obtained an F-measure of 54 points, whereas the best parser at the CoNLL 2018 shared task obtained an LAS of 75.84.

Part of these figures can be explained by the challenging nature of MWEs, and by the sparse amount of training data. However, the models being employed are also not fully compatible with the nature of the phenomenon. Indeed, supervised learning is based on generalisations made from observations. MWEs are by definition idiosyncratic, and there is little to generalise from one MWE to another. As a consequence, most systems are able to cope with variants of observed expressions, but fail in detecting new ones, those never observed in the training corpus. It is unclear whether sophisticated deep learning architectures can be beaten by much simpler memorisation baselines ( Cordeiro et al 2016).

The goal of this postdoc is to improve current MWE identification systems by trying to increase their performance on unseen MWEs. Therefore, the recruited researcher will study, implement and evaluate original methods to enrich supervised MWE identification models with information automatically extracted from large unannotated corpora. Methods to discover new MWEs in raw corpora abound ( Church and Hanks 1990, Smadja 1993, Ramisch et al 2010, Riedl and Biemann 2015, Yazdani et al 2015), but they have rarely been combined in large-scale MWE identification pipelines. This postdoc represents an opportunity to explore this promising research direction.

The TALEP team has experience with supervised MWE identification using recurrent neural networks ( Zampieri et al 2018), statistical MWE discovery tools ( Cordeiro et al 2018) and automatic compositionality prediction using word embeddings ( Cordeiro et al 2016). These will serve as starting points for the exploratory work of this postdoc. Familiarity with (one of) these tools and/or technologies is a plus.


Envronment

This position is funded by the ANR PARSEME-FR project, a French spin-off of PARSEME. The PARSEME community gathers partners from 30+ countries interested in the automatic processing of MWEs. Its main event is the PARSEME shared task. The goal of the PARSEME-FR project is to tackle the challenges posed by MWEs in NLP specifically for French. The recruited person will participate in the national project meetings, co-author articles, and interact with other PARSEME-FR and PARSEME members.

The postdoc will be supervised by Carlos Ramisch and Alexis Nasr. The recruited person will become a member of the TALEP team, specialised in computational linguistics. TALEP is a dynamic and international team of LIS, a computer science lab affiliated to CNRS and Aix Marseille University, located on the Luminy campus in Marseille.

Aix Marseille University is one of the largest universities in France, providing a lively and diverse research environment which attracts scientists carrying out research in many areas, including computational linguistics, in collaboration with leading international organizations.

The metropolitan area of Aix-Marseille is the second largest in France. It offers a vibrant environment conveniently situated on the south coast of France. Marseille is a cosmopolite and well connected city with Mediterranean climate and surrounded by stunning landscapes such as the Calanques and the Provence region.


Profile

  • PhD in computer science or computational linguistics
  • Good knowledge of French and English (not necessarily native)
  • Interest in linguistics and familiarity with language technology
  • Capacity to work independently and as part of a team

Important dates

  • Application deadline: February 15, 2019 (or until fulfilled)
  • Position starts: April 2019 (flexible)
  • Duration: 1 year

Application

Applications should be sent before February 15, 2019. Candidates should send the following documents in PDF, in French OR in English, to Carlos Ramisch and Alexis Nasr (FirstName.LastName@lis-lab.fr), indicating "Application postdoc AMU" in the subject line:

  • a CV, including a list of publications
  • a cover letter explaining their motivations
  • the names and emails of at least 2 referees to be contacted
job-2019-lis-postdoc.txt · Last modified: 2018/12/26 10:33 by carlos.ramisch

Page Tools