User Tools

Site Tools

Agence Nationale de la Recherche

job-2017-lif-postdoc

This is an old revision of the document!


Aix Marseille University offers a 2-years post-doc position in computational linguistics.

Identification of multiword expressions off the beaten path : towards MWE-aware semantic parsing


Context

We are hiring a post-doc researcher in the domain of natural language processing. This position is funded by the ANR PARSEME-FR project , a French spin-off of the PARSEME action. The PARSEME community gathers partners from 31 countries around scientific challenges in automatic processing of multiword expressions. The goal of the PARSEME-FR project is to tackle the challenges posed by multiword expressions in natural language processing systems specifically for French. The post-doc will participate in the PARSEME-FR project meetings and interact with other project members in France and with members of the international PARSEME community.

The post-doc will be supervised by Carlos Ramisch and Alexis Nasr (Aix-Marseille University) and will become a member of the TALEP team, specialised in computational linguistics. The TALEP team is part of LIF, a computer science lab affiliated to CNRS and Aix Marseille University, located on the Luminy campus in Marseilles, the second-largest city in France. Marseilles is conveniently situated on the south coast of France, in the Provence-Alpes-Côte d'Azur region. It is a cosmopolite and well connected city with Mediterranean climate and surrounded by stunning landscapes such as the Calanques natural park, where the Luminy campus is located.


Research project

One of the main goals of natural language processing (NLP) systems is to automatically find the underlying structure in running text. Many tools and techniques for text analysis have been developed to transform sequences of characters into increasingly abstract representations. This process is usually based on a pipeline of modules which subsequently perform operations such as text segmentation, tokenization, part-of-speech tagging, morphological analysis, lemmatization, syntactic and semantic parsing.

Multiword expressions (MWEs) such as French fries, take a break, do one's best and spill the beans often pose problems for text analysis chains because of their idiosyncratic nature (Baldwin and Kim 2009, Sag et al. 2001). Among their notably challenging characteristics, they can include words that only occur in a given expression (e.g. astray in go astray), they can present irregular syntactic structure (e.g. by and large, an adverbial formed by the coordination of a preposition with an adjective), they can be discontinuous (e.g. to take this relevant remark into account), they can be ambiguous (e.g. a piece of cake can be something very easy or something you can eat) and they can present some degree of semantic non-compositionality (e.g. a hot dog is not literally a dog). The task of automatic MWE identification consists in finding such irregularities in text, that is, identifying which words are part of a multiword expression, and how they are related to each other.

Considerable progress has been made in the last years to understand and model the interactions between MWE identification and syntactic parsing. It has been shown that the automatic discovery of new MWEs can greatly benefit from parsed data (Seretan, 2008). The use of sequence models, such as CRFs and structured perceptrons, has also been explored as a means to tag – mainly contiguous – expressions in context prior, to syntactic parsing (Riedl and Biemann 2016, Schneider et al. 2014, Constant and Sigogne 2011). The use of subtrees, such as special multiword constituents (Green et al. 2013) and dependencies (Nasr et al. 2015, Vincze et al. 2013), to learn parsing models has also been investigated to deal with syntactically irregular, ambiguous and discontinuous MWEs. These challenges have also been addressed by joint models such as a synchronous transition-based dependency parser and MWE segmenter (Constant and Nivre 2016).

On the other hand, the interactions between MWEs and semantic parsing have not received much emphasis. This is surprising, given that many MWEs breach the principle of semantic compositionality to some extent, requiring special treatment in semantic analysis. Therefore, the goal of the proposed post-doc project is to develop and evaluate original methods for semantic-aware MWE identification and MWE-aware semantic parsing. The recruited post-doc will be responsible for the development of a software prototype for MWE identification in French. Nonetheless, the methods implemented therein should be as language independent and universal as possible.

The project is structured in three phases: (i) MWE tagging of syntactic trees, (ii) use of deep learning methods for better generalization and compositionality modelling, and (iii) transformation of the tagged tree into a pre-semantic graph.

The first phase will consist in adapting existing sequence models to tag syntactic trees instead of flat word sequences. The first experiments will focus on verbal MWEs using the corpora of the PARSEME shared task on the automatic identification of verbal MWEs. Therefore, it is crucial to abstract away intervening elements and inflection, working on trees as an input and tagging those nodes and dependencies that are part of a MWE. The main assumption here is that verbal expressions are syntactically regular, thus we can place MWE identification after syntactic parsing and before semantic analysis.

The second phase will focus on the use of word embeddings and deep learning models to perform the classification task. Methods based on word embeddings are potentially interesting for two main reasons. First, they could find MWEs that are similar to known ones by using vector similarity. Therefore, they could perform better generalizations based on little training data, increasing the coverage of the system. Second, methods based on word embeddings can identify idiomatic MWEs by identifying word combinations whose overall vector is distant from the vectors of its component words (Cordeiro et al. 2016). The system developed in the first and second phases will be evaluated on French corpora, currently under annotation, covering all MWE categories, not only verbal ones.

Once the MWE identification system is both precise and robust enough, the third phase consists in applying tree transformation operations on the resulting tagged trees so that they become closer to semantic predicate-argument structures. Verbal MWEs are particularly relevant here, since phenomena such as light-verb constructions, verbal idioms and inherently reflexive verbs should be modelled as atomic multiword predicates. The final output will be a tree (or graph) of identified predicates and their arguments, thus allowing further processing by downstream applications that require the extraction of semantic structures from text.


Profile

  • PhD in computer science or computational linguistics
  • Good knowledge of French and English (not necessarily native)
  • Interest in linguistics and familiarity with language technology
  • Capacity to work independently and as part of a team

Important dates

  • Application deadline: September 30, 2017 (or until fulfilled)
  • Position starts: January 2018 (or earlier if possible)
  • Duration: 2 years, that is, 1 year renewable once

Application

Candidates should send the following documents as a single attached document named LASTNAME-Firstname.pdf, in French OR in English, to Carlos Ramisch and Alexis Nasr (FirstName.LastName@lif.univ-mrs.fr):

  • a CV, including a list of publications
  • a cover letter explaining how this position matches your research interests and experience,
  • and the names and emails of 2 referees to be contacted

References

job-2017-lif-postdoc.1500905407.txt.gz · Last modified: 2017/07/24 16:10 by carlos.ramisch

Page Tools