We are hiring a post-doc researcher in the domain of natural language processing. This position is funded by the ANR PARSEME-FR project , a French spin-off of the PARSEME action. The PARSEME community gathers partners from 31 countries around scientific challenges in automatic processing of multiword expressions. The goal of the PARSEME-FR project is to tackle the challenges posed by multiword expressions in natural language processing systems specifically for French. The post-doc will participate in the national project meetings, co-author articles submitted to top-tier conferences and journals, and interact with other project members in France and with members of the international research community.
The post-doc will be supervised by Carlos Ramisch and Alexis Nasr (Aix Marseille University) and will become a member of the TALEP team of LIF, specialised in computational linguistics. The TALEP team is part of LIF, a computer science lab affiliated to CNRS and Aix Marseille University, located on the Luminy campus in Marseilles.
The metropolitan area of Aix-Marseilles is a the second largest in France. It offers a vibrant environment conveniently situated on the south coast of France, in the Provence-Alpes-Côte d'Azur region. Aix-Marseilles is a cosmopolite and well connected urban area with Mediterranean climate and surrounded by stunning landscapes such as the Calanques natural park and the Provence region.
Aix Marseille University is the largest university in France and of the francophone world by its number of students and staff. It provides a lively and diverse research environment which attracts scientists carrying out research in many areas, including computational linguistics, in collaboration with leading international organizations.
One of the main goals of natural language processing (NLP) systems is to automatically find the underlying structure in running text. Many tools and techniques for text analysis have been developed to transform sequences of characters into increasingly abstract representations. This process is usually based on a pipeline of modules which subsequently perform operations such as text segmentation, tokenization, part-of-speech tagging, morphological analysis, lemmatization, syntactic and semantic parsing.
Multiword expressions (MWEs) such as French fries, take a break, do one's best and spill the beans often pose problems for text analysis chains because of their idiosyncratic nature (Baldwin and Kim 2009, Sag et al. 2001). Among their notably challenging characteristics, they can include words that only occur in a given expression (e.g. astray in go astray), they can present irregular syntactic structure (e.g. by and large, an adverbial formed by the coordination of a preposition with an adjective), they can be discontinuous (e.g. to take this relevant remark into account), they can be ambiguous (e.g. a piece of cake can be something very easy or something you can eat) and they can present some degree of semantic non-compositionality (e.g. a hot dog is not literally a dog). The task of automatic MWE identification consists in finding such irregularities in text, that is, determining which words are part of a multiword expression, and how they are related to each other.
Considerable progress has been made in the last years to understand and model the interactions between MWE identification and syntactic parsing. It has been shown that the automatic discovery of new MWEs can greatly benefit from parsed data (Seretan, 2008). The use of sequence models, such as CRFs and structured perceptrons, has also been explored as a means to tag – mainly contiguous – expressions in context prior, to syntactic parsing (Riedl and Biemann 2016, Schneider et al. 2014, Constant and Sigogne 2011). The use of subtrees, such as special multiword constituents (Green et al. 2013) and dependencies (Nasr et al. 2015, Vincze et al. 2013), to learn parsing models has also been investigated to deal with syntactically irregular, ambiguous and discontinuous MWEs. These challenges have also been addressed by joint models such as a synchronous transition-based dependency parser and MWE segmenter (Constant and Nivre 2016).
On the other hand, the interactions between MWEs and semantic parsing have not received much emphasis. This is surprising, given that many MWEs breach the principle of semantic compositionality to some extent, requiring special treatment in semantic analysis. Therefore, the goal of the proposed post-doc project is to develop and evaluate original methods for semantic-aware MWE identification and MWE-aware semantic parsing. The recruited post-doc will be responsible for the development of a software prototype for MWE identification in French. Nonetheless, the methods implemented therein should be as language independent and universal as possible.
The project is structured in three phases: (i) MWE tagging of syntactic trees, (ii) use of deep learning methods for better generalization and compositionality modelling, and (iii) transformation of the tagged tree into a pre-semantic graph.
The first phase will consist in adapting existing sequence models to tag syntactic trees instead of flat word sequences. The first experiments will focus on verbal MWEs using the corpora of the PARSEME shared task on the automatic identification of verbal MWEs. Therefore, it is crucial to abstract away intervening elements and inflection, working on trees as an input and tagging those nodes and dependencies that are part of a MWE. The main assumption here is that verbal expressions are syntactically regular, thus we can place MWE identification after syntactic parsing and before semantic analysis.
The second phase will focus on the use of word embeddings and deep learning models to perform the classification of tree nodes as belonging to a given expression. Methods based on word embeddings are potentially interesting for two main reasons. First, they could find MWEs that are similar to known ones by using vector similarity. Therefore, they could perform better generalizations based on little training data, increasing the coverage of the system. Second, methods based on word embeddings can identify idiomatic MWEs by identifying word combinations whose overall vector is distant from the vectors of its component words (Cordeiro et al. 2016). The system developed in the first and second phases will be evaluated on French corpora, currently under annotation, covering all MWE categories, not only verbal ones.
Once the MWE identification system is both precise and robust enough, the third phase consists in applying tree transformation operations on the resulting tagged trees so that they become closer to semantic predicate-argument structures. Verbal MWEs are particularly relevant here, since phenomena such as light-verb constructions, verbal idioms and inherently reflexive verbs should be modelled as atomic multiword predicates. The final output will be a tree (or graph) of identified predicates and their arguments, thus allowing further processing by downstream applications that require the extraction of semantic structures from text.
Applications should be sent before October 15, 2017. Candidates should send the following documents as a single attached document named LASTNAME-Firstname.pdf, in French OR in English, to Nuria Gala, Carlos Ramisch and Alexis Nasr (FirstName.LastName@univ-amu.fr), indicating "Application post-doc AMU" in the subject line :
Candidates applying to both positions should indicate and motivate this in their cover letter, but send a single application file.