===== Identification of multiword expressions off the beaten path : towards MWE-aware semantic parsing ===== * **Application deadline:** Oct 15, 2017 * **Field:** natural language processing/computational linguistics * **City**: Marseilles, France * **University**: [[http://edu.univ-amu.fr/en | Aix Marseille Université (AMU)]] * **Laboratory**: [[http://www.lif.univ-mrs.fr/ | Laboratoire d'informatique fondamentale (LIF)]] - Luminy campus * **Research team**: [[http://www.lif.univ-mrs.fr/recherche/equipes/6/presentation | Written and Spoken Language Processing (TALEP)]] * **Supervisors**: [[http://pageperso.lif.univ-mrs.fr/~carlos.ramisch|Carlos Ramisch]] and [[http://pageperso.lif.univ-mrs.fr/~alexis.nasr|Alexis Nasr]] * **Duration:** 2 years, starting in January 2018 (or earlier) * **Remuneration:** around 2000-2400€/month net (depending on experience) * **Funding:** [[http://parsemefr.lif.univ-mrs.fr/|ANR PARSEME-FR project]] * **Keywords**: multiword expressions, MWE identification, syntactic parsing, word embeddings, recurrent neural networks, compositionality prediction, semantic parsing ----------------- ==== Context ==== We are hiring a post-doc researcher in the domain of natural language processing. This position is funded by the [[http://parsemefr.lif.univ-mrs.fr | ANR PARSEME-FR project ]], a French spin-off of the [[http://www.parseme.eu | PARSEME action]]. The PARSEME community gathers partners from 31 countries around scientific challenges in automatic processing of multiword expressions. The goal of the PARSEME-FR project is to tackle the challenges posed by multiword expressions in natural language processing systems specifically for French. The post-doc will participate in the national project meetings, co-author articles submitted to top-tier conferences and journals, and interact with other project members in France and with members of the international research community. The post-doc will be supervised by Carlos Ramisch and Alexis Nasr (Aix Marseille University) and will become a member of the TALEP team of LIF, specialised in computational linguistics. The TALEP team is part of LIF, a computer science lab affiliated to CNRS and Aix Marseille University, located on the Luminy campus in Marseilles. The metropolitan area of Aix-Marseilles is a the second largest in France. It offers a vibrant environment conveniently situated on the south coast of France, in the Provence-Alpes-Côte d'Azur region. Aix-Marseilles is a cosmopolite and well connected urban area with Mediterranean climate and surrounded by stunning landscapes such as the Calanques natural park and the Provence region. Aix Marseille University is the largest university in France and of the francophone world by its number of students and staff. It provides a lively and diverse research environment which attracts scientists carrying out research in many areas, including computational linguistics, in collaboration with leading international organizations. -------------------------- ==== Research project ==== One of the main goals of natural language processing (NLP) systems is to automatically find the underlying structure in running text. Many tools and techniques for **text analysis** have been developed to transform sequences of characters into increasingly abstract representations. This process is usually based on a pipeline of modules which subsequently perform operations such as text segmentation, tokenization, part-of-speech tagging, morphological analysis, lemmatization, syntactic and semantic parsing. Multiword expressions (MWEs) such as //French fries, take a break, do one's best// and //spill the beans// often pose problems for text analysis chains because of their idiosyncratic nature (Baldwin and Kim 2009, Sag et al. 2001). Among their notably challenging characteristics, they can include words that only occur in a given expression (e.g. //astray// in //go astray//), they can present irregular syntactic structure (e.g. //by and large//, an adverbial formed by the coordination of a preposition with an adjective), they can be discontinuous (e.g. //to **take** this relevant remark **into account**//), they can be ambiguous (e.g. //a piece of cake// can be something very easy or something you can eat) and they can present some degree of semantic non-compositionality (e.g. a //hot dog// is not literally a //dog//). The task of automatic **MWE identification** consists in finding such irregularities in text, that is, determining which words are part of a multiword expression, and how they are related to each other. Considerable progress has been made in the last years to understand and model the interactions between MWE identification and syntactic parsing. It has been shown that the automatic discovery of new MWEs can greatly benefit from parsed data (Seretan, 2008). The use of sequence models, such as CRFs and structured perceptrons, has also been explored as a means to tag -- mainly contiguous -- expressions in context prior, to syntactic parsing (Riedl and Biemann 2016, Schneider et al. 2014, Constant and Sigogne 2011). The use of subtrees, such as special multiword constituents (Green et al. 2013) and dependencies (Nasr et al. 2015, Vincze et al. 2013), to learn parsing models has also been investigated to deal with syntactically irregular, ambiguous and discontinuous MWEs. These challenges have also been addressed by joint models such as a synchronous transition-based dependency parser and MWE segmenter (Constant and Nivre 2016). On the other hand, the interactions between MWEs and **semantic parsing** have not received much emphasis. This is surprising, given that many MWEs breach the principle of semantic compositionality to some extent, requiring special treatment in semantic analysis. Therefore, the **goal of the proposed post-doc project is to develop and evaluate original methods for semantic-aware MWE identification and MWE-aware semantic parsing**. The recruited post-doc will be responsible for the development of a software prototype for MWE identification in French. Nonetheless, the methods implemented therein should be as language independent and universal as possible. The project is structured in three phases: (i) MWE tagging of syntactic trees, (ii) use of deep learning methods for better generalization and compositionality modelling, and (iii) transformation of the tagged tree into a pre-semantic graph. The first phase will consist in adapting existing sequence models to tag syntactic trees instead of flat word sequences. The first experiments will focus on verbal MWEs using the corpora of the [[http://multiword.sourceforge.net/sharedtask2017/ | PARSEME shared task on the automatic identification of verbal MWEs]]. Therefore, it is crucial to abstract away intervening elements and inflection, working on trees as an input and tagging those nodes and dependencies that are part of a MWE. The main assumption here is that verbal expressions are syntactically regular, thus we can place MWE identification //after// syntactic parsing and before semantic analysis. The second phase will focus on the use of word embeddings and deep learning models to perform the classification of tree nodes as belonging to a given expression. Methods based on word embeddings are potentially interesting for two main reasons. First, they could find MWEs that are similar to known ones by using vector similarity. Therefore, they could perform better generalizations based on little training data, increasing the coverage of the system. Second, methods based on word embeddings can identify idiomatic MWEs by identifying word combinations whose overall vector is distant from the vectors of its component words (Cordeiro et al. 2016). The system developed in the first and second phases will be evaluated on French corpora, currently under annotation, covering all MWE categories, not only verbal ones. Once the MWE identification system is both precise and robust enough, the third phase consists in applying tree transformation operations on the resulting tagged trees so that they become closer to semantic predicate-argument structures. Verbal MWEs are particularly relevant here, since phenomena such as light-verb constructions, verbal idioms and inherently reflexive verbs should be modelled as atomic multiword predicates. The final output will be a tree (or graph) of identified predicates and their arguments, thus allowing further processing by downstream applications that require the extraction of semantic structures from text. ----------------- ==== Profile ==== * PhD in computer science or computational linguistics * Good knowledge of French and English (not necessarily native) * Interest in linguistics and familiarity with language technology * Capacity to work independently and as part of a team ------------------------- ==== Important dates ==== * **Application deadline: October 15, 2017 (or until fulfilled)** * Position starts: December 2017 or January 2018 * Duration: 2 years, that is, 1 year renewable once --------------------- ==== Application ==== **Applications should be sent before October 15, 2017**. Candidates should send the following documents as a single attached document named **LASTNAME-Firstname.pdf**, in French OR in English, to Nuria Gala, Carlos Ramisch and Alexis Nasr (FirstName.LastName@univ-amu.fr), indicating "Application post-doc AMU" in the subject line : * a CV, including a list of publications * a cover letter explaining how the offer matches your interests and experience * a copy of their PhD degree or a document indicating the expected defense date * and the names and emails of 1 or 2 referees to be contacted Candidates applying to both positions should indicate and motivate this in their cover letter, but send a single application file. -------------------- ==== References ==== * [[ https://people.eng.unimelb.edu.au/tbaldwin/pubs/handbook2009.pdf | Baldwin T. and Kim S. N. (2009) Multiword Expressions ]] * [[ http://aclweb.org/anthology/P16-1016 | Constant M. and Nivre J. (2016) A Transition-Based System for Joint Lexical and Syntactic Analysis ]] * [[ http://aclweb.org/anthology/W11-0809 | Constant M. and Sigogne A. (2011) MWU-aware Part-of-Speech Tagging with a CRF model and lexical resources ]] * [[ http://www.aclweb.org/anthology/P16-1187 | Cordeiro S. et al. (2016) Predicting the Compositionality of Nominal Compounds: Giving Word Embeddings a Hard Time ]] * [[ http://aclweb.org/anthology/J13-1009 | Green et al (2012) Parsing Models for Identifying Multiword Expressions ]] * [[ http://aclweb.org/anthology/P15-1108 | Nasr A. et al. (2015) Joint Dependency Parsing and Multiword Expression Tokenisation ]] * [[ http://lingo.stanford.edu/pubs/WP-2001-03.pdf | Sag I. et al. (2001) Multiword Expressions: A Pain in the Neck for NLP ]] * [[ http://www.issco.unige.ch/en/staff/seretan/publ/PhDThesis-VioletaSeretan.pdf | Seretan V. (2008) Collocation Extraction Based on Syntactic Parsing ]] * [[ http://aclweb.org/anthology/W16-1816 | Riedl M. and Biemann C. (2016) Impact of MWE resources in Multiword Recognition ]] * [[ http://aclweb.org/anthology/I13-1024 | Vincze V. et al. (2013) Dependency Parsing for Identifying Hungarian Light Verb Constructions ]]