ATILF offers a PhD position in computational linguistics. ===== Integrating Multiword Expressions at the heart of statistical syntactic and semantic analysis ===== * **Application accepted until position is fulfilled** * **Field:** natural language processing/computational linguistics * **Location:** [[http://www.atilf.fr/| ATILF]], University of Lorraine (Nancy, France) * **Supervisor**: [[http://igm.univ-mlv.fr/~mconstan/|Matthieu Constant]] * **Co-supervisor**: [[http://utilisateurs.linguist.univ-paris-diderot.fr/~mcandito/|Marie Candito]] (Univ. Paris Diderot and INRIA) * **Duration:** 3 years, October 2016 to September 2019 * **Remuneration:** around 1700 €/month * **Funding:** CNRS funding, [[http://parsemefr.lif.univ-mrs.fr/|ANR PARSEME-FR project]] * **Keywords**: multiword expressions (MWEs), syntactic parsing, semantic parsing, deep learning ----------------- ==== Context ==== The proposed PhD thesis falls into the field of natural language processing at the crossroads of computer science and linguistics. In particular, it will focus on processing of multiword expressions, namely sequences of several words with a certain degree of idiomaticity. These expressions are very frequent and diverse. For instance, hot dog, piece of cake, cut the mustard, take a step, by and large. Identifying multiword expressions in context constitutes an essential step for syntactic parsing, semantic analysis and more generally for natural language processing applications such as machine translation. This PhD proposal holds in the framework of the ANR-funded PARSEME-FR project that aims at widely integrating such expressions into syntactic and semantic parsers. ----------------- ==== Profile ==== * Master in computer science or computational linguistics * Good knowledge of French and English, another language would be a plus * Interests in linguistics and familiarity with language technology * Capacity to work independently and as part of a team --------------------- ==== Application ==== Candidates should send the following documents in PDF format, in French or in English, to Mathieu Constant (FirstName.LastName@u-pem.fr) and Marie Candito (FirstName.LastName@linguist.univ-paris-diderot.fr) * CV * Cover letter * Transcript of MSc and BSc grades (translated if not in French or English) * Reference letters would be a plus ------------------------------ ==== Hosting Institutions ==== === Main affiliation === * **Laboratory**: [[http://www.atilf.fr/|ATILF]] * **University**: [[http://www.univ-lorraine.fr/|University of Lorraine]] === Secondary affiliation === * **Laboratory**: [[https://www.rocq.inria.fr/alpage-wiki/tiki-index.php?page=Alpage%20%E2%80%94%20Homepage | ALPAGE -INRIA]] * **Institutions**: [[ http://www.univ-paris-diderot.fr/sc/site.php?bc=accueil&np=accueil | Université Paris Diderot]] and [[ http://www.inria.fr/centre/paris | INRIA Paris]] ------------------------------- ==== Scientific description ==== This PhD thesis aims at revisiting statistical syntactic and semantic analysis in the light of multiword expressions. More precisely, it falls within the framework of linear-time dependency parsing. Taking multiword expressions into account is a challenge for automatic text analysis, mainly due to their non-compositionality, i.e. to the partial or total irregularity in the way their elements combine at the lexical, morpho-syntactic and/or semantic levels. Furthermore, there exists a continuum between entirely fixed expressions (piece of cake) and almost free expressions (traffic light). A wide majority of these expressions are actually partially compositional (white wine, take a nap) thus requiring a non-atomic representation. The first work will consist in designing a new lexical, syntactic and semantic representation that would enable a satisfying handling of such expressions. Given this new representation, the next step will consist in developing new parsing algorithms integrating MWEs. Priority will be given to a system that jointly performs both MWE identification and syntactic parsing, in such a way both tasks can mutually inform each other. Multiword expressions generally representing semantic units, a natural extension of this joint system is to develop a system that automatically constructs a shallow semantic graph for an input sentence. The developed parsers should combine two features: speed and accuracy. To reach high accuracy, joint prediction can enable the system to benefit from richer linguistic information at analysis time. Further, the use of deep learning techniques and large-scale MWE resources can be investigated. Yet this sophistication comes at the cost of increased complexity and ambiguity. A possible solution is to add constraints reducing search space. Finally, we wish the proposed solutions to have (quasi-)linear speed complexity, in order to reasonably consider parsing big textual data. This thesis will be in collaboration with Joakim Nivre (Univ. Uppsala, Sweden), in the framework of the European COST Action PARSEME. ------------------------- ==== Bibliography ==== * [[ http://www.mitpressjournals.org/doi/pdf/10.1162/coli.07-056-R1-07-027 | Nivre, J. (2008). Algorithms for deterministic incremental dependency parsing, Computational Linguistics 34(4): 513—553. ]] * [[ http://www.aclweb.org/anthology/P/P14/P14-1070.pdf | Candito M. and Constant M. (2014). Strategies for Multiword Expression Analysis and Dependency Parsing, Proceedings of the 52th Annual Meeting of the Association for Computational Linguistics (ACL'14), Baltimore, USA. ]] * [[ https://aclweb.org/anthology/C/C14/C14-1177.pdf | Joseph Le Roux, Matthieu Constant and Antoine Rozenknop. Syntactic Parsing and Compound Recognition via Dual Decomposition: Application to French. 25th International Conference on Computational Linguistics (COLING'14). 2014. ]]