ATILF offers a PhD position in computational linguistics.
The proposed PhD thesis falls into the field of natural language processing at the crossroads of computer science and linguistics. In particular, it will focus on processing of multiword expressions, namely sequences of several words with a certain degree of idiomaticity. These expressions are very frequent and diverse. For instance, hot dog, piece of cake, cut the mustard, take a step, by and large. Identifying multiword expressions in context constitutes an essential step for syntactic parsing, semantic analysis and more generally for natural language processing applications such as machine translation. This PhD proposal holds in the framework of the ANR-funded PARSEME-FR project that aims at widely integrating such expressions into syntactic and semantic parsers.
Candidates should send the following documents in PDF format, in French or in English, to Mathieu Constant (FirstName.LastName@u-pem.fr) and Marie Candito (FirstName.LastName@linguist.univ-paris-diderot.fr)
This PhD thesis aims at revisiting statistical syntactic and semantic analysis in the light of multiword expressions. More precisely, it falls within the framework of linear-time dependency parsing.
Taking multiword expressions into account is a challenge for automatic text analysis, mainly due to their non-compositionality, i.e. to the partial or total irregularity in the way their elements combine at the lexical, morpho-syntactic and/or semantic levels. Furthermore, there exists a continuum between entirely fixed expressions (piece of cake) and almost free expressions (traffic light). A wide majority of these expressions are actually partially compositional (white wine, take a nap) thus requiring a non-atomic representation. The first work will consist in designing a new lexical, syntactic and semantic representation that would enable a satisfying handling of such expressions. Given this new representation, the next step will consist in developing new parsing algorithms integrating MWEs. Priority will be given to a system that jointly performs both MWE identification and syntactic parsing, in such a way both tasks can mutually inform each other. Multiword expressions generally representing semantic units, a natural extension of this joint system is to develop a system that automatically constructs a shallow semantic graph for an input sentence.
The developed parsers should combine two features: speed and accuracy. To reach high accuracy, joint prediction can enable the system to benefit from richer linguistic information at analysis time. Further, the use of deep learning techniques and large-scale MWE resources can be investigated. Yet this sophistication comes at the cost of increased complexity and ambiguity. A possible solution is to add constraints reducing search space. Finally, we wish the proposed solutions to have (quasi-)linear speed complexity, in order to reasonably consider parsing big textual data.
This thesis will be in collaboration with Joakim Nivre (Univ. Uppsala, Sweden), in the framework of the European COST Action PARSEME.