The University of Tours (Blois campus) offers a PhD position in computational linguistics.

===== Keeping tabs, bringing into line and sending to the Outer Rim. How to tackle ambiguity, variability and compositionality of multiword expressions? =====


  * **Application deadline:** May 30, 2016 (or until filled)
  * **Field:** natural language processing/computational linguistics
  * **Location:** [[http://li.univ-tours.fr/ | Computer Science Lab]], University of Tours, campus in Blois (France)
  * **Supervisor**: [[http://www.info.univ-tours.fr/~savary/|Agata Savary]] (Tours)
  * **Co-supervisor**: [[http://pageperso.lif.univ-mrs.fr/~carlos.ramisch|Carlos Ramisch]] (Marseilles)
  * **Duration:** 3 years, October 2016 to September 2019
  * **Remuneration:** around 1660 €/month (including 64 hours of yearly teaching duties)
  * **Funding:** [[http://parsemefr.lif.univ-mrs.fr/|ANR PARSEME-FR project]]
  * **Keywords**: multiword expressions (MWEs), variability, ambiguity, linking, named entities, lexicons, compositionality, MWE identification, MWE linking

-----------------
==== Context ====

A PhD position in the domain of natural language processing opens in October 2016.

This PhD position is offered in the framework of the [[http://parsemefr.lif.univ-mrs.fr | ANR PARSEME-FR project ]], a French spin-off of the [[http://www.parseme.eu | PARSEME action]]. PARSEME is a European scientific network funded by the [[http://www.cost.eu/COST_Actions/ict/IC1207 | COST program]]. The action's consortium gathers partners from 31 countries around scientific challenges in automatic processing of multiword expressions. The goal of the PARSEME-FR project is to tackle these challenges specifically for French.

The PhD will be supervised by Agata Savary (University of Tours, campus in Blois) and co-supervised by Carlos Ramisch (Aix-Marseille University). The PhD candidate will work in Blois, with occasional visits to Marseilles. The candidate will have the opportunity to attend the meetings of the PARSEME-FR project, which gather the French scientific community working on multiword expressions.

-----------------
==== Profile ====

  * Master in computer science or computational linguistics
  * Good knowledge of French and English, another language would be a plus
  * Interests in linguistics and familiarity with language technology
  * Capacity to work independently and as part of a team

-------------------------
==== Important dates ====

  * **Application deadline: May 30, 2016 (or until filled)**
  * Notification: June 24, 2016
  * Position starts: October 2016
  * Position ends: September 2019

---------------------
==== Application ====

Candidates should send the following documents in PDF format, in French or in English, to Agata Savary (FirstName.LastName@univ-tours.fr) and Carlos Ramisch (FirstName.LastName@lif.univ-mrs.fr)
  * CV
  * Cover letter
  * Transcript of MSc and BSc grades (translated if not in French or English)

------------------------------
==== Hosting Institutions ====

=== Main affiliation ===

  * **University**: [[http://international.univ-tours.fr/welcome-international-265902.kjsp?RH=INTER&RF=INTER-EN | Université François Rabelais Tours]]
  * **Laboratory**: [[http://li.univ-tours.fr | Laboratoire d'informatique (LI)]]
  * **Research team**: Databases and Natural Language Processing (BdTln), campus in Blois

=== Secondary affiliation ===

  * **University**: [[http://edu.univ-amu.fr/en | Aix Marseille Université (AMU)]]
  * **Laboratory**: [[http://www.lif.univ-mrs.fr/ | Laboratoire d'informatique fondamentale (LIF)]]
  * **Research team**: Written and Spoken Language Processing (TALEP)

-------------------------------
==== Scientific challenges ====

This PhD thesis will be dedicated to semi-fixed and flexible multiword expressions (MWEs) such as //French fries//, //random access memory//, //take a break//, //take time//, //do one's best//, //spill the beans// and //kick the bucket//. This class also encompasses multiword named entities (MWNEs) such as //Jeffrey David Ullman//, //European Bank for Reconstruction and Development// and //United Kingdom of Great Britain and Northern Ireland//, which carry a rich semantic and pragmatic load, best represented via links to representations of objects and concepts of the real world (people, places, objects, events, etc). 

MWEs and MWNEs present some degree of idiosyncrasy that must be taken into account by any computational application performing some degree of semantic processing, like machine translation and information extraction. When not handled correctly, they are often at the root of errors in NLP applications. To date, their automatic identification in running text and linking with lexical and knowledge bases is largely unsatisfactory. The goal of this PhD thesis is to propose and evaluate robust computational models for MWE and MWNE identification and linking.

Important properties of MWEs that must be taken into account are:
  * discontinuity
  * morpho-syntactic variability
  * semantic and pragmatic ambiguity

For instance, the components of //take a break// can inflect and be separated by extra words (//he **takes** two-and-a-half-week long **breaks** every two months//) or be subject to syntactic transformations (//the **break** that he **took** this time was short//). Elliptical variants or acronyms can replace full names (//Jess D. Ullman//, //EBRD//, //Great Britain//). Finally, an idiomatic expression (//to kick the bucket//) may have both literal and idiomatic reading, depending on the context  (//he kicked the bucket while cleaning the floor//).

Syntactic parsing may help in distinguishing valid from invalid morphosyntactic variants (e.g. //the **break** that he **took** this **time** was short// is a variant of //take a break// but not of //take time//). Furthermore, precise lexical encoding of variational patterns can also help detecting MWEs and MWNEs in non-canonical forms. Finally, the use of distributional vector-space representations like word embeddings can lead to the distinction of literal from idiomatic readings.

Variability is probably the key challenge in MWE and MWNE linking. Appropriate disambiguation and variant conflation is necessary whenever a canonical form of a MWE or MWNE present in a knowledge base needs to be attached to its occurrences in a text. For instance, in entity linking, named entities pre-identified in a text are to be automatically linked to nodes in knowledge bases (Linked Open Data) representing the same referents. The same applies for flexible MWEs, where the knowledge base is a lexicon describing the properties of MWEs. Conversely, in treebank annotation, automatic methods are needed to project a MWE lexicon on a treebank in order to speed up and minimize manual work. 

This PhD thesis will be dedicated to a better understanding of how syntactic parsing, supported by lexical encoding and vector-space semantic models, may enhance the results of MWE and MWNE identification and disambiguation. The impact of this task on entity linking and treebank annotation will be assessed with the help of French language resources and tools developed in the PARSEME-FR project.


-------------------------
==== Further reading ====

  * [[ http://lingo.stanford.edu/pubs/WP-2001-03.pdf | Multiword Expresions: A Pain in the Neck for NLP ]]
  * [[ http://link.springer.com/content/pdf/10.1007%2F978-3-540-45115-0_6.pdf | Reducing Information Variation in Text ]]
  * [[ https://aclweb.org/anthology/P/P15/P15-1108.pdf | Joint Dependency Parsing and Multiword Expression Tokenisation ]] 
  * [[ http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.461.1494&rep=rep1&type=pdf | Evaluating Entity Linking with Wikipedia ]]  
  * [[ http://aclweb.org/anthology/W/W12/W12-3311.pdf | A Generic Framework for Multiword Expressions Treatment: from Acquisition to Applications ]]
  * [[http://aclweb.org/anthology/Q/Q14/Q14-1016.pdf | Discriminative Lexical Semantic Segmentation with Gaps: Running the MWE Gamut ]]