User Tools

Site Tools

Agence Nationale de la Recherche

2018-lifat-m2-1

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
2018-lifat-m2-1 [2018/09/21 13:57]
agata.savary
2018-lifat-m2-1 [2018/09/21 14:42] (current)
agata.savary
Line 1: Line 1:
-====== Lexicon-to-corpus multiword expression browser  ======+====== Verbal Multiword Expression Discovery in French Based on Seen Data and Distributional Semantics  ======
  
   * **Domain:** Natural Language Processing   * **Domain:** Natural Language Processing
Line 13: Line 13:
 The internship will take place in the framework of the [[http://parsemefr.lis-lab.fr|PARSEME-FR]] project, which involves several NLP teams in France. The aim is to boost applications in Natural Language Processing (NLP), by focusing on one of their major challenges: multiword expressions (MWEs). The internship will take place in the framework of the [[http://parsemefr.lis-lab.fr|PARSEME-FR]] project, which involves several NLP teams in France. The aim is to boost applications in Natural Language Processing (NLP), by focusing on one of their major challenges: multiword expressions (MWEs).
  
-MWEs are groups of words which exhibit unpredicted properties (Baldwin & Kim, 2010). Most prominently, their meaning does not straightforwardly derive from the meanings of their components. For instance,  _faire_ ‘make/do’ and valoir ‘be worth sth’ are verbs, while _**faire-valoir**‘a stooge, a person who is used by somebody to do things that are unpleasant or dishonest’. Similarly, the meaning of casser sa pipe (literally to break one’s pipe) ‘to die’ cannot be straightforwardly deduced from the meanings of the individual components. Additionally, MWEs exhibit unpredicted morpho-syntactic and lexical constraints. For instance, replacing the verb in lancer un appel (lit. to throw a call) ‘to issue a call’ by a synonym yields an invalid expression *jeter un appel ‘to throw a call’. Doing alike in casser sa pipe ‘to die’  imposes a literal reading of the resulting expression: briser sa piper ‘to break one’s pipe’.+MWEs are groups of words which exhibit unpredicted properties (Baldwin & Kim, 2010). Most prominently, their meaning does not straightforwardly derive from the meanings of their components. For instance,  //faire// ‘make/do’ and //valoir// ‘be worth sth’ are verbs, while //**faire-valoir**// ‘a stooge, a person who is used by somebody to do things that are unpleasant or dishonest’. Similarly, the meaning of //casser sa pipe// (literally //to break one’s pipe//) ‘to die’ cannot be straightforwardly deduced from the meanings of the individual components. Additionally, MWEs exhibit unpredicted morpho-syntactic and lexical constraints. For instance, replacing the verb in //**lancer** un **appel**// (lit. //to throw a call//) ‘to issue a call’ by a synonym yields an invalid expression //*jeter un appel// ‘to throw a call’. Doing alike in //**casser sa pipe**// ‘to die’  imposes a literal reading of the resulting expression: //briser sa pipe// ‘to break one’s pipe’.
  
-One of the main aims of MWE-oriented NLP research is to model such expressions so as to optimize their automatic processing (for instance, to avoid their literal translation in machine translation systems). Two major MWE-related NLP tasks include MWE discovery and MWE identification. In the former, the input consists in large quantities of raw texts and the output is a list of potential MWEs. In the latter, and identifier takes a text on input and automatically annotates (points at) the occurrences of MWEs in it. MWE identification is a pre-requisite for downstream applications such as machine translation (which may want to treat MWEs with dedicated procedures). +One of the main aims of MWE-oriented NLP research is to model such expressions so as to optimize their automatic processing (for instance, to avoid their literal translation in machine translation systems). Two major MWE-related NLP tasks include **MWE discovery** and **MWE identification**. In the former, the input consists in large quantities of raw texts and the output is a list of potential MWEs. In the latter, and identifier takes a text on input and automatically annotates (points at) the occurrences of MWEs in it. MWE identification is a pre-requisite for downstream applications such as machine translation (which may want to treat MWEs with dedicated procedures). 
  
-Automatic identification of MWEs in 19 languages was addressed by the PARSEME shared task1 (Ramisch et al., 20182018), in which the BdTln team participated with the VarIDE system (Pasquer et al., 2018a). The results of the shared task show that identifying unseen MWEs (i.e. those MWEs which do not occur in the training data) is particularly challenging. Thus, identification should, ideally, exploit not only annotated corpora but also MWE lexicons and MWE discovery methods.+Automatic identification of MWEs in 19 languages was addressed by the PARSEME shared task1 (Ramisch et al., 20182018), in which the BdTln team participated with the VarIDE system (Pasquer et al., 2018a). The results of the shared task show that identifying **unseen MWEs** (i.e. those MWEs which do not occur in the training data) is particularly challenging. Thus, identification should, ideally, exploit not only annotated corpora but also MWE lexicons and MWE discovery methods.
  
 +===== Topics =====
 +This internship is dedicated to discovering how MWE discovery could benefit from the previously seen data, rather than be performed from scratch. The hypothesis to be tested is that new (unseen) MWEs of certain types can be discovered due to their semantic similarity with known (previously seen) MWEs. For instance, knowing that //**haute température**// ‘high temperature’ is a known MWE, and replacing its components by synonyms and antonyms, we obtain semantically close but previously unseen MWEs such as //**température élevée**// ‘high temperature’ or //**basse/moyenne température**// ‘low/middle temperature’ (Savary & Jacquemin, 2003). Thus, new MWEs might be discovered by examining **lexical substitution** performed within known MWEs. The challenge is to discover to which degree lexical substitution yields valid MWEs, rather than spurious MWE candidates. For instance, a known MWE //**prendre** un **bain**// ‘to take a bath’ allows us to discover //**prendre** une **douche**// ‘to take a shower’. But the use of other semantically related but more distant words leads to invalid or literal expressions, such as //prendre une baignoire// ‘to take a bathtub’ or //prendre un WC// ‘to take a WC’.
 +
 +To perform lexical substitution, a model of semantic similarity of words and expressions is needed. Previous work exploited semantic resources such as WordNet. In this internship, we focus on the domain of **distributional semantics**, which is based on the hypothesis that words having a similar meaning occur in similar contexts. Recent developments in distributional semantics include the construction of **word embeddings**, i.e. mappings from words or expressions to low-dimensional vectors of real numbers, which are expected to represent co-occurrence contexts of these words/expressions in a compact way. Thus, an embedding of a word/expression can be considered an abstract representation of its meaning. 
  
 ===== Objectives ===== ===== Objectives =====
  
-The main objective of this internship is to develop tools to link and explore the annotated corpus and the lexicon, built in the PARSEME-FR project.  +The objectives of this internship are to exploit word embeddings for discovery of new MWEs based on their semantic proximity to the previously seen MWEs, contained in a lexicon or in an annotated corpus (resources of both types belong to the outcomes of the PARSEME-FR project). The discovery should lead to (semi-)automatic enrichment of these initial resources. Two stages are to be considered
-The internship can be divided in the following tasks+  * candidates for new MWEs are generated by replacing individual components of known MWEs by their semantically close wordsestablished notably via word embeddings; 
-  * Study the linguistic properties of French multiword expressionsthrough the exploration of a verbal MWE lexicon extracted from the Lexicon-Grammar tables (Gross 1984) +  * the candidates generated in this way are filtered based on their corpus frequency or contexts of occurrence; for instance, adjectives //chaud/froid// ‘hot/cold’ tend to co-occur more frequently with //*prendre* un **bain**/une **douche**// ‘to take a bath/shower’ than with //**prendre** une **baignoire**// (spacieuse/solide...) ‘take (huge/solid) bathtub’.
-  * Develop a tool to automatically link annotated verbal MWEs and their corresponding entries in the lexicon, using matching procedures based on some normalization of the expressions. For instance, a lexicon entry like //to spill the beans// should be linked to its corpus occurrences like //the beans he spilled////spilling the beans// etc. +
-  * Develop web interface in order to search annotated expressions via multicriteria filters and to visualize their tree-like linguistic description +
-  * Optionally, extend the work to all multiword expressions, which involves the automatic conversion of an existing dictionary of nominal, adjectival and adverbial compounds into the PARSEME-FR format, in particular using syntactic information available in the annotated corpus+
  
-===== Profile ==== +Possible extensions of the objectives:
-  * Master 1 or Master 2 in computational linguistics or computer science, +
-  * Good knowledge of French,  +
-  * Interests in linguistics and familiarity with language technology, +
-  * Programming skills (python, web programming).+
  
-===== Important dates ==== +  integrating MWE discovery with MWE identification in //varIDE// 
-  * Application deadline: 15 January 2018 (or until filled) +  * coupling word embedding-based lexical replacement with semantic resources such as WordNet.
-  * Notification: 25 January 2018 +
-  Position starts: around March 2018 +
-  * Position ends: July-August 2018+
  
  
-===== References ====+===== Candidate's profile ==== 
 +  * 2nd-year master student in computational linguistics, computer science or alike 
 +  * Interests in linguistics and familiarity with language technology 
 +  * Good knowledge of French 
 +  * Good programming skills, preferably in Python
  
-Marie Candito, Mathieu Constant, Carlos Ramisch, Agata Savary, Yannick Parmentier, Caroline Pasquer, and Jean-Yves Antoine. Annotation d’expressions polylexicales verbales en français. In Jean-Yves Antoine Iris Eshkol, editor, 24e conférence sur le Traitement Automatique des Langues Naturelles (TALN), Actes de TALN, volume 2 articles courts, pages 1–9, Orléans, France, 06 2017.+===== Important dates ==== 
 +  * Application deadline: 15 December 2018 (or until filled) 
 +  * Notification15 January 2018 
 +  * Position starts: around February-March 2018 
 +  * Position ends: around July-August 2018
  
-Maurice Gross. Lexicon-grammar and the syntactic analysis of FrenchIn Procof COLING-ACL 1964, pages 275–282, Stanford, CA, 1984. Association for Computational Linguistics+===== How to apply ===== 
- +Send your CV and a cover letter to: 
-Agata Savary, Carlos Ramisch, Silvio Cordeiro, Federico Sangati, Veronika Vincze, Behrang QasemiZadeh, Marie Candito, Fabienne Cap, Voula Giouli, Ivelina Stoyanova, and Antoine DoucetThe PARSEME shared task on automatic identification of verbal multiword expressionsIn Procof EACL 2017 Workshop on MWEs, pages 31–47, Valencia, April 2017+  * Caroline Pasquer: first.last@etu.univ-tours.fr 
 +  Agata Savary, Jean-Yves Antoine: first.last@univ-tours.fr 
 +  * Carlos Ramisch: first.last@lis-lab.fr
  
  
 +===== References ====
 +  * Baldwin, T. and Kim, S. N. (2010) [[https://people.eng.unimelb.edu.au/tbaldwin/pubs/handbook2009.pdf|Multiword Expressions]], in Nitin Indurkhya and Fred J. Damerau (eds.)  Handbook of Natural Language Processing, Second Edition, CRC Press, Boca Raton, USA, pp. 267-292.
 +  * Farahmand, M. Henderson, J., [[http://www.aclweb.org/anthology/W16-1809||Modeling the non-substitutability of multiword expressions with distributional semantics and a loglinear model]], Proceedings of the ACL 2016 Workshop on MWEs. Berlin, pp.61-66, 2016.
 +  * Afsaneh Fazly, Paul Cook and Suzanne Stevenson. 2009. [[http://www.aclweb.org/anthology/J09-1005|Unsupervised type and token identification of idiomatic expressions]]. Computational Linguistics 35(1):61–103
 +  * Peng, J., Aharodnik, K., Feldman, A.. (2018). A Distributional Semantics Model for Idiom Detection - The Case of English and Russian. Special Session on Natural Language Processing in Artificial Intelligence, 675-682
 +  * Pasquer, C., Savary, A., Antoine, J.-Y., Ramisch, C. (2018b) [[http://aclweb.org/anthology/C18-1219|If you’ve seen some, you’ve seen them all: Identifying variants of multiword expressions]], in the Proceedings of the 27th International Conference on Computational Linguistics (COLING-18), Santa Fe, USA. 
 +  * Ramisch C., Cordeiro, S., Savary, A., Vincze, V. et al. (2018) [[http://aclweb.org/anthology/W18-4925|Edition 1.1 of the PARSEME Shared Task on Automatic Identification of Verbal Multiword Expressions]]. the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018), Aug 2018, Santa Fe, United States. Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018), pp.222 - 240.
 +  * Savary, A., Jacquemin, Ch. (2003): [[https://link.springer.com/content/pdf/10.1007%2F978-3-540-45115-0_6.pdf|Reducing Information Variation in Text]], in Renals, S., Grefenstette, G. (eds.) Text- and Speech-Triggered Information Access, Proceedings of TESTIA 2000, 8th ELSNET European Summer School on Language and Speech Communication, Lecture Notes in Artificial Intelligence 2705, Springer Verlag, pp. 145-181.
 ------------------------------ ------------------------------
-===== How to apply ===== 
  
-Applications should be sent to Mathieu.Constant@univ-lorraine.fr. They should include a CV, a cover letter, and possibly support letters by teacher. 
2018-lifat-m2-1.1537531057.txt.gz · Last modified: 2018/09/21 13:57 by agata.savary