Syntactic Parsing and Multiword Expressions in French

This is an old revision of the document!

Garder la trace, mettre de l'ordre et relier les points : modéliser l'ambiguïté, la variation et la compositionalité des expressions polylexicales

Date limite d'envoi des dossiers: 30 mai 2016
Domaine: traitement automatique des langues/linguistique informatique
Localisation: Laboratoire d'informatique, Université de Tours, antenne de Blois (France)
Encadrant: Agata Savary (Tours)
Co-encadrant: Carlos Ramisch (Marseille)
Durée: 3 ans, octobre 2016 à septembre 2019
Rémunération: approx. 1660 €/mois (incluant 64 heures annuelles d'enseignement)
Financement: projet ANR PARSEME-FR
Mots clefs: expressions polylexicales (EP), variation, ambiguïté, linking, entités nommées, lexiques, compositionalité, identification d'EP

Contexte

Cette thèse de doctorat s'inscrit dans le cadre du projet ANR PARSEME-FR, un spin-off français de l'action PARSEME. PARSEME est un réseau scientifique européen financé par le programme COST. Le consortium regroupe plus de 200 partenaires de 31 pays autour des défis scientifiques en traitement automatique d'expressions polylexicales. L'objectif du projet PARSEME-FR est de s'attaquer à ces défis spécifiquement en langue française.

Le travail de thèse sera encadré par Agata Savary (Université de Tours, antenne de Blois) et co-encadré par Carlos Ramisch (Aix Marseille Université). Le/la doctorant(e) travaillera à Blois, et effectuera des visites ponctuelles à Marseille. Il/elle participera aux réunions régulières du projet PARSEME-FR réunissant des spécialistes français en traitement d'expressions polylexicales.

Profil

Master en informatique ou traitement automatique des langues
Bon niveau de français et d'anglais, la maîtrise d'une autre langue serait un plus
Intérêts pour la linguistique, connaissances des technologies langagières
Capacité de travail en autonomie et en équipe

Calendrier

Date limite d'envoi des dossiers: 30 mai 2016 (ou tant que le poste n'est pas pourvu)
Résultats de la sélection: 24 juin 2016
Début de contrat: octobre 2016
Fin de contrat: septembre 2019

Dossier

Les candidats doivent faire parvenir les documents suivants au format PDF, en français OU en anglais, à Agata Savary (prenom.nom@univ-tours.fr) ET Carlos Ramisch (prenom.nom@lif.univ-mrs.fr)

CV
lettre de motivation
relevé de notes de Master et licence (avec traduction, si rédigés en une autre langue)

Institutions d'accueil

Affiliation principale

Université: Université François Rabelais Tours
Laboratoire: Laboratoire d'informatique (LI)
Équipe de recherche: Bases de données et traitement des langues (BdTln), antenne de Blois

Affiliation secondaire

Université: Aix Marseille Université (AMU)
Laboratoire: Laboratoire d'informatique fondamentale (LIF)
Équipe de recherche: Traitement automatique du langage écrit et parlé (TALEP)

Défis scientifiques

Cette thèse sera dédiée au traitement informatique des expressions polylexicales (EP) semi-figées et flexibles, telles que blanc d'oeuf, mémoire vive, prendre une pause, prendre le temps, tourner la veste ou prendre le taureau par les cornes. Cette classe englobe également les entités nommées polylexicales (ENP) telles que Valéry Giscard d'Estain, Fond européen de développement régional et Royaume-Uni de Grande Bretagne et d'Irlande du Nord. La particularité des ENP réside dans leur richesse sémantique et pragmatique, représentable par des liens vers des référents d'entités et de concepts du monde réel ou du monde du discours (personnes, lieux, objets, événements, etc.).

Les EP et les ENP présentent des comportements linguistiques irréguliers qui doivent être pris en compte par toute application informatique réalisant des traitements d'ordre sémantique, telles que la traduction automatique ou l'extraction d'informations. En l'absence de traitements appropriés, Les EP et les ENP sont sources d'erreurs en traitement automatique des langues. À ce jour, leur identification automatique en contexte, ainsi que leur rattachement à une entrée dans une base de connaissances (entity linking), n'atteignent pas des niveaux de qualité suffisants.

L'objectif de cette thèse est de proposer et évaluer des méthodes informatiques robustes pour l'identification et le linking des EP et des ENP.

Les propriétés pincipales des EP et des ENP à prendre en compte dans ce contexte sont :

leur discontinuité dans des textes
leur variation morpho-syntaxique
leur ambiguïté sémantique et pragmatique

Par exemple, les constituants de prendre une pause peuvent se fléchir ou être séparés par des éléments externes (il n'a pris ces derniers mois aucune pause) ou soumis à des transformations syntaxiques (les pauses qu'il prend ces derniers temps sont longues). Des variantes elliptiques ou des acronymes peuvent remplacer les noms complets (Giscard d'Estain, FEDER, Grande Bretagne. Finalement, une EP potentielle peut avoir une interprétation idiomatique ou littérale (il a retourné sa veste pour la nettoyer de l'intérieur).

L'analyse syntaxique, tout comme l'encodage lexical fin des EP, peuvent aider à détecter les occurrences non-canoniques d'EP et à distinguer les variantes morpho-syntaxiques valides de celles qui sont impossibles. Par exemple les pauses qu'il prend ce dernier temps sont longues est une variante de prendre une pause mais pas de prendre le temps. De plus, l'emploi de représentations distributionnelles fondées sur des espaces vectoriels (word embeddings), peut contribuer à distinguer les interprétations littérales et idiomatiques.

Les variations constituent l'un des défis majeurs de la tâche de linking d'EP et d'ENP. Des algorithmes de désambiguïsation et de fusion de variantes sont requis lorsqu'une forme canonique d'une EP ou d'une ENP figurant dans une base de connaissances est à relier à ses occurrences dans un texte. Par exemple, dans la tâche d'entity linking, les entités nommées pré-identifiées dans un texte doivent être automatiquement reliées aux Données Liées Ouvertes (Linked Open Data) représentant les mêmes référents. C'est également le cas pour les EP flexibles, dont les propriétés sont décrites dans un lexique morpho-syntaxique. La tâche inverse, mais également complexe, consiste à automatiquement projeter un lexique d'EP sur un corpus arboré, afin accélérer le processus d'annotation du corpus.

Cette thèse de doctorat devra mener à une meilleure compréhension des avantages que l'identification automatique et la désambiguïsation des EP et des ENP peuvent tirer de l'analyse syntaxique, de l'encodage lexical et des modèles sémantiques distributionnels. L'impact de ces traitements sur les tâches d'entity linking et de la pré-annotation de corpus sera évalué à l'aide de ressources et outils langagiers du français, développés dans le cadre du projet ANR PARSEME-FR.

Bibliographie

English version

The University of Tours (Blois campus) offers a PhD position in computational linguistics.

Keeping tabs, bringing into line and sending to the Outer Rim. How to tackle ambiguity, variability and compositionality of multiword expressions?

Application deadline: May 30, 2016
Field: natural language processing/computational linguistics
Location: Computer Science Lab, University of Tours, campus in Blois (France)
Supervisor: Agata Savary (Tours)
Co-supervisor: Carlos Ramisch (Marseilles)
Duration: 3 years, October 2016 to September 2019
Remuneration: around 1660 €/month (including 64 hours of yearly teaching duties)
Funding: ANR PARSEME-FR project
Keywords: multiword expressions (MWEs), variability, ambiguity, linking, named entities, lexicons, compositionality, MWE identification, MWE linking

Context

A PhD position in the domain of natural language processing opens in October 2016.

This PhD position is offered in the framework of the ANR PARSEME-FR project , a French spin-off of the PARSEME action. PARSEME is a European scientific network funded by the COST program. The action's consortium gathers partners from 31 countries around scientific challenges in automatic processing of multiword expressions. The goal of the PARSEME-FR project is to tackle these challenges specifically for French.

The PhD will be supervised by Agata Savary (University of Tours, campus in Blois) and co-supervised by Carlos Ramisch (Aix-Marseille University). The PhD candidate will work in Blois, with occasional visits to Marseilles. The candidate will have the opportunity to attend the meetings of the PARSEME-FR project, which gather the French scientific community working on multiword expressions.

Profile

Master in computer science or computational linguistics
Good knowledge of French and English, another language would be a plus
Interests in linguistics and familiarity with language technology
Capacity to work independently and as part of a team

Important dates

Application deadline: May 30, 2016 (or until fulfilled)
Notification: June 24, 2016
Position starts: October 2016
Position ends: September 2019

Application

Candidates should send the following documents in PDF format, in French or in English, to Agata Savary (FirstName.LastName@univ-tours.fr) and Carlos Ramisch (FirstName.LastName@lif.univ-mrs.fr)

CV
Cover letter
Transcript of MSc and BSc grades (translated if not in French or English)

Hosting Institutions

Main affiliation

University: Université François Rabelais Tours
Laboratory: Laboratoire d'informatique (LI)
Research team: Databases and Natural Language Processing (BdTln), campus in Blois

Secondary affiliation

University: Aix Marseille Université (AMU)
Laboratory: Laboratoire d'informatique fondamentale (LIF)
Research team: Written and Spoken Language Processing (TALEP)

Scientific challenges

This PhD thesis will be dedicated to semi-fixed and flexible multiword expressions (MWEs) such as French fries, random access memory, take a break, take time, do one's best, spill the beans and kick the bucket. This class also encompasses multiword named entities (MWNEs) such as Jeffrey David Ullman, European Bank for Reconstruction and Development and United Kingdom of Great Britain and Northern Ireland, which carry a rich semantic and pragmatic load, best represented via links to representations of objects and concepts of the real world (people, places, objects, events, etc).

MWEs and MWNEs present some degree of idiosyncrasy that must be taken into account by any computational application performing some degree of semantic processing, like machine translation and information extraction. When not handled correctly, they are often at the root of errors in NLP applications. To date, their automatic identification in running text and linking with lexical and knowledge bases is largely unsatisfactory. The goal of this PhD thesis is to propose and evaluate robust computational models for MWE and MWNE identification and linking.

Important properties of MWEs that must be taken into account are:

discontinuity
morpho-syntactic variability
semantic and pragmatic ambiguity

For instance, the components of take a break can inflect and be separated by extra words (he takes two-and-a-half-week long breaks every two months) or be subject to syntactic transformations (the break that he took this time was short). Elliptical variants or acronyms can replace full names (Jess D. Ullman, EBRD, Great Britain). Finally, an idiomatic expression (to kick the bucket) may have both literal and idiomatic reading, depending on the context (he kicked the bucket while cleaning the floor).

Syntactic parsing may help in distinguishing valid from invalid morphosyntactic variants (e.g. the break that he took this time was short is a variant of take a break but not of take time). Furthermore, precise lexical encoding of variational patterns can also help detecting MWEs and MWNEs in non-canonical forms. Finally, the use of distributional vector-space representations like word embeddings can lead to the distinction of literal from idiomatic readings.

Variability is probably the key challenge in MWE and MWNE linking. Appropriate disambiguation and variant conflation is necessary whenever a canonical form of a MWE or MWNE present in a knowledge base needs to be attached to its occurrences in a text. For instance, in entity linking, named entities pre-identified in a text are to be automatically linked to nodes in knowledge bases (Linked Open Data) representing the same referents. The same applies for flexible MWEs, where the knowledge base is a lexicon describing the properties of MWEs. Conversely, in treebank annotation, automatic methods are needed to project a MWE lexicon on a treebank in order to speed up and minimize manual work.

This PhD thesis will be dedicated to a better understanding of how syntactic parsing, supported by lexical encoding and vector-space semantic models, may enhance the results of MWE and MWNE identification and disambiguation. The impact of this task on entity linking and treebank annotation will be assessed with the help of French language resources and tools developed in the PARSEME-FR project.

Syntactic Parsing and Multiword Expressions in French

Sidebar

Table of Contents

Version en français

Garder la trace, mettre de l'ordre et relier les points : modéliser l'ambiguïté, la variation et la compositionalité des expressions polylexicales

Contexte

Profil

Calendrier

Dossier

Institutions d'accueil

Affiliation principale

Affiliation secondaire

Défis scientifiques

Bibliographie

English version

Keeping tabs, bringing into line and sending to the Outer Rim. How to tackle ambiguity, variability and compositionality of multiword expressions?

Context

Profile

Important dates

Application

Hosting Institutions

Main affiliation

Secondary affiliation

Scientific challenges

Further reading

Syntactic Parsing and Multiword Expressions in French

User Tools

Site Tools

Sidebar

Table of Contents

Version en français

Garder la trace, mettre de l'ordre et relier les points : modéliser l'ambiguïté, la variation et la compositionalité des expressions polylexicales

Contexte

Profil

Calendrier

Dossier

Institutions d'accueil

Affiliation principale

Affiliation secondaire

Défis scientifiques

Bibliographie

English version

Keeping tabs, bringing into line and sending to the Outer Rim. How to tackle ambiguity, variability and compositionality of multiword expressions?

Context

Profile

Important dates

Application

Hosting Institutions

Main affiliation

Secondary affiliation

Scientific challenges

Further reading

Page Tools