ANR PROFITEROLE (PRocessing Old French Instrumented TExts for the Representation Of Language Evolution)
Responsable scientifique IHRIM : Serge HEIDEN (IHRIM UM 5317, ENS de Lyon)
Responsable scientifique : Sophie PREVOST (LATTICE UMR 8094, CNRS)
- Lattice (UMR 8094, ENS Paris/ Univ. Paris Sorbonne Nouvelle) : S. Prévost (resp.), I. Tellier, K. Gerdes, A. Stein, T. Rainsford
- LLF (UMR 7110, Univ. Paris Diderot) : B. Crabbé (resp.), B. Sagot, E. de la Clergerie, M. Constant, R. Garnier
- IHRIM (UMR 5317, ENS Lyon) : S. Heiden (resp), C. Guillot, A. Lavrentiev
The PROFITEROLE has three main goals that fall within the fields of linguistics and Natural Language Processing (NLP). These three goals are closely correlated. First, modeling morphological and syntactic aspects of the diachronic evolution of French. Second, targeting the development of a methodology to explore and annotate heterogeneous linguistic data while providing automatic analysers for various stages of the French language. Finally, expanding linguistic resources for French, by building a large annotated corpus (1 million words) of Medieval French (9th-15th centuries) and morphological lexicons (plus NLP tools) covering several stages of French.
The Medieval period constitutes a critical period for the study of the evolution of French. It is indeed during this period that most core morphological and syntactic changes were initiated and began to spread throughout the language. Focusing on this chronological span therefore allows us to achieve a better insight into the evolution of French and a better understanding of certain mechanisms of change that have also taken place in other languages. Material constraints have so far limited data mining or other extensive analyses of French diachronic text collections, which call for a partially automated utilization of the data. This holds true especially for the Medieval period. The emergence in 2013 of the Syntactic Reference Corpus of Medieval French has opened up new perspectives for both linguistic and NLP issues. SRCMF is an Old French (9th-13th) treebank annotated with fine-grained syntactic dependency structures, with each of the 251,000 words carrying a manually checked POS tag and a syntactic function.
Old French is characterized by much greater variation than Modern French on both the grapho-phonetic and syntactic levels. This variation must be conceived as internal to the language, although the variation can also be seen as external as it operates between texts of different external variables (such as date, dialect, domain-genre, form, or register), with the date being the main parameter for variation. The SRCMF is a highly conducive field for the study of variation in its internal and external dimensions, in addition to the possible correlations between the two – a hitherto understudied domain of syntax, especially with regard to the progressive fixation of word order in the history of French, which will be our main linguistic focus. The complexity of the task involves the use of sophisticated statistical and computer technologies. The strong variation in Medieval French complicates the identification and grasping of its successive stages, while also being a critical factor for the passage from one stage to the other. A significant increase in processed data is therefore crucial for a better understanding of language change. Yet, this same multi-heterogeneity of the data proves a major challenge for processing an automatic text enrichment.
Building on the SRCMF and on the diachronic morphological lexicons designed throughout the project, we will develop an annotation methodology capable of processing this variability, by exploring two distinct approaches in order to do so. The first relies on manually crafted symbolic parsers and the second is based on Machine Learning. These automatic analysis tools and resources, accurately and easily configurable for various stages of Medieval French, will thus serve to automatically explore and annotate additional Old French plain texts. All the annotated data and tools will provide new and valuable linguistic insights for the study of diachronic variation, and will more generally contribute to our understanding of multiple heterogeneous data and its computerized processing, both for past stages of French as well as present-day stages of languages.
The collaboration of international specialists in historical linguistics, digital humanities and natural language processing is an important asset for the success of this project.