09.11.2021 François Pellegrino and Matthew Stave

Morph in translation: a preliminary account of the temporal distribution of morphological information in the DoReCo corpus


Despite sharing the same overall communicative functions, human languages exhibit a large range of grammatical strategies to convey information. In the last sixty years, linguistic grammars and typological studies have unveiled the existence of typical linguistic profiles, enhanced with the more recent development of corpus-based quantitative studies. We now have a much better (though still incomplete) picture of morphological typology that admits the existence of intermediate strategies between archetypes (such as purely isolating vs. agglutinating languages, for instance). Still, corpus-based studies often remain biased towards ‘major’ languages or rely on written corpora with their inherent limitations. As a consequence, elaborating a cross-linguistic and usage-based account of how a language system fulfills its speakers’ communicative goals is an ongoing task (Levshina, 2021; Levshina & Moran, 2021).

The DoReCo project (Paschen et al., 2020) aims to overcome these obstacles by offering (semi-)spontaneous oral recordings time-aligned with their fine-grained morphosyntactic analysis for about 50 languages, with at least 10,000 words per language. Research on the DoReCo database has already resulted in quantitative studies on linguistic complexity (Easterday et al., 2021, Stave et al., 2021a) and morphological typology (Stave, von Prince & Seifart, 2021). In this colloquium, we will present a preliminary account of the way morphological information is distributed over time in more than a dozen languages from the DoReCo collection. Our study stems from recent suggestions on the existence of a regulation in the rate of information transmitted in human communication, either in terms of fluctuations (Frank & Jaeger, 2008; Seifart et al., 2018) or on average (Cohen Priva, 2017; Coupé et al. 2019; Pellegrino, Coupé & Marsico, 2011). We have adopted an original dual approach, based on both the linguist-informed morphosyntactic analysis of the source language and an analysis of the English translation performed through state-of-the-art natural language processing. This way, we combine the advantages of a linguistic approach close to the original speech production with those of an automatic processing performed on the same pivot language (English) for each corpus.



