09.11.2021 François Pellegrino and Matthew Stave

Morph in translation: a preliminary account of the temporal distribution of morphological information in the DoReCo corpus


Despite sharing the same overall communicative functions, human languages exhibit a large range of grammatical strategies to convey information. In the last sixty years, linguistic grammars and typological studies have unveiled the existence of typical linguistic profiles, enhanced with the more recent development of corpus-based quantitative studies. We now have a much better (though still incomplete) picture of morphological typology that admits the existence of intermediate strategies between archetypes (such as purely isolating vs. agglutinating languages, for instance). Still, corpus-based studies often remain biased towards ‘major’ languages or rely on written corpora with their inherent limitations. As a consequence, elaborating a cross-linguistic and usage-based account of how a language system fulfills its speakers’ communicative goals is an ongoing task (Levshina, 2021; Levshina & Moran, 2021).

The DoReCo project (Paschen et al., 2020) aims to overcome these obstacles by offering (semi-)spontaneous oral recordings time-aligned with their fine-grained morphosyntactic analysis for about 50 languages, with at least 10,000 words per language. Research on the DoReCo database has already resulted in quantitative studies on linguistic complexity (Easterday et al., 2021, Stave et al., 2021a) and morphological typology (Stave, von Prince & Seifart, 2021). In this colloquium, we will present a preliminary account of the way morphological information is distributed over time in more than a dozen languages from the DoReCo collection. Our study stems from recent suggestions on the existence of a regulation in the rate of information transmitted in human communication, either in terms of fluctuations (Frank & Jaeger, 2008; Seifart et al., 2018) or on average (Cohen Priva, 2017; Coupé et al. 2019; Pellegrino, Coupé & Marsico, 2011). We have adopted an original dual approach, based on both the linguist-informed morphosyntactic analysis of the source language and an analysis of the English translation performed through state-of-the-art natural language processing. This way, we combine the advantages of a linguistic approach close to the original speech production with those of an automatic processing performed on the same pivot language (English) for each corpus.



Cohen Priva, U. (2017). Not so fast: Fast speech correlates with lower lexical and structural information. Cognition160, 27-34.

Coupé, C., Oh, Y. M., Dediu, D., & Pellegrino, F. (2019). Different languages, similar encoding efficiency: Comparable information rates across the human communicative niche. Science advances5(9), eaaw2594.

Easterday, S., Stave, M., Allassonnière-Tang, M., & Seifart, F. (2021). Syllable complexity and morphological synthesis: A well-motivated positive complexity correlation across subdomains. Frontiers in psychology12, 583.

Frank, A. F., & Jaeger, T. F. (2008). Speaking rationally: Uniform information density as an optimal strategy for language production. In Proc. of the annual meeting of the cognitive science society (Vol. 30, No. 30).

Levshina, N. (2021). Corpus-based typology: applications, challenges and some solutions. Linguistic Typology.

Levshina, N., & Moran, S. (2021). Efficiency in human languages: Corpus evidence for universal principles. Linguistics Vanguard7(s3).

Paschen, L., Delafontaine, F., Draxler, C., Fuchs, S., Stave, M., & Seifart, F. (2020). Building a time-aligned cross-linguistic reference corpus from language documentation data (DoReCo). In Proceedings of the 12th language resources and evaluation conference (pp. 2657-2666).

Pellegrino, F., Coupé, C., & Marsico, E. (2011). A cross-language perspective on speech information rate. Language, 539-558.

Seifart, F., Strunk, J., Danielsen, S., Hartmann, I., Pakendorf, B., Wichmann, S., ... & Bickel, B. (2018). Nouns slow down speech across structurally and culturally diverse languages. Proceedings of the National Academy of Sciences115(22), 5720-5725.

Stave, M., Paschen, L., Pellegrino, F., & Seifart, F. (2021). Optimization of morpheme length: a cross-linguistic assessment of Zipf’s and Menzerath’s laws. Linguistics Vanguard7(s3).

Stave, M., von Prince, K., & Seifart, F. (2021). A usage-based approach to morphological typology, 54th Annual Meeting of the Societas Linguistica Europaea, 30 August-3 September 2021.