The Indo-European Cognate Relationships Dataset.

Matthew Scarborough has featured at LH many times (see, e.g., here), and he has now posted The Indo-European Cognate Relationships dataset (Scientific Data 12. 1541):

This is somewhat old news since the dataset (v1.0) has already been available since the publication of the analysis paper in Science two years ago, but since that paper was finally published, we (mainly Cormac Anderson and Paul Heggarty who wrote most of the paper) finally have been able to publish The Indo-European Cognate Relationships dataset paper in Scientific Data as of yesterday. The paper discusses the underlying dataset, and its organisation and structure and is published together with a revised version (v.1.2) of the dataset on Zenodo. The dataset itself can be explored using its web application at https://iecor.clld.org.

From the article’s abstract:

The Indo-European Cognate Relationships (IE-CoR) dataset is an open-access relational dataset showing how related, inherited words (‘cognates’) pattern across 160 languages of the Indo-European family. IE-CoR is intended as a benchmark dataset for computational research into the evolution of the Indo-European languages. It is structured around 170 reference meanings in core lexicon, and contains 25731 lexeme entries, analysed into 4981 cognate sets. Novel, dedicated structures are used to code all known cases of horizontal transfer. All 13 main documented clades of Indo-European, and their main subclades, are well represented. Time calibration data for each language are also included, as are relevant geographical and social metadata. Data collection was performed by an expert consortium of 89 linguists drawing on 355 cited sources. The dataset is extendable to further languages and meanings and follows the Cross-Linguistic Data Format (CLDF) protocols for linguistic data. It is designed to be interoperable with other cross-linguistic datasets and catalogues, and provides a reference framework for similar initiatives for other language families.

Not to understate the achievement here, but where we say benchmark dataset, I believe this is the most comprehensive cognacy-indexed dataset for the Indo-European since that of Isidore Dyen’s dataset that was used in Dyen, Kruskal & Black’s An Indoeuropean Classification: A Lexicostatistical Experiment (Transactions of the American Philosophical Society 82 (5)) which, with some modifications, has been essentially the same modern language dataset behind many recent phylogenetic studies that have focused primarily on lexical cognacy data including Gray & Atkinson (2003), Bouckaert et al. (2012) and Chang et al. (2015). And while Heggarty et al. (2023) is a paper not immune from criticism, I believe that we and our co-authors have at the least made a solid new dataset that can be used for research on the Indo-European language family, and a database structure that can serve as a template for work on other language families for many years to come.

Congratulations to all the co-authors for finally getting this out. This one has been a long time in the making.

Congratulations from me as well: y’all have done a great thing.

Comments

  1. David Eddyshaw says

    Yes indeed.

Speak Your Mind

*