The Indo-European Cognate Relationships (IE-CoR) dataset is a comprehensive, open-access relational database detailing cognates—inherited related words—across 160 Indo-European languages. Developed by a consortium of 89 linguists, it aims to serve as a benchmark for computational research into the evolution of this vast language family, encompassing 25,731 lexeme entries grouped into 4,981 cognate sets based on 170 core meanings. The dataset incorporates time calibration data, geographical/social metadata, and a novel structure for coding horizontal transfer, adhering to the Cross-Linguistic Data Format (CLDF) for interoperability and long-term accessibility. IE-CoR addresses limitations of previous datasets through improved coverage, rigorous coding protocols, and a focus on the primary cognate state of root morphemes, offering a valuable resource for phylogenetic and quantitative linguistic research.
A pair of landmark studies published in Nature have identified the origin of the Indo-European family of over 400 languages, spoken by more than 40% of the world's population, to the Caucasus Lower Volga people in present-day Russia around 6,500 years ago.
Harvard researchers traced the origins of the vast Indo-European language family to the Caucasus-Lower Volga region, identifying the ancestral population known as the Yamnaya, who appeared around 3300 BCE and spread from Hungary to western China.