SemanticScuttle - klotz.me » klotz: dataset

klotz: dataset*

The Indo-European Cognate Relationships dataset

The Indo-European Cognate Relationships (IE-CoR) dataset is a comprehensive, open-access relational database detailing cognates—inherited related words—across 160 Indo-European languages. Developed by a consortium of 89 linguists, it aims to serve as a benchmark for computational research into the evolution of this vast language family, encompassing 25,731 lexeme entries grouped into 4,981 cognate sets based on 170 core meanings. The dataset incorporates time calibration data, geographical/social metadata, and a novel structure for coding horizontal transfer, adhering to the Cross-Linguistic Data Format (CLDF) for interoperability and long-term accessibility. IE-CoR addresses limitations of previous datasets through improved coverage, rigorous coding protocols, and a focus on the primary cognate state of root morphemes, offering a valuable resource for phylogenetic and quantitative linguistic research.

2025-09-02 Tags: nature, indo-european, linguistics, history, prehistory, cognates, dataset, phylogenetics. by klotz

Function Calling: Fine-Tuning Llama 3 on xLAM

Learn how to fine-tune large language models like Llama 3 for function calling, enabling interaction with external tools and APIs for tasks like web search and math operations.

2024-07-23 Tags: large language models, fine-tuning, function calling, llama 3, xlam, dataset, salesforce by klotz

HuggingFace Releases

HuggingFace has released FineWeb, a new large-scale dataset consisting of 15 trillion tokens and 44TB of disk space designed for pretraining large language models (LLMs). The dataset, which leverages data from CommonCrawl, undergoes rigorous deduplication and quality filtering processes, making it a valuable tool for researchers.

2024-06-04 Tags: huggingface, fineweb, dataset, llm, commoncrawl by klotz

Exploring your data with just 1 line of Python - Towards Data Science

2019-09-26 Tags: medium, pandas, describe, dataset by klotz

imagenet-simple-labels/README.md at master · anishathalye/imagenet-simple-labels · GitHub

2018-12-30 Tags: imagenry, dataset, machine learning by klotz

Announcing A Benchmark Dataset for Time Series Anomaly Detection | research.yahoo.com

2018-09-18 Tags: yahoo, time-series, anomaly, dataset by klotz

Academic Torrents

2018-08-12 Tags: machine learning, dataset by klotz

Affordable online news archives for academic research

2018-08-10 Tags: news, machine learning, dataset by klotz

Google search adds dataset schema support to search results - Search Engine Land