klotz: commoncrawl* + fineweb*

0 bookmark(s) - Sort by: Date ↓ / Title / - Bookmarks from other users for this tag

  1. HuggingFace has released FineWeb, a new large-scale dataset consisting of 15 trillion tokens and 44TB of disk space designed for pretraining large language models (LLMs). The dataset, which leverages data from CommonCrawl, undergoes rigorous deduplication and quality filtering processes, making it a valuable tool for researchers.
    2024-06-04 Tags: , , , , by klotz

Top of the page

First / Previous / Next / Last / Page 1 of 0 SemanticScuttle - klotz.me: Tags: commoncrawl + fineweb

About - Propulsed by SemanticScuttle