HuggingFace has released FineWeb, a new large-scale dataset consisting of 15 trillion tokens and 44TB of disk space designed for pretraining large language models (LLMs). The dataset, which leverages data from CommonCrawl, undergoes rigorous deduplication and quality filtering processes, making it a valuable tool for researchers.