πŸ€— FineData

Enterprise
community
Activity Feed

AI & ML interests

We release large pre-training datasets to accelerate open LLM development. Part of the Hugging Face Science team (hf.co/science)

Recent Activity

guipenedoΒ  updated a collection about 9 hours ago
🍷 FineWeb datasets
guipenedoΒ  updated a collection about 9 hours ago
🍷 FineWeb datasets
View all activity

πŸ€— HuggingFace 🍷 FineWeb datasets

Read our technical report!

This organization hosts the 🍷 FineWeb datasets, a collection of text datasets sourced from the web (CommonCrawl), released under a permissive license (ODC-By).

The creation of 🍷 FineWeb involved careful processing and filtering of large amounts of web data with the aim of lowering the barriers to entry to anyone intending to pretrain high-performance large language models.

All code and artefacts needed for reproduction are public and built on top of open source libraries, such as the πŸ€— libraries datatrove, nanotron or lighteval.

Version 1 of the 🍷 FineWeb dataset is available here. Our ablation models can be found here.

Version 2 of the πŸ₯‚ FineWeb dataset (multilingual extension to +1800 languages/script) is available here.