🤗 FineData

Enterprise

community

AI & ML interests

We release large pre-training datasets to accelerate open LLM development. Part of the Hugging Face Science team (hf.co/science)

Recent Activity

guipenedo updated a collection about 9 hours ago

🍷 FineWeb datasets

guipenedo updated a collection about 9 hours ago

🍷 FineWeb datasets

guipenedo new activity about 1 month ago

HuggingFaceFW/fineweb-edu-score-2:1 of 2 TODOs

View all activity

Organization Card

Community About org cards

🤗 HuggingFace 🍷 FineWeb datasets

Read our technical report!

This organization hosts the 🍷 FineWeb datasets, a collection of text datasets sourced from the web (CommonCrawl), released under a permissive license (ODC-By).

The creation of 🍷 FineWeb involved careful processing and filtering of large amounts of web data with the aim of lowering the barriers to entry to anyone intending to pretrain high-performance large language models.

All code and artefacts needed for reproduction are public and built on top of open source libraries, such as the 🤗 libraries datatrove, nanotron or lighteval.

Version 1 of the 🍷 FineWeb dataset is available here. Our ablation models can be found here.

Version 2 of the 🥂 FineWeb dataset (multilingual extension to +1800 languages/script) is available here.

Collections 5

spaces 5

Discussion Forum

FineWeb: decanting the web for the finest text data at scale

Generate high-quality web text data for LLM training

Scaling FineWeb to 1000+ languages: Step 1: finding signal in 100s of evaluation tasks

Evaluate multilingual models using FineTasks

Tasks Explorer

Datasets Metrics Explorer

models 30

HuggingFaceFW/fineweb-edu-classifier

Text Classification • Updated Nov 17, 2024 • 6.27k • 176

HuggingFaceFW/Datasets-Metrics-Viewer-Data

Updated Sep 2, 2024

HuggingFaceFW/ablation-model-fineweb-edu

Text Generation • Updated Jun 11, 2024 • 362 • 12

HuggingFaceFW/ablation-exp-filter-custom-all_filters-28BT

Text Generation • Updated Jun 4, 2024 • 8 • 1

HuggingFaceFW/ablation-exp-filter-custom-line_char_duplicated_0.01-28BT

Text Generation • Updated Jun 4, 2024 • 3 • 2

HuggingFaceFW/ablation-exp-filter-custom-line_ratio_0.67-28BT

Text Generation • Updated Jun 4, 2024 • 7

HuggingFaceFW/ablation-exp-filter-custom-lines_punct_0.12-28BT

Text Generation • Updated Jun 4, 2024 • 3 • 3

HuggingFaceFW/ablation-exp-filter-baseline_c4-28BT

Text Generation • Updated Jun 4, 2024 • 8 • 2

HuggingFaceFW/ablation-exp-filter-baseline_cc-28BT

Text Generation • Updated Jun 4, 2024 • 9 • 4

HuggingFaceFW/ablation-exp-filter-c4-word_lengths-28BT

Text Generation • Updated Jun 4, 2024 • 6 • 2

datasets 7

HuggingFaceFW/fineweb-edu-score-2

Viewer • Updated Apr 11 • 13.1B • 31.4k • 73

HuggingFaceFW/clean-wikipedia

Viewer • Updated Mar 19 • 61.1M • 922 • 2

HuggingFaceFW/fineweb-edu

Viewer • Updated Jan 31 • 3.3B • 139k • 675

HuggingFaceFW/fineweb

Viewer • Updated Jan 31 • 25B • 868k • 2.15k

HuggingFaceFW/fineweb-2

Viewer • Updated Jan 8 • 12.5B • 41k • 479

HuggingFaceFW/admin

Viewer • Updated Dec 7, 2024 • 16 • 14.5k • 3

HuggingFaceFW/fineweb-edu-llama3-annotations

Viewer • Updated Jun 3, 2024 • 467k • 292 • 40