Spaces:
Running
Running
metadata
title: README
emoji: π
colorFrom: purple
colorTo: pink
sdk: static
pinned: false
π· FineData
This is the home of the π· FineData team, a branch of the π€ Hugging Face Science Team releasing large scale pre-training datasets to accelerate open LLM development.
- π· FineWeb: A 15T tokens English dataset for LLM pre-training. See the blogpost and paper.
- π FineWeb-Edu: a filtered subset of the most educational content from FineWeb.
- π₯ FineWeb2: an extension of FineWeb to over 1000 languages. See the paper.
- π FinePDFs: 3T tokens of text data extracted from PDFs sourced from the Web.
- π FineWiki: an updated, better extracted version of Wikipedia in 300+ languages.
- π FinePDFs-Edu: 350B+ highly educational tokens filtered from π FinePDFs