Blog: https://together.ai/blog/redpajama-data-v2
Hugging Face: https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2
GitHub: https://github.com/togethercomputer/RedPajama-Data
Description:
RedPajama-V2 is an open dataset for training large language models. The dataset includes over 100B text documents coming from 84 CommonCrawl snapshots and processed using the CCNet pipeline. Out of these, there are 30B documents in the corpus that additionally come with quality signals, and 20B documents that are deduplicated.
Is there any way we can read those datasets? I’m a noob when it comes to “what’s under the hood”. On HuggingFace they show they tried to upload the dataset but it failed due to, likely, the sheer size of the thing…