Blog: https://together.ai/blog/redpajama-data-v2

Hugging Face: https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2

GitHub: https://github.com/togethercomputer/RedPajama-Data

Description:

RedPajama-V2 is an open dataset for training large language models. The dataset includes over 100B text documents coming from 84 CommonCrawl snapshots and processed using the CCNet pipeline. Out of these, there are 30B documents in the corpus that additionally come with quality signals, and 20B documents that are deduplicated.

  • FairSum@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    11 months ago

    Man, 30T tokens deduplicated is a lot of data.

    For reference, Llama 2 was trained on 2T tokens and GPT-4 was believed to have been trained on 13T tokens (and my suspicion is Turbo was too). This is much, much more than that.

  • Maykey@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    11 months ago

    20B documents that are deduplicated.

    I wonder if we’ll see even slimmer version

  • Feztopia@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    11 months ago

    Does this have the same level of deduplication like slimpajama or do we need a slimpajama v2?

  • LuluViBritannia@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    11 months ago

    Is there any way we can read those datasets? I’m a noob when it comes to “what’s under the hood”. On HuggingFace they show they tried to upload the dataset but it failed due to, likely, the sheer size of the thing…

  • UserMinusOne@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    11 months ago

    How much free space is required to do a “git clone …”?

    Is there a better method to download the data without requiring additional space for the history (.git). If yes, how big is the whole dataset?

    Given the current developments: Maybe some should start collecting raw data and serving them as torrents. … Just in case.