RedPajama-Data-v2: an Open Dataset with 30 Trillion Tokens for Training Large Language Models

APaperADay@alien.top · 2 years ago

RedPajama-Data-v2: an Open Dataset with 30 Trillion Tokens for Training Large Language Models

LuluViBritannia@alien.top · 2 years ago

Is there any way we can read those datasets? I’m a noob when it comes to “what’s under the hood”. On HuggingFace they show they tried to upload the dataset but it failed due to, likely, the sheer size of the thing…