[R] MADLAD-400 - 4.6 / 2.6 trillion token dataset covering 419 languages + translation models up to 10.7B parameters

APaperADay@alien.top · 10 months ago

CatalyzeX_code_bot@alien.top · 10 months ago

Found 2 relevant code implementations for “MADLAD-400: A Multilingual And Document-Level Large Audited Dataset”.

If you have code to share with the community, please add it here 😊🙏

–

To opt out from receiving code links, DM me.

APaperADay@alien.top · 10 months ago

Credit to u/jbochi for getting the models to run + telling Google to fix their model checkpoints.

jbochi@alien.top · 10 months ago

thanks

maizeq@alien.top · 10 months ago

There use of monolingual and multilingual to describe the same dataset is unusual.

I get that they’re probably trying to say “monolingual at the document-level”, but the back and forth is quite confusing.

E.g.

"We introduce MADLAD-400, a manually audited, general domain 3T token monolingual dataset

“We use both supervised parallel data with a machine translation objective and the monolingual MADLAD-400 dataset”

“Through MADLAD-400, we introduce a highly multilingual, general web-domain, document-level text dataset”

Unless I am missing something obvious, these are either typos or poor wording decisions.