[R] MADLAD-400 - 4.6 / 2.6 trillion token dataset covering 419 languages + translation models up to 10.7B parameters

APaperADay@alien.top · 1 year ago

maizeq@alien.top · 1 year ago

There use of monolingual and multilingual to describe the same dataset is unusual.

I get that they’re probably trying to say “monolingual at the document-level”, but the back and forth is quite confusing.

E.g.

"We introduce MADLAD-400, a manually audited, general domain 3T token monolingual dataset

“We use both supervised parallel data with a machine translation objective and the monolingual MADLAD-400 dataset”

“Through MADLAD-400, we introduce a highly multilingual, general web-domain, document-level text dataset”

Unless I am missing something obvious, these are either typos or poor wording decisions.