I have the feeling alot of models include alot of data in many languages. Would it make more sense to train just on english data and have a seperate translation layer? Or do i misunderstand something?
By having a separate translation module, you’re making the decision for the model about which parameters should be used for translation, and which for learning about the world.
With an extremely small model (one that doesn’t have the capacity to even fully learn English), this would probably be reasonable. With any other size of model (100–200 million parameters and up, maybe?), it would be far, far more effective to let the model pick and choose how it allocates its parameters.
Often, this will lead to a perfect meld of translation and learning, to the point that we don’t currently even know how to figure out whether a given neuron or set of neurons does one task or another. The current most likely theory (in my opinion) is that most neurons are multitask.