@Professor_Entropy

Professor_Entropy@alien.top · 10 months ago

You should try scaling up the hidden size and number of layers together of your baseline model.

In my opinion, it’s better if you rather scaled down your new model to control for computational complexity.

Also make sure not just the # parameters, but the computational complexity isn’t too different. It could be the case that number of parameters are the same but one has higher computational budget. For example, Albert has all layers shared parameters but has the same computational requirements as Bert.

Make sure you optimise the learning rate for both the architectures, because optimal LR may change depending on your modifications.

Additionally, 100 training steps might not be enough to conclude. You should plot the graph and train to convergence. Share the graph rather than just final number.

Importantly, verify that these numbers for vanilla transformer match with hose reported elsewhere?

Finally, make sure you have an explanation of the mechanism by which the model gets better learning. Then check if anybody has done something similar elsewhere and how the results were.

If you could give further hint into your modifications, I could comment more.
--
PS: based on this the loss seems too good to be true https://paperswithcode.com/sota/language-modelling-on-penn-treebank-word
If I’m not wrong your perplexities for vanilla transformers are much higher than gpt4