[R] Seeking input: transformer modification with 25-30% improvement in validation loss across 3 datasets

exmatrixmachina@alien.top · 2 years ago

[R] Seeking input: transformer modification with 25-30% improvement in validation loss across 3 datasets

mrfox321@alien.top · 2 years ago

You should share your methodology.

Professor_Entropy@alien.top · 2 years ago

You should try scaling up the hidden size and number of layers together of your baseline model.

In my opinion, it’s better if you rather scaled down your new model to control for computational complexity.

Also make sure not just the # parameters, but the computational complexity isn’t too different. It could be the case that number of parameters are the same but one has higher computational budget. For example, Albert has all layers shared parameters but has the same computational requirements as Bert.

Make sure you optimise the learning rate for both the architectures, because optimal LR may change depending on your modifications.

Additionally, 100 training steps might not be enough to conclude. You should plot the graph and train to convergence. Share the graph rather than just final number.

Importantly, verify that these numbers for vanilla transformer match with hose reported elsewhere?

Finally, make sure you have an explanation of the mechanism by which the model gets better learning. Then check if anybody has done something similar elsewhere and how the results were.

If you could give further hint into your modifications, I could comment more.
--
PS: based on this the loss seems too good to be true https://paperswithcode.com/sota/language-modelling-on-penn-treebank-word
If I’m not wrong your perplexities for vanilla transformers are much higher than gpt4

koolaidman123@alien.top · 2 years ago

A lot of reported architecture improvements disappear at scale, or end up having some contamination

Best way to see if it works is to release the code and have others tinker with it

lightSpeedBrick@alien.top · 2 years ago

Largely unrelated, but this has a similar vibe. I wonder what happened to that high school kid who invented the transformer even before Vaswani et al, and then a year later another guy who claimed to invent a brand new neural network architecture that was supposed to break the internet.

curiousshortguy@alien.top · 2 years ago

Why are we looking at train and validation LOSS instead of a meaningful metric here?

r_s_s_i_u@alien.top · 2 years ago

More than validation loss, try to test your model against a benchmark (after scaling up your model). That will help you decide if your model is actually better (when compared to other models of the same scale)