Long time lurker here. Made an account just to post this.
I’ve been experimenting with some modifications on the transformer architecture (addition of a new standalone component).
Recently I got to something that seems to be an improvement of the validation loss by ~25-30% over vanilla decoder transformers; the task is next token prediction.
My question is if this is significant enough to dedicate more serious effort in (eg. getting more compute credits to create a bigger model, running a beefier benchmark, sharing more with folks in academia to get feedback, write a paper, etc.) or if it’s likely a fluke.
In terms of methodology, I’ve compared vanilla-vs-modification on 3 datasets (in increasing difficulty): Penn Treebank, Lord of the Rings, the complete works of Shakespeare. The datasets are small enough that the results are verifiable on any laptop quickly.
I’ve also been controlling for things that stay the same across the 2 variants (vocab size, embedding dim, number of layers, layer norm and residual connection positions, etc.).
With the addition of the new component however, I have 130% more parameters when keeping other things equal (from 800K on vanilla to 1.8M on modified version). I’ve tried to remediate this by also increasing the number of layers in the vanilla model to bring the parameter count to the same level and the improvement is still noticeable.
I’m providing below the loss comparisons after 100 iterations across both vanilla and modified over the 3 datasets.
I’d appreciate any input you may have! What next steps, if any, do you recommend? For background, I’m a software engineer by day and neural net enthusiast by night since college (more than 10 years ago). I’m loosely connected with some folks who may be able to give input but would appreciate the community’s feedback before nagging them and being more serious about this :)
# Lord of the Rings
**vanilla**:
step 100 evaluated train loss = 2.9514, valid loss = **2.9790**
step 100 evaluated train loss = 2.8528, valid loss = **2.8742** (w/ more layers => 10% more params than modified)
**modified**:
step 100 evaluated train loss = 2.1858, valid loss = **2.1094**
# Shakespeare’s works
**vanilla**:
step 100 evaluated train loss = 3.1380, valid loss = **3.1767**
step 100 evaluated train loss = 2.9478, valid loss = **2.9677** (w/ more layers => 10% more params than modified)
**modified**:
step 100 evaluated train loss = 2.2036, valid loss = **2.2190**
# Penn Treebank
**vanilla**:
step 100 evaluated train loss = 2.7331, valid loss = **2.7417**
step 100 evaluated train loss = 2.8184, valid loss = **2.5611** (w/ 10 layers =>10% more params than modified)
**modified**
step 100 evaluated train loss = 2.0061, valid loss = **2.0184**
You should try scaling up the hidden size and number of layers together of your baseline model.
In my opinion, it’s better if you rather scaled down your new model to control for computational complexity.
Also make sure not just the # parameters, but the computational complexity isn’t too different. It could be the case that number of parameters are the same but one has higher computational budget. For example, Albert has all layers shared parameters but has the same computational requirements as Bert.
Make sure you optimise the learning rate for both the architectures, because optimal LR may change depending on your modifications.
Additionally, 100 training steps might not be enough to conclude. You should plot the graph and train to convergence. Share the graph rather than just final number.
Importantly, verify that these numbers for vanilla transformer match with hose reported elsewhere?
Finally, make sure you have an explanation of the mechanism by which the model gets better learning. Then check if anybody has done something similar elsewhere and how the results were.
If you could give further hint into your modifications, I could comment more.
--
PS: based on this the loss seems too good to be true https://paperswithcode.com/sota/language-modelling-on-penn-treebank-word
If I’m not wrong your perplexities for vanilla transformers are much higher than gpt4