[D]In transformer models, why is there a query and key matrix instead of just the product?

lildaemon@alien.top · 1 year ago

[D]In transformer models, why is there a query and key matrix instead of just the product?

Tensor_Devourer_56@alien.top · 1 year ago

If I’m understanding your question correctly, it probably doesn’t make any differences computation wise. But if we have query dot key as one single input, then the attention layer would just have two inputs: 1.query dot key matrix; 2. value matrix. I think this would be a worse formulation thant the original paper altough they are the same computation wise. By allowing separate key and value matrices, the data flow is clearer. For example the Encoder-Decoder attention layer takes the result of Encoder block as key and value but the processed target sequence as value. This idea is very clear with the original attention layer formation.

tdgros@alien.top · 1 year ago

It’s the same mathematically but not computation wise, the tokens are projected to a smaller dimension. The complexity is 2Nd whereas it’d be N² if you’d fuse the weight matrices.

lildaemon@alien.top · 1 year ago

This.