minus-squareeigenham@alien.topBtoMachine Learning@academy.garden•[D]In transformer models, why is there a query and key matrix instead of just the product?linkfedilinkEnglisharrow-up1·10 months agoI would suggest looking into the math a little more. I think all of the matrices in the attention layer are a (linear) function of the input sequence. So the output of the attention layer is softmax of a quadratic of the input iirc linkfedilink
I would suggest looking into the math a little more. I think all of the matrices in the attention layer are a (linear) function of the input sequence. So the output of the attention layer is softmax of a quadratic of the input iirc