[D]In transformer models, why is there a query and key matrix instead of just the product?

lildaemon@alien.top · 10 months ago

[D]In transformer models, why is there a query and key matrix instead of just the product?

Affectionate-Fish241@alien.top · 10 months ago

dammit all of the answers are fkin terrible. Looks like the ai bots took over or everyone in this subreddit has become braindead since the blackout.

You obviously don’t do W_q @ W_k. That’s totally stupid.

What transformers do is (x_i@W_q) @ (x_j@W_k) where x_i and x_j are two tokens in the sequence. This is an interaction operation. This can’t be precomputed. What you see noted in the papers is Q = x_i @ W_q, and K = x_j @ W_k.

(Transposes omitted for notational clarity, work that out yourself)

mrfox321@alien.top · 10 months ago

Your answer is also terrible. It does not answer his question.

Look at the top 2 replies to see correct interpretations of the question.