The only time that the query and key matrices are used is to compute the attention scores. That is $v_i^T \cdot W_q^T W_k v_j$ But what is used is the matrix $W_q^T W_k$. Why not just replace $W_q^T W_k$ with a single matrix $W_{qv}$, and learn the matrix that is the product of W_q^T W_k instead of the matrices themselves? How does it help to have two matrices instead of one? And if it helps, why is that not done when applying matrices between neuron layers?

Chatgpt tells me that the reason is that it allows the model to learn a different representation for the query and key. But because they are just dotted together, it seems to me that you can just use the original embedding as the query with no loss of generality.

  • Affectionate-Fish241@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    dammit all of the answers are fkin terrible. Looks like the ai bots took over or everyone in this subreddit has become braindead since the blackout.

    You obviously don’t do W_q @ W_k. That’s totally stupid.

    What transformers do is (x_i@W_q) @ (x_j@W_k) where x_i and x_j are two tokens in the sequence. This is an interaction operation. This can’t be precomputed. What you see noted in the papers is Q = x_i @ W_q, and K = x_j @ W_k.

    (Transposes omitted for notational clarity, work that out yourself)

    • mrfox321@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      10 months ago

      Your answer is also terrible. It does not answer his question.

      Look at the top 2 replies to see correct interpretations of the question.