minus-squaretdgros@alien.topBtoMachine Learning@academy.garden•[D]In transformer models, why is there a query and key matrix instead of just the product?linkfedilinkEnglisharrow-up1·1 year agoIt’s the same mathematically but not computation wise, the tokens are projected to a smaller dimension. The complexity is 2Nd whereas it’d be N² if you’d fuse the weight matrices. linkfedilink
It’s the same mathematically but not computation wise, the tokens are projected to a smaller dimension. The complexity is 2Nd whereas it’d be N² if you’d fuse the weight matrices.