minus-squareesotericloop@alien.topBtoLocalLLaMA@poweruser.forum•Questions on Attention Sinks and Their Usage in LLM ModelslinkfedilinkEnglisharrow-up1·1 year agoSee, you’re attending to the initial token across all layers and heads. :P linkfedilink
See, you’re attending to the initial token across all layers and heads. :P