If you are wondering if it could be implemented, there was a modified transformers library. The author practically made changes, renamed the library to attention_sinks and presented it as a drop-in solution to use it:
But it was impossible to maintain, so devs of transformers suggested him to make a patch for transformers and to maintain it, so it could be properly incorporated into the library and to be future-proof.
The author of this code has been working on this patch since beginning of the October:
If you are wondering if it could be implemented, there was a modified transformers library. The author practically made changes, renamed the library to attention_sinks and presented it as a drop-in solution to use it:
https://github.com/tomaarsen/attention_sinks/
But it was impossible to maintain, so devs of transformers suggested him to make a patch for transformers and to maintain it, so it could be properly incorporated into the library and to be future-proof.
The author of this code has been working on this patch since beginning of the October:
https://github.com/huggingface/transformers/pull/26681
it’s already implemented in llama.cpp