So I did some research and after I while in the rabbit hole I think that sliding window attention is not implemented in ExLlama (or v2) yet, and it is not in the AMD ROCm fork of Flash Attention yet either.
I think that means it’s just unsupported right now. Very unfortunate, but I guess I’ll have to wait. Waiting for support is the price I pay for saving 900 euros on a GPU by not buying a 4090, but a 7900 XTX. I’m fine with that.
Others can help you with the LLM part of this, but I am just mainly curious whether your plan is worth it. You do know that converting an entire book into a summary is pretty much worthless if you don’t make the summary yourself?