if you have the ram don’t worry about disk at all. if you have to drop to any kind of disk even if it’s gen 5 ssd you speeds will tank. memory bandwidth matters so much more than compute for LLMs, but it all depends on your needs. there are probably cheaper ways to go about this if you just need something occasionally. maybe runpod or something, but if you need a lot of inference then locally could save you money, but renting a big machine with a100s will always be faster. so will a 7B model do what you need or do you need the accuracy and comprehension of a 70b or one of the new 120b merges? also llama3 is supposed to be out in jan/feb and if it’s significantly better then everything changes again.
So there are CPU intrinsics for prefetching data. If we can get better at anticipating the next pieces of data that need to be calculated you can speckle in those preload instructions and achieve more speed.