@pulse77

pulse77@alien.top · 2 years ago

WizardLM (WizardLM-70b-v1.0.Q8_0 when quality is needed, WizardLM-30B Q5_K_M when speed is needed).

pulse77@alien.top · 2 years ago

Just tried this out on my Windows machine and got this:

warning: couldn't find nvcc (nvidia c compiler) try setting $CUDA_PATH if it's installed
warning: GPU offload not supported on this platform; GPU related options will be ignored
warning: you might need to install xcode (macos) or cuda (windows, linux, etc.) check the output above to see why support wasn't linked

So I can not use my GPU like I can do with the standard llama.cpp… and I don’t want to install anything - I would like to have a portable solution which I can copy on my external SSD…

pulse77@alien.top · 2 years ago

GPU - even with low VRAM - will speedup your prompt evaluation…

pulse77@alien.top · 2 years ago

You may try and run one of Q4 models without problems: because llama.cpp uses mmap to map files into memory, you can go above available RAM and because many models are sparse it will not use all mapped pages and even if it needs it, it will swap it out with other pages on demand… I was able to run falcon-180b-chat.Q6_K which uses about 141GB on a 128GB Windows PC with less than 1% SSD reads during inference… I could even run falcon-180b-chat.Q8 which uses about 182GB but in this case SSD was working heavily during inference and it was unbearably slow (0.01 tokens/second)…