• 0 Posts
  • 4 Comments
Joined 10 months ago
cake
Cake day: November 27th, 2023

help-circle

  • Just tried this out on my Windows machine and got this:

    warning: couldn't find nvcc (nvidia c compiler) try setting $CUDA_PATH if it's installed
    warning: GPU offload not supported on this platform; GPU related options will be ignored
    warning: you might need to install xcode (macos) or cuda (windows, linux, etc.) check the output above to see why support wasn't linked

    So I can not use my GPU like I can do with the standard llama.cpp… and I don’t want to install anything - I would like to have a portable solution which I can copy on my external SSD…



  • You may try and run one of Q4 models without problems: because llama.cpp uses mmap to map files into memory, you can go above available RAM and because many models are sparse it will not use all mapped pages and even if it needs it, it will swap it out with other pages on demand… I was able to run falcon-180b-chat.Q6_K which uses about 141GB on a 128GB Windows PC with less than 1% SSD reads during inference… I could even run falcon-180b-chat.Q8 which uses about 182GB but in this case SSD was working heavily during inference and it was unbearably slow (0.01 tokens/second)…