Hi when running the AI models i notice that the amount of ram used is a lot lot less than claimed, however the performance greatly differs based on the amount of ram the machine has. My machines have limited number of ram slots but based on the behaviour of running the models, are the models cached into ram instead? for instance if i run llama 70B i have to rent an expensive aws ec2 instead but the responses greatly differ. If i use 13B it does not answer the question and i get the question back but it does with 70B. I still would like to be able to run 70B and i am using a similar chip architecture with avx_vnni. If there isnt enough ram would it be possible to create a ram drive split across multiple machines and use 10Gb/s NICs? I have used SFP+ NICs and SFP+ slots in my switch.

Are there ways to speed up running larger models with less memory without quantising them with lower accuracy?

  • J_J_Jake@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    This is because llama.cpp uses mmap() by default, which maps a file stream buffer into memory. The model is swapped in and out as used, with available system ram basically used as a cache. you can disable this via the command line if you want the model to be static in ram.