I have a stealth 15m laptop that has 16 gig of ram with a 3060 with 6vrams. Can this run 13b models decently well? Pretty new to llm stuff and so far I can only make it gen around 2-3 token a second and feel like that’s pretty slow. Is there anyway I can bump that to 5+ token per second? Or is 2-3 token per second the limit of my laptop?
If you download GGUF models from “thebloke” you can read on the models card page how much RAM is required for the specific model without offloading to the GPU.
I have included a screenshot as an example of a 13b model.
https://preview.redd.it/3bv297j8c92c1.jpeg?width=1426&format=pjpg&auto=webp&s=94fac2937e8a2e0f6b3886d42401b0b50b0010b3