Great test!
Unfortunately the Llama 2 Chat template is completely broken in SillyTavern. It not only uses a new line as separator instead of the correct one, but also ends the prompt after the system prompt with the input sequence [INS] instead of [/INST] if you are using the vector storage or an example dialogue. You can see for yourself by comparing the output to what the format should look like.
So these Airoboros 3.1.2 tests are unfortunately borked. Still though, interesting result for the other models.
Yes, it’s system wide. You can set your prefered way in Nvidia control panel->global settings-> cuda systemem fallback policity.
Driver default is prefer systemmem fallback, which means it’s going to offload to RAM instead of crashing when VRAM is full.
No System Mem fallback is basically the old memory management, it crashes once your VRAM is full.