file size which impacts load time:
with load_in_4bit it will download and parse the big file (which is 4x bigger if it is bfloat16, or 8x bigger if it is float32) and then will quantize on the fly,
with pre-quantized files, it downloads only the quants, so expect a 4x to 8x faster load time for 4bit quants
you need to have a knowledge base (with your legal, or financial data) and use semantic search to feed the context for a model fine-tuned to follow instructions and answer from the context,
for small (7B) models there are OpenHermes-2.5-Mistral-7B, Mistral-7B-OpenOrca, dolphin-2.1-mistral-7b,
for bigger models there is Nous-Capybara-34B