My boss is a semi famous author in a niche academic field. I have thousands of pages of text coming from books, transcripts, and more.
Is there a straightforward path to creating a corpus to augment Bert or Llama or another llm? End goal being able to chat with this ai that is now trained on his life’s work.
Is there anything specific to understand in terms of preparing the corpus? Do I need key value pairs where I write a ton of examples questions and responses?
Agree that you should look at RAG. LLMs are not search engines so you need to connect the knowledge corpus to LLMs.
Try LLMWare’s RAG implementation - it is easy to use, straightforward, and automates Mongo and Milvus set up so great for what you are trying to achieve. LLMWare also has free models in Hugging Face you can start to experiment with for experimenting for your use case.
https://github.com/llmware-ai/llmware
https://huggingface.co/llmware