Around 1.5 months ago, I started https://github.com/michaelfeil/infinity. With the hype in Retrieval-Augmented-Generation, this topic got important over the last month in my view. With this Repo being the only option under a open license.

I now implemented everything from faster attention, onnx / ctranslate2 / torch inference, caching, better docker images, better queueing stategies. Now I am pretty much running out of ideas - if you got some, feel free to open an issue, would be very welcome!

  • dancingnightly@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Thank you for the detail and references. Yeah I know, I have used sentence transformers and before them BERT/T5 embeddings for a long time (e.g. Kaggle competitions, few hackathons around the issue…), but I am just wondering what motivated you to create an embeddings server as opposed to running the embeddings in place in the code with the SBERT models or calling an API as you mention with those alternatives? Is the python code you write in the get started part much faster than just using the SentenceTransformer module with batch arrays?

    Because I have found, such as when competing in the Learning Agency competitions, you can build the indexes locally or use open source tools like LlamaIndex equivalents with SBERT, rather then need to set up a server. Am I missing something to do with speed or do new models take longer to embed? What’s the problem you and others are facing to use a server for embeddings rather than do it in the code?