llama.cpp server rocks now! 🤘

Gorefindal@alien.top · 1 year ago

llama.cpp server rocks now! 🤘

Inkbot_dev@alien.top · 1 year ago

I really, really, hope they add support for chat_templates for the chat/completion endpoint: https://huggingface.co/docs/transformers/chat_templating

Inkbot_dev@alien.top · 1 year ago

Well, fingers crossed my plea for actually supporting chat templates works. Partial support is equal to no support in this case.

https://github.com/ggerganov/llama.cpp/issues/4216#issuecomment-1829944957

edwios@alien.top · 1 year ago

Have changed from llama-cpp-python[server] to llama.cpp server, working great with OAI API calls, except multimodal which is not working. Patched it with one line and voilà, works like a charm!

sleeper-2@alien.top · 1 year ago

send a PR with your patch!

edwios@alien.top · 1 year ago

Already raised an issue, couldn’t create a PR as I’m with my phone only. Solution also included:

https://github.com/ggerganov/llama.cpp/issues/4245

sleeper-2@alien.top · 1 year ago

sweet, ty 😎!

herozorro@alien.top · 1 year ago

will this speed up ollama project?

sleeper-2@alien.top · 1 year ago

huge fan of server.cpp too! I actually embed a universal binary (created with lipo) in my macOS app (FreeChat) and use it as an LLM backend running on localhost. Seeing how quickly it improves makes me very happy about this architecture choice.

I just saw the improvements issue today. Pretty excited about the possibility of getting chat template functionality since currently all of that complexity has to live in my client.

Also, TIL about the batching stuff. I’m going to try getting multiple responses using that.

Gorefindal@alien.top · 1 year ago

*Love* FreeChat!

Inkbot_dev@alien.top · 1 year ago

It’s not looking so great that they actually support the feature, but would rather hard code templates into the cpp, ignoring what the model is define with it it doesn’t match.

I made my case for it, but there seems to be resistance to doing it at all… there may be options to load a python jinja script from cpp if the dependencies exists, and fall back to the hard coded impl if not, but people seem very resistant to do anything of the sort. And the cpp jinja port seems to be too heavy weight for their tastes…

dirkson@alien.top · 1 year ago

I can’t seem to get it to work. Sillytavern asks for “/v1/completions”, which doesn’t seem to be provided by the llama.cpp api.

Jelegend@alien.top · 1 year ago

I am using it as http://localhost:8000/v1/completions and it is working perfectly

KrazyKirby99999@alien.top · 1 year ago

Docker or native?

Spasmochi@alien.top · 1 year ago

Did you use one of the example servers or just executre the default one at ./server ?

Water-cage@alien.top · 1 year ago

Thanks for the heads up, I’ll give it a try tomorrow

SatoshiNotMe@alien.top · 1 year ago

You mean we don’t need to use llama-cpp-Python anymore to serve this at an OAI-like endpoint?

reallmconnoisseur@alien.top · 1 year ago

Correct. You run llama.cpp server and inside your code/gui whatever you set OpenAI base API to the server’s endpoint.

aseichter2007@alien.top · 1 year ago

I’m pretty sure that makes it compatible with Clipboard Conqueror too!