I’m trying to perfect a dev tool for python developers to easily scale their code to thousands of cloud resources using only one line of code.
I want to get some project ideas so I can build useful tutorials for running inference and fine tuning open source LLMs.
A few weeks back I created a tutorial teaching people to massively parallelize inference with Mistral-7B. I was able to deliver a ton of value to a select few people and it helped me better understand the flaws with my tool.
Anyways I want to open it up to the community before I decide what tutorials I should prioritize. Please drop any project/tutorial ideas and if you think someone’s idea is good please upvote them (so I know you think it would be valuable).
I don’t code but I’ve been working with the apis (oobabooga/text-generation-webui) using chatgpt for coding, for text I have no problems but for images it’s been a big struggle for me. So for your tutorial (and as a clue for me), maybe a script that takes screenshots periodically and describes what it sees, could be a desktop screenshot or a camera feed or something, using a llava model.
First a question - what sort of performance can you get with ctransformers compared to AutoAWQ or tensorllm? I thought that ggml based stuff would not scale good if you are doing batches of 1000 generations at the the time, and gpu-first libraries would outperform it.
Now to tutorials. I feel like most of the community does not have more than 4 gpu’s, or a need to scale workloads that far, so I am not sure about how useful that would be for us/them. You can write a tutorial about splitting training among 10-20 commercial GPUs where each gpu has less vram than needed for micro_batch_size. Then the same for inference - sharding models and running parts of them on various gpu’s. There was interest in builds of 10x P40 and 8x 3080 cards, so while maybe not the most practical, doing some fun experiments with this kind of setup might be interesting to community. Basically in-depth Deepspeed, FSDP, 3D parallelism.
If are are not committed to write about scaling to 200 gpu’s to push your service, you should write in-depth ins and outs of qlora, with experiments on various parameters, memory scaling, rank scaling, relation of training speed to gpu core perf and memory speed, applicable scaling laws. All of it with most popular tools used in the community and the ones not as known. For the popular ones we have axolotl and training in oobabooga webui, less popular are those like h2o studio. There are probably more that I don’t know. Quite a lot of people here are asking basic questions about qlora. While I like answering those, having a good comprehensive tutorial would be nice.
what is different/better about whatever you are attempting to suggest compared to the existing prominent solutions such as vLLM, TensorRT-LLM, etc?
it’s not clear to me exactly what the value proposition is of what you’re offering.
i dont understand where is this supposed to run? at a cloud provider? so this script is instlaled there, and it handles the distribution?
i read the docs for the site and i must say these questions where not answered. perhaps a ‘what is burla’
First of all, your platform seems quite interesting and I might give it a go when I get some spare time.
Regarding the brand name…not a great choice tbh, it literally means “scam” in Portuguese and “mockery” in Spanish, might not be your target markets though, I understand.
I’d eat my arms for sharding llms with node/bun