• 1 Post
  • 13 Comments
Joined 1 year ago
cake
Cake day: November 9th, 2023

help-circle

  • This is what my hobby project essentially does. I’m running a single chat from 3 different servers in my network all serving different LLMs that are given a role in the chat pipeline. I can send the same prompt to multiple models so they can work on it concurrently, or have them handoff each other’s responses to continue elaborating, validating, or whatever that LLMs job is. Since each server is serving an API and websocket route, all I need to do is put it behind a proxy and port forward them to the public internet. Anyone here could visit the public URL and run inference workflows in my homelab(theoretically speaking). They could also spin up an instance on their side and we can have our servers talk to each other.

    Of course that’s highly insecure and just bait for bad actors. So I will scale it using overlay network that requires a key exchange and runs over VPN.

    Any startup thinking they are going to profit from this idea will only burn investor money and waste their own time. This will all be free and it’s only a matter of time before the open source community cuts into their hopes and dreams.





  • Ideally we would be better in a timeline where LLMs could do this better than classical methods but we’re not there yet. You can code a handler that cleans up html retrieval quite trivial since you’re just looking for the text in specific tags like articles, headers, paragraphs, etc. There are a ton of frameworks and examples out there on how to do this and a proper handler would execute the cleanup in a fraction of the time even the most powerful LLM ever hoped to.



  • This is something I’ve noticed with large context as well. This is why the platform built around LLMs is what will be the major differentiator for the foreseeable future. I’m cooking up a workflow to insert remote LLMs as part of a chat workflow and successfully tested running inference on a fast Mistral-7B model and a large Dolphin-Yi-70B on different servers from a single chat view successfully about an hour ago. This will unlock the capability to have multiple LLMs working together to manage context by providing summaries, offloading realtime embedding/retrieval to a remote LLM, and a ton of other possibilities. I got it working on a 64GB M2 and a 128GB M3. Tonight I will insert the 4090RTX into the mix. The plan is to have the 4090 run small LLMs. Think 13B and smaller. These run and light speed on my 4090. Its job can be to provide summaries of the context by using LLMs finetuned for that purpose. The new Orca13B is promising little agent that so far follows instructions really well for these types of workflows. Then we can have all 3 servers working together on a solution. Ultimately, all of the responses would be merged into the “ideal response” and output as the “final answer”. I am not concerned with speed for my use case as I use LLMs for highly technical work. I need correctness above all even if this means waiting a while for the next step.

    I’m also going to implement a mesh VPN so we can do this over WAN and scale it even more with a trusted group of peers.

    The magic behind ChatGPT is the tooling and how much compute they can burn. My belief is the model is less relevant than folks think. It’s the best model no doubt, but if we were allowed to run it on the CLI as a pure prompt/response workflow between use and model with no tooling in between, my belief is it would be a lot like the best open source models…



  • What’s stopping us from building a mesh of web crawlers and creating a distributed database that anyone can host and add to the total pool of indexers/servers? How long would it take to create a quality dataset by deploying bots that crawl their way “out” of the most popular and trusted sites for particular knowledge domains and just compress and dump that into a format for training into said global p2p mesh? If we got a couple of thousand nerds on Reddit to contribute compute and storage capacity to this network we might be able to build it relatively fast. Just sayin…