How to upgrade to the next VRAM breakpoints, and is it worth it?

DominicanGreg@alien.top · 2 years ago

How to upgrade to the next VRAM breakpoints, and is it worth it?

fediverser@alien.top · 2 years ago

This post is an automated archive from a submission made on /r/LocalLLaMA, powered by Fediverser software running on alien.top. Responses to this submission will not be seen by the original author until they claim ownership of their alien.top account. Please consider reaching out to them let them know about this post and help them migrate to Lemmy.

Lemmy users: you are still very much encouraged to participate in the discussion. There are still many other subscribers on !localllama@poweruser.forum that can benefit from your contribution and join in the conversation.

Reddit users: you can also join the fediverse right away by getting by visiting https://portal.alien.top. If you are looking for a Reddit alternative made for and by an independent community, check out Fediverser.

unculturedperl@alien.top · 2 years ago

Speed costs money, how fast can you afford to go?

Why 72gb? 80 or 96 seems like a more reasonable number. H100’s have 80gb models if you can afford it ($29k?). Two A6000 adas would be $15k (plus a system to put them in).

The higher end compute cards seem more limited by funds and production than anything, X090 cards are where you find more scalpers and their ilk.

bick_nyers@alien.top · 2 years ago

ITT there are people discussing making a jump to Threadripper etc. to afford PCIE lanes.

Alternatively, pick up a Zen 2 EPYC on eBay for cheap. 16 core CPU + Motherboard could run you around $500 and you can get 6 PCIE 4.0 x16. Check motherboard specs and learn more about using server hardware (loud fans!) via ServeTheHome and Art of Server.

Saw something a while back that GDDR7 will have something like 33% more memory per chip, so if the bus width stays the same we are looking at a 32GB 5090. Keep in mind this will be PCIE 5.0.

synn89@alien.top · 2 years ago

Building a system that supports two 24GB cards doesn’t have to cost a lot. Boards that can do dual 8x PCI and cases/power that can handle 2 GPUs isn’t very hard. The problem I see past that is you’re running into much more exotic/expensive hardware. AMD Threadripper comes to mind, which is a big price jump.

Given that the market of people that can afford that is much lower than dual card setups, I don’t feel like we’ll see the lion’s share of open source happening at that level. People tend to tinker on things that are likely to get used by a lot of people.

I don’t really see this changing much until AMD/Intel come out with graphics cards that bust the consumer card 24GB barrier to compete with Nvidia head on in the AI market. Right now Nvidia won’t do that, as to not compete with their premium priced server cards.

fallingdowndizzyvr@alien.top · 2 years ago

The easiest thing to do is to get a Mac Studio. It also happens to be the best value. 3x4090s at $1600 each is $4800. That’s just for the cards. Adding a machine to put those cards into will cost another few hundred dollars. Just the cost of 3x4090s put you into Mac Ultra 128GB range. Adding the machine to put those cards into puts you in Mac Ultra 192GB range. With those 3x4090s you only have 72GB of RAM. Both those Mac options give you much more RAM.

Bod9001@alien.top · 2 years ago

If you want to run a general purpose model that can do everything fair enough throw resources at it, But I feel like there’s a lot of optimisation that can be done, e.g coding model doesn’t need to know how to fill out tax returns or who won the European Cup in 1995–96, and it’s even possible to do maybe even optimisations in size without any loss

kingp1ng@alien.top · 2 years ago

I keep hearing *unsubstantiated* rumors about model optimization breakthroughs. Everyone knows that the cost of compute is too damn high.

So I’m just waiting until the next performance improvements arrive. Three years ago, a 1B param model was state of the art. Hopefully by the next year, they’ll be a model and framework which cuts the compute cost by half.

DominicanGreg@alien.top · 2 years ago

Parts wise, a threadripper + ASUS Pro WS WRX80E-SAGE SE WiFi II is already a 2k price floor.

each 4090 is 2-2.3k

each 3090 is 1-1.5k

so building a machine from scratch will run you easily 8-10k off 4090’s and 6-8k off 3090’s. If you already have some GPUS or parts you would still problaly need 2 or more extra gpu’s plus the space and power to run them.

to my specific situation i would have to grab the treadripper, mobo, a case, ram, 2 more cards im looking at potentially 5-7k worth of damage. OR… pay 8.6k for a mac pro m2 and get an entire extra machine to play with.

There’s definitely an entire Mac Pro M3 series on the way considering they just released the laptops, it’s only a matter of time for them to shoot out the announcements. So i would definitely feel a bit peeved if i bought the M2 tower only for a month or two later apple to release the m3 versions.

ModeradorDoFariaLima@alien.top · 2 years ago

Wish we could just solder more VRAM to the cards. Such a silly thing to keep holding us back.

corecursion0@alien.top · 2 years ago

The next gen of models are in the 110B mark and beyond. I would say, estimate what it takes to do 250B at FP8 and FP16, then structure your purchases accordingly. Favour high bandwidth memory.

tylerbeefish@alien.top · 2 years ago

Your wait and see approach is probably wise. The newly released GH200 chip leapfrogs the H100 by a considerable margin, which was already smoking the A100.

On the consumer side, there does not seem to be a high demand to run local LLM. However, I used a 7b model with GPT4All on my ultrabook from 2014 which has a low-tier intel 6th gen with 16gb ram and was getting about 2.5 tokens/second. It was super slow but just shows what would be possible with some optimizations on consumer hardware.

If you’re willing to spend $10k to run an esoteric 110b model, it might be worthwhile to go for the capability to train them in the first place (even if perhaps very slowly). Or, consider a mac with large amounts of memory that’s built into the soc (unified memory) which would likely run models at an acceptable rate with some optimizations. Of course, if blistering performance isn’t necessary.

Otherwise, patience will likely have some good results in the context of a solid model which works on consumer-grade components. The space seems keen on allowing general users and enabling alternatives to transmitting data to some random server elsewhere. Opinion.

Flying_Madlad@alien.top · 2 years ago

I think the future is modular. Many small machines contributing to hosting a bigger model.

That way if you need to upgrade the capacity of your system you can just add another compute node

MindOrbits@alien.top · 2 years ago

Yes.

Workstations are the way to go. There are a few motherboards out there that give you four 2x wide slots.

Pro tip: think in pcix3 terms, 16x lanes (pcix3) is a sought after baseline. 8x lanes often preform about 80% of 16x, often due other system limitations are the bottlenecks, not the pci bus.

Depending on cpu, motherboard chipset, and internal lane connections, you will struggle to find four 16x slots.

PCI 4.0 adds to the mess, but always in your benefit, just not as much as you might think Depending on the above.

Older cards 3.0 Most cards you consider modern good and better 4.0 New cards 5.0

4.0 lanes can be split by chipsets for things like nvme drives and usb. And is 2x the bandwidth of pcix3 with supported 4.0 devices. (8x pcix4 ~ 16x pcix3) A nice motherboard feature is when 16x pice4 lanes are split into two 16x pciex3 slots. Chipsets and nvme drives benefit greatly from pcix4 and often free up more pcix3 lanes for the slots.

So… if you find four pci double wide slots with at least 8x lanes per slot your leaving some performance ‘on the table’ but your really not that handicapped by the loss for what you buying, especially when shopping used.

Really new cards would suffer more from lane saturation, and may not have a favorable cost to benefit due to newer cards price.

-Automaticity@alien.top · 2 years ago

If Nvidia isn’t upgrading GPU’s past 24GB for the RTX 50 series then that will probably factor into the open source community keeping models below 40b parameters. I don’t know the exact cutoff point. A lot of people with 12gb VRAM can run 13b models but you could also run 7b 8-bit with 16k context size. It will get increasingly difficult to run larger contexts with larger models.

Some larger open models are being released but there won’t be much community there to train on a bunch of datasets to the huge models to nail the ideal finetune.

Rutabaga-Agitated@alien.top · 2 years ago

We created a 4x 4090 RTX setup through a mining rig That is 96gb VRAM for round about 10k… does not get cheaper than that. Best compute per cost rn I think.

https://preview.redd.it/nfq4olntq54c1.png?width=1812&format=pjpg&auto=webp&s=a5308bb5eec778072f8d6a394b5243ca33c7fd87

.*