What kind of performance should we expect?

cosmicr@alien.top · 1 year ago

What kind of performance should we expect?

DarthNebo@alien.top · 1 year ago

Run this with TGI or vLLM

Aaaaaaaaaeeeee@alien.top · 1 year ago

What’s the latest t/s on a 4bit model with TGI? is there a difference compared with HF transformer loader?

DarthNebo@alien.top · 1 year ago

The attention layers get replaced with flash attention 2, there’s kv caching as well so you get way better batch1 & batchN results with continuous batching for every request

dodo13333@alien.top · 1 year ago

What is TGI?