Optimum Intel int4 on iGPU UHD 770
I’d like to share the result of inference using Optimum Intel library with Starling-LM-7B Chat model quantized to int4 (NNCF) on iGPU Intel UHD Graphics 770 (i5 12600) with OpenVINO library.
I think it’s quite good 16 tk/s with CPU load 25-30%. Same performance with int8 (NNCF) quantization.
This is inside a Proxmox VM with SR-IOV virtualized GPU 16GB RAM and 6 cores. I also found that the ballooning device might cause crash of the VM so I disabled it while the swap is on a zram device.
free -h
output while inferencing:
total used free shared buff/cache available
Mem: 15Gi 6.2Gi 573Mi 4.7Gi 13Gi 9.3Gi
Swap: 31Gi 256Ki 31Gi
Code adapted from https://github.com/OpenVINO-dev-contest/llama2.openvino
What’s your thoughts on this?
I hope that something similar emerge on Linux.
SYCL can be a candidate, like Vulkan for 3D Acceleration: it’s a PITA to deal with CUDA, ROCm etc etc.
That’s why Intel is pitching OneAPI. They want it to be the single API to bring everything together. That’s why it also supports nvidia GPUs, AMD GPUs, CPUs and even FPGA.