In the evolving landscape of AI Infrastructure, Serverless GPUs have been a game changer. Six months on from our last guide, which sparked multiple discussions & created more awareness about the space, we’ve returned with fresh insights on the state of “True Serverless” offerings and I am here sharing performance benchmark & cost effectiveness analysis for Llama 2-7Bn & Stable Diffusion 2-1 model.

📊 Performance Testing Methodology: We put the spotlight on popular serverless GPU contenders: Runpod, Replicate, Inferless, and Hugging Face Inference Endpoints, specifically testing for:

1. Cold Starts: Varied across platforms. Latency minus inference time, represents the delay due to initializing a dormant Serverless function.

Average Cold-starts across platforms

2. Variability: We don’t just trust one-off results; we test over 5 days to ensure stability. We observed differences in consistency.

Performance Variability

3. Autoscaling: Simulated traffic peaks to assess how well platforms scale under pressure ,we tried the simulation on what happens when we receive 200 requests with a concurrency of 5. Not all platforms could manage linear scaling efficiently, leading to varied latencies under load.

Autoscaling capabilities

4. Decoding Serverless Pricing:

4.1 We modeled a scenario where you process 1,000 documents daily with the Llama 2 7Bn model. Here’s the TL;DR on costs:

Llama 2- 7Bn Cost Analysis

4.2 For the image processing (stable diffusion) use case, only the number of processed items and cold start times differ. Instead of 1,000 documents, we’re considering 1,000 images daily.

Stable Diffusion Cost Analysis

🔮 Overall Insights: The serverless GPU sector is advancing, notably in reducing cold-start times and improving cost efficiency. However, the best choice depends on specific use cases. While AWS Lambda is a leader in general serverless solutions, specialized tasks, particularly those GPU-intensive, may find better options elsewhere.

Detailed Blog link: https://www.inferless.com/learn/the-state-of-serverless-gpus-part-2

This analysis aims at shedding light on the serverless GPU arena. We welcome feedback and aim for precision in our findings.