What "Serverless GPU" Actually Means
AI isn't just software — it's a physical reality with hard resource ceilings. I've walked through data centres where the evaporative cooling towers consume more water than a small neighbourhood; a reality many software engineers can afford to ignore. My clients, however, cannot. They come to me with a specific constraint: they need GPU horsepower for inference or fine-tuning, but without the operational burden of provisioning dedicated hardware. The traditional path — procuring expensive GPU instances, managing drivers, building complex autoscaling groups — leads to underutilised resources or frantic scrambling during peak demand.
This guide is the decision layer. It compares how AWS, GCP, and Azure each approach serverless GPU access, surfaces the trade-offs that matter in production, and gives you a clear framework for choosing the right platform for your workload. Each platform has a dedicated deep-dive article in this series; links are provided at the end of each section.
The term is overloaded. Across providers it covers at least three distinct patterns:
- Scale-to-zero inference endpoints — you deploy a model; the provider scales instances (including GPU instances) between zero and N based on traffic. You pay per invocation or per second of active compute. Cold starts are the main cost.
- Serverless batch / fine-tuning jobs — you submit a job; the provider allocates GPU resources for the duration, then releases them. No endpoint to manage, no replica count to tune.
- Managed model APIs — the provider runs the model entirely (e.g. Amazon Bedrock, Azure OpenAI). You call an API; you never touch the GPU abstraction at all.
Not all platforms support all three patterns equally. That asymmetry is the first thing to understand before choosing.
Platform Comparison
| AWS | GCP | Azure | |
|---|---|---|---|
| Primary service | SageMaker Serverless Inference | Vertex AI Endpoints | Azure AI Projects (AIProjectClient) |
| Managed model API | Amazon Bedrock | Vertex AI Model Garden | Azure OpenAI |
| Serverless inference (custom models) | ✅ SageMaker Serverless | ✅ Vertex AI (min_replica=0) | ⚠️ Managed Online Endpoints (not true scale-to-zero) |
| Serverless fine-tuning | ✅ SageMaker Training | ⚠️ Custom job, not fully serverless | ✅ Azure OpenAI fine-tuning via AIProjectClient |
| Memory ceiling (serverless) | 6 GB | ~16 GB (n1-standard-4 + T4) | Model-dependent (Azure OpenAI managed) |
| Cold-start mitigation | ProvisionedConcurrency | min_replica_count ≥ 1 | N/A (job-based) |
| IaC support | CloudFormation / CDK / Terraform | Terraform (first-class) | Bicep / Terraform |
| Best fit | Variable inference, existing AWS stack | Integrated MLOps pipelines | Fine-tuning Azure OpenAI models |
AWS: SageMaker Serverless Inference and Bedrock
AWS gives you two distinct serverless entry points. For custom models, SageMaker Serverless Inference lets you deploy your own container without managing instances. For foundation models, Amazon Bedrock provides fully managed, pay-per-token API access to Anthropic, Cohere, Amazon Titan, and others — no GPU management whatsoever.
The key configuration levers are MemorySizeInMB (1024–6144 MB) and MaxConcurrency. The 6 GB memory ceiling is the hard limit that matters most in practice: models larger than roughly 3–4 B parameters at FP16 will not fit. For larger models, the path is SageMaker Endpoints with dedicated GPU instances (ml.g5 or ml.inf2) or Bedrock.
ProvisionedConcurrency is the cold-start dial. Setting it to 0 maximises cost savings but introduces 10–30 s cold starts. Setting it to 1–2 keeps instances warm at a modest fixed cost. I advise clients to start at zero to establish a baseline, then add provisioned concurrency only if latency SLOs are breached.
When to choose AWS: Your team already operates in the AWS ecosystem, your model fits within 6 GB of memory, and you want the most familiar IAM and observability model. AWS Cost Explorer's per-endpoint spend visibility is also the most mature of the three platforms.
→ See the full implementation guide: Serverless GPU Inference on AWS SageMaker
GCP: Vertex AI Endpoints
Vertex AI Endpoints offer the tightest integration with the broader Vertex platform — Model Registry, Experiments, Pipelines, and Monitoring all share the same resource model. You define a machine type with an attached GPU accelerator (T4, A100, H100), set autoscaling bounds, and Vertex handles the rest.
The critical setting is min_replica_count. Setting it to 0 gives true scale-to-zero cost behaviour but introduces cold starts that regularly exceed 60 seconds for GPU instances. For any user-facing service, min_replica_count=1 is the correct default — the cost of one idle n1-standard-4 + T4 replica (~$0.95/hr) is almost always cheaper than the engineering effort of debugging SLO breaches.
For memory headroom, Vertex AI is the most flexible of the three: you can attach an A100 (40 GB HBM2) or H100 (80 GB HBM3) to a custom machine type, bypassing the 6 GB ceiling that constrains AWS SageMaker Serverless.
When to choose GCP: You are building within the Vertex AI platform and want MLOps tooling (pipelines, model versioning, experiments) in one place. Also the right choice if your model is too large for SageMaker Serverless but you want to stay serverless-adjacent with autoscaling.
→ See the full implementation guide: Serverless GPU Inference on GCP Vertex AI
Azure: Serverless Fine-Tuning and Inference via AIProjectClient
Azure's serverless GPU story splits cleanly by use case.
For fine-tuning, Azure has the strongest serverless model of the three platforms. You connect to your Azure AI Project via AIProjectClient, get an OpenAI client, upload training data, and submit a fine-tuning job. Azure allocates GPU compute for the duration, the job completes, and you pay only for what ran. There is no endpoint to manage, no replica count to tune, and no cold-start concern. For the fine-tuning use case specifically, this is the cleanest serverless pattern I have encountered across any cloud.
For real-time inference on custom models, Azure is the weakest of the three. Azure Machine Learning Managed Online Endpoints support GPU instance types and autoscaling, but they do not scale to zero — a minimum of one instance must remain running. This makes them operationally similar to a managed dedicated instance rather than true serverless. The exception is Azure OpenAI endpoints: these are fully managed, scale transparently, and are the right choice if your inference workload is against a supported OpenAI model.
When to choose Azure: You are fine-tuning Azure OpenAI models, your organisation already runs on Azure, or you need the compliance and data-residency guarantees of Azure's European regions. Do not choose Azure for serverless inference of custom models if cold-start-free scale-to-zero is a hard requirement.
→ See the full implementation guide: Serverless GPU on Azure — Fine-Tuning and Inference with AIProjectClient
NVIDIA: The Foundational Layer
NVIDIA is not a cloud competitor to the three above — it is the substrate all three run on. The GPU SKU your provider uses directly determines available memory, compute throughput, and pricing:
| NVIDIA SKU | HBM | Typical cloud mapping | Best for |
|---|---|---|---|
| T4 | 16 GB | GCP n1 + T4, AWS ml.g4dn | Inference workhorses, cost-optimised |
| A10G | 24 GB | AWS ml.g5 | Inference + light fine-tuning |
| A100 | 40 / 80 GB | GCP a2, Azure NC A100 v4 | Large model inference, training |
| H100 | 80 GB HBM3 | GCP a3, Azure ND H100 v5 | Frontier model training and inference |
NVIDIA Inference Microservices (NIMs) deserve a mention: these are optimised, pre-built containers for popular models (LLaMA, Mistral, Stable Diffusion) that run identically on-prem or in any compatible cloud GPU. If hardware portability is a requirement — running the same workload on GCP today and on-prem next quarter — NIMs are the right abstraction layer to evaluate.
Choosing: A Decision Framework
Start with your workload shape:
- Managed foundation model (no custom weights)? → Use Bedrock (AWS), Vertex AI Model Garden (GCP), or Azure OpenAI. No GPU management needed.
- Custom model, sporadic traffic, ≤ 6 GB memory? → AWS SageMaker Serverless Inference. Simplest operational model in the AWS ecosystem.
- Custom model, variable traffic, need large GPU memory or tight MLOps integration? → GCP Vertex AI with
min_replica_count=1. - Fine-tuning an OpenAI model or running against Azure OpenAI? → Azure AIProjectClient. Best serverless fine-tuning experience of the three.
- Custom model inference on Azure? → Azure ML Managed Online Endpoints, but budget for a minimum one running instance.
FinOps First Principle: Run Serverless for 30 Days, Then Decide
Instrument cost alerting before you go to production — AWS Cost Explorer, GCP Budget API, or Azure Cost Management. Set a budget alarm at 150% of your forecast. A serverless GPU endpoint scaling to meet an unexpected traffic spike can generate a month's worth of expected spend in a single afternoon. Capture p99 latency, invocation count, and actual spend over 30 days. Only then evaluate whether a reserved GPU instance is cheaper. In my experience, fewer than 20% of workloads justify the operational overhead of dedicated GPU compute once idle time and engineering cost are fully factored in.
Conclusion
Serverless GPU access has matured enough to be the default starting point for any new AI workload. But "serverless" means something different on each platform, and the choice between them is not primarily a technical one — it is an operational and organisational one.
Choose AWS if your team already lives in the AWS ecosystem and your models fit within the 6 GB memory ceiling. Choose GCP Vertex AI if you need the full MLOps platform or large-GPU headroom without dedicated instances. Choose Azure if you are fine-tuning Azure OpenAI models or operating in an Azure-first organisation.
The three deep-dive articles in this series go step-by-step through deployment on each platform — from Terraform to SDK calls to cost instrumentation. Start with the platform closest to your existing infrastructure, measure for 30 days, and treat FinOps governance as a first-class requirement from day one.