Architecting Serverless GPU Access: A Field Guide to AWS, GCP, Azure, and NVIDIA

A field guide for architects comparing serverless GPU access across AWS, GCP, and Azure. This essay breaks down the architectural components, tradeoffs, and costs from a consultant's perspective, with actionable code examples.

Architecting Serverless GPU Access: A Field Guide to AWS, GCP, Azure, and NVIDIA
Key Takeaways

What "Serverless GPU" Actually Means

AI isn't just software — it's a physical reality with hard resource ceilings. I've walked through data centres where the evaporative cooling towers consume more water than a small neighbourhood; a reality many software engineers can afford to ignore. My clients, however, cannot. They come to me with a specific constraint: they need GPU horsepower for inference or fine-tuning, but without the operational burden of provisioning dedicated hardware. The traditional path — procuring expensive GPU instances, managing drivers, building complex autoscaling groups — leads to underutilised resources or frantic scrambling during peak demand.

This guide is the decision layer. It compares how AWS, GCP, and Azure each approach serverless GPU access, surfaces the trade-offs that matter in production, and gives you a clear framework for choosing the right platform for your workload. Each platform has a dedicated deep-dive article in this series; links are provided at the end of each section.

The term is overloaded. Across providers it covers at least three distinct patterns:

  • Scale-to-zero inference endpoints — you deploy a model; the provider scales instances (including GPU instances) between zero and N based on traffic. You pay per invocation or per second of active compute. Cold starts are the main cost.
  • Serverless batch / fine-tuning jobs — you submit a job; the provider allocates GPU resources for the duration, then releases them. No endpoint to manage, no replica count to tune.
  • Managed model APIs — the provider runs the model entirely (e.g. Amazon Bedrock, Azure OpenAI). You call an API; you never touch the GPU abstraction at all.

Not all platforms support all three patterns equally. That asymmetry is the first thing to understand before choosing.

Platform Comparison

AWS GCP Azure
Primary service SageMaker Serverless Inference Vertex AI Endpoints Azure AI Projects (AIProjectClient)
Managed model API Amazon Bedrock Vertex AI Model Garden Azure OpenAI
Serverless inference (custom models) ✅ SageMaker Serverless ✅ Vertex AI (min_replica=0) ⚠️ Managed Online Endpoints (not true scale-to-zero)
Serverless fine-tuning ✅ SageMaker Training ⚠️ Custom job, not fully serverless ✅ Azure OpenAI fine-tuning via AIProjectClient
Memory ceiling (serverless) 6 GB ~16 GB (n1-standard-4 + T4) Model-dependent (Azure OpenAI managed)
Cold-start mitigation ProvisionedConcurrency min_replica_count ≥ 1 N/A (job-based)
IaC support CloudFormation / CDK / Terraform Terraform (first-class) Bicep / Terraform
Best fit Variable inference, existing AWS stack Integrated MLOps pipelines Fine-tuning Azure OpenAI models

AWS: SageMaker Serverless Inference and Bedrock

AWS gives you two distinct serverless entry points. For custom models, SageMaker Serverless Inference lets you deploy your own container without managing instances. For foundation models, Amazon Bedrock provides fully managed, pay-per-token API access to Anthropic, Cohere, Amazon Titan, and others — no GPU management whatsoever.

The key configuration levers are MemorySizeInMB (1024–6144 MB) and MaxConcurrency. The 6 GB memory ceiling is the hard limit that matters most in practice: models larger than roughly 3–4 B parameters at FP16 will not fit. For larger models, the path is SageMaker Endpoints with dedicated GPU instances (ml.g5 or ml.inf2) or Bedrock.

ProvisionedConcurrency is the cold-start dial. Setting it to 0 maximises cost savings but introduces 10–30 s cold starts. Setting it to 1–2 keeps instances warm at a modest fixed cost. I advise clients to start at zero to establish a baseline, then add provisioned concurrency only if latency SLOs are breached.

When to choose AWS: Your team already operates in the AWS ecosystem, your model fits within 6 GB of memory, and you want the most familiar IAM and observability model. AWS Cost Explorer's per-endpoint spend visibility is also the most mature of the three platforms.

See the full implementation guide: Serverless GPU Inference on AWS SageMaker

GCP: Vertex AI Endpoints

Vertex AI Endpoints offer the tightest integration with the broader Vertex platform — Model Registry, Experiments, Pipelines, and Monitoring all share the same resource model. You define a machine type with an attached GPU accelerator (T4, A100, H100), set autoscaling bounds, and Vertex handles the rest.

The critical setting is min_replica_count. Setting it to 0 gives true scale-to-zero cost behaviour but introduces cold starts that regularly exceed 60 seconds for GPU instances. For any user-facing service, min_replica_count=1 is the correct default — the cost of one idle n1-standard-4 + T4 replica (~$0.95/hr) is almost always cheaper than the engineering effort of debugging SLO breaches.

For memory headroom, Vertex AI is the most flexible of the three: you can attach an A100 (40 GB HBM2) or H100 (80 GB HBM3) to a custom machine type, bypassing the 6 GB ceiling that constrains AWS SageMaker Serverless.

When to choose GCP: You are building within the Vertex AI platform and want MLOps tooling (pipelines, model versioning, experiments) in one place. Also the right choice if your model is too large for SageMaker Serverless but you want to stay serverless-adjacent with autoscaling.

See the full implementation guide: Serverless GPU Inference on GCP Vertex AI

Azure: Serverless Fine-Tuning and Inference via AIProjectClient

Azure's serverless GPU story splits cleanly by use case.

For fine-tuning, Azure has the strongest serverless model of the three platforms. You connect to your Azure AI Project via AIProjectClient, get an OpenAI client, upload training data, and submit a fine-tuning job. Azure allocates GPU compute for the duration, the job completes, and you pay only for what ran. There is no endpoint to manage, no replica count to tune, and no cold-start concern. For the fine-tuning use case specifically, this is the cleanest serverless pattern I have encountered across any cloud.

For real-time inference on custom models, Azure is the weakest of the three. Azure Machine Learning Managed Online Endpoints support GPU instance types and autoscaling, but they do not scale to zero — a minimum of one instance must remain running. This makes them operationally similar to a managed dedicated instance rather than true serverless. The exception is Azure OpenAI endpoints: these are fully managed, scale transparently, and are the right choice if your inference workload is against a supported OpenAI model.

When to choose Azure: You are fine-tuning Azure OpenAI models, your organisation already runs on Azure, or you need the compliance and data-residency guarantees of Azure's European regions. Do not choose Azure for serverless inference of custom models if cold-start-free scale-to-zero is a hard requirement.

See the full implementation guide: Serverless GPU on Azure — Fine-Tuning and Inference with AIProjectClient

NVIDIA: The Foundational Layer

NVIDIA is not a cloud competitor to the three above — it is the substrate all three run on. The GPU SKU your provider uses directly determines available memory, compute throughput, and pricing:

NVIDIA SKU HBM Typical cloud mapping Best for
T4 16 GB GCP n1 + T4, AWS ml.g4dn Inference workhorses, cost-optimised
A10G 24 GB AWS ml.g5 Inference + light fine-tuning
A100 40 / 80 GB GCP a2, Azure NC A100 v4 Large model inference, training
H100 80 GB HBM3 GCP a3, Azure ND H100 v5 Frontier model training and inference

NVIDIA Inference Microservices (NIMs) deserve a mention: these are optimised, pre-built containers for popular models (LLaMA, Mistral, Stable Diffusion) that run identically on-prem or in any compatible cloud GPU. If hardware portability is a requirement — running the same workload on GCP today and on-prem next quarter — NIMs are the right abstraction layer to evaluate.

Choosing: A Decision Framework

Start with your workload shape:

  1. Managed foundation model (no custom weights)? → Use Bedrock (AWS), Vertex AI Model Garden (GCP), or Azure OpenAI. No GPU management needed.
  2. Custom model, sporadic traffic, ≤ 6 GB memory? → AWS SageMaker Serverless Inference. Simplest operational model in the AWS ecosystem.
  3. Custom model, variable traffic, need large GPU memory or tight MLOps integration? → GCP Vertex AI with min_replica_count=1.
  4. Fine-tuning an OpenAI model or running against Azure OpenAI? → Azure AIProjectClient. Best serverless fine-tuning experience of the three.
  5. Custom model inference on Azure? → Azure ML Managed Online Endpoints, but budget for a minimum one running instance.

FinOps First Principle: Run Serverless for 30 Days, Then Decide

Instrument cost alerting before you go to production — AWS Cost Explorer, GCP Budget API, or Azure Cost Management. Set a budget alarm at 150% of your forecast. A serverless GPU endpoint scaling to meet an unexpected traffic spike can generate a month's worth of expected spend in a single afternoon. Capture p99 latency, invocation count, and actual spend over 30 days. Only then evaluate whether a reserved GPU instance is cheaper. In my experience, fewer than 20% of workloads justify the operational overhead of dedicated GPU compute once idle time and engineering cost are fully factored in.

Conclusion

Serverless GPU access has matured enough to be the default starting point for any new AI workload. But "serverless" means something different on each platform, and the choice between them is not primarily a technical one — it is an operational and organisational one.

Choose AWS if your team already lives in the AWS ecosystem and your models fit within the 6 GB memory ceiling. Choose GCP Vertex AI if you need the full MLOps platform or large-GPU headroom without dedicated instances. Choose Azure if you are fine-tuning Azure OpenAI models or operating in an Azure-first organisation.

The three deep-dive articles in this series go step-by-step through deployment on each platform — from Terraform to SDK calls to cost instrumentation. Start with the platform closest to your existing infrastructure, measure for 30 days, and treat FinOps governance as a first-class requirement from day one.

Last updated:

This article was produced using an AI-assisted research and writing pipeline. Learn how we create content →