finally work.
They call tools, reason through workflows, and actually complete tasks.
Then the first real API bill arrives.
For many teams, that’s the moment the question appears:
“Should we just run this ourselves?”
The good news is that self-hosting an LLM is no longer a research project or a massive ML infrastructure effort. With the right model, the right GPU, and a few battle-tested tools, you can run a production-grade LLM on a single machine you control.
You’re probably here because one of these happened:
Your OpenAI or Anthropic bill exploded
You can’t send sensitive data outside your VPC
Your agent workflows burn millions of tokens/day
You want custom behavior from your AI and the prompts aren’t cutting it.
If this is you, perfect. If not, you’re still perfect 🤗
In this article, I’ll walk you through a practical playbook for deploying an LLM on your own infrastructure, including how models were evaluated and selected, which instance types were evaluated and chosen, and the reasoning behind those decisions.
I’ll also provide you with a zero-switch cost deployment pattern for your own LLM that works for OpenAI or Anthropic.
By the end of this guide you’ll know:
- Which benchmarks actually matter for LLMs that need to solve and reason through agentic problems, and not reiterate the latest string theorem.
- What it means to quantize and how it affects performance
- Which instance types/GPUs can be used for single machine hosting1
- Which models to use2
- How to use a self-hosted LLM without having to rewrite an existing API based codebase
- How to make self-hosting cost-effective3?
1 Instance types were evaluated across the “big three”: AWS, Azure and GCP
2 all models are current as of March 2026
3 All pricing data is current as of March 2026
Note: this guide is focused on deploying agent-oriented LLMs — not general-purpose, trillion-parameter, all-encompassing frontier models, which are largely overkill for most agent use cases.
✋Wait…why would I host my own LLM again?
+++ Privacy
This is most likely why you’re here. Sensitive data — patient health records, proprietary source code, user data, financial records, RFPs, or internal strategy documents that can never leave your firewall.
Self-hosting removes the dependency on third-party APIs and alleviates the risk of a breach or failure to retain/log data according to strict privacy policies.
++ Cost Predictability
API pricing scales linearly with usage. For agent workloads, which typically are higher on the token spectrum, operating your own GPU infrastructure introduces economies-of-scale. This is especially important if you plan on performing agent reasoning across a medium to large company (20-30 agents+) or providing agents to customers at any sort of scale.
+ Performance
Remove roundtrip API calling, get reasonable token-per-second values and increase capacity as necessary with spot-instance elastic scaling.
+ Customization
Methods like LoRA and QLoRA (not covered in detail here) can be used to fine-tune an LLM’s behavior or adapt its alignment, abliterating, enhancing, tailoring tool usage, adjusting response style, or fine-tuning on domain-specific data.
This is crucially useful to build custom agents or offer AI services that require specific behavior or style tuned to a use-case rather than generic instruction alignment via prompting.
An aside on finetuning
Methods such as LoRA/QLoRA, model ablation (“abliteration”), realignment techniques, and response stylization are technically complex and outside the scope of this guide. However, self-hosting is often the first step toward exploring deeper customization of LLMs.
Why a single machine?
It’s not a hard requirement, it’s more for simplicity. Deploying on a single machine with a single GPU is relatively simple. A single machine with multiple GPUs is doable with the right configuration choices.
However, debugging distributed inference across many machines can be nightmarish.
This is your first self-hosted LLM. To simplify the process, we’re going to target a single machine and a single GPU. As your inference needs grow, or if you need more performance, scale up on a single machine. Then as you mature, you can start tackling multi-machine or Kubernetes style deployments.
👉Which Benchmarks Actually Matter?
The LLM Benchmark landscape is noisy. There are dozens of leaderboards, and most of them are irrelevant for our use case. We need to prune down these benchmarks to find LLMs which excel at agent-style tasks
Specifically, we’re looking for LLMs which can:
- Follow complex, multi-step instructions
- Use tools reliably: call functions with well-formed arguments, interpret results, and decide what to do next
- Reason with constraints: reason with potentially incomplete information without hallucinating a confident but wrong answer
- Write and understand code: We don’t need to solve expert level SWE problems, but interacting with APIs and being able to generate code on the fly helps expand the action space and typically translates into better tool usage
Here are the benchmarks to really pay attention to:
BenchmarkDescriptionWhy?Berkeley Function Calling Leaderboard (BFCL v3)Accuracy of function/tool calling across simple, parallel, nested, and multi-step invocationsDirectly tests the capability your agents depend on most: structured tool use.IFEval (Instruction Following Eval)Strict adherence to formatting, constraint, and structural instructionsAgents need strict adherence to instructionsτ-bench (Tau-bench)E2E agent task completion in simulated environmentsMeasures real agentic competence, can this LLM actually accomplish a goal over multiple turns?SWE-bench VerifiedAbility to resolve real GitHub issues from popular open-source reposIf your agents write or modify code, this is the gold standard. The “Verified” subset filters out ambiguous or poorly-specified issuesWebArena / VisualWebArenaTask completion in realistic web environmentsSuper useful if your agent needs to use a WebUI
Note: unfortunately, getting reliable benchmark scores on all of these, especially quantized models, is difficult. You’re going to have to use your best judgement, assuming that the full precision model adheres to the performance degradation table outlined below.
🤖Quantizing
This is in no way, shape, or form meant to be the exhaustive guide to quantizing. My goal is to give you enough information to allow you to navigate HuggingFace without coming out cross-eyed.
The basics
A model’s parameters are stored as numbers. At full precision (FP32), each weight is a 32-bit floating point number — 4 bytes. Most modern models are distributed at FP16 or BF16 (half precision, 2 bytes per weight). You will see this as the baseline for each model
Quantization reduces the number of bits used to represent each weight, shrinking the memory requirement and increasing inference speed, at the cost of some accuracy.
Not all quantization methods are equal. There are some clever methods that retain performance with highly reduced bit precision.
BF16 vs. GPTQ vs. AWQ vs. GGUF
You’ll see these acronyms a lot when model shopping. Here’s what they mean:
- BF16: plain and simple. 2 bytes per parameter. A 70B parameter model will cost you 140GB of VRAM. This is the minimal level of quantizing.
- GPTQ: stands for “Generative Pretrained Transformer Quantization”, quantized layer by layer using an greedy “error aware” approximation of the Hessian for each weight. Largely superseded by AWQ and methods applicable to GGUF models (see below)
- AWQ: stands for “Activation Aware Weight Quantization”, quantizes weights using the magnitude of the activation (via channels) instead of the error.
- GGUF: isn’t a quantization method at all, it’s an LLM container popularized by llama.cpp, within which you will find some of the following quantization methods:
- K-quants: Named by bits-per-weight and method, e,g Q4_K_M/Q4_K_S.
- I-quants: Newer version, pushes precision at lower bitrates (4 bit and lower)
Here’s a rough guide as to what quantization does to performance:
PrecisionBits per weightVRAM for 70BPerformanceFP16 / BF1616~140 GBBaseline (100%)Q8 (INT8)8~70 GB~99–99.5% of FP16Q5_K_M5.5 (mixed)~49 GB~97–98%Q4_K_M4.5 (mixed)~42 GB~95–97%Q3_K_M3.5 (mixed)~33 GB~90–94%Q2_K2.5 (mixed)~23 GB~80–88% — noticeable degradation
Where quantization really hurts
Not all tasks degrade equally. The things most affected by aggressive quantization (Q3 and below):
- Precise numerical computation: if your agent needs to do exact arithmetic in-weights (as opposed to via tool calls), lower precision hurts
- Rare/specialized knowledge recall: the “long tail” of a model’s knowledge is stored in less-activated weights, which are the first to lose fidelity
- Very long chain-of-thought sequences: small errors compound over extended reasoning chains
- Structured output reliability: at Q3 and below, JSON schema compliance and tool-call formatting start to degrade. This is a killer for agent pipelines
💡Protip: Stick to Q4_K_M and above for agents. Any lower, and long context reasoning and output reliability issues put agent tasks at risk.
🛠️Hardware
Finally, Santa has delivered a capacity block free A100 Instance with 80GB VRAM. Imagined by ChatGPT
GPUs (Accelerators)
Although more GPU types are available, the landscape across AWS, GCP and Azure can be mostly distilled into the following options, especially for single machine, single GPU deployments:
GPUArchitectureVRAMH100Hopper80GBA100Ampere40GB/80GBL40SAda Lovelace48GBL4Ada Lovelace24GBA10/A10GAmpere24GBT4 Turing16GB
The best tradeoffs for performance and cost exist in the L4, L40S and A100 range, with the A100 providing the best performance (in terms of model capacity and multi-user agentic workloads). If your agent tasks are simple, and require less throughput, it’s safe to downgrade to L4/A10. Don’t upgrade to the H100 unless you need it.
The 48GB of VRAM provided by the L40S give us a lot of options for models. We won’t get the throughput of the A100, but we’ll save on hourly cost.
For the sake of simplicity, I’m going to frame the rest of this discussion around this GPU. If you determine that your needs are different (less/more), the decisions I outline below will help you navigate model selection, instance selection and cost optimization.
Note about GPU selection: even though you may have your heart set on an A100, and the finances to buy it, cloud capacity may restrict you to another instance/GPU type unless you’re willing to purchase “Capacity Blocks” [AWS] or “Reservations” [GCP].
Quick decision checkpoint
If you’re deploying your first self-hosted LLM:
SituationRecommendationexperimentingL4 / A10production agentsL40Shigh concurrencyA100
Recommended Instance Types
I’ve compiled a non-exhaustive list of instance types across the big three which can help narrow down virtual machine types.
Note: all pricing information was sourced in March 2026.
AWS
AWS lacks many single-GPU instance options, and is more geared towards large multi-GPU workloads. That being said, if you want to purchase reserved capacity blocks, they offer a p5.4xlarge with a single H100. They also have a large block of L40S instance types which are prime for spot instances for predictable/scheduled agentic workloads.
Click to reveal instance types
Instance GPUVRAMvCPURAMOn-demand $/hrg4dn.xlarge1x T416 GB416 GB~$0.526g5.xlarge1x A10G24 GB416 GB~$1.006g5.2xlarge1x A10G24 GB832 GB~$1.212g6.xlarge1x L424 GB416 GB~$0.805g6e.xlarge1x L40S48GB432GB~$1.861p5.4xlarge1x H10080GB16256GB~$6.88
Google Cloud Platform
Unlike AWS, GCP offers single-GPU A100 instances. This makes a2-ultragpu-1g the most cost-effective option for running 70B models on a single machine. You pay only for what you use.
Click to reveal instance types
InstanceGPUVRAMOn-demand $/hrg2-standard-41x L424 GB~$0.72a2-highgpu-1g1x A100 (40GB)40 GB~$3.67a2-ultragpu-1g1x A100 (80GB)80 GB~$5.07a3-highgpu-1g1x H100 (80GB)80 GB~$7.2
Azure
Azure has the most limited set of single GPU instances, so you’re pretty much set into the Standard_NC24ads_A100_v4, which gives you an A100 for ~$3.60 per hour unless you want to go with a smaller model
Click to reveal instance types
InstanceGPUVRAMOn-demand $/hrNotesStandard_NC4as_T4_v31x T416 GB~$0.526Dev/testStandard_NV36ads_A10_v51x A1024 GB~$1.80Note: A10 (not A10G), slightly different specsStandard_NC24ads_A100_v41x A100 (80GB)80 GB~$3.67Strong single-GPU option
‼️Important: Don’t downplay the KV Cache
The key–value (KV) cache is a major factor when sizing VRAM requirements for LLMs.
Remember: LLMs are large transformer based models. A transformer layer computes attention using queries (Q), keys (K), and values (V). During generation, each new token must attend to all previous tokens. Without caching, the model would need to recompute the keys and values for the entire sequence every step.
By caching [storing] the attention keys and values in VRAM, long contexts become feasible, as the model doesn’t have to recompute keys and values. Taking generation from O(T^2) to O(t).
Agents must deal with longer contexts. This means that even if the model we select fits within VRAM, we need to also ensure there’s sufficient capacity for the KV cache.
Example: a quantized 32B model might occupy around 20-25 GB of VRAM, but the KV cache for several concurrent requests at an 8 k or 16 k context can add another 10-20 GB. This is why GPUs with 48 GB or more memory are typically recommended for production inference of mid-size models with longer contexts.
💡Protip: Along with serving models with a Paged KV Cache (discussed below), allocate an additional 30-40% of the model’s VRAM requirements for the KV cache.
💾Models
So now we know:
- the VRAM limits
- the quantization target
- the benchmarks that matter
That narrows the model field from hundreds to just a handful.
From the previous section, we selected the L40S as the GPU, giving us instances at a reasonable price point (especially spot instances, from AWS). This puts us at a cap of 48GB VRAM. Remembering the importance of the KV cache will limit us to models which fit into ~28GB VRAM (saving 20GB for multiple agents caching with long context windows).
With Q4_K_M quantizing, this puts us in range of some very capable models.
I’ve included links to the models directly on Huggingface. You’ll notice that Unsloth is the provider of the quants. Unsloth does very detailed analysis of their quants and heavy testing. As a result, they’ve become a community favorite. But, feel free to use any quant provider you prefer.
🥇Top Rank: Qwen3.5-27B
Developed by Alibaba as part of the Qwen3.5 model family.
This 27B model is a dense hybrid transformer architecture optimized for long-context reasoning and agent workflows.
Qwen 3.5 uses a Gated DeltaNet + Gated Attention Hybrid to maintain long context while preserving reasoning ability and minimizing the cost (in VRAM).
The 27B version gives us similar mechanics as the frontier model, and preserves reasoning, giving it outstanding performance on tool calling, SWE and agent benchmarks.
Strange fact: the 27B version performs slightly better than the 32B version.
Link to the Q4_K_M quant
https://huggingface.co/unsloth/Qwen3.5-27B-GGUF?show_file_info=Qwen3.5-27B-Q4_K_M.gguf
🥈Solid Contender: GLM 4.7 Flash
GLM‑4.7‑Flash, from Z.ai, is a 30 billion‑parameter Mixture‑of‑Experts (MoE) language model that activates only a small subset of its parameters per token (~3 B active).
Its architecture supports very long context windows (up to ~128 k–200 k tokens), enabling extended reasoning over large inputs such as long documents, codebases, or multi‑turn agent workflows.
It comes with turn based “thinking modes”, which support more efficient agent level reasoning, toggle off for quick tool executions, toggle on for extended reasoning on code or interpreting results.
Link to the Q4_K_M quant
https://huggingface.co/unsloth/GLM-4.7-Flash-GGUF?show_file_info=GLM-4.7-Flash-Q4_K_M.gguf
👌Worth checking: GPT-OSS-20B
OpenAI’s open sourced models, 120B param and 20B param versions are still competitive despite being released over a year ago. They consistently perform better than Mistral and the 20B version (quantized) is well suited for our VRAM limit.
It supports configurable reasoning levels (low/medium/high) so you can trade off speed versus depth of reasoning. GPT‑OSS‑20B also exposes its full chain‑of‑thought reasoning, which makes debugging and introspection easier.
It’s a solid choice for agent AI tasks. You won’t get the same performance as OpenAI’s frontier models, but benchmark performance along with a low memory requirement still warrant a test.
Link to the Q4_K_M quant
https://huggingface.co/unsloth/GLM-4.7-Flash-GGUF?show_file_info=GLM-4.7-Flash-Q4_K_M.gguf
Remember: even if you’re running your own model, you can still use frontier models
This is a smart agentic pattern. If you have a dynamic graph of agent actions, you can switch on the expensive API for Claude 4.6 Opus or the GPT 5.4 for your complex subgraphs or tasks that require frontier model level visual reasoning.
Compress the summary of your entire agent graph using your LLM to minimize input tokens and be sure to set the maximum output length when calling the frontier API to minimize costs.
🚀Deployment
I’m going to introduce 2 patterns, the first is for evaluating your model in a non production mode, the second is for production use.
Pattern 1: Evaluate with Ollama
Ollama is the docker run of LLM inference. It wraps llama.cpp in a clean CLI and REST API, handles model downloads, and just works. It’s perfect for local dev and evaluation: you can have an OpenAI compatible API running with your model in under 10 minutes.
Setup
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pull and run a model
ollama pull qwen3.5:27b
ollama run qwen3.5:27b
As mentioned, Ollama exposes an OpenAI-compatible API right out of the box, Hit it at http://localhost:11434/v1
from openai import OpenAI
client = OpenAI(
base_url=”http://localhost:11434/v1″,
api_key=”ollama” # required but unused
)
response = client.chat.completions.create(
model=”qwen3.5:27b”,
messages=[
{“role”: “system”, “content”: “You are a paranoid android.”},
{“role”: “user”, “content”: “Determine when the singularity will eventually consume us”}
]
)
You can always just build llama.cpp from source directly [with the GPU flags on], which is also good for evals. Ollama just simplifies it.
Pattern #2: Production with vLLM
vLLM is nice because it automagically handles KV caching via PagedAttention. Naively trying to handle KV caching will lead to memory underutilization via fragmentation. While more effective on RAM than VRAM, it still helps.
While tempting, don’t use Ollama for production. Use vLLM as it’s much better suited for concurrency and monitoring.
Setup
# Install vLLM (CUDA required)
pip install vllm
# Serve a model with the OpenAI-compatible API server
vllm serve Qwen/Qwen3.5-27B-GGUF \
–dtype auto \
–quantization k_m \
–max-model-len 32768 \
–gpu-memory-utilization 0.90 \
–port 8000 \
–api-key your-secret-key
Key configuration flags:
FlagWhat it doesGuidance–max-model-lenMaximum sequence length (input + output tokens)Set this to the max you actually need, not the model’s theoretical max. 32K is a good default. Setting it to 128K will reserve enormous KV cache.–gpu-memory-utilizationFraction of GPU memory vLLM can use0.90 is aggressive but fine for dedicated inference machines. Lower to 0.85 if you see OOM errors.–quantization Tells vLLM which quantizing format to useMust match the model format you downloaded.–tensor-parallel-size NShard model across N GPUsFor single-GPU, omit or set to 1. For multi-GPU on a single machine, set to the number of GPUs.
Monitoring:
vLLM exposes a /metrics endpoint compatible with Prometheus
# prometheus.yml scrape config
scrape_configs:
– job_name: ‘vllm’
static_configs:
– targets: [‘localhost:8000’]
metrics_path: ‘/metrics’
Key metrics to watch:
- vllm:num_requests_running: current concurrent requests
- vllm:num_requests_waiting: requests queued (if consistently > 0, you need more capacity)
- vllm:gpu_cache_usage_perc: KV cache utilization (high values = approaching memory limits)
- vllm:avg_generation_throughput_toks_per_s: your actual throughput
🤩Zero switch costs?
Yep.
You use OpenAI’s API:
The API that vLLM uses is fully compatible.
You must launch vLLM with tool calling explicitly enabled. You also need to specify a parser so vLLM knows how to extract the tool calls from the model’s output (e.g., llama3_json, hermes, mistral).
For Qwen3.5, add the following flags when running vLLM
–enable-auto-tool-choice \
–tool-call-parser qwen3_xml
–reasoning-parser qwen3
You use Anthropic’s API:
We need to add one more, somewhat hacky, step. Add a LiteLLM proxy as a “phantom-claude” to handle Anthropic-formatted requests.
LiteLLM will act as a translation layer. It intercepts the Anthropic-formatted requests (e.g., messages API, tool_use blocks) and converts them into the OpenAI format that vLLM expects, then maps the response back so your Anthropic client never knows the difference.
Note: Add this proxy on the machine/container which actually runs your agents and not the LLM host.
Configuration is easy:
model_list:
– model_name: claude-local # The name your Anthropic client will use
litellm_params:
model: openai/qwen3.5-27b # Tells LiteLLM to use the OpenAI-compatible adapter
api_base: http://yourvllm-server:8000/v1 # this is where you’re serving vLLM
api_key: sk-1234
Run LiteLLM
pip install ‘litellm[proxy]’
litellm –config config.yaml –port 4000
Changes to your source code (example call with Anthropic’s API)
import anthropic
client = anthropic.Anthropic(
base_url=”http://localhost:4000″, # Point to LiteLLM Proxy
api_key=”sk-1234″ # Must match your LiteLLM master key
)
response = client.messages.create(
model=”claude-local”, # proxied model
max_tokens=1024,
messages=[{“role”: “user”, “content”: “What’s the weather in NYC?”}],
tools=[{
“name”: “get_weather”,
“description”: “Get current weather”,
“input_schema”: {
“type”: “object”,
“properties”: {“location”: {“type”: “string”}}
}
}]
)
# LiteLLM translates vLLM’s response back into an Anthropic ToolUseBlock
print(response.content[0].name) # Output: ‘get_weather’
What if I don’t want to use Qwen?
Going rogue, fair enough.
Just make sure that arguments for –tool-call-parser and –reasoning-parser and –quantization match the model you’re using.
Since you are using LiteLLM as a gateway for an Anthropic client, be aware that Anthropic’s SDK expects a very specific structure for “thinking” vs “tool use.” When all else fails, pipe everything to stdout and inspect where the error is.
🤑How much is this going to cost?
A typical production agent system can consume:
200M–500M tokens/month
At API pricing, that often lands between:
$2,000 – $8,000 per month
As mentioned, cost scalability is important. I’m going to provide two realistic scenarios with monthly token estimates taken from real world production scenarios.
Scenario 1: Mid-size team, multi-agent production workload
Setup: Qwen 3.5 72B (Q4_K_M) on a GCP a2-ultragpu-1g (1x A100 80GB)
Cost componentMonthly costInstance (on-demand, 24/7)$5.07/hr × 730 hrs = $3,701Instance (1-year committed use)~$3.25/hr × 730 hrs = $2,373Instance (3-year committed use)~$2.28/hr × 730 hrs = $1,664Storage (1 TB SSD)~$80Total (1-year committed)~$2,453/mo
Comparable API cost: 20 agents running production workloads, averaging 500K tokens/day:
- 500K × 30 = 15M tokens/month per agent × 20 agents = 300M tokens/month
- At ~$9/M tokens: ~$2,700/mo
Nearly equivalent on cost, but with self-hosting you also get: no rate limits, no data leaving your VPC, sub-20ms first-token latency (vs. 200–500ms API round-trip), and the ability to fine-tune.
Scenario 2: Research team, experimentation and evaluation
Setup: Multiple models on a spot-instance A100, running 10 hours/day on weekdays
Cost componentMonthly costInstance (spot, ~10hr/day × 22 days)~$2.00/hr × 220 hrs = $440Storage (2 TB SSD for multiple models)~$160Total~$600/mo
This gives you unlimited experimentation: swap models, test quantization levels, and run evals for the price of a moderately heavy API bill.
Always be optimizing
- Use spot instances and make your agents “reschedulable” or “interruptible”: Langchain provides built ins for this. That way, if you’re ever evicted, your agent can resume from a checkpoint whenever the instance restarts. Implement a health-check via AWS Lambda or other to restart the instance when it stops.
- If your agents don’t need to run overnight, schedule stops and starts with cron or any other scheduler.
- Consider committed-use/reserved instances. If you’re a startup planning on offering AI based services into the future, this alone can give you considerable cost savings.
- Monitor your vLLM usage metrics. Check for signals of being overprovisioned (queued requests, utilization). If you are only using 30% of your capacity, downgrade.
✅Wrapping things up
Self-hosting an LLM is no longer a massive engineering effort, it’s a practical, well-understood deployment pattern. The open-weight model ecosystem has matured to the point where models like Qwen 3.5 and GLM-4,7 rival frontier APIs on tasks that matter the most for agents: tool calling, instruction following, code generation, and multi-turn reasoning.
Remember:
- Pick your model based on agentic benchmarks (BFCL, τ-bench, SWE-bench, IFEval), not general leaderboard rankings.
- Quantize to Q4_K_M for the best balance of quality and VRAM efficiency. Don’t go below Q3 for production agents.
- Use vLLM for production inference
- GCP’s single-GPU A100 instances are currently the best value for 70B-class models. For 32B-class models, L40, L40S, L4 and A10s are capable alternates.
- The cost crossover from API to self-hosted happens at roughly 40–100M tokens/month depending on the model and instance type. Beyond that, self-hosting is both cheaper and more capable.
- Start simple. Single machine, single GPU, one model, vLLM, systemd. Get it running, validate your agent pipeline E2E, then optimize.
Enjoy!

