Self-Hosting Your First LLM | Towards Data Science

finally work.

They call tools, reason through workflows, and actually complete tasks.

Then the first real API bill arrives.

For many teams, that’s the moment the question appears:

“Should we just run this ourselves?”

The good news is that self-hosting an LLM is no longer a research project or a massive ML infrastructure effort. With the right model, the right GPU, and a few battle-tested tools, you can run a production-grade LLM on a single machine you control.

You’re probably here because one of these happened:

Your OpenAI or Anthropic bill exploded

You can’t send sensitive data outside your VPC

Your agent workflows burn millions of tokens/day

You want custom behavior from your AI and the prompts aren’t cutting it.

If this is you, perfect. If not, you’re still perfect 🤗

In this article, I’ll walk you through a practical playbook for deploying an LLM on your own infrastructure, including how models were evaluated and selected, which instance types were evaluated and chosen, and the reasoning behind those decisions.

I’ll also provide you with a zero-switch cost deployment pattern for your own LLM that works for OpenAI or Anthropic.

By the end of this guide you’ll know:

Which benchmarks actually matter for LLMs that need to solve and reason through agentic problems, and not reiterate the latest string theorem.
What it means to quantize and how it affects performance
Which instance types/GPUs can be used for single machine hosting1
Which models to use2
How to use a self-hosted LLM without having to rewrite an existing API based codebase
How to make self-hosting cost-effective3?

1 Instance types were evaluated across the “big three”: AWS, Azure and GCP

2 all models are current as of March 2026

3 All pricing data is current as of March 2026

Note: this guide is focused on deploying agent-oriented LLMs — not general-purpose, trillion-parameter, all-encompassing frontier models, which are largely overkill for most agent use cases.

✋Wait…why would I host my own LLM again?

+++ Privacy

This is most likely why you’re here. Sensitive data — patient health records, proprietary source code, user data, financial records, RFPs, or internal strategy documents that can never leave your firewall.

Self-hosting removes the dependency on third-party APIs and alleviates the risk of a breach or failure to retain/log data according to strict privacy policies.

++ Cost Predictability

API pricing scales linearly with usage. For agent workloads, which typically are higher on the token spectrum, operating your own GPU infrastructure introduces economies-of-scale. This is especially important if you plan on performing agent reasoning across a medium to large company (20-30 agents+) or providing agents to customers at any sort of scale.

+ Performance

Remove roundtrip API calling, get reasonable token-per-second values and increase capacity as necessary with spot-instance elastic scaling.

+ Customization

Methods like LoRA and QLoRA (not covered in detail here) can be used to fine-tune an LLM’s behavior or adapt its alignment, abliterating, enhancing, tailoring tool usage, adjusting response style, or fine-tuning on domain-specific data.

This is crucially useful to build custom agents or offer AI services that require specific behavior or style tuned to a use-case rather than generic instruction alignment via prompting.

An aside on finetuning

Methods such as LoRA/QLoRA, model ablation (“abliteration”), realignment techniques, and response stylization are technically complex and outside the scope of this guide. However, self-hosting is often the first step toward exploring deeper customization of LLMs.

Why a single machine?

It’s not a hard requirement, it’s more for simplicity. Deploying on a single machine with a single GPU is relatively simple. A single machine with multiple GPUs is doable with the right configuration choices.

However, debugging distributed inference across many machines can be nightmarish.

This is your first self-hosted LLM. To simplify the process, we’re going to target a single machine and a single GPU. As your inference needs grow, or if you need more performance, scale up on a single machine. Then as you mature, you can start tackling multi-machine or Kubernetes style deployments.

👉Which Benchmarks Actually Matter?

The LLM Benchmark landscape is noisy. There are dozens of leaderboards, and most of them are irrelevant for our use case. We need to prune down these benchmarks to find LLMs which excel at agent-style tasks

Specifically, we’re looking for LLMs which can:

Follow complex, multi-step instructions
Use tools reliably: call functions with well-formed arguments, interpret results, and decide what to do next
Reason with constraints: reason with potentially incomplete information without hallucinating a confident but wrong answer
Write and understand code: We don’t need to solve expert level SWE problems, but interacting with APIs and being able to generate code on the fly helps expand the action space and typically translates into better tool usage

Here are the benchmarks to really pay attention to:

BenchmarkDescriptionWhy?Berkeley Function Calling Leaderboard (BFCL v3)Accuracy of function/tool calling across simple, parallel, nested, and multi-step invocationsDirectly tests the capability your agents depend on most: structured tool use.IFEval (Instruction Following Eval)Strict adherence to formatting, constraint, and structural instructionsAgents need strict adherence to instructionsτ-bench (Tau-bench)E2E agent task completion in simulated environmentsMeasures real agentic competence, can this LLM actually accomplish a goal over multiple turns?SWE-bench VerifiedAbility to resolve real GitHub issues from popular open-source reposIf your agents write or modify code, this is the gold standard. The “Verified” subset filters out ambiguous or poorly-specified issuesWebArena / VisualWebArenaTask completion in realistic web environmentsSuper useful if your agent needs to use a WebUI

Note: unfortunately, getting reliable benchmark scores on all of these, especially quantized models, is difficult. You’re going to have to use your best judgement, assuming that the full precision model adheres to the performance degradation table outlined below.

🤖Quantizing

This is in no way, shape, or form meant to be the exhaustive guide to quantizing. My goal is to give you enough information to allow you to navigate HuggingFace without coming out cross-eyed.

The basics

A model’s parameters are stored as numbers. At full precision (FP32), each weight is a 32-bit floating point number — 4 bytes. Most modern models are distributed at FP16 or BF16 (half precision, 2 bytes per weight). You will see this as the baseline for each model

Quantization reduces the number of bits used to represent each weight, shrinking the memory requirement and increasing inference speed, at the cost of some accuracy.

Not all quantization methods are equal. There are some clever methods that retain performance with highly reduced bit precision.

BF16 vs. GPTQ vs. AWQ vs. GGUF

You’ll see these acronyms a lot when model shopping. Here’s what they mean:

BF16: plain and simple. 2 bytes per parameter. A 70B parameter model will cost you 140GB of VRAM. This is the minimal level of quantizing.
GPTQ: stands for “Generative Pretrained Transformer Quantization”, quantized layer by layer using an greedy “error aware” approximation of the Hessian for each weight. Largely superseded by AWQ and methods applicable to GGUF models (see below)
AWQ: stands for “Activation Aware Weight Quantization”, quantizes weights using the magnitude of the activation (via channels) instead of the error.
GGUF: isn’t a quantization method at all, it’s an LLM container popularized by llama.cpp, within which you will find some of the following quantization methods:
- K-quants: Named by bits-per-weight and method, e,g Q4_K_M/Q4_K_S.
- I-quants: Newer version, pushes precision at lower bitrates (4 bit and lower)

Here’s a rough guide as to what quantization does to performance:

PrecisionBits per weightVRAM for 70BPerformanceFP16 / BF1616~140 GBBaseline (100%)Q8 (INT8)8~70 GB~99–99.5% of FP16Q5_K_M5.5 (mixed)~49 GB~97–98%Q4_K_M4.5 (mixed)~42 GB~95–97%Q3_K_M3.5 (mixed)~33 GB~90–94%Q2_K2.5 (mixed)~23 GB~80–88% — noticeable degradation

Where quantization really hurts

Not all tasks degrade equally. The things most affected by aggressive quantization (Q3 and below):

Precise numerical computation: if your agent needs to do exact arithmetic in-weights (as opposed to via tool calls), lower precision hurts
Rare/specialized knowledge recall: the “long tail” of a model’s knowledge is stored in less-activated weights, which are the first to lose fidelity
Very long chain-of-thought sequences: small errors compound over extended reasoning chains
Structured output reliability: at Q3 and below, JSON schema compliance and tool-call formatting start to degrade. This is a killer for agent pipelines

💡Protip: Stick to Q4_K_M and above for agents. Any lower, and long context reasoning and output reliability issues put agent tasks at risk.

🛠️Hardware

Finally, Santa has delivered a capacity block free A100 Instance with 80GB VRAM. Imagined by ChatGPT

GPUs (Accelerators)

Although more GPU types are available, the landscape across AWS, GCP and Azure can be mostly distilled into the following options, especially for single machine, single GPU deployments:

GPUArchitectureVRAMH100Hopper80GBA100Ampere40GB/80GBL40SAda Lovelace48GBL4Ada Lovelace24GBA10/A10GAmpere24GBT4 Turing16GB

The best tradeoffs for performance and cost exist in the L4, L40S and A100 range, with the A100 providing the best performance (in terms of model capacity and multi-user agentic workloads). If your agent tasks are simple, and require less throughput, it’s safe to downgrade to L4/A10. Don’t upgrade to the H100 unless you need it.

The 48GB of VRAM provided by the L40S give us a lot of options for models. We won’t get the throughput of the A100, but we’ll save on hourly cost.

For the sake of simplicity, I’m going to frame the rest of this discussion around this GPU. If you determine that your needs are different (less/more), the decisions I outline below will help you navigate model selection, instance selection and cost optimization.

Note about GPU selection: even though you may have your heart set on an A100, and the finances to buy it, cloud capacity may restrict you to another instance/GPU type unless you’re willing to purchase “Capacity Blocks” [AWS] or “Reservations” [GCP].

Quick decision checkpoint

If you’re deploying your first self-hosted LLM:

SituationRecommendationexperimentingL4 / A10production agentsL40Shigh concurrencyA100

Recommended Instance Types

I’ve compiled a non-exhaustive list of instance types across the big three which can help narrow down virtual machine types.

Note: all pricing information was sourced in March 2026.

AWS

AWS lacks many single-GPU instance options, and is more geared towards large multi-GPU workloads. That being said, if you want to purchase reserved capacity blocks, they offer a p5.4xlarge with a single H100. They also have a large block of L40S instance types which are prime for spot instances for predictable/scheduled agentic workloads.

Click to reveal instance types
Instance GPUVRAMvCPURAMOn-demand $/hrg4dn.xlarge1x T416 GB416 GB~$0.526g5.xlarge1x A10G24 GB416 GB~$1.006g5.2xlarge1x A10G24 GB832 GB~$1.212g6.xlarge1x L424 GB416 GB~$0.805g6e.xlarge1x L40S48GB432GB~$1.861p5.4xlarge1x H10080GB16256GB~$6.88

Google Cloud Platform

Unlike AWS, GCP offers single-GPU A100 instances. This makes a2-ultragpu-1g the most cost-effective option for running 70B models on a single machine. You pay only for what you use.

Click to reveal instance types
InstanceGPUVRAMOn-demand $/hrg2-standard-41x L424 GB~$0.72a2-highgpu-1g1x A100 (40GB)40 GB~$3.67a2-ultragpu-1g1x A100 (80GB)80 GB~$5.07a3-highgpu-1g1x H100 (80GB)80 GB~$7.2

Azure

Azure has the most limited set of single GPU instances, so you’re pretty much set into the Standard_NC24ads_A100_v4, which gives you an A100 for ~$3.60 per hour unless you want to go with a smaller model

Click to reveal instance types
InstanceGPUVRAMOn-demand $/hrNotesStandard_NC4as_T4_v31x T416 GB~$0.526Dev/testStandard_NV36ads_A10_v51x A1024 GB~$1.80Note: A10 (not A10G), slightly different specsStandard_NC24ads_A100_v41x A100 (80GB)80 GB~$3.67Strong single-GPU option

‼️Important: Don’t downplay the KV Cache

The key–value (KV) cache is a major factor when sizing VRAM requirements for LLMs.

Remember: LLMs are large transformer based models. A transformer layer computes attention using queries (Q), keys (K), and values (V). During generation, each new token must attend to all previous tokens. Without caching, the model would need to recompute the keys and values for the entire sequence every step.

By caching [storing] the attention keys and values in VRAM, long contexts become feasible, as the model doesn’t have to recompute keys and values. Taking generation from O(T^2) to O(t).

Agents must deal with longer contexts. This means that even if the model we select fits within VRAM, we need to also ensure there’s sufficient capacity for the KV cache.

Example: a quantized 32B model might occupy around 20-25 GB of VRAM, but the KV cache for several concurrent requests at an 8 k or 16 k context can add another 10-20 GB. This is why GPUs with 48 GB or more memory are typically recommended for production inference of mid-size models with longer contexts.

💡Protip: Along with serving models with a Paged KV Cache (discussed below), allocate an additional 30-40% of the model’s VRAM requirements for the KV cache.

💾Models

So now we know:

the VRAM limits
the quantization target
the benchmarks that matter

That narrows the model field from hundreds to just a handful.

From the previous section, we selected the L40S as the GPU, giving us instances at a reasonable price point (especially spot instances, from AWS). This puts us at a cap of 48GB VRAM. Remembering the importance of the KV cache will limit us to models which fit into ~28GB VRAM (saving 20GB for multiple agents caching with long context windows).

With Q4_K_M quantizing, this puts us in range of some very capable models.

I’ve included links to the models directly on Huggingface. You’ll notice that Unsloth is the provider of the quants. Unsloth does very detailed analysis of their quants and heavy testing. As a result, they’ve become a community favorite. But, feel free to use any quant provider you prefer.

🥇Top Rank: Qwen3.5-27B

Developed by Alibaba as part of the Qwen3.5 model family.

This 27B model is a dense hybrid transformer architecture optimized for long-context reasoning and agent workflows.

Qwen 3.5 uses a Gated DeltaNet + Gated Attention Hybrid to maintain long context while preserving reasoning ability and minimizing the cost (in VRAM).

The 27B version gives us similar mechanics as the frontier model, and preserves reasoning, giving it outstanding performance on tool calling, SWE and agent benchmarks.

Strange fact: the 27B version performs slightly better than the 32B version.

Link to the Q4_K_M quant

https://huggingface.co/unsloth/Qwen3.5-27B-GGUF?show_file_info=Qwen3.5-27B-Q4_K_M.gguf

🥈Solid Contender: GLM 4.7 Flash

GLM‑4.7‑Flash, from Z.ai, is a 30 billion‑parameter Mixture‑of‑Experts (MoE) language model that activates only a small subset of its parameters per token (~3 B active).

Its architecture supports very long context windows (up to ~128 k–200 k tokens), enabling extended reasoning over large inputs such as long documents, codebases, or multi‑turn agent workflows.

It comes with turn based “thinking modes”, which support more efficient agent level reasoning, toggle off for quick tool executions, toggle on for extended reasoning on code or interpreting results.

Link to the Q4_K_M quant

https://huggingface.co/unsloth/GLM-4.7-Flash-GGUF?show_file_info=GLM-4.7-Flash-Q4_K_M.gguf

👌Worth checking: GPT-OSS-20B

OpenAI’s open sourced models, 120B param and 20B param versions are still competitive despite being released over a year ago. They consistently perform better than Mistral and the 20B version (quantized) is well suited for our VRAM limit.

It supports configurable reasoning levels (low/medium/high) so you can trade off speed versus depth of reasoning. GPT‑OSS‑20B also exposes its full chain‑of‑thought reasoning, which makes debugging and introspection easier.

It’s a solid choice for agent AI tasks. You won’t get the same performance as OpenAI’s frontier models, but benchmark performance along with a low memory requirement still warrant a test.

Link to the Q4_K_M quant

https://huggingface.co/unsloth/GLM-4.7-Flash-GGUF?show_file_info=GLM-4.7-Flash-Q4_K_M.gguf

Remember: even if you’re running your own model, you can still use frontier models

This is a smart agentic pattern. If you have a dynamic graph of agent actions, you can switch on the expensive API for Claude 4.6 Opus or the GPT 5.4 for your complex subgraphs or tasks that require frontier model level visual reasoning.

Compress the summary of your entire agent graph using your LLM to minimize input tokens and be sure to set the maximum output length when calling the frontier API to minimize costs.

🚀Deployment

I’m going to introduce 2 patterns, the first is for evaluating your model in a non production mode, the second is for production use.

Pattern 1: Evaluate with Ollama

Ollama is the docker run of LLM inference. It wraps llama.cpp in a clean CLI and REST API, handles model downloads, and just works. It’s perfect for local dev and evaluation: you can have an OpenAI compatible API running with your model in under 10 minutes.

Setup

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run a model
ollama pull qwen3.5:27b
ollama run qwen3.5:27b

As mentioned, Ollama exposes an OpenAI-compatible API right out of the box, Hit it at http://localhost:11434/v1

from openai import OpenAI

client = OpenAI(
base_url=”http://localhost:11434/v1″,
api_key=”ollama” # required but unused
)

response = client.chat.completions.create(
model=”qwen3.5:27b”,
messages=[
{“role”: “system”, “content”: “You are a paranoid android.”},
{“role”: “user”, “content”: “Determine when the singularity will eventually consume us”}
]
)

You can always just build llama.cpp from source directly [with the GPU flags on], which is also good for evals. Ollama just simplifies it.

Pattern #2: Production with vLLM

vLLM is nice because it automagically handles KV caching via PagedAttention. Naively trying to handle KV caching will lead to memory underutilization via fragmentation. While more effective on RAM than VRAM, it still helps.

While tempting, don’t use Ollama for production. Use vLLM as it’s much better suited for concurrency and monitoring.

Setup

# Install vLLM (CUDA required)
pip install vllm

# Serve a model with the OpenAI-compatible API server
vllm serve Qwen/Qwen3.5-27B-GGUF \
–dtype auto \
–quantization k_m \
–max-model-len 32768 \
–gpu-memory-utilization 0.90 \
–port 8000 \
–api-key your-secret-key

Key configuration flags:

FlagWhat it doesGuidance–max-model-lenMaximum sequence length (input + output tokens)Set this to the max you actually need, not the model’s theoretical max. 32K is a good default. Setting it to 128K will reserve enormous KV cache.–gpu-memory-utilizationFraction of GPU memory vLLM can use0.90 is aggressive but fine for dedicated inference machines. Lower to 0.85 if you see OOM errors.–quantization Tells vLLM which quantizing format to useMust match the model format you downloaded.–tensor-parallel-size NShard model across N GPUsFor single-GPU, omit or set to 1. For multi-GPU on a single machine, set to the number of GPUs.

Monitoring:
vLLM exposes a /metrics endpoint compatible with Prometheus

# prometheus.yml scrape config
scrape_configs:
– job_name: ‘vllm’
static_configs:
– targets: [‘localhost:8000’]
metrics_path: ‘/metrics’

Key metrics to watch:

vllm:num_requests_running: current concurrent requests
vllm:num_requests_waiting: requests queued (if consistently > 0, you need more capacity)
vllm:gpu_cache_usage_perc: KV cache utilization (high values = approaching memory limits)
vllm:avg_generation_throughput_toks_per_s: your actual throughput

🤩Zero switch costs?

Yep.

You use OpenAI’s API:

The API that vLLM uses is fully compatible.

You must launch vLLM with tool calling explicitly enabled. You also need to specify a parser so vLLM knows how to extract the tool calls from the model’s output (e.g., llama3_json, hermes, mistral).

For Qwen3.5, add the following flags when running vLLM

–enable-auto-tool-choice \
–tool-call-parser qwen3_xml
–reasoning-parser qwen3

You use Anthropic’s API:

We need to add one more, somewhat hacky, step. Add a LiteLLM proxy as a “phantom-claude” to handle Anthropic-formatted requests.

LiteLLM will act as a translation layer. It intercepts the Anthropic-formatted requests (e.g., messages API, tool_use blocks) and converts them into the OpenAI format that vLLM expects, then maps the response back so your Anthropic client never knows the difference.

Note: Add this proxy on the machine/container which actually runs your agents and not the LLM host.

Configuration is easy:

model_list:
– model_name: claude-local # The name your Anthropic client will use
litellm_params:
model: openai/qwen3.5-27b # Tells LiteLLM to use the OpenAI-compatible adapter
api_base: http://yourvllm-server:8000/v1 # this is where you’re serving vLLM
api_key: sk-1234

Run LiteLLM

pip install ‘litellm[proxy]’
litellm –config config.yaml –port 4000

Changes to your source code (example call with Anthropic’s API)

import anthropic

client = anthropic.Anthropic(
base_url=”http://localhost:4000″, # Point to LiteLLM Proxy
api_key=”sk-1234″ # Must match your LiteLLM master key
)

response = client.messages.create(
model=”claude-local”, # proxied model
max_tokens=1024,
messages=[{“role”: “user”, “content”: “What’s the weather in NYC?”}],
tools=[{
“name”: “get_weather”,
“description”: “Get current weather”,
“input_schema”: {
“type”: “object”,
“properties”: {“location”: {“type”: “string”}}
}
}]
)

# LiteLLM translates vLLM’s response back into an Anthropic ToolUseBlock
print(response.content[0].name) # Output: ‘get_weather’

What if I don’t want to use Qwen?

Going rogue, fair enough.

Just make sure that arguments for –tool-call-parser and –reasoning-parser and –quantization match the model you’re using.

Since you are using LiteLLM as a gateway for an Anthropic client, be aware that Anthropic’s SDK expects a very specific structure for “thinking” vs “tool use.” When all else fails, pipe everything to stdout and inspect where the error is.

🤑How much is this going to cost?

A typical production agent system can consume:

200M–500M tokens/month

At API pricing, that often lands between:

$2,000 – $8,000 per month

As mentioned, cost scalability is important. I’m going to provide two realistic scenarios with monthly token estimates taken from real world production scenarios.

Scenario 1: Mid-size team, multi-agent production workload

Setup: Qwen 3.5 72B (Q4_K_M) on a GCP a2-ultragpu-1g (1x A100 80GB)

Cost componentMonthly costInstance (on-demand, 24/7)$5.07/hr × 730 hrs = $3,701Instance (1-year committed use)~$3.25/hr × 730 hrs = $2,373Instance (3-year committed use)~$2.28/hr × 730 hrs = $1,664Storage (1 TB SSD)~$80Total (1-year committed)~$2,453/mo

Comparable API cost: 20 agents running production workloads, averaging 500K tokens/day:

500K × 30 = 15M tokens/month per agent × 20 agents = 300M tokens/month
At ~$9/M tokens: ~$2,700/mo

Nearly equivalent on cost, but with self-hosting you also get: no rate limits, no data leaving your VPC, sub-20ms first-token latency (vs. 200–500ms API round-trip), and the ability to fine-tune.

Scenario 2: Research team, experimentation and evaluation

Setup: Multiple models on a spot-instance A100, running 10 hours/day on weekdays

Cost componentMonthly costInstance (spot, ~10hr/day × 22 days)~$2.00/hr × 220 hrs = $440Storage (2 TB SSD for multiple models)~$160Total~$600/mo

This gives you unlimited experimentation: swap models, test quantization levels, and run evals for the price of a moderately heavy API bill.

Always be optimizing

Use spot instances and make your agents “reschedulable” or “interruptible”: Langchain provides built ins for this. That way, if you’re ever evicted, your agent can resume from a checkpoint whenever the instance restarts. Implement a health-check via AWS Lambda or other to restart the instance when it stops.
If your agents don’t need to run overnight, schedule stops and starts with cron or any other scheduler.
Consider committed-use/reserved instances. If you’re a startup planning on offering AI based services into the future, this alone can give you considerable cost savings.
Monitor your vLLM usage metrics. Check for signals of being overprovisioned (queued requests, utilization). If you are only using 30% of your capacity, downgrade.

✅Wrapping things up

Self-hosting an LLM is no longer a massive engineering effort, it’s a practical, well-understood deployment pattern. The open-weight model ecosystem has matured to the point where models like Qwen 3.5 and GLM-4,7 rival frontier APIs on tasks that matter the most for agents: tool calling, instruction following, code generation, and multi-turn reasoning.

Remember:

Pick your model based on agentic benchmarks (BFCL, τ-bench, SWE-bench, IFEval), not general leaderboard rankings.
Quantize to Q4_K_M for the best balance of quality and VRAM efficiency. Don’t go below Q3 for production agents.
Use vLLM for production inference
GCP’s single-GPU A100 instances are currently the best value for 70B-class models. For 32B-class models, L40, L40S, L4 and A10s are capable alternates.
The cost crossover from API to self-hosted happens at roughly 40–100M tokens/month depending on the model and instance type. Beyond that, self-hosting is both cheaper and more capable.
Start simple. Single machine, single GPU, one model, vLLM, systemd. Get it running, validate your agent pipeline E2E, then optimize.

Enjoy!

What's Hot

Hulu just added one of my favorite psychological thriller movies of the decade — and the ending still keeps me up at night

Kenwood MultiPro Go review: this food processor is excellent for its price and size

Calls for Reformer Pilates regulation amid boom

A Coding Implementation to Explore and Analyze the TaskTrove Dataset with Streaming Parsing Visualization and Verifier Detection

A Developer’s Guide to Systematic Prompting: Mastering Negative Constraints, Structured JSON Outputs, and Multi-Hypothesis Verbalized Sampling

CSPNet Paper Walkthrough: Just Better, No Tradeoffs

Inference Scaling (Test-Time Compute): Why Reasoning Models Raise Your Compute Bill

What is Tokenization Drift and How to Fix It?

Sakana AI Introduces KAME: A Tandem Speech-to-Speech Architecture That Injects LLM Knowledge in Real Time

Hulu just added one of my favorite psychological thriller movies of the decade — and the ending still keeps me up at night

Kenwood MultiPro Go review: this food processor is excellent for its price and size

Calls for Reformer Pilates regulation amid boom

Google Will Now Let You Virtually Try on Clothes With Just a Selfie

What’s in a Name? How to Get Your Domain Right

Speed Across the Galaxy Next Year in Star Wars: Galactic Racer

News

Company

Services

What's Hot

Self-Hosting Your First LLM | Towards Data Science

You’re probably here because one of these happened:

If this is you, perfect. If not, you’re still perfect 🤗

✋Wait…why would I host my own LLM again?

+++ Privacy

++ Cost Predictability

+ Performance

+ Customization

Why a single machine?

👉Which Benchmarks Actually Matter?

🤖Quantizing

The basics

BF16 vs. GPTQ vs. AWQ vs. GGUF

Where quantization really hurts

🛠️Hardware

GPUs (Accelerators)

Recommended Instance Types

AWS

Google Cloud Platform

Azure

‼️Important: Don’t downplay the KV Cache

💾Models

🥇Top Rank: Qwen3.5-27B

🥈Solid Contender: GLM 4.7 Flash

👌Worth checking: GPT-OSS-20B

Remember: even if you’re running your own model, you can still use frontier models

🚀Deployment

Pattern 1: Evaluate with Ollama

Pattern #2: Production with vLLM

🤩Zero switch costs?

You use OpenAI’s API:

You use Anthropic’s API:

What if I don’t want to use Qwen?

🤑How much is this going to cost?

Scenario 1: Mid-size team, multi-agent production workload

Scenario 2: Research team, experimentation and evaluation

Always be optimizing

✅Wrapping things up

Related Posts

News

Company

Services

Subscribe to Updates