MLOpsModel HostingCost

Self-hosting LLMs vs API: the real cost math for 2026

By Ibra · 16 Jun 2026 · 4 min read

Every few weeks a founder asks us the same question. Token bills are climbing, open models are good now, so should we just host our own and stop paying the API tax. It is a fair question. The answer is usually more interesting than yes or no.

Self-hosting can absolutely be cheaper. It is just rarely cheaper at the volume most teams are actually running, and the savings live behind a wall of operational cost that does not show up in a GPU price list.

Where the breakeven actually sits

The raw GPU rate is the smallest part of the bill. Once you add idle time between requests, the DevOps work to keep a serving stack healthy, and the engineering hours to tune it, self-hosting tends to cost three to five times the bare GPU price.

The numbers are blunt. Self-hosting only starts to win above roughly 11 billion tokens a month, which is on the order of 500 million tokens a day. Below that, APIs usually come out cheaper once everything is counted. At 50 million tokens a day, a small hosted model via API runs around 2,250 dollars a month. The same workload on four mid-tier GPUs comes in closer to 5,175 dollars, more than twice as much, because you pay for the hardware whether or not a request is in flight.

That last point is the one teams underestimate. An API charges you per token. A GPU charges you per hour, awake or idle, and most workloads are bursty. You end up paying full price for a lot of empty capacity.

The costs that never make the spreadsheet

The hardware is visible. The rest is not.

A production serving stack needs an owner. Allocating 20 to 30 percent of a senior engineer to keep inference healthy is roughly 3,000 to 6,000 dollars a month in salary, before anything breaks at 2am. There is power and cooling if you run your own boxes, where a single high-end card can add 65 dollars a month just in electricity. And there is the opportunity cost of pointing your best engineers at infrastructure plumbing instead of product.

The open serving tools, vLLM, Text Generation Inference, llama.cpp, carry no license fee. They also carry no support contract. Someone on your team has to deploy them, tune batching and quantisation, watch for memory leaks, and upgrade without taking the service down. That someone is the real cost.

# self-hosting is not the model, it is everything around it
serve(
    model="open-weights-13b",
    gpus=4,
    batching="continuous",   # someone tunes this
    quantization="int8",     # and this
    autoscale=False,         # you pay for idle either way
)

When self-hosting genuinely wins

There are clear cases where running your own models is the right call, and they are not mainly about cost.

Compliance and data residency is the strongest one. If your data cannot leave your environment, self-hosting keeps every token inside your boundary, and that is worth paying for regardless of the breakeven. Air-gapped and regulated workloads land here.

Extreme, steady volume is the second. If you genuinely push hundreds of millions of tokens a day with a predictable load, you can keep GPUs busy enough to beat API pricing, and the savings become real.

A strong MLOps capability is the third. Self-hosting rewards teams that already know how to run production infrastructure. If keeping a serving stack fast and cheap is not a skill you have in house, the savings evaporate into downtime and firefighting.

Self-hosting is not a way to save money on AI. It is a way to buy control, and control sometimes happens to be cheaper.

The answer is usually hybrid

The teams getting this right rarely pick one side. They route the bulk of traffic to APIs for flexibility and burst capacity, and they self-host the specific workloads where volume, latency, or compliance justify it. Companies that land on a deliberate hybrid setup report 40 to 70 percent savings against a fully API-dependent stack, without betting the whole operation on a GPU fleet.

The practical move is to measure before you migrate. Look at your real token volume, your load shape over a week, and your compliance constraints, then model the fully loaded cost of each option rather than comparing a GPU rate to an API rate. The honest comparison almost always changes the decision.

This is exactly the kind of analysis Astronic runs with teams before they commit capital to hardware, including hosting your own models where it makes sense and keeping you on APIs where it does not. The goal is the lowest total cost that still meets your latency and compliance needs, not a self-hosting project for its own sake.