AI cost optimizationLLM costsmodel routingprompt compressionLLM cachingself-hosted LLMAI architecturetoken optimization

AI Cost Optimization: Model Routing, Caching, Prompt Compression, and Architecture Patterns

March 29, 2026 8 min readBy Codelit Team Discussion

LLM API costs can spiral from hundreds to tens of thousands of dollars per month as usage grows. The difference between a profitable AI feature and a money pit often comes down to architecture. This guide covers practical strategies to reduce LLM costs without sacrificing quality.

Understanding Token Economics#

Every LLM API call is billed by tokens. Understanding the cost structure is the foundation of optimization.

Input vs Output Tokens#

Most providers charge differently for input and output tokens. Output tokens are typically 2-4x more expensive because they require autoregressive generation. This means:

Long system prompts are expensive at scale but less so than long outputs
Asking the model to be concise saves real money
Structured output (JSON) is often cheaper than verbose natural language

Cost Comparison (approximate, per million tokens)#

Model	Input	Output
GPT-4o	$2.50	$10.00
GPT-4o-mini	$0.15	$0.60
Claude Sonnet	$3.00	$15.00
Claude Haiku	$0.25	$1.25
Llama 3 70B (self-hosted)	~$0.50	~$0.50

These prices change frequently. The key insight is that the cost difference between frontier and smaller models is 10-40x.

Model Routing: Cheap Model First#

The single most impactful cost optimization is sending easy requests to cheap models and only escalating to expensive models when needed.

Classification Router#

Use a lightweight classifier (or the cheap model itself) to assess request difficulty:

async def route_request(request):
    complexity = classify_complexity(request)
    if complexity == "simple":
        return await call_model("gpt-4o-mini", request)
    elif complexity == "medium":
        return await call_model("claude-haiku", request)
    else:
        return await call_model("gpt-4o", request)

Cascade Router#

Start with the cheapest model. If the response fails quality checks, escalate:

async def cascade_request(request):
    response = await call_model("gpt-4o-mini", request)
    if passes_quality_check(response):
        return response
    response = await call_model("gpt-4o", request)
    return response

The cascade pattern trades latency for cost savings. In practice, 70-80% of requests can be handled by the cheap model, saving 60-70% on API costs.

Quality Checks for Routing#

Format validation: Does the output match the expected schema?
Confidence scoring: Does the model express uncertainty?
Length heuristics: Is the response suspiciously short or long?
Semantic checks: Does the response address the actual question?

Caching LLM Responses#

Many LLM applications receive similar or identical queries. Caching avoids paying for the same answer twice.

Exact Match Caching#

Hash the full prompt (system + user message) and cache the response. Simple and effective for applications with repetitive queries like customer support bots.

import hashlib

def get_cache_key(messages):
    content = json.dumps(messages, sort_keys=True)
    return hashlib.sha256(content.encode()).hexdigest()

async def cached_completion(messages):
    key = get_cache_key(messages)
    cached = await redis.get(key)
    if cached:
        return json.loads(cached)
    response = await call_model(messages)
    await redis.setex(key, 3600, json.dumps(response))
    return response

Semantic Caching#

Use embeddings to find semantically similar past queries. If a new query is close enough to a cached query, return the cached response. This catches paraphrases and minor variations.

Embed the incoming query
Search a vector store for similar past queries
If similarity exceeds a threshold (e.g., 0.95), return the cached response
Otherwise, call the model and cache the new pair

Semantic caching has a higher hit rate than exact match but introduces a small risk of returning an imprecise answer. Set the similarity threshold conservatively.

Cache Invalidation#

Set TTLs based on how quickly your data changes
Invalidate caches when system prompts or model versions change
Use versioned cache keys that include the model name and prompt version

Prompt Compression#

Long prompts cost more. Reducing prompt length without losing information directly reduces costs.

Techniques#

Remove redundancy: System prompts often contain repeated instructions. Audit and consolidate.

Abbreviate examples: In few-shot prompts, use the minimum number of examples that maintain quality. Often 2-3 examples work as well as 10.

Compress context: When including retrieved documents in RAG, summarize or extract relevant sections instead of dumping full documents.

Use structured references: Instead of repeating instructions for each item in a list, define the pattern once and apply it.

Token-aware truncation: Truncate context to fit within a budget, prioritizing the most relevant content.

Prompt Compression Tools#

LLMLingua: Microsoft's prompt compression library that can reduce prompt length by 2-5x with minimal quality loss
Selective context: Include only the top-k most relevant retrieved chunks instead of all chunks above a threshold

Measuring Compression Impact#

Always A/B test compressed prompts against originals. Measure both cost savings and quality metrics. A 50% cost reduction is worthless if quality drops below acceptable thresholds.

Batch vs Real-Time Processing#

Not every LLM call needs to happen in real time. Batch processing unlocks significant savings.

When to Batch#

Content moderation that can tolerate minutes of delay
Email summarization and classification
Data enrichment and extraction pipelines
Nightly report generation
Embedding generation for search indices

Batch API Pricing#

OpenAI's Batch API offers 50% cost reduction for requests that can tolerate up to 24-hour completion windows. Other providers offer similar discounts for asynchronous workloads.

Implementation Pattern#

async def process_batch(items):
    batch = create_batch_request(items)
    batch_id = await submit_batch(batch)
    # Poll or webhook for completion
    results = await wait_for_batch(batch_id)
    return results

Hybrid Architecture#

Use real-time calls for user-facing features and batch processing for background tasks. A typical pattern:

User submits content — real-time model call for immediate feedback
Background job runs deeper analysis in batch mode
Results are merged and presented on next page load

Self-Hosted vs API#

At sufficient scale, running your own models can be dramatically cheaper than API calls.

When Self-Hosting Makes Sense#

Monthly API spend exceeds $5,000-10,000
You have ML engineering expertise on the team
Data privacy requirements prohibit sending data to third parties
You need custom models (fine-tuned) with high throughput
Latency requirements demand co-located inference

Cost Breakdown: Self-Hosted#

Running a Llama 3 70B model on a single A100 80GB GPU:

Cloud GPU cost: ~$2-3/hour ($1,500-2,200/month)
Throughput: ~30-50 requests/second with vLLM
Effective cost: ~$0.50 per million tokens
Break-even: ~5 million output tokens per day vs GPT-4o

Infrastructure Requirements#

vLLM or TGI: High-throughput inference servers with continuous batching
GPU monitoring: Track utilization, memory, and queue depth
Auto-scaling: Scale GPU instances based on request queue length
Fallback: Route to API providers when self-hosted capacity is exhausted

The Middle Ground#

Services like Together AI, Anyscale, and Fireworks offer hosted open-source models at prices between self-hosted and frontier API costs, without the operational burden.

Cost Monitoring#

You cannot optimize what you do not measure. Build cost visibility from day one.

What to Track#

Cost per request: Total token cost for each API call
Cost per feature: Aggregate costs by product feature
Cost per user: Identify expensive usage patterns
Model distribution: What percentage of requests go to each model tier
Cache hit rate: Are your caches actually saving money
Waste metrics: Requests that fail, timeout, or produce unused outputs

Alerting#

Set alerts for:

Daily spend exceeding 2x the trailing 7-day average
Single-user spend exceeding thresholds (possible abuse)
Cache hit rate dropping below expected levels
Error rates increasing (you pay for failed requests too)

Tools for Cost Monitoring#

Helicone: Proxy that logs all LLM calls with cost tracking and analytics
Portkey: AI gateway with cost tracking, caching, and routing built in
LiteLLM: Unified API that normalizes costs across providers
Custom logging: Log token counts and costs with every request to your observability stack

Architecture Checklist#

Here is a checklist for cost-optimized AI architecture:

Implement model routing — cheap model first, escalate when needed
Add exact-match caching with Redis for repetitive queries
Consider semantic caching if query patterns have high variance
Audit and compress prompts quarterly
Move non-real-time workloads to batch processing
Evaluate self-hosting when monthly spend exceeds $5K
Build cost dashboards from day one
Set spend alerts and per-user rate limits
A/B test every optimization against quality metrics
Review model pricing monthly — the market changes fast

Summary#

AI cost optimization is an architecture problem, not a negotiation problem. Route easy requests to cheap models, cache aggressively, compress prompts, batch where possible, and monitor everything. Most teams can reduce LLM costs by 50-80% with these techniques while maintaining or improving quality. Start with model routing and caching — they deliver the highest impact with the least effort.

Build smarter AI systems with us at codelit.io.

Article #336 on Codelit — Keep building, keep shipping.

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Cost Estimator

See estimated AWS monthly costs for every component in your architecture

Build this architecture →

Comments

AI agents

Stop Sending Every Agent Task to the Same Model

3 min read

Try these templates

Netflix Video Streaming Architecture

Global video streaming platform with adaptive bitrate, CDN distribution, and recommendation engine.

10 components

Search Engine Architecture

Web-scale search with crawling, indexing, ranking, and sub-second query serving.

8 components

Google Search Engine Architecture

Web-scale search with crawling, indexing, PageRank, query processing, ads, and knowledge graph.

10 components

Build this architecture

Generate an interactive AI Cost Optimization in seconds.

Try it in Codelit →

AI cost optimizationLLM costsmodel routingprompt compressionLLM cachingself-hosted LLMAI architecturetoken optimization

AI Cost Optimization: Model Routing, Caching, Prompt Compression, and Architecture Patterns

March 29, 2026 8 min readBy Codelit Team Discussion

Understanding Token Economics#

Every LLM API call is billed by tokens. Understanding the cost structure is the foundation of optimization.

Input vs Output Tokens#

Most providers charge differently for input and output tokens. Output tokens are typically 2-4x more expensive because they require autoregressive generation. This means:

Long system prompts are expensive at scale but less so than long outputs
Asking the model to be concise saves real money
Structured output (JSON) is often cheaper than verbose natural language

Cost Comparison (approximate, per million tokens)#

Model	Input	Output
GPT-4o	$2.50	$10.00
GPT-4o-mini	$0.15	$0.60
Claude Sonnet	$3.00	$15.00
Claude Haiku	$0.25	$1.25
Llama 3 70B (self-hosted)	~$0.50	~$0.50

These prices change frequently. The key insight is that the cost difference between frontier and smaller models is 10-40x.

Model Routing: Cheap Model First#

The single most impactful cost optimization is sending easy requests to cheap models and only escalating to expensive models when needed.

Classification Router#

Use a lightweight classifier (or the cheap model itself) to assess request difficulty:

async def route_request(request):
    complexity = classify_complexity(request)
    if complexity == "simple":
        return await call_model("gpt-4o-mini", request)
    elif complexity == "medium":
        return await call_model("claude-haiku", request)
    else:
        return await call_model("gpt-4o", request)

Cascade Router#

Start with the cheapest model. If the response fails quality checks, escalate:

async def cascade_request(request):
    response = await call_model("gpt-4o-mini", request)
    if passes_quality_check(response):
        return response
    response = await call_model("gpt-4o", request)
    return response

The cascade pattern trades latency for cost savings. In practice, 70-80% of requests can be handled by the cheap model, saving 60-70% on API costs.

Quality Checks for Routing#

Format validation: Does the output match the expected schema?
Confidence scoring: Does the model express uncertainty?
Length heuristics: Is the response suspiciously short or long?
Semantic checks: Does the response address the actual question?

Caching LLM Responses#

Many LLM applications receive similar or identical queries. Caching avoids paying for the same answer twice.

Exact Match Caching#

Hash the full prompt (system + user message) and cache the response. Simple and effective for applications with repetitive queries like customer support bots.

import hashlib

def get_cache_key(messages):
    content = json.dumps(messages, sort_keys=True)
    return hashlib.sha256(content.encode()).hexdigest()

async def cached_completion(messages):
    key = get_cache_key(messages)
    cached = await redis.get(key)
    if cached:
        return json.loads(cached)
    response = await call_model(messages)
    await redis.setex(key, 3600, json.dumps(response))
    return response

Semantic Caching#

Use embeddings to find semantically similar past queries. If a new query is close enough to a cached query, return the cached response. This catches paraphrases and minor variations.

Embed the incoming query
Search a vector store for similar past queries
If similarity exceeds a threshold (e.g., 0.95), return the cached response
Otherwise, call the model and cache the new pair

Semantic caching has a higher hit rate than exact match but introduces a small risk of returning an imprecise answer. Set the similarity threshold conservatively.

Cache Invalidation#

Set TTLs based on how quickly your data changes
Invalidate caches when system prompts or model versions change
Use versioned cache keys that include the model name and prompt version

Prompt Compression#

Long prompts cost more. Reducing prompt length without losing information directly reduces costs.

Techniques#

Remove redundancy: System prompts often contain repeated instructions. Audit and consolidate.

Abbreviate examples: In few-shot prompts, use the minimum number of examples that maintain quality. Often 2-3 examples work as well as 10.

Compress context: When including retrieved documents in RAG, summarize or extract relevant sections instead of dumping full documents.

Use structured references: Instead of repeating instructions for each item in a list, define the pattern once and apply it.

Token-aware truncation: Truncate context to fit within a budget, prioritizing the most relevant content.

Prompt Compression Tools#

LLMLingua: Microsoft's prompt compression library that can reduce prompt length by 2-5x with minimal quality loss
Selective context: Include only the top-k most relevant retrieved chunks instead of all chunks above a threshold

Measuring Compression Impact#

Always A/B test compressed prompts against originals. Measure both cost savings and quality metrics. A 50% cost reduction is worthless if quality drops below acceptable thresholds.

Batch vs Real-Time Processing#

Not every LLM call needs to happen in real time. Batch processing unlocks significant savings.

When to Batch#

Content moderation that can tolerate minutes of delay
Email summarization and classification
Data enrichment and extraction pipelines
Nightly report generation
Embedding generation for search indices

Batch API Pricing#

OpenAI's Batch API offers 50% cost reduction for requests that can tolerate up to 24-hour completion windows. Other providers offer similar discounts for asynchronous workloads.

Implementation Pattern#

async def process_batch(items):
    batch = create_batch_request(items)
    batch_id = await submit_batch(batch)
    # Poll or webhook for completion
    results = await wait_for_batch(batch_id)
    return results

Hybrid Architecture#

Use real-time calls for user-facing features and batch processing for background tasks. A typical pattern:

User submits content — real-time model call for immediate feedback
Background job runs deeper analysis in batch mode
Results are merged and presented on next page load

Self-Hosted vs API#

At sufficient scale, running your own models can be dramatically cheaper than API calls.

When Self-Hosting Makes Sense#

Monthly API spend exceeds $5,000-10,000
You have ML engineering expertise on the team
Data privacy requirements prohibit sending data to third parties
You need custom models (fine-tuned) with high throughput
Latency requirements demand co-located inference

Cost Breakdown: Self-Hosted#

Running a Llama 3 70B model on a single A100 80GB GPU:

Cloud GPU cost: ~$2-3/hour ($1,500-2,200/month)
Throughput: ~30-50 requests/second with vLLM
Effective cost: ~$0.50 per million tokens
Break-even: ~5 million output tokens per day vs GPT-4o

Infrastructure Requirements#

vLLM or TGI: High-throughput inference servers with continuous batching
GPU monitoring: Track utilization, memory, and queue depth
Auto-scaling: Scale GPU instances based on request queue length
Fallback: Route to API providers when self-hosted capacity is exhausted

The Middle Ground#

Services like Together AI, Anyscale, and Fireworks offer hosted open-source models at prices between self-hosted and frontier API costs, without the operational burden.

Cost Monitoring#

You cannot optimize what you do not measure. Build cost visibility from day one.

What to Track#

Cost per request: Total token cost for each API call
Cost per feature: Aggregate costs by product feature
Cost per user: Identify expensive usage patterns
Model distribution: What percentage of requests go to each model tier
Cache hit rate: Are your caches actually saving money
Waste metrics: Requests that fail, timeout, or produce unused outputs

Alerting#

Set alerts for:

Daily spend exceeding 2x the trailing 7-day average
Single-user spend exceeding thresholds (possible abuse)
Cache hit rate dropping below expected levels
Error rates increasing (you pay for failed requests too)

Tools for Cost Monitoring#

Helicone: Proxy that logs all LLM calls with cost tracking and analytics
Portkey: AI gateway with cost tracking, caching, and routing built in
LiteLLM: Unified API that normalizes costs across providers
Custom logging: Log token counts and costs with every request to your observability stack

Architecture Checklist#

Here is a checklist for cost-optimized AI architecture:

Implement model routing — cheap model first, escalate when needed
Add exact-match caching with Redis for repetitive queries
Consider semantic caching if query patterns have high variance
Audit and compress prompts quarterly
Move non-real-time workloads to batch processing
Evaluate self-hosting when monthly spend exceeds $5K
Build cost dashboards from day one
Set spend alerts and per-user rate limits
A/B test every optimization against quality metrics
Review model pricing monthly — the market changes fast

Summary#

Build smarter AI systems with us at codelit.io.

Article #336 on Codelit — Keep building, keep shipping.

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Cost Estimator

See estimated AWS monthly costs for every component in your architecture

Build this architecture →

Comments

AI agents

Stop Sending Every Agent Task to the Same Model

3 min read

Build this architecture

Generate an interactive AI Cost Optimization in seconds.

Try it in Codelit →

AI Cost Optimization: Model Routing, Caching, Prompt Compression, and Architecture Patterns

Understanding Token Economics#

Input vs Output Tokens#

Cost Comparison (approximate, per million tokens)#

Model Routing: Cheap Model First#

Classification Router#

Cascade Router#

Quality Checks for Routing#

Caching LLM Responses#

Exact Match Caching#

Semantic Caching#

Cache Invalidation#

Prompt Compression#

Techniques#

Prompt Compression Tools#

Measuring Compression Impact#

Batch vs Real-Time Processing#

When to Batch#

Batch API Pricing#

Implementation Pattern#

Hybrid Architecture#

Self-Hosted vs API#

When Self-Hosting Makes Sense#

Cost Breakdown: Self-Hosted#

Infrastructure Requirements#

The Middle Ground#

Cost Monitoring#

What to Track#

Alerting#

Tools for Cost Monitoring#

Architecture Checklist#

Summary#

Comments

Related articles

Stop Sending Every Agent Task to the Same Model

Try these templates

Netflix Video Streaming Architecture

Search Engine Architecture

Google Search Engine Architecture

Build this architecture

AI Cost Optimization: Model Routing, Caching, Prompt Compression, and Architecture Patterns

Understanding Token Economics#

Input vs Output Tokens#

Cost Comparison (approximate, per million tokens)#

Model Routing: Cheap Model First#

Classification Router#

Cascade Router#

Quality Checks for Routing#

Caching LLM Responses#

Exact Match Caching#

Semantic Caching#

Cache Invalidation#

Prompt Compression#

Techniques#

Prompt Compression Tools#

Measuring Compression Impact#

Batch vs Real-Time Processing#

When to Batch#

Batch API Pricing#

Implementation Pattern#

Hybrid Architecture#

Self-Hosted vs API#

When Self-Hosting Makes Sense#

Cost Breakdown: Self-Hosted#

Infrastructure Requirements#

The Middle Ground#

Cost Monitoring#

What to Track#

Alerting#

Tools for Cost Monitoring#

Architecture Checklist#

Summary#

Comments

Related articles

Stop Sending Every Agent Task to the Same Model

Try these templates

Netflix Video Streaming Architecture

Search Engine Architecture

Google Search Engine Architecture

Build this architecture