AI Cost Optimization: Model Routing, Caching, Prompt Compression, and Architecture Patterns
LLM API costs can spiral from hundreds to tens of thousands of dollars per month as usage grows. The difference between a profitable AI feature and a money pit often comes down to architecture. This guide covers practical strategies to reduce LLM costs without sacrificing quality.
Understanding Token Economics#
Every LLM API call is billed by tokens. Understanding the cost structure is the foundation of optimization.
Input vs Output Tokens#
Most providers charge differently for input and output tokens. Output tokens are typically 2-4x more expensive because they require autoregressive generation. This means:
- Long system prompts are expensive at scale but less so than long outputs
- Asking the model to be concise saves real money
- Structured output (JSON) is often cheaper than verbose natural language
Cost Comparison (approximate, per million tokens)#
| Model | Input | Output |
|---|---|---|
| GPT-4o | $2.50 | $10.00 |
| GPT-4o-mini | $0.15 | $0.60 |
| Claude Sonnet | $3.00 | $15.00 |
| Claude Haiku | $0.25 | $1.25 |
| Llama 3 70B (self-hosted) | ~$0.50 | ~$0.50 |
These prices change frequently. The key insight is that the cost difference between frontier and smaller models is 10-40x.
Model Routing: Cheap Model First#
The single most impactful cost optimization is sending easy requests to cheap models and only escalating to expensive models when needed.
Classification Router#
Use a lightweight classifier (or the cheap model itself) to assess request difficulty:
async def route_request(request):
complexity = classify_complexity(request)
if complexity == "simple":
return await call_model("gpt-4o-mini", request)
elif complexity == "medium":
return await call_model("claude-haiku", request)
else:
return await call_model("gpt-4o", request)
Cascade Router#
Start with the cheapest model. If the response fails quality checks, escalate:
async def cascade_request(request):
response = await call_model("gpt-4o-mini", request)
if passes_quality_check(response):
return response
response = await call_model("gpt-4o", request)
return response
The cascade pattern trades latency for cost savings. In practice, 70-80% of requests can be handled by the cheap model, saving 60-70% on API costs.
Quality Checks for Routing#
- Format validation: Does the output match the expected schema?
- Confidence scoring: Does the model express uncertainty?
- Length heuristics: Is the response suspiciously short or long?
- Semantic checks: Does the response address the actual question?
Caching LLM Responses#
Many LLM applications receive similar or identical queries. Caching avoids paying for the same answer twice.
Exact Match Caching#
Hash the full prompt (system + user message) and cache the response. Simple and effective for applications with repetitive queries like customer support bots.
import hashlib
def get_cache_key(messages):
content = json.dumps(messages, sort_keys=True)
return hashlib.sha256(content.encode()).hexdigest()
async def cached_completion(messages):
key = get_cache_key(messages)
cached = await redis.get(key)
if cached:
return json.loads(cached)
response = await call_model(messages)
await redis.setex(key, 3600, json.dumps(response))
return response
Semantic Caching#
Use embeddings to find semantically similar past queries. If a new query is close enough to a cached query, return the cached response. This catches paraphrases and minor variations.
- Embed the incoming query
- Search a vector store for similar past queries
- If similarity exceeds a threshold (e.g., 0.95), return the cached response
- Otherwise, call the model and cache the new pair
Semantic caching has a higher hit rate than exact match but introduces a small risk of returning an imprecise answer. Set the similarity threshold conservatively.
Cache Invalidation#
- Set TTLs based on how quickly your data changes
- Invalidate caches when system prompts or model versions change
- Use versioned cache keys that include the model name and prompt version
Prompt Compression#
Long prompts cost more. Reducing prompt length without losing information directly reduces costs.
Techniques#
Remove redundancy: System prompts often contain repeated instructions. Audit and consolidate.
Abbreviate examples: In few-shot prompts, use the minimum number of examples that maintain quality. Often 2-3 examples work as well as 10.
Compress context: When including retrieved documents in RAG, summarize or extract relevant sections instead of dumping full documents.
Use structured references: Instead of repeating instructions for each item in a list, define the pattern once and apply it.
Token-aware truncation: Truncate context to fit within a budget, prioritizing the most relevant content.
Prompt Compression Tools#
- LLMLingua: Microsoft's prompt compression library that can reduce prompt length by 2-5x with minimal quality loss
- Selective context: Include only the top-k most relevant retrieved chunks instead of all chunks above a threshold
Measuring Compression Impact#
Always A/B test compressed prompts against originals. Measure both cost savings and quality metrics. A 50% cost reduction is worthless if quality drops below acceptable thresholds.
Batch vs Real-Time Processing#
Not every LLM call needs to happen in real time. Batch processing unlocks significant savings.
When to Batch#
- Content moderation that can tolerate minutes of delay
- Email summarization and classification
- Data enrichment and extraction pipelines
- Nightly report generation
- Embedding generation for search indices
Batch API Pricing#
OpenAI's Batch API offers 50% cost reduction for requests that can tolerate up to 24-hour completion windows. Other providers offer similar discounts for asynchronous workloads.
Implementation Pattern#
async def process_batch(items):
batch = create_batch_request(items)
batch_id = await submit_batch(batch)
# Poll or webhook for completion
results = await wait_for_batch(batch_id)
return results
Hybrid Architecture#
Use real-time calls for user-facing features and batch processing for background tasks. A typical pattern:
- User submits content — real-time model call for immediate feedback
- Background job runs deeper analysis in batch mode
- Results are merged and presented on next page load
Self-Hosted vs API#
At sufficient scale, running your own models can be dramatically cheaper than API calls.
When Self-Hosting Makes Sense#
- Monthly API spend exceeds $5,000-10,000
- You have ML engineering expertise on the team
- Data privacy requirements prohibit sending data to third parties
- You need custom models (fine-tuned) with high throughput
- Latency requirements demand co-located inference
Cost Breakdown: Self-Hosted#
Running a Llama 3 70B model on a single A100 80GB GPU:
- Cloud GPU cost: ~$2-3/hour ($1,500-2,200/month)
- Throughput: ~30-50 requests/second with vLLM
- Effective cost: ~$0.50 per million tokens
- Break-even: ~5 million output tokens per day vs GPT-4o
Infrastructure Requirements#
- vLLM or TGI: High-throughput inference servers with continuous batching
- GPU monitoring: Track utilization, memory, and queue depth
- Auto-scaling: Scale GPU instances based on request queue length
- Fallback: Route to API providers when self-hosted capacity is exhausted
The Middle Ground#
Services like Together AI, Anyscale, and Fireworks offer hosted open-source models at prices between self-hosted and frontier API costs, without the operational burden.
Cost Monitoring#
You cannot optimize what you do not measure. Build cost visibility from day one.
What to Track#
- Cost per request: Total token cost for each API call
- Cost per feature: Aggregate costs by product feature
- Cost per user: Identify expensive usage patterns
- Model distribution: What percentage of requests go to each model tier
- Cache hit rate: Are your caches actually saving money
- Waste metrics: Requests that fail, timeout, or produce unused outputs
Alerting#
Set alerts for:
- Daily spend exceeding 2x the trailing 7-day average
- Single-user spend exceeding thresholds (possible abuse)
- Cache hit rate dropping below expected levels
- Error rates increasing (you pay for failed requests too)
Tools for Cost Monitoring#
- Helicone: Proxy that logs all LLM calls with cost tracking and analytics
- Portkey: AI gateway with cost tracking, caching, and routing built in
- LiteLLM: Unified API that normalizes costs across providers
- Custom logging: Log token counts and costs with every request to your observability stack
Architecture Checklist#
Here is a checklist for cost-optimized AI architecture:
- Implement model routing — cheap model first, escalate when needed
- Add exact-match caching with Redis for repetitive queries
- Consider semantic caching if query patterns have high variance
- Audit and compress prompts quarterly
- Move non-real-time workloads to batch processing
- Evaluate self-hosting when monthly spend exceeds $5K
- Build cost dashboards from day one
- Set spend alerts and per-user rate limits
- A/B test every optimization against quality metrics
- Review model pricing monthly — the market changes fast
Summary#
AI cost optimization is an architecture problem, not a negotiation problem. Route easy requests to cheap models, cache aggressively, compress prompts, batch where possible, and monitor everything. Most teams can reduce LLM costs by 50-80% with these techniques while maintaining or improving quality. Start with model routing and caching — they deliver the highest impact with the least effort.
Build smarter AI systems with us at codelit.io.
Article #336 on Codelit — Keep building, keep shipping.
Try it on Codelit
Chaos Mode
Simulate node failures and watch cascading impact across your architecture
Cost Estimator
See estimated AWS monthly costs for every component in your architecture
Try these templates
Netflix Video Streaming Architecture
Global video streaming platform with adaptive bitrate, CDN distribution, and recommendation engine.
10 componentsSearch Engine Architecture
Web-scale search with crawling, indexing, ranking, and sub-second query serving.
8 componentsGoogle Search Engine Architecture
Web-scale search with crawling, indexing, PageRank, query processing, ads, and knowledge graph.
10 components
Comments