LLM Integration Patterns in Production
LLM Integration Patterns in Production
Six months ago, our team embarked on integrating Large Language Models into our production application. I thought it would be straightforward—call an API, get a response, ship it. I was naive. What I learned along the way fundamentally changed how I think about building AI-powered features.
The Promise and the Reality
The marketing pitch for LLMs is compelling: add AI to your app with a few API calls. The reality? It's more complex. You're dealing with non-deterministic outputs, managing costs that can spiral quickly, and ensuring reliability when calling external services. But when done right, the user experience improvements are genuinely transformative.
RAG Architecture: Why It Became Our Go-To Pattern
Retrieval-Augmented Generation (RAG) sounds academic, but it's incredibly practical. Let me explain with a real example from our codebase.
We built a customer support assistant that needed to answer questions about our product documentation. Initially, we tried fine-tuning a model on our docs. It cost thousands of dollars and still hallucinated incorrect information. RAG solved this elegantly.
The core idea: instead of expecting the LLM to memorize your content, you retrieve relevant information and inject it into the prompt. Here's the pattern that worked for us:
pythonfrom openai import AsyncOpenAI
from langchain.vectorstores import Pinecone
from langchain.embeddings import OpenAIEmbeddings
client = AsyncOpenAI()
embeddings = OpenAIEmbeddings()
async def answer_question(user_question: str, user_id: str):
# Step 1: Retrieve relevant documentation
query_embedding = await embeddings.aembed_query(user_question)
relevant_docs = await vector_store.similarity_search(
query_embedding,
k=4,
filter={"user_tier": get_user_tier(user_id)}
)
# Step 2: Build context from retrieved docs
context = "\n\n".join([doc.page_content for doc in relevant_docs])
# Step 3: Generate response with context
response = await client.chat.completions.create(
model="gpt-4-turbo-preview",
messages=[
{"role": "system", "content": f"""You are a helpful customer support assistant.
Answer questions using ONLY the following documentation. If the answer
isn't in the documentation, say you don't know.
Documentation:
{context}"""},
{"role": "user", "content": user_question}
],
temperature=0.3,
max_tokens=500
)
return {
"answer": response.choices[0].message.content,
"sources": [doc.metadata for doc in relevant_docs]
}
The game-changer? We can update our documentation and it's immediately reflected in answers—no retraining required. Plus, by returning sources, users can verify the information themselves.
Prompt Engineering: The Art and Science
I used to think prompt engineering was overhyped. Then I spent three hours getting wildly inconsistent outputs before realizing my prompt was the problem.
Here's what I learned: good prompts are specific, structured, and give examples. Bad prompts are vague hopes thrown at an AI.
Before: The Naive Approach
pythonprompt = f"Summarize this text: {user_text}"
Results? Sometimes a bullet list, sometimes paragraphs, sometimes it would translate to another language for no reason.
After: The Engineered Approach
pythonprompt = f"""Task: Create a concise summary of the following text.
Requirements:
- Length: 2-3 sentences maximum
- Focus: Key insights and main points
- Format: Single paragraph
- Tone: Professional and clear
Text to summarize:
{user_text}
Summary:"""
Consistency improved dramatically. But we took it further with few-shot learning:
pythonSUMMARIZATION_PROMPT = """You are an expert at creating clear, concise summaries.
Example 1:
Text: "Artificial intelligence has made significant progress in recent years..."
Summary: AI has advanced rapidly through breakthroughs in deep learning and transformer architectures. These developments have enabled practical applications in language understanding and computer vision.
Example 2:
Text: "Climate change poses an urgent threat to global ecosystems..."
Summary: Climate change threatens biodiversity and human societies through rising temperatures and extreme weather. Immediate action on emissions reduction is critical to avoid catastrophic outcomes.
Now summarize this text:
Text: {user_text}
Summary:"""
The examples teach the model exactly what output format and style you want. It's like showing someone a template before asking them to write something.
Cost Optimization: How We Reduced Our Bill by 70%
Here's a fun story: our first month of LLM integration, our bill was $4,200. I nearly had a heart attack. Our CFO definitely did. Here's how we got it down to $1,100 without sacrificing quality.
Strategy 1: Aggressive Caching
Many queries are similar or identical. Why pay twice?
pythonimport hashlib
from redis import asyncio as aioredis
from datetime import timedelta
redis = aioredis.from_url("redis://localhost")
async def cached_llm_call(prompt: str, **kwargs):
# Create cache key from prompt and parameters
cache_key = hashlib.sha256(
f"{prompt}:{str(sorted(kwargs.items()))}".encode()
).hexdigest()
# Check cache
cached_response = await redis.get(f"llm:{cache_key}")
if cached_response:
return json.loads(cached_response)
# Make API call
response = await client.chat.completions.create(
messages=[{"role": "user", "content": prompt}],
**kwargs
)
result = response.choices[0].message.content
# Cache for 24 hours
await redis.setex(
f"llm:{cache_key}",
timedelta(hours=24),
json.dumps(result)
)
return result
Cache hit rate after one week? 43%. That's 43% fewer API calls.
Strategy 2: Model Selection Based on Complexity
Not every task needs GPT-4. We built a routing system:
pythonasync def smart_llm_call(prompt: str, task_type: str):
# Simple tasks: use cheaper, faster models
if task_type in ["classification", "extraction", "simple_summary"]:
model = "gpt-3.5-turbo"
max_tokens = 150
# Complex reasoning: use powerful models
elif task_type in ["analysis", "creative_writing", "complex_reasoning"]:
model = "gpt-4-turbo-preview"
max_tokens = 500
else:
# Default to balanced option
model = "gpt-3.5-turbo-16k"
max_tokens = 300
return await client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
max_tokens=max_tokens
)
Cost difference? GPT-3.5 costs about 1/30th of GPT-4. Using the right model for the job saved thousands.
Strategy 3: Streaming Responses
For user-facing features, streaming provides better UX and can reduce wasted tokens if users navigate away:
pythonasync def stream_completion(prompt: str):
stream = await client.chat.completions.create(
model="gpt-4-turbo-preview",
messages=[{"role": "user", "content": prompt}],
stream=True
)
async for chunk in stream:
if chunk.choices[0].delta.content:
yield chunk.choices[0].delta.content
Users see responses immediately, and if they close the page, you stop generating (and paying for) tokens.
Monitoring and Observability: The Critical Missing Piece
You can't improve what you don't measure. Here's our monitoring setup:
pythonimport time
from prometheus_client import Counter, Histogram, Gauge
llm_requests_total = Counter(
'llm_requests_total',
'Total LLM API requests',
['model', 'status']
)
llm_latency_seconds = Histogram(
'llm_latency_seconds',
'LLM API call latency',
['model']
)
llm_tokens_used = Counter(
'llm_tokens_used_total',
'Total tokens consumed',
['model', 'type']
)
llm_cost_dollars = Counter(
'llm_cost_dollars_total',
'Estimated cost in dollars',
['model']
)
async def monitored_llm_call(prompt: str, model: str):
start_time = time.time()
try:
response = await client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}]
)
# Record metrics
latency = time.time() - start_time
llm_latency_seconds.labels(model=model).observe(latency)
prompt_tokens = response.usage.prompt_tokens
completion_tokens = response.usage.completion_tokens
llm_tokens_used.labels(model=model, type="prompt").inc(prompt_tokens)
llm_tokens_used.labels(model=model, type="completion").inc(completion_tokens)
# Calculate cost (rough estimates)
cost = calculate_cost(model, prompt_tokens, completion_tokens)
llm_cost_dollars.labels(model=model).inc(cost)
llm_requests_total.labels(model=model, status="success").inc()
return response.choices[0].message.content
except Exception as e:
llm_requests_total.labels(model=model, status="error").inc()
raise
This lets us track:
- Which features use the most tokens
- Average latency by model
- Daily cost trends
- Error rates
We caught a bug where a recursive function was making 50+ LLM calls per user request. Metrics saved us thousands of dollars.
Error Handling: When LLMs Fail
LLMs will fail. APIs go down, rate limits hit, and sometimes the model just returns garbage. Handling this gracefully is crucial.
pythonfrom tenacity import retry, stop_after_attempt, wait_exponential
import asyncio
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=2, max=10)
)
async def resilient_llm_call(prompt: str, fallback_response: str = None):
try:
async with asyncio.timeout(30): # 30 second timeout
response = await client.chat.completions.create(
model="gpt-4-turbo-preview",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
except TimeoutError:
logger.error(f"LLM call timeout for prompt: {prompt[:100]}")
if fallback_response:
return fallback_response
raise
except Exception as e:
logger.error(f"LLM call failed: {str(e)}")
if fallback_response:
return fallback_response
raise
For user-facing features, we always provide fallback responses. Better to show something useful than an error message.
Real-World Impact
After six months of iteration, our LLM-powered features now:
- Handle 50,000+ queries per day
- Maintain 99.5% uptime
- Cost less than $40 per day
- Achieve 4.7/5 user satisfaction ratings
Was it worth the complexity? Absolutely. But it required treating LLMs as a critical infrastructure component, not a magic solution. Proper engineering practices—caching, monitoring, error handling—apply just as much to AI as to traditional backend services.
The future of software includes LLMs, but successful integration requires treating them as powerful tools that need careful orchestration, not silver bullets.