LLM Integration Patterns in Production

Six months ago, our team embarked on integrating Large Language Models into our production application. I thought it would be straightforward—call an API, get a response, ship it. I was naive. What I learned along the way fundamentally changed how I think about building AI-powered features.

The Promise and the Reality

The marketing pitch for LLMs is compelling: add AI to your app with a few API calls. The reality? It's more complex. You're dealing with non-deterministic outputs, managing costs that can spiral quickly, and ensuring reliability when calling external services. But when done right, the user experience improvements are genuinely transformative.

RAG Architecture: Why It Became Our Go-To Pattern

Retrieval-Augmented Generation (RAG) sounds academic, but it's incredibly practical. Let me explain with a real example from our codebase.

We built a customer support assistant that needed to answer questions about our product documentation. Initially, we tried fine-tuning a model on our docs. It cost thousands of dollars and still hallucinated incorrect information. RAG solved this elegantly.

The core idea: instead of expecting the LLM to memorize your content, you retrieve relevant information and inject it into the prompt. Here's the pattern that worked for us:


python
from openai import AsyncOpenAI
from langchain.vectorstores import Pinecone
from langchain.embeddings import OpenAIEmbeddings

client = AsyncOpenAI()
embeddings = OpenAIEmbeddings()

async def answer_question(user_question: str, user_id: str):
    # Step 1: Retrieve relevant documentation
    query_embedding = await embeddings.aembed_query(user_question)
    relevant_docs = await vector_store.similarity_search(
        query_embedding,
        k=4,
        filter={"user_tier": get_user_tier(user_id)}
    )

    # Step 2: Build context from retrieved docs
    context = "\n\n".join([doc.page_content for doc in relevant_docs])

    # Step 3: Generate response with context
    response = await client.chat.completions.create(
        model="gpt-4-turbo-preview",
        messages=[
            {"role": "system", "content": f"""You are a helpful customer support assistant.
            Answer questions using ONLY the following documentation. If the answer
            isn't in the documentation, say you don't know.

            Documentation:
            {context}"""},
            {"role": "user", "content": user_question}
        ],
        temperature=0.3,
        max_tokens=500
    )

    return {
        "answer": response.choices[0].message.content,
        "sources": [doc.metadata for doc in relevant_docs]
    }

The game-changer? We can update our documentation and it's immediately reflected in answers—no retraining required. Plus, by returning sources, users can verify the information themselves.

Prompt Engineering: The Art and Science

I used to think prompt engineering was overhyped. Then I spent three hours getting wildly inconsistent outputs before realizing my prompt was the problem.

Here's what I learned: good prompts are specific, structured, and give examples. Bad prompts are vague hopes thrown at an AI.

Before: The Naive Approach


python
prompt = f"Summarize this text: {user_text}"

Results? Sometimes a bullet list, sometimes paragraphs, sometimes it would translate to another language for no reason.

After: The Engineered Approach


python
prompt = f"""Task: Create a concise summary of the following text.

Requirements:
- Length: 2-3 sentences maximum
- Focus: Key insights and main points
- Format: Single paragraph
- Tone: Professional and clear

Text to summarize:
{user_text}

Summary:"""

Consistency improved dramatically. But we took it further with few-shot learning:


python
SUMMARIZATION_PROMPT = """You are an expert at creating clear, concise summaries.

Example 1:
Text: "Artificial intelligence has made significant progress in recent years..."
Summary: AI has advanced rapidly through breakthroughs in deep learning and transformer architectures. These developments have enabled practical applications in language understanding and computer vision.

Example 2:
Text: "Climate change poses an urgent threat to global ecosystems..."
Summary: Climate change threatens biodiversity and human societies through rising temperatures and extreme weather. Immediate action on emissions reduction is critical to avoid catastrophic outcomes.

Now summarize this text:
Text: {user_text}
Summary:"""

The examples teach the model exactly what output format and style you want. It's like showing someone a template before asking them to write something.

Cost Optimization: How We Reduced Our Bill by 70%

Here's a fun story: our first month of LLM integration, our bill was $4,200. I nearly had a heart attack. Our CFO definitely did. Here's how we got it down to $1,100 without sacrificing quality.

Strategy 1: Aggressive Caching

Many queries are similar or identical. Why pay twice?


python
import hashlib
from redis import asyncio as aioredis
from datetime import timedelta

redis = aioredis.from_url("redis://localhost")

async def cached_llm_call(prompt: str, **kwargs):
    # Create cache key from prompt and parameters
    cache_key = hashlib.sha256(
        f"{prompt}:{str(sorted(kwargs.items()))}".encode()
    ).hexdigest()

    # Check cache
    cached_response = await redis.get(f"llm:{cache_key}")
    if cached_response:
        return json.loads(cached_response)

    # Make API call
    response = await client.chat.completions.create(
        messages=[{"role": "user", "content": prompt}],
        **kwargs
    )
    result = response.choices[0].message.content

    # Cache for 24 hours
    await redis.setex(
        f"llm:{cache_key}",
        timedelta(hours=24),
        json.dumps(result)
    )

    return result

Cache hit rate after one week? 43%. That's 43% fewer API calls.

Strategy 2: Model Selection Based on Complexity

Not every task needs GPT-4. We built a routing system:


python
async def smart_llm_call(prompt: str, task_type: str):
    # Simple tasks: use cheaper, faster models
    if task_type in ["classification", "extraction", "simple_summary"]:
        model = "gpt-3.5-turbo"
        max_tokens = 150
    # Complex reasoning: use powerful models
    elif task_type in ["analysis", "creative_writing", "complex_reasoning"]:
        model = "gpt-4-turbo-preview"
        max_tokens = 500
    else:
        # Default to balanced option
        model = "gpt-3.5-turbo-16k"
        max_tokens = 300

    return await client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=max_tokens
    )

Cost difference? GPT-3.5 costs about 1/30th of GPT-4. Using the right model for the job saved thousands.

Strategy 3: Streaming Responses

For user-facing features, streaming provides better UX and can reduce wasted tokens if users navigate away:


python
async def stream_completion(prompt: str):
    stream = await client.chat.completions.create(
        model="gpt-4-turbo-preview",
        messages=[{"role": "user", "content": prompt}],
        stream=True
    )

    async for chunk in stream:
        if chunk.choices[0].delta.content:
            yield chunk.choices[0].delta.content

Users see responses immediately, and if they close the page, you stop generating (and paying for) tokens.

Monitoring and Observability: The Critical Missing Piece

You can't improve what you don't measure. Here's our monitoring setup:


python
import time
from prometheus_client import Counter, Histogram, Gauge

llm_requests_total = Counter(
    'llm_requests_total',
    'Total LLM API requests',
    ['model', 'status']
)

llm_latency_seconds = Histogram(
    'llm_latency_seconds',
    'LLM API call latency',
    ['model']
)

llm_tokens_used = Counter(
    'llm_tokens_used_total',
    'Total tokens consumed',
    ['model', 'type']
)

llm_cost_dollars = Counter(
    'llm_cost_dollars_total',
    'Estimated cost in dollars',
    ['model']
)

async def monitored_llm_call(prompt: str, model: str):
    start_time = time.time()

    try:
        response = await client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}]
        )

        # Record metrics
        latency = time.time() - start_time
        llm_latency_seconds.labels(model=model).observe(latency)

        prompt_tokens = response.usage.prompt_tokens
        completion_tokens = response.usage.completion_tokens

        llm_tokens_used.labels(model=model, type="prompt").inc(prompt_tokens)
        llm_tokens_used.labels(model=model, type="completion").inc(completion_tokens)

        # Calculate cost (rough estimates)
        cost = calculate_cost(model, prompt_tokens, completion_tokens)
        llm_cost_dollars.labels(model=model).inc(cost)

        llm_requests_total.labels(model=model, status="success").inc()

        return response.choices[0].message.content

    except Exception as e:
        llm_requests_total.labels(model=model, status="error").inc()
        raise

This lets us track:

Which features use the most tokens
Average latency by model
Daily cost trends
Error rates

We caught a bug where a recursive function was making 50+ LLM calls per user request. Metrics saved us thousands of dollars.

Error Handling: When LLMs Fail

LLMs will fail. APIs go down, rate limits hit, and sometimes the model just returns garbage. Handling this gracefully is crucial.


python
from tenacity import retry, stop_after_attempt, wait_exponential
import asyncio

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=2, max=10)
)
async def resilient_llm_call(prompt: str, fallback_response: str = None):
    try:
        async with asyncio.timeout(30):  # 30 second timeout
            response = await client.chat.completions.create(
                model="gpt-4-turbo-preview",
                messages=[{"role": "user", "content": prompt}]
            )
            return response.choices[0].message.content

    except TimeoutError:
        logger.error(f"LLM call timeout for prompt: {prompt[:100]}")
        if fallback_response:
            return fallback_response
        raise

    except Exception as e:
        logger.error(f"LLM call failed: {str(e)}")
        if fallback_response:
            return fallback_response
        raise

For user-facing features, we always provide fallback responses. Better to show something useful than an error message.

Real-World Impact

After six months of iteration, our LLM-powered features now:

Handle 50,000+ queries per day
Maintain 99.5% uptime
Cost less than $40 per day
Achieve 4.7/5 user satisfaction ratings

Was it worth the complexity? Absolutely. But it required treating LLMs as a critical infrastructure component, not a magic solution. Proper engineering practices—caching, monitoring, error handling—apply just as much to AI as to traditional backend services.

The future of software includes LLMs, but successful integration requires treating them as powerful tools that need careful orchestration, not silver bullets.