Building LLM-Powered Applications Beyond the ChatGPT API

RAG, Embeddings, and the Art of Not Going Broke

Build production-ready AI applications that don’t hallucinate your business into the ground or drain your wallet.

The $10,000 API Bill Nobody Talks About

Friend built a customer support chatbot using ChatGPT API. Beautiful interface, users loved it. Bill came: $9,847 for one month.

Problem? Feeding entire documentation into context window for every query. Thousands of tokens, hundreds of requests daily.

Solution: RAG (Retrieval-Augmented Generation)

What RAG Does

  1. Store knowledge in vector database (embeddings of documents)
  2. Retrieve relevant chunks when user asks
  3. Augment LLM prompt with only relevant information
  4. Generate accurate responses grounded in your data

Instead of hoping LLM knows your documentation, give it exactly what it needs, right when needed.

The Architecture

Query → Embedding → Vector DB (retrieve) → LLM (generate) → Response

Building Your First RAG App

Step 1: Install dependencies

pip install langchain openai chromadb

Step 2-3: Load documents and chunk them

from langchain.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = DirectoryLoader('./docs', glob="**/*.md")
documents = loader.load()

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200
)
chunks = splitter.split_documents(documents)

Step 4-5: Create embeddings and build RAG chain

from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(
    documents=chunks, embedding=embeddings
)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm, retriever=vectorstore.as_retriever()
)

The Gotchas

1. Embedding Model Mismatch: Use same embedding model for indexing and querying.

2. Chunk Size: Code docs need smaller chunks (500-800 chars). Long-form needs larger (1500-2000).

3. Metadata: Add metadata during chunking, filter during retrieval.

4. Cost Optimization:

  • Use text-embedding-3-small ($0.00002/1K tokens) instead of ada-002
  • Cache embeddings aggressively
  • Use local embeddings for development (HuggingFace)

Combine vector search + keyword search for better results:

from langchain.retrievers import BM25Retriever, EnsembleRetriever

vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
keyword_retriever = BM25Retriever.from_documents(chunks)
ensemble = EnsembleRetriever(
    retrievers=[vector_retriever, keyword_retriever],
    weights=[0.5, 0.5]
)

Catches cases where pure vector search misses exact keywords.

Measuring Success

Track:

  • Latency: Aim for <2 seconds
  • Cost per query: Should be pennies
  • Retrieval accuracy: Are right chunks found?
  • User satisfaction: Thumbs up/down on responses

When RAG Isn’t Enough

  • Real-time data: RAG won’t know today’s weather. Use function calling.
  • Complex reasoning: Multi-step math needs different approaches.
  • Personalization: Combine RAG with user profiles and conversation history.
  • Domain expertise: Fine-tuning might be better for specialized fields.

Production Checklist

  • Chunk size optimized for content type
  • Embedding model chosen
  • Vector database scaled
  • Caching for repeat queries
  • Rate limiting in place
  • Cost monitoring and alerts
  • Fallback handling when retrieval fails
  • Source attribution in responses
  • User feedback mechanism

RAG transformed how I build AI apps. Instead of $10k bills, I pay $50/month. Users get accurate answers, sources are cited, knowledge base updates without retraining.

Start simple, measure everything, iterate.

Now go build something that doesn’t hallucinate.