The $10,000 API Bill Nobody Talks About
Friend built a customer support chatbot using ChatGPT API. Beautiful interface, users loved it. Bill came: $9,847 for one month.
Problem? Feeding entire documentation into context window for every query. Thousands of tokens, hundreds of requests daily.
Solution: RAG (Retrieval-Augmented Generation)
What RAG Does
- Store knowledge in vector database (embeddings of documents)
- Retrieve relevant chunks when user asks
- Augment LLM prompt with only relevant information
- Generate accurate responses grounded in your data
Instead of hoping LLM knows your documentation, give it exactly what it needs, right when needed.
The Architecture
Query → Embedding → Vector DB (retrieve) → LLM (generate) → Response
Building Your First RAG App
Step 1: Install dependencies
pip install langchain openai chromadb
Step 2-3: Load documents and chunk them
from langchain.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
loader = DirectoryLoader('./docs', glob="**/*.md")
documents = loader.load()
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, chunk_overlap=200
)
chunks = splitter.split_documents(documents)
Step 4-5: Create embeddings and build RAG chain
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(
documents=chunks, embedding=embeddings
)
qa_chain = RetrievalQA.from_chain_type(
llm=llm, retriever=vectorstore.as_retriever()
)
The Gotchas
1. Embedding Model Mismatch: Use same embedding model for indexing and querying.
2. Chunk Size: Code docs need smaller chunks (500-800 chars). Long-form needs larger (1500-2000).
3. Metadata: Add metadata during chunking, filter during retrieval.
4. Cost Optimization:
- Use
text-embedding-3-small($0.00002/1K tokens) instead of ada-002 - Cache embeddings aggressively
- Use local embeddings for development (HuggingFace)
Advanced: Hybrid Search
Combine vector search + keyword search for better results:
from langchain.retrievers import BM25Retriever, EnsembleRetriever
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
keyword_retriever = BM25Retriever.from_documents(chunks)
ensemble = EnsembleRetriever(
retrievers=[vector_retriever, keyword_retriever],
weights=[0.5, 0.5]
)
Catches cases where pure vector search misses exact keywords.
Measuring Success
Track:
- Latency: Aim for <2 seconds
- Cost per query: Should be pennies
- Retrieval accuracy: Are right chunks found?
- User satisfaction: Thumbs up/down on responses
When RAG Isn’t Enough
- Real-time data: RAG won’t know today’s weather. Use function calling.
- Complex reasoning: Multi-step math needs different approaches.
- Personalization: Combine RAG with user profiles and conversation history.
- Domain expertise: Fine-tuning might be better for specialized fields.
Production Checklist
- Chunk size optimized for content type
- Embedding model chosen
- Vector database scaled
- Caching for repeat queries
- Rate limiting in place
- Cost monitoring and alerts
- Fallback handling when retrieval fails
- Source attribution in responses
- User feedback mechanism
RAG transformed how I build AI apps. Instead of $10k bills, I pay $50/month. Users get accurate answers, sources are cited, knowledge base updates without retraining.
Start simple, measure everything, iterate.
Now go build something that doesn’t hallucinate.