~/krishna_dhakal
#AI#AWS#RAG#LLM

Building RAG Pipelines with AWS Bedrock

> June 10, 2025

Retrieval-Augmented Generation (RAG) has become the go-to pattern for building AI applications that need access to domain-specific knowledge without the cost of fine-tuning.


> Why RAG?


Fine-tuning large language models is expensive and slow to update. RAG lets you keep your knowledge base separate from the model — meaning you can update your data without retraining.


> The Stack


  • **AWS Bedrock** — managed inference for foundation models (Claude, Titan, etc.)
  • **Qdrant** — high-performance vector database for semantic search
  • **llama.cpp** — local inference for cost-sensitive workloads
  • **LangChain** — orchestration layer tying everything together

> Embedding Documents


The first step is chunking your documents and embedding them into vector space:


from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import BedrockEmbeddings

splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)
chunks = splitter.split_documents(docs)

embeddings = BedrockEmbeddings(model_id="amazon.titan-embed-text-v1")
vectorstore = Qdrant.from_documents(chunks, embeddings, url="http://localhost:6333")

> Retrieval & Generation


At query time, embed the user's question and retrieve the top-k chunks, then pass them as context to the LLM:


retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
chain = RetrievalQA.from_chain_type(llm=bedrock_llm, retriever=retriever)
answer = chain.run("What are the key benefits of serverless architecture?")

> Lessons Learned


  1. **Chunk size matters** — too small loses context, too large dilutes relevance.
  2. **Metadata filtering** speeds up retrieval dramatically on large corpora.
  3. **Hybrid search** (keyword + vector) outperforms pure vector search in many domains.

RAG is not a silver bullet but it's the right tool for most knowledge-intensive enterprise AI use cases.