Link Search Menu Expand Document

RAG - 101

LLM

rag

RAG (Retrieval-Augmented Generation)

  • Combines information retrieval with a language model to generate accurate, relevant, less hallucination, private-secure and up-to-date answers.
  • RAG = Retrieval + Augmented + Generation
  • Instead of relying only on what the model was trained on, RAG lets the model look up information from trusted sources before generating a response.
  • Retrieves more accurate, contextually appropriate, relevant data from a knowledge base
  • Uses an LLM to generate a response based on that context
  • RAG systems augment LLM model’s generative capabilities with real-time retrieval of information, ensuring responses are fluent, factually grounded, and relevant.
  • Implementing RAG involves considerations such as the size of the context window and the need for data retrieval and storage in appropriate formats

drawing

  • Core components:
    • Embedding generator
    • Retriever module
    • Prompt Constructor
    • Generator (LLM model)

Schema

RAG Pipeline

  • Tech side
    • Raw docs
    • Document Processing
      • Chunking
      • add metadata
    • Embedding generation each chunk
    • Vectors and original text stored at each DB
  • User side
    • Query Processing
    • Query Embedding
    • Retrieval module
      • Hybrid search (Similarity)
      • Maximum Marginal Relevance (MMR)
        • Top-k chunks
          • BM25 Search (Top-30)
          • Vector Search (Top-10)
          • RRF Fusion (up to Top-40)
        • Reranker
          • Jina Cross-Encoder Scoring and top-10 Results
          • Most Diverse
    • Context Formatting
      • Organize with Metadata
      • Apply instructions (format, tone, constraints)
  • LLM model

What RAG system do:

  • Chunk documents
  • Turns documents into searchable vectors (embed chunks)
  • Finds information using semantic search (retrive top-k chunks)
  • Sends relevant context to the LLM
  • Generates accurate answeres from the data

RAG type

Type Details Examples
Standard Basic retrieval + context for Q&A answering questions from documents (fact-based Q&A, knowledge base assistants)
Fusion multiple queries → better retrieval results combining multiple retrieval strategies for better coverage (large-scale enterprise search)
Corrective verifies queries or fixes responses reducing hallucinations and fixing incorrect outputs (high-stakes domains like legal or healthcare)
Agentic step-by-step reasoning loop with tools automating workflows (research assistants, task automation, multi-step reasoning)
Modular flexible pipeline (swap components) building flexible systems that can be customized per use case (AI platforms, SaaS tools)
Multimodal works with text, images, etc. working with images, audio, and text together (medical reports, visual QA systems)
Graph knowledge graph understanding relationships in structured data (fraud detection, knowledge graphs), cross-document reasoning and multi-hop queries
Hybrid combine between dense + sparse search and reranker improving search accuracy in enterprise systems (BM/25 keyword - lexical search + semantic search)
Multi-query   handling vague or ambiguous user queries (customer support bots)
Re-ranking   improving result quality in search engines and recommendation systems
Adaptive The system adaptively decides whether retrieval is needed based on the query type, and proceeds accordingly  

Techical

Name Details Examples
Document crawler   - Scrapy ✅
- BeautifulSoup ✅
Document parsing - PDF corpus - pdfplumber ✅
- Docling ✅
- Crawler logic ✅
- Camelot
- Tabula
Tokenization Convert text into numerical or basic text as tokens - SentencePiece
- GPT’s BPE tokenizer
Embedding - Convert the text into vectors
- Converting text into vector representations
- Captures the semantic meaning of the data
- Make documents searchable using similiarity
- OpenAI - text-embedding-3-*
- Azure OpenAI
- DashScope - text-embedding-v3
- Bedrock - amazon.titan-embed-text-v2
- Vertex AI - text-embedding-005
- Voyage AI - voyage-3
- Cohere - embed-english-v3.0
- SiliconFlow - BAAI/bge-large-zh-v1.5
- Hugging Face - Any TEI-served model ✅
VectorDB - Pecific database to store vectors
- Allows fast semantic search
- ChromaDB
- Milvus ✅
- DynamoDB
- OpenSearch
- Qdrant
- pgvector ✅
- Pinecone
- Weaviate
Sematic Search   - FAISS ✅
Reranking Function   - Weighted Ranker ✅
- RRF Ranker ✅
- Boost Ranker ✅
- Decay Ranker ✅
- Model Ranker ✅
Reranking Model   - TEI ✅
- Cohere
- Voyage AI
- SiliconFlow
LLM Model   - GPT
- T5 ✅
- LLaMA ✅
- Falcon
- Mistral ✅
- Phi ✅
- Gemini
Model Alignment - Supervised Fine-Tuning (SFT)
- Reinforcement Learning from Human Feedback (RLHF)
- Safety & Constitutional AI
- Train on high-quality human-annotated datasets (InstructGPT, Alpaca, Dolly)
- Generate responses, rank outputs, train a Reward Model (PPO), and refine using Proximal Policy Optimization (PPO)
- Apply RLAIF, adversarial training, and bias filtering
Model Optimization - Compression & Quantization
- API Serving & Scaling - Reduce model size with GPTQ, AWQ, LLM.int8(), and Knowledge Distillation
- Deploy with vLLM, Triton Inference Server, TensorRT, ONNX, and Ray Serve for efficient inference
- FastAPI ✅
- vLLM ✅
Model collection   - Hugging Face ✅
Pretraining - Causal Language Modeling (CLM) with Cross-Entropy Loss
- Gradient Checkpointing
- Parallelization (FSDP, ZeRO)
 
Optimizations - Apply Mixed Precision (FP16/BF16)
- Gradient Clipping
- Adaptive Learning Rate Schedulers for efficiency
 
Evaluation & Benchmarking - Benchmarking performance
- Red-Teaming
- Adversarial Testing
- HumanEval
- HELM
- OpenAI Eval
- MMLU
- ARC
- MT-Bench
Orchestrator - Orchestrated conversation flow & intent detection - LangChain ✅
- LangGraph ✅
- LangSmith
- LlamaIndex ✅
- Ollama ✅
- vLLM ✅
- Azure AI Foundry
AI Agents   - Microsoft Autogen
- Agno
- OpenAI Agent Kit
- CrewAI

LLM Aided Retrieval - Optimizing RAG

  • Chunking Strategy
  • Hybrid Search: (combination of traditional keyword retrieval (BM25) + dense (vector-based) retrieval)
  • Metadata Filtering: use tag, source, type, date, etc.
  • Prompt Engineering
  • Caching Results: speeds up repeated queries
  • Multi-hop RAG: breaks complex queries into sub-questions and retrieves and reasons across RAG.
  • Query rewriting: Improve vague user queries
  • Self-RAG: Model checks its own output
  • SelfQuery: where we use an LLM to convert the user question into a query.
    • Core component: information » Query Parser » Search term + Filter
  • Compression: Using LLM we could increase the number of results we can put in the context by shrinking the responses to only the relevant information.

RAG Optimization

  • Everything starts with data quality.
  • Poor chunking leads to:
    • Too large → noisy, low precision
    • Too small → missing context
  • Hybrid search (semantic + keyword) becomes essential
  • Re-ranking is critical to prioritize true relevance—not just similarity
  • Enforce context. Too much information leads to the “lost in the middle” effect.
  • LLMs may ignore retrieved context unless prompts enforce:
    • Grounding
    • Citations
    • Context adherence
  • Continuous feedback loops
  • Performance monitoring
  • Adaptive tuning to evolving data