Unstructure Note for LLM and AI agent (potential to split)
Solutions
- Unstructure Note for LLM and AI agent (potential to split)
- Tools
- RAG (Retrieval-Augmented Generation)
- Vector Databases
Tools
- LangChain – the most widely used framework with a massive ecosystem. Provides abstractions for: Models, Tools, Memory, Chains, Agents, Retrivers, Vector stores.
- LangGraph – the next evolution for production-grade agent control. solving the limitation of LangChain; agent control flow, LangChain is linear, Graph-based.
- Ollama – Run powerful LLMs locally on your own hardware with a single command.
- Langflow – A drag-and-drop visual builder for designing and deploying AI agents and RAG workflows.
- CrewAI – role-based multi-agent collaboration. with 3 core abstractions: agents, tasks, crew.
- AutoGen – conversational multi-agent systems by Microsoft. for building conversation, collaborative, and autonomous multi-agent system.
- Agno – lightweight and performance-focused framework.
- LlamaIndex – knowledge/data-centric agents framework, designed to connect LLM agents with structured and unstructured data. act as knowledge orchestration.
- Flowise – visual agent orchestration framework using no-code
- n8n – agent orchastration that work as a node-based workflow automation system where each performs an operation
- Relevance AI – enterprise-focused agent with capabilities in knowledge integration, workflow automation, and operational decision-making with focuses specifically on business operation.
- OpenClaw – The always-on personal AI agent that lives on your device and talks to you through WhatsApp, Telegram, and 50+ other platforms.
- Open WebUI – A self-hosted, offline-capable ChatGPT alternative
RAG (Retrieval-Augmented Generation)
combines information retrieval with a language model to generate accurate answers.
- retrieves relevant data from a knowledge base
- uses an LLM to generate a response based on that context
RAG Pipeline
- Raw docs
- Chunking
- Embedding each chunk
- Vector DB
- Query embedding
- Similarity search
- Top-k chunks
- Reranker
- LLM model
What RAG system do:
- Chunk documents
- Turns documents into searchable vectors (embed chunks)
- Finds information using semantic search (retrive top-k chunks)
- Sends relevant context to the LLM
- Generates accurate answeres from the data
Agentic RAG
-
Introducing AI agents that can make decisions, select tools, and even refine queries for more accurate and flexible responses.
-
Here’s how Agentic RAG works on a high level:
- The user query is directed to an AI Agent for processing.
- The agent uses short-term and long-term memory to track query context. It also formulates a retrieval strategy and selects appropriate tools for the job.
- The data fetching process can use tools such as vector search, multiple agents, and MCP servers to gather relevant data from the knowledge base.
- The agent then combines retrieved data with a query and system prompt. It passes this data to the LLM.
- LLM processes the optimized input to answer the user’s query.
Techical
- Embedding - Convert the text into vectors.
- Converting text into vector representations.
- Captures the semantic meaning of the data.
- Make documents searchable using similiarity.
- VectorDB - specific database to store vectors
- Allows fast semantic search
- Examples:
- ChromaDB
RAG Chunking
- Chunking decides what knowledge your system is allowed to see.
- If you split text by token count, you’re not building retrieval. You’re breaking meaning.
-
When chunking is wrong, no vector database or reranker can save you.
- Real RAG chunking is about:
- Preserving ideas, not lines
- Respecting document structure
- Using semantic boundaries instead of arbitrary cuts
- Adding overlap so context doesn’t vanish
- Treating every chunk as a standalone knowledge unit
- When chunking is right:
- Retrieval improves
- Hallucinations drop
- Answers become precise
- Costs go down
- When chunking is wrong:
- Retrieval fails
- Hallucinations increase
- Context gets fragmented
- Token costs explode
RAG Chunking Parameter
-
Chunk Size - Measured in token or characters.
Use Case Recommended Size FAQs 200-400 tokens Documentation 400-800 tokens Legal / Contracts 800-1200 tokens Code 200-500 tokens - Embedding lose precision after ~800 token
- Model gets polluted with noise
- Overlap - Chunks must overlap so knowledge isn’t cut.
- Overlap preserves:
- Definitions
- Cross-sentence logic
- References
- Typical overlap: 10-25%
- Overlap preserves:
- Chunking Strategies
- Fixed-Size Chunking
- Sentence-Size Chunking
- Semantic Chunking
- Chunks are split when topic changes.
- using sentence embeddings, cosine similarity, break where similarity drops.
- This produces; concept-aligned chunks, self-contained knowledge blocks.
- Document-structure Chunking
- Split by:
- Headers
- Sections
- Paragraphs
- Bullet groups
- use case for: Docs, Wikis, Policies, Research papers
- Split by:
- Hybrid Chunking
- Best practice:
- Split by document structure
- Inside each section, apply semantic chunking
- Apply size limits + overlap
- This creates:
- Logically coherent chunks
- Embedding-friendly size
- Retrieval-optimized knowledge blocks
- Best practice:
- Chunk metadata
- it’s store:
- id
- document name
- section
- page
- token
- etc
- metadata enable:
- Filtering
- Source citation
- Page-level grounding
- Reranking
- it’s store:
RAG Failure Root Cause
| Failure | Root Cause |
|---|---|
| Model makes things up | Missing chunk |
| Wrong answer | Chunk too small |
| Vague answer | Chunk too large |
| High cost | Over-long chunks |
| Wrong answer | Chunk too small |
| Low recall | Chunk boundaries break meaning |
Chuncking is bad when:
- ask about X, where X defined, who does X work?
- answers are: vague, half-correct
- missing details
Chunks size vs Retrieval Accuracy
| Chunk size | Retrieval | LLM Quality |
|---|---|---|
| Too small | High recall, low precision | Fragmented answers |
| Too large | Low recall | Irrelevent context |
| Just right | High recall + precision | Clean answers |
RAG for Specific data
- Tables
- stored as: CSV-like text or one chunk per table
- Code
- chunk by: Function, Class, file
- PDFs
- Page -> Section -> Paragraph
Vector Databases
- a Storage to stores the Embedding data (which are mathematical representations of meaning) in vector data type.
- Powerful to solving semantic queries, ask about similarity and relation.
- This DB acts as memory to get the data for LLM Model.
Techical
- Stores
- Vectors
- Metadata
- Original content
- Supports
- Fast similarity search
- Filtering
- Scalable retrieval
- Measure similarity with distance function
- Cosine similarity
- Euclidean distance
- Dot product
- Scoring hybrid system
- vector_score * 0.7 + keyword_score * 0.3
Cost of Vector DB
- Large storage for storing vectors
- RAM heavy
- Indexing is complex
Core Architecture of a Vector Database
- Ingestion Layer - Consume the data
- Raw data
- Vectors
- Metadata
- Indexing Layer - Build Appoximate Nearest Neighbor (ANN) indexes. Using Graph and clustering to indexing.
- HNSW
- IVF
- PQ
- Storage Layer
- Vectors
- Metadata
- IDs
- Query Engine
- A vector
- Filters
- Top K
- return most similar items
Vector DB Usage
- Semantic Search
- Recommendataion engines
- AI agents with memory
- Document QA
- Similarity matching
- Fraud detection
- Image and audio search
Vector DB tools
- Dedicated DB Examples:
- Chroma
- LanceDB
- Milvus
- Weaviate
- Pinecone
- DS Support vector search:
- PostgreSQL
- Cassandra
- ClickHouse
- OpenSearch
- elasticsearch
- Redis