Vector Databases - 101
Vector Database
LLM
Vector Databases
- a Storage to stores the Embedding data (which are mathematical representations of meaning) in vector data type.
- Powerful to solving semantic queries, ask about similarity and relation.
- This DB acts as memory to get the data for LLM Model.
Embedding Flow
Techical
- Stores
- Vectors
- Metadata
- Original content
- Supports
- Fast similarity search
- Filtering
- Scalable retrieval
- Indexing
- Local Sensitive Hashing (LSH): Similar vectors have higher chances of sharing similar hash codes.
- Hierarchical Navigable Small World (HNSW): Organize vectors into difference layers with varying probabilities into a hierarchical graph structure.
- Approximate Nearest Neighbor Oh Yeah (ANNOY): Organize high-dimensional data using binary tree.
- Measure similarity with distance function
- Cosine similarity
- Euclidean distance
- Dot product
- Scoring hybrid system
- vector_score * 0.7 + keyword_score * 0.3
Cost of Vector DB
- Large storage for storing vectors
- RAM heavy
- Indexing is complex
Core Architecture of a Vector Database
- Ingestion Layer - Consume the data
- Raw data
- Vectors
- Metadata
- Indexing Layer - Build Appoximate Nearest Neighbor (ANN) indexes. Using Graph and clustering to indexing.
- HNSW
- IVF
- PQ
- Storage Layer
- Vectors
- Metadata
- IDs
- Query Engine
- A vector
- Filters
- Top K
- return most similar items
Vector DB Usage
- Semantic Search
- Recommendataion engines
- AI agents with memory
- Document QA
- Similarity matching
- Fraud detection
- Image and audio search
Vector DB tools
- Dedicated DB Examples:
- Chroma
- LanceDB
- Milvus
- Weaviate
- Pinecone
- DS Support vector search:
- PostgreSQL (pgvector)
- Cassandra
- ClickHouse
- OpenSearch
- elasticsearch
- Redis
