Embeddings

Overview

This guide explains how to implement semantic search for vCon conversation content in Supabase PostgreSQL using the pgvector extension for vector similarity search.


Architecture

High-Level Flow

vCon Content → Embedding API → Vector (array of floats) → PostgreSQL (pgvector) → Similarity Search

Components

  1. pgvector Extension - PostgreSQL extension for vector similarity

  2. Embedding Service - OpenAI, Sentence Transformers, or custom

  3. Vector Storage - New columns in Supabase tables

  4. Similarity Functions - Cosine similarity, L2 distance, inner product

  5. Indexes - HNSW or IVFFlat indexes for fast retrieval


Step 1: Enable pgvector Extension

In Supabase Dashboard

Verify Installation


Step 2: Add Vector Columns to Schema

Add Embedding Columns

Alternative: Dedicated Embeddings Table

For more flexibility, create a separate embeddings table:


Step 3: Create Vector Indexes

HNSW Parameters:

  • m: Max connections per layer (default: 16, higher = more accurate but slower)

  • ef_construction: Size of dynamic candidate list (default: 64, higher = better recall)

IVFFlat Index (For Large Datasets)

When to Use Which Index:

  • HNSW: < 1M vectors, need high accuracy, have memory

  • IVFFlat: > 1M vectors, can trade accuracy for speed

  • No Index: < 10K vectors (brute force is fast enough)


Step 4: Generate Embeddings

Option A: Using OpenAI API

Option B: Using Sentence Transformers (Local/Self-Hosted)

Option C: Batch Processing with Edge Functions (Preferred in this repo)

See docs/INGEST_AND_EMBEDDINGS.md for the production-ready function (supabase/functions/embed-vcons/index.ts), environment variables, and Cron scheduling. This repository standardizes on 384‑dim embeddings to match the migrations and HNSW index.


Step 5: Semantic Search Queries

Similarity Operators in pgvector

Python Implementation


Step 6: Hybrid Search (Semantic + Exact)

SQL Function for Hybrid Search with Tags-from-Attachments

Python Implementation


Step 7: Automatic Embedding Generation

Trigger for Automatic Embedding on Insert

Background Job for Batch Embedding


Step 8: Performance Optimization

Query Optimization

Embedding Dimension Reduction

Caching Strategy


Step 9: Monitoring & Maintenance

Track Embedding Coverage

Monitor Search Performance

Index Maintenance


Cost Considerations

OpenAI Embedding Costs

  • text-embedding-ada-002: $0.0001 per 1K tokens (~750 words)

  • text-embedding-3-small: $0.00002 per 1K tokens

  • text-embedding-3-large: $0.00013 per 1K tokens

Example Costs:

  • 100K vCons with 200-word subjects: $26.67 (ada-002)

  • 1M dialog messages averaging 50 words: $66.67 (ada-002)

Self-Hosted Alternative

Use Sentence Transformers locally:

  • No API costs

  • Faster for batch processing

  • 384-768 dimensions (vs 1536)

  • Slightly lower accuracy


Summary

Key Decisions

  1. Embedding Model

    • OpenAI ada-002: Best accuracy, API costs

    • Sentence Transformers: Free, good accuracy, self-hosted

  2. Storage Strategy

    • Dedicated embeddings table (recommended)

    • Embedded in existing tables (simpler)

  3. Index Type

    • HNSW: < 1M vectors, high accuracy

    • IVFFlat: > 1M vectors, faster but less accurate

  4. Search Strategy

    • Pure semantic: Best for natural language queries

    • Hybrid: Combine with tags and keywords for precision

Implementation Checklist

This architecture provides production-ready semantic search for vCon conversations in Supabase! 🚀

Last updated