DocuVerse Architecture - Complete Technical Overview

01

Executive Summary

The Challenge

Traditional infrastructure struggles with the "bursty" nature of mass indexing. A crawler might sit idle for days, then require thousands of concurrent threads.

⚡

The Solution

Serverless architecture with Modal for compute orchestration and Vector Databases for high-dimensional storage enables scale-to-zero efficiency.

📈

The Result

92% cost reduction compared to traditional Kubernetes deployments while maintaining sub-200ms search latency and 15-minute index freshness.

02

The Paradigm Shift in Search

Traditional

Inverted Index

✖ Lexical gap: "car" ≠ "automobile"
✔ CPU/IO bound operations
✔ Well-understood B-Tree storage
✖ Requires manual synonymization

Elasticsearch Apache Lucene

Evolution

Modern

Vector Embeddings

✔ Semantic similarity: "car" ≈ "automobile"
✔ GPU-accelerated compute
✔ ANN algorithms (HNSW)
✔ Cross-lingual understanding

Pinecone Qdrant BERT/GPT

Resource Requirements Shift

CPU

→

GPU

Matrix multiplications require massive GPU parallelization

I/O

→

Compute

Embedding models are the new bottleneck

B-Tree

→

HNSW

Approximate Nearest Neighbor algorithms for vectors

The Infrastructure Dilemma

Resource Usage Over Time

Mon

Tue

Wed

Thu PEAK

Fri

Sat

Sun

Provisioned for Peak (95% wasted)

Provisioned for Average (latency spikes)

⚠ Provisioned Capacity

Provision for peak = pay for idle GPUs 95% of the time

Provision for average = massive latency spikes during peak

⚠ Traditional Serverless

No GPU support, 15-minute timeouts, cold starts too slow for ML models

✔ Modal Solution

Per-second billing, GPU ephemerality, millisecond container launches

03

The DocuVerse Use Case

Mission

DocuVerse is a "Universal Documentation Search Engine" for developers. A single search bar that retrieves the most relevant technical answers using RAG, regardless of where the information lives.

Data Sources

📜 Official: Python Docs, MDN, AWS 👥 Community: Stack Overflow, GitHub 🌐 Decentralized: IPFS, Arweave

Dataset Specifications

Metric	Value	Implications
Total Documents	5,000,000	Requires efficient bulk indexing strategies
Average Doc Size	4 KB (~800 tokens)	Fits within standard embedding context windows
Update Velocity	~200,000 docs/day	Incremental indexing must be robust
Vector Dimensions	1,536	OpenAI Ada-002 compatible, high-fidelity
Total Index Size	~30 GB	Vectors + Metadata storage requirements
Target Latency	<200ms search, <15min freshness	Tight constraints on ingestion pipeline

The "Matrix Link" - PageRank for Code

Beyond simple text search, DocuVerse constructs a graph of documentation relationships. How many pages link to the React useEffect hook? This "Matrix Link" boosts authoritative pages during vector retrieval.

Final_Score = (Vector_Similarity * 0.8) + (PageRank_Score * 0.2)

04

Complete Architecture Overview

The DocuVerse engine is built on four pillars: Ingestion, Processing, Memory, and Interaction. Each component is designed for serverless execution with automatic scaling.

Ingestion Layer

Processing Layer

Storage Layer

Interaction Layer

📦

Ingestion Layer

CPU-bound web crawling with politeness sharding and deduplication

modal.Queue modal.Dict BeautifulSoup

1,200 pages/sec 300 containers

⚡

Processing Layer

GPU-accelerated embedding generation with intelligent batching

e5-large A10G GPU 8-bit Quantization

4,500 docs/sec 50 GPUs

💾

Storage Layer

Serverless vector storage with bulk import and hybrid search

Pinecone S3 HNSW

10,000 vectors/sec 30GB index

🤝

Interaction Layer

RAG pipeline with reranking and authority-boosted retrieval

LangChain Cross-Encoder GPT-4

<200ms latency Hybrid search

05

Distributed Crawler Architecture

Building a crawler that handles 5 million pages without getting blocked, crashing, or entering infinite loops requires a sophisticated distributed architecture based on the Producer-Consumer Pattern.

Producer-Consumer Pattern with modal.Queue

⚠ The Problem

In a monolithic script, crawling is recursive: visit(url) -> find_links() -> visit(links)

In serverless: deep recursion = stack overflows and timeout errors

✔ The Solution

Flatten recursion into a Queue-Based Architecture where work discovery is decoupled from work execution.

Producer

Queue

Workers

State Store

💡

Why Not modal.map?

While modal.map allows parallel execution over a list, it is static - it expects inputs to be known beforehand. A crawler is dynamic: parsing Page A reveals Pages B and C. The Queue pattern is essential because it allows the workload to expand during runtime.

Politeness Sharding Strategy

A naive crawler scaling to 500 containers resembles a DDoS attack. We implement domain-based sharding:

Worker Type A

*.github.io

Limit: 5 concurrent

Worker Type B

*.readthedocs.io

Limit: 10 concurrent

Worker Type C

General Web

Limit: 100 concurrent

This ensures aggregate throughput is high while per-domain impact remains respectful of robots.txt.

06

GPU Orchestration & Embeddings

The processing layer is the most expensive phase - where Modal's value proposition is strongest. This is where we shift from network-bound (crawling) to compute-bound (embedding).

The Container Loading Advantage

🐢

Traditional (K8s)

~2 minutes

Pull 5GB image, load PyTorch model weights into GPU memory

VS

⚡

Modal Snapshot

<2 seconds

Container snapshot mounted over network, lazy loading on demand

Implication: No need for "warm pools" of GPU servers. If the crawler finds a new pocket of documentation, Modal instantly spins up 50 GPU containers and shuts them down the second the queue is empty.

Batching Strategy for Throughput

GPUs are throughput devices, not latency devices. Sending one document at a time is inefficient due to CPU-to-GPU data transfer overhead.

Input (Crawlers)

Batcher (CPU)

GPU Embedder

Model Selection & Quantization

API-Based (OpenAI)

✔ Simple to implement

✖ $0.10/1M tokens adds up

Selected

Self-Hosted (e5-large)

✔ State-of-the-art for technical text ✔ Full control over GPU costs

Quantization: Scalar (32-bit to 8-bit) = 4x size reduction with minimal recall loss

07

Vector Database Layer

The vectors produced by GPU workers need a home. We analyze two leading contenders and their integration strategies for the serverless pipeline.

📍

Pinecone Serverless

Primary Choice

✔

Separation of Concerns

Vectors stored in blob storage, loaded into index only when needed

✔

Bulk Import via S3

Write Parquet files to S3, Pinecone ingests asynchronously

✔

Hybrid Search

Dense vectors + BM25 sparse vectors for keyword matching

Use for: Production index with zero-maintenance elasticity

◈

Qdrant

Dev/Testing

✔

Open Source

Run as managed cloud or self-hosted in Modal Sandbox

✔

HNSW Optimization

Disable graph rebalancing during bulk upload, force optimization after

✔

LangChain Integration

Deep integration with QdrantVectorStore for metadata filtering

Use for: Local development without cloud costs

HNSW Algorithm Visualization

Hierarchical Navigable Small World graphs enable fast approximate nearest neighbor search:

08

Vector Search Algorithms

Understanding the algorithms behind vector search is crucial for optimizing performance. The two primary approaches - IVF (Inverted File) and HNSW (Hierarchical Navigable Small World) - offer different trade-offs between speed, accuracy, and memory usage.

📂

IVF (Inverted File Index)

IVF partitions the vector space into clusters using k-means. During search, only the nearest clusters are searched, dramatically reducing comparisons.

1

Training Phase

Run k-means on sample vectors to create centroids

2

Indexing Phase

Assign each vector to its nearest centroid's bucket

3

Search Phase

Find nearest centroids, then search only those buckets

Key Parameters

nlist - Number of clusters (typically sqrt(n))
nprobe - Clusters to search at query time

✔ Lower memory footprint

✔ Faster bulk imports

✖ Lower recall at equal speed

🌐

HNSW (Hierarchical NSW)

Recommended

HNSW builds a multi-layer graph where each layer has exponentially fewer nodes. Search starts at the top sparse layer and greedily descends to find nearest neighbors.

1

Entry Point

Start at a random node in the topmost layer

2

Greedy Search

Move to the neighbor closest to query vector

3

Layer Descent

When stuck at local minimum, descend to next layer

4

Base Layer Search

Exhaustive search in bottom layer for final candidates

Key Parameters

M - Max connections per node (16-64)
efConstruction - Build-time search width
efSearch - Query-time search width

✔ Best recall/speed ratio

✔ No training required

✖ Higher memory usage

Complexity Comparison

Operation	IVF	HNSW
Build Time	O(n * k * iterations)	O(n * log(n) * M)
Search Time	O(nprobe * n/nlist)	O(log(n) * efSearch)
Memory	O(n * d + k * d)	O(n * (d + M * layers))
Insert (Online)	O(k)	O(log(n) * M)

Algorithm Operations Visualized

Click on each operation to see the iterative flow and Python implementation.

IVF Build: O(n × k × iterations)

v1 v2 v3 ... vn

↓ k-means

C1 C2 C3

# IVF Build - K-means clustering
import numpy as np

def ivf_build(vectors, k=100, max_iter=20):
    n, d = vectors.shape

    # Initialize k centroids randomly
    centroids = vectors[np.random.choice(n, k, replace=False)]

    for iteration in range(max_iter):  # O(iterations)
        # Assign each vector to nearest centroid
        assignments = []
        for i in range(n):  # O(n)
            distances = []
            for j in range(k):  # O(k)
                dist = np.linalg.norm(vectors[i] - centroids[j])
                distances.append(dist)
            assignments.append(np.argmin(distances))

        # Update centroids
        for j in range(k):
            cluster_vecs = vectors[np.array(assignments) == j]
            if len(cluster_vecs) > 0:
                centroids[j] = cluster_vecs.mean(axis=0)

    return centroids, assignments
# Total: O(n * k * iterations)

HNSW Build: O(n × log(n) × M)

Layer 2 ○─○
Layer 1 ○─○─○─○
Layer 0 ○─○─○─○─○─○─○─○

↑ Insert each vector

# HNSW Build - Hierarchical graph construction
import random
import heapq

def hnsw_build(vectors, M=16, ef_construction=200):
    n = len(vectors)
    graph = {i: {layer: [] for layer in range(max_layer(i))}
             for i in range(n)}

    for i in range(n):  # O(n) - insert each vector
        level = random_level()  # Usually 0, rarely higher

        # Find entry point at top layer
        entry = find_entry_point()

        for layer in range(level, -1, -1):  # O(log(n)) layers
            # Find M nearest neighbors at this layer
            neighbors = search_layer(
                vectors[i], entry, ef_construction, layer
            )

            # Connect to M nearest neighbors
            for neighbor in neighbors[:M]:  # O(M) connections
                graph[i][layer].append(neighbor)
                graph[neighbor][layer].append(i)

            entry = neighbors[0]  # Best neighbor for next layer

    return graph
# Total: O(n * log(n) * M)

IVF Search: O(nprobe × n/nlist)

Query: q

↓ Find nearest clusters

C1 ✓ C2 ✓ C3

↓ Scan vectors in clusters

# IVF Search - Cluster-based retrieval
def ivf_search(query, centroids, inverted_lists, nprobe=8, k=10):
    # Step 1: Find nprobe nearest clusters
    cluster_distances = []
    for j, centroid in enumerate(centroids):  # O(nlist)
        dist = np.linalg.norm(query - centroid)
        cluster_distances.append((dist, j))

    # Sort and take top nprobe clusters
    cluster_distances.sort()
    probe_clusters = [c[1] for c in cluster_distances[:nprobe]]

    # Step 2: Search within selected clusters
    candidates = []
    for cluster_id in probe_clusters:  # O(nprobe)
        vectors_in_cluster = inverted_lists[cluster_id]
        # Each cluster has ~n/nlist vectors
        for vec_id, vector in vectors_in_cluster:  # O(n/nlist)
            dist = np.linalg.norm(query - vector)
            candidates.append((dist, vec_id))

    # Return top-k results
    candidates.sort()
    return candidates[:k]
# Total: O(nprobe * n/nlist)

HNSW Search: O(log(n) × efSearch)

Query: q

↓ Navigate layers

L2: ●→●
L1: ●→●→●
L0: ●→●→●→●→★

# HNSW Search - Greedy layer-by-layer traversal
def hnsw_search(query, graph, vectors, ef_search=64, k=10):
    # Start at top layer with entry point
    entry_point = 0  # Usually node 0
    current_best = entry_point

    # Traverse from top layer down
    for layer in range(max_layer, -1, -1):  # O(log(n)) layers
        # Greedy search at this layer
        changed = True
        while changed:
            changed = False
            # Check all neighbors of current best
            for neighbor in graph[current_best][layer]:  # O(M)
                dist = np.linalg.norm(query - vectors[neighbor])
                if dist < np.linalg.norm(query - vectors[current_best]):
                    current_best = neighbor
                    changed = True

    # At layer 0: expand search with efSearch candidates
    candidates = []
    visited = {current_best}
    heap = [(np.linalg.norm(query - vectors[current_best]), current_best)]

    while heap and len(candidates) < ef_search:
        dist, node = heapq.heappop(heap)
        candidates.append((dist, node))

        for neighbor in graph[node][0]:
            if neighbor not in visited:
                visited.add(neighbor)
                heapq.heappush(heap,
                    (np.linalg.norm(query - vectors[neighbor]), neighbor))

    return sorted(candidates)[:k]
# Total: O(log(n) * efSearch)

IVF Memory: O(n×d + k×d)

Vectors n × d floats

Centroids k × d floats

# IVF Memory Calculation
def ivf_memory_usage(n_vectors, dimension, n_clusters, dtype='float32'):
    bytes_per_float = 4 if dtype == 'float32' else 2

    # Original vectors: n * d
    vectors_memory = n_vectors * dimension * bytes_per_float

    # Cluster centroids: k * d
    centroids_memory = n_clusters * dimension * bytes_per_float

    # Inverted lists (cluster assignments): n * 4 bytes (int32)
    assignments_memory = n_vectors * 4

    total = vectors_memory + centroids_memory + assignments_memory

    print(f"Vectors:    {vectors_memory / 1e9:.2f} GB")
    print(f"Centroids:  {centroids_memory / 1e6:.2f} MB")
    print(f"Total:      {total / 1e9:.2f} GB")

    return total

# Example: 5M vectors, 1024 dims, 4096 clusters
ivf_memory_usage(5_000_000, 1024, 4096)
# Vectors:    20.48 GB
# Centroids:  16.78 MB
# Total:      20.50 GB

HNSW Memory: O(n×(d + M×layers))

Vectors n × d floats

Graph Links n × M × layers

# HNSW Memory Calculation
import math

def hnsw_memory_usage(n_vectors, dimension, M=16, ml=0.36):
    bytes_per_float = 4
    bytes_per_int = 4

    # Original vectors: n * d
    vectors_memory = n_vectors * dimension * bytes_per_float

    # Average number of layers per node
    avg_layers = 1 / (1 - ml)  # ~1.56 for ml=0.36

    # Graph connections: each node has M connections per layer
    # Layer 0: 2*M connections (bidirectional)
    # Higher layers: M connections
    links_per_node = (2 * M) + (avg_layers - 1) * M
    graph_memory = n_vectors * links_per_node * bytes_per_int

    total = vectors_memory + graph_memory

    print(f"Vectors:    {vectors_memory / 1e9:.2f} GB")
    print(f"Graph:      {graph_memory / 1e9:.2f} GB")
    print(f"Total:      {total / 1e9:.2f} GB")

    return total

# Example: 5M vectors, 1024 dims, M=16
hnsw_memory_usage(5_000_000, 1024, M=16)
# Vectors:    20.48 GB
# Graph:      0.94 GB
# Total:      21.42 GB (slightly more than IVF)

IVF Insert: O(k)

New vector: v

↓ Compare to k centroids

C1 C2 ← C3

↓ Add to inverted list

# IVF Online Insert - Fast cluster assignment
def ivf_insert(vector, centroids, inverted_lists, vector_id):
    # Find nearest cluster: O(k)
    min_dist = float('inf')
    best_cluster = 0

    for j, centroid in enumerate(centroids):  # O(k) comparisons
        dist = np.linalg.norm(vector - centroid)
        if dist < min_dist:
            min_dist = dist
            best_cluster = j

    # Add to inverted list: O(1)
    inverted_lists[best_cluster].append((vector_id, vector))

    return best_cluster

# Example: Insert 1000 new vectors
for i, vec in enumerate(new_vectors):
    cluster = ivf_insert(vec, centroids, inverted_lists, i)
    print(f"Vector {i} -> Cluster {cluster}")
# Each insert: O(k) where k = number of clusters

HNSW Insert: O(log(n) × M)

New vector: v

↓ Navigate to position

L1: find entry

L0: connect M neighbors

↓ Create M bidirectional links

# HNSW Online Insert - Graph extension
def hnsw_insert(vector, vector_id, graph, vectors, M=16, ef=200):
    # Determine max layer for new node (exponential decay)
    ml = 0.36  # Layer multiplier
    level = int(-math.log(random.random()) * ml)

    # Initialize empty adjacency lists
    graph[vector_id] = {l: [] for l in range(level + 1)}
    vectors[vector_id] = vector

    entry_point = get_entry_point()

    # Navigate from top layer down: O(log(n))
    for layer in range(get_max_layer(), level, -1):
        entry_point = greedy_search(vector, entry_point, layer)

    # Insert at each layer from level down to 0
    for layer in range(min(level, get_max_layer()), -1, -1):
        # Find ef nearest neighbors: O(ef)
        neighbors = search_layer(vector, entry_point, ef, layer)

        # Select M best neighbors
        selected = select_neighbors(vector, neighbors, M)

        # Create bidirectional connections: O(M)
        for neighbor in selected:
            graph[vector_id][layer].append(neighbor)
            graph[neighbor][layer].append(vector_id)

            # Prune if neighbor has too many connections
            if len(graph[neighbor][layer]) > M:
                graph[neighbor][layer] = prune_connections(
                    neighbor, graph[neighbor][layer], M
                )

        entry_point = neighbors[0]

    return level
# Total: O(log(n) * M) for navigation + connections

Implementation Examples

qdrant_hnsw.py Qdrant with HNSW configuration

from qdrant_client import QdrantClient
from qdrant_client.models import (
    VectorParams, Distance, HnswConfigDiff,
    OptimizersConfigDiff, PointStruct
)

# Initialize client
client = QdrantClient(host="localhost", port=6333)

# Create collection with optimized HNSW settings
client.create_collection(
    collection_name="docuverse",
    vectors_config=VectorParams(
        size=1536,  # OpenAI ada-002 dimensions
        distance=Distance.COSINE
    ),
    hnsw_config=HnswConfigDiff(
        m=16,                    # Connections per node
        ef_construct=100,        # Build-time search width
        full_scan_threshold=10000  # Use HNSW when > 10k vectors
    ),
    optimizers_config=OptimizersConfigDiff(
        indexing_threshold=20000,  # Start indexing after 20k points
        memmap_threshold=50000     # Use memory mapping for large data
    )
)

# Upsert vectors with payload
points = [
    PointStruct(
        id=idx,
        vector=embedding,
        payload={
            "url": doc.url,
            "title": doc.title,
            "chunk_id": chunk_id
        }
    )
    for idx, (embedding, doc, chunk_id) in enumerate(data)
]

client.upsert(
    collection_name="docuverse",
    points=points,
    wait=True
)

# Search with HNSW parameters
results = client.search(
    collection_name="docuverse",
    query_vector=query_embedding,
    limit=10,
    search_params={
        "hnsw_ef": 128,  # Query-time search width (higher = better recall)
        "exact": False   # Use ANN, not exact search
    },
    with_payload=True,
    score_threshold=0.7  # Minimum similarity
)

# Filter search with metadata
filtered_results = client.search(
    collection_name="docuverse",
    query_vector=query_embedding,
    query_filter={
        "must": [
            {"key": "category", "match": {"value": "python"}}
        ],
        "must_not": [
            {"key": "deprecated", "match": {"value": True}}
        ]
    },
    limit=5
)

pinecone_serverless.py Pinecone Serverless with hybrid search

from pinecone import Pinecone, ServerlessSpec
import os

# Initialize Pinecone client
pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])

# Create serverless index
pc.create_index(
    name="docuverse-prod",
    dimension=1536,
    metric="cosine",
    spec=ServerlessSpec(
        cloud="aws",
        region="us-east-1"
    )
)

# Get index reference
index = pc.Index("docuverse-prod")

# Upsert vectors with metadata
vectors = [
    {
        "id": f"doc_{i}",
        "values": embedding,
        "metadata": {
            "url": doc.url,
            "title": doc.title,
            "source": doc.source,
            "updated_at": doc.timestamp
        }
    }
    for i, (embedding, doc) in enumerate(data)
]

# Batch upsert (max 100 vectors per batch)
for i in range(0, len(vectors), 100):
    batch = vectors[i:i+100]
    index.upsert(vectors=batch, namespace="production")

# Dense vector search
results = index.query(
    vector=query_embedding,
    top_k=10,
    include_metadata=True,
    namespace="production"
)

# Hybrid search (dense + sparse)
from pinecone_text.sparse import BM25Encoder

# Initialize BM25 for sparse vectors
bm25 = BM25Encoder()
bm25.fit(corpus)  # Fit on your document corpus

# Create sparse vector from query
sparse_query = bm25.encode_queries(query_text)

# Hybrid query with alpha weighting
results = index.query(
    vector=query_embedding,           # Dense vector
    sparse_vector=sparse_query,       # Sparse (BM25) vector
    top_k=50,
    include_metadata=True,
    filter={
        "source": {"$in": ["official", "docs"]},
        "updated_at": {"$gte": "2024-01-01"}
    }
)

# Bulk import from S3 (for millions of vectors)
index.start_import(
    uri="s3://docuverse-bucket/embeddings/",
    integration_id="s3-integration",
    error_mode="continue"  # Skip failed records
)

faiss_ivf.py FAISS IVF with Product Quantization

import faiss
import numpy as np

# Configuration
dimension = 1536
n_vectors = 5_000_000
n_clusters = int(np.sqrt(n_vectors))  # ~2236 clusters

# Training data (sample 10% for k-means)
train_size = min(500_000, n_vectors // 10)
train_vectors = vectors[:train_size].astype('float32')

# Option 1: IVF with Flat (exact search within clusters)
quantizer = faiss.IndexFlatL2(dimension)
index_ivf = faiss.IndexIVFFlat(
    quantizer, dimension, n_clusters, faiss.METRIC_L2
)

# Train the index (k-means clustering)
index_ivf.train(train_vectors)
print(f"Index trained: {index_ivf.is_trained}")

# Add vectors
index_ivf.add(all_vectors.astype('float32'))

# Search parameters
index_ivf.nprobe = 64  # Search 64 clusters (higher = better recall)

# Option 2: IVF with Product Quantization (compressed)
# PQ splits vector into subvectors and quantizes each
m = 96           # Number of subquantizers
bits = 8         # Bits per subquantizer code

index_ivfpq = faiss.IndexIVFPQ(
    quantizer, dimension, n_clusters, m, bits
)

index_ivfpq.train(train_vectors)
index_ivfpq.add(all_vectors.astype('float32'))

# Memory comparison
print(f"IVF Flat memory: {index_ivf.ntotal * dimension * 4 / 1e9:.2f} GB")
print(f"IVF PQ memory: {index_ivfpq.ntotal * m / 1e9:.2f} GB")

# Search
query = query_embedding.reshape(1, -1).astype('float32')
distances, indices = index_ivf.search(query, k=10)

# Option 3: HNSW in FAISS (for comparison)
index_hnsw = faiss.IndexHNSWFlat(dimension, 32)  # M=32
index_hnsw.hnsw.efConstruction = 200
index_hnsw.hnsw.efSearch = 128

index_hnsw.add(all_vectors.astype('float32'))
distances, indices = index_hnsw.search(query, k=10)

# GPU acceleration (if available)
if faiss.get_num_gpus() > 0:
    gpu_index = faiss.index_cpu_to_gpu(
        faiss.StandardGpuResources(),
        0,  # GPU device ID
        index_ivf
    )
    distances, indices = gpu_index.search(query, k=10)

Benchmark Results (5M vectors, 1536 dims)

IVF-Flat

Recall@10 0.92

QPS 850

Memory 28 GB

IVF-PQ

Recall@10 0.85

QPS 2,400

Memory 4.5 GB

HNSW
Best Balance

                                Recall@10
                                0.98
                            
                                QPS
                                1,200
                            
                                Memory
                                32 GB

Pinecone Serverless

Recall@10 0.96

QPS Elastic

Memory Managed

09

RAG Pipeline Integration

LangChain Orchestration Flow

Query Processing

Hybrid Retrieval

Re-Ranking

Answer Synthesis

Click on a card to see more details

1

Query Embedding

Query sent to the same embedding function used for indexing.

Click for details

Query Embedding - Details

Model: e5-large (same as indexing)
Dimension: 1024-dimensional vector
Latency: ~15ms per query
Key insight: Using the same model for queries and documents ensures vectors are in the same latent space, making similarity comparisons meaningful.

2

Hybrid Retrieval

LangChain queries Pinecone with vector + BM25 filters.

Click for details

Hybrid Retrieval - Details

Vector search: Cosine similarity on embeddings
Keyword (BM25): Exact matches for error codes, APIs
Alpha weighting: 0.7 vector + 0.3 sparse
Filters: updated_at > 1 year ago
Initial candidates: Top 50 results from index

3

Cross-Encoder Re-Ranking

Re-rank candidates with Cross-Encoder on Modal GPU.

Click for details

Cross-Encoder - Details

Model: ms-marco-MiniLM-L-12-v2
Input: (query, document) pairs
Output: Relevance score 0-1
Why: Bi-encoder (embedding) is fast but less precise. Cross-encoder sees query+doc together for better relevance.
GPU: Runs on Modal A10G for ~40ms latency

4

Matrix Link Boost

Apply authority boost with PageRank-style scoring.

Click for details

Matrix Link Boost - Details

Formula: Final = (Sim * 0.8) + (Authority * 0.2)
Authority source: Link graph analysis (PageRank)
Effect: Official docs outrank forum posts
Example: React.dev scores 0.95, random blog 0.3
Final output: Top 5 most relevant + authoritative docs

10

Operational Resilience

⚠

Dead Letter Queue (DLQ)

In a system processing millions of items, 0.1% will fail. Failed items are serialized with error traceback and pushed to DLQ for later inspection or retry.

try:
    process(item)
except Exception as e:
    dlq.put({"input": item, "error": str(e)})

🔒

Idempotency

Document IDs generated deterministically: sha256(url). If a worker crashes after writing to Pinecone but before acknowledging, the retry simply overwrites with identical data.

💰

Cost Monitoring

Track tokens processed by embedding function. If daily spend exceeds threshold ($50), the seed_injector is disabled until next billing cycle.

Circuit Breaker $42.50 / $50.00

11

Cost & Performance Analysis

Monthly Cost: Kubernetes vs Serverless

Component	Kubernetes (EKS)	DocuVerse (Modal)	Savings
Compute (Crawler)	$450/mo	$42/mo	90%
Compute (GPU)	$2,200/mo	$150/mo	93%
Vector DB	$300/mo	$45/mo	85%
DevOps Labor	10 hrs/mo	1 hr/mo	90%
Total	~$2,950	~$237	92%

Throughput Benchmarks

1,200

pages/sec

Crawling Speed (300 containers)

4,500

docs/sec

Embedding Rate (50 A10G GPUs)

10,000

vectors/sec

Indexing via S3 bulk import

1.8s

cold start

Container + model weights

12

Complete System Mind Map

Core Engine

Ingestion

Processing

Memory

Interaction

13

Reference Implementation

Complete source code for the DocuVerse engine, structured as a Modal application package.

src/common.py Shared data structures and configuration

from dataclasses import dataclass
from typing import List, Optional

# Constants
QUEUE_NAME = "docuverse-frontier"
DICT_NAME = "docuverse-visited"
EMBED_QUEUE = "docuverse-embeddings"
LINK_MATRIX_QUEUE = "docuverse-matrix"

@dataclass
class Document:
    url: str
    content: str
    title: str
    links: List[str]
    doc_hash: str
    metadata: dict

@dataclass
class VectorRecord:
    id: str
    values: List[float]
    metadata: dict

src/crawler.py Distributed Fetcher with Producer-Consumer pattern

import modal
import hashlib
from .common import Document, QUEUE_NAME, DICT_NAME, EMBED_QUEUE, LINK_MATRIX_QUEUE

# Define the container image with necessary scraping libraries
crawler_image = modal.Image.debian_slim().pip_install(
    "beautifulsoup4", "requests"
)

app = modal.App("docuverse-crawler")

# Persistent State
frontier_queue = modal.Queue.from_name(QUEUE_NAME, create_if_missing=True)
visited_db = modal.Dict.from_name(DICT_NAME, create_if_missing=True)
embed_queue = modal.Queue.from_name(EMBED_QUEUE, create_if_missing=True)
matrix_queue = modal.Queue.from_name(LINK_MATRIX_QUEUE, create_if_missing=True)

@app.function(image=crawler_image, concurrency_limit=300)
def fetch_url(url: str):
    import requests
    from bs4 import BeautifulSoup

    # Idempotency check
    if url in visited_db:
        return

    try:
        response = requests.get(url, timeout=5)
        if response.status_code != 200:
            return

        soup = BeautifulSoup(response.text, 'html.parser')

        # 1. Extract Content
        text = soup.get_text()
        title = soup.title.string if soup.title else url
        doc_hash = hashlib.sha256(text.encode()).hexdigest()

        # 2. Extract Matrix Links (Graph Edges)
        links = [a.get('href') for a in soup.find_all('a', href=True)]
        normalized_links = [l for l in links if l.startswith('http')]

        doc = Document(
            url=url,
            content=text[:5000],  # Truncate for demo
            title=title,
            links=normalized_links,
            doc_hash=doc_hash,
            metadata={"source": "crawler"}
        )

        # 3. Mark as visited
        visited_db[url] = {"hash": doc_hash, "status": "processed"}

        # 4. Dispatch for Processing
        embed_queue.put(doc)
        matrix_queue.put({"source": url, "targets": normalized_links})

        # 5. Expand Frontier
        for link in normalized_links:
            if link not in visited_db:
                frontier_queue.put(link)

    except Exception as e:
        print(f"Failed to crawl {url}: {e}")

@app.function(schedule=modal.Cron("0 2 * * *"))
def seed_injector():
    """Daily job to restart the crawl from root nodes."""
    roots = ["https://docs.python.org/3/", "https://react.dev"]
    for url in roots:
        frontier_queue.put(url)

src/embedder.py GPU Batch Processing with model caching

import modal
from typing import List
from .common import Document, VectorRecord, EMBED_QUEUE

# Define a GPU-enabled image with PyTorch and Transformers
gpu_image = (
    modal.Image.debian_slim()
    .pip_install("torch", "transformers", "sentence-transformers")
)

app = modal.App("docuverse-embedder")

@app.cls(gpu="A10G", image=gpu_image, container_idle_timeout=300)
class ModelService:
    def __enter__(self):
        from sentence_transformers import SentenceTransformer
        # Load model once when container starts (Cold Start optimization)
        self.model = SentenceTransformer('intfloat/multilingual-e5-large')

    @modal.method()
    def embed_batch(self, docs: List[Document]) -> List[VectorRecord]:
        texts = [d.content for d in docs]

        # Generate dense vectors
        embeddings = self.model.encode(texts, normalize_embeddings=True)

        records = []
        for doc, emb in zip(docs, embeddings):
            records.append(VectorRecord(
                id=doc.doc_hash,
                values=emb.tolist(),
                metadata={"url": doc.url, "title": doc.title}
            ))
        return records

@app.function(image=modal.Image.debian_slim())
def batch_coordinator():
    """Reads from queue, batches items, and sends to GPU."""
    embed_queue = modal.Queue.from_name(EMBED_QUEUE)
    service = ModelService()

    BATCH_SIZE = 64

    while True:
        try:
            items = embed_queue.get_many(BATCH_SIZE, block=True, timeout=5.0)
            if not items:
                break

            # Invoke GPU function
            vectors = service.embed_batch.remote(items)

            # Send vectors to Pinecone
            # pinecone_upload.remote(vectors)

        except Exception:
            break

src/vector_db.py Pinecone bulk upload via S3

import modal
import os

app = modal.App("docuverse-vectordb")

@app.function(
    secrets=[modal.Secret.from_name("pinecone-secret"),
             modal.Secret.from_name("aws-secret")]
)
def bulk_upsert(parquet_file_path: str):
    from pinecone import Pinecone
    import boto3

    # 1. Upload Parquet to S3
    s3 = boto3.client('s3')
    bucket = "docuverse-ingest-bucket"
    key = f"imports/{os.path.basename(parquet_file_path)}"
    s3.upload_file(parquet_file_path, bucket, key)

    # 2. Trigger Pinecone Import
    pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])
    idx = pc.Index("docuverse-prod")

    # Start async import
    idx.start_import(
        uri=f"s3://{bucket}/{key}",
        integration_id="s3-integration-id"
    )
    print("Bulk import started.")

14

Learning Resources

Video Tutorials

Getting Started with Modal

Official introduction to serverless GPU compute with Modal Labs

Building RAG Applications

Complete guide to Retrieval-Augmented Generation with LangChain

HNSW Algorithm Deep Dive

Understanding Hierarchical Navigable Small World graphs

Complete Vector Database Tutorial

Learn ChromaDB, Pinecone & Weaviate for Generative AI

HNSW Explained & Implemented

Hierarchical Navigable Small World with Faiss (Python)

Vector Indexing: HNSW, IVF & PQ

Complete guide to vector indexing methods and when to use each

RAG Pipeline From Scratch

Data Ingestion to Vector Store with LangChain

RAG Chatbot with Claude & Qdrant

Building document processing chatbots with vector databases

How to Choose a Vector Database

PGVector vs Pinecone vs Redis comparison

Official Documentation

📖

Modal Documentation

Complete guide to serverless GPU compute

→

📍

Pinecone Documentation

Vector database setup and best practices

→

◈

Qdrant Documentation

Open-source vector search engine

→

☍

LangChain Documentation

Building LLM applications and RAG pipelines

→

🤖

E5 Embedding Model

Multilingual text embedding model

→

📑

Sentence Transformers

Python framework for embeddings

→

Research & Further Reading

Paper

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Lewis et al. - The original RAG paper from Facebook AI

Paper

The Many Facets of Internet Topology and Traffic

H. Chang & S. Uhlig - American Institute of Mathematical Sciences

Modal & Infrastructure

Blog

Vector Databases

Docs

Embeddings & ANN Algorithms

Docs

Benchmarks & Performance

Tool

LangChain & RAG Integration

Docs

Cloud & Enterprise Solutions

Docs

Use Cases & Applications

Guide