DEV Community

Cover image for The Serverless Semantic Engine: Architecting Mass Indexing Pipelines with Modal and Vector Databases
Lucas Ribeiro
Lucas Ribeiro

Posted on

The Serverless Semantic Engine: Architecting Mass Indexing Pipelines with Modal and Vector Databases

Executive Summary

The transition from keyword-based information retrieval to semantic search represents one of the most significant paradigm shifts in data engineering over the last decade. As organizations seek to leverage Large Language Models (LLMs) via Retrieval-Augmented Generation (RAG), the ability to efficiently crawl, embed, and index vast corpora of unstructured data has become a critical competency. However, traditional infrastructure approaches—relying on provisioned virtual machines, long-running Kubernetes clusters, or monolithic server architectures—often struggle to handle the distinct "bursty" nature of mass indexing workloads. A web crawler might sit idle for days and then require thousands of concurrent threads for a few hours; a vector embedding job requires massive GPU throughput for short bursts but is financially ruinous to maintain 24/7.

This report provides an exhaustive technical analysis of architecting a serverless mass-indexing pipeline using Modal for compute orchestration and Vector Databases (specifically analyzing Pinecone and Qdrant) for high-dimensional storage. To facilitate a rigorous examination of these technologies, we introduce a fictional yet realistic application scenario: "DocuVerse," a decentralized technical documentation aggregator. This simulation involves the ingestion of millions of technical documents, requiring a pipeline that is robust, scalable, and cost-efficient.

Our analysis extends beyond simple implementation details to explore second-order implications: the graph-theoretical properties of web crawling (the "Matrix Link"), the economics of ephemeral GPU compute, and the nuances of distributed state management in a stateless environment. Furthermore, bridging the gap between deep engineering and public communication, the report concludes with a comprehensive LinkedIn content strategy, including visual "card" designs and a conceptual mind map of the application, designed to communicate these complex architectures to a professional audience.


Part I: The Paradigm Shift in Search Infrastructure

1.1 The Evolution of Retrieval: From Keywords to Vectors

To understand the necessity of the architectures proposed in this report, one must first appreciate the fundamental limitations of the systems they replace. For decades, the industry standard for search was the Inverted Index—a data structure mapping unique terms to the documents containing them (e.g., Apache Lucene, Elasticsearch). While highly efficient for exact keyword matching, inverted indices suffer from "lexical gap": they cannot match a query for "automobile" to a document containing "car" unless explicitly synonymized.

The advent of Transformer-based language models (BERT, RoBERTa, and later GPT) introduced Vector Embeddings. In this paradigm, text is transformed into a high-dimensional vector (often 768 to 1536 dimensions) where semantic meaning is encoded in the geometric distance between points. "Car" and "Automobile" end up in the same neighborhood of this vector space.1

This shift changes the fundamental resource requirements of the indexing pipeline:

  1. CPU to GPU Shift: Inverted indexing is I/O and CPU bound (tokenization). Vector indexing is compute-bound, requiring matrix multiplications best performed on GPUs.

  2. Throughput Sensitivity: The embedding model is a bottleneck. Processing millions of documents through a deep neural network requires massive parallelization that single-server architectures cannot provide.

  3. Storage Complexity: Storing and searching millions of dense vectors requires specialized Approximate Nearest Neighbor (ANN) algorithms (like HNSW), which have different memory and disk IOPS profiles compared to traditional B-Trees.

1.2 The Infrastructure Dilemma: Burstiness vs. Provisioning

Mass indexing events—such as the initial ingestion of a new dataset or a full re-indexing after an embedding model update—are characterized by extreme burstiness.

Consider a documentation platform that crawls the web. For 23 hours a day, traffic is minimal (incremental updates). For 1 hour, a major new library release might trigger a crawl of 100,000 pages.

  • Provisioned Capacity (e.g., EC2/Kubernetes): If you provision for the peak, you pay for idle GPUs 95% of the time. If you provision for the average, the peak load causes massive latency spikes, violating Service Level Agreements (SLAs).

  • Traditional Serverless (e.g., AWS Lambda): While scalable, these services often lack GPU support, have restrictive timeouts (15 minutes), and suffer from "cold starts" that make loading large ML models (often gigabytes in size) too slow for real-time responsiveness.

1.3 The Modal Solution

Modal has emerged as a specialized cloud platform designed to solve these specific discrepancies. Unlike general-purpose serverless platforms, Modal is optimized for data-intensive and AI workloads. Its architecture allows for:

  • Container Lifecycle Management: Modal separates the container image definition from the execution. It employs advanced caching and lazy-loading techniques to launch containers in milliseconds, even those with heavy dependencies like PyTorch or TensorFlow.1

  • GPU Ephemerality: Functions can request specific GPU hardware (e.g., NVIDIA A10G, H100) on a per-invocation basis. The billing model is per-second of usage, enabling a "scale-to-zero" architecture where the cost of a massive GPU cluster is incurred only during the minutes it is actually crunching data.

  • Distributed Primitives: Modal provides native distributed data structures (Queues, Dicts) that allow functions to coordinate state without needing an external Redis or message bus.2

This report validates Modal as the foundational compute layer for "DocuVerse," demonstrating how it orchestrates the complex dance of crawling, embedding, and indexing.


Part II: The Fictional Use Case: "DocuVerse"

To ground our architectural decisions in reality, we define the specifications of DocuVerse.

2.1 Mission and Scope

DocuVerse is a "Universal Documentation Search Engine" for developers. It aggregates technical documentation from:

  1. Official Sources: Python docs, MDN, AWS documentation.

  2. Community Sources: Stack Overflow archives, GitHub Wikis.

  3. Decentralized Web: Technical whitepapers hosted on IPFS/Arweave.

The goal is to provide a single search bar that retrieves the most relevant technical answers using RAG, regardless of where the information lives.

2.2 Dataset Specifications (Fictional Data)

Metric Value Implications
Total Documents 5,000,000 Requires efficient bulk indexing strategies.
Average Doc Size 4 KB (approx. 800 tokens) Fits within standard embedding context windows; chunking may be minimal.
Update Velocity ~200,000 docs/day Incremental indexing must be robust.
Vector Dimensions 1,536 (OpenAI Ada-002 compatible) Standard high-fidelity dimensionality.
Total Index Size ~30 GB (Vectors + Metadata) Fits in memory for some DBs, requires disk-offload for others.
Target Latency < 200ms (Search), < 15 min (Index Freshness) Tight constraints on the ingestion pipeline.

2.3 The "Matrix Link" Requirement

Beyond simple text search, DocuVerse aims to implement a "PageRank-for-Code" algorithm. It must construct a graph of how documentation pages link to each other (e.g., how many pages link to the React useEffect hook documentation?). This "Matrix Link" 3 will be used to boost the relevance of authoritative pages during vector retrieval. This adds a complexity layer: the crawler must not just extract text, but also preserve the adjacency matrix of the web graph.


Part III: Architecting the Distributed Crawler on Modal

The ingestion layer is the gateway to the system. Building a crawler that can handle 5 million pages without getting blocked, crashing, or entering infinite loops requires a sophisticated distributed architecture.

3.1 The Producer-Consumer Pattern using modal.Queue

In a monolithic script, crawling is a recursive function: visit(url) -> find_links() -> visit(links). In a serverless environment, deep recursion leads to stack overflows or timeout errors. We must flatten this recursion into a Queue-Based Architecture.2

The Architecture Design:

  1. The Frontier Queue: A modal.Queue named crawl-frontier. This persistent queue holds the URLs waiting to be visited. It acts as the buffer between the discovery of work and the execution of work.

  2. The Seed Injector: A scheduled function (@app.function(schedule=modal.Cron(...))) 5 that runs periodically (e.g., every morning at 02:00 UTC) to push known "root" URLs (e.g., https://docs.python.org/3/) into the Frontier Queue. This kickstarts the process.

  3. The Fetcher Swarm: A set of worker functions that pop() items from the queue. This is where Modal's auto-scaling shines. We can configure the Fetcher to scale between 0 and 500 concurrent containers depending on the queue length.

Why Not modal.map?

While modal.map allows parallel execution over a list, it is static. It expects the list of inputs to be known beforehand. A crawler is dynamic—parsing Page A reveals Page B and C. The Queue pattern is essential here because it allows the workload to expand dynamically during runtime.5

3.2 State Management: The Deduplication Matrix

To prevent infinite loops (Page A links to B, B links to A) and to ensure we don't waste compute crawling the same page twice, we need a shared state of visited URLs.

The Distributed Dictionary:

We employ modal.Dict as a shared key-value store accessible by all 500 fetcher containers simultaneously.2

  • Key: The URL (normalized).

  • Value: A metadata object containing timestamp, hash (for content change detection), and status.

Consistency Challenge:

In a high-concurrency environment, a race condition exists: two workers might pop the same URL or discover the same link simultaneously. modal.Dict provides atomicity guarantees for operations, ensuring that visited.put_if_absent(url) is thread-safe across the distributed cluster.

3.3 The "Matrix Link" Construction

As referenced in the research 3, the structure of the web is an adjacency matrix. Most crawlers discard this structure, keeping only the content. DocuVerse preserves it.

Implementation:

When the Fetcher parses a page, it extracts two distinct datasets:

  1. Content: The text for vectorization.

  2. Edges: A list of outbound links.

These edges are pushed to a secondary link_matrix_queue. A separate aggregator function reads this queue and builds a sparse matrix representation of the documentation graph. This matrix is later used to calculate "Authority Scores" for each document, which will be stored as metadata in the Vector Database. This approach leverages Graph Neural Network (GNN) concepts where the link structure informs the semantic importance of the node.4

3.4 Handling Politeness and Anti-Bot Measures

A naive crawler scaling to 500 containers will resemble a DDoS attack to the target server. We must implement Politeness Sharding.

The Sharded Queue Strategy:

Instead of one global queue, we logically partition the work by domain.

  • Worker Type A: Processes *.github.io (Concurrency Limit: 5).

  • Worker Type B: Processes *.readthedocs.io (Concurrency Limit: 10).

  • Worker Type C: General Web (Concurrency Limit: 100).

In Modal, this is achieved by defining different Functions with different concurrency_limit decorators, all consuming from filtered views of the main queue or separate domain-specific queues. This ensures that while the aggregate throughput of DocuVerse is high, the per-domain impact remains respectful of robots.txt etiquette.


Part IV: The Processing Core: Embeddings & GPU Orchestration

Once the raw HTML is secured, the pipeline shifts from network-bound (crawling) to compute-bound (embedding). This is the most expensive phase of the operation and where Modal's value proposition is strongest.

4.1 The Container Loading Advantage

In traditional container orchestration (like Kubernetes), adding a new GPU node and pulling a Docker image containing a 5GB PyTorch model can take several minutes. This latency makes it difficult to react to a sudden influx of 50,000 documents.

Modal solves this with a highly optimized container runtime.1

  1. Image Snapshotting: The file system of the container (including the installed Python packages and the model weights) is snapshot.

  2. Lazy Loading: When a function is invoked, Modal mounts this snapshot over the network. Data is read on-demand.

  3. Result: A container capable of running a BERT-large model can boot in under 2 seconds.

Implication for DocuVerse:

This allows us to treat the Embedding Function as a purely on-demand resource. We do not need to keep a "warm pool" of GPU servers running. If the crawler finds a new pocket of documentation, Modal instantly spins up 50 GPU containers to process it and shuts them down the second the queue is empty.

4.2 Batching Strategy for Throughput

GPUs are throughput devices, not latency devices. Sending one document at a time to a GPU is inefficient due to the overhead of moving data from CPU RAM to GPU VRAM.

The Batcher Pattern:

We insert a "buffer" function between the Crawler and the Embedder.

  1. Crawler: Pushes text chunks to embedding_input_queue.

  2. Batcher: A lightweight CPU function that pulls from the queue and accumulates items until it reaches a batch size of 128 or a timeout of 500ms.

  3. Dispatcher: The Batcher sends the List (batch of 128) to the GPU Embedding Function.

This ensures that every time we pay for a GPU cycle, we are utilizing its matrix multiplication cores to their maximum capacity.

4.3 Model Selection and Quantization

For DocuVerse, we have two primary options for embeddings:

  1. API-Based (e.g., OpenAI): Simple to implement but costly at scale ($0.10 per million tokens can add up with 5 million docs re-indexed weekly).

  2. Self-Hosted (e.g., multilingual-e5-large): Running open-source models on Modal's GPUs.

We choose the Self-Hosted approach for this architecture to demonstrate the capability. We utilize the multilingual-e5-large model, which provides state-of-the-art performance for technical text.6

Quantization:

To reduce the memory footprint in the Vector Database and speed up search, we apply Scalar Quantization (converting 32-bit floats to 8-bit integers) within the embedding function. This reduces the index size by 4x with minimal loss in retrieval accuracy (Recall@10).


Part V: The Vector Database Layer: Storage and Indexing

The vectors produced by our GPU workers need a home. We analyze two leading contenders, Pinecone and Qdrant, and how they integrate into this serverless pipeline.

5.1 Pinecone: The Serverless Standard

Pinecone's recent "Serverless" offering 7 aligns perfectly with our architecture. Unlike their previous "Pod-based" model where users provisioned capacity, the serverless model decouples storage from compute.

Architecture Benefits:

  • Separation of Concerns: Vectors are stored in blob storage (S3-compatible) and loaded into the index only when needed. This means we can store 5 million vectors cheaply, even if we rarely search the "long tail" of the data.

  • Mass Indexing via Object Storage: For the initial load of DocuVerse (the "Bootstrap" phase), pushing vectors one by one via API is too slow. Pinecone allows bulk import from object storage.8 Our Modal pipeline can write Parquet files to an S3 bucket, and Pinecone can ingest them asynchronously. This is the fastest and most cost-effective way to build the initial index.

Integration Strategy:

We use a Hybrid Search index. We store both the dense vector (from the GPU model) and a sparse vector (BM25) for keyword matching. This ensures that if a user searches for a specific error code (e.g., "Error 503"), the keyword match takes precedence over semantic similarity.9

5.2 Qdrant: The High-Performance Alternative

Qdrant offers a different value proposition. It is open-source and can be run as a managed cloud service or self-hosted.

HNSW Graph Construction:

Qdrant uses the Hierarchical Navigable Small World (HNSW) algorithm.9 Constructing this graph is computationally expensive.

  • Insight: During mass indexing, inserting vectors and updating the graph in real-time destroys performance.

  • Optimization: We configure the Qdrant client to disable "optimization" (graph re-balancing) during the bulk upload. Once the upload is complete, we trigger a forced optimization. This reduces total indexing time by approximately 60%.

LangChain Integration:

Qdrant has deep integration with LangChain.11 We can leverage the QdrantVectorStore class to handle metadata filtering out of the box. For DocuVerse, metadata is crucial.

  • Filter Example: filter={"project": "react", "version": "18.0"}.

    This allows the search engine to respect the structure of the documentation sets.

5.3 The DocuVerse Decision

For the primary architecture, we select Pinecone Serverless for the production index due to its zero-maintenance elasticity. However, we utilize Qdrant (running ephemerally in a Modal Sandbox) for testing and development pipelines, allowing developers to run the full stack locally without incurring cloud costs.


Part VI: Retrieval and Integration (RAG)

The ultimate consumer of our index is the RAG pipeline.

6.1 The LangChain Orchestrator

We use LangChain to wire the components together.11

  1. User Query: "How do I mount a volume in Modal?"

  2. Query Embedding: The query is sent to the same Embedding Function (hosted on Modal) used for indexing. This ensures the query vector and document vectors are in the same latent space.

  3. Retrieval: LangChain queries Pinecone with the vector + filters (e.g., "only show me docs updated in the last year").

  4. Re-Ranking: To improve precision, we fetch 50 candidates and pass them through a Cross-Encoder model (also hosted on Modal) to re-rank them. This is more expensive but guarantees higher relevance.

  5. Synthesis: The top 5 chunks are passed to GPT-4 via the OpenAI API to generate the answer.

6.2 The "Matrix Link" Boost

Here, our earlier graph work pays off. When retrieving results, we apply a boosting factor based on the "Authority Score" calculated during the crawl.

  • Score Formula: Final_Score = (Vector_Similarity * 0.8) + (PageRank_Score * 0.2)

  • This ensures that the "official" documentation page (which has many incoming links) ranks higher than a random forum post (which has few), even if the forum post has slightly higher semantic similarity.4


Part VII: Operational Resilience and Observability

Building a distributed system on fictional data is easy; running it in production is hard.

7.1 The Dead Letter Queue (DLQ)

In a system processing millions of items, 0.1% will fail. The HTML might be malformed; the embedding model might encounter a token limit.

  • Pattern: We define a dlq_queue in Modal.

  • Mechanism: Wrap the processing logic in a try/except block. On exception, serialize the input + the error traceback and push it to the DLQ.

  • Recovery: A separate "Janitor" function runs daily to inspect the DLQ. It can either retry the jobs (if the error was transient, like a network timeout) or alert a human.

7.2 Idempotency and Determinism

The pipeline must be idempotent. If a worker crashes after writing to Pinecone but before acknowledging the queue message, the message will be re-delivered.

  • Solution: We generate Document IDs deterministically using a hash of the URL (sha256(url)). If we try to write the same document to Pinecone twice, the second write simply overwrites the first with identical data. No duplicates are created.13

7.3 Cost Monitoring

To prevent "wallet-denial-of-service", we implement budget guards.

  • Token Counting: We track the total tokens processed by the Embedding Function.

  • Circuit Breaker: If the daily spend exceeds a threshold (e.g., $50), the seed_injector function is disabled, pausing new crawls until the next billing cycle or manual override.


Part VIII: LinkedIn Content Strategy & Visuals

To effectively communicate the sophistication of the DocuVerse architecture to a professional network, we need a content strategy that bridges the gap between high-level value and low-level engineering.

8.1 The "Hook" and Narrative

Headline: "How I Built a 'Google for Code' Indexing 5 Million Pages for <$50."

Narrative Arc:

  1. The Villain: The "Idle Resource". Identifying the waste in traditional provisioned clusters.

  2. The Hero: The "Serverless Trinity" (Modal + Pinecone + LangChain).

  3. The Climax: The "Mass Indexing Event"—scaling from 0 to 500 GPUs in seconds.

  4. The Resolution: A predictable, low-cost bill and a high-performance search engine.

8.2 Card Suggestions (Visual Assets)

Card 1: The "Cold Start" Myth

  • Visual: A stopwatch comparing "Standard Docker" (2 min) vs. "Modal Snapshot" (1.5 sec).

  • Text: "Serverless GPUs used to be too slow for real-time AI. Not anymore. Container snapshotting changes the physics of cold starts." 1

Card 2: The Architecture Map

  • Visual Strategy: Instead of a static image, use this flow diagram to illustrate the "Producer-Consumer" decoupling that enables scale.

  • Diagram:

Snippet de código

flowchart TD
    subgraph Ingestion ["Ingestion Layer (CPU)"]
        Seed(Seed Injector) --> Frontier[Frontier Queue]
        Frontier --> Crawler
        Crawler -->|HTML| Parser
        Crawler -->|Links| Frontier
    end

    subgraph Processing ["Processing Layer (GPU)"]
        Parser -->|Text Chunks| BatchQueue[Embedding Queue]
        BatchQueue --> Batcher
        Batcher -->|Batch of 128| Embedder
        Embedder -->|Vectors| VectorBuffer
    end

    subgraph Storage
        VectorBuffer -->|Bulk Import| S3
        S3 -->|Async Ingest| Pinecone
        Crawler -.->|Deduplication| Dict
    end

    subgraph Retrieval ["Interaction Layer"]
        User -->|Query| API
        API -->|Embed Query| Embedder
        API -->|Search| Pinecone
        Pinecone -->|Results| RAG
        RAG --> User
    end
Enter fullscreen mode Exit fullscreen mode

Card 3: The "Matrix Link"

  • Visual: A network graph with nodes glowing. One central node is brighter.

  • Text: "Vectors aren't enough. We mapped the adjacency matrix of 5 million docs to boost 'Authority' alongside 'Similarity'. This is RAG + Graph Theory." 3

Card 4: The Cost Curve

  • Visual: A graph showing a flat line (Cost) overlaying a spiky line (Traffic), compared to a blocky "Provisioned" cost line.

  • Text: "Stop paying for air. Scale to zero means your infrastructure bill hits $0.00 when your users sleep."

8.3 Application Mind Map

The following mind map illustrates the four pillars of the DocuVerse engine: Ingestion, Processing, Memory, and Interaction.

Snippet de código

mindmap
  root((DocuVerse<br/>Engine))
    Ingestion
      Crawler Swarm
        Politeness Sharding
        Deduplication
      Frontier Queue
      Seed Injector
    Processing
      HTML Parser
      Graph Builder
        Matrix Link
      Batcher
      Embedder
        Model: e5-large
        Quantization: 8-bit
    Memory
      Pinecone Serverless
      S3 Bucket
      DLQ Error Handler
    Interaction
      API Endpoint
      LangChain Orchestrator
      RAG Pipeline
Enter fullscreen mode Exit fullscreen mode

Part IX: Comparison Data and Fictional Metrics

To further illustrate the efficiency of this architecture, we present fictional performance data derived from the "DocuVerse" simulation.

9.1 Cost Comparison: Serverless vs. Provisioned

Component Architecture A: Kubernetes (EKS) + P3 Instances Architecture B: DocuVerse (Modal + Pinecone) Savings
Compute (Crawler) $450/mo (3 nodes always on) $42/mo (Pay per CPU-second) 90%
Compute (GPU) $2,200/mo (p3.2xlarge reserved) $150/mo (A10G spot, burst usage) 93%
Vector DB $300/mo (Managed Instance) $45/mo (Serverless Usage-Based) 85%
DevOps Labor 10 hrs/mo (Cluster maintenance) 1 hr/mo (Config tweaks) 90%
Total Monthly ~$2,950 ~$237 ~92%

Table 1: Monthly operational cost projection for indexing 5M documents with daily updates.

9.2 Throughput Metrics

Operation Metric Note
Crawling Speed 1,200 pages/sec Scaled to 300 concurrent containers.
Embedding Rate 4,500 docs/sec Utilizing 50 concurrent A10G GPUs with batch size 128.
Indexing Rate 10,000 vectors/sec Bulk upsert to Pinecone via S3 import.
Cold Start Latency 1.8 seconds Time to boot fresh container + load model weights.1

Table 2: Performance benchmarks observed during the "MegaCorp" documentation ingestion simulation.


Conclusion

The "DocuVerse" case study illustrates a powerful truth about modern data engineering: Architecture is the new Optimization.

In the past, optimizing a search engine meant writing faster C++ code to tokenize strings. Today, it means composing the right set of serverless primitives to handle the physics of data movement and model inference.

  • Modal provides the elastic compute fabric, solving the "bursty" nature of crawling and embedding.

  • Vector Databases like Pinecone and Qdrant provide the semantic storage layer, solving the retrieval problem.

  • Graph Theory (the Matrix Link) provides the relevance signal, solving the authority problem.

By treating the cloud not as a collection of servers, but as a single, programmable computer, engineers can build systems that are orders of magnitude more efficient—both in cost and performance—than their predecessors. The era of the "Serverless Semantic Engine" is here, and it is accessible to any developer willing to embrace these new paradigms.


Appendix: DocuVerse Reference Implementation

This section provides the reference source code for the core logic of the "DocuVerse" engine. The application is structured as a Modal package.

A.1 src/common.py - Shared Structures

Defines the data models and shared configuration.

from dataclasses import dataclass
from typing import List, Optional

# Constants
QUEUE_NAME = "docuverse-frontier"
DICT_NAME = "docuverse-visited"
EMBED_QUEUE = "docuverse-embeddings"
LINK_MATRIX_QUEUE = "docuverse-matrix"

@dataclass
class Document:
    url: str
    content: str
    title: str
    links: List[str]
    doc_hash: str
    metadata: dict

@dataclass
class VectorRecord:
    id: str
    values: List[float]
    metadata: dict
Enter fullscreen mode Exit fullscreen mode

A.2 src/crawler.py - The Distributed Fetcher

Implements the Producer-Consumer pattern with modal.Queue and the Matrix Link extraction.

import modal
import hashlib
from.common import Document, QUEUE_NAME, DICT_NAME, EMBED_QUEUE, LINK_MATRIX_QUEUE

# Define the container image with necessary scraping libraries
crawler_image = modal.Image.debian_slim().pip_install("beautifulsoup4", "requests")

app = modal.App("docuverse-crawler")

# Persistent State
frontier_queue = modal.Queue.from_name(QUEUE_NAME, create_if_missing=True)
visited_db = modal.Dict.from_name(DICT_NAME, create_if_missing=True)
embed_queue = modal.Queue.from_name(EMBED_QUEUE, create_if_missing=True)
matrix_queue = modal.Queue.from_name(LINK_MATRIX_QUEUE, create_if_missing=True)

@app.function(image=crawler_image, concurrency_limit=300)
def fetch_url(url: str):
    import requests
    from bs4 import BeautifulSoup

    # Idempotency check
    if url in visited_db:
        return

    try:
        response = requests.get(url, timeout=5)
        if response.status_code!= 200:
            return

        soup = BeautifulSoup(response.text, 'html.parser')

        # 1. Extract Content
        text = soup.get_text()
        title = soup.title.string if soup.title else url
        doc_hash = hashlib.sha256(text.encode()).hexdigest()

        # 2. Extract Matrix Links (Graph Edges)
        links =
        normalized_links = [l for l in links if l.startswith('http')] # Simplified logic

        doc = Document(
            url=url,
            content=text[:5000], # Truncate for demo
            title=title,
            links=normalized_links,
            doc_hash=doc_hash,
            metadata={"source": "crawler"}
        )

        # 3. Mark as visited
        visited_db[url] = {"hash": doc_hash, "status": "processed"}

        # 4. Dispatch for Processing
        # Push content to embedding queue
        embed_queue.put(doc)

        # Push edges to matrix calculator queue
        matrix_queue.put({"source": url, "targets": normalized_links})

        # 5. Expand Frontier
        for link in normalized_links:
            if link not in visited_db:
                frontier_queue.put(link)

    except Exception as e:
        print(f"Failed to crawl {url}: {e}")

@app.function(schedule=modal.Cron("0 2 * * *"))
def seed_injector():
    """Daily job to restart the crawl from root nodes."""
    roots = ["https://docs.python.org/3/", "https://react.dev"]
    for url in roots:
        frontier_queue.put(url)
Enter fullscreen mode Exit fullscreen mode

A.3 src/embedder.py - GPU Batch Processing

Uses modal.cls to maintain the model state (weights) in GPU memory between invocations.

import modal
from typing import List
from.common import Document, VectorRecord, EMBED_QUEUE

# Define a GPU-enabled image with PyTorch and Transformers
gpu_image = (
    modal.Image.debian_slim()
   .pip_install("torch", "transformers", "sentence-transformers")
)

app = modal.App("docuverse-embedder")

@app.cls(gpu="A10G", image=gpu_image, container_idle_timeout=300)
class ModelService:
    def __enter__(self):
        from sentence_transformers import SentenceTransformer
        # Load model once when container starts (Cold Start optimization)
        self.model = SentenceTransformer('intfloat/multilingual-e5-large')

    @modal.method()
    def embed_batch(self, docs: List) -> List:
        texts = [d.content for d in docs]

        # Generate dense vectors
        embeddings = self.model.encode(texts, normalize_embeddings=True)

        records =
        for doc, emb in zip(docs, embeddings):
            records.append(VectorRecord(
                id=doc.doc_hash,
                values=emb.tolist(),
                metadata={"url": doc.url, "title": doc.title}
            ))
        return records

@app.function(image=modal.Image.debian_slim())
def batch_coordinator():
    """Reads from queue, batches items, and sends to GPU."""
    embed_queue = modal.Queue.from_name(EMBED_QUEUE)
    service = ModelService()

    batch =
    BATCH_SIZE = 64

    while True:
        # Fetch items with a short timeout
        try:
            items = embed_queue.get_many(BATCH_SIZE, block=True, timeout=5.0)
            if not items:
                break

            # Invoke GPU function
            vectors = service.embed_batch.remote(items)

            # TODO: Send vectors to Pinecone/Qdrant
            # pinecone_upload.remote(vectors)

        except Exception:
            break
Enter fullscreen mode Exit fullscreen mode

A.4 src/vector_db.py - Pinecone Integration

Demonstrates the bulk upload strategy via S3 (Conceptual code).

import modal
import os

app = modal.App("docuverse-vectordb")

@app.function(
    secrets=
)
def bulk_upsert(parquet_file_path: str):
    from pinecone import Pinecone
    import boto3

    # 1. Upload Parquet to S3
    s3 = boto3.client('s3')
    bucket = "docuverse-ingest-bucket"
    key = f"imports/{os.path.basename(parquet_file_path)}"
    s3.upload_file(parquet_file_path, bucket, key)

    # 2. Trigger Pinecone Import
    pc = Pinecone(api_key=os.environ)
    idx = pc.Index("docuverse-prod")

    # Start async import
    idx.start_import(
        uri=f"s3://{bucket}/{key}",
        integration_id="s3-integration-id"
    )
    print("Bulk import started.")
Enter fullscreen mode Exit fullscreen mode

Top comments (0)