Linghua Jin

Posted on Dec 4

Turn Your PDF Library into a Searchable Research Database (in ~100 Lines with CocoIndex)

#python #ai #rag #datascience

Everyone has a folder full of PDFs that is basically a graveyard. With CocoIndex, that folder can become a real-time research database: semantic search over titles and abstracts, "show me all papers by Jeff Dean", and vector search backed by Postgres + PGVector.

This post shows how to:

Extract paper metadata (title, authors, abstract, page count) from PDFs
Build embeddings for titles and abstract chunks for semantic search
Maintain author→paper mappings so you can ask questions like "all papers by X"
Keep everything in Postgres so you can query with plain SQL or your favorite query engine

Why this is different from "embed the whole PDF"

Most "RAG over papers" demos just chunk the whole PDF and stuff it into a vector store. That works for retrieval, but it loses structure:

You can't easily filter by author, venue, or page count
"Top-k vectors" doesn't map cleanly to questions like "list all papers by Hinton after 2015"
You pay to embed a ton of content you'll never query, especially long appendices

This CocoIndex flow keeps text embeddings, relational metadata, and author graphs in sync, so you can:

Do semantic search over title/abstract
Run structured queries over authors and filenames
Extend later with full-PDF embeddings, images, or knowledge graphs without rewriting your pipeline

The flow at a glance

Here's the high-level pipeline:

Read PDFs from a local papers/ directory as a CocoIndex source
For each file:
- Extract first page + total page count with pypdf
- Convert the first page into Markdown with Marker (or Docling)
- Use an LLM to extract title, authors, and abstract into a typed dataclass
Split abstracts into semantic chunks and embed with sentence-transformers/all-MiniLM-L6-v2
Collect three logical tables into Postgres + PGVector:
- paper_metadata (filename, title, authors, abstract, num_pages)
- author_papers (author_name, filename)
- metadata_embeddings (id, filename, location, text, embedding)

CocoIndex handles incremental updates: drop a new PDF into papers/, and the pipeline keeps the index fresh automatically.

Defining the flow with CocoIndex

First, define a flow that watches a papers/ folder as a source and keeps it in sync:

@cocoindex.flow_def(name="PaperMetadata")
def paper_metadata_flow(flow_builder: cocoindex.FlowBuilder,
                        data_scope: cocoindex.DataScope) -> None:
    data_scope["documents"] = flow_builder.add_source(
        cocoindex.sources.LocalFile(path="papers", binary=True),
        refresh_interval=datetime.timedelta(seconds=10),
    )

This gives you a logical documents table with filename and content (the raw PDF bytes).

Next, extract basic PDF info using pypdf:

@dataclasses.dataclass
class PaperBasicInfo:
    num_pages: int
    first_page: bytes

@cocoindex.op.function()
def extract_basic_info(content: bytes) -> PaperBasicInfo:
    reader = PdfReader(io.BytesIO(content))
    output = io.BytesIO()
    writer = PdfWriter()
    writer.add_page(reader.pages[0])
    writer.write(output)

    return PaperBasicInfo(
        num_pages=len(reader.pages),
        first_page=output.getvalue(),
    )

with data_scope["documents"].row() as doc:
    doc["basic_info"] = doc["content"].transform(extract_basic_info)

Now each document has a lightweight view of the first page plus total page count.

From first page to structured metadata

To turn the first page into Markdown, reuse a cached Marker converter (or swap in Docling):

@cache
def get_marker_converter() -> PdfConverter:
    config_parser = ConfigParser({})
    return PdfConverter(
        create_model_dict(),
        config=config_parser.generate_config_dict(),
    )

@cocoindex.op.function(gpu=True, cache=True, behavior_version=1)
def pdf_to_markdown(content: bytes) -> str:
    with tempfile.NamedTemporaryFile(delete=True, suffix=".pdf") as f:
        f.write(content)
        f.flush()
        text, _, _ = text_from_rendered(get_marker_converter()(f.name))
        return text

with data_scope["documents"].row() as doc:
    doc["first_page_md"] = doc["basic_info"]["first_page"].transform(pdf_to_markdown)

Then define a dataclass for the metadata schema and use LLM-structured extraction:

@dataclasses.dataclass
class Author:
    name: str

@dataclasses.dataclass
class PaperMetadata:
    title: str
    authors: list[Author]
    abstract: str

with data_scope["documents"].row() as doc:
    doc["metadata"] = doc["first_page_md"].transform(
        cocoindex.functions.ExtractByLlm(
            llm_spec=cocoindex.LlmSpec(
                api_type=cocoindex.LlmApiType.OPENAI,
                model="gpt-4o",
            ),
            output_type=PaperMetadata,
            instruction="Extract the title, authors and abstract from the first page.",
        )
    )

CocoIndex parses the LLM response directly into your dataclass, so downstream code can stay type-safe.

Building author and embedding tables

Now collect paper-level metadata into a dedicated table:

paper_metadata = data_scope.add_collector()

with data_scope["documents"].row() as doc:
    paper_metadata.collect(
        filename=doc["filename"],
        title=doc["metadata"]["title"],
        authors=doc["metadata"]["authors"],
        abstract=doc["metadata"]["abstract"],
        num_pages=doc["basic_info"]["num_pages"],
    )

Unroll authors into an author_papers relation:

author_papers = data_scope.add_collector()

with data_scope["documents"].row() as doc:
    with doc["metadata"]["authors"].row() as author:
        author_papers.collect(
            author_name=author["name"],
            filename=doc["filename"],
        )

For semantic search, embed titles and abstract chunks:

with data_scope["documents"].row() as doc:
    doc["title_embedding"] = doc["metadata"]["title"].transform(
        cocoindex.functions.SentenceTransformerEmbed(
            model="sentence-transformers/all-MiniLM-L6-v2",
        )
    )

    doc["abstract_chunks"] = doc["metadata"]["abstract"].transform(
        cocoindex.functions.SplitRecursively(
            custom_languages=[
                cocoindex.functions.CustomLanguageSpec(
                    language_name="abstract",
                    separators_regex=[r"[.?!]+\s+", r"[:;]\s+", r",\s+", r"\s+"],
                )
            ]
        ),
        language="abstract",
        chunk_size=500,
        min_chunk_size=200,
        chunk_overlap=150,
    )

    with doc["abstract_chunks"].row() as chunk:
        chunk["embedding"] = chunk["text"].transform(
            cocoindex.functions.SentenceTransformerEmbed(
                model="sentence-transformers/all-MiniLM-L6-v2",
            )
        )

Finally, collect all embeddings into a single table that PGVector can index:

metadata_embeddings = data_scope.add_collector()

with data_scope["documents"].row() as doc:
    metadata_embeddings.collect(
        id=cocoindex.GeneratedField.UUID,
        filename=doc["filename"],
        location="title",
        text=doc["metadata"]["title"],
        embedding=doc["title_embedding"],
    )

    with doc["abstract_chunks"].row() as chunk:
        metadata_embeddings.collect(
            id=cocoindex.GeneratedField.UUID,
            filename=doc["filename"],
            location="abstract",
            text=chunk["text"],
            embedding=chunk["embedding"],
        )

Wiring it into Postgres + PGVector

Export everything into Postgres with vector indexes:

paper_metadata.export(
    "paper_metadata",
    cocoindex.targets.Postgres(),
    primary_key_fields=["filename"],
)

author_papers.export(
    "author_papers",
    cocoindex.targets.Postgres(),
    primary_key_fields=["author_name", "filename"],
)

metadata_embeddings.export(
    "metadata_embeddings",
    cocoindex.targets.Postgres(),
    primary_key_fields=["id"],
    vector_indexes=[
        cocoindex.VectorIndexDef(
            field_name="embedding",
            metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY,
        )
    ],
)

From here you can:

Use SQL + PGVector to run "find similar abstracts to X"
Join metadata_embeddings with paper_metadata to filter by author or page count
Power agents that can browse your personal research library instead of the open web

Get started

If you want more data-infra-for-AI recipes like this, star CocoIndex on GitHub and follow along! 🚀

DEV Community