Everyone has a folder full of PDFs that is basically a graveyard. With CocoIndex, that folder can become a real-time research database: semantic search over titles and abstracts, "show me all papers by Jeff Dean", and vector search backed by Postgres + PGVector.
This post shows how to:
- Extract paper metadata (title, authors, abstract, page count) from PDFs
- Build embeddings for titles and abstract chunks for semantic search
- Maintain author→paper mappings so you can ask questions like "all papers by X"
- Keep everything in Postgres so you can query with plain SQL or your favorite query engine
Why this is different from "embed the whole PDF"
Most "RAG over papers" demos just chunk the whole PDF and stuff it into a vector store. That works for retrieval, but it loses structure:
- You can't easily filter by author, venue, or page count
- "Top-k vectors" doesn't map cleanly to questions like "list all papers by Hinton after 2015"
- You pay to embed a ton of content you'll never query, especially long appendices
This CocoIndex flow keeps text embeddings, relational metadata, and author graphs in sync, so you can:
- Do semantic search over title/abstract
- Run structured queries over authors and filenames
- Extend later with full-PDF embeddings, images, or knowledge graphs without rewriting your pipeline
The flow at a glance
Here's the high-level pipeline:
- Read PDFs from a local
papers/directory as a CocoIndex source - For each file:
- Extract first page + total page count with
pypdf - Convert the first page into Markdown with Marker (or Docling)
- Use an LLM to extract
title,authors, andabstractinto a typed dataclass
- Extract first page + total page count with
- Split abstracts into semantic chunks and embed with
sentence-transformers/all-MiniLM-L6-v2 - Collect three logical tables into Postgres + PGVector:
-
paper_metadata(filename, title, authors, abstract, num_pages) -
author_papers(author_name, filename) -
metadata_embeddings(id, filename, location, text, embedding)
-
CocoIndex handles incremental updates: drop a new PDF into papers/, and the pipeline keeps the index fresh automatically.
Defining the flow with CocoIndex
First, define a flow that watches a papers/ folder as a source and keeps it in sync:
@cocoindex.flow_def(name="PaperMetadata")
def paper_metadata_flow(flow_builder: cocoindex.FlowBuilder,
data_scope: cocoindex.DataScope) -> None:
data_scope["documents"] = flow_builder.add_source(
cocoindex.sources.LocalFile(path="papers", binary=True),
refresh_interval=datetime.timedelta(seconds=10),
)
This gives you a logical documents table with filename and content (the raw PDF bytes).
Next, extract basic PDF info using pypdf:
@dataclasses.dataclass
class PaperBasicInfo:
num_pages: int
first_page: bytes
@cocoindex.op.function()
def extract_basic_info(content: bytes) -> PaperBasicInfo:
reader = PdfReader(io.BytesIO(content))
output = io.BytesIO()
writer = PdfWriter()
writer.add_page(reader.pages[0])
writer.write(output)
return PaperBasicInfo(
num_pages=len(reader.pages),
first_page=output.getvalue(),
)
with data_scope["documents"].row() as doc:
doc["basic_info"] = doc["content"].transform(extract_basic_info)
Now each document has a lightweight view of the first page plus total page count.
From first page to structured metadata
To turn the first page into Markdown, reuse a cached Marker converter (or swap in Docling):
@cache
def get_marker_converter() -> PdfConverter:
config_parser = ConfigParser({})
return PdfConverter(
create_model_dict(),
config=config_parser.generate_config_dict(),
)
@cocoindex.op.function(gpu=True, cache=True, behavior_version=1)
def pdf_to_markdown(content: bytes) -> str:
with tempfile.NamedTemporaryFile(delete=True, suffix=".pdf") as f:
f.write(content)
f.flush()
text, _, _ = text_from_rendered(get_marker_converter()(f.name))
return text
with data_scope["documents"].row() as doc:
doc["first_page_md"] = doc["basic_info"]["first_page"].transform(pdf_to_markdown)
Then define a dataclass for the metadata schema and use LLM-structured extraction:
@dataclasses.dataclass
class Author:
name: str
@dataclasses.dataclass
class PaperMetadata:
title: str
authors: list[Author]
abstract: str
with data_scope["documents"].row() as doc:
doc["metadata"] = doc["first_page_md"].transform(
cocoindex.functions.ExtractByLlm(
llm_spec=cocoindex.LlmSpec(
api_type=cocoindex.LlmApiType.OPENAI,
model="gpt-4o",
),
output_type=PaperMetadata,
instruction="Extract the title, authors and abstract from the first page.",
)
)
CocoIndex parses the LLM response directly into your dataclass, so downstream code can stay type-safe.
Building author and embedding tables
Now collect paper-level metadata into a dedicated table:
paper_metadata = data_scope.add_collector()
with data_scope["documents"].row() as doc:
paper_metadata.collect(
filename=doc["filename"],
title=doc["metadata"]["title"],
authors=doc["metadata"]["authors"],
abstract=doc["metadata"]["abstract"],
num_pages=doc["basic_info"]["num_pages"],
)
Unroll authors into an author_papers relation:
author_papers = data_scope.add_collector()
with data_scope["documents"].row() as doc:
with doc["metadata"]["authors"].row() as author:
author_papers.collect(
author_name=author["name"],
filename=doc["filename"],
)
For semantic search, embed titles and abstract chunks:
with data_scope["documents"].row() as doc:
doc["title_embedding"] = doc["metadata"]["title"].transform(
cocoindex.functions.SentenceTransformerEmbed(
model="sentence-transformers/all-MiniLM-L6-v2",
)
)
doc["abstract_chunks"] = doc["metadata"]["abstract"].transform(
cocoindex.functions.SplitRecursively(
custom_languages=[
cocoindex.functions.CustomLanguageSpec(
language_name="abstract",
separators_regex=[r"[.?!]+\s+", r"[:;]\s+", r",\s+", r"\s+"],
)
]
),
language="abstract",
chunk_size=500,
min_chunk_size=200,
chunk_overlap=150,
)
with doc["abstract_chunks"].row() as chunk:
chunk["embedding"] = chunk["text"].transform(
cocoindex.functions.SentenceTransformerEmbed(
model="sentence-transformers/all-MiniLM-L6-v2",
)
)
Finally, collect all embeddings into a single table that PGVector can index:
metadata_embeddings = data_scope.add_collector()
with data_scope["documents"].row() as doc:
metadata_embeddings.collect(
id=cocoindex.GeneratedField.UUID,
filename=doc["filename"],
location="title",
text=doc["metadata"]["title"],
embedding=doc["title_embedding"],
)
with doc["abstract_chunks"].row() as chunk:
metadata_embeddings.collect(
id=cocoindex.GeneratedField.UUID,
filename=doc["filename"],
location="abstract",
text=chunk["text"],
embedding=chunk["embedding"],
)
Wiring it into Postgres + PGVector
Export everything into Postgres with vector indexes:
paper_metadata.export(
"paper_metadata",
cocoindex.targets.Postgres(),
primary_key_fields=["filename"],
)
author_papers.export(
"author_papers",
cocoindex.targets.Postgres(),
primary_key_fields=["author_name", "filename"],
)
metadata_embeddings.export(
"metadata_embeddings",
cocoindex.targets.Postgres(),
primary_key_fields=["id"],
vector_indexes=[
cocoindex.VectorIndexDef(
field_name="embedding",
metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY,
)
],
)
From here you can:
- Use SQL + PGVector to run "find similar abstracts to X"
- Join
metadata_embeddingswithpaper_metadatato filter by author or page count - Power agents that can browse your personal research library instead of the open web
Get started
If you want more data-infra-for-AI recipes like this, star CocoIndex on GitHub and follow along! 🚀
Top comments (0)