BARSIC: Five Questions That Make the Talk Click

#snowflake #snowpark #cheminformatics #rdkit

Johnson & Johnson × Snowflake Dev Day 2026
The speaker is a senior software engineer from J&J, presenting BARSIC — Basic All-purpose RDKit-based SQL Instant Chemistry for Snowflake — their open-source cheminformatics platform. The slides are technically dense; the five questions below trace the full arc.

Q1: A chemist draws a benzene ring — how does a database "understand" it?

A chemist thinks in structures: atoms, bonds, chirality. A relational database thinks in strings and numbers. There is no native overlap.

Several encodings bridge that gap:

SMILES (Simplified Molecular Input Line Entry System): compresses a structure into a one-dimensional string. Phenol, for instance, becomes Oc1ccccc1. One molecule can have dozens of valid SMILES representations — which is precisely where problems begin.
Binary fingerprint: encodes structural features into a bit vector for fast similarity comparisons.
SMARTS: a pattern language for substructure queries, analogous to regular expressions for text.

The core insight: a database never stores "a molecule" — it stores an encoding of one. Doing meaningful search across those encodings is the engineering problem the whole field is trying to solve.

Q2: Finding "all molecules containing a specific functional group" — why is that so hard?

This is called a substructure search: given a query fragment (say, a chlorinated alkene like C=CCl), find every molecule in the library that contains it.

It sounds straightforward. The underlying problem is not:

Subgraph Isomorphism — NP-complete

Determining whether one molecular graph contains another has no known polynomial-time general solution. Running it once on a single molecule is fine. Running it across hundreds of millions of records is a different matter entirely.

The slide puts it plainly: substructure search is the most resource-demanding and challenging of the three common search types (exact match, similarity, substructure). PubChem alone holds 123 million chemical structures. Large pharmaceutical compound collections add further scale.

At that scale, brute-force row-by-row scanning is not an option. The solution has to be architectural.

Q3: Snowflake is powerful — why not just plug in a chemistry extension?

The intuitive answer. Snowflake's architecture makes it unexpectedly difficult.

How traditional databases handle it (PostgreSQL, Oracle, etc.):

Extend the query engine's encoding and decoding logic directly
Accelerate substructure searches with GiST indexes (domain-specific structural indexes)

Snowflake's two structural barriers:

Hybrid columnar storage with micro-partitions — no traditional row-level indexes, so GiST-style extensions have no foothold
No public API hooks into query execution — there is no supported way to intercept and augment how Snowflake processes a query

The commonly proposed workaround was to run a PostgreSQL instance alongside Snowflake, split the query between them, and merge the results. The problem: shuttling data back and forth is expensive, latency is high, and the operational overhead is substantial.

A fundamental paradigm shift was required.

Q4: UDFs can break through — but is the performance actually usable?

Yes, with two levels of optimization layered on top of each other.

Step 1: Snowpark + UDF

Snowpark allows Python, Java, and Scala code to execute inside Snowflake, eliminating the need for an external processing engine. J&J's approach:

Use Anaconda integration to bring RDKit (the dominant open-source cheminformatics library, implemented in C++ with a Python API) natively into Snowflake
Wrap RDKit's encoding, decoding, and matching logic in Scalar UDFs
Expose substructure search as a first-class SQL function call

Step 2: Stored Procedure + Fingerprint Prescreening

A naive UDF is still too slow — running full subgraph isomorphism on every row puts 2M rows at 60–120 seconds. The two-stage pipeline addresses this:

Stage	Operation	Scope
Stored Procedure	Pre-compute the query molecule's fingerprint	Once
UDF + BITAND	Bitwise filter — eliminate non-matches cheaply	All rows, very fast
RDKit full match	Exact subgraph isomorphism	Candidates only (small set)

Per the slide benchmarks, the same 2M rows run in 10–30 seconds with the optimized pipeline — a 4–6× improvement over the naive approach.

The speaker also reported that in offline testing against the full PubChem collection (123 million structures), search times for a typical substructure query were not dramatically different from those on a 3M-row subset — suggesting the pipeline scales well.

Q5: Is BARSIC just a chemistry tool for J&J's internal use?

No — and this is the most broadly applicable insight in the talk.

J&J chose a full open-source release (Apache 2.0) and deliberately designed BARSIC as a general pattern, not a chemistry-specific product:

Swap RDKit for spaCy (NLP), Shapely (geospatial), BioPython (genomics) — same UDF + stored procedure pattern.

BARSIC's three-layer architecture:

Layer 3 — BARSIC SQL API
  Encoding/Decoding · Exact search · Similarity search · Substructure search
  Molecular property calculations · Fingerprinting
        ↑
Layer 2 — RDKit  (swappable for any domain-specific Python library)
        ↑
Layer 1 — Snowflake  (columnar storage + elastic compute clusters)

The broader point: any domain that has a capable Python library can follow this same pattern to bring domain-specific computation directly to the data in Snowflake — no pipelines, no data movement, warehouse-scale parallelism for free.

GitHub: github.com/johnsonandjohnson/BARSIC — Apache 2.0, Snowflake Marketplace listing coming soon.

Summary

Dimension	Key takeaway
Problem	Substructure search requires solving subgraph isomorphism — NP-complete — at scale across millions to hundreds of millions of structures
Barrier	Snowflake's columnar architecture and lack of query hooks rule out traditional DB extension approaches
Breakthrough	Snowpark brings Python computation to the data, eliminating data movement entirely
Optimization	Fingerprint prescreening + RDKit exact match; 4–6× faster than naive UDF on 2M rows
Generality	The pattern is domain-agnostic — RDKit today, spaCy/BioPython/Shapely tomorrow
Availability	Fully open source, Apache 2.0

This post exists because of a gap between what was said and what could have been heard. The speaker at J&J's Snowflake Dev Day 2026 session had genuinely solid material — a clean architecture, real benchmark numbers, and a pattern that generalizes well beyond chemistry. But poor audio quality and a presentation style that buried the narrative made it hard to follow in the room. The five-question structure below is an attempt to re-tell the same content in a way that earns the audience's attention — starting from the problem a chemist actually faces, and building toward the architectural insight that makes BARSIC worth paying attention to.

Note: This post was drafted with the assistance of Claude, and reviewed by ChatGPT and Gemini.