DEV Community: Weimo Liu

How Wiz Crushed Lacework: A Data Infrastructure Perspective

Weimo Liu — Mon, 04 Aug 2025 19:35:56 +0000

Google's acquisition of Wiz for $32 billion was a clear signal to the industry: the cloud security war has a winner. What's more interesting is how they won. Wiz wasn't the first mover. Lacework started five years earlier with a solid team, strong product vision, and top-tier VC backing. So what went wrong for Lacework? And what went right for Wiz?

If you browse social media, you'll find engineers and CISOs asking the same thing. Threads on X, Reddit, and Hacker News have dozens of posts dissecting the matchup. The answer holds lessons not just for security vendors, but for anyone building modern data-intensive products.

Obviously, Wiz did many things right from product strategy and GTM to customer support and execution. But there's one angle I haven't seen talked about much in the usual analysis. It happens to be my niche: data infrastructure. It might just be their secret weapon, and that's what prompted me to write this breakdown.

Take Reddit, for example. Multiple posts are comparing Lacework and Wiz, with engineers sharing firsthand experiences from evaluations and deployments.

Source: Reddit

I'm not a security guy. I come from a data infrastructure background. But this story is just as much about data architecture as it is about product strategy.

Let's look at what each company built.

Lacework: Graph Ideas, SQL Reality

Lacework launched in 2015 with the Polygraph® Data Platform. It aimed to detect threats by mapping relationships and behaviors between cloud assets, a classic graph use case. But under the hood, Lacework didn't use a graph database. They built it on Snowflake.

Why Snowflake? Probably because Sutter Hill Ventures incubated both companies. And to be fair, Snowflake made sense on paper. It offers strong scalability and relatively low cost. You can store huge volumes of cloud telemetry, and it scales elastically. That's helpful for cost control and data retention.

But there's a catch. Snowflake isn't built for graph workloads. Writing a 3-hop relationship query in SQL can take 100+ lines of nested joins. Here's what a basic traversal looks like in SQL:

SELECT a.user_id, d.device_id, n.network_id 
FROM users a 
JOIN logins b ON a.user_id = b.user_id 
JOIN devices d ON b.device_id = d.device_id 
JOIN connections c ON d.device_id = c.device_id 
JOIN networks n ON c.network_id = n.network_id 
WHERE n.public = true;

Now imagine debugging this at 10 hops, with filters, aggregations, and alert logic. Even the best engineers will slow down. Development becomes brittle and difficult to maintain.

Wiz: Native Graph, Feature Velocity

Wiz was founded in 2020 by Assaf Rappaport and his former team from Adallom. They chose a different path. From day one, Wiz used Amazon Neptune, a native graph database.

In a joint blog post with AWS titled "The World is a Graph", Wiz CTO Ami Luttwak explained their approach:

"The world is a graph, not a table. It's time our tooling reflected this."

Wiz modeled everything, users, assets, roles, and flows, as nodes and edges. They queried it with Gremlin. Here's a real-world example:

g.V().hasLabel("vm").has("public", true)
  .out("connectedTo").hasLabel("network")
  .out("reachableBy").has("role", "admin")
  .path()

This kind of logic is expressible in 10 lines with Gremlin. In SQL? It would be a nightmare.

This architectural choice gave Wiz a massive edge in developer velocity. With Neptune and Gremlin, engineers could express complex security logic in concise, readable queries and ship them quickly. What would take days or weeks in SQL due to brittle joins and long query chains could be prototyped and pushed in hours. This mattered. Security is a fast-moving field, and Wiz's ability to ship features at startup speed meant it could respond to customer requests, compliance requirements, and threat intelligence faster than Lacework. Even with a smaller team, they consistently outpaced Lacework's product delivery cadence.

By 2022, Wiz deepened its commitment to graph infrastructure by continuing to scale on Amazon Neptune. Their bet on native graph tech was not just architectural; it defined their velocity and differentiation.

The Graph Bet That Changed Everything

Lacework prioritized cost efficiency. By using Snowflake, they could ingest and retain massive volumes of telemetry with elastic scaling and lower marginal cost. They didn't need to maintain a separate graph database or optimize for graph workloads. The tradeoff was in capability: Snowflake's tabular design wasn't built for deep relationship queries. Modeling graph logic in SQL, especially multi-hop joins, was verbose, fragile, and hard to iterate on. This slowed down development and made advanced threat modeling harder to execute.

Wiz is optimized for speed. By betting on a native graph engine, they gained fast iteration, concise query logic, and a security model grounded in relationships. They could express new detections or traversal-based insights in a few lines of Gremlin, prototype ideas quickly, and ship updates faster.

In cybersecurity, speed wins. Customers care more about feature velocity and detection quality than marginal compute savings. Wiz took a costly but strategic path: they paid much more for infrastructure but delivered faster innovation and outpaced the field.

A friend at a Series F cybersecurity startup told me they only store a single day's worth of graph data, because the graph database cannot scale out.
Another company splits its graph workload: topology stays in a graph database, but all attributes are offloaded to a warehouse like Snowflake or Databricks.

Lacework's architecture helped them scale cheaply, but that same architecture made it difficult to build graph-native security features. Their bet is optimized for storage and cost. Wiz's bet is optimized for iteration and product value. The outcome was clear.

Can You Get the Wiz Speed With Low Cost and Unlimited Scalability of Data Lakes?

If you've made it this far, you might be wondering: Is it possible to get the benefits of a native graph system, fast iteration, expressive multi-hop queries, without the painful cost and complexity of traditional graph databases?
Plenty of cybersecurity unicorns have attempted creative workarounds to address the scalability and cost challenges of traditional graph databases:

No ETL
No duplicated storage
Query your Parquet files or iceberg/delta tables with Cypher or Gremlin
Subsecond response times
Lower cost than Snowflake

These are clever tradeoffs. But they're still compromises.
What if you didn't have to choose between fast iteration and low cost?

(Trigger warning: Shameless plug coming)

That's the question we asked ourselves when building PuppyGraph, a graph query engine designed to run directly on your data lake.

Wiz chose graphs and shipped features fast. Lacework chose SQL and struggled with velocity.

The best part of the Wiz story isn't just that they chose graphs, it's that they embraced the tradeoff. They paid more in infrastructure, but got faster iteration and better product velocity in return.

Now imagine building at that speed, with a much smaller bill. If you're building the next Wiz, maybe you don't need a $32B exit. Perhaps you just need the right graph engine. (Okay fine, a $32B exit would be nice too.)

Real-Time Threat Detection With MongoDB & PuppyGraph

Weimo Liu — Fri, 11 Jul 2025 03:37:19 +0000

Security operations teams face an increasingly complex environment. Cloud-native applications, identity sprawl, and continuous infrastructure changes generate a flood of logs and events. From API calls in AWS to lateral movement between virtual machines, the volume of telemetry is enormous-and it's growing.

The challenge isn't just scale. Its structure. Traditional security tooling often looks at events in isolation, relying on static rules or dashboards to highlight anomalies. But real attacks unfold as chains of related actions: A user assumes a role, launches a resource, accesses data, and then pivots again. These relationships are hard to capture with flat queries or disconnected logs.

That's where graph analytics comes in. By modeling your data as a network of users, sessions, identities, and events, you can trace how threats emerge and evolve. And with PuppyGraph, you don't need a separate graph database or batch pipelines to get there.
In this post, we'll show how to combine MongoDB and PuppyGraph to analyze AWS CloudTrail data as a graph-without moving or duplicating data. You'll see how to uncover privilege escalation chains, map user behavior across sessions, and detect suspicious access patterns in real time.

Why MongoDB for cybersecurity data

MongoDB is a popular choice for managing security telemetry. Its document-based model is ideal for ingesting unstructured and semi-structured logs like those generated by AWS CloudTrail, GuardDuty, or Kubernetes audit logs. Events are stored as flexible JSON documents, which evolve naturally as logging formats change.

This flexibility matters in security, where schemas can shift as providers update APIs or teams add new context to events. MongoDB handles these changes without breaking pipelines or requiring schema migrations. It also supports high-throughput ingestion and horizontal scaling, making it well-suited for operational telemetry.

Many security products and SIEM backends already support MongoDB as a destination for real-time event streams. That makes it a natural foundation for graph-based security analytics: The data is already there—rich, semi-structured, and continuously updated.

Why graph analytics for threat detection

Modern security incidents rarely unfold as isolated events. Attackers don’t just trip a single rule—they navigate through systems, identities, and resources, often blending in with legitimate activity. Understanding these behaviors means connecting the dots across multiple entities and actions. That’s precisely what graph analytics excels at. By modeling users, sessions, events, and assets as interconnected nodes and edges, analysts can trace how activity flows through a system. This structure makes it easy to ask questions that involve multiple hops or indirect relationships—something traditional queries often struggle to express.

For example, imagine you’re investigating activity tied to a specific AWS account. You might start by counting how many sessions are associated with that account. Then, you might break those sessions down by whether they were authenticated using MFA. If some weren’t, the next question becomes: What resources were accessed during those unauthenticated sessions?

This kind of multi-step investigation is where graph queries shine. Instead of scanning raw logs or filtering one table at a time, you can traverse the entire path from account to identity to session to event to resource, all in a single query. You can also group results by attributes like resource type to identify which services were most affected.

And when needed, you can go beyond metrics and pivot to visualization, mapping out full access paths to see how a specific user or session interacted with sensitive infrastructure. This helps surface lateral movement, track privilege escalation, and uncover patterns that static alerts might miss.

Graph analytics doesn’t replace your existing detection rules; it complements them by revealing the structure behind security activity. It turns complex event relationships into something you can query directly, explore interactively, and act on with confidence.

Query MongoDB data as a graph without ETL

MongoDB is a popular choice for storing security event data, especially when working with logs that don’t always follow a fixed structure. Services like AWS CloudTrail produce large volumes of JSON-based records with fields that can differ across events. MongoDB’s flexible schema makes it easy to ingest and query that data as it evolves.

PuppyGraph builds on this foundation by introducing graph analytics—without requiring any data movement. Through the MongoDB Atlas SQL Interface, PuppyGraph can connect directly to your collections and treat them as relational tables. From there, you define a graph model by mapping key fields into nodes and relationships.

Figure 1. Architecture of the integration of MongoDB and PuppyGraph.

This makes it possible to explore questions that involve multiple entities and steps, such as tracing how a session relates to an identity or which resources were accessed without MFA. The graph itself is virtual. There’s no ETL process or data duplication. Queries run in real time against the data already stored in MongoDB.

While PuppyGraph works with tabular structures exposed through the SQL interface, many security logs already follow a relatively flat pattern: consistent fields like account IDs, event names, timestamps, and resource types. That makes it straightforward to build graphs that reflect how accounts, sessions, events, and resources are linked. By layering graph capabilities on top of MongoDB, teams can ask more connected questions of their security data, without changing their storage strategy or duplicating infrastructure.

Investigating CloudTrail activity using graph queries

To demonstrate how graph analytics can enhance security investigations, we’ll explore a real-world dataset of AWS CloudTrail logs. This dataset originates from flaws.cloud, a security training environment developed by Scott Piper.

The dataset comprises anonymized CloudTrail logs collected over 3.5 years, capturing a wide range of simulated attack scenarios within a controlled AWS environment. It includes over 1.9 million events, featuring interactions from thousands of unique IP addresses and user agents. The logs encompass various AWS API calls, providing a comprehensive view of potential security events and misconfigurations.

For our demonstration, we imported a subset of approximately 100,000 events into MongoDB Atlas. By importing this dataset into MongoDB Atlas and applying PuppyGraph’s graph analytics capabilities, we can model and analyze complex relationships between accounts, identities, sessions, events, and resources.

Demo

Let’s walk through the demo step by step! We have provided all the materials for this demo on GitHub. Please download the materials or clone the repository directly.

If you’re new to integrating MongoDB Atlas with PuppyGraph, we recommend starting with the MongoDB Atlas + PuppyGraph Quickstart Demo to get familiar with the setup and core concepts.

Prerequisites

A MongoDB Atlas account (free tier is sufficient)
Docker
Python 3

Set up MongoDB Atlas

Follow the MongoDB Atlas Getting Started guide to:

Create a new cluster (free tier is fine).
Add a database user.
Configure IP access.
Note your connection string for the MongoDB Python driver (you’ll need it shortly).

Download and import CloudTrail logs

Run the following commands to fetch and prepare the dataset:

wget https://summitroute.com/downloads/flaws_cloudtrail_logs.tar
mkdir -p ./raw_data
tar -xvf flaws_cloudtrail_logs.tar --strip-components=1 -C ./raw_data
gunzip ./raw_data/*.json.gz

Create a virtual environment and install dependencies:

# On some Linux distributions, install `python3-venv` first.
sudo apt-get update
sudo apt-get install python3-venv
# Create a virtual environment, activate it, and install the necessary packages 
python -m venv venv
source venv/bin/activate
pip install ijson faker pandas pymongo

Import the first chunk of CloudTrail data (replace the connection string with your Atlas URI):

export MONGODB_CONNECTION_STRING="your_mongodb_connection_string"
python import_data.py raw_data/flaws_cloudtrail00.json --database cloudtrail

This creates a new cloudtrail database and loads the first chunk of data containing 100,000 structured events.

Enable Atlas SQL interface and get JDBC URI

To enable graph access:

Create an Atlas SQL Federated Database instance.
Ensure the schema is available (generate from sample, if needed).
Copy the JDBC URI from the Atlas SQL interface. See PuppyGraph’s guide for setting up MongoDB Atlas SQL.

Start PuppyGraph and upload the graph schema

Start the PuppyGraph container:

docker run -p 8081:8081 -p 8182:8182 -p 7687:7687 \
  -e PUPPYGRAPH_PASSWORD=puppygraph123 \
  -d --name puppy --rm --pull=always puppygraph/puppygraph:stable

Username: puppygraph.
Password: puppygraph123.

Upload the schema:

Open schema.json.
Fill in your JDBC URI, username, and password.
Upload via the Upload Graph Schema JSON section or run:

curl -XPOST -H "content-type: application/json" \
  --data-binary @./schema.json \
  --user "puppygraph:puppygraph123" localhost:8081/schema

Wait for the schema to upload and initialize (approximately five minutes).

Figure 2. A graph visualization of the schema, which models the graph from relational data.

Run graph queries to investigate security activity

Once the graph is live, open the Query panel in PuppyGraph’s UI.

Let's say we want to investigate the activity of a specific account. First, we count the number of sessions associated with the account.

Cypher:

MATCH (a:Account)-[:HasIdentity]->(i:Identity)
  -[:HasSession]->(s:Session)
WHERE id(a) = "Account[811596193553]"
RETURN count(s)

Gremlin:

g.V("Account[811596193553]")
 .out("HasIdentity").out("HasSession").count()

Figure 3. Graph query in the PuppyGraph UI.

Then, we want to see how many of these sessions are MFA-authenticated or not.

Cypher:

MATCH (a:Account)-[:HasIdentity]->(i:Identity)
  -[:HasSession]->(s:Session)
WHERE id(a) = "Account[811596193553]"
RETURN s.mfa_authenticated AS mfaStatus, count(s) AS count

Gremlin:

g.V("Account[811596193553]")
  .out("HasIdentity").out("HasSession")
  .groupCount().by("mfa_authenticated")

Figure 4. Graph query results in the PuppyGraph UI.

Next, we investigate those sessions that are not MFA authenticated and see what resources they accessed.

Cypher:

MATCH (a:Account)-[:HasIdentity]->
  (i:Identity)-[:HasSession]->
  (s:Session {mfa_authenticated: false})
  -[:RecordsEvent]->(e:Event)
  -[:OperatesOn]->(r:Resource)
WHERE id(a) = "Account[811596193553]"
RETURN r.resource_type AS resourceType, count(r) AS count

Gremlin:

g.V("Account[811596193553]").out("HasIdentity")
  .out("HasSession")
  .has("mfa_authenticated", false)
  .out('RecordsEvent').out('OperatesOn')
  .groupCount().by("resource_type")

Figure 5. PuppyGraph UI showing results that are not MFA authenticated.

We show those access paths in a graph.

Cypher:

MATCH path = (a:Account)-[:HasIdentity]->
  (i:Identity)-[:HasSession]->
  (s:Session {mfa_authenticated: false})
  -[:RecordsEvent]->(e:Event)
  -[:OperatesOn]->(r:Resource)
WHERE id(a) = "Account[811596193553]"
RETURN path

Gremlin:

g.V("Account[811596193553]").out("HasIdentity").out("HasSession").has("mfa_authenticated", false)
  .out('RecordsEvent').out('OperatesOn')
  .path()

Figure 6. Graph visualization in PuppyGraph UI.

Tear down the environment

When you’re done:

docker stop puppy

Your MongoDB data will persist in Atlas, so you can revisit or expand the graph model at any time.

Conclusion

Security data is rich with relationships, between users, sessions, resources, and actions. Modeling these connections explicitly makes it easier to understand what’s happening in your environment, especially when investigating incidents or searching for hidden risks.

By combining MongoDB Atlas and PuppyGraph, teams can analyze those relationships in real time without moving data or maintaining a separate graph database. MongoDB provides the flexibility and scalability to store complex, evolving security logs like AWS CloudTrail, while PuppyGraph adds a native graph layer for exploring that data as connected paths and patterns.

In this post, we walked through how to import real-world audit logs, define a graph schema, and investigate access activity using graph queries. With just a few steps, you can transform a log collection into an interactive graph that reveals how activity flows across your cloud infrastructure.

PuppyGraph on MongoDB: Native Graph Queries Without ETL

Weimo Liu — Tue, 22 Apr 2025 22:13:34 +0000

Structured data across a wide range of workloads—from product catalogs to telemetry streams to user activity logs. Its schema-less structure and distributed architecture make it a natural fit for applications that demand both agility and scale.

But in many real-world scenarios, data points aren’t just valuable on their own—they’re more powerful when understood in context. Connections between entities often reveal the patterns that matter most: how users interact, how systems behave, and how events unfold over time. While MongoDB provides expressive tools for working with nested documents and join operations across collections, some types of relationship analysis are more naturally expressed as graph queries.

That’s where PuppyGraph comes in. It adds a real-time graph layer on top of your existing MongoDB deployment—no ETL, no separate graph databases needed. You can define a graph model across your collections and run queries using openCypher or Gremlin, all without modifying your source data.

In this tutorial, we’ll walk through how PuppyGraph connects to MongoDB, how it complements document-based architectures with graph capabilities, and how you can get started running graph queries with minimal setup.

What is MongoDB?

MongoDB is a document-oriented NoSQL database designed to store and manage data in a flexible, JSON-like format. Unlike traditional relational databases that use tables and rows, MongoDB employs collections and documents, allowing for dynamic schemas that can evolve with application requirements. This flexibility makes it particularly well-suited for handling semi-structured and unstructured data, accommodating use cases such as content management systems, real-time analytics, AI vector search, and Internet of Things (IoT) applications.

In MongoDB, data is organized into collections of documents, each containing key-value pairs. This structure enables developers to represent complex hierarchical relationships within a single document, reducing the need for expensive join operations. For example, a document representing a blog post can encapsulate not only the post content but also metadata like author information and comments, all within the same document.

The database offers a rich set of features, including a powerful query API that supports field searches, range queries, and regular expressions. Indexing capabilities enhance query performance, allowing developers to create indexes on any field. Additionally, MongoDB’s aggregation framework facilitates data transformation and analysis directly within the database, streamlining the development of analytics applications.

MongoDB Atlas: Managed Cloud Database Service

Recognizing the operational challenges associated with managing databases, MongoDB introduced MongoDB Atlas, a fully managed cloud database service. MongoDB Atlas simplifies the deployment, scaling, and management of MongoDB databases, allowing developers to focus on building applications rather than handling database administration tasks.

MongoDB Atlas provides automated deployment across major cloud providers, including AWS, Google Cloud Platform, and Microsoft Azure, offering flexibility and global reach. It features automated backups, ensuring data durability and facilitating disaster recovery. Built-in monitoring tools provide real-time insights into database performance, enabling proactive optimization and maintenance.

Security is a core component of MongoDB Atlas, with features such as end-to-end encryption, network isolation, and fine-grained access controls to protect sensitive data. The platform also supports compliance with various industry standards, making it suitable for applications with stringent security requirements.

By combining the flexibility of MongoDB’s document model with the operational simplicity of a managed service, MongoDB Atlas empowers organizations to build and scale applications with greater agility and confidence.

Graph Analytics for MongoDB using PuppyGraph

For teams working with MongoDB, many valuable insights come from understanding how entities relate across collections — whether it’s tracing user journeys, mapping operational dependencies, or detecting linked anomalies. In many cases, understanding those relationships across documents and collections can unlock deeper insights, especially when the goal is to trace connections, analyze paths, or detect patterns that span multiple entities.

PuppyGraph adds a real-time graph layer to MongoDB, allowing teams to query those relationships using graph-specific languages like Gremlin or openCypher. Without migrating or duplicating data, you can define how collections map to nodes and edges, then run graph queries directly against MongoDB Atlas or self-hosted deployments. Under the hood, PuppyGraph connects via the MongoDB Atlas SQL JDBC driver, querying live data and returning results with no ETL or transformation required.

This integration offers several key benefits:

Query Live Data, Not Snapshots: MongoDB often powers applications with dynamic, operational data — user interactions, catalogs, IoT streams, or content updates. PuppyGraph allows the execution of graph queries directly on this live data. This means you can immediately analyze emerging relationships, detect anomalies like fraud patterns as they happen, or power real-time recommendation engines without waiting for slow batch processes or dealing with data staleness.

Traverse Relationships with Purpose-Built Queries: Go beyond simple document retrieval. Graph query languages like Gremlin and openCypher, supported by PuppyGraph, are purpose-built for traversing connections, finding paths, analyzing influence, and understanding network structures. This facilitates the application of graph algorithms for tasks like PageRank (identifying importance), community detection (finding clusters), pathfinding, and more, uncovering insights hidden within the relationships scattered across your MongoDB documents — insights that might be difficult or inefficient to obtain using standard document queries alone.

No ETL, No New Stack: Instead of building a separate graph system and maintaining sync jobs, PuppyGraph works directly with MongoDB. This means lower operational overhead, fewer moving parts, and analytics that always reflect current application state — all from your existing data platform. It ensures your analytics always reflect the current state of your operational data, providing a unified source of truth for both document-based and graph-based analysis.

Integration Architecture: PuppyGraph and MongoDB

Integrating PuppyGraph with MongoDB Atlas involves a series of components working together to enable seamless graph analytics capabilities.

Architectural Components

MongoDB Atlas: A fully managed cloud database service that stores data in a flexible, document-oriented format.
MongoDB Atlas SQL JDBC Driver: Provides SQL-based access to MongoDB Atlas databases, facilitating connections with SQL-compatible tools and applications.
PuppyGraph: Connects to MongoDB Atlas via the JDBC driver, allowing users to define graph schemas over existing collections and execute graph queries using languages like Gremlin or openCypher.

Integration Steps

Prepare data in the MongoDB Atlas cluster: Create or import the necessary collections into your MongoDB Atlas cluster.
Configure Connection Settings: Set up the connection in PuppyGraph using the JDBC connection string.
Define the Graph Schema: Map MongoDB collections to graph elements such as vertices and edges within PuppyGraph.
Execute Graph Queries and Algorithms: Use Gremlin or openCypher to perform complex graph traversals and run graph algorithms directly on MongoDB data.

This architecture allows organizations to leverage their existing MongoDB infrastructure to perform sophisticated graph analyses, enhancing their data analysis capabilities without the need for additional data processing steps.

Step-by-Step: Running Graph Queries on MongoDB Atlas with PuppyGraph

We will go through a simple demo and see how MongoDB is integrated with PuppyGraph exactly. It is also recommended to read the getting-started document. What we will do here is essentially the same.

Prerequisites

Create a MongoDB Atlas Cluster

See the documentation to get started with MongoDB Atlas. You can use the MongoDB Atlas CLI or MongoDB Atlas UI to deploy a free cluster easily. Follow the detailed instructions in the document up to step 4, Manage the IP access list.

Create a MongoDB Atlas cluster.
Deploy a Free cluster.
Manage database users for your cluster.
Manage the IP access list.

Data Preparation

See the documentation to connect your cluster via MongoDB Shell. You need to get your connection string. After connecting successfully, run the following commands to create collections and insert data. Documents within a collection are flexible; they don’t have to adhere to the same schema. However, to mitigate potential errors, we create collections with schema validators.

First selecting the database, which will be automatically created after collections are created.

use modern

Then create collections with schema validators and insert data.

db.createCollection("person", {
   validator: {
      $jsonSchema: {
         bsonType: "object",
         required: [ "id", "name", "age" ],
         properties: {
            id: { bsonType: "string" },
            name: { bsonType: "string" },
            age: { bsonType: "int"}
         }
      }
   }
})
db.person.insertMany([
  {id: 'v1', name: 'marko', age: 29},
  {id: 'v2', name: 'vadas', age: 27},
  {id: 'v4', name: 'josh', age: 32},
  {id: 'v6', name: 'peter', age: 35}
])
db.createCollection("software", {
   validator: {
      $jsonSchema: {
         bsonType: "object",
         required: [ "id", "name", "lang" ],
         properties: {
            id: { bsonType: "string" },
            name: { bsonType: "string" },
            lang: { bsonType: "string" }
         }
      }
   }
})
db.software.insertMany([
  {id: 'v3', name: 'lop', lang: 'java'},
  {id: 'v5', name: 'ripple', lang: 'java'}
])
db.createCollection("created", {
   validator: {
      $jsonSchema: {
         bsonType: "object",
         required: [ "id", "from_id", "to_id", "weight" ],
         properties: {
            id: { bsonType: "string" },
            from_id: { bsonType: "string" },
            to_id: { bsonType: "string" },
            weight: { bsonType: "double" }
         }
      }
   }
})
db.created.insertMany([
  {id: 'e9', from_id: 'v1', to_id: 'v3', weight: 0.4},
  {id: 'e10', from_id: 'v4', to_id: 'v5', weight: Double(1.1)},
  {id: 'e11', from_id: 'v4', to_id: 'v3', weight: 0.4},
  {id: 'e12', from_id: 'v6', to_id: 'v3', weight: 0.2}
])
db.createCollection("knows", {
   validator: {
      $jsonSchema: {
         bsonType: "object",
         required: [ "id", "from_id", "to_id", "weight" ],
         properties: {
            id: { bsonType: "string" },
            from_id: { bsonType: "string" },
            to_id: { bsonType: "string" },
            weight: { bsonType: "double" }
         }
      }
   }
})
db.knows.insertMany([
  {id: 'e7', from_id: 'v1', to_id: 'v2', weight: 0.5},
  {id: 'e8', from_id: 'v1', to_id: 'v4', weight: Double(1.1)}
])

The data for this demo comes from the “modern” graph defined by Apache TinkerPop.

Deployment

Run the following command to start the PuppyGraph container. The PUPPYGRAPH_PASSWORD environment variable sets the password for the default user puppygraph to puppygraph123. You can change it to your desired password. The — rm flag ensures that the container is removed after it stops.

docker run -p 8081:8081 -p 8182:8182 -p 7687:7687 -e PUPPYGRAPH_PASSWORD=puppygraph123 -d --name puppy --rm --pull=always puppygraph/puppygraph:stable

Modeling the Graph

Log into the PuppyGraph Web UI at http://localhost:8081 with the following credentials:

Username: puppygraph
Password: puppygraph123

There are two methods to model the graph:

Use the Graph Schema Builder to create the schema manually.
Upload the schema JSON file. We have prepared a template for you and you need to fill some fields about connection. You can upload the schema in two ways:

In Web UI, select the file schema.json under Upload Graph Schema JSON, then click on Upload.
Run the following command in the terminal:

curl -XPOST -H "content-type: application/json" --data-binary @./schema.json --user "puppygraph:puppygraph123" localhost:8081/schema

{
  "catalogs": [
    {
      "name": "mongodb_data",
      "type": "mongodb",
      "jdbc": {
        "username": "[username]",
        "password": "[password]",
        "jdbcUri": "[jdbcUri]",
        "driverClass": "com.mongodb.jdbc.MongoDriver"
      }
    }
  ],
  "graph": {
    "vertices": [
      {
        "label": "person",
        "oneToOne": {
          "tableSource": {
            "catalog": "mongodb_data",
            "schema": "modern",
            "table": "person"
          },
          "id": {
            "fields": [
              {
                "type": "String",
                "field": "id",
                "alias": "id"
              }
            ]
          },
          "attributes": [
            {
              "type": "Long",
              "field": "age",
              "alias": "age"
            },
            {
              "type": "String",
              "field": "name",
              "alias": "name"
            }
          ]
        }
      },
      {
        "label": "software",
        "oneToOne": {
          "tableSource": {
            "catalog": "mongodb_data",
            "schema": "modern",
            "table": "software"
          },
          "id": {
            "fields": [
              {
                "type": "String",
                "field": "id",
                "alias": "id"
              }
            ]
          },
          "attributes": [
            {
              "type": "String",
              "field": "lang",
              "alias": "lang"
            },
            {
              "type": "String",
              "field": "name",
              "alias": "name"
            }
          ]
        }
      }
    ],
    "edges": [
      {
        "label": "knows",
        "fromVertex": "person",
        "toVertex": "person",
        "tableSource": {
          "catalog": "mongodb_data",
          "schema": "modern",
          "table": "knows"
        },
        "id": {
          "fields": [
            {
              "type": "String",
              "field": "id",
              "alias": "id"
            }
          ]
        },
        "fromId": {
          "fields": [
            {
              "type": "String",
              "field": "from_id",
              "alias": "from_id"
            }
          ]
        },
        "toId": {
          "fields": [
            {
              "type": "String",
              "field": "to_id",
              "alias": "to_id"
            }
          ]
        },
        "attributes": [
          {
            "type": "Double",
            "field": "weight",
            "alias": "weight"
          }
        ]
      },
      {
        "label": "created",
        "fromVertex": "person",
        "toVertex": "software",
        "tableSource": {
          "catalog": "mongodb_data",
          "schema": "modern",
          "table": "created"
        },
        "id": {
          "fields": [
            {
              "type": "String",
              "field": "id",
              "alias": "id"
            }
          ]
        },
        "fromId": {
          "fields": [
            {
              "type": "String",
              "field": "from_id",
              "alias": "from_id"
            }
          ]
        },
        "toId": {
          "fields": [
            {
              "type": "String",
              "field": "to_id",
              "alias": "to_id"
            }
          ]
        },
        "attributes": [
          {
            "type": "Double",
            "field": "weight",
            "alias": "weight"
          }
        ]
      }
    ]
  }
}

When using the graph schema builder or the schema.json file, you need to fill in either the JDBC Connection String or jdbcUri — they are the same thing. The JDBC Connection String is used to connect to the MongoDB Atlas database.To find it, follow the instructions as:

In the MongoDB Atlas UI, go to the Data Federation page and click Connect for the federated database instance that you want to connect to.
Under Access your data through tools, select Atlas SQL.
Under Select your driver, select JDBC Driver from the dropdown.
Under Get Connection String, select the database that you want to connect to and copy the connection string. In this demo, the database is modern.

You also need to fill the user and password fields according to your setting. Once complete, you would see the schema graph.

Querying the Graph

Go to Dashboard in the Web UI, you can see dashboard like the picture below. Each tab represents a query, and you can click on them to view the details. To add a new tab, click the plus (+) symbol located at the bottom right corner.

Navigate to Query in the Web UI, then you can use Graph Query for Gremlin/openCypher queries with visualization.

Here are some example queries:

Retrieve an vertex named “marko”.

Gremlin:

g.V().has("name", "marko").valueMap()

openCypher:

MATCH (v {name: 'marko'}) RETURN v

Retrieve the paths from “marko” to the software created by those whom “marko” knows.

Gremlin:

g.V().has("name", "marko")
.out("knows").out("created").path()

openCypher:

MATCH p=(v {name: 'marko'})-[:knows]->()-[:created]->()
RETURN p

Conclusion

MongoDB’s flexible document model and robust query engine make it a strong foundation for modern applications, whether you’re powering transactional systems or real-time analytics. For use cases where understanding relationships between entities is key, adding graph capabilities can unlock a new class of insights.

With PuppyGraph, teams can introduce real-time graph analytics into their MongoDB Atlas environment without modifying schemas, exporting data, or managing additional infrastructure. By connecting through the MongoDB Atlas SQL Interface, PuppyGraph lets you define graph models directly over your collections and query them using Gremlin or openCypher — while the data stays exactly where it is.

If you’re working with connected data and want to explore graph queries on MongoDB Atlas, try PuppyGraph’s free Developer Edition and experience what’s possible — no ETL required.