DEV Community: Michael Aglietti

Apache Ignite 3.1.0 is now available

Michael Aglietti — Mon, 03 Nov 2025 21:14:36 +0000

Apache Ignite 3.1.0 is now available.

This release includes 1,461 commits addressing 1,265 JIRA tickets with enhancements across SQL query performance, observability, client capabilities, and disaster recovery.

What is Apache Ignite 3?

Apache Ignite 3 is a memory-first distributed SQL database built for high-velocity data workloads where milliseconds matter and transaction windows keep shrinking.

It handles transactions, queries, and processing in one place with strong consistency and flexible data models. Schema-driven colocation keeps related data together, achieving up to 40x performance improvement for relational queries. You get ultra-low-latency performance without managing multiple systems.

What's New in 3.1.0

Query performance improvements reduce both network overhead and unnecessary data scanning.
Fine-grained data placement control through distribution zones. Configure replica counts and data node filters per zone to optimize query performance based on your access patterns. This extends Ignite 3's schema-driven colocation with control over where replicas live and how many you need.
Logical namespaces through SQL schemas. Multiple tables with the same name can exist in different schemas, solving naming conflicts in shared environments.
Diagnose performance bottlenecks faster with observability improvements. Monitor storage engine health, SQL query execution, replication lag, and clock drift across cluster nodes. Export metrics to logs with configurable filtering.
Standard API support across all client platforms makes Ignite 3 a drop-in replacement for traditional databases. .NET developers use ADO.NET provider. Python developers use DB API 2.0. Java developers use enhanced JDBC and Spring Data integration. All clients gain partition awareness and backward compatibility.
Reduced downtime during partition failures with disaster recovery operations. Inspect partition states across the cluster, restart failed partitions, and trigger recovery through CLI or REST API.
Analytical SQL capabilities through Calcite 1.40 upgrade. ROLLUP/CUBE hierarchical aggregations, NULLS FIRST/NULLS LAST ordering, and enhanced temporal data handling enable sophisticated reporting directly on operational data.
Over 400 bug fixes addressing data integrity, SQL query processing, and rebalancing operations.

Get Started

Download Ignite 3.1.0: https://ignite.apache.org/download.cgi

Star Ignite 3 on GitHub: https://github.com/apache/ignite-3

Get started with Ignite 3: https://ignite.apache.org/docs/ignite3/latest/quick-start/getting-started-guide

Apache Ignite, Apache, and the flame logo are trademarks of The Apache Software Foundation.

Monitoring Quine Streaming Graph using Grafana + InfluxDB

Michael Aglietti — Fri, 09 Jun 2023 15:12:16 +0000

Monitoring Data in Motion

There has been a significant increase in the popularity of event streaming and stream processing applications/technologies within the data engineering community. With the accelerating growth of big data, IoT, and cloud computing, more organizations are facing the challenge of extracting actionable insights earlier in the event pipeline. For historical reasons, operational tools for monitoring, alerting, and diagnosing system issues are oriented toward data at rest. That doesn't mean they can't be just as useful for monitoring data in motion. It just means adjusting your monitoring regime to a streaming mindset.

From Emerging Architectures for Modern Data Infrastructure - Andreessen Horowitz

A good example of a next-gen streaming infrastructure element is Quine. Quine is an event streaming technology designed to process graph-shaped event streams and produce high-value events in real time.

In this blog post, we'll guide you through setting up Grafana backed by InfluxDB to monitor a Quine instance. We'll show you how to configure Quine to send data to InfluxDB, create a dashboard in Grafana to visualize this data, and use Grafana's powerful features to detect issues and anomalies in real time. By the end of this post, you'll have a solid understanding of how to monitor event stream pipelines using Grafana and InfluxDB, and you'll be equipped with the tools and knowledge needed to keep Quine running smoothly.

Setting up Grafana and InfluxDB

Grafana is a tool that helps you visualize and understand operational metrics data. It lets you create visual dashboards to monitor and analyze data from sources across your data infrastructure. DevOps teams use Grafana metrics dashboards to make informed decisions.

The observability subsystem for Quine is build for Grafana integration.

Above is an example of my typical development and testing environment when working on a recipe. The event sources and output sinks change depending on the scenario, but most of the time, I run Quine on my local host, configured to push metrics to InfluxDB and visualize the observations in Grafana. Using Docker containers makes it easy to configure and clean up my environment quickly.

We need to do a little pre-work before launching the Docker containers. This is how I set up my environment using docker-compose. You may do things differently based on how Docker is installed on your host.

I like to keep docker-compose.yaml files arranged inside their directories in a docker directory that lives in $HOME. This helps me keep things organized and makes sharing configs between my MacOS laptop and Ubuntu servers easy.

I created a zip file of my config to download and use with the blog post.

cd $HOME
wget https://quine-recipe-public.s3.us-west-2.amazonaws.com/quine-grafana-docker.zip
unzip quine-grafana-docker.zip

Archive:  quine-docker.zip
  inflating: docker/cassandra/docker-compose.yaml
  inflating: docker/grafana/docker-compose.yaml
   creating: docker/grafana/grafana-provisioning/
   creating: docker/grafana/grafana-provisioning/datasources/
  inflating: docker/grafana/grafana-provisioning/datasources/datasource.yml
   creating: docker/grafana/grafana-provisioning/dashboards/
  inflating: docker/grafana/grafana-provisioning/dashboards/quine.json
  inflating: docker/grafana/grafana-provisioning/dashboards/dashboard.yaml

Note: I included a docker-compose file for Cassandra in the zip archive. I won't cover the Cassandra config in this article. The file is included as a reference if you choose to separate your persistent storage from the application to keep from competing for server resources. See the Cassandra Persistor docs for a sample configuration file.

You now have this directory structure in your $HOME dir.

docker
├── cassandra
│   └── docker-compose.yaml
└── grafana
    ├── docker-compose.yaml
    └── grafana-provisioning
        ├── dashboards
        │   ├── dashboard.yaml
        │   └── quine.json
        └── datasources
            └── datasource.yml

With Docker configured and the quine-docker.zip files loaded on your virtualization host, it's time to start the containers so that they are ready to receive data from Quine.

Change into the grafana directory and start the InfluxDB/Grafana stack:

docker compose up -d

You should see something similar to this appear in your terminal window:

[+] Running 18/18
 ⠿ grafana Pulled                                            8.7s
   ⠿ f56be85fc22e Pull complete                              2.8s
   ⠿ 9efeca377709 Pull complete                              3.0s
   ⠿ b4608283f0dd Pull complete                              3.5s
   ⠿ 94ba646ecfcd Pull complete                              3.9s
   ⠿ 6730f2b3d4cf Pull complete                              4.1s
   ⠿ 871e090050be Pull complete                              4.4s
   ⠿ 03d60ad4c029 Pull complete                              5.7s
   ⠿ baaa3e79bf5c Pull complete                              7.6s
   ⠿ 01c0c058d3df Pull complete                              7.7s
 ⠿ influxdb Pulled                                           9.6s
   ⠿ 918547b94326 Pull complete                              7.4s
   ⠿ 5d79063a01c5 Pull complete                              7.7s
   ⠿ a8e9798c2a3f Pull complete                              7.8s
   ⠿ e8074b4fc936 Pull complete                              8.5s
   ⠿ a913b4722330 Pull complete                              8.5s
   ⠿ 9c8265b2cf7a Pull complete                              8.6s
   ⠿ 9037f1aeb9df Pull complete                              8.6s
[+] Running 4/4
 ⠿ Volume "grafana_grafana-storage"   Created                0.0s
 ⠿ Volume "grafana_influxdb-storage"  Created                0.0s
 ⠿ Container grafana-influxdb-1       Started                0.5s
 ⠿ Container grafana-grafana-1        Started                0.7s

Verify that the containers are running:

docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"
NAMES                 STATUS           PORTS
grafana-grafana-1     Up 4 seconds     0.0.0.0:3000->3000/tcp
grafana-influxdb-1    Up 4 seconds     0.0.0.0:8086->8086/tcp

Congratulations! 🎉 InfluxDB and Grafana are running in separate containers and listening on their default ports.

Configuring Quine to Send Metrics Data

Enable metrics reporting in Quine via configuration parameters that can be passed as Java system properties with -D or contained in a Quine configuration file. Quine can report metrics to jmx, csv, influxdb, and slf4j for analysis. The jmx metrics reporter is enabled by default.

java \
    -Xmx12G -Xms12G \
    -Dquine.metrics-reporters.1.type=influxdb \
    -Dquine.metrics-reporters.1.database=db0 \
    -Dquine.metrics-reporters.1.period=30s \
    -Dquine.metrics-reporters.1.host=<container_host> \
    -jar quine-1.5.3.jar \
    -r wikipedia --force-config

A couple of things to note when passing configuration as system properties.

The -D parameters must come before -jar
When launching Quine with a recipe (-r) you also have to pass --force-config

Alternatively, you can pass the following configuration stored in quine-metrics.conf to Quine to accomplish the same thing.

Create a quine-metrics.conf file containing the HOCON configuration from the documentation.

quine {
  # where metrics collected by the application should be reported
  metrics-reporters = [
    {
      # Report metrics to an influxdb (version 1) database
      type = influxdb

      # required by influxdb - the interval at which new records will
      # be written to the database
      period = 30

      # Connection information for the influxdb database
      database = db0
      scheme = http
      host = <container_host>
      port = 8086

      # Authentication information for the influxdb database. Both
      # fields may be omitted
      # user = admin
      # password = admin
    }
  ]
}

Then launch Quine, passing the configuration file on the command line.

java -Dconfig.file=metrics.conf -jar quine-1.5.4.jar -r wikipedia --force-config

Quine Metrics

Quine reports three classes of metrics; counters, timers, and gauges.

When queried, the metrics summary API endpoint reports the same metrics as a metrics reporter.

Counters

Quine uses counters to accumulate the number of times that events occur. Counters can return either a value or a histogram.

node.edge-counts.*: Histogram-style summaries of edges per node
node.property-counts.*: Histogram-style summaries of properties per node
shard.*.sleep-counters: Count the lifecycle state of nodes managed by a shard

Timers

Quine reports the elapsed time in milliseconds it takes to perform persistor operations.

persistor.get-journal: Time taken to read and deserialize a single node's relevant journal
persistor.persist-event: Time taken to serialize and persist one message's worth of on-node events
persistor.get-latest-snapshot: Time taken to read (but not deserialize) a single node snapshot

Gauges

Quine gauges report metrics as a value.

memory.heap.*: JVM heap usage
memory.total: JVM combined memory usage
shared.valve.ingest: Number of current requests to slow ingest for another part of Quine to catch up
dgn-reg.count: Number of in-memory registered DomainGraphNodes

Create a Dashboard in Grafana

A dashboard in Grafana contains a series of panels that provide an at-a-glance view of how Quine is performing.

Log into Grafana. The username and password for the container is admin:admin.
Decide if you are going to keep the default password or skip changing it

If you launched Grafana using the docker-compose files from the quine-docker.zip file that I provided, you will see a dashboard called "Quine - Monitor a Recipe" in the lower left hand corner of the Dashboards card. Click on that dashboard to open it. Initially, the dashboard will be empty. It will fill in as you run a recipe.

Let's start Quine with the Wikipedia recipe and the metrics.conf file from above to get familiar with each visualization.

java -Dconfig.file=metrics.conf -jar quine-1.5.3.jar -r wikipedia --force-config

Metrics will populate the dashboard after about 30 seconds once Quine is running. You may need to reload your browser to have Grafana pull all of the metrics from InfluxDB. Also, be sure to set the time range in the upper right corner of the dashboard to "Last 15 minutes" to ensure that you have a current time range selected to visualize.

Your dashboard will begin to populate like this:

A Grafana dashboard view for Quine running the Wikipedia ingest recipe.

Hover over each graph in the dashboard to expose a "three-dot" menu in the upper right hand corner of the panel. Click on the menu and select "edit" to review how each visualization is configured. Some visualizations use the query builder, and some are written directly as an InfluxDB query.

Please modify the dashboard to match your environment and satisfy your needs.

What I've Learned Monitoring Quine

Monitoring a streaming graph is similar to any other database, with a few additional key metrics to watch.

Pay attention to supernodes! A super-node is a single node in the graph with a very large number of edge counts (tens of thousands and more). A single supernode or a moderate number of mid-sized supernodes will cause Quine to keep nodes awake continuously, which can lead to backpressure.
Quine is backpressured, which means that the performance of the persistence subsystem affects the flow of events in the graph. Watch for when the shards fill with awake nodes and the associated persistor latency if you see a drop in the event ingest rate.
Java garbage collection impacts backpressure. It is normal for Quine ingest rates to fluctuate as Java manages the heap. Keep an eye on when your heap consumption approaches the max memory configured for Java. I've found the best performance when launching Quine with a 12G (-Xmx12G -Xms12G) memory allocation pool.

Conclusion

The metrics dashboard built into the Exploration UI is good for understanding how Quine is currently operating. However, monitoring the performance of a recipe or solution over time requires a DevOps tool like Grafana. This blog will get you up and running with a sample dashboard that replicates all of the gauges in the Exploration UI that you can modify to suit your needs.

Did you run this dashboard in your environment? How did it perform? Start a conversation in Slack with a screenshot of your dashboard or feedback to improve our base Grafana dashboard.

Calculate Risk and Optimize Asset Allocation in Real Time

Michael Aglietti — Fri, 09 Jun 2023 15:12:04 +0000

The Hidden Cost of Batch Processing for Financial Institutions

The recent failures at financial institutions like First Republic Bank, Signature Bank, and even Silicon Valley Bank have brought issues of regulatory compliance and capital management to the forefront for both industry members and the wider public alike.

One thing these events have exposed is that the financial industry largely relies on an approach to managing mandated operational risk capital requirements, batch processing, that is ill-suited to the direction both the market and compliance are heading. Operationally, batch processing is time-consuming, costly, and often must take place in constrained time windows between market close and open.

The knock on financial effect of the operation limitations of batch processing are more impactful: institutions are slow to react to changing market conditions, which can lead to over- or under-allocation of certain classes of funds.

Real-time Risk Calculation and Asset Allocation

Using Quine streaming graph, financial institutions can respond to market changes in real time, providing adequate coverage for risk exposure while ensuring compliance minimally affects asset allocation.

At a high level, Quine accomplishes this by doing what it does best: combining multiple feeds in real-time to build hierarchical models of elements like markets, trading entities, risk classes, and asset values, that adjust in real time to changing market conditions.

At a specific level, we have created a Quine recipe that demonstrates, in the context of regulatory monitoring requirements like the Basel III Liquidity Coverage Ratio (LCR), Net Stable Funding Ratio (NSFR) and liquidity risk monitoring tools as described in https://www.bis.org/bcbs/basel3.htm, the following:

Calculating risk while taking into account complex interdependencies and rules.
Constantly recomputing liquidity-indexed risk to determine capital requirements relative to market conditions.
Normalizing multiple sources to calculate relative value of assets and roll up the results to determine near-real time liquidity in event liquidation is necessary.

A sample view from the graph this recipe generates.

The recipe can be found here.

Quine Developer Site 2.0

As part of our continued focus on improving the Quine developer experience, we’ve made significant changes to the Quine.io site.

The most notable change is a total restructuring of the recipe pages to interleave code and contextual or documentary information. Recipe documentation now includes a full walkthrough of a recipe and an explanation of how the recipe works so that recipes can also act as training material.

The new, more structured recipe page.

Other changes include:

Improved developer journey by separating tutorials (getting started), technical docs, and recipe docs into their own sections of the site
Full release notes and release history included in the downloads page
Direct links to Quine blog posts, events, and self-service demos are now on the info page

You can still download and easily get started with Quine (hint, hint) and we’d love to hear your feedback and add features you think might help you build great things with Quine.

Create a Quine Icon Library with Python

Michael Aglietti — Tue, 02 May 2023 20:46:16 +0000

Have you ever wanted to add flair to a graph visualization but are unsure which icons Quine supports? In this blog, we explore a Python script that fetches valid icon names from the web, configures the Exploration UI, then creates a graph of icon nodes for reference. The script uses several popular Python libraries, including Requests, BeautifulSoup, and Halo, along with the /query-ui and /query/cypher API endpoints.

The Environment

Before we start, we need to ensure that we have the necessary libraries installed. We will be using requests, beautifulsoup4, log_symbols, and halo. You can install them using pip:

Quine
Python 3
Requests library (pip install requests)
BeautifulSoup library (pip install beautifulsoup4)
Optional Halo library for operation visuals (pip install log-symbols halo)

Start Quine so that it is ready to run the script.

java -jar quine-1.5.3.jar

The Script

The script begins by importing the required libraries:

import requests
import json
from halo import Halo
from log_symbols import LogSymbols
from bs4 import BeautifulSoup

Build a list of icon names

We use the requests library to GET the webpage referenced in the Replace Node Appearances API documentation. Quine supports version 2.0.0 of the Ionicons icon set from the Ionic Framework. The link contains a list of 733 icons supported by Quine. A try...except block handles any errors that might occur during the request. If the request is successful, the script saves the HTML content of the page.

try:
    url = "https://ionic.io/ionicons/v2/cheatsheet.html"
    response = requests.get(url)
    html = response.content
    print(LogSymbols.SUCCESS.value, "GET Icon Cheatsheet")
except requests.exceptions.RequestException as e:
    raise SystemExit(e)

Next, we use BeautifulSoup to parse the HTML content of the page to extract all of the icon names. The soup.select method finds all <input> elements with a name attribute and returns a list, which are then looped over to extract the value attribute of each tag later. We output len(all_icons) to verify that we identified all of the icons.

soup = BeautifulSoup(html, 'html.parser')
all_icons = soup.select("input.name")
print(LogSymbols.SUCCESS.value, "Extract Icon Names:", len(all_icons))

Create Node Appearances

Now that we have the icon names, we can use them to create node appearances for the Quine Exploration UI. We'll use the json package to format the nodeAppearances data as JSON, and requests to replace the current nodeAppearances with a PUT to the /query-ui/node-appearances endpoint. We wrap the API call in try...expect as before to handle any errors.

predicate: filter which nodes to apply this style
size: the size of the icon in pixels
icon: the name of the icon
label: the label of the node

Note: Cypher does not allow dash (-) characters in node labels. We get around this by replacing all of the dashes with underscores in the node labels.

nodeAppearances = [
    {
        "predicate": {
            "propertyKeys": [],
            "knownValues": {},
            "dbLabel": icon_name["value"].replace("-", "_")
        },
        "size":40.0,
        "icon": icon_name["value"],
        "label": {
            "key": "name",
            "type": "Property"
        }
    } for icon_name in all_icons]
json_data = json.dumps(nodeAppearances)
try:
    headers = {'Content-type': 'application/json'}
    response = requests.put(
        'http://localhost:8080/api/v1/query-ui/node-appearances', data=json_data, headers=headers)
except requests.exceptions.RequestException as e:
    raise SystemExit(e)
print(LogSymbols.SUCCESS.value, "PUT Node Appearances")

Create Icon Nodes

Finally, our script creates icon nodes by sending a series of POST requests to the Quine /query/cypher endpoint. For each icon name, a Cypher query creates the corresponding icon node and connects it to the appropriate group node. We use Halo to create a spinner while we POST the icon data to Quine.

quineSpinner = Halo(text='Creating Icon Nodes', spinner='bouncingBar')
try:
    quineSpinner.start()
    for icon_name in all_icons:
        group = icon_name["value"].split('-',2)
        query_text = (
            f'MATCH (a), (b), (c) '
            f'WHERE id(a) = idFrom("{group[0]}") '
            f'  AND id(b) = idFrom("{group[1]}") '
            f'  AND id(c) = idFrom("{icon_name["value"]}") '
            f'SET a:{group[0]}, a.name = "{group[0]}" '
            f'SET b:{group[1]}, b.name = "{group[1]}" '
            f'SET c:{icon_name["value"].replace("-", "_")}, c.name = "{icon_name["value"]}" '
            f'CREATE (a)<-[:` `]-(b)<-[:` `]-(c)'
          ) if len(group) == 3 else (
            f'MATCH (a), (c) '
            f'WHERE id(a) = idFrom("{group[0]}") '
            f' AND id(c) = idFrom("{icon_name["value"]}") '
            f'SET a:{group[0]}, a.name = "{group[0]}" '
            f'SET c:{icon_name["value"].replace("-", "_")}, c.name = "{icon_name["value"]}" '
            f'CREATE (a)<-[:` `]-(c)'
          )
        quineSpinner.text = query_text
        headers = {'Content-type': 'text/plain'}
        # print(query_text)
        response = requests.post(
            'http://localhost:8080/api/v1/query/cypher', data=query_text, headers=headers)
    quineSpinner.succeed('POST Icon Nodes')
except requests.exceptions.Timeout as timeout:
    quineSpinner.stop('Request Timeout: ' + timeout)
except requests.exceptions.RequestException as e:
    raise SystemExit(e)

Running the script

At this point, we are ready to run the script and visualize the icons supported in Quine.

python3 iconLibrary.py

The script updates the console as it moves through the blocks of code that we described above.

✔ GET Icon Cheatsheet
✔ Extract Icon Names: 733
✔ PUT Node Appearances
✔ POST Icon Nodes

Navigate to Quine in your browser and load all of the nodes that we just created into the Exploration UI. There are multiple ways to load all of the nodes in the UI, for this example, we use MATCH (n) RETURN n. The Exploration UI will warn that you are about to render 787 nodes which is correct for all of the icons and grouping nodes generated by the script. Hit the OK button to view the graph.

Note: If you already had Quine open in a browser before running the script, you will need to refresh your browser window to load the new nodeAppearances submitted by the query in order for the nodes to render correctly.

In our case, the nodes are jumbled when they are first rendered. Click the play button in the top nav to have Quine organize the graph. Our result produced the graph visualization of all supported icons below:

Conclusion

There you have it, a graph visualization using all of the icons Quine supports!

This script can generate the nodeAppearances graph and serve as a starting point if you are looking to automate fetching non-streaming data from websites to enrich streaming data stored in Quine.

If you want to learn more about Quine or explore using other API libraries with Quine, check out the interactive REST API documentation available via the document icon in the left nav bar. The interactive documentation is a great place to submit API requests. Code samples in popular languages are quickly mocked up in the docs for use when experimenting with small projects like this yourself.

You can download this script and try it for yourself in this GitHub Repo

Dynamic Duo: Quine & Novelty Detector for Insider Threats

Michael Aglietti — Thu, 20 Apr 2023 22:03:48 +0000

Adding Quine to the Insider Threat Detection Proof of Concept

A lot has changed since we first posted the Stop Insider Threats With Automated Behavioral Anomaly Detection blog post. Most significantly, thatDot released Quine, our streaming graph, as an open source project just as the industry is recognizing the value of real-time ETL and complex event processing in service of business requirements. This is especially true in finance and cybersecurity, where minutes (seconds or even milliseconds) can mean the difference between disaster, survival or success.

Our goal, at the time, was to show how anomaly detection on categorical data could be used to resolve complex challenges utilizing an industry recognized standard benchmark dataset, which happened to be static. The approach we used then was to pre-process (batch) the VAST Insider Threat challenge dataset with Python then ingest that processed stream of data with thatDot's Novelty Detector to identity the bad actor.

But with a new tool in our kit we decided to see what would be involved in updating the workflow by replacing the Python pre-processing, instead using Quine in front of Novelty Detector in our pipeline.

This involved:

Defining the ingest queries required to consume and shape the VAST datasets; and
Developing a standing query to output the data to Novelty Detector for anomaly detection.

Data from the dataset is broken into three files:

Employee to office and source IP address mapping in employeeData.csv


ingestStreams:
  - type: FileIngest
    path: employeeData.csv
    parallelism: 61
    format:
      type: CypherCsv
      headers: true
      query: >-
        MATCH (employee), (ipAddress), (office)
        WHERE id(employee) = idFrom('employee', $that.EmployeeID)
          AND id(ipAddress) = idFrom('ipAddress',$that.IP)
          AND id(office) = idFrom('office',$that.Office)

        SET employee.id = $that.EmployeeID,
            employee:employee

        SET ipAddress.ip = $that.IP,
            ipAddress:ipAddress

        SET office.office = $that.Office,
            office:office

        CREATE (ipAddress)<-[:USES_IP]-(employee)-[:SHARES_OFFICE]->(office)

Proximity reader data from door badge scanners in proxLog.csv

  - type: FileIngest
    path: proxLog.csv
    format:
      type: CypherCsv
      headers: true
      query: >-
        MATCH (employee), (badgeStatus)
        WHERE id(employee) = idFrom('employee', $that.ID)
          AND id(badgeStatus) = idFrom('badgeStatus',$that.ID,$that.Datetime,$that.Type,$that.ID)

        SET employee.id = $that.ID,
            employee:employee

        SET badgeStatus.type = $that.Type,
            badgeStatus.employee = $that.ID,
            badgeStatus.datetime = $that.Datetime,
            badgeStatus:badgeStatus

        CREATE (employee)-[:BADGED]->(badgeStatus)

Network traffic in IPLog3.5.csv

 - type: FileIngest
    path: IPLog3.5.csv
    format:
      type: CypherCsv
      headers: true
      query: >-
        MATCH (ipAddress), (request)
        WHERE id(ipAddress) = idFrom('ipAddress',$that.SourceIP)
          AND id (request) = idFrom('request', $that.SourceIP,$that.AccessTime, $that.DestIP, $that.Socket)

        SET request.reqSize = $that.ReqSize,
            request.respSize = $that.RespSize,
            request.datetime = $that.AccessTime,
            request.dst = $that.DestIP,
            request.dstport = $that.Socket,
            request:request

        SET ipAddress.ip = $that.SourceIP,
            ipAddress:ipAddress

        CREATE (ipAddress)-[:MADE_REQUEST]->(request)

These ingests form a basic structure that looks like:

The ingest streams combine to create the essential graph structure.

Because we have created an intuitive schema for identifying nodes by way of feeding idFrom() deterministic and descriptive data that can be used to query for them very efficiently (and do so with sub-millisecond latency).

A quick query efficiently displays relevant properties from connected nodes.

Moving from Batch to Real Time Monitoring

While this is certainly an improvement from our previous workflow, it is still highly manual (i.e., having to explicitly query for the data we're looking for). The promise of a Quine to Novelty Detector workflow is automation with real-time results.

By ingesting the data in chronological order (as presented in the source files), we are able to easily match proximity network events to the last associated proximity badge event in real-time.

This is accomplished via standing query matches like:

standingQueries:
   - pattern:
       query: >-
         MATCH (request)<-[:MADE_REQUEST]-(ipAddress)<-[:USES_IP]-(employee)-[:BADGED]->(badgeStatus)
         RETURN DISTINCT id(request) AS requestid
       type: Cypher
     outputs:
       print-output:
         type: CypherQuery
         query: >-
          MATCH (request)<-[:MADE_REQUEST]-(ipAddress)<-[:USES_IP]-(employee)-[:BADGED]->(badgeStatus)
          WHERE id(request) = $that.data.requestid
            AND badgeStatus.datetime<=request.datetime
          WITH max(badgeStatus.datetime) AS date, request, ipAddress
          MATCH (request)<-[:MADE_REQUEST]-(ipAddress)<-[:USES_IP]-(employee)-[:BADGED]->(badgeStatus)
          WHERE badgeStatus.datetime=date

          RETURN badgeStatus.type AS status,ipAddress.ip AS src,request.dstport AS port,request.dst AS dst

The question remains, "How do we share the standing query matches from Quine to Novelty Detector?" This can be done in a number of ways (all via standing query outputs) including, but not limited to:

Writing results to a file that Novelty Detector ingests;
Emitting webhooks from Quine to Novelty Detector; or
Publishing results to a Kafka topic to be ingested by Novelty Detector.

Although the first two choices will work, they are severely suboptimal. Consider a simple example of a single employee's data:

Visualizing data from a single employee.

Writing the aggregate 115,434 matches would be done one record at a time (on each standing query match) to the filesystem.

andThen:
           type: WriteToFile
           path: behaviors.jsonl

Using webhooks suffer the same issue as writing to file, and introduces induced latency from the HTTP transactions.

andThen:
            type: PostToEndpoint
            url: http://localhost:8080/api/v1/novelty/behaviors/observe?transformation=behaviors

Ultimately, we settled on the third option as it most closely resembles production environments, and is the most performant.

andThen:
           type: WriteToKafka
           bootstrapServers: localhost:9092
           topic: vast
           format: {
            type: JSON
        }

The big question - did it work?

Results from the Novelty Detector UI.

Absolutely.

The anomalous activity has been identified.

Was it worthwhile?

Sure, but...

It Don't Mean a Thing If It Ain't Got That Real-Time Swing

Although we were able to accomplish the same results with Quine in a single step this was still a batch processing-based exercise. The true value of a Quine to Novelty Detector pipeline is in the melding of complex event stream processing in Quine with shallow learning (no training data) techniques in Novelty Detector, providing an efficient solution for detecting persistent threats and unwanted behaviors in your network. This pattern, moving from batch processing, requiring heavy lifting and grooming of datasets, to real-time stream processing is one where Quine and Novelty Detector thrive.

Try it Yourself

If you'd like to try the VAST test case yourself, you can run Novelty Detector on AWS with a generous free usage tier. Instructions for configuring Novelty Detector are available here.\
And the open source version of Quine is available for download here. If you are interested there is also an enterprise version that offers clustering for horizontal scaling and resilience.

And if you'd prefer a demo or have additional questions, check out Quine community slack or send us an email.

idFrom(): the simple function that’s key to Quine streaming graph

Michael Aglietti — Thu, 20 Apr 2023 22:03:28 +0000

A simple concept at the core of a new way of processing data

What's a streaming graph?

When we first released Quine streaming graph last year, we had to answer this question a lot. After all, a "streaming graph" had never existed before.

As interest grew, we got pretty good at answering, usually something like this: Quine is a real-time event processor like Flink or ksqlDB. It consumes data from sources like Kafka and Kinesis, queries for complex patterns in event streams, and pushes results to the next hop in the streaming architecture the instant a match is made. However, unlike those venerable systems, Quine uses graph data structure.

Hence, streaming graph.

That seemed to work and, engineers being a curious lot, led inevitably to a second question: "How's it different from a graph database?"

That's a fun question to answer, because it means we get to talk aboutidFrom(). And explaining idFrom() allows us to begin to unpack all the interesting architectural properties that make Quine uniquely well-suited for real-time complex event processing.

"Big things have small beginnings." -- David from the film Prometheus (2012)

Event-driven: what if we stopped querying databases?

Unlike a graph database, which relies on an index to query for the existence of data in the graph, Quine uses idFrom(), a custom Cypher function.

idFrom() generates a unique node ID from a set of user-provided arguments -- most commonly taken from the data in the event stream itself -- which is then used in lieu of an index to locate and operate on a node and its properties. (We will get to the why in a bit but it will help first to look at how you use idFrom().)

Say you want to analyze an event stream of edits from wikipedia to keep an eye out for edits made by specific authors to specific articles in specific databases.

The json record (a pared back version of the actual Wikipedia event feed used in the Wikipedia API recipe featured in our docs example here) might look like this:

{
   "$schema": "/mediawiki/revision/create/1.1.0",
   "database": "wikidatawiki",
   "page_id": 83996749,
   "rev_id": 1869025669,
   "rev_timestamp": "2023-04-05T18:18:23Z",
   "performer": {
       "user_is_bot": true,
       "user_id": 6135162,
   },
   "rev_parent_id": 1869025663,
}

To create the nodes in a continuous stream of records, you would use MATCH to declare the node names then call the idFrom() function to generate unique node IDs based on the values in the json itself.

MATCH (revNode),(pageNode),(dbNode),(userNode), (parentNode)
WHERE id(revNode) = idFrom('revision', $that.rev_id)
  AND id(pageNode) = idFrom('page', $that.page_id)
  AND id(dbNode) = idFrom('db', $that.database)
  AND id(userNode) = idFrom('id', $that.performer.user_id)
  AND id(parentNode) = idFrom('revision', $that.rev_parent_id)

For now, we can skip adding properties to nodes but it helps our discussion to complete this simple graph by adding relationships between the nodes:

CREATE (revNode)-[:IN]->(dbNode),
       (revNode)-[:TO]->(pageNode),
       (userNode)-[:MADE]->(revNode)
       (parentNode)-[:NEXT]->(revNode)

Now, as each event streams in, Quine will create and connect nodes, forming the desired subgraph that looks like this:

You can see the same subgraph with node ID no longer concealed by the node labels:

Note the things you didn't have to do to create this graph:

Query to find out if the node exists already before
Consult a schema

Quine eliminates the need to check to see if the node exists before completing an operation.

The deterministic nature of node IDs created using idFrom() means a value or combination of values passed to the function will always result in the same ID.

It will either create a new node based on the value or, if that node already exists, update it.

In the latter case, because Quine is an event-sourced system, when Quine updates a node, it doesn't need to look up if the node already exists. Quine appends the update to the existing node, preserving historical versions that can be retrieved using idFrom() with the at.time

`idFrom()` and CRUD operations: why Quine is so dang fast

Inasmuch as Quine uses a hash of a value to generate a node ID that is then used for CRUD operations, it bears a superficial similarity between NoSQL key-value stores As long as you know either the ID or the value, it is dead simple to retrieve data from the graph.

However, because of Quine's in-memory graph structure, it is far more efficient and performant operating on patterns, ranges (e.g. time-ordered), or otherwise related data than key-value databases.

Using the node ID to anchor the query, you specify the edges to traverse to find connected data.

This might be a query to retrieve a node's properties using node ID (in this case, for revNode):

MATCH (n) WHERE strId(n) = "8b290926-271c-3497-b5d6-e30fcf934a73" RETURN id(n), properties(n)

Which delivers these results:

If you don't know a node's ID, you can query for it using the node's properties and the strid() function:

MATCH (userNode:user {user_is_bot: true}) RETURN DISTINCT strid(userNode)

But what about more complex queries -- for example, a query that must retrieve multiple related objects. Key-value stores are famously inefficient in this scenario. But this is precisely where Quine's architectural choices come in. Using an in-memory graph structure means you can query for any node in a subgraph, follow it's edges, and produce one or more values.

For example, say you want to find all revisions where a bot made an update to the 'wikidatawiki' database:

MATCH (userNode:user {user_is_bot:true})-[:MADE]->(revNode:revision)-[:TO]->(pageNode:page)-[:IN]->(dbNode:db {database : "wikidatawiki"})
RETURN DISTINCT id(revNode) as id, id(userNode) as id2

Either way, it starts with setting the node ID with idFrom() . And idFrom() makes Quine very, very fast.

Standing queries and querying data from the future with `idFrom()`

Standing queries persist inside the graph, monitoring the stream for specific patterns. Propagate them throughout the graph without you ever having to issue a query again. Standing queries persist, monitoring for matches.

Once matches are found, standing queries trigger actions using those results (e.g. report results, execute code, transform other data in the graph, publish data to another source).

To do this, every standing query must have two parts, the pattern portion (what sub-graph you are matching for in the event stream) and the outputs portion (the action you wish to take).

Adapted from the recipe used in Getting Started, here's a standing query that monitors for non-bot revisions to the 'enwiki' database and outputs these events to the terminal:

standingQueries:
  - pattern:
      query: |-
        MATCH (userNode:user {user_is_bot: false})-[:MADE]->(revNode:revision {database: 'enwiki'})
        RETURN DISTINCT id(revNode) as id
      type: Cypher
    outputs:
      print-output:
        type: CypherQuery
        query: |-
          MATCH (n)
          WHERE id(n) = $that.data.id
          RETURN properties(n)
        andThen:
          type: PrintToStandardOut

Standing query matches printing to console.

Because standing queries persist in the graph, incrementally updating partial results as new data arrives, you are not just querying the past and present state, you are setting up queries for data yet to arrive.

And while idFrom() is a key part of what makes standing queries possible, to really understand what makes Quine function so efficiently as a stream processor, we'll need to dive into the actor-based, graph-shaped compute model. But that's for a different post.

Instead, I'll leave you with a clever use of idFrom() employed by developers at a SaaS company that uses Quine.

Partitioning Key Spaces for a SaaS application using `idFrom()`

Since you can generate a node ID by passing an arbitrary combination of values to idFrom(), some Quine users with SaaS or internal multi-tenant applications have employed it to partition graphs by customer namespace or similar property.

Sticking with the Wikipedia example, you could create distinct sub-graphs corresponding to each of the database types by adding $that.database as an additional value determining each node ID:

  MATCH (revNode),(pageNode),(dbNode),(userNode),(parentNode)
       WHERE id(revNode) = idFrom('revision', $that.rev_id, $that.database)
         AND id(pageNode) = idFrom('page', $that.page_id, $that.database)
         AND id(dbNode) = idFrom('db', $that.database)
         AND id(userNode) = idFrom('id', $that.performer.user_id, $that.database)
         AND id(parentNode) = idFrom('revision', $that.rev_parent_id, $that.database)

This creates a series of subgraphs partitioned by database and would allow you to be certain that if you query for data related to a specific database, you won't inadvertently return data from others.

And while the chance of key collision exists, it is vanishingly small, making this approach suitable for use in multi-tenant SaaS applications.

At any rate, this accomplished what the company wanted: a partitioned graph for data separation, all standing and ad hoc queries work the same across the entire graph, and the only real cost is the discipline of always using the compound key.

Pretty clever.

If any of this inspires you or piques your interest and you want to try Quine yourself, check out Getting Started docs.

Quine, streaming graph data that scales past 1 million events/second

Michael Aglietti — Fri, 07 Oct 2022 20:11:20 +0000

Finding relationships within categorical data is graph's strong point. Doing so at scale, as Quine now makes possible, has significant implications for cyber security, fraud detection, observability, logistics, e-commerce, and any use case that graph is both well-suited for and which must process high velocity data in real time.

The goal of this test is to demonstrate a high-volume of sustained event ingest that is resilient to cluster node failure in both Quine and the persister using commodity infrastructure, and to share performance and cost results along with details of the test for those interested in either reproducing results or running Quine in production.

Our tests delivered the following results:

1M events/second processed for a 2 hour period
1M+ writes per second
1M 4-node graph traversals (reads) per second
21K results (4-node pattern matches) emitted per second
140 commodity hosts plus 1 hot spare running Quine Enterprise
66 storage hosts using Apache Cassandra persistor
3 hosts for Apache Kafka

Infrastructure

The Cassandra persistor layer’s settings are set at a TTL of 15 minutes and a replication factor of 1 to manage quota limits and spending on cloud infrastructure. This does not fit every possible use case, but it is fairly common. Other scenarios which are more data-storage oriented will often increase the replication factor and/or TTL. In those variations, maintaining the 1 million events/sec processing rate would require increasing the number of Cassandra hosts or disk storage, both of which are budgetary concerns more than technical concerns.

Component	# of Hosts	Host Types
Quine Cluster	141	c2-standard-30 (30 vCPUs, 120GB RAM) Max heap for JVM set to 12GB 140 cluster size w/ 1 hot spare
Cassandra Persistor Cluster	66	n1-highmem-32 (32 vCPU, 208GB RAM) x 375 GB local SSD each r1 x 375 GB local SSD each durable_writes=false TTL=15 minutes on snapshots (to control disk costs in testing) and journals tables
Kafka	3	n2-standard-4 (4 vCPU, 16 GB RAM) Preloaded with 8 billion events (sufficient for a sustained 2 hour ingest at 1 million events per second) 420 partitions

The Test

The plan is set out below, with each action labeled and the results explained. Events are clearly marked by sequence # on the Grafana screen grabs below the table.

A few notes on the test:

A script is used to generate events
Host failures are manually triggered.
We used Grafana for the results (and screenshots).
We pre-loaded Kafka with enough events to sustain one million events/second for two hours.
A Cassandra cluster is used for persistent data storage. The Cassandra cluster is not over-provisioned to accommodate compaction intentionally (a common strategy) so that the effects of database maintenance on the ingest rate can be demonstrated.
The cluster is run in a Kubernetes environment

Seq #	Actions	Expected Results	Actual Results
1	Start the Quine cluster and begin ingest from Kafka	The ingest rate increase and settle at or above 1 million events per second	Observed
2	Let Quine run for 40 minutes to establish a stable baseline	Quine does not fail and maintains a baseline ingest rate at or above 1 million events per second.	Observed
3	Kill a Quine host	Quine ingest is not significantly impacted. The hot spare steps in to recover quickly, and Kubernetes replaces the killed host which becomes a new hot spare.	Observed at 17:47. No impact to ingest rate. The hot spare recovered quickly and ingest was not impacted.
4	Persistor Maintenance	Cassandra regularly performs maintenance, Quine experiences this as increased latency and should backpressure the ingest to maintain stability during database maintenance.	From 17:55 - 18:15 the ingest rate is reduced as a corresponding increase in latency is measured above 1ms across all nodes from the Cassandra persistor.
5	Kill two Quine hosts	Observe the following sequence: hot spare recovers one host, whole cluster suspends ingest due to being degraded, Kubernetes replaces killed hosts, first replaced host recovers the cluster, and the second replaced host becomes the new hot spare.	Observed from 18:18 - 18:25. Due to Kubernetes the impact was not visible. However, the expected sequence was confirmed in the logs.
6	Stop and resume a Quine host for about 1 minute to inject high latency	Quine detects the host is no longer available, boots it from the cluster, and hot spare steps in to recover. When the rejected host resumes, it learns it was removed from the cluster, so it shuts down, is restarted by Kubernetes, and to become the new hot spare	Observed from 18:41 - 18:46. No impact to ingest rate as the back pressured ingest was for a single host in the cluster, and the recovery happened quickly.
7	Stop and resume a Cassandra persistor host for about 1 minute to inject high latency	Quine back pressures ingest until Cassandra persistor has recovered	Observed from 18:47 - 18:54. Due to replication factor = 1, ingest was impacted until Cassandra persistor recovered. Then ingest resumed to > 1M e/s.
8	Kill a Cassandra persistor host	Quine suspends ingest until Cassandra persistor recovers with a new host	Observed from 18:54 - 19:10. The host was recovered quickly due to kubernetes, and ingest briefly recovered to 1M e/s by 18:58 (only a few minutes).
9	Persistor Maintenance	Cassandra regularly performs maintenance, Quine experiences this as increased latency and should backpressure the ingest to maintain stability during database maintenance.	From 17:55 - 18:15 the ingest rate is reduced as a corresponding increase in latency is measured above 1ms across all nodes from the Cassandra persistor.
10	Let Quine consume the remaining Kafka stream	Observe the Quine hosts drop to zero events per second (not all at once)	Observed from 19:10 - 19:35. Around the time Cassandra persistor latency was returning to 1ms, and ingest returned to 1M e/s, the pre-loaded ingest stream began to become exhausted on some hosts. For the following 20 minutes hosts exhausted their partitions in the stream.

The Results

As you can see from the overall ingest rate results:

#1 shows an initial peak of 1.25M events/sec
#2 Quine settles into a steady ingest rate > 1 million events/sec
#3 Quine recovers nicely after single node shutdown
Quine settles into a steady ingest rate > 1 million events/sec
#s 4 and 9 show Cassandra maintenance event (see Cassandra Latency - Figure 3)
#5 Quine has no problem with two-node failure events.

We observed that a persistor node high-latency event (7) has a more marked impact on performance than either a Quine node failure (5) or an outright failure of a persistor node (8). In the case of a clear failure, Kubernetes is quick to replace the node, allowing ingest to resume. In cases when a persistence node state is non-responsive but not clearly down, Quine’s response is to back pressure ingest until the node is recovered.

An alternate variation on this test could use more persistor machines to stabilize ingest rates during maintenance events.

The individual Quine node ingest graphs indicate when individual nodes are offline and reinforces the observation that Quine Enterprise’s cluster resilience allows for smooth operation during high-volume ingest, even in the face of a Quine node shut down or failure. Quine’s overall performance, and hence an area of operational focus for anyone planning a production deployment, more closely conforms with persistor performance.

The median query latency for the Cassandra cluster during this test was <1 ms. Even during/following persistor shutdown (8) or node failure (7), cluster latency stayed < 1.5 ms. Events at (1), (5), and (8), all reflect increased latency for single nodes.

Standing Queries and 1 Million 4-node traversals per second

The purpose of running any complex event processor, Quine included, is in detecting and acting on high-value events in real time. This could mean detecting indications of a cyber attack, or video stream buffering, or identifying e-commerce upsell opportunities at check out. This is where Quine really excels.

Standing queries are a unique feature of Quine. They monitor streams for specified patterns, maintaining partial matches, and executing user-specified actions the instant a full match is made. Actions can include anything from updating the graph itself by creating new nodes or edges, writing results out to Kafka (or Kinesis, or posting results to a webhook).

In this test, Quine standing queries monitored for specific 4-node patterns requiring a 4-node traversal every time an event was ingested. Traditional graph databases slow down ingest when performing multi-node traversal. Not Quine. Quine’s ability to sustain high-speed data ingest together with simultaneous graph analysis is a revolutionary new capability. Not only did Quine ingest more than 1,000,000 events per second, it analyzed all that data in real-time to find more than 20,000 matches per second for complex graph patterns. This is a whole new world!

Why Quine Hitting 1 Million Events/Sec Matters

Since its release in 2007 at the start of the NoSQL revolution, Neo4J have proven conclusively the value of graph to connect and find complex patterns in categorical data.

The graph data model is indispensable to everything from fraud detection to network observability to cybersecurity. It is used for recommendation engines, logistics, and XDR/EDR.

But not long after NoSQL hit the scene, Kafka kicked off the movement toward real-time event processing. Soon, event processors like Flink, Spark Streaming and ksqlDB brought the ability to process live streams. These systems relied on less-expressive key-value stores or slower document and relational databases to save intermediate data.

Quine is the graph analog and is important because now you can do what graph is really good at -- finding complex patterns across multiple streams of data using not just numerical but categorical data.

Quine makes all the great graph use cases viable at high volumes and in real time.

What’s Next

If you want to reproduce this test, we have published the test details on Github so that you can understand and run it yourself..

If you want help planning your own test, or you would like to try the Quine Enterprise, please contact us. You can also read more about Quine Enterprise here.

Or you can start learning about Quine now by visiting the Quine open source project. We have a Slack channel where folks can ask questions and we are always up for a call.

Use Quine Graph ETL to reduce SIEM storage costs.

Michael Aglietti — Mon, 25 Jul 2022 19:57:01 +0000

The High Cost of Storing Low Value Data

The high cost of SIEM has given rise to countless articles and dozens of companies promoting strategies or products to reduce monthly bills, with some claiming 50-90% reductions.While the 50-90% number seems a little overblown and sure to be met with skepticism — enterprises tend to take a “better to store it and pay the price than regret we didn’t later” approach, especially when the data may have compliance implications — the appeal is easy to understand.

I took a look at the current methods for reducing SIEM costs and compared them to what graph ETL using Quine can accomplish all while considering impact on data fidelity.

The State of Stream Pre-Processing: Random, Destructive, and Only Somewhat Effective

Legacy event log pre-processing offerings typically employ one or more of six basic strategies to reduce the amount of data stored in the SIEM:

Sample data
Filter out fields
Filter out events
De-duplicate
Aggregate/roll-up
Re-route some data to cheaper alternatives for cold storage (e.g,. Logstash or Amazon S3)

These solutions also usually include the ability to set rules that refine system behavior by data source or event type – for instance, sampling one in five events from a log of failed authentication attempts but one in twenty events from an Apache access log.

It is important to note that stream pre-processing can only be applied to each stream and each record individually. Since many modern event processing use cases — not just SIEM but those for machine learning and e-commerce — depend on combining multiple data sources to model complex events, the single-stream approach means storing duplicate data from each stream required to connect them later (in SQL terms, these data are the keys used to join the various data sets once they are stored).

We were paying for 600 GB to 700 GB per day with Splunk, which meant we were lousy co-workers to our IT group, because we had to tell them, 'Send us this field, not that field,' and limit the data ingestion severely," said John Gerber, principal cybersecurity analyst at Reston, Va., systems integrator SAIC. -- from Elastic SIEM woos enterprises with cost savings

As the quote above makes clear, some approaches also require lots of operational intervention, meaning delays for analysts and data scientists and an overall increase in cost of ownership.

The more important limitation is that these approaches cannot determine the value of the data they discard. They either throw data away or, in the case of aggregation, reduce fidelity. All data is considered to have the same value.

Quine’s approach is different: it turns high volumes of low-value data into low volumes of high-value data.

Instead of storing data in Splunk or a similar system and then determining value, Quine can evaluate data as it arrives and make choices to store or discard based on the problem you are trying to solve.

Quine Ingest Queries: Semantic ETL for High Value Data

At the heart of how Quine processes data are two query types: ingest and standing queries (more on the latter below).

Quine uses ingest queries to consume event data and construct your streaming graph database. Ingest queries perform real-time ETL on incoming streams, combining multiple data sources (for example from multiple Kafka topics, Kinesis streams, data from databases, live feeds via APIs) into a single streaming graph, eliminating the need to keep duplicate data around for joins.

Using Quine’s ingest ETL, you can join all the data, eliminating cross-data stream duplicates. That accounts for some incremental data reduction over existing methods, which along with the other five strategies (all of which Quine supports) means Quine offers superior savings on your SIEM costs. But more than just deduplicating data, joining streams lets you draw conclusions early about what makes some data more valuable than other data.

Quine’s real power, however, is its ability to apply a semantic filter to your data to find patterns made up of multiple events. And it does so as data streams in.

Save Only the Patterns That Matter

Ingest queries make it easy to organize the high value, often complex, patterns in data into graph structures. These patterns are characterized by the relationships between multiple events. In a practical sense, you are shaping the data into a form that anticipates the analysis you will perform downstream in your SIEM. Quine can join, interpret, and trim away any data not relevant to the answers.

What you end up creating in your graph-ETL are subgraphs, or patterns of two or more nodes and connecting edges.

Here are a few real world examples from the Quine community:

Find and store all instances where there have been attempts (both successful and failed) to log into the accounts of members of the executive team from multiple IP addresses

A subgraph for monitoring authentication fraud attempts.

Find and store all instances where multiple processes in different office locations are sending message to the same IP address

A subgraph for monitoring processes and the IP addresses to which they write.

In both of these examples, the test for what you keep and what you discard is based on what might possibly be important, on what matters to your business.

What if data takes time to become interesting?

One challenge processing streaming data – especially when event data arrives from many networked sources – is that it can arrive late or out of order, obscuring what would otherwise be an interesting pattern. Consider the examples above.

What if the login attempts in example one were spread out over days or even weeks?

What if log events from several locations in example two (above) were delayed for several hours or started at different times? Quine handles this late arriving data (as well as out of order data) using standing queries.

Standing queries persist in the graph, storing partial matches and triggering actions when a full match occurs.

Standing queries live inside the graph and automatically propagate the incremental results computed from both historical data and incoming streaming data. Once matches are found, standing queries trigger actions using those results (e.g., execute code, transform other data in the graph, publish data to another system like Apache Kafka or Kinesis).

The implication for SIEM storage reduction is that Quine can temporarily retain possibly interesting incomplete patterns until a match occurs. It is neither discarded nor taking up costly space in your SIEM. Then, at the instant the match occurs, it is sent along to the SIEM system for regular processing. If a match doesn’t occur within a useful period, the data can be discarded automatically.

Want to go further? Consider bypassing your SIEM altogether and sending alerts and data directly to your SOC or NOC’s dashboards, analysts, or data science team as it arrives and matches occur. But that’s for the next blog post. Until then, try out Quine’s graph ETL on your own log data. It is open source and easy to get started with.

Who knows, it might just save you a few million dollars.

Help Getting Started

If you want to try it on your own logs, here are some resources to help:

Download Quine - JAR file | Docker Image | Github
Check out the Ingest Data into Quine blog series covering everything from ingest from Kafka to ingesting .CSV data
Apache Log Recipe - this recipe provides more ingest pattern examples
Join Quine Community Slack and get help from thatDot engineers and community members.

Standing Queries: Turning Event-Driven Data Into Data-Driven Events

Michael Aglietti — Wed, 06 Jul 2022 21:55:13 +0000

Quine's super power is the ability to store and execute business logic within the graph. That query can then operate directly on data as it streams in. We call this type of query a standing query.

A standing query incrementally matches some graph structure while new data is ingested into the graph. Quine’s special design makes this process extremely fast and efficient. When a full pattern match is found, a standing query takes action.

A standing query is defined in two parts: a pattern and an output. The pattern defines what we want to match, expressed in Cypher using the form MATCH … WHERE … RETURN …. The output defines the action(s) to take for each result produced by the RETURN in the pattern query.

The result of a standing query output is passed to a series of actions which process the output. This output can be logged, passed to other systems (via Kafka, Kinesis, HTTP POST, and more), or can even be used to perform additional actions like running new queries or even rewriting parts of the graph. Whatever logic your application needs.

How nodes match patterns

Each node in Quine is backed by an actor, which makes each graph node act like its own little CPU. Actors function as lightweight, single-threaded logical computation units that maintain state and communicate with each other by passing messages.

The actor model enables you to execute a standing query that is stored in the graph and remembered automatically. When you issue a DistinctId standing query, the query is broken into individual steps that can be tested one at a time on individual nodes. Quine stores the result of each successive decomposition of a query (smaller and smaller queries) internally on the node issuing that portion of the query. The previous node's query is essentially a subscription to the next nodes status as either matching the query or not.

Any changes in the next node’s pattern match state result in a notification to the querying node. In this way, a complex query is relayed through the graph, where each node subscribes to whether or not the next node fulfills its part of the query. When a complete match is made, or unmade, the chain is notified with results and an output action is triggered.

There are two pattern match modes DistinctId and MultipleValues

This must take the form of MATCH WHERE RETURN

When the mode is DistinctId, the pattern query RETURN must also be DISTINCT.

https://docs.quine.io/components/multiple-values.html#match-query

Creating a standing query

The first step to making a Standing Query is determining the graph pattern you want to watch for. You may have deployed Quine in your data pipeline to perform a series of tasks to isolate data, implement a specific feature, or monitor the stream to find a specific pattern in real time. In any case, Quine will implement your logic using Cypher. The recipe for this example is included in the Quine repo if you'd like to follow along.

Let's demonstrate this concept using Quine's built in synthetic data generator that was introduced in v1.3.0. Say that you have a need to establish the relationships between all numbers in a number line and any number that is divisible by 10 using integer division (where dividing always returns a whole number; the remainder is discarded).

ingestStreams:
  - format:
      query: |-
        WITH gen.node.from(toInteger($that)) AS n,
             toInteger($that) AS i
        MATCH (thisNode), (nextNode), (divNode) 
        WHERE id(thisNode) = id(n) 
          AND id(nextNode) = idFrom(i + 1) 
          AND id(divNode) = idFrom(i / 10) 
        SET this.i = i,
            this.prop = gen.string.from(i)
        CREATE (thisNode)-[:next]->(nextNode), 
               (thisNode)-[:div_by_ten]->(divNode)
      type: CypherLine
    type: NumberIteratorIngest
    ingestLimit: 100000

Creates a graph with 100000 nodes and a shape that we can use for our example.

In the example above, I want to count the unique times that a pattern like the one visualized above occurs in a sample of 100000 numbers. A key to our pattern is the existence of the "data" parameter in a node that is generated by the gen.string.from() function.

To detect a pattern in our data, we can write a Cypher query in the pattern section:

standingQueries:
  - pattern:
      query: |-
        MATCH (a)-[:div_by_ten]->(b)-[:div_by_ten]->(c)
        WHERE exists(c.prop)
        RETURN DISTINCT id(c) as id
      type: Cypher
    outputs:
      count-1000-results:
        type: Drop

It is looking for a number which is the ten-divisor of another number which is also the ten-divisor of a number in the graph. That basically means it's looking for one of the first 1000 nodes created by our "number iterator" ingest.

❯ java -jar quine -r sq-test.yaml
Graph is ready
Running Recipe Standing Query Test Recipe
Using 1 node appearances
Using 11 quick queries 
Running Standing Query STANDING-1
Running Ingest Stream INGEST-1
Quine web server available at http://0.0.0.0:8080
INGEST-1 status is completed and ingested 100000

 | => STANDING-1 count 1000

This example simply counts how many are detected, using the standing query output variant: type: Drop

Standing query result output

Say that instead of just counting the number of times that the pattern matches, we need to output the match for debugging or inspection. We can replace the Drop output with a CypherQuery that uses the matched result and then prints information to the console. When issuing a DistinctId standing query, the result of a match is a payload that looks like:

{
    "meta": {
        "isPositiveMatch": true,
        "resultId": "2a757517-1225-7fe2-0d0e-22625ad3be37"
    },
    "data": {
        "a.id": 45110,
        "a.prop": "YH32SISr",
        "b.id": 4511,
        "b.prop": "fqx8aVAU",
        "c.id": 451,
        "c.prop": "61mTZqH8"
    }
}

This payload includes the ID of the node that initially matched in the data field. So We can write a new Cypher query to go fetch additional information triggered by this match:

MATCH (a)-[:div_by_ten]->(b)-[:div_by_ten]->(c)
WHERE id(c) = $that.data.id
RETURN a.i, a.prop, b.i, b.prop c.i, c.prop

The MATCH portion looks similar to our standing query, but this time we're not monitoring the graph, we're fetching data from the three-node pattern rooted at (c).

Replacing the count-1000-results output with inspect-results from below would accomplish just that.

inspect-results:
  type: CypherQuery
  query: |-
    MATCH (a)-[:div_by_ten]->(b)-[:div_by_ten]->(c)
    WHERE id(c) = $that.data.id
    RETURN a.i, a.prop, b.i, b.prop c.i, c.prop
  andThen:
    type: PrintToStandardOut

The outputs stage of a standing query is where you can express your business logic and put Quine to work for you in your data pipeline. Take some time to review all of the possible output types in our API documentation located on https://quine.io.

Modifying standing queries

Modify a Standing Query Output

Another time that you need to notify Quine of changes in your standing queries is when you modify the outputs section of an existing standing query. The Quine API has two methods for the /api/v1/query/standing/{standing-query-name}/output/{standing-query-output-name} endpoint that allow you to DELETE and POST a new output to an existing standing query.

From above, let's change the original standing query output type from Drop to a new CypherQuery that outputs the matches to the console. We will use two API calls to accomplish the change.

Delete the existing output:

curl --request DELETE \
  --url http://0.0.0.0:8080/api/v1/query/standing/STANDING-1/output/count-1000-results \
  --header 'Content-Type: application/json'

Create the new output:

curl --request POST \
  --url http://0.0.0.0:8080/api/v1/query/standing/STANDING-1/output/inspect-results \
  --header 'Content-Type: application/json' \
  --data '{
    "type": "CypherQuery",
    "query": "MATCH (a)-[:div_by_ten]->(b)-[:div_by_ten]->(c)\nWHERE id(c) = $that.data.id\nRETURN a.i, a.prop, b.i, b.prop c.i, c.prop",
    "andThen": {
      "type": "PrintToStandardOut"
    }
}'

Propagate a New Standing Query

When a new standing query is registered in the system, it gets automatically registered only new nodes (or old nodes that are loaded back into the cache). This behavior is the default because pro-actively setting the standing query on all existing data might be quite costly depending on how much historical data there is. So Quine defaults to the most efficient option.

However, sometimes there is a need to actively propagate standing queries across all previously ingested data as well. You can use the API to request that Quine propagate a new standing query to all nodes in the existing graph. Here's how the request looks in curl.

curl --request POST \
  --url http://0.0.0.0:8080/api/v1/query/standing/control/propagate?include-sleeping=true \
  --header 'Content-Type: application/json'

Review the in-product API documentation via the Quine web interface for additional code snippets.

Conclusion

In this blog post, we looked at the different types of standing queries that you can create in Quine. A standing query is a powerful tool for data processing because it allows you to express your business logic as part of your data pipeline. We also looked at how you can modify an existing standing query output type and propagate a new standing query across the graph.

Quine is open source if you want to explore standing queries for yourself using your own data. Download a precompiled version or build it yourself from the codebase from the Quine Github codebase.

Have a question, suggestion, or improvement? I welcome your feedback! Please drop into Quine Slack and let me know. I'm always happy to discuss Quine or answer questions.

Research

From: https://drive.google.com/file/d/17uw36E3juptE2QEEwKt-WLRhdXLM9r_R/view?usp=sharing

Standing Query

Quine implements a facility for executing any query type as a “Standing Query” which is persisted in the graph in this fashion. When a query is issued as a Standing Query, it includes callback functions describing what to do in each of the four cases where a node:

1) Initially matches the query
2) Initially does not match the query
3) Initially did not match, but the node data changed so that it now does match the query
4) Initially did match, but the node data changed so that it no longer matches the query

To implement standing queries, Quine stores the result of each successive decomposition of a query (into smaller and smaller branches) on the node issuing that portion of the query. The query issued to the next node is essentially a subscription to the next nodes status as either matching the query, or not. Changes in the next node’s state result in a notification to the querying node. In this way, a complex query is relayed through the graph, where each node subscribes to whether the next node fulfills the smaller query. When a complete match is made, a special actor is notified with the results and the appropriate callback (established with the original query) is called. These callbacks can simply return results, but can also execute arbitrary functionality, like initiating new queries, or even rewriting parts of the graph when certain patterns are matched.

From: Quine Innovations, Part II

Query Execution

Quine is implemented as a graph interpreter. A query is dropped into the graph and turned into a result by a recursive of process of: evaluate the first part of the query locally, and if it matches the requirements, relay the remainder to more nodes connected to the first (as relevant for the query definition), aggregating and processing the results returned. The relayed remainder of the query is smaller than the initial payload processed by the node in question. The process is repeated until the entire query is “consumed” and the relevant results returned and aggregated.

The process of resolving a query happens in two directions:

the query is consumed while extending the remaining query parts to the next relevant set of nodes,
results are returned from all relevant participating nodes in the query. The first component of query resolution is about exploration of the existing data graph to determine which nodes are responsible for resolving which portion of the query. The second component focuses on how results are returned to the requesting node recursively to produce the full and final result.

Quine enables the unique opportunity to perform these steps separately so that the result of performing the first part (exploration) can result in a back-pressured stream of results delivered only when the consumer is ready to consume each of the next results. This allows the Quine system to achieve maximal memory and computational efficiency by computing a “recipe for results” in phase 1 which does not execute until the optimal moment when a consumer is ready to receive the results in phase 2. The back-pressure technique to slow a stream of data processing is well-know in the streaming data community; however the application of it in a granular node-by-node fashion when resolving graph queries is a novel innovation.

Real-time Graph Analytics for Kafka Streams with Quine

Michael Aglietti — Mon, 27 Jun 2022 13:55:46 +0000

Kafka is the tool of choice by data engineers when building a streaming data pipeline. Adding Quine into a Kafka-centric data pipeline is the perfect way to introduce streaming analytics to the mix. Adding business logic directly into an event pipeline allows you to process high-value insights in real-time.

Simple Streaming Pipeline

Consider this straightforward, minimum viable streaming pipeline.

In this simple pipeline, Vector will produce events, dummy_log lines, once a second and stream them into a Kafka topic, demo-logs, where an ingest stream from Quine will transform the log events into a streaming graph.

Setting up Vector

Start by installing Vector in your environment. My examples use macOS and may need slight modifications to work correctly in your environment. I installed Vector with brew install vector, which includes a sample Vector.toml config in /opt/homebrew/etc/vector. I extended the sample Vector config to build our pipeline.

Run Vector to get a feel for the events that Vector emits.

❯ vector -c /opt/homebrew/etc/vector/vector.toml

Vector generates dummy log lines from a built-in demo_logs source. The log lines are transformed in Vector using the parse_syslog and emit a JSON object.

{
    "appname": "Karimmove",
    "facility": "lpr",
    "hostname": "some.com",
    "message": "Take a breath, let it go, walk away",
    "msgid": "ID416",
    "procid": 9207,
    "severity": "debug",
    "timestamp": "2022-06-14T15:34:11.936Z",
    "version": 2
}

Once Vector is emitting log entries, we need to connect that output to Kafka by adding in a Kafka sink element into the Vector.toml file.

# Stream parsed logs to kafka
[sinks.to_kafka]
type = "kafka"
inputs = [ "parse_logs" ]
bootstrap_servers = "127.0.0.1:9092"
key_field = "quine"
topic = "demo-logs"
encoding = "json"
compression = "none"

Local Kafka Instance

Kafka is the next step in the pipeline. I set up a single node Kafka cluster in Docker. There are more than enough examples on the internet of how to set up a Kafka cluster in Docker, and please set up the cluster in a way that fits your environment. My cluster uses a docker-compose file that launches version 7.1.1 of Zookeeper and Kafka containers.

Start the Kafka cluster and create a topic called demo-logs.

Note, I had to run the docker compose up command a couple of times before both the Zookeeper and Kafka containers launched cleanly. Make sure the containers fully load at least once before including the -d option to run them in detached mode.

❯ Docker compose up -d
❯ docker exec Kafka Kafka-topics --bootstrap-server kafka:9092 --create --topic demo-logs

Use kcat to verify the Kafka cluster is up and that the demo-logs topic was configured.

❯ kcat -b localhost -L
Metadata for all topics (from broker -1: localhost:9092/bootstrap):
 1 brokers:
  broker 1 at 127.0.0.1:9092 (controller)
 1 topics:
  topic "demo-logs" with 1 partitions:
    partition 0, leader 1, replicas: 1, isrs: 1

Quine Config

Ok, let's get Quine configured and ready to receive the log events from Kafka via an ingest stream. We can start with a simple ingest stream that takes each demo log line and creates a node.

ingestStreams:
  - type: KafkaIngest
    topics:
      - demo-logs
    bootstrapServers: localhost:9092
    format:
      type: CypherJson
      query: |-
        MATCH (n)
        WHERE id(n) = idFrom($that)
        SET n.line = $that

Launch the Pipeline

Let's launch Vector and Quine to get the pipeline moving.

Launch Vector using the modified vector.toml configuration.

❯ vector -c vector.toml

Launch Quine by running the Kafka Pipeline recipe.

❯ java -jar quine-x.x.x -r kafka_pipeline.yaml

And verify that we see nodes generated in Quine.

Quine app web server available at http://0.0.0.0:8080

 | => INGEST-1 status is running and ingested 18

Congratulations! 🎉 Your pipeline is operating!

Improving the Ingest Query

The ingest query that I started with is pretty basic. Using CALL recentNodes(1), let's take a look at the newest node in the graph and see what the query produced.

❯ ## Get Latest Node
curl -s -X "POST" "http://0.0.0.0:8080/api/v1/query/cypher" \
     -H 'Content-Type: text/plain' \
     -d "CALL recentNodes(1)" \
| jq '.'
{
  "columns": [
    "node"
  ],
  "results": [
    [
      {
        "id": "9fde7ef4-c5ec-35f1-ae5f-619bd9ab7d5c",
        "labels": [],
        "properties": {
          "line": {
            "appname": "benefritz",
            "facility": "uucp",
            "hostname": "make.de",
            "message": "#hugops to everyone who has to deal with this",
            "msgid": "ID873",
            "procid": 871,
            "severity": "emerg",
            "timestamp": "2022-06-14T19:58:16.463Z",
            "version": 1
          }
        }
      }
    ]
  ]
}

The ingest query creates nodes using idFrom(), populated them with the properties that it received from Kafka, and didn't create any relationships. We can make this node more useful by giving it a label and removing parameters that are not interesting to us. Additionally, using reify.time(), I can associate the node with a timeNode to stitch together events that occur across the network in time.

Analyzing the sample data

Quine has a web-based graph explorer that really comes to life once you have a handle on the shape of the streaming data. But I am starting from the beginning with a bare-bones recipe. For me, when I start pulling apart a stream of data, I find that using the API to ask a few analytical questions serves me well.

I'll use the /query/cypher endpoint to get a feel for the shape of the sample data streaming from Kafka. I don't recommend doing a full node scan on a mature streaming graph, but my streaming graph is still young and small.

Using my REST API client of choice, I POST a Cypher query that returns the metrics (counts) for parameters that are interesting.

That's a lot of JSON results to review; let's take this over to a Jupyter Notebook to continue the analysis. My REST API client includes a Python snip-it tool that makes it really easy to move directly into code without having to start from scratch.

In Jupyter, within a few cells, I had the JSON response data loaded into a Pandas DataFrame and an easy to review textual visualization of what the sample data contains.

I let the pipeline run while I developed simple visualizations of the metrics. Right away, I could see that the sample data Vector produces is random and uniformly distributed across all of the parameters in the graph. And after 15000 log lines, the sample generation exhausted all permutations of the data.

Conclusions and Next Steps

I learned a lot about streaming data while setting up this pipeline. Vector is a great tool that allows you to stream log files into Kafka for analysis. Add a Quine instance on the other side of Kafka, and you are able to perform streaming analytics inside a streaming graph using standing queries.

Use the same workflow to develop an understanding of streaming data that you do for data at rest
Perform streaming analysis by connecting Quine to your Kafka cluster
Use Cypher ingest queries to form the graph within a Quine ingest stream.

Quine is open source if you want to run this analysis for yourself. Download a precompiled version or build it yourself from the codebase (Quine Github). I published the recipe that I developed at https://quine.io/recipes.

Have a question, suggestion, or improvement? I welcome your feedback! Please drop into Quine Slack and let me know. I'm always happy to discuss Quine or answer questions.

Processing Machine Logs with Streaming Graph

Michael Aglietti — Thu, 16 Jun 2022 14:15:50 +0000

You know we had to get here eventually. I'm looking into all of the ways that Quine can connect to and ingest streaming sources. Next up is my old friend, the log file.

Log files are a structured stream of parsable data using regular expressions. Log lines are emitted at all levels of an application. The challenge is that they are primarily islands of disconnected bits of the overall picture. Placed into a data pipeline, Quine allows us to combine different types of logs and use a standing query to match interesting patterns upstream of a log analytics solution like Splunk or Sumo Logic.

Log Line Structure

Processing log files can quickly become as messy as the log files themself. I think that it's best to approach a log file like any other data source and take the time to understand the log line structure before asking any questions.

Quine is an application that produces log lines, and just like many other applications, the structure of the log lines follows a pattern. The logline pattern is defined in Scala, making it very easy for us to understand what the log line contains.

pattern = "%date %level [%mdc{akkaSource:-NotFromActor}] [%thread] %logger - %msg%n%ex"

Quine Log RegEx

Each Quine log line was assembled using the pre-defined pattern. This presents a perfect opportunity to use a regular expression, reverse the pattern, and build a streaming graph.

Note, the regex link in the example below uses the log output from a Quine Enterprise cluster. Learn more about the Quine Enterprise and other products created by thatDot on our web site. The regular expression will work for both Quine and Quine Enterprise.

I developed a regular expression that reverses the logline and returns the log elements for use by the ingest stream ingest query. I also published a recipe that uses the regular expression to parse Quine log lines on Quine.io.

(^\d{4}-\d{2}-\d{2} \d{1,2}:\d{2}:\d{2},\d{3}) # date and time string 
(FATAL|ERROR|WARN|INFO|DEBUG)                  # log level
\[(\S*)\]                                      # actor address
\[(\S*)\]                                      # thread name
(\S*)                                          # logging class
-                                              # the log message
((?:(?!^[0-9]{4}(?:-[0-9]{2}){2}(?:[^|\r?\n]+){3}).*(?:\r?\n)?)+)

Quine Log Ingest Stream

In my previous article, I connected to a CSV file using the CypherCsv FileIngest format so that Quine could break the rows of data stored in the file back into columns. The CypherLine FileIngest format allows us to read each line into the $that variable and process it through a Cypher query.

ingestStreams:
  - type: FileIngest
    path: $in_file
    format:
      type: CypherLine
      query: |-
        // Quine log pattern "%date %level [%mdc{akkaSource:-NotFromActor}] [%thread] %logger - %msg%n%ex"
        WITH text.regexFirstMatch($that, "(^\\d{4}-\\d{2}-\\d{2} \\d{1,2}:\\d{2}:\\d{2},\\d{3}) (FATAL|ERROR|WARN|INFO|DEBUG) \\[(\\S*)\\] \\[(\\S*)\\] (\\S*) - (.*)") as r 
        WHERE r IS NOT NULL 
        // 0: whole matched line
        // 1: date time string
        // 2: log level
        // 3: actor address. Might be inside of `akka.stream.Log(…)`
        // 4: thread name
        // 5: logging class
        // 6: Message
        WITH *, split(r[3], "/") as path, split(r[6], "(") as msgPts
        WITH *, replace(COALESCE(split(path[2], "@")[-1], 'No host'),")","") as qh
        MATCH (actor), (msg), (class), (host)
        WHERE id(host)  = idFrom("host", qh)
          AND id(actor) = idFrom("actor", r[3])
          AND id(msg)   = idFrom("msg", r[0])
          AND id(class) = idFrom("class", r[5])
        SET host: Host, host.address = split(qh, ":")[0], host.port = split(qh, ":")[-1], host.host = qh,
            actor: Actor, actor.address = r[3], actor.id = replace(path[-1],")",""), actor.shard = path[-2], actor.type = path[-3],
            msg: Message, msg.msg = r[6], msg.type = split(msgPts[0], " ")[0], msg.level = r[2],
            class: Class, class.class = r[5]
        WITH * CALL reify.time(datetime({date: localdatetime(r[1], "yyyy-MM-dd HH:mm:ss,SSS")})) YIELD node AS time
        CREATE (actor)-[:sent]->(msg),
               (actor)-[:of_class]->(class),
               (actor)-[:on_host]->(host),
               (msg)-[:at_time]->(time)

The ingest stream definition:

Reads Quine log lines from a file
Parses each line with regex
Creates host, actor, message, and class nodes
Populates the node properties
Relates the nodes in the streaming graph
Anchors the message with a relationship to a time node from reify.time

Configuring Quine Logs

Ok, let's run this recipe and see how it works. By default, the log level in Quine is set to WARN. We can increase the log level in the configuration or pass in a Java system configuration property when we launch Quine.

Note: Set the log level in Quine (or Quine Enterprise) via the thatdot.loglevel configuration option.

Setting Log Level in Configuration

Start by getting your current Quine configuration. The easiest way to get the configuration is to start Quine and then GET the configuration via an API call.

❯ curl --request GET \
  --url http://0.0.0.0:8080/api/v1/admin/config \
  --header 'Content-Type: application/json' \
> quine.conf

Edit the quine.conf file and add "thatdot":{"loglevel":"DEBUG"}, before the quine object.

❯ jq '.' quine.conf
{
  "thatdot": {
    "loglevel": "DEBUG"
  },
  "quine": {
    "decline-sleep-when-access-within": "0",
    "decline-sleep-when-write-within": "100ms",
    "dump-config": false,
    "edge-iteration": "reverse-insertion",
    "id": {
      "partitioned": false,
      "type": "uuid"
    },
    "in-memory-hard-node-limit": 75000,
    "in-memory-soft-node-limit": 10000,
    "labels-property": "__LABEL",
    "metrics-reporters": [
      {
        "type": "jmx"
      }
    ],
    "persistence": {
      "effect-order": "memory-first",
      "journal-enabled": true,
      "snapshot-schedule": "on-node-sleep",
      "snapshot-singleton": false,
      "standing-query-schedule": "on-node-sleep"
    },
    "shard-count": 4,
    "should-resume-ingest": false,
    "store": {
      "create-parent-dir": false,
      "filepath": "quine.db",
      "sync-all-writes": false,
      "type": "rocks-db",
      "write-ahead-log": true
    },
    "timeout": "2m",
    "webserver": {
      "address": "0.0.0.0",
      "enabled": true,
      "port": 8080
    }
  }
}

Now, restart Quine and include the config.file property.

java -Dconfig.file=quine.conf -jar quine-x.x.x.jar > quineLog.log

DEBUG level log lines will stream into the quineLog.log file.

Passing Log Level at Runtime

Another slightly more straightforward way to enable Quine logs is to pass in a Java system configuration property. Here's how to start Quine and enable logging from the command line.

java -Dthatdot.loglevel=DEBUG -jar quine-x.x.x.jar > quineLog.log

DEBUG level log lines will stream into the quineLog.log file.

Ingesting Other Log Formats

You can easily modify the regex I developed for Quine log lines above to parse similar log output, like those found in *nix based system files or other Java applications.

Standard-ish Java Log Output

Depending on the log level, Java emits a lot of information into logs. This ingest stream handles application log lines from most Java applications. Sometimes the log entry itself spans multiple lines.

- type: FileIngest
  path: $app_log
  format:
    type: CypherJson
    query: |-
      WITH *, text.regexFirstMatch($that.message, '^(\\d{4}(?:-\\d{2}){2}(?:[^]\\r?\\n]+))\\s+?\\[(.+?)\\]\\s+?(\\S+?)\\s+(.+?)\\s+\\-\\s+((?:(?!^\\d{4}(?:-\\d{2}){2}(?:[^|\\r?\\n]+){3}).*(?:\\r?\\n)?)+)') AS r WHERE r IS NOT NULL
      CREATE (log {
        timestamp: r[1],
        component: r[2],
        level: r[3],
        subprocess: r[4],
        message: r[5],
        type: 'log'
      })
      // Create hour/minute buckets per event
      WITH * WHERE r[1] IS NOT NULL CALL reify.time(datetime({date: localdatetime(r[1], "yyyy-MM-dd HH:mm:ss,SSS")}), ["hour","minute"]) YIELD node AS timeNode
      // Create edges for timenNodes
      CREATE (log)-[:at]->(timeNode)

Ubuntu Ubuntu 22.04 LTS Syslog

If you're developing distributed applications, you will most likely need a regular expression that parses the Ubuntu /var/log/syslog file. First, you need to edit /etc/rsyslog.conf and uncomment the line to emit the traditional DateTime format.

#
# Use traditional timestamp format.
# To enable high precision timestamps, comment out the following line.
#
$ActionFileDefaultTemplate RSYSLOG_TraditionalFileFormat

The log line format is:
%timestamp:::date-rfc3339% %HOSTNAME% %app-name% %procid% %msgid% %msg%n

- type: FileIngest
  path: $syslog
  format:
    type: CypherLine
    query: |-
      WITH text.regexFirstMatch($that, '^(\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}\\.\\d*?\\+\\d{2}:\\d{2}|Z).?\\s(.*?)(?=\\s).?\\s(\\S+)\\[(\\S+)\\]:\\s(.*)') AS s WHERE s IS NOT NULL
      CREATE (syslog {
        timestamp: s[1],
        hostname: s[2],
        app_name: s[3],
        proc_id: s[4],
        message: s[5],
        type: 'syslog'
      })
      // Create hour/minute buckets per event
      WITH * WHERE s[1] IS NOT NULL CALL reify.time(datetime({date: localdatetime(s[1], "yyyy-MM-dd'T'HH:mm:ss.SSSSSSz")}), ["hour","minute"]) YIELD node AS timeNode
      // Create edges for timenNodes
      CREATE (syslog)-[:at]->(timeNode)

MySQL Error Log

Working on a web application that's been around for a while, it's probably sitting on top of a MySQL database. The traditional-format MySQL log messages have these fields:

time thread [label] [err_code] [subsystem] msg

For example:
2022-04-14T06:55:26.961757Z 0 [System] [MY-011323] [Server] X Plugin ready for connections. Socket: /var/run/mysqld/mysqlx.sock

Add these log entries to your streaming graph for analysis too.

- type: FileIngest
  path: $sqlerr_log
  format:
    type: CypherLine
    query: |-
      WITH text.regexFirstMatch($that, '^(\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}\\.\\d{6}Z)\\s(\\d)\\s\\[(\\S+)\\]\\s\\[(\\S+)\\]\\s\\[(\\S+)\\]\\s(.*)') AS m WHERE m IS NOT NULL
      CREATE (sqllog {
        timestamp: m[1],
        thread: m[2],
        label: m[3],
        err_code: m[4],
        subsystem: m[5],
        message: m[6],
        type: 'sqllog'
      })
      // Create hour/minute buckets per event
      WITH * WHERE m[1] IS NOT NULL CALL reify.time(datetime({date: localdatetime(m[1], "yyyy-MM-dd'T'HH:mm:ss.SSSSSSz")}), ["hour","minute"]) YIELD node AS timeNode
      // Create edges for timenNodes
      CREATE (sqllog)-[:at]->(timeNode)

Conclusion

Streaming data comes from all kinds of sources. With Quine, it's easy to convert that data stream into a streaming graph.

Have a question, suggestion, or improvement? I welcome your feedback! Please drop in to Quine Slack and let me know. I'm always happy to discuss Quine or answer questions.

Ingesting From Multiple Data Sources into Quine Streaming Graphs

Michael Aglietti — Mon, 06 Jun 2022 18:12:57 +0000

As part of the ongoing series in which I exploring different ways to use the ingest stream to load data into Quine, I want to cover one of Quine's specialities: building a streaming graph from multiple data sources. This time, we'll work with CSV data exported from IMDb to answer the question; "Which actors have acted in and directed the same movie?"

The `CSV` Files

Usually, if someone says that they have data, most likely it's going to be in CSV format or pretty darn close to it. (Or JSON, but that is another blog post.) In our case, we have two files filled with data in CSV format. Let's inspect what's inside.

File 1: movieData.csv

The movieData.csv file contains records for actors, movies, and the actor's relationship to the movie. Conveniently, each record type has a schema, flattened into rows during export.

Should we separate the data back into discrete files and then load them? No, we can set up separate ingest streams to act on each data type in the file. Effectively, we will separate the "jobs to do" into Cypher queries and stream in the data.

File 2: ratingData.csv

Our second file, ratingData.csv is very straightforward. It contains 100,000 rows of movie ratings. Adding the ratings data into our model completes our discovery phase for the supplied data.

The `CypherCsv` Ingest Stream

The Quine API documentation defines the schema of the File Ingest Format ingest stream for us. The schema is robust and accommodates CSV, JSON, and line file types. Please take a moment to read through the documentation. Be sure to select type: FileIngest -> format: CypherCsv using the API documentation dropdowns.

I define ingest streams to transform and load the movie data into Quine. Quine ingest streams behave independently and in parallel when processing files. This means that we can have multiple ingest streams operating on a single file. This is the case for the movieData.csv file because there are several operations that we need to perform on multiple types of data.

Movie Rows

The first ingest stream that I set up will address the Movie rows in the movieData.csv file. There are 9125 movies in the data set. I create two nodes from each Movie row using an ingest query, movie and genre. I store all of the movie data as properties in the Movie mode.

WITH $that AS row
MATCH (m) WHERE row.Entity = 'Movie' AND id(m) = idFrom("Movie", row.movieId)
SET
  m:Movie,
  m.tmdbId = row.tmdbId,
  m.imdbId = row.imdbId,
  m.imdbRating = toFloat(row.imdbRating),
  m.released = row.released,
  m.title = row.title,
  m.year = toInteger(row.year),
  m.poster = row.poster,
  m.runtime = toInteger(row.runtime),
  m.countries = split(coalesce(row.countries,""), "|"),
  m.imdbVotes = toInteger(row.imdbVotes),
  m.revenue = toInteger(row.revenue),
  m.plot = row.plot,
  m.url = row.url,
  m.budget = toInteger(row.budget),
  m.languages = split(coalesce(row.languages,""), "|"),
  m.movieId = row.movieId
WITH m,split(coalesce(row.genres,""), "|") AS genres
UNWIND genres AS genre
WITH m, genre
MATCH (g) WHERE id(g) = idFrom("Genre", genre)
SET g.genre = genre, g:Genre
MERGE (m:Movie)-[:IN_GENRE]->(g:Genre)

Quine passes each line to the ingest stream via the variable $that to which I assign the identity row. A MATCH is made when the row.Entity value is Movie and a node id is returned from the idFrom() function. SET is used to give the node a label and to store metadata as node properties.

Each movie row has a pipe | delimited list of genres in the genres column. I split the column value apart and created a Genre node for each genre in the list, labeled and containing the genre as a property.

Finally, the Movie node is related to the Genre node with MERGE.

Person Rows

The second ingest stream addresses the Person rows in the same way I did for the Movie rows. There are 19047 person records in the movieData.csv file.

WITH $that AS row
MATCH (p) WHERE row.Entity = "Person" AND id(p) = idFrom("Person", row.tmdbId)
SET
  p:Person,
  p.imdbId = row.imdbId,
  p.bornIn = row.bornIn,
  p.name = row.name,
  p.bio = row.bio,
  p.poster = row.poster,
  p.url = row.url,
  p.born = row.born,
  p.died = row.died,
  p.tmdbId = row.tmdbId,
  p.born = CASE row.born WHEN "" THEN null ELSE datetime(row.born + "T00:00:00Z") END,
  p.died = CASE row.died WHEN "" THEN null ELSE datetime(row.died + "T00:00:00Z") END

The ingest query in this ingest stream matches when the row.Entity is Person, creates a node using the idFrom() function, and stores the Person metadata in node parameters.

Join Rows

Looking at the rows that have Join in the Entity column leads me to believe that the data in this CSV file originated from a relational database. There are two types of joins in the file, Acted and Directed. The ingest queries below process them.

Acted In

WITH $that AS row
WITH row WHERE row.Entity = "Join" AND row.Work = "Acting"
MATCH (p) WHERE id(p) = idFrom("Person", row.tmdbId)
MATCH (m) WHERE id(m) = idFrom("Movie", row.movieId)
MATCH (r) WHERE id(r) = idFrom("Role", row.tmdbId, row.movieId, row.role)
SET 
  r.role = row.role, 
  r.movie = row.movieId, 
  r.tmdbId = row.tmdbId, 
  r:Role
MERGE (p:Person)-[:PLAYED]->(r:Role)<-[:HAS_ROLE]-(m:Movie)
MERGE (p:Person)-[:ACTED_IN]->(m:Movie)

Acted join rows create relationships between Person, Role, and Movie nodes. There are two paths created from the Person nodes. The first path (p)-[:PLAYED]->(r)<-[:HAS_ROLE]-(m) establishes the relationship between actors (Person) and the roles they have played as well as the roles in a movie (Movies). A second path is formed that directly relates an actor to movies they acted in.

Directed

WITH $that AS row
WITH row WHERE row.Entity = "Join" AND row.Work = "Directing"
MATCH (p) WHERE id(p) = idFrom("Person", row.tmdbId)
MATCH (m) WHERE id(m) = idFrom("Movie", row.movieId)
MERGE (p:Person)-[:DIRECTED]->(m:Movie)

The Directed ingest query matches join rows and creates a path relating directors with the movies they have directed.

Ratings

WITH $that AS row
MATCH (m) WHERE id(m) = idFrom("Movie", row.movieId)
MATCH (u) WHERE id(u) = idFrom("User", row.userId)
MATCH (rtg) WHERE id(rtg) = idFrom("Rating", row.movieId, row.userId, row.rating)
SET u.name = row.name, u:User
SET rtg.rating = row.rating,
  rtg.timestamp = toInteger(row.timestamp),
  rtg:Rating
MERGE (u:User)-[:SUBMITTED]->(rtg:Rating)<-[:HAS_RATING]-(m:Movie)
MERGE (u:User)-[:RATED]->(m:Movie)

The last ingest query processes rows from the ratingData.csv file. The query creates User and Rating nodes, then relates them together.

Running the Recipe

As my project progressed, I developed a Quine recipe to load my CSV files and perform the analysis. Running the recipe requires a couple of Quine options to pass in the locations of the CSV files and an updated configuration setting.

java \
-Dquine.in-memory-soft-node-limit=30000 \
-jar ../releases/latest -r movieData \
--recipe-value movie_file=movieData.csv \
--recipe-value rating_file=ratingData.csv

After ingesting the CSV files, it results in the data set stored in Quine:

The orange Movie and Person nodes are created directly from the Entity column in movieData.csv. The User node is from ratingData.csv and the green nodes were derived from data stored within an entity row. The ActedDirected relationship is built by the standing query in the recipe.

Answering the Question

Getting all of this data into Quine was only part of the challenge. Remember the question that we were asked, "which actors have acted in and directed the same movie?"

Quine is a streaming graph; if we were to connect the ingest streams to the streaming source, rather than CSV files, the standing query inside of the recipe that I developed would answer the question for movies in the past as well as movies in the future.

Our standing query matches when a complete pattern for the situation when an actor (Person) both ACTED_IN and DIRECTED the same movie.

MATCH (a:Movie)<-[:ACTED_IN]-(p:Person)-[:DIRECTED]->(m:Movie) 
WHERE id(a) = id(m)
RETURN id(m) as movieId, m.title as Movie, id(p) as personId, p.name as Actor

When the standing query completes a match, it processes the movie id and person id through the output query and actions.

standingQueries:
  - pattern:
      type: Cypher
      mode: MultipleValues
      query: |-
        MATCH (a:Movie)<-[:ACTED_IN]-(p:Person)-[:DIRECTED]->(m:Movie) 
        WHERE id(a) = id(m)
        RETURN id(m) as movieId, m.title as Movie, id(p) as personId, p.name as Actor
    outputs:
      set-ActedDirected:
        type: CypherQuery
        query: |-
          MATCH (m),(p)
          WHERE strId(m) = $that.data.movie AND strId(p) = $that.data.person
          MERGE (p:Person)-[:ActedDirected]->(m:Movie)
      log-actor-director:
        type: WriteToFile
        path: "ActorDirector.jsonl"

My standing query creates a new ActedDirected relationship between the Person and Movie nodes, then logs the relationship.

Four hundred ninety-one actors acted in and directed the same movie in our data set.

{
    "data": {
        "Actor": "Clint Eastwood",
        "Movie": "Unforgiven",
        "movieId": "4a6d64c8-9c90-3362-b443-4d2e7b2fb9d1",
        "personId": "4638a820-3b68-3fc7-9fa7-341e876b701e"
    }
}

Conclusion

Phew, we made it through! And we learned a lot along the way.

CSV data is streamed into Quine
Quine can read from external files and streaming providers
You can ingest multiple streams at once, movies and reviewers, and combine them into one streaming graph
Always separate ingest queries using the jobs to be done framework

Have a question, suggestion, or improvement? I welcome your feedback! Please drop in to Quine Slack and let me know. I'm always happy to discuss Quine or answer questions.

DEV Community: Michael Aglietti

Apache Ignite 3.1.0 is now available

What is Apache Ignite 3?

What's New in 3.1.0

Get Started

Monitoring Quine Streaming Graph using Grafana + InfluxDB

Monitoring Data in Motion

Setting up Grafana and InfluxDB

Configuring Quine to Send Metrics Data

Quine Metrics

Counters

Timers

Gauges

Create a Dashboard in Grafana

What I've Learned Monitoring Quine

Conclusion

Calculate Risk and Optimize Asset Allocation in Real Time

The Hidden Cost of Batch Processing for Financial Institutions

Real-time Risk Calculation and Asset Allocation

Quine Developer Site 2.0

Create a Quine Icon Library with Python

The Environment

The Script

Build a list of icon names

Create Node Appearances

Create Icon Nodes

Running the script

Conclusion

Dynamic Duo: Quine & Novelty Detector for Insider Threats

Adding Quine to the Insider Threat Detection Proof of Concept

Moving from Batch to Real Time Monitoring

It Don't Mean a Thing If It Ain't Got That Real-Time Swing

Try it Yourself

idFrom(): the simple function that’s key to Quine streaming graph

A simple concept at the core of a new way of processing data

Event-driven: what if we stopped querying databases?

idFrom() and CRUD operations: why Quine is so dang fast

Standing queries and querying data from the future with idFrom()

Partitioning Key Spaces for a SaaS application using idFrom()

Quine, streaming graph data that scales past 1 million events/second

Infrastructure

The Test

The Results

Standing Queries and 1 Million 4-node traversals per second

Why Quine Hitting 1 Million Events/Sec Matters

What’s Next

Use Quine Graph ETL to reduce SIEM storage costs.

The High Cost of Storing Low Value Data

The State of Stream Pre-Processing: Random, Destructive, and Only Somewhat Effective

Quine Ingest Queries: Semantic ETL for High Value Data

Save Only the Patterns That Matter

What if data takes time to become interesting?

Help Getting Started

Standing Queries: Turning Event-Driven Data Into Data-Driven Events

How nodes match patterns

Creating a standing query

Standing query result output

Modifying standing queries

Conclusion

Research

Standing Query

Query Execution

Real-time Graph Analytics for Kafka Streams with Quine

Simple Streaming Pipeline

Setting up Vector

Local Kafka Instance

Quine Config

Launch the Pipeline

Improving the Ingest Query

Analyzing the sample data

Conclusions and Next Steps

Processing Machine Logs with Streaming Graph

Log Line Structure

Quine Log RegEx

Quine Log Ingest Stream

Configuring Quine Logs

Ingesting Other Log Formats

Conclusion

Ingesting From Multiple Data Sources into Quine Streaming Graphs

The CSV Files

`idFrom()` and CRUD operations: why Quine is so dang fast

Standing queries and querying data from the future with `idFrom()`

Partitioning Key Spaces for a SaaS application using `idFrom()`

The `CSV` Files

The `CypherCsv` Ingest Stream