Blaine Elliott

Posted on Apr 12 • Originally published at blog.anomalyarmor.ai

Why We Open-Sourced Our Database Query Layer

#dataengineering #dataquality

When you connect a data quality tool to your database, you're trusting that tool with access to your data. Most tools ask you to just trust them. We decided to show our work.

Every query AnomalyArmor runs against your database goes through our Query Security Gateway. The gateway is open source. You can read every line of code. You can verify exactly what we're allowed to do.

GitHub: https://github.com/anomalyarmor/anomalyarmor-query-gateway
PyPI: https://pypi.org/project/anomalyarmor-query-gateway/

The trust problem

Data quality tools need database access to do their job. Schema discovery requires reading metadata. Freshness monitoring requires checking timestamps. Anomaly detection requires looking at distributions.

But customers have legitimate concerns. What queries are you actually running? Could you read our customer data? How do we know you're not doing more than you say?

"Trust us" isn't a good enough answer. Especially when the data is sensitive.

Three access levels

We built the gateway around three access levels. You choose how much access to grant based on your security requirements.

Schema Only: The most restrictive. We can query metadata tables (information_schema, pg_catalog, system tables) but nothing else. You get schema discovery and basic tagging. No access to actual table data.

Aggregates: We can run aggregate functions: COUNT, SUM, AVG, MIN, MAX. No raw values. This enables freshness monitoring (checking MAX(updated_at)), row counts, null rates, and statistical distributions. We never see individual records.

Full: Unrestricted read access. This enables improved tagging and intelligence features that sample values to detect patterns. For example, detecting that a column named "data" actually contains Social Security numbers.

Most customers use Aggregates. You get the monitoring features without exposing raw data.

How it works

The gateway sits between AnomalyArmor and your database. Every query passes through it. The gateway parses the SQL, validates it against your access level, and blocks anything that doesn't comply.

Your Query → Gateway → Parser → Validator → Database
                          ↓
                    Audit Logger

If you've set Aggregates access and something tries to run SELECT email FROM users, the gateway blocks it. Doesn't matter if it's a bug in our code or a misconfigured feature. The query never reaches your database.

Every query attempt is logged. You can audit what we ran and what we tried to run.

Why open source

We published the gateway code for a few reasons.

First, transparency. You shouldn't have to take our word for how the access levels work. Read the code. The validator logic is right there. If we say "aggregates mode only allows aggregate functions," you can verify that claim yourself.

Second, security review. Open source means security researchers can audit it. If there's a bypass or a flaw in our logic, someone can find it and report it. Closed source security is security through obscurity.

Third, trust through verification. When your security team asks "how does this tool handle database access," you can point them to a GitHub repo instead of a marketing page.

Defense in depth

We don't just rely on the gateway. There are two layers of enforcement.

The first layer checks features. Before any SQL is constructed, we check if your access level permits that feature. Trying to run freshness monitoring with Schema Only access? Blocked at the feature layer. You never even see a query.

The second layer is the gateway. It parses and validates the actual SQL. This catches anything that somehow bypasses the feature layer. If a bug in our code constructs a query it shouldn't, the gateway stops it.

Both layers have to allow the operation. If either blocks, nothing runs.

What this means for you

When you connect AnomalyArmor to your database, you choose your access level. The default is Full, for maximum monitoring capability. But you can restrict it at any time.

Some customers use Schema Only on production databases and Full on staging. Some use Aggregates everywhere. You can set a company-wide default and override it per data source.

You can change levels whenever you want. Downgrading disables features that require higher access. Upgrading enables them. No migration, no reconfiguration.

The features at each level

Schema Only gets you:

Schema discovery (tables, columns, types)
Basic tagging (inferred from column names and types)
Basic intelligence (metadata-based insights)

Aggregates adds:

Row counts
Freshness monitoring
Null and completeness checks
Cardinality (distinct counts)
Numeric statistics (min, max, average)

Full adds:

Improved tagging (samples values to detect patterns)
Improved intelligence (value-based insights)

Most data quality monitoring works fine with Aggregates. Full is for when you want the AI to analyze actual values to find things like PII in unexpected columns.

Check it yourself

The gateway code is at https://github.com/anomalyarmor/anomalyarmor-query-gateway. It's Apache 2.0 licensed. Read it, fork it, run the tests.

If you find a security issue, email security@anomalyarmor.ai. We take reports seriously.

This is how we think data tools should work. Not "trust us," but "verify us."

Ready to try data observability with transparent security? Sign up for AnomalyArmor and choose your access level when you connect your database.

DEV Community