DEV Community

Cover image for Making AI Data Flows Visible: Building an Open-Source Tool to Understand SaaS & LLM Data Risk
Harris Bashir
Harris Bashir

Posted on

Making AI Data Flows Visible: Building an Open-Source Tool to Understand SaaS & LLM Data Risk

As AI and large language models (LLMs) become embedded into everyday SaaS tools, I’ve noticed a recurring issue across teams and organisations:

It’s surprisingly hard to answer a simple question:
Where does our data actually go when AI is involved?

Teams know they are using AI features, ticket summarisation, content generation, analytics assistants, but often lack a clear, end-to-end view of how data flows across systems once these features are enabled.

This article documents the motivation, design decisions, and lessons learned from building a small open-source, local-first tool focused on making these data flows visible.

The Problem I Kept Seeing in Practice

In many SMEs and startups, AI adoption happens incrementally:

  • A support tool adds AI ticket summarisation
  • A CRM introduces AI-driven insights
  • Marketing tools generate content using LLMs
  • Internal documents are analysed using AI assistants
  • Each feature feels isolated and low-risk.

However, over time:

  • Personal data is processed in more places
  • Third-party AI providers are introduced
  • Cross-border data flows increase
  • Assumptions replace documentation

What’s missing is not intent or care, it’s visibility.

Why Existing Approaches Fall Short (for SMEs)

Most existing solutions in this space are either:

  • Enterprise-grade compliance platforms
  • Security tools focused on enforcement
  • Vendor-specific dashboards
  • Static documentation or spreadsheets

For smaller teams, these approaches are often:

  • Too heavy-weight
  • Too expensive
  • Too opaque
  • Too disconnected from how systems actually behave

I wanted to explore whether a simple, engineering-led approach could help teams reason about AI-related data risk without turning it into a legal or compliance exercise.

Design Principles

Before writing any code, I set a few constraints:

1: Visibility over judgement

The tool should surface potential risks, not declare violations or compliance status.

2: Deterministic and explainable

Risk identification should be based on explicit rules, not black-box AI decisions.

3: Local-first

Everything should run locally. No cloud services, no data collection.

4: Honest about uncertainty

Unknown or unclear data handling should be supported — and treated as a signal, not an error.

5: Narrow scope

This is not a full compliance platform. It focuses specifically on SaaS + LLM data flows.

What the Tool Does

At a high level, the tool:

1: Accepts simple JSON inputs describing:

  • SaaS tools in use
  • AI / LLM features enabled
  • Known (or unknown) data handling details

2: Builds a data flow model:

  • Source → Processing → Destination

3: Applies deterministic risk rules, such as:

  • Personal data sent to third-party LLM providers
  • Lack of anonymisation before LLM processing
  • Cross-border data flows
  • Unknown provider or data location

4: Generates:

  • A structured technical report
  • A plain-English executive summary

The goal is to create outputs that can be read and discussed by both technical and non-technical stakeholders.

Handling “Unknowns” Explicitly

One important design decision was how to handle incomplete information.

In real organisations, teams often don’t know:

  • Which LLM provider a feature uses
  • Whether data is anonymised
  • Where data is ultimately processed

Instead of treating this as a failure, the tool treats lack of transparency itself as a risk signal.

This mirrors how real-world governance works: uncertainty increases risk.

What This Tool Is (and Is Not)

To be clear, this project is:

  • Not legal advice
  • Not an automated compliance system
  • Not an audit or enforcement tool

It is a technical visibility tool designed to support better conversations, better documentation, and better decision-making around AI usage.

Why Open Source

I chose to keep this project open-source for a few reasons:

  • Transparency builds trust
  • Deterministic rules should be inspectable
  • Others can adapt or extend the logic
  • It encourages responsible AI practices

This is especially important in areas touching data protection and AI governance, where opacity often causes more harm than good.

Early Learnings

A few things became clear very quickly:

  • Teams are often surprised by how many AI touchpoints exist
  • Mapping flows forces valuable cross-team discussion
  • Even simple models surface non-obvious risks
  • Clarity reduces fear more than silence

The tool doesn’t “solve” compliance, but it helps teams see what they’re already doing.

What’s Next

The project is currently in pilot / exploratory mode.

The focus moving forward is:

  • Gathering feedback from early users
  • Improving clarity and explanations
  • Refining rule logic
  • Keeping the scope intentionally narrow

If you’re interested in exploring how AI features interact with your data flows, or have thoughts on how this kind of visibility could be improved, feedback is very welcome.

Repository

The project is available here:
👉 https://github.com/harisraja123/LLM-SAAS-Data-Flow-Visibility

Closing Thought

AI adoption doesn’t fail because teams don’t care about data — it fails when systems become too complex to reason about.

Sometimes the most useful thing you can build isn’t another layer of automation, but a clearer picture of what’s already happening.

Top comments (0)