Mukhtar

Posted on May 17

Turn Your File System Into a SQL Database

#security #software #discuss #rust

Stop grepping. Start querying.

The 2-Minute Version:

# Install and scan your Documents folder
brew tap surveilr/tap && brew install surveilr
surveilr admin init -d fs.db
surveilr ingest files -r ~/Documents -d fs.db

# Query it
surveilr shell -d fs.db

-- Find every PDF modified this month
SELECT file_path, size_bytes, last_modified
FROM files
WHERE extension = 'pdf' AND last_modified > date('now', 'start of month');

That's it. Your filesystem is now a SQLite database you can query forever.

The Problem with grep and find

You've been there:

find ~/Documents -name "*.pdf" -mtime -30 | xargs ls -lh

It works. Until you need to:

Track changes over time
Join file metadata with other data
Build complex filters
Share results in a structured format
Query the same data six months later

Then your one-liner becomes a 200-line Bash script nobody can maintain.

Here's a different approach:

Turn your filesystem into a permanent SQLite database you can keep querying with standard SQL.

No scripts. No parsing. Just tables.

What surveilr Does

surveilr scans directories and captures everything into SQLite:

File paths and names
Sizes and timestamps
Content hashes (SHA-256)
Extensions and MIME types
Permissions and ownership

The output is a standard .db file you can:

Query with any SQLite tool
Open in DB Browser, DBeaver, DataGrip
Read with Python pandas or R
Keep forever (SQLite files last decades)
Share with teammates (just copy the file)

Step 1: Create a Database

surveilr admin init -d my-files.db

This creates an empty SQLite database. That's it. No configuration, no setup, no servers.

You can open this file in any SQLite client right now and see the schema.

Step 2: Scan a Directory

surveilr ingest files -r ~/Documents -d my-files.db

surveilr walks the directory tree and inserts metadata into two tables:

uniform_resource — Core metadata (hash, size, timestamps)
ur_ingest_session_fs_path_entry — File-specific data (paths, extensions)

Pro tip: Run multiple times to track changes over time. Each scan creates a new session.

Step 3: Query Like You Mean It

Open the SQL shell:

surveilr shell -d my-files.db

Now you can run queries that would be painful in Bash.

Find all PDFs larger than 5MB

SELECT
    e.file_path_abs,
    e.file_basename,
    u.size_bytes / 1024 / 1024 AS size_mb
FROM uniform_resource u
JOIN ur_ingest_session_fs_path_entry e
  ON u.uniform_resource_id = e.uniform_resource_id
WHERE e.file_extn = 'pdf'
  AND u.size_bytes > 5242880
ORDER BY u.size_bytes DESC;

Track file modifications over time

Run multiple scans, then query across sessions:

SELECT
    e.file_basename,
    COUNT(DISTINCT u.content_digest) AS versions,
    MIN(u.last_modified_at) AS first_seen,
    MAX(u.last_modified_at) AS last_modified
FROM uniform_resource u
JOIN ur_ingest_session_fs_path_entry e
  ON u.uniform_resource_id = e.uniform_resource_id
WHERE e.file_basename = 'report.docx'
GROUP BY e.file_basename;

Find duplicate files by content hash

SELECT
    u.content_digest,
    COUNT(*) AS copies,
    SUM(u.size_bytes) / 1024 / 1024 AS total_mb_wasted
FROM uniform_resource u
JOIN ur_ingest_session_fs_path_entry e
  ON u.uniform_resource_id = e.uniform_resource_id
GROUP BY u.content_digest
HAVING copies > 1
ORDER BY total_mb_wasted DESC;

Find files that changed in the last 7 days

SELECT
    e.file_path_abs,
    u.last_modified_at,
    u.size_bytes
FROM uniform_resource u
JOIN ur_ingest_session_fs_path_entry e
  ON u.uniform_resource_id = e.uniform_resource_id
WHERE u.last_modified_at > datetime('now', '-7 days')
ORDER BY u.last_modified_at DESC;

These queries work on any SQLite database—forever.

Transform Structured Files Into Queryable Tables

surveilr doesn't just capture file metadata—it can extract and transform content from structured files.

This is where it gets interesting.

Transform CSV Files Into SQL Tables

Got CSV files? Turn them into queryable tables automatically:

surveilr orchestrate transform-csv -d my-files.db

Every CSV file you ingested becomes a SQL table. If you scanned a directory with sales-2024.csv, you can now query it:

SELECT * FROM sales_2024 WHERE revenue > 10000;

No manual schema definition. No import scripts. Just automatic table creation.

Extract Data from HTML Files with CSS Selectors

Ingested HTML files? Extract specific elements with CSS selectors:

# Extract all paragraphs inside divs
surveilr orchestrate transform-html -d my-files.db \
  --css-select "paragraphs:div > p"

# Extract all links
surveilr orchestrate transform-html -d my-files.db \
  --css-select "links:a[href]"

# Extract headers
surveilr orchestrate transform-html -d my-files.db \
  --css-select "headers:h1, h2, h3"

The extracted data goes into JSON format you can query:

SELECT
    file_path,
    json_extract(transformed_content, '$.paragraphs') AS paragraphs
FROM uniform_resource
WHERE file_extn = 'html';

Real-world use case: Scrape product data from saved HTML pages, extract structured content from documentation, parse scraped web pages.

Transform Markdown Files

Extract specific sections from Markdown files:

# Extract all headers
surveilr orchestrate transform-markdown -d my-files.db \
  --md-select "headers:# *"

# Extract code blocks
surveilr orchestrate transform-markdown -d my-files.db \
  --md-select "code:```
{% endraw %}
*
{% raw %}
```"

Real-world use case: Index documentation, extract code examples from README files, parse technical notes.

Transform XML Files

XML files get automatically parsed into queryable JSON:

surveilr orchestrate transform-xml -d my-files.db

Then query XML content with SQLite's JSON functions:

SELECT
    file_path,
    json_extract(transformed_content, '$.root.item[0].name') AS item_name
FROM uniform_resource
WHERE file_extn = 'xml';

Real-world use case: Parse configuration files, process exported data from tools, analyze XML logs.

Why This Matters

Most file indexing tools just capture metadata.

surveilr extracts and transforms content so you can query inside your files.

This means you can:

Query values within CSV files, not just the file existence
Search for specific HTML elements across hundreds of pages
Extract code snippets from all Markdown docs
Parse XML configs and query their structure

Example workflow:

# 1. Ingest a directory of mixed files
surveilr ingest files -r ~/project-docs -d docs.db

# 2. Transform CSVs into tables
surveilr orchestrate transform-csv -d docs.db

# 3. Extract headers from Markdown
surveilr orchestrate transform-markdown -d docs.db --md-select "headers:# *"

# 4. Query across file types
surveilr shell -d docs.db

-- Find all CSV files with revenue data
SELECT * FROM sales_data WHERE revenue > 50000;

-- Find all Markdown headers mentioning "API"
SELECT file_path, json_extract(transformed_content, '$.headers')
FROM uniform_resource
WHERE file_extn = 'md'
  AND json_extract(transformed_content, '$.headers') LIKE '%API%';

You just turned a directory of mixed files into a queryable knowledge base.

Watch Mode: Continuous Indexing

Want surveilr to automatically re-index when files change?

surveilr ingest files -r ~/Documents -d my-files.db --watch

surveilr monitors the directory and updates the database in real-time.

Press Ctrl+C to stop.

Export Results

Need to share findings? Export as JSON, CSV, or Markdown:

surveilr shell -d my-files.db --cmd "
SELECT file_path, size_bytes, last_modified
FROM files
WHERE extension = 'pdf'
" --output json > pdfs.json

Or save queries as reusable SQL files:

cat > find-large-files.sql <<'EOF'
SELECT file_path, size_bytes / 1024 / 1024 AS size_mb
FROM files
WHERE size_bytes > 10485760
ORDER BY size_bytes DESC
LIMIT 20;
EOF

surveilr shell -d my-files.db find-large-files.sql

Why Not Just Use Scripts?

You could write a Python script to walk directories and output JSON.

But:

Scripts have no schema.
Next month, you won't remember what fields you captured or what format you used.

surveilr creates stable SQL tables.
The schema is self-documenting.

Scripts don't track history.
You get a point-in-time snapshot, then it's gone.

surveilr accumulates sessions over time.
Run it weekly and query across months of history.

Scripts don't compose.
How do you join your file scan with your email data?

surveilr stores everything in one database.
Cross-domain joins just work.

Forensic Investigation Workflows

Once your filesystem is queryable, you can ask surprising questions:

Find files that disappeared

-- Files that existed in session 1 but not session 2
SELECT file_path
FROM files WHERE session_id = 'session_1'
EXCEPT
SELECT file_path
FROM files WHERE session_id = 'session_2';

Track when specific files were modified

SELECT file_basename, last_modified_at, size_bytes
FROM files
WHERE file_basename LIKE '%contract%'
ORDER BY last_modified_at DESC;

Find files created during a specific time window

-- Files created during a deployment window
SELECT file_path, last_modified
FROM files
WHERE last_modified BETWEEN '2024-05-10 14:00' AND '2024-05-10 15:00';

These queries are impossible with find and grep.

Real-World Uses

For Developers

Track which files changed before/after a bug appeared
Find orphaned build artifacts
Identify large files bloating repos
Build custom file search tools

For Operations

Audit log file growth over time
Track config file changes
Find stale backups
Monitor disk usage trends

For Security Teams

Establish filesystem baselines
Detect unauthorized file modifications
Track file access patterns
Identify anomalous file creation

For Compliance (Oh, By the Way)

Document where regulated data lives (HIPAA PHI, GDPR PII)
Prove file retention policies
Provide audit evidence
Respond to data access requests

But the real power is permanent queryability.

Open the Database in Other Tools

Because surveilr outputs standard SQLite, you can use any SQLite tool:

DB Browser for SQLite — Visual database inspector
DBeaver — Universal database client
Datasette — Instant web UI for your database
Python pandas — pd.read_sql_query(sql, 'sqlite:///my-files.db')
DuckDB — Fast analytical queries on SQLite files
VS Code SQLite extensions — Query in your editor

You own the .db file. Use whatever tools you want.

Troubleshooting

"Permission denied" errors?
Run with appropriate permissions or scan a different directory.

Database getting too large?
Use --dry-run to preview:

surveilr ingest files -r ~/Documents --dry-run

Want to see what happened?
Add --stats to show ingestion summary:

surveilr ingest files -r ~/Documents -d my-files.db --stats

What's Next?

You just turned your filesystem into a queryable database.

Now combine it with other data sources:

Turn Email Into SQL — Query Gmail/Outlook alongside files
Extract API Data — Add GitHub, Jira, or 600+ other sources

Or explore more:

surveilr.com — Full documentation
Installation Guide — Other install methods

The Bottom Line

Your filesystem should be queryable, not just searchable.

find and grep are great for one-off questions.

But when you need:

Historical tracking
Complex queries
Structured output
Composability
Permanence

...you want SQL.

surveilr turns your filesystem into a SQLite database you own forever.

No scripts. No parsing. Just queries.

Ready to query your filesystem? Install surveilr and run your first scan.

Get Started →

DEV Community