DEV Community

Mukhtar
Mukhtar

Posted on

Turn Your File System Into a SQL Database

Stop grepping. Start querying.


The 2-Minute Version:

# Install and scan your Documents folder
brew tap surveilr/tap && brew install surveilr
surveilr admin init -d fs.db
surveilr ingest files -r ~/Documents -d fs.db

# Query it
surveilr shell -d fs.db
Enter fullscreen mode Exit fullscreen mode
-- Find every PDF modified this month
SELECT file_path, size_bytes, last_modified
FROM files
WHERE extension = 'pdf' AND last_modified > date('now', 'start of month');
Enter fullscreen mode Exit fullscreen mode

That's it. Your filesystem is now a SQLite database you can query forever.


The Problem with grep and find

You've been there:

find ~/Documents -name "*.pdf" -mtime -30 | xargs ls -lh
Enter fullscreen mode Exit fullscreen mode

It works. Until you need to:

  • Track changes over time
  • Join file metadata with other data
  • Build complex filters
  • Share results in a structured format
  • Query the same data six months later

Then your one-liner becomes a 200-line Bash script nobody can maintain.

Here's a different approach:

Turn your filesystem into a permanent SQLite database you can keep querying with standard SQL.

No scripts. No parsing. Just tables.


What surveilr Does

surveilr scans directories and captures everything into SQLite:

  • File paths and names
  • Sizes and timestamps
  • Content hashes (SHA-256)
  • Extensions and MIME types
  • Permissions and ownership

The output is a standard .db file you can:

  • Query with any SQLite tool
  • Open in DB Browser, DBeaver, DataGrip
  • Read with Python pandas or R
  • Keep forever (SQLite files last decades)
  • Share with teammates (just copy the file)

Step 1: Create a Database

surveilr admin init -d my-files.db
Enter fullscreen mode Exit fullscreen mode

This creates an empty SQLite database. That's it. No configuration, no setup, no servers.

You can open this file in any SQLite client right now and see the schema.


Step 2: Scan a Directory

surveilr ingest files -r ~/Documents -d my-files.db
Enter fullscreen mode Exit fullscreen mode

surveilr walks the directory tree and inserts metadata into two tables:

  • uniform_resource — Core metadata (hash, size, timestamps)
  • ur_ingest_session_fs_path_entry — File-specific data (paths, extensions)

Pro tip: Run multiple times to track changes over time. Each scan creates a new session.


Step 3: Query Like You Mean It

Open the SQL shell:

surveilr shell -d my-files.db
Enter fullscreen mode Exit fullscreen mode

Now you can run queries that would be painful in Bash.

Find all PDFs larger than 5MB

SELECT
    e.file_path_abs,
    e.file_basename,
    u.size_bytes / 1024 / 1024 AS size_mb
FROM uniform_resource u
JOIN ur_ingest_session_fs_path_entry e
  ON u.uniform_resource_id = e.uniform_resource_id
WHERE e.file_extn = 'pdf'
  AND u.size_bytes > 5242880
ORDER BY u.size_bytes DESC;
Enter fullscreen mode Exit fullscreen mode

Track file modifications over time

Run multiple scans, then query across sessions:

SELECT
    e.file_basename,
    COUNT(DISTINCT u.content_digest) AS versions,
    MIN(u.last_modified_at) AS first_seen,
    MAX(u.last_modified_at) AS last_modified
FROM uniform_resource u
JOIN ur_ingest_session_fs_path_entry e
  ON u.uniform_resource_id = e.uniform_resource_id
WHERE e.file_basename = 'report.docx'
GROUP BY e.file_basename;
Enter fullscreen mode Exit fullscreen mode

Find duplicate files by content hash

SELECT
    u.content_digest,
    COUNT(*) AS copies,
    SUM(u.size_bytes) / 1024 / 1024 AS total_mb_wasted
FROM uniform_resource u
JOIN ur_ingest_session_fs_path_entry e
  ON u.uniform_resource_id = e.uniform_resource_id
GROUP BY u.content_digest
HAVING copies > 1
ORDER BY total_mb_wasted DESC;
Enter fullscreen mode Exit fullscreen mode

Find files that changed in the last 7 days

SELECT
    e.file_path_abs,
    u.last_modified_at,
    u.size_bytes
FROM uniform_resource u
JOIN ur_ingest_session_fs_path_entry e
  ON u.uniform_resource_id = e.uniform_resource_id
WHERE u.last_modified_at > datetime('now', '-7 days')
ORDER BY u.last_modified_at DESC;
Enter fullscreen mode Exit fullscreen mode

These queries work on any SQLite database—forever.


Transform Structured Files Into Queryable Tables

surveilr doesn't just capture file metadata—it can extract and transform content from structured files.

This is where it gets interesting.

Transform CSV Files Into SQL Tables

Got CSV files? Turn them into queryable tables automatically:

surveilr orchestrate transform-csv -d my-files.db
Enter fullscreen mode Exit fullscreen mode

Every CSV file you ingested becomes a SQL table. If you scanned a directory with sales-2024.csv, you can now query it:

SELECT * FROM sales_2024 WHERE revenue > 10000;
Enter fullscreen mode Exit fullscreen mode

No manual schema definition. No import scripts. Just automatic table creation.

Extract Data from HTML Files with CSS Selectors

Ingested HTML files? Extract specific elements with CSS selectors:

# Extract all paragraphs inside divs
surveilr orchestrate transform-html -d my-files.db \
  --css-select "paragraphs:div > p"

# Extract all links
surveilr orchestrate transform-html -d my-files.db \
  --css-select "links:a[href]"

# Extract headers
surveilr orchestrate transform-html -d my-files.db \
  --css-select "headers:h1, h2, h3"
Enter fullscreen mode Exit fullscreen mode

The extracted data goes into JSON format you can query:

SELECT
    file_path,
    json_extract(transformed_content, '$.paragraphs') AS paragraphs
FROM uniform_resource
WHERE file_extn = 'html';
Enter fullscreen mode Exit fullscreen mode

Real-world use case: Scrape product data from saved HTML pages, extract structured content from documentation, parse scraped web pages.

Transform Markdown Files

Extract specific sections from Markdown files:

# Extract all headers
surveilr orchestrate transform-markdown -d my-files.db \
  --md-select "headers:# *"

# Extract code blocks
surveilr orchestrate transform-markdown -d my-files.db \
  --md-select "code:```
{% endraw %}
*
{% raw %}
```"
Enter fullscreen mode Exit fullscreen mode

Real-world use case: Index documentation, extract code examples from README files, parse technical notes.

Transform XML Files

XML files get automatically parsed into queryable JSON:

surveilr orchestrate transform-xml -d my-files.db
Enter fullscreen mode Exit fullscreen mode

Then query XML content with SQLite's JSON functions:

SELECT
    file_path,
    json_extract(transformed_content, '$.root.item[0].name') AS item_name
FROM uniform_resource
WHERE file_extn = 'xml';
Enter fullscreen mode Exit fullscreen mode

Real-world use case: Parse configuration files, process exported data from tools, analyze XML logs.

Why This Matters

Most file indexing tools just capture metadata.

surveilr extracts and transforms content so you can query inside your files.

This means you can:

  • Query values within CSV files, not just the file existence
  • Search for specific HTML elements across hundreds of pages
  • Extract code snippets from all Markdown docs
  • Parse XML configs and query their structure

Example workflow:

# 1. Ingest a directory of mixed files
surveilr ingest files -r ~/project-docs -d docs.db

# 2. Transform CSVs into tables
surveilr orchestrate transform-csv -d docs.db

# 3. Extract headers from Markdown
surveilr orchestrate transform-markdown -d docs.db --md-select "headers:# *"

# 4. Query across file types
surveilr shell -d docs.db
Enter fullscreen mode Exit fullscreen mode
-- Find all CSV files with revenue data
SELECT * FROM sales_data WHERE revenue > 50000;

-- Find all Markdown headers mentioning "API"
SELECT file_path, json_extract(transformed_content, '$.headers')
FROM uniform_resource
WHERE file_extn = 'md'
  AND json_extract(transformed_content, '$.headers') LIKE '%API%';
Enter fullscreen mode Exit fullscreen mode

You just turned a directory of mixed files into a queryable knowledge base.


Watch Mode: Continuous Indexing

Want surveilr to automatically re-index when files change?

surveilr ingest files -r ~/Documents -d my-files.db --watch
Enter fullscreen mode Exit fullscreen mode

surveilr monitors the directory and updates the database in real-time.

Press Ctrl+C to stop.


Export Results

Need to share findings? Export as JSON, CSV, or Markdown:

surveilr shell -d my-files.db --cmd "
SELECT file_path, size_bytes, last_modified
FROM files
WHERE extension = 'pdf'
" --output json > pdfs.json
Enter fullscreen mode Exit fullscreen mode

Or save queries as reusable SQL files:

cat > find-large-files.sql <<'EOF'
SELECT file_path, size_bytes / 1024 / 1024 AS size_mb
FROM files
WHERE size_bytes > 10485760
ORDER BY size_bytes DESC
LIMIT 20;
EOF

surveilr shell -d my-files.db find-large-files.sql
Enter fullscreen mode Exit fullscreen mode

Why Not Just Use Scripts?

You could write a Python script to walk directories and output JSON.

But:

Scripts have no schema.
Next month, you won't remember what fields you captured or what format you used.

surveilr creates stable SQL tables.
The schema is self-documenting.

Scripts don't track history.
You get a point-in-time snapshot, then it's gone.

surveilr accumulates sessions over time.
Run it weekly and query across months of history.

Scripts don't compose.
How do you join your file scan with your email data?

surveilr stores everything in one database.
Cross-domain joins just work.


Forensic Investigation Workflows

Once your filesystem is queryable, you can ask surprising questions:

Find files that disappeared

-- Files that existed in session 1 but not session 2
SELECT file_path
FROM files WHERE session_id = 'session_1'
EXCEPT
SELECT file_path
FROM files WHERE session_id = 'session_2';
Enter fullscreen mode Exit fullscreen mode

Track when specific files were modified

SELECT file_basename, last_modified_at, size_bytes
FROM files
WHERE file_basename LIKE '%contract%'
ORDER BY last_modified_at DESC;
Enter fullscreen mode Exit fullscreen mode

Find files created during a specific time window

-- Files created during a deployment window
SELECT file_path, last_modified
FROM files
WHERE last_modified BETWEEN '2024-05-10 14:00' AND '2024-05-10 15:00';
Enter fullscreen mode Exit fullscreen mode

These queries are impossible with find and grep.


Real-World Uses

For Developers

  • Track which files changed before/after a bug appeared
  • Find orphaned build artifacts
  • Identify large files bloating repos
  • Build custom file search tools

For Operations

  • Audit log file growth over time
  • Track config file changes
  • Find stale backups
  • Monitor disk usage trends

For Security Teams

  • Establish filesystem baselines
  • Detect unauthorized file modifications
  • Track file access patterns
  • Identify anomalous file creation

For Compliance (Oh, By the Way)

  • Document where regulated data lives (HIPAA PHI, GDPR PII)
  • Prove file retention policies
  • Provide audit evidence
  • Respond to data access requests

But the real power is permanent queryability.


Open the Database in Other Tools

Because surveilr outputs standard SQLite, you can use any SQLite tool:

  • DB Browser for SQLite — Visual database inspector
  • DBeaver — Universal database client
  • Datasette — Instant web UI for your database
  • Python pandaspd.read_sql_query(sql, 'sqlite:///my-files.db')
  • DuckDB — Fast analytical queries on SQLite files
  • VS Code SQLite extensions — Query in your editor

You own the .db file. Use whatever tools you want.


Troubleshooting

"Permission denied" errors?
Run with appropriate permissions or scan a different directory.

Database getting too large?
Use --dry-run to preview:

surveilr ingest files -r ~/Documents --dry-run
Enter fullscreen mode Exit fullscreen mode

Want to see what happened?
Add --stats to show ingestion summary:

surveilr ingest files -r ~/Documents -d my-files.db --stats
Enter fullscreen mode Exit fullscreen mode

What's Next?

You just turned your filesystem into a queryable database.

Now combine it with other data sources:

Or explore more:


The Bottom Line

Your filesystem should be queryable, not just searchable.

find and grep are great for one-off questions.

But when you need:

  • Historical tracking
  • Complex queries
  • Structured output
  • Composability
  • Permanence

...you want SQL.

surveilr turns your filesystem into a SQLite database you own forever.

No scripts. No parsing. Just queries.


Ready to query your filesystem? Install surveilr and run your first scan.

Get Started →

Top comments (0)