DEV Community

Cover image for De-identify SQL: Transforming Production Data
Vlad L
Vlad L

Posted on

De-identify SQL: Transforming Production Data

Perhaps you've noticed similar comments where you work:

Sales:

"This sales demo needs updated data to demonstrate a new feature."

QA:

"Regression testing wasn't possible because the database is stale."

Development:

"Our staging environment doesn't contain the necessary data."

With data-driven applications, our production environment often contains the answer to Sales, QA & Development needs but crossing our fingers and hoping it all works can be problematic.

Testing in Production?

I need a tool that transforms a production database, modifies identifiable fields and creates de-identified data to share with prospective customers, create a development database or populate QA environments.

De-identifying SQL: Creating Data

De-identify SQL is a Node.js command-line tool that modifies SQL statements using a customizable strategy file for each table. For any database column I can:

  • Replace data with a fabricated value.
  • Redact data with a constant value.
  • Generate new data with JavaScript.

De-identify SQL accepts a passed-in file or piped input and creates an output .sql or pipes data to another process. So I can wire this into an automated process or manually create data as needed.

Example Ahead: Creating Fake Users

To begin I need a JSON file for each table I want to modify. Each strategy file is named after the table it's applied to. Here's example input SQL for a user table:

CREATE TABLE `user` (
  `id` int NOT NULL DEFAULT '0',
  `email` varchar(256),
  `full_name` varchar(100),
  `birth_date` date,
  `gov_id` varchar(20),
PRIMARY KEY (`id`)
);

INSERT INTO `user` VALUES (1,'Florida.Gibson@hotmail.com','Florida Gibson','2000-12-10','111-22-3333'),(2,'godfreymo44@hotmail.com','Godfrey Mosciski','1980-01-01','333-22-1111'),(3,'Tommie1Quitzon@gmail.com','Tommie Quitzon','1960-06-07','444-55-6666');
Enter fullscreen mode Exit fullscreen mode

Any field I list in my strategy JSON file will be modified, all other fields will pass-through. To modify the above SQL I can use the following file:

{
  "columns": [
    {
      "columnKey": "full_name",
      "redactWith": "NAME REMOVED",
      "tracked": false
    },
    {
      "columnKey": "email",
      "redactWith": "internet.email",
      "tracked": false
    },
    {
      "columnKey": "birth_date",
      "redactWith": "createBirthday",
      "tracked": false
    },
    {
      "columnKey": "gov_id",
      "redactWith": "{{datatype.number({\"min\":100,\"max\":999})}}-{{datatype.number({\"min\":10,\"max\":99})}}-{{datatype.number({\"min\":1000,\"max\":9999})}}",
      "tracked": false
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Running De-identify SQL with the above input and JSON strategy:

cat user.sql | de-identify-sql > user-de-identified.sql
Enter fullscreen mode Exit fullscreen mode

I'll receive back (output generated randomly, your results may vary):

INSERT INTO `user` VALUES 
(1,'Titus1@yahoo.com','NAME REMOVED','1976-05-17','420-15-1747'),
(2,'Jeremy.Bechtelar@hotmail.com','NAME REMOVED','1961-10-17','327-49-5054'),
(3,'Emiliano.Grimes@gmail.com','NAME REMOVED','2002-11-10','106-57-2546');
Enter fullscreen mode Exit fullscreen mode

Under the Hood

I have a lot of flexibility when modifying SQL statements:

  • In the above example, full_name is redacted with a constant: NAME_REMOVED.
  • For the email column, we're calling a faker function to generate a new email address.
  • To generate a gov_id, we use a mustache-like template of faker functions.
  • Lastly, to create birth_date, we're calling custom JavaScript. This is coming from user.js:
const faker = require('faker')

module.exports = {
  createBirthday: () => {
    return faker.date.past(80).toISOString().slice(0, 10);
  }
}
Enter fullscreen mode Exit fullscreen mode

Preserving Data Relationships

I may also need to work with denormalized or repeating data which needs to be consistently processed to retained the relationships between that data while removing identifying details. Here I can change the tracked parameter in the JSON strategy file from false to true this will replace an repeated fields with the same fabricated values. For example, this order SQL:

INSERT INTO `order` VALUES 
(1000,'Dorthy.OKeefe@hotmail.com',12.99),
(1001,'Belle10@hotmail.com',123.45),
(1002,'Dorthy.OKeefe@hotmail.com',70.00),
(1003,'Garfield.Renner89@hotmail.com',542.01),
(1004,'Garfield.Renner89@hotmail.com',67.52);
Enter fullscreen mode Exit fullscreen mode

With this JSON strategy:

{
  "columns": [
    {
      "columnKey": "order_email",
      "redactWith": "internet.email",
      "tracked": true
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Could become:

INSERT INTO `order` VALUES 
(1000,'Demario26@yahoo.com',12.99),
(1001,'Audra_Stroman@yahoo.com',123.45),
(1002,'Demario26@yahoo.com',70),
(1003,'Rogelio_OHara1@yahoo.com',542.01),
(1004,'Rogelio_OHara1@yahoo.com',67.52);
Enter fullscreen mode Exit fullscreen mode

This allows us to remove identifying fields while preserving relationships between data elements that the application may require. This also makes our fabricated data appear more realistic: in certain tables we expect to see fields repeat.

Disclaimers

Top comments (0)