DarrylBryson

Posted on Mar 24, 2021 • Edited on Mar 25, 2021

Auto Generate Lab Measurement Dataset

#python #datascience #devops #sampledata

Auto Generate Lab Measurement Dataset

A typical data source found in most industrial work places are lab measurements. This type of data usually has a well defined structure and is a mixture of categorical and numeric data types. Measurement ranges and file format can vary greatly between use cases. This article covers how to generate a simple dataset that can be customized to simulate a wide range of use cases.

Data Structure
Packages
- Faker
- xml.etree.ElementTree
- Pandas
Generate Dataset
Convert Dataset to XML
Convert Dataset to Pandas DataFrame
Save File
- csv
- xml
- parquet

Sample Data Structure

Field	DataType	Min	Max
machine_id	string
test_id	uuid
technician	string
test_routine	categorical
batched	categorical
loc_1	dictionary
loc_1_x_offset	decimal	-15	15
loc_1_y_offset	decimal	-1	1
loc_1_z_offset	int	2500	5000
loc_2	dictionary
loc_2_x_offset	decimal	-15	15
loc_2_y_offset	decimal	-1	1
loc_2_z_offset	int	2500	5000

Packages

Faker: Is a python package for creating, as the name implies, fake data. This library not only generates random numbers and strings but more complex elements like addresses, and names.

from faker import Faker
fake = Faker()
Faker.seed(0)

xml.etree.ElementTree: Is a base library of Python that is a simple and efficient way of querying, parsing, and creating XML data.

import xml.etree.ElementTree as ET
from xml.dom import minidom

Pandas: Is the goto library for working with table-like data.

import pandas as pd

# base imports
import uuid
import os

Generate Dataset

The structure of the dataset is represented by a dictionary and uses Faker to fill in the data values. This allows for simple modifications when moving between use cases.

def generate_measurement_record():
    base_equipment_name='Machine'
    equipment_cnt = 25
    n = fake.pyint(min_value=1, max_value=equipment_cnt)

    measurement_record = {
        'machine_id': '_'.join([base_equipment_name, f'{n:02}'])
        ,'test_id':str(uuid.uuid4().hex)
        ,'technician':fake.name()
        ,'test_routine':fake.random_sample(elements=('a', 'b', 'c', 'd', 'e', 'f'))
        ,'batched':fake.random_sample(elements=('Yes', 'No', 'N/A'), length=1)[0]
        ,'loc_1':{
            'x_offset':fake.pydecimal(left_digits=2, right_digits=2, min_value=-15, max_value=15)
            ,'y_offset':fake.pydecimal(left_digits=1, right_digits=6, min_value=-1, max_value=1)
            ,'z_offset':fake.pyint(min_value=2500, max_value=5000)
        }
        ,'loc_2':{
            'x_offset':fake.pydecimal(left_digits=2, right_digits=2, min_value=-15, max_value=15)
            ,'y_offset':fake.pydecimal(left_digits=1, right_digits=6, min_value=-1, max_value=1)
            ,'z_offset':fake.pyint(min_value=2500, max_value=5000)
        }
    }
    return measurement_record

Repeatedly calling this function generates new records with random data that can be collected in a list up to a desired size.

records_cnt = 5

measurement_record = None
measurement_list=[]

for _ in range(records_cnt):
    measurement_record = generate_measurement_record()
    measurement_list.append(measurement_record) # List of nested dictionaries

Convert Dataset to XML

Next we'll need a function that will handle converting the records stored as dictionaries to xml. Then apply the function to each record in the list.

import xml.etree.ElementTree as ET

def dict_to_xml(d, r=None):
    file_id = str(uuid.uuid4().hex)
    if r is None:
        r = ET.Element('DataFile')
        r.set('id', file_id)
    if isinstance(d, dict):
        for k, v in d.items():
            s = ET.SubElement(r, k)
            dict_to_xml(v, s)
    elif isinstance(d, tuple) or isinstance(d, list):
        val = '/'.join(str(v) for v in d)
        r.text = val
    elif isinstance(d, str):
        r.text = d
    else:
        r.text = str(d)
    return r

# Convert each record
xml_dataset = ET.Element('DataFiles')
for rcd in measurement_list:
    file_id = str(uuid.uuid4().hex)
    child = ET.SubElement(xml_dataset,'DataFile')
    child.set('id', file_id)

    # Convert to xml
    element= dict_to_xml(rcd,child)

The resulting xml file can be viewed with:

from xml.dom import minidom

xml= ET.tostring(xml_dataset, encoding='unicode', method='xml')
xml= minidom.parseString(xml)
xml= xml.toprettyxml(indent='  ')

print(xml)

<?xml version="1.0" ?>
<DataFiles>
  <DataFile id="33c768494e484df9b954c4c7bae2ce68">
    <machine_id>Machine_10</machine_id>
    <test_id>3c0c95f773bf426585a3a68642d2d41a</test_id>
    <technician>Brett Kerr</technician>
    <test_routine>c/f/a/b</test_routine>
    <batched>Yes</batched>
    <loc_1>
      <x_offset>8.63</x_offset>
      <y_offset>0.39661</y_offset>
      <z_offset>3736</z_offset>
    </loc_1>
    <loc_2>
      <x_offset>-4.94</x_offset>
      <y_offset>0.964363</y_offset>
      <z_offset>3182</z_offset>
    </loc_2>
  </DataFile>
</DataFiles>

Convert Dataset to Pandas DataFrame

Pandas can directly create a DataFrame from a dictionary if the dictionary is not nested. So, the first step of conversion is to create a function to flatten the measurement dictionary structure:

import collections.abc

def flatten(d, parent_key='', sep='_'):
    items = []
    for k, v in d.items():
        new_key = parent_key + sep + k if parent_key else k
        if isinstance(v, collections.MutableMapping):
            items.extend(flatten(v, new_key, sep=sep).items())
        else:
            items.append((new_key, v))
    return dict(items)

Next, create a new list of the flattened records:

measurement_flat_list= []

for rcd in measurement_list:
    measurement_flat_list.append(flatten(rcd)) # List of single depth dictionaries

Finally, convert the list to a dataframe:

measurement_df = pd.DataFrame(measurement_flat_list)
measurement_df.head()

	machine_id	test_id	technician	test_routine	batched	loc_1_x_offset	loc_1_y_offset	loc_1_z_offset	loc_2_x_offset	loc_2_y_offset	loc_2_z_offset
0	Machine_02	ffbb9f35d5484deebab0598926194203	Stephanie Leblanc	[f, d, c, e, a]	Yes	14.86	0.607854	3627	0.63	0.86374	3828
1	Machine_20	fb824a88e0f145169b411a2180c5671c	Kevin Rogers	[b, a]	N/A	-6.14	0.390133	3198	-4.54	0.105494	3099
2	Machine_23	25d73f925ba14df5a09a11ef8fb3a68d	Robert Walters	[e, a, f, d, c]	N/A	12.73	0.410212	2874	-3.14	0.634917	2588
3	Machine_07	a3fb30a0fe8f4000aa72eb182230d236	Cathy Martinez	[f, a]	N/A	-14.69	0.650746	2915	12.33	0.231556	2794
4	Machine_21	02a27172e035464fbdf680c476312b03	Jeffrey Brown	[a, e, f, c]	No	-8.33	0.94833	4426	12.72	0.731588	3333

Save Dataset Files

XML
CSV
Parquet

# xml
destPath = 'sample_lab_measurement.xml'
# Convert to string
xml_str= ET.tostring(xml_dataset, encoding='unicode', method='xml')

# Pretty print string 
xml_str= minidom.parseString(xml_str)
xml_str= xml_str.toprettyxml(indent='  ')

# Write file
with open(destPath, 'w') as f:  # Write in file as utf-8
    f.write(xml_str)

# csv
destPath = 'sample_lab_measurement.csv'
measurement_df.to_csv(destPath, index = False, header=True)

# parquet
destPath = 'sample_lab_measurement.parquet'
measurement_df.to_parquet(destPath, index = False)

DEV Community

Auto Generate Lab Measurement Dataset

Auto Generate Lab Measurement Dataset

Contents

Sample Data Structure

Packages

Generate Dataset

Convert Dataset to XML

Convert Dataset to Pandas DataFrame

Save Dataset Files

Top comments (0)