DEV Community: ugo landini

JR, quality Random Data from the Command line, part II

ugo landini — Wed, 31 May 2023 17:36:34 +0000

In the first part of this series, we have seen how to use JR in simple use cases to stream random data from predefined templates to standard out and Apache Kafka on Confluent Cloud.

In this follow-up, we'll have a closer look at the JR data generation process and how you can use it to generate data which is usable to streaming applications.

Smart functions

We defined quality data across 2 dimensions:

things that must be realistic "in themselves", like an IP address, or a credit card number
things that are realistic if coherent to other data, like names, companies, emails, cities, zip codes, mobile phones, locale, etc.

Some JR template functions are “smart”, so let's talk a bit about type 2 data. Let's look at the predefined user template for example:

> jr template show user

{
  "guid": "{{uuid}}",
  "isActive": {{bool}},
  "balance": "{{amount 100 10000 "€"}}",
  "picture": "http://placehold.it/32x32",
  "age": {{integer 20 60}},
  "eyeColor": "{{randoms "blue|brown|green"}}",
  "name": "{{name}} {{surname}}",
  "gender": "{{gender}}",
  "company": "{{company}}",
  "work_email": "{{email_work}}",
  "email": "{{email}}",
  "about": "{{lorem 20}}",
  "country": "{{country}}",
  "address": "{{city}}, {{street}} {{building 2}}, {{zip}}",
  "phone_number": "{{phone}}",
  "mobile": "{{mobile_phone}}",
  "latitude": {{latitude}},
  "longitude": {{longitude}}
}

the user template doesn't contain any logic to correlate type 2 data, but if you try to run the template, you'll see that everything works as expected. Let's run the template with IT localisation for example:

jr run --locale IT user

{
  "guid": "3c37f1d2-c4d4-4a10-ac9e-eefa0d0a4fc1",
  "isActive": false,
  "balance": "€8106.36",
  "picture": "http://placehold.it/32x32",
  "age": 21,
  "eyeColor": "green",
  "name": "Maria Rizzo",
  "gender": "F",
  "company": "Evil Partners",
  "work_email": "maria.rizzo@evilpartners.com",
  "email": "maria.rizzo@hotmail.com",
  "about": "Lorem ipsum dolor sit amet, laoreet ligula. Curabitur id nisl ut Lorem sit amet justo pulvinar aliquet accumsan sit amet",
  "country": "IT",
  "address": "Lodi, Piazza dei Miracoli 80, 26900",
  "phone_number": "0371 95903936",
  "mobile": "3899578232",
  "latitude": -22.4702,
  "longitude": -4.6067
}

As you can see, name, gender, email, country, address, zip code and phones are all coeherent. That's because JR, under the hood, keep track of everything and reuse data previously generated in the template. So, if you generate a work_email, the function will reuse name, surname and company.
Zip code is a reverse regex pattern which is valid for the city, mobile phone is valid for the country, and so on. At the moment some JR localisations are in progress, so pls contribute if you want to help us!

This is pretty simple and straightforward, so let's look now at relations between data.

Emitters

So far we have seen simple generation use cases. If you need to generate related data, you need more tools. JR comes preconfigured with some example emitters:

jr emitter list

List of JR emitters:

shoe
shoe_customer
shoe_order
shoe_clickstream

What's an emitter? It's basically a preconfigured jr job, and it's really helpful when you have to generate different entities with different generation parameters and relations between them.

Let's study the preconfigured shoe example:

jr emitter show shoe

Name:shoe
Locale: us
Num: 0
Frequency: 0s
Duration: 0s
Preload: 100
Output: stdout
Topic: shoes
Kcat: false
Oneline: false
Key Template: null
Value Template: shoe
Output Template: {{.V}}

this will generate just 10 shoes in preload phase (i.e. before the generation phase), and no more: frequency and duration are both at 0. So this is useful for more static "table-like" stuff.

jr emitter show shoe_customer

Name:shoe_customer
Locale: us
Num: 1
Frequency: 1s
Duration: 10s
Preload: 20
Output: stdout
Topic: shoe_customers
Kcat: false
Oneline: false
Key Template: null
Value Template: shoe_customer
Output Template: {{.V}}

For shoe_customer we have a preload of 20, but it will also generate a customer per second for 10 seconds. So it's static, but less than the shoes, which is reasonable. You don't have a new product to sell every second, but you may have new customers.

jr emitter show shoe_clickstream

Name:shoe_clickstream
Locale: us
Num: 1
Frequency: 100ms
Duration: 10s
Preload: 0
Output: stdout
Topic: shoe_clickstream
Kcat: false
Oneline: false
Key Template: null
Value Template: shoe_clickstream
Output Template: {{.V}}

shoe_clickstream is much more dynamic, it emits 1 click every 100ms, with no preload.

jr emitter show shoe_order

Name:shoe_order
Locale: us
Num: 1
Frequency: 500ms
Duration: 10s
Preload: 0
Output: stdout
Topic: shoe_orders
Kcat: false
Oneline: false
Key Template: null
Value Template: shoe_order
Output Template: {{.V}}

shoe_order is similar, no preload and a lower frequency.

But wait, this is just a way to simplify the command line and differentiate frequency, duration, preload and other parameters for every template: where are the relations?

Let's look at the show template:

jr template show shoe

{{$id:=uuid}}{{add_v_to_list "shoes_id_list" $id}}{
  "id": "{{$id}}",
  "sale_price": "{{amount 200 2000 ""}}",
  "brand": "{{from "sport_brand"}}",
  "name": "{{randoms "Pro|Cool|Soft|Air|Perf"}} {{from "cool_name"}} {{integer 1 20}}",
  "rating": "{{format_float "%.2f" (floating 1 5)}}"
}

Here you can see that a random uuid is assigned to a $id variable, and then added to a shoes_id_list with the add_v_to_list command.
The list is automatically shared with all the running templates, so to have a working relationship you just need to get random ids from this list instead of generating them.

jr template show shoe_clickstream
{
  "product_id": "{{random_v_from_list "shoes_id_list"}}",
  "user_id": "{{random_v_from_list "customers_id_list"}}",
  "view_time": {{integer 10 120}},
  "page_url": "https://www.acme.com/product/{{random_string 4 5}}",
  "ip": "{{ip "10.1.0.0/16"}}",
  "ts": {{counter "ts" 1609459200000 10000 }}
}

In the shoe_clickstream template that's pretty clear: product_id and user_id are not random but come from shoes_id_list and customers_id_list, so there is full referential integrity.

If you need to have more than 1 value from a list, you can use random_n_v_from_list function instead of random_v_from_list. This function is guaranteed to peek n different values form the list, so is ideal for 1:many relationships.

to start all the emitters, just type:

jr emitter run

A goroutine per emitter will start producing random data, but not too random: coherency and integrity are important for your streaming applications!

Conclusions

We have seen how to use JR in more advanced use cases, streaming quality random data with referential integrity.
In the next part of this series, we will see how to use REST apis with JR.
In the meanwhile, happy streaming!

JR, quality Random Data from the Command line, part I

ugo landini — Sun, 07 May 2023 18:11:48 +0000

What is JR?

JR is a cli tool which helps to stream quality random data. We all know what streaming is and why it is so important nowadays. Now, let's try to define what "quality random data" is.
A simple - and not too scientific - way of defining it is whatever data is good enough to look real.

Examples:

Is 1.2.3.4 a good IP? And 10.2.138.203?
Is

  {
    "ID": "ABCDEFG1234"
    "name": "Ugo Landini",
    "gender": "F",
    "company": "Confluent",
    "email": "john.wayne@ibm.com"
  }

a good random user?

What about this one instead?

  {
    "ID": "69167997-0253-4165-a17d-9ef896124426"
    "name": "Laura Kim",
    "gender": "F",
    "company": "Boston Static",
    "email": "laura.kim@bostonstatic.com"
  }

Defining quality data

There are essentially two different dimensions:

things that must be realistic "in themselves", like an IP address, or a credit card number
things that are realistic if coherent to other data, like names, companies, emails, cities, zip codes, mobile phones, locale, etc.

Sometimes we may need to generate random data of type 2 in different streams, so the "coherency" must also spread across different entities, think for example to referential integrity in databases. If I am generating users, products and orders to three different Kafka topics and I want to create a streaming application with Apache Flink, I definitely need data to be coherent across topics.

What is JR?

So, is JR yet another faking library written in Go? Yes and no. JR indeed implements most of the APIs in fakerjs and Go fake it, but it's also able to stream data directly to stdout, Kafka, Redis and more (Elastic and MongoDB coming). JR can talk directly to Confluent Schema Registry, manage json-schema and Avro schemas, easily maintain coherence and referential integrity. If you need more than what is OOTB in JR, you can also easily pipe your data streams to other cli tools like kcat thanks to its flexibility.

Why it's called JR?

Just a Random generator, Json Random generator, or, better just JR from the famous 80's Dallas character are all valid answers. JR can generate everything and not only JSON, so I definitely prefer the last one.

The use case that generated the generator.

I work as Staff Solutions Engineer in Confluent: some weeks ago I was talking with a prospect customer and he told me that they needed to send json documents like this (among others) to Confluent Cloud.

{
"VLAN": "DELTA",
"IPV4_SRC_ADDR": "10.1.41.98",
"IPV4_DST_ADDR": "10.1.137.141",
"IN_BYTES": 1220,
"FIRST_SWITCHED": 1681984281,
"LAST_SWITCHED": 1682975009,
"L4_SRC_PORT": 81,
"L4_DST_PORT": 80,
"TCP_FLAGS": 0,
"PROTOCOL": 1,
"SRC_TOS": 211,
"SRC_AS": 4,
"DST_AS": 1,
"L7_PROTO": 443,
"L7_PROTO_NAME": "ICMP",
"L7_PROTO_CATEGORY": "Application"
}

They needed to send many of these (and similar) documents to Kafka and so it was important to measure how good Kafka client compression would have been at their rate and with their data.

When you use fully managed services like Confluent Cloud it's very important to understand how much your data will be compressed: price is directly proportional to throughput and kafka batches messages in the producers, so producer compression can easily save you a lot of bandwidth and therefore a lot of money. Now, producing data in real time and analysing it in real time is pretty easy with Confluent Cloud. But answering to the prospect question in real time (i.e. during the conference call) it wasn't as easy as it should. Which compression algorithm is better? Would it be fast enough? Is batch size important for the compression?

Datagen is the de-facto standard to generate random data for Kafka. But customising what's generated is not something you can do in 30 seconds, and enabling compression is currently not an option with the managed connectors. So I decided to write a tool which you could use to easily start streaming random data to kafka in seconds, and that's why JR was born. With the help of some friends and colleagues we packed JR with a lot of features (and many more coming!)

Basic JR usage

JR is very straightforward to use. Let's look at all the preinstalled templates:

jr template list

All the templates should be green: that means that their syntax is correct and they compile.

Let's see the net_device template, which is what I should have written if I had JR during the conference call to randomise what they gave me:

> jr template show net_device

{
"VLAN": "{{randoms "ALPHA|BETA|GAMMA|DELTA"}}",
"IPV4_SRC_ADDR": "{{ip "10.1.0.0/16"}}",
"IPV4_DST_ADDR": "{{ip "10.1.0.0/16"}}",
"IN_BYTES": {{integer 1000 2000}},
"FIRST_SWITCHED": {{unix_time_stamp 60}},
"LAST_SWITCHED": {{unix_time_stamp 10}},
"L4_SRC_PORT": {{ip_known_port}},
"L4_DST_PORT": {{ip_known_port}},
"TCP_FLAGS": 0,
"PROTOCOL": {{integer 0 5}},
"SRC_TOS": {{integer 128 255}},
"SRC_AS": {{integer 0 5}},
"DST_AS": {{integer 0 2}},
"L7_PROTO": {{ip_known_port}},
"L7_PROTO_NAME": "{{ip_known_protocol}}",
"L7_PROTO_CATEGORY": "{{randoms "Network|Application|Transport|Session"}}"
}

The net-device template is pretty easy to write: these are all "Type 1" fields with no relations. You can easily generate a good IP starting from its CIDR with the ip function. There are other networking functions used in this template, all pretty straightforward, like ip_known_port, integer and unix_time_stamp. Running this template is just a matter of typing

> jr template run net-device 

{
"VLAN": "DELTA",
"IPV4_SRC_ADDR": "10.1.175.220",
"IPV4_DST_ADDR": "10.1.148.210",
"IN_BYTES": 1553,
"FIRST_SWITCHED": 1680183839,
"LAST_SWITCHED": 1682746947,
"L4_SRC_PORT": 443,
"L4_DST_PORT": 81,
"TCP_FLAGS": 0,
"PROTOCOL": 0,
"SRC_TOS": 195,
"SRC_AS": 0,
"DST_AS": 0,
"L7_PROTO": 22,
"L7_PROTO_NAME": "SFTP",
"L7_PROTO_CATEGORY": "Network"
}

When you write your own templates you'll probably need to look at all the available functions. Let's see for example how to ask JR which networking functions are available:

> jr man -c network

...

Name: ip_known_protocol
Category: network
Description: returns a random known protocol
Parameters:
Localizable: false
Return: string
Example: jr run --template '{{ip_known_protocol}}'
Output: tcp

Name: http_method
Category: network
Description: returns a random http method
Parameters:
Localizable: false
Return: string
Example: jr run --template '{{http_method}}'
Output: GET

Name: mac
Category: network
Description: returns a random mac Address
Parameters:
Localizable: false
Return: string
Example: jr run --template '{{mac}}'
Output: 7e:8e:75:a5:0a:85

you can also immediately test the function without writing a template, directly from jr man:

> jr man ip --run

Name: ip
Category: network
Description: returns a random Ip Address matching the given cidr
Parameters: cidr string
Localizable: false
Return: string
Example: jr run --template '{{ip "10.2.0.0/16"}}'
Output: 10.2.55.217

10.2.240.243

Elapsed time: 0s
Data Generated (Objects): 1
Data Generated (bytes): 12
Number of templates (Objects): 5
Throughput (bytes per second):       118

Create more random data

Using -n option you can create more data in each pass. You can use jr run or jr template run, they are equivalent.
This example creates 3 net_device objects at once:

jr run net_device -n 3

Using --frequency option you can repeat the whole creation pass as you like:

This example creates 2 net_device every second, for ever:

jr run net_device -n 2 -f 1s

Using --duration option you can time bound the entire object creation.
This example creates 2 net_device every 100ms for 1 minute:

jr run net_device -n 2 -f 100ms -d 1m

Results are by default written on standard out (--output "stdout"), but streaming to Kafka is as simple as that.

If you have Confluent Cloud, you can just download the client configuration, put the file in a kafka dir and start streaming. If you don't have Confluent Cloud, give it a try: no credit card needed, a basic cluster to test JR is super cheap and you'll also get 400$ of traffic included.

Anyway, here is the configuration template if you need to configure it manually. It's just a standard librdkafka configuration

# Kafka configuration
# https://github.com/confluentinc/librdkafka/blob/master/CONFIGURATION.md

bootstrap.servers=
security.protocol=SASL_SSL
sasl.mechanisms=PLAIN
sasl.username=
sasl.password=
compression.type=gzip
compression.level=9
statistics.interval.ms=1000

Streaming to Kafka

Once Kafka is configured, streaming to it with JR is straightforward

jr run -n 5 -f 500ms -d 5s net-device -o kafka
2023/05/07 20:03:07         0 bytes produced to Kafka
2023/05/07 20:03:08      5250 bytes produced to Kafka
2023/05/07 20:03:09      8765 bytes produced to Kafka
2023/05/07 20:03:10     12260 bytes produced to Kafka
2023/05/07 20:03:11     15763 bytes produced to Kafka

Elapsed time: 5s
Data Generated (Objects): 50
Data Generated (bytes): 17364
Number of templates (Objects): 1
Throughput (bytes per second):      3172

By default, JR writes to a topic named test, but you can change that with with -t option.

Conclusions

We have seen how to use JR in simple use cases, streaming quality random data from predefined templates to standard out and Kafka on Confluent Cloud.
In the second part of this series, we will see how to produce your own templates and manage integrity of generated data.
In the meanwhile, happy streaming!