In the first part of this series, we have seen how to use JR in simple use cases to stream random data from predefined templates to standard out and Apache Kafka on Confluent Cloud.
In this follow-up, we'll have a closer look at the JR data generation process and how you can use it to generate data which is usable to streaming applications.
Smart functions
We defined quality data across 2 dimensions:
- things that must be realistic "in themselves", like an IP address, or a credit card number
- things that are realistic if coherent to other data, like names, companies, emails, cities, zip codes, mobile phones, locale, etc.
Some JR template functions are βsmartβ, so let's talk a bit about type 2 data. Let's look at the predefined user template for example:
> jr template show user
{
"guid": "{{uuid}}",
"isActive": {{bool}},
"balance": "{{amount 100 10000 "β¬"}}",
"picture": "http://placehold.it/32x32",
"age": {{integer 20 60}},
"eyeColor": "{{randoms "blue|brown|green"}}",
"name": "{{name}} {{surname}}",
"gender": "{{gender}}",
"company": "{{company}}",
"work_email": "{{email_work}}",
"email": "{{email}}",
"about": "{{lorem 20}}",
"country": "{{country}}",
"address": "{{city}}, {{street}} {{building 2}}, {{zip}}",
"phone_number": "{{phone}}",
"mobile": "{{mobile_phone}}",
"latitude": {{latitude}},
"longitude": {{longitude}}
}
the user
template doesn't contain any logic to correlate type 2 data, but if you try to run the template, you'll see that everything works as expected. Let's run the template with IT localisation for example:
jr run --locale IT user
{
"guid": "3c37f1d2-c4d4-4a10-ac9e-eefa0d0a4fc1",
"isActive": false,
"balance": "β¬8106.36",
"picture": "http://placehold.it/32x32",
"age": 21,
"eyeColor": "green",
"name": "Maria Rizzo",
"gender": "F",
"company": "Evil Partners",
"work_email": "maria.rizzo@evilpartners.com",
"email": "maria.rizzo@hotmail.com",
"about": "Lorem ipsum dolor sit amet, laoreet ligula. Curabitur id nisl ut Lorem sit amet justo pulvinar aliquet accumsan sit amet",
"country": "IT",
"address": "Lodi, Piazza dei Miracoli 80, 26900",
"phone_number": "0371 95903936",
"mobile": "3899578232",
"latitude": -22.4702,
"longitude": -4.6067
}
As you can see, name, gender, email, country, address, zip code and phones are all coeherent. That's because JR, under the hood, keep track of everything and reuse data previously generated in the template. So, if you generate a work_email, the function will reuse name, surname and company.
Zip code is a reverse regex pattern which is valid for the city, mobile phone is valid for the country, and so on. At the moment some JR localisations are in progress, so pls contribute if you want to help us!
This is pretty simple and straightforward, so let's look now at relations between data.
Emitters
So far we have seen simple generation use cases. If you need to generate related data, you need more tools. JR comes preconfigured with some example emitters:
jr emitter list
List of JR emitters:
shoe
shoe_customer
shoe_order
shoe_clickstream
What's an emitter? It's basically a preconfigured jr job, and it's really helpful when you have to generate different entities with different generation parameters and relations between them.
Let's study the preconfigured shoe example:
jr emitter show shoe
Name:shoe
Locale: us
Num: 0
Frequency: 0s
Duration: 0s
Preload: 100
Output: stdout
Topic: shoes
Kcat: false
Oneline: false
Key Template: null
Value Template: shoe
Output Template: {{.V}}
this will generate just 10 shoes in preload phase (i.e. before the generation phase), and no more: frequency
and duration
are both at 0. So this is useful for more static "table-like" stuff.
jr emitter show shoe_customer
Name:shoe_customer
Locale: us
Num: 1
Frequency: 1s
Duration: 10s
Preload: 20
Output: stdout
Topic: shoe_customers
Kcat: false
Oneline: false
Key Template: null
Value Template: shoe_customer
Output Template: {{.V}}
For shoe_customer
we have a preload
of 20, but it will also generate a customer per second for 10 seconds. So it's static, but less than the shoes, which is reasonable. You don't have a new product to sell every second, but you may have new customers.
jr emitter show shoe_clickstream
Name:shoe_clickstream
Locale: us
Num: 1
Frequency: 100ms
Duration: 10s
Preload: 0
Output: stdout
Topic: shoe_clickstream
Kcat: false
Oneline: false
Key Template: null
Value Template: shoe_clickstream
Output Template: {{.V}}
shoe_clickstream
is much more dynamic, it emits 1 click every 100ms, with no preload
.
jr emitter show shoe_order
Name:shoe_order
Locale: us
Num: 1
Frequency: 500ms
Duration: 10s
Preload: 0
Output: stdout
Topic: shoe_orders
Kcat: false
Oneline: false
Key Template: null
Value Template: shoe_order
Output Template: {{.V}}
shoe_order
is similar, no preload
and a lower frequency
.
But wait, this is just a way to simplify the command line and differentiate frequency, duration, preload and other parameters for every template: where are the relations?
Let's look at the show template:
jr template show shoe
{{$id:=uuid}}{{add_v_to_list "shoes_id_list" $id}}{
"id": "{{$id}}",
"sale_price": "{{amount 200 2000 ""}}",
"brand": "{{from "sport_brand"}}",
"name": "{{randoms "Pro|Cool|Soft|Air|Perf"}} {{from "cool_name"}} {{integer 1 20}}",
"rating": "{{format_float "%.2f" (floating 1 5)}}"
}
Here you can see that a random uuid is assigned to a $id variable, and then added to a shoes_id_list
with the add_v_to_list
command.
The list is automatically shared with all the running templates, so to have a working relationship you just need to get random ids from this list instead of generating them.
jr template show shoe_clickstream
{
"product_id": "{{random_v_from_list "shoes_id_list"}}",
"user_id": "{{random_v_from_list "customers_id_list"}}",
"view_time": {{integer 10 120}},
"page_url": "https://www.acme.com/product/{{random_string 4 5}}",
"ip": "{{ip "10.1.0.0/16"}}",
"ts": {{counter "ts" 1609459200000 10000 }}
}
In the shoe_clickstream
template that's pretty clear: product_id
and user_id
are not random but come from shoes_id_list
and customers_id_list
, so there is full referential integrity.
If you need to have more than 1 value from a list, you can use random_n_v_from_list
function instead of random_v_from_list
. This function is guaranteed to peek n different values form the list, so is ideal for 1:many relationships.
to start all the emitters, just type:
jr emitter run
A goroutine per emitter will start producing random data, but not too random: coherency and integrity are important for your streaming applications!
Conclusions
We have seen how to use JR in more advanced use cases, streaming quality random data with referential integrity.
In the next part of this series, we will see how to use REST apis with JR.
In the meanwhile, happy streaming!
Top comments (0)