DEV Community: Gonçalo Trincão Cunha

Rubik Cube Simulation in Python: repeat until solved

Gonçalo Trincão Cunha — Wed, 27 Jul 2022 17:51:00 +0000

If you repeat any sequence of moves on a Rubik cube enough times, the cube will return to the initial (solved) state.
This happens no matter how simple or complex is the chosen sequence.

Each sequence has a length (number of moves in the sequence) and a period, or group order, which is the number of times it must be repeated until the cube returns to the solved state.

Example sequence : F' L' F' L

Sequence Period: How many times to repeat?

On a 3x3x3 cube, depending on the sequence chosen, the period may be as low as 1 or as high as 1260. Here are a few examples.

Period	Sequence
1	L L'
2	R, D, F, F, D', R'
4	U
5	F' L' F' L
6	R' D' R D
105	U', R'
1260	R' U' R D D U' F R

On a 4x4x4 cube, the periods can be much larger even reaching 765765.
Some examples.

Period	Sequence
1	L L'
4	U
6	R' D' R D
80	2L Bw
12240	Rw, U, Fw, Bw, F
765765	R R Rw Uw Dw Dw

Analysis of the Sequence Period

Main question: Given a random sequence of N moves what is the average sequence period?

Although there are mathematical approaches to answer this question, we're using a simulation approach with the Python library magiccube, which is a fast Rubik Cube simulator.

The simulation is run 1000 times. Each run executes the sequence until the cube returns to the original state.

cube = magiccube.Cube(3)

# Run the simulation N times
for n in range(1000):
    # Generate a random sequence
    moves = cube.generate_random_moves(num_steps=int(sys.argv[1]))
    cube.reset()

    # Execute the sequence
    for i in range(1,1000000):
        cube.rotate(moves)
        # Check if the cube is finished
        if cube.is_done():
            print(i)
            break

Sequence Period decay

The distribution of sequence periods has an exponential decay. Most sequences have small periods, few sequences have large periods.

Using a sequence length of 30 random moves on a 3x3x3 cube, we can see the distribution of period sizes.

Sequence length vs period

The sequence period is typically smaller for shorter sequence lengths, but after a certain threshold, the period doesn't increase any more.
On the 3x3x3 cube, the threshold is around 2-5 moves.
On the 4x4x4 cube, the threshold is around 11-16 moves.
On the 5x5x5 cube, the threshold is in excess of 20 moves.

Final thoughts

Hope you enjoyed.
If you are a fan of the Rubik cube, you can use the open source Python library magiccube to solve the cube and perform simulations.

Reactive vs Synch Performance Test with Spring Boot

Gonçalo Trincão Cunha — Fri, 22 Jul 2022 19:25:00 +0000

Reactive is a programming paradigm that uses asynchronous programming. It inherits the concurrency efficiency of the asynchronous model with the ease of use of declarative programming.

Multithreading is able to parallelize work on multiple CPUs, but when an IO operation is issued, the thread blocks waiting for the IO to complete.

Reactive/Async does not parallelize work on multiple CPUs, but when an IO operation is issued, the CPU is handed over to the next task in the event loop.

Typically, using multiple processes or threads is better for CPU bound systems and async/reactive is better for IO bound systems.

NOTE: This modified repost of a test done back in 2018. Photo by Andrew Wulf on Unsplash

Easier Asynchronous Programming

Let’s see an example of a method that fetches a user from a database, does some conversions, transformations and then displays the results.

The synchronous version look like this:

User user = getUserFromDBSync(id);
user = convertUser(user);
user = processResult(user);
displayResults(user);

Pretty straight forward.

The async version with callbacks has deeply nested code, essentially the “callback hell”.

getUserFromDB(id, user -> {
  convertUser(user, convertedUser -> {
    processResult(convertedUser, processedUser -> {
      displayResults(processedUser);
    });
  });
});

Now the same example with the reactive approach. It is much more readable and maintainable than the async/callback version.

getUserFromDBAsync(id)
  .map(user -> convertUser(user))
  .map(user -> processResult(user))
  .subscribe(user -> displayResults(user));

Improved multi-tasking

On the concurrency topic, I’ve decided to do a small test to evaluate the difference of the reactive versus the synchronous version for IO bound operations.

You can get the test project here https://github.com/trincaog/reactivetest

The test setup is:

Load testing client (Gatling)
Test Service (Spring Boot)
External backend service (simulated)

Test backend service

The test backend service simulates a query to an external service (ex: database) which takes some time to return a list of records. For simplification, the test doesn’t send a query to a real database, instead it simulates the response delay.

Synchronous version setup:

A Spring Boot 2.0 (2.0.0.RC1) application / Spring MVC framework
Embedded Tomcat container with max threads=10.000 (large number to avoid queued requests)
Hosted on AWS ECS/Fargate with 256 mCPU / 2GB RAM

Reactive version setup:

A Spring Boot 2.0 (2.0.0.RC1) application / Spring Webflux framework
Netty framework
Hosted on AWS ECS/Fargate with 256 mCPU / 2GB RAM

Load Testing Client

The following components were used:

AWS EC2 t2.small 1vCPU / 2GB RAM
Gatling 2.3.0
Continuous request loop without any delay between requests
2 configurations of external service: one with 500ms response time; another with 2.000ms response time

Load Test #1: External Service Delay 500ms

With <=100 concurrent requests, the response times are very similar between the 2 versions.

After 200 concurrent users the synchronous/tomcat version starts deteriorating the response times, while the reactive version with Netty holds-up until 2.000 concurrent users.

Load Test #2: External Service Delay 2.000ms

This test uses a much slower backing service (4x slower) and the service handles a much larger load. This happens because, although the number of concurrent users are the same, the number of req/sec is 4x lower.

In this test, the synchronous version starts deteriorating with 4-5x the number of concurrent users than the prior 500ms delay test.

Moving from a Database Mindset to a Data Lake Mindset

Gonçalo Trincão Cunha — Fri, 22 Jul 2022 17:13:40 +0000

Image by: Joel Ambass

Three paradigm shifts when working with a Data Lake

There are several key conceptual differences between working with databases and Data Lakes.
In this post, let’s identify some of these differences which may not be intuitive at first sight, especially for people with a strong relational database background.

The server is disposable. The data is in the Cloud.

Decoupled storage and compute: This is a classic when talking about Data Lakes.

In traditional database systems (and initial Hadoop-based Data Lakes), storage is tightly coupled with computing servers. The servers either have the storage built-in or are directly connected to the storage.

In modern cloud-based Data Lake architectures, the data storage and compute are independent. Data is held in cloud object storage (ex: AWS S3, Azure Storage), usually in an open format like parquet, and compute servers are stateless, they can be started/shut down whenever necessary.

Having a decoupled storage and compute enables:

Lower computing costs: The servers are running when necessary. When unused, they can be shut down thus lowering compute costs.
Scalability: You don’t have to acquire the hardware for peak usage. The number of servers/CPUs/memory can be scaled up/down dynamically according to current usage.
Sandboxing: The same data can be read simultaneously by multiple compute servers/clusters. This allows you to have multiple teams, in separate clusters, working in parallel reading the same data without affecting each other.

RAW data is king! Curated data is just derived.

In the database paradigm, after the data from source systems is transformed and loaded into database tables, it is no longer useful. In the Data Lake paradigm, RAW data is kept as the source of truth, eventually forever because it is the real asset.

RAW data, however, is typically unsuitable for consumption by business users, therefore it goes through a curation process to improve its quality, provide structure and ease consumption. Curated data is finally stored for feeding data science teams, data warehouses, reporting systems, and general consumption by business users.

Typical Data Lake consumers only see the curated data and therefore they value curated data much more than the RAW data which generated it.

However, the true asset of the Data Lake is the RAW data (along with the curation pipeline) and, in a sense, curated data is similar to a materialized view that can be refreshed at any time.

Key takeaways

Can be recreated from RAW at any time.
Can be recreated with an improved curation process.
We can have multiple curated views, each for a specific analysis.

Schema decisions taken today don’t constrain future requirements

Often the information requirements change and some piece of information not originally collected from the source/operational system needs to be analyzed.

In a typical scenario, if the original RAW data isn’t stored, the historical data is lost forever.

However, in a Data Lake architecture, the decision taken today that a field is not to be loaded on the curated schema can be reversed later, because all the detailed information is safely stored in the RAW area of the Data Lake and the historical curated data can be recreated with the additional fields.

Key takeaways

Don’t spend a lot of time trying to create a generic one-size-fits-all curated schema if you don’t need it right now.
Create a curated schema iteratively, start by adding the fields you need right now.
When additional fields are required, add them to the curation process and reprocess.

Final Thoughts

Data Lakes are not a replacement for Databases, each tool has its sweet spots and Achilles heels.

It is probably as much a bad idea to use a Data Lake for OLTP, as it is to use a database to store terabytes of unstructured data.

I hope this post helped to shed some light on some of the key design differences between both systems.

Speeding up Stream-Static Joins on Apache Spark

Gonçalo Trincão Cunha — Fri, 22 Jul 2022 16:54:10 +0000

Some time ago I came across a use case where a spark structured streaming job required a join with static data located on very large table.

The first approach taken wasn’t really great. Even with small micro-batches, it increased the batch processing time by orders of magnitude.

A (very) simplified example of this case could be a stream of sales events that needs to be merged with additional product information located on a large table of products.

This post is about using mapPartitions to join Spark Structured Streaming data frames with static data.

Approach #1 — Stream-Static Join

The first approach involved a join of the sales events data frame with the static products table.

Image by Author.

Unfortunately, the join caused each micro-batch to do a full scan of the product table, resulting in a high batch duration even if the stream had a single record to process.

Image by Author.

The code went like this:

// streamingDS = … Sales stream initialization …
// Read static product table
val staticDS = spark.read
  .format("parquet")
  .load("/tmp/prods.parquet").as[Product]
// Join of sales stream with products table
streamingDS
  .joinWith(staticDS, 
    streamingDS("productId")===staticDS("productId") &&
    streamingDS("category")===staticDS("category"))
  .map{ 
    case (sale,product) => new SaleInfo(sale, Some(product))
  }

Using a small demo application, the DAG shows the culprit:

The partitioning of the static table was ignored and thus all rows of all partitions (in this case 5) where read.
The full table scan of the product table added >1min to the micro-batch duration, even if it has only one event.

Image by Author.

Approach #2 — mapPartitions

The second approach was based on a lookup to a key-value store for each sale event via Spark mapPartitions operation, which allows you to make data frame/data set transformations at the row level.

Image by Author.

Neither Parquet nor Delta tables are suitable for individual key lookup, so the prerequisite for this scenario is to have the product information loaded into a key value store (Mongo DB in this example).

The sample code is a bit more complex, but in certain cases well worth the effort to keep the batch duration low. Especially on small micro-batches.

// streamingDS = … Sales stream initialization …
streamingDS.mapPartitions(partition => {
  // setup DB connection
  val dbService = new ProductService()
  dbService.connect()

  partition.map(sale => {
    // Product lookup and merge
    val product = dbService.findProduct(sale.productId)
    new SaleInfo(sale, Some(product))
  }).iterator
})

The new batch duration graph shows that the problem is long gone, and we’re back to a short batch duration.

Image by Author.

Hope you enjoyed reading! Please let me know if you have better approaches to this problem.

Test details: Spark version 3.2.1 running on Ubuntu 20.04 LTS / WSL2.

Test Code: https://github.com/trincaog/spark-mappartitions-test

Photo by Marc Sendra Martorell on Unsplash