<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Gonçalo Trincão Cunha</title>
    <description>The latest articles on DEV Community by Gonçalo Trincão Cunha (@trincao).</description>
    <link>https://dev.to/trincao</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F896214%2F781e6e16-98fa-497e-aa5f-0607a918bdc2.png</url>
      <title>DEV Community: Gonçalo Trincão Cunha</title>
      <link>https://dev.to/trincao</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/trincao"/>
    <language>en</language>
    <item>
      <title>Rubik Cube Simulation in Python: repeat until solved</title>
      <dc:creator>Gonçalo Trincão Cunha</dc:creator>
      <pubDate>Wed, 27 Jul 2022 17:51:00 +0000</pubDate>
      <link>https://dev.to/trincao/rubik-cube-simulation-repeat-until-solved-2o80</link>
      <guid>https://dev.to/trincao/rubik-cube-simulation-repeat-until-solved-2o80</guid>
      <description>&lt;p&gt;If you repeat any sequence of moves on a Rubik cube enough times, the cube will return to the initial (solved) state.&lt;br&gt;
This happens no matter how simple or complex is the chosen sequence.&lt;/p&gt;

&lt;p&gt;Each sequence has a length (number of moves in the sequence) and a period, or group order, which is the number of times it must be repeated until the cube returns to the solved state.&lt;/p&gt;

&lt;p&gt;Example sequence : F' L' F' L&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdc3pg8fz7779dr5lv3np.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdc3pg8fz7779dr5lv3np.png" alt="Sequence"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Sequence Period: How many times to repeat?
&lt;/h2&gt;

&lt;p&gt;On a 3x3x3 cube, depending on the sequence chosen, the period may be as low as 1 or as high as 1260. Here are a few examples.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Period&lt;/th&gt;
&lt;th&gt;Sequence&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;L L'&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;R, D, F, F, D', R'&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;U&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;F' L' F' L&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;R' D' R D&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;105&lt;/td&gt;
&lt;td&gt;U', R'&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1260&lt;/td&gt;
&lt;td&gt;R' U' R D D U' F R&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;On a 4x4x4 cube, the periods can be much larger even reaching 765765.&lt;br&gt;
Some examples.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Period&lt;/th&gt;
&lt;th&gt;Sequence&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;L L'&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;U&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;R' D' R D&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;80&lt;/td&gt;
&lt;td&gt;2L Bw&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;12240&lt;/td&gt;
&lt;td&gt;Rw, U, Fw, Bw, F&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;765765&lt;/td&gt;
&lt;td&gt;R R Rw Uw Dw Dw&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h2&gt;
  
  
  Analysis of the Sequence Period
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Main question&lt;/strong&gt;: Given a random sequence of N moves what is the average sequence period?&lt;/p&gt;

&lt;p&gt;Although there are &lt;a href="https://people.kth.se/~boij/kandexjobbVT11/Material/rubikscube.pdf" rel="noopener noreferrer"&gt;mathematical approaches&lt;/a&gt; to answer this question, we're using a simulation approach with the Python library &lt;a href="https://github.com/trincaog/magiccube" rel="noopener noreferrer"&gt;magiccube&lt;/a&gt;, which is a fast Rubik Cube simulator.&lt;/p&gt;

&lt;p&gt;The simulation is run 1000 times. Each run executes the sequence until the cube returns to the original state.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;cube&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;magiccube&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Cube&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Run the simulation N times
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Generate a random sequence
&lt;/span&gt;    &lt;span class="n"&gt;moves&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cube&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate_random_moves&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;num_steps&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;argv&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;
    &lt;span class="n"&gt;cube&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;reset&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c1"&gt;# Execute the sequence
&lt;/span&gt;    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;1000000&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;cube&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rotate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;moves&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="c1"&gt;# Check if the cube is finished
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;cube&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;is_done&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;break&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Sequence Period decay
&lt;/h2&gt;

&lt;p&gt;The distribution of sequence periods has an exponential decay. Most sequences have small periods, few sequences have large periods.&lt;/p&gt;

&lt;p&gt;Using a sequence length of 30 random moves on a 3x3x3 cube, we can see the distribution of period sizes.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faenhdhnb0pc8c0ivi25g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faenhdhnb0pc8c0ivi25g.png" alt="Histogram"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Sequence length vs period
&lt;/h2&gt;

&lt;p&gt;The sequence period is typically smaller for shorter sequence lengths, but after a certain threshold, the period doesn't increase any more.&lt;br&gt;
On the 3x3x3 cube, the threshold is around 2-5 moves.&lt;br&gt;
On the 4x4x4 cube, the threshold is around 11-16 moves.&lt;br&gt;
On the 5x5x5 cube, the threshold is in excess of 20 moves.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs15tsojuz6yh14ssfcqz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs15tsojuz6yh14ssfcqz.png" alt="Sequence length vs period"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Final thoughts
&lt;/h2&gt;

&lt;p&gt;Hope you enjoyed. &lt;br&gt;
If you are a fan of the Rubik cube, you can use the open source  Python library &lt;a href="https://github.com/trincaog/magiccube" rel="noopener noreferrer"&gt;magiccube&lt;/a&gt; to solve the cube and perform simulations.&lt;/p&gt;

</description>
      <category>rubik</category>
      <category>python</category>
      <category>cube</category>
      <category>simulation</category>
    </item>
    <item>
      <title>Reactive vs Synch Performance Test with Spring Boot</title>
      <dc:creator>Gonçalo Trincão Cunha</dc:creator>
      <pubDate>Fri, 22 Jul 2022 19:25:00 +0000</pubDate>
      <link>https://dev.to/trincao/reactive-vs-synch-performance-test-with-spring-boot-3d7m</link>
      <guid>https://dev.to/trincao/reactive-vs-synch-performance-test-with-spring-boot-3d7m</guid>
      <description>&lt;p&gt;Reactive is a programming paradigm that uses asynchronous programming. It inherits the concurrency efficiency of the asynchronous model with the ease of use of declarative programming.&lt;/p&gt;

&lt;p&gt;Multithreading is able to parallelize work on multiple CPUs, but when an IO operation is issued, the thread blocks waiting for the IO to complete.&lt;/p&gt;

&lt;p&gt;Reactive/Async does not parallelize work on multiple CPUs, but when an IO operation is issued, the CPU is handed over to the next task in the event loop.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Typically, using multiple processes or threads is better for CPU bound systems and async/reactive is better for IO bound systems.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;NOTE: This modified repost of a test done back in 2018. Photo by &lt;a href="https://unsplash.com/@andreuuuw?utm_source=unsplash&amp;amp;utm_medium=referral&amp;amp;utm_content=creditCopyText" rel="noopener noreferrer"&gt;Andrew Wulf&lt;/a&gt; on &lt;a href="https://unsplash.com/s/photos/many?utm_source=unsplash&amp;amp;utm_medium=referral&amp;amp;utm_content=creditCopyText" rel="noopener noreferrer"&gt;Unsplash&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Easier Asynchronous Programming
&lt;/h2&gt;

&lt;p&gt;Let’s see an example of a method that fetches a user from a database, does some conversions, transformations and then displays the results.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;synchronous&lt;/strong&gt; version look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User user = getUserFromDBSync(id);
user = convertUser(user);
user = processResult(user);
displayResults(user);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pretty straight forward.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;async&lt;/strong&gt; version with callbacks has deeply nested code, essentially the “callback hell”.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;getUserFromDB(id, user -&amp;gt; {
  convertUser(user, convertedUser -&amp;gt; {
    processResult(convertedUser, processedUser -&amp;gt; {
      displayResults(processedUser);
    });
  });
});
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now the same example with the &lt;strong&gt;reactive&lt;/strong&gt; approach. It is much more readable and maintainable than the async/callback version.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;getUserFromDBAsync(id)
  .map(user -&amp;gt; convertUser(user))
  .map(user -&amp;gt; processResult(user))
  .subscribe(user -&amp;gt; displayResults(user));
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Improved multi-tasking
&lt;/h2&gt;

&lt;p&gt;On the concurrency topic, I’ve decided to do a small test to evaluate the difference of the reactive versus the synchronous version for IO bound operations.&lt;/p&gt;

&lt;p&gt;You can get the test project here &lt;a href="https://github.com/trincaog/reactivetest" rel="noopener noreferrer"&gt;https://github.com/trincaog/reactivetest&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The test setup is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Load testing client (Gatling)&lt;/li&gt;
&lt;li&gt;Test Service (Spring Boot)&lt;/li&gt;
&lt;li&gt;External backend service (simulated)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Test backend service
&lt;/h3&gt;

&lt;p&gt;The test backend service simulates a query to an external service (ex: database) which takes some time to return a list of records. For simplification, the test doesn’t send a query to a real database, instead it simulates the response delay.&lt;/p&gt;

&lt;h3&gt;
  
  
  Synchronous version setup:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;A Spring Boot 2.0 (2.0.0.RC1) application / Spring MVC framework&lt;/li&gt;
&lt;li&gt;Embedded Tomcat container with max threads=10.000 (large number to avoid queued requests)&lt;/li&gt;
&lt;li&gt;Hosted on AWS ECS/Fargate with 256 mCPU / 2GB RAM&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Reactive version setup:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;A Spring Boot 2.0 (2.0.0.RC1) application / Spring Webflux framework&lt;/li&gt;
&lt;li&gt;Netty framework&lt;/li&gt;
&lt;li&gt;Hosted on AWS ECS/Fargate with 256 mCPU / 2GB RAM&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Load Testing Client
&lt;/h2&gt;

&lt;p&gt;The following components were used:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AWS EC2 t2.small 1vCPU / 2GB RAM&lt;/li&gt;
&lt;li&gt;Gatling 2.3.0&lt;/li&gt;
&lt;li&gt;Continuous request loop without any delay between requests&lt;/li&gt;
&lt;li&gt;2 configurations of external service: one with 500ms response time; another with 2.000ms response time&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Load Test #1: External Service Delay 500ms
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fskacbrghtnbf2hn70vd8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fskacbrghtnbf2hn70vd8.png" alt="Load test #1"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;With &amp;lt;=100 concurrent requests, the response times are very similar between the 2 versions.&lt;/p&gt;

&lt;p&gt;After 200 concurrent users the synchronous/tomcat version starts deteriorating the response times, while the reactive version with Netty holds-up until 2.000 concurrent users.&lt;/p&gt;

&lt;h3&gt;
  
  
  Load Test #2: External Service Delay 2.000ms
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3ipt7txf7imyphk1qteg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3ipt7txf7imyphk1qteg.png" alt="Load test #2"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This test uses a much slower backing service (4x slower) and the service handles a much larger load. This happens because, although the number of concurrent users are the same, the number of req/sec is 4x lower.&lt;/p&gt;

&lt;p&gt;In this test, the synchronous version starts deteriorating with 4-5x the number of concurrent users than the prior 500ms delay test.&lt;/p&gt;

</description>
      <category>reactive</category>
      <category>java</category>
      <category>springboot</category>
      <category>async</category>
    </item>
    <item>
      <title>Moving from a Database Mindset to a Data Lake Mindset</title>
      <dc:creator>Gonçalo Trincão Cunha</dc:creator>
      <pubDate>Fri, 22 Jul 2022 17:13:40 +0000</pubDate>
      <link>https://dev.to/trincao/moving-from-a-database-mindset-to-a-data-lake-mindset-kan</link>
      <guid>https://dev.to/trincao/moving-from-a-database-mindset-to-a-data-lake-mindset-kan</guid>
      <description>&lt;p&gt;Image by: Joel Ambass&lt;/p&gt;

&lt;h1&gt;
  
  
  Three paradigm shifts when working with a Data Lake
&lt;/h1&gt;

&lt;p&gt;There are several key conceptual differences between working with databases and Data Lakes.&lt;br&gt;
In this post, let’s identify some of these differences which may not be intuitive at first sight, especially for people with a strong relational database background.&lt;/p&gt;



&lt;h2&gt;
  
  
  The server is disposable. The data is in the Cloud.
&lt;/h2&gt;

&lt;p&gt;Decoupled storage and compute: This is a classic when talking about Data Lakes.&lt;/p&gt;

&lt;p&gt;In traditional database systems (and initial Hadoop-based Data Lakes), storage is tightly coupled with computing servers. The servers either have the storage built-in or are directly connected to the storage.&lt;/p&gt;

&lt;p&gt;In modern cloud-based Data Lake architectures, the data storage and compute are independent. Data is held in cloud object storage (ex: AWS S3, Azure Storage), usually in an open format like parquet, and compute servers are stateless, they can be started/shut down whenever necessary.&lt;/p&gt;

&lt;p&gt;Having a decoupled storage and compute enables:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Lower computing costs&lt;/strong&gt;: The servers are running when necessary. When unused, they can be shut down thus lowering compute costs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scalability&lt;/strong&gt;: You don’t have to acquire the hardware for peak usage. The number of servers/CPUs/memory can be scaled up/down dynamically according to current usage.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sandboxing&lt;/strong&gt;: The same data can be read simultaneously by multiple compute servers/clusters. This allows you to have multiple teams, in separate clusters, working in parallel reading the same data without affecting each other.&lt;/li&gt;
&lt;/ul&gt;



&lt;h2&gt;
  
  
  RAW data is king! Curated data is just derived.
&lt;/h2&gt;

&lt;p&gt;In the database paradigm, after the data from source systems is transformed and loaded into database tables, it is no longer useful. In the Data Lake paradigm, RAW data is kept as the source of truth, eventually forever because it is the real asset.&lt;/p&gt;

&lt;p&gt;RAW data, however, is typically unsuitable for consumption by business users, therefore it goes through a curation process to improve its quality, provide structure and ease consumption. Curated data is finally stored for feeding data science teams, data warehouses, reporting systems, and general consumption by business users.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--TLjHpPrD--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/q48pily8pgk63jl0dhyw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--TLjHpPrD--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/q48pily8pgk63jl0dhyw.png" alt="Data Lake Curation" width="551" height="87"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Typical Data Lake consumers only see the curated data and therefore they value curated data much more than the RAW data which generated it.&lt;/p&gt;

&lt;p&gt;However, the true asset of the Data Lake is the RAW data (along with the curation pipeline) and, in a sense, curated data is similar to a materialized view that can be refreshed at any time.&lt;/p&gt;

&lt;h3&gt;
  
  
  Key takeaways
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Can be recreated from RAW at any time.&lt;/li&gt;
&lt;li&gt;Can be recreated with an improved curation process.&lt;/li&gt;
&lt;li&gt;We can have multiple curated views, each for a specific analysis.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Schema decisions taken today don’t constrain future requirements
&lt;/h2&gt;

&lt;p&gt;Often the information requirements change and some piece of information not originally collected from the source/operational system needs to be analyzed.&lt;/p&gt;

&lt;p&gt;In a typical scenario, if the original RAW data isn’t stored, the historical data is lost forever.&lt;/p&gt;

&lt;p&gt;However, in a Data Lake architecture, the decision taken today that a field is not to be loaded on the curated schema can be reversed later, because all the detailed information is safely stored in the RAW area of the Data Lake and the historical curated data can be recreated with the additional fields.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--V6NCFSkJ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/5bba5oxkv2ll5dfg4kmw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--V6NCFSkJ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/5bba5oxkv2ll5dfg4kmw.png" alt="Curated schema evolution (Image by author)" width="441" height="361"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Key takeaways
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Don’t spend a lot of time trying to create a generic one-size-fits-all curated schema if you don’t need it right now.&lt;/li&gt;
&lt;li&gt;Create a curated schema iteratively, start by adding the fields you need right now.&lt;/li&gt;
&lt;li&gt;When additional fields are required, add them to the curation process and reprocess.&lt;/li&gt;
&lt;/ul&gt;



&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;Data Lakes are not a replacement for Databases, each tool has its sweet spots and Achilles heels.&lt;/p&gt;

&lt;p&gt;It is probably as much a bad idea to use a Data Lake for OLTP, as it is to use a database to store terabytes of unstructured data.&lt;/p&gt;

&lt;p&gt;I hope this post helped to shed some light on some of the key design differences between both systems.&lt;/p&gt;

</description>
      <category>datalake</category>
      <category>bigdata</category>
      <category>dataengineering</category>
      <category>database</category>
    </item>
    <item>
      <title>Speeding up Stream-Static Joins on Apache Spark</title>
      <dc:creator>Gonçalo Trincão Cunha</dc:creator>
      <pubDate>Fri, 22 Jul 2022 16:54:10 +0000</pubDate>
      <link>https://dev.to/trincao/speeding-up-stream-static-joins-on-apache-spark-3gdg</link>
      <guid>https://dev.to/trincao/speeding-up-stream-static-joins-on-apache-spark-3gdg</guid>
      <description>&lt;p&gt;Some time ago I came across a use case where a spark structured streaming job required a join with static data located on very large table.&lt;/p&gt;

&lt;p&gt;The first approach taken wasn’t really great. Even with small micro-batches, it increased the batch processing time by orders of magnitude.&lt;/p&gt;

&lt;p&gt;A (very) simplified example of this case could be a stream of sales events that needs to be merged with additional product information located on a large table of products.&lt;/p&gt;

&lt;p&gt;This post is about using mapPartitions to join Spark Structured Streaming data frames with static data.&lt;/p&gt;

&lt;h2&gt;
  
  
  Approach #1 — Stream-Static Join
&lt;/h2&gt;

&lt;p&gt;The first approach involved a join of the sales events data frame with the static products table.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fywunqr0w7ilivq7qjouj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fywunqr0w7ilivq7qjouj.png" alt="Stream-static Join"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;center&gt;Image by Author&lt;/center&gt;.

&lt;p&gt;Unfortunately, the join caused each micro-batch to do a full scan of the product table, resulting in a high batch duration even if the stream had a single record to process.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpq55pr96g44h4dwb428s.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpq55pr96g44h4dwb428s.png" alt="join performance"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;center&gt;Image by Author&lt;/center&gt;.

&lt;p&gt;The code went like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;// streamingDS = … Sales stream initialization …
// Read static product table
val staticDS = spark.read
  .format("parquet")
  .load("/tmp/prods.parquet").as[Product]
// Join of sales stream with products table
streamingDS
  .joinWith(staticDS, 
    streamingDS("productId")===staticDS("productId") &amp;amp;&amp;amp;
    streamingDS("category")===staticDS("category"))
  .map{ 
    case (sale,product) =&amp;gt; new SaleInfo(sale, Some(product))
  }
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Using a small demo application, the DAG shows the culprit:&lt;/p&gt;

&lt;p&gt;The partitioning of the static table was ignored and thus all rows of all partitions (in this case 5) where read.&lt;br&gt;
The full table scan of the product table added &amp;gt;1min to the micro-batch duration, even if it has only one event.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnylijxxi1lunrn0nlwu7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnylijxxi1lunrn0nlwu7.png" alt="join DAG"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;center&gt;Image by Author&lt;/center&gt;.
&lt;h2&gt;
  
  
  Approach #2 — mapPartitions
&lt;/h2&gt;

&lt;p&gt;The second approach was based on a lookup to a key-value store for each sale event via Spark mapPartitions operation, which allows you to make data frame/data set transformations at the row level.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbejls7trj3x0v4t0a03i.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbejls7trj3x0v4t0a03i.png" alt="mapPartitions approach"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;center&gt;Image by Author&lt;/center&gt;.

&lt;p&gt;Neither Parquet nor Delta tables are suitable for individual key lookup, so the prerequisite for this scenario is to have the product information loaded into a key value store (Mongo DB in this example).&lt;/p&gt;

&lt;p&gt;The sample code is a bit more complex, but in certain cases well worth the effort to keep the batch duration low. Especially on small micro-batches.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;// streamingDS = … Sales stream initialization …
streamingDS.mapPartitions(partition =&amp;gt; {
  // setup DB connection
  val dbService = new ProductService()
  dbService.connect()

  partition.map(sale =&amp;gt; {
    // Product lookup and merge
    val product = dbService.findProduct(sale.productId)
    new SaleInfo(sale, Some(product))
  }).iterator
})
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The new batch duration graph shows that the problem is long gone, and we’re back to a short batch duration.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5vwgqm109tfz2w48knka.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5vwgqm109tfz2w48knka.png" alt="mapPartitions performance"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;center&gt;Image by Author&lt;/center&gt;.

&lt;p&gt;Hope you enjoyed reading! Please let me know if you have better approaches to this problem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Test details&lt;/strong&gt;: Spark version 3.2.1 running on Ubuntu 20.04 LTS / WSL2.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Test Code&lt;/strong&gt;: &lt;a href="https://github.com/trincaog/spark-mappartitions-test" rel="noopener noreferrer"&gt;https://github.com/trincaog/spark-mappartitions-test&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Photo by Marc Sendra Martorell on Unsplash&lt;/p&gt;

</description>
      <category>apachespark</category>
      <category>streaming</category>
      <category>performance</category>
      <category>programming</category>
    </item>
  </channel>
</rss>
