DEV Community: Scott Sotka

Continued Learnings

Scott Sotka — Thu, 04 Jul 2024 21:30:31 +0000

After writing my original post Learnings in Raku and Pg Concurrency I learned a few more things to make the script work better and become more raku-ish.

Learning 1. Try was not needed. In the previous post I showed this snippet of the code doing the main work.

my $q = @batch.join("\n");
try {
   my @res = $dbh.execute($q).values;
}
CATCH {
   default {
      say "Error: $_";
   }
}

In this snippet you see paired try/catch blocks. In other languages I'm familiar with that support try and catch, the two almost always go together. If code in the try block throws an exception, the exception will be caught in the catch block. However in Raku when you have a catch block in a given scope, any exceptions thrown in the scope can be caught and handled by that catch block.

A try can be used alone if you simply want to keep your code from dying because of an exception. In the documentation it is explained as "a try block is a normal block which implicitly turns on the use fatal pragma and includes an implicit CATCH block that drops the exception, which means you can use it to contain them".

When I learned this I changed that block to simplify it.

my $q = @batch.join("\n");

    $dbh.execute($q);

    CATCH {
        default {
            say "Error: $_";
        }
    }
}

Learning 2. DB Indexes. After refreshing the data in our non-prod environment I was working with fresh indexes on the tables I was updating. I mentioned in the previous article that I was able to update 4.5M rows within 30 minutes. This was from my testing on the non-refreshed DB and I had not run a fresh ANALYZE on the tables before the updates. Working with fresh indexes those 4.5M updates finished in 6 minutes.

So keep your indexes up to date. As we don't do many deletes I am assuming there were very few zombie rows (deleted but not expunged by the database) that had to be dealt with in the tests.

Learning 3. Atomicints. The lesson I learned about using an atomicint when concurrent threads each need to update a value, in this case the number of rows that have been processed, still applies. However I found an alternative to this in my case.

I decided to set up a Supply to be used in the threads to emit the number of rows that thread had just processed. I also set up a tap for that Suppy which would then add those values to the overall number of rows processed.

At first I worried that this would simply have the same problem and would also require an atomicint. However, the documentation for Supply indicated that "A supply is a thread-safe, asynchronous data stream". So I decided to give it a try. This probably means that The Supply is using locking behind the scenes, however, it's now hidden from me and doesn't slow down the update threads. I removed the atomicint type from $lines-done and set up the suppier and tap.

my $supplier = Supplier.new;
my $supply = $supplier.Supply;
$supply.tap: -> $batch {
    $lines-done += $batch;
}

Then in the main thread code I added an emit call.

$supplier.emit(@batch.elems);

Testing did not turn up any problems (which I realize is not proof that it won't at sometime in the future).

Here is the cleaned up code:

#!/usr/bin/env raku

use v6;
use DB::Pg;
use Terminal::Print <T>;

multi sub MAIN ( :$host = 'host.docker.internal', :$port = 8888, :$script-path, :$batch-size, :$dbname, :$user, :$password ) {
    T.initialize-screen;
    my @columns = ^T.columns;
    my @rows = ^T.rows;

    my $conninfo = join " ",
        ('dbname=' ~ $dbname),
        ('host=' ~ $host),
        ('port=' ~ $port),
        ('user=' ~ $user),
        ('password=' ~ $password);

    # to get the total number of lines in the file shell out to wc -l
    my $lines-total = 0 + qqx{ wc -l $script-path }.words[0];
    my $lines-done = 0; 

    T.print-string( 1, @rows.elems - 8, $*PROGRAM-NAME );
    T.print-string( 1, @rows.elems - 7, "Script-path: $script-path");
    T.print-string( 1, @rows.elems - 6, "Total lines: $lines-total");

    #every second print the progress
    my $start-time = now;
    sub format-elapsed-time($elapsed) {
        my $hours = $elapsed.Int div 3600;
        my $minutes = ($elapsed.Int mod 3600) div 60;
        my $seconds = $elapsed.Int mod 60;
        return $hours.fmt("%02d") ~ ':' ~ $minutes.fmt("%02d") ~ ':' ~ $seconds.fmt("%02d");
    }

    my  $update-line = @rows.elems - 5;
    my $doneline = @rows.elems - 1;
    my $progress = start {
        loop {
            #show elapsed time
            my $elapsed = now - $start-time;
            my $local-lines-done = $lines-done;
            my $local-lines-total = $lines-total;
            my $pct = (($local-lines-done / $local-lines-total) * 100).fmt("%02.2f");
            T.print-string( 1, $update-line,"Progress: $local-lines-done/$local-lines-total {$pct}%  - " ~ format-elapsed-time($elapsed) ~ " elapsed");

            sleep 1;
            last if $local-lines-done == $local-lines-total;
        }
        T.print-string( 1, $doneline, "All Queries Queued. Waiting on Database...");
    }


    my @batch;
    my $dbh = DB::Pg.new(:$conninfo);
    # check the connection
    my $res = $dbh.execute("SELECT 1");
    $dbh.finish;
    #say "Connection: $res";

    my $supplier = Supplier.new;
    my $supply = $supplier.Supply;
    $supply.tap: -> $batch {
        $lines-done += $batch;
    }

    $script-path.IO.lines.rotor($batch-size, :partial).race.map: -> @batch { 
        my $q = @batch.join("\n");

        $dbh.execute($q);

        CATCH {
            default {
                say "Error: $_";
            }
        }

        $supplier.emit(@batch.elems);
    };

    T.shutdown-screen;
}

Learnings in Raku and Pg Concurrency

Scott Sotka — Fri, 28 Jun 2024 23:04:31 +0000

I've been making my living with Perl 5 and PostgreSQL since the mid 90's but lately I've been giving Raku a serious look. I've published a couple of Raku modules, but had not used it for work. Then recently I needed to automate a system to Mask/Fake data in our non-prod environments so I decided to use Raku.

My method was simple, once I had identified all the tables that contain PII (Personally Identifiable Information) I created SQL scripts with autogenerated Fake data for the fields I needed to mask. This created a number of files containing up to 5 million rows of update statements. After I had the files all I needed to do was run them against our non-prod databases. As you can imagine simply running them via psql proved to be very slow. I'm sure there were any number of ways to speed this process up that I didn't think of or couldn't afford. I chose to find out if Raku concurrency would help.

I've never really used concurrency in my job and only toyed with it outside. The main work being done in the script was simply batch up a number of update lines from a file and send them to PostgreSQL to be executed.

my $q = @batch.join("\n");
try {
   my @res = $dbh.execute($q).values;
}
CATCH {
   default {
      say "Error: $_";
   }
}

I started by wrapping that in a for loop returning promises and awaiting each promise.

The start command returns a promise, which means it spawns a thread and either executes with out a problem (promise kept) or fails (promise broken). The await preceding the for loop collects all these promises and basically tells raku not to end the program until all the promises finish executing.

my @batch;
await for $script-file.IO.lines -> $line {
   push @batch, $line;
   if @batch.elems == $batch-size {
      start {
         <...main work code...>
      }
      @batch = ();
   }
}

Once in the loop, spawning threads, I worried about the database handle being thread-safe. So I thought I would just write a connection sub that would cache the connections for each thread identified by $*THREAD.id. That way, each thread would get one connection and could re-use it. I wrote the code and it was working, but then I read the documentation and found

Learning #1. DB::Pg is thread-safe. It already does connection caching.

I also wanted to keep track of the progress for each file as it was processed. So, hey, threads! This promise simply prints the progress every second, showing how many lines had been processed vs. total lines, then that same thing as a percentage along with the elapsed time. I didn't bother with an await for this as it will finish when when the lines are all processed.

my $lines-total = 0 + qqx{ wc -l $script-path }.words[0];
my $lines-done = 0;
my $progress = start {
     loop {
         #show elapsed time
         my $elapsed = now - $start-time;
         my $pct = (($lines-done / $lines-total) * 100).fmt("%02.2f");
         say("Progress: $lines-done/$lines-total {$pct}%  - " ~ format-elapsed-time($elapsed) ~ " elapsed");
         sleep 1;
         last if $lines-done == $lines-total;
     }
     say( "All Queries Queued. Waiting on Database...");
 }

Then at the end of the main start block I incremented $lines-done by the $batch-size. It all looked good, and even ran...for a while. After a certain number of iterations process would suddenly end producing only the message Killed.

At this point I turned to the Raku discord and posted my problem there and was kindly helped by user ab5tract.

Learning 2. The $lines-done variable was unprotected and being added to by each thread. It was inevitable that two of them would eventually stomp on each other. Which they were. There were two options to handle this problem. I could put a lock around the increment, or I could set $lines-done to be an atomicint. I chose the latter.

It was also pointed out to me that the for-loop was probably not the most elegant solution to batching and spawning the threads.

Learning 3. Rotor is a thing. Instead of worrying about doing the batching myself I could let Raku do it. If I combined it with race the threads would all be done behind the scenes.

Rotor lets you specify how many lines to take from the file (:partial means if we run out send what you have). So it grabs $batch-size lines and sends them to race which spawns a thread and executes the code in the map for each element of a list (in this case the batched lines). It's more compact and raku-ish than producing the list and sending it to a for-loop. Race also doesn't care what order the threads execute, if you need the results of the map in a specific order use hyper instead.

$script-path.IO.lines.rotor($batch-size, :partial).race.map: -> @batch { 
     my $q = @batch.join("\n");
     try {
         my @res = $dbh.execute($q).values;
     }
     CATCH {
         default {
             say "Error: $_";
         }
     }
     #$dbh.finish;
     atomic-fetch-add($lines-done, @batch.elems);
 };

This is the final form of the main loop.

Learning 4. The $dbh.finish call is not needed for updates, inserts, and deletes. In fact, I found that if you do call finish in these threads, you will eventually segfault. In the case of selects you must either pull all the data selected or make the call to finish.

At this point, I gussied up the output using the Terminal::Print module. This kept my progress messages all updating the same terminal line and not scrolling off the page.

Setting the batch size to 2000 statements gave me pretty good throughput. I have not yet pushed that number beyond 2000. I'll get around to it. Five-million lines of update statements took about 30 minutes.

Here is the full code for my update script (I may have overdone it with atomicints, but it works):

#!/usr/bin/env raku

use v6;
use DB::Pg;
use Terminal::Print <T>;

multi sub MAIN ( :$host = 'host.docker.internal', :$port = 8888, :$script-path, :$batch-size, :$dbname, :$user, :$password ) {
    T.initialize-screen;
    my @columns = ^T.columns;
    my @rows = ^T.rows;

    my $conninfo = join " ",
        ('dbname=' ~ $dbname),
        ('host=' ~ $host),
        ('port=' ~ $port),
        ('user=' ~ $user),
        ('password=' ~ $password);

    # to get the total number of lines in the file shell out to wc -l
    my atomicint $lines-total = 0 + qqx{ wc -l $script-path }.words[0];
    my atomicint $lines-done = 0;   # atomic integer to keep track of the number of lines processed hardware atomic operations

    T.print-string( 1, @rows.elems - 8, $*PROGRAM-NAME );
    T.print-string( 1, @rows.elems - 7, "Script-path: $script-path");
    T.print-string( 1, @rows.elems - 6, "Total lines: $lines-total");

    #every second print the progress
    my $start-time = now;
    sub format-elapsed-time($elapsed) {
        my $hours = $elapsed.Int div 3600;
        my $minutes = ($elapsed.Int mod 3600) div 60;
        my $seconds = $elapsed.Int mod 60;
        return $hours.fmt("%02d") ~ ':' ~ $minutes.fmt("%02d") ~ ':' ~ $seconds.fmt("%02d");
    }

    my atomicint $update-line = @rows.elems - 5;
    my atomicint $doneline = @rows.elems - 1;
    my $progress = start {
        loop {
            #show elapsed time
            my $elapsed = now - $start-time;
            my $local-lines-done = atomic-fetch($lines-done);
            my $local-lines-total = atomic-fetch($lines-total);
            my $pct = (($local-lines-done / $local-lines-total) * 100).fmt("%02.2f");
            T.print-string( 1, $update-line,"Progress: $local-lines-done/$local-lines-total {$pct}%  - " ~ format-elapsed-time($elapsed) ~ " elapsed");
            #say("Progress: $local-lines-done/$local-lines-total {$pct}%  - " ~ format-elapsed-time($elapsed) ~ " elapsed");
            sleep 1;
            last if $local-lines-done == $local-lines-total;
        }
        T.print-string( 1, $doneline, "All Queries Queued. Waiting on Database...");
        #say( "All Queries Queued. Waiting on Database...");
    }


    my @batch;
    my @promises;
    my $dbh = DB::Pg.new(:$conninfo);
    # check the connection
    my $res = $dbh.execute("SELECT 1");
    $dbh.finish;
    #say "Connection: $res";

    $script-path.IO.lines.rotor($batch-size, :partial).race.map: -> @batch { 
        my $q = @batch.join("\n");
        try {
            my @res = $dbh.execute($q).values;
        }
        CATCH {
            default {
                say "Error: $_";
            }
        }
        #$dbh.finish;
        atomic-fetch-add($lines-done, @batch.elems);
    };

    T.shutdown-screen;
}