<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Yuval</title>
    <description>The latest articles on DEV Community by Yuval (@yuval1024).</description>
    <link>https://dev.to/yuval1024</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F779976%2Fa06085b9-5205-403c-ac1a-01ade35062bc.jpeg</url>
      <title>DEV Community: Yuval</title>
      <link>https://dev.to/yuval1024</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/yuval1024"/>
    <language>en</language>
    <item>
      <title>DIY Database Backup - quick and dirty backup using rsync and s3</title>
      <dc:creator>Yuval</dc:creator>
      <pubDate>Wed, 16 Apr 2025 10:15:23 +0000</pubDate>
      <link>https://dev.to/yuval1024/diy-database-backup-quick-and-dirty-backup-using-rsync-and-s3-1cn2</link>
      <guid>https://dev.to/yuval1024/diy-database-backup-quick-and-dirty-backup-using-rsync-and-s3-1cn2</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwzwgoz14xo94azl50xsu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwzwgoz14xo94azl50xsu.png" alt="MongoDB Application Layer backup example" width="800" height="668"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let's say you have a local database (DynamoDB / Postgresql) for a local project; it's not production yet, so no need for RDS or alike.&lt;br&gt;
However, you would still want to backup this database. How?&lt;/p&gt;
&lt;h2&gt;
  
  
  Setup
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;MongoDB 7.0 in a Docker container exposed on port 27090&lt;/li&gt;
&lt;li&gt;PostgreSQL with pgvector in a custom Docker container exposed on port 54032&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Data directories mounted from the host at:&lt;br&gt;
&lt;code&gt;/root/data/mongodb&lt;/code&gt; for MongoDB&lt;br&gt;
&lt;code&gt;/root/data/postgresql&lt;/code&gt; for PostgreSQL&lt;/p&gt;
&lt;h2&gt;
  
  
  Storage Layer VS Application Layer
&lt;/h2&gt;

&lt;p&gt;Storage Layer Backups: Direct copying of database files&lt;br&gt;
Application Layer Backups: Using database-specific tools like mongodump and pg_dump&lt;/p&gt;
&lt;h3&gt;
  
  
  1st try - tar datafiles:
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;filepath&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/root/data/backups/mongo_db_backup_day.tar.gz

&lt;span class="nb"&gt;tar&lt;/span&gt; &lt;span class="nt"&gt;-cf&lt;/span&gt; /root/data/mongo_db &lt;span class="nv"&gt;$filepath&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Problem: what if files are changing during the backup?&lt;/p&gt;
&lt;h3&gt;
  
  
  2nd try - copy before taring:
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cp&lt;/span&gt; /root/data/mongo_db /root/data/backups/mongo_db
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;filepath&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/root/data/backups/mongo_db_backup_day.tar.gz

&lt;span class="nb"&gt;tar&lt;/span&gt; &lt;span class="nt"&gt;-cf&lt;/span&gt; /root/data/backups/tmp/mongo_db &lt;span class="nv"&gt;$filepath&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Problem: Copying is faster then archiving, but still - files could change during the copy.&lt;/p&gt;
&lt;h3&gt;
  
  
  3rd try - 2-pass rsync
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;rsync &lt;span class="nt"&gt;-av&lt;/span&gt; /root/data/mongodb/ /root/data/backups/tmp/mongo_db/
rsync &lt;span class="nt"&gt;-av&lt;/span&gt; /root/data/mongodb/ /root/data/backups/tmp/mongo_db/
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;filepath&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/root/data/backups/mongo_db_backup_day.tar.gz

&lt;span class="nb"&gt;tar&lt;/span&gt; &lt;span class="nt"&gt;-cf&lt;/span&gt; /root/data/backups/tmp/mongo_db &lt;span class="nv"&gt;$filepath&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This looks way better!&lt;br&gt;
We do 2-pass rsync copying.&lt;br&gt;
We do rsync twice - first we copy all files, and on second pass we copy only changes files from the first. rsync syncs (copies) only changed files.&lt;/p&gt;
&lt;h3&gt;
  
  
  Feature Request - Saving Latest X Backups
&lt;/h3&gt;

&lt;p&gt;We would like to save latest X backups, eg latest 3 backups; how shall we do this?&lt;br&gt;
One way, is to do &lt;code&gt;aws s3 ls&lt;/code&gt; and delete old backups;&lt;br&gt;
However, we want quick and dirty solution; so we take the current day and module 3. This way we will rotate backup "shard" every day.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# DAY_MOD is day of the year module 3&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;DAY_MOD&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;$((&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; +%j&lt;span class="si"&gt;)&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt; &lt;span class="k"&gt;))&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;filepath&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/root/data/backups/mongo_db_backup_day&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;DAY_MOD&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;.tar.gz

&lt;span class="nb"&gt;tar&lt;/span&gt; &lt;span class="nt"&gt;-cf&lt;/span&gt; /root/data/backups/tmp/mongo_db &lt;span class="nv"&gt;$filepath&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Feature Request - Speed of Backup?
&lt;/h3&gt;

&lt;p&gt;For this, we can parallize the process. We could use &lt;code&gt;parallel&lt;/code&gt; command, but seems like &lt;code&gt;pigz&lt;/code&gt; is better;&lt;br&gt;
Let's limit also number of cpus to 10 (we could also choose nproc//2, or nproc-1 for that matter).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;filepath&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/root/data/backups/mongo_db_backup_day&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;DAY_MOD&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;.tar.gz

&lt;span class="nb"&gt;tar&lt;/span&gt; &lt;span class="nt"&gt;-c&lt;/span&gt; /root/data/backups/tmp/mongo_db | pigz &lt;span class="nt"&gt;-p&lt;/span&gt; 10 &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nv"&gt;$filepath&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Feature request - Application Layer Backup
&lt;/h3&gt;

&lt;p&gt;This could be done using pg_dump, or using mongodump.&lt;/p&gt;

&lt;p&gt;It's really 2 minutes talking to Claude, to get Docker command like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;--rm&lt;/span&gt; &lt;span class="nt"&gt;--network&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;host &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-v&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;dirname&lt;/span&gt; &lt;span class="nv"&gt;$BACKUP_PATH&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;:/backup &lt;span class="se"&gt;\&lt;/span&gt;
  mongo:7.0.15-jammy &lt;span class="se"&gt;\&lt;/span&gt;
  mongodump &lt;span class="nt"&gt;--host&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;localhost &lt;span class="nt"&gt;--port&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;27090 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--username&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$MONGO_USERNAME&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--password&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$MONGO_PASSWORD&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--authenticationDatabase&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;admin &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--archive&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/backup/&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;basename&lt;/span&gt; &lt;span class="nv"&gt;$BACKUP_PATH&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt; &lt;span class="nt"&gt;--gzip&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Q: Why double-rsync?&lt;br&gt;
A: The first rsync copies most files. During this time, some files might change. The second rsync then efficiently copies only the files that changed during the first pass, resulting in a more consistent snapshot.&lt;/p&gt;

&lt;p&gt;Q: Storage layer backup? Isn't this a problem?&lt;br&gt;
A: Yes, it is; it will require using the exact same database version, eg the same Docker tag.&lt;/p&gt;

&lt;p&gt;Q: What about differential backup?&lt;br&gt;
A: For larger systems, this makes lots of sense to integrate CDC and do faster backup. However, for larger systems we might be using managed solutions already.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>There are more than 2 UUID types - UUIDv4, 7, ULID, etc...</title>
      <dc:creator>Yuval</dc:creator>
      <pubDate>Sat, 17 Feb 2024 20:09:23 +0000</pubDate>
      <link>https://dev.to/yuval1024/there-are-more-than-2-uuid-types-uuidv4-7-ulid-etc-1jg4</link>
      <guid>https://dev.to/yuval1024/there-are-more-than-2-uuid-types-uuidv4-7-ulid-etc-1jg4</guid>
      <description>&lt;p&gt;&lt;strong&gt;Tl;dr&lt;/strong&gt; - UUIDv4, UUIDv7, ULID, Base64, Base58, Base85, HashIDs (hiding IDs on the frontend), libs compatibility between different SDKs.&lt;/p&gt;




&lt;p&gt;So, you've all heard about UUIDv4. It's just a very random collection of bits, represented nicely.&lt;/p&gt;

&lt;p&gt;Let's review some other UUIDs:&lt;/p&gt;

&lt;h2&gt;
  
  
  ULID / UUIDv7
&lt;/h2&gt;

&lt;p&gt;UUIDv4 has a major issue - it would give you some issues when you try to order by UUIDv4.&lt;br&gt;&lt;br&gt;
Let's say you have a database table, with id which is just a simple int, and &lt;code&gt;AUTO_INCREMENT&lt;/code&gt;.&lt;br&gt;&lt;br&gt;
first record will be id=1, second record will be id=2, etc..&lt;br&gt;
Now, when you do something like &lt;code&gt;SELECT * FROM my_table ORDER BY id&lt;/code&gt;, results will be sorted, but more importantly - the results will be close to each other.&lt;br&gt;&lt;br&gt;
e.g., if you iterate in chunks of 100 for example, you would not jump all over the database, but results should be pretty close to each other.  &lt;/p&gt;

&lt;p&gt;What happens with UUIDv4? There's no really point in sorting, because you just sort a bunch of random numbers.&lt;br&gt;&lt;br&gt;
In addition, you will "jump" all over the database when reading records. And if your DB is big enough, results you read will page out.&lt;/p&gt;
&lt;h3&gt;
  
  
  So what about ULID/ UUIVv7?
&lt;/h3&gt;

&lt;p&gt;So, ULID / UUIDv7 are 2 protocol which offer to prefix the UUID with time signature; eg the IDs are always increasing.&lt;br&gt;&lt;br&gt;
This way, for example, if you have table with ULID/UUIDv7 as an index, you can run &lt;code&gt;SELECT * FROM my_table ORDER BY id&lt;/code&gt; and it would make sense.  &lt;/p&gt;
&lt;h3&gt;
  
  
  Problems with ULID/UUIv7?
&lt;/h3&gt;

&lt;p&gt;So, one thing is adoption; another problem which is more problematic, is information leakage - given an ID, we can know when it was created..&lt;/p&gt;


&lt;h2&gt;
  
  
  "Nano IDs"
&lt;/h2&gt;

&lt;p&gt;This is just a summary of this great article - &lt;a href="https://unkey.dev/blog/uuid-ux"&gt;The UX of UUIDs&lt;/a&gt;. Go read it now.&lt;br&gt;&lt;br&gt;
UUIDs are not easier to copy; the "-" in the UUID prevent from copying the whole string.&lt;br&gt;&lt;br&gt;
We can see what Stripe is doing - key is just a random string, without dashed; in addition it is prefixed with key description. For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;STRIPE_LIVE_PUBLIC_KEY="pk_live_xUBcwUhe....."
STRIPE_LIVE_SECRET_KEY="sk_live_gpTjnUwB....."
STRIPE_TEST_PUBLIC_KEY="pk_test_CcfLsSzE....."
STRIPE_TEST_SECRET_KEY="sk_test_WFnNSjpB....."
DJSTRIPE_WEBHOOK_SECRET="whsec_LqqRWEKkd....."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can copy the key(s) using double-click, and also key has lots more information. How come?&lt;br&gt;&lt;br&gt;
Answer is, UUID (v4 for example) represents in hexadecimal base; Stripe IDs, however, represent in a different base.  The bigger the base, the shorter the string for the same amount of data represented.&lt;/p&gt;
&lt;h3&gt;
  
  
  Base64 vs Base58
&lt;/h3&gt;

&lt;p&gt;We all know about base64 (see FAQ if not) - but what is Base58??&lt;br&gt;
Base58 is just like Base64, but with some confusing letters omitted; eg we remove I and l, remove O and 0 and o avoid confusion. and + and / to as well.&lt;/p&gt;
&lt;h3&gt;
  
  
  Base85???
&lt;/h3&gt;

&lt;p&gt;Yes, another base is base85; let's say:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;you work for a software company which distributes signed .exe files to customers&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;and you want the filename to contain url the executable should connect to on first run.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;You can't add this URL to the file content, since you would have to sign many different files (*).  &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;URL should be part of the filename&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Filename should be short as possible.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;So - use you base85 to encapsulate the URL; the bigger the base -&amp;gt; lower string of the filename.&lt;br&gt;
And this way you get a short filename.&lt;/p&gt;


&lt;h2&gt;
  
  
  Hash IDs
&lt;/h2&gt;

&lt;p&gt;Let's say you have a SaaS, and you give each new user an ID. And you have a view of the format &lt;code&gt;https://my-saas.com/users/123&lt;/code&gt; (where 123 is the user_id) ;&lt;br&gt;&lt;br&gt;
What happens is, people can &lt;a href="https://en.wikipedia.org/wiki/German_tank_problem"&gt;estimate&lt;/a&gt; the number of users in your website, by creating a new user and checking the id they got.&lt;br&gt;&lt;br&gt;
So - how can you hide the current user_id from the user itself??&lt;br&gt;&lt;br&gt;
One option of course is to use a random id, but then we would get all the issues of UUID (UUID is just a private case of random ID).  &lt;/p&gt;

&lt;p&gt;Another option, is to encrypt the ID using some key; and this is exactly what &lt;a href="https://sqids.org/python?hashids"&gt;Squid (formerly HashIDs)&lt;/a&gt; is doing!  &lt;/p&gt;

&lt;p&gt;Using a secret key (*), you can convert id-&amp;gt;string and string-&amp;gt;id, and this way you can have something like &lt;code&gt;https://my-saas.com/users/nVB&lt;/code&gt;, and convert &lt;code&gt;nVB&lt;/code&gt; to user_id &lt;code&gt;123&lt;/code&gt; in your backend.&lt;/p&gt;

&lt;p&gt;Sample code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Taken from: https://github.com/davidaurelio/hashids-python

hashids = Hashids(salt='this is my salt 1')
hashid = hashids.encode(123) # 'nVB'

# and with different salt:
hashids = Hashids(salt='this is my salt 2')
hashid = hashids.encode(123) # 'ojK'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  What can be the problem with HashIDs?
&lt;/h3&gt;

&lt;p&gt;Well, first we should check what algorithm does it use; and make sure &lt;code&gt;e(d(id)) == id&lt;/code&gt; for all ids; eg that we can trust the lib (algorithm) to do conversions without an error.&lt;br&gt;&lt;br&gt;
Another issue, is we might be bound to a specific implementation (and thus technology), unless we prove that results are not changed when we switch lib.&lt;br&gt;
Security review of the algorithm - the algorithm does some logic to &lt;a href="https://github.com/davidaurelio/hashids-python?tab=readme-ov-file#curses-"&gt;avoid generating most common English curse words by never placing some letters next to each other&lt;/a&gt; ; so this might sound like trouble, entropy-wise..&lt;/p&gt;

&lt;h4&gt;
  
  
  Checking Cross-Language Consistency
&lt;/h4&gt;

&lt;p&gt;What happens if frontend uses HashIDs with Javascript but backend uses Python/C++/Rust/ for example?&lt;/p&gt;




&lt;h3&gt;
  
  
  FAQ
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Q&lt;/strong&gt;: Do we really need 128 bits as an ID? isn't it too much? What are the odds of collisions?&lt;br&gt;
A: Some people claim this isn't really needed. Referring to the &lt;a href="https://en.wikipedia.org/wiki/Birthday_problem"&gt;Birthday Attack probability table&lt;/a&gt; we see that for 128 bits  we need to have 1.6×10&lt;sup&gt;76&lt;/sup&gt; keys in  order to get collision with 1% chance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q&lt;/strong&gt;: What is base64 for?&lt;br&gt;
&lt;strong&gt;A&lt;/strong&gt;: Let's say you want to transfer information via text, eg you want to serialize info and send it to someone (other program)&lt;br&gt;
Let's say you want to transfer information, eg serialize it. Serializing means converting information to text, so you can send it from one program to another&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q&lt;/strong&gt;: Why the "&lt;em&gt;" in "you would have to sign many different files"?&lt;br&gt;
**A&lt;/em&gt;*: There are some mechanisms to deal with it, eg signing the file except a small part of metadata, for example.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q&lt;/strong&gt;: If we use UUID with time as prefix, what happens on daylight saving time?&lt;br&gt;
&lt;strong&gt;A&lt;/strong&gt;: Nothing; as the time is unix epoch, which is always increasing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q&lt;/strong&gt;: Why do hash IDs lib call the secret key "salt"? It's a secret, not salt..&lt;br&gt;
&lt;strong&gt;A&lt;/strong&gt;: The goal of hash IDs is to convert number to string and vice versa; eg supply a two-direction hash function.&lt;br&gt;&lt;br&gt;
Thus, in order to change the hash result, we use salt.&lt;br&gt;
In the algorithmic layer, this hould indeed be called "hash".&lt;br&gt;&lt;br&gt;
In the Product/Marketing layer, this is should be called "secret".&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q&lt;/strong&gt;: In "Checking cross-language", why do we need a Dockerfile for the test?&lt;br&gt;
&lt;strong&gt;A&lt;/strong&gt;: We really don't need; we can do this one time to check the implementations we need and that's it. Dockerfile is for demonstration purposes only.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://buildkite.com/blog/goodbye-integers-hello-uuids"&gt;UUIDs and poor index locality&lt;/a&gt;&lt;br&gt;&lt;br&gt;
&lt;a href="https://www.2ndquadrant.com/en/blog/sequential-uuid-generators/"&gt;Benchmarking UUIDs and checking WAL&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://buildkite.com/blog/goodbye-integers-hello-uuids"&gt;https://buildkite.com/blog/goodbye-integers-hello-uuids&lt;/a&gt;&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Testing Github Co-Pilot and Trying to Win World Cup Bet</title>
      <dc:creator>Yuval</dc:creator>
      <pubDate>Sun, 20 Nov 2022 19:48:52 +0000</pubDate>
      <link>https://dev.to/yuval1024/testing-github-co-pilot-and-trying-to-win-world-cup-bet-104d</link>
      <guid>https://dev.to/yuval1024/testing-github-co-pilot-and-trying-to-win-world-cup-bet-104d</guid>
      <description>&lt;p&gt;The world of Algorithmic Betting is very reach; lots of words were written about &lt;a href="https://en.wikipedia.org/wiki/Arbitrage_betting"&gt;Arbitrage Betting&lt;/a&gt;; what it means is, that different booking providers give different ratios for the same game, so you can, by betting in multiple providers, guarantee to make a profit.  &lt;/p&gt;

&lt;p&gt;However, this requires a lot of effort - real-time betting, scraping, etc..  &lt;/p&gt;

&lt;p&gt;This post will be about trying to find the best strategy to gamble with friends.&lt;br&gt;&lt;br&gt;
2 options exist:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Bet "single" - single result on a game - home/draw/away - 2 points if correct&lt;/li&gt;
&lt;li&gt;Bet "double" - choose 2 of home/draw/away - 1 point if correct&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Bonus - guess 5/6 games in a group - get 1 bonus point; guess 6/6 games in a group - get 3 bonus points.&lt;/p&gt;

&lt;p&gt;A script was written to find the ratio in which you should take a single bet below. e.g. if a team is expected to win at 1.1 ratio, and the cut-off is 1.2, then bet single.&lt;br&gt;
If a team is expected to win at 1.5 but cut-off is 1.4, so bet double - e.g. this team and the best other bet.&lt;/p&gt;
&lt;h3&gt;
  
  
  Howto?
&lt;/h3&gt;

&lt;p&gt;We start with RapidApi and do Google search for &lt;a href="https://letmegooglethat.com/?q=rapid+api+soccer+bet"&gt;rapid api soccer bet&lt;/a&gt; then need to find a free provider.&lt;br&gt;
We'll go with Pinnacle. Subscribing to free plan which would be enough.&lt;br&gt;
Scraping market, scraping games for the market and saving results in cache.&lt;br&gt;
Setting a &lt;code&gt;single_limit&lt;/code&gt;, e.g. the limit for single or double bet - below this limit always a single bet.  &lt;/p&gt;
&lt;h4&gt;
  
  
  Teams to Groups
&lt;/h4&gt;

&lt;p&gt;So, each group has 6 games - &lt;a href="https://en.wikipedia.org/wiki/Binomial_coefficient"&gt;4 choose 2&lt;/a&gt;, e.g. &lt;code&gt;4!/(2!2!)&lt;/code&gt;&lt;br&gt;
And for 5 correct answers we get 1 point, for 6 we get 3 points.&lt;br&gt;&lt;br&gt;
How do we know if we have 5 or 6 correct answers?  We need to map game (team) -&amp;gt; group.&lt;br&gt;&lt;br&gt;
How to do this? It can be done automatically! Since the group(team) is an &lt;a href="https://en.wikipedia.org/wiki/Equivalence_relation"&gt;Equivalence Ratio&lt;/a&gt;&lt;br&gt;
we can build it: if we have games:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;team A &amp;lt;-&amp;gt; Team B, And&lt;/li&gt;
&lt;li&gt;Team B &amp;lt;-&amp;gt; Team C, and&lt;/li&gt;
&lt;li&gt;Team C &amp;lt;-&amp;gt; Team D,
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So we know all A,B,C,D are in the same group!&lt;br&gt;
And we don't have to enter all the data manually. In the code a similar solution is implemented in &lt;code&gt;GroupsHelper&lt;/code&gt;.&lt;/p&gt;
&lt;h4&gt;
  
  
  Use Copilot!
&lt;/h4&gt;

&lt;p&gt;So, there are some &lt;a href="https://www.theverge.com/2022/11/8/23446821/microsoft-openai-github-copilot-class-action-lawsuit-ai-copyright-violation-training-data"&gt;controversy&lt;/a&gt; around Github Copilot.&lt;br&gt;&lt;br&gt;
So please don't use it in your corporate job lol.. Or make sure to ask legal before doing so.&lt;br&gt;
I've used Copilot for this toy project, and got some nice results.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Main takeaways&lt;/strong&gt;:&lt;br&gt;
It can generate complete class, if the class is trivial (e.g. Game class).&lt;br&gt;
You can add a comment before function, and this way give copilot a "hint" about what is expected from it. Function name is a hint, of course, but also the comment.&lt;/p&gt;

&lt;p&gt;Sometimes, even hints doesn't help.. The algo gives us &lt;a href="https://en.wikipedia.org/wiki/2018_FIFA_World_Cup_Group_A"&gt;2018 world cup groups&lt;/a&gt;, not 2022 as instructed..&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--V5xTKAED--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/bke59p7yvtw7il8qyw1u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--V5xTKAED--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/bke59p7yvtw7il8qyw1u.png" alt="copilot giving 2018 world cup groups" width="500" height="397"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Q&amp;amp;A
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Q:&lt;/strong&gt; What is &lt;code&gt;RAPID_API_KEY = os.environ.get('RAPID_API_KEY')&lt;/code&gt;?&lt;br&gt;
&lt;strong&gt;A:&lt;/strong&gt; You should store configuration in environment variables; never in code. See &lt;a href="https://12factor.net/"&gt;12 factors app&lt;/a&gt;.&lt;br&gt;
Python .pyc files can easily be &lt;a href="https://github.com/rocky/python-uncompyle6/"&gt;"decompiled"&lt;/a&gt; to .py and reveal all secrets in code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q:&lt;/strong&gt; What methods can be used to explore the API?&lt;br&gt;
&lt;strong&gt;A:&lt;/strong&gt; The best option is to search for Swagger file. Swagger is "open source editor to design, define and document RESTful APIs in the Swagger Specification".&lt;br&gt;&lt;br&gt;
Another alternative, is to search for Postman collection for relevant product / service.  &lt;a href="https://github.com/hkamel/azuredevops-postman-collections"&gt;Some&lt;/a&gt; &lt;a href="https://github.com/twitterdev/postman-twitter-api"&gt;Postman&lt;/a&gt; &lt;a href="https://github.com/CiscoDevNet/postman-webex"&gt;collections&lt;/a&gt; &lt;a href="https://github.com/esri-es/ArcGIS-REST-API"&gt;examples&lt;/a&gt;. &lt;br&gt;
For this project, I've used some hacky method:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;resp = requests.get(url)
open('out.txt', 'w', encoding=resp.encoding).write(json.dumps(resp.json(), indent=4))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--etfAID2m--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/p5gi9hjtxd9vqbl4cq5j.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--etfAID2m--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/p5gi9hjtxd9vqbl4cq5j.png" alt="saving json to file with indent" width="608" height="267"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q:&lt;/strong&gt; What about exception handling?&lt;br&gt;
&lt;strong&gt;A:&lt;/strong&gt; Scraping part was manual, e.g. scraping results of all of the games, and that's it; so in case of error - it was handled manually.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q:&lt;/strong&gt; I guess there are some libraries to handle http requests cache &lt;br&gt;
&lt;strong&gt;A:&lt;/strong&gt; Yes, &lt;a href="https://requests-cache.readthedocs.io/en/stable/"&gt;there are indeed&lt;/a&gt;; however too much dev time to learn those libs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q:&lt;/strong&gt; What is the &lt;code&gt;req_id&lt;/code&gt; in &lt;code&gt;do_req&lt;/code&gt;?&lt;br&gt;
&lt;strong&gt;A:&lt;/strong&gt; While other libs like &lt;code&gt;requests-cache&lt;/code&gt; automatically integrate (eg patch) into &lt;code&gt;requests&lt;/code&gt;, since we implement our own "cache", we need a way to know if request was already fetched or not.&lt;br&gt;&lt;br&gt;
This is a signature of the file, which allows us to check quickly if request already exist in cache.  E.g., we can save request in some key-value DB (Redis) and query it by the signature.  Actually, we use the disk (/tmp/cache/) as key-value store.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q:&lt;/strong&gt; Why does numpy gives a warning on the print() line?&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--kebloSqY--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/i3yyo22k72f71r9d5uwo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--kebloSqY--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/i3yyo22k72f71r9d5uwo.png" alt="numpy warning message" width="880" height="116"&gt;&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;A:&lt;/strong&gt; Check what is the return type from &lt;code&gt;np.mean()&lt;/code&gt;, for example..&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q:&lt;/strong&gt; What is the name of the technique which can describe the line &lt;code&gt;for single_limit in [1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2.0]:&lt;/code&gt;?&lt;br&gt;
Which scikit-learn method can refer to it?&lt;br&gt;
&lt;strong&gt;A:&lt;/strong&gt; Grid-search; we search for the best parameters for the mode.&lt;br&gt;
scikit-learn ref - &lt;a href="https://scikit-learn.org/stable/modules/grid_search.html"&gt;https://scikit-learn.org/stable/modules/grid_search.html&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q:&lt;/strong&gt; What about Generic Algorithm?&lt;br&gt;
&lt;strong&gt;A:&lt;/strong&gt; This was the original plan, to use some GA to get the best betting strategy. But not enough parameters for it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q:&lt;/strong&gt; In &lt;code&gt;tests.py&lt;/code&gt;, in some of the tests, you're missing &lt;code&gt;assert&lt;/code&gt; statement!&lt;br&gt;
&lt;strong&gt;A:&lt;/strong&gt; Correct; these are statistical tests, so I print values and check them manually; this is also part of the "type" of the project.&lt;br&gt;&lt;br&gt;
If it was a production system, we'd have to do something like making sure values are "similar", e.g. maybe up to some 2 or 3 standard deviations from one each other..&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q:&lt;/strong&gt; What is the &lt;code&gt;random.seed(42)&lt;/code&gt;?&lt;br&gt;
&lt;strong&gt;A:&lt;/strong&gt; In case of bugs, we want to be able to reprod the bug. So the &lt;code&gt;random.seed()&lt;/code&gt; allows us to get reproducible results.&lt;br&gt;
Q: But then you get the same results every time; don't you want randomization?&lt;br&gt;
A: Actually, we do. So we can use something like that:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;time&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;
&lt;span class="n"&gt;time_seed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="k"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"seeding with %d"&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;time_seed&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;seed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;time_seed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Source Code
&lt;/h3&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;




&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


</description>
    </item>
    <item>
      <title>Python - PDB usage and reproducing program execution</title>
      <dc:creator>Yuval</dc:creator>
      <pubDate>Fri, 11 Nov 2022 10:58:48 +0000</pubDate>
      <link>https://dev.to/yuval1024/python-pdb-usage-and-reproducing-program-execution-mo6</link>
      <guid>https://dev.to/yuval1024/python-pdb-usage-and-reproducing-program-execution-mo6</guid>
      <description>&lt;p&gt;So imagine you have a Python program, and you want to inspect some parameters during an error.&lt;br&gt;&lt;br&gt;
There are &lt;a href="https://www.rookout.com/"&gt;many&lt;/a&gt;, &lt;a href="https://sentry.io"&gt;possible&lt;/a&gt;, &lt;a href="https://docs.python.org/3/library/traceback.html"&gt;ways&lt;/a&gt; to do that;&lt;br&gt;
I'd like to speak about a basic one, which involves debugger. Just like GDB for C/C++, Python has PDB.&lt;/p&gt;

&lt;p&gt;PDB is command line debugger, which can be attached to process or started from within the process.&lt;br&gt;&lt;br&gt;
Just add the lines &lt;code&gt;import pdb; pdb.set_trace()&lt;/code&gt; and you will have a shell where you can communicate with the process.  &lt;/p&gt;

&lt;p&gt;Needless to say, this is good only for CLI programs. Others, like servers, should have other solutions (Rookout etc.., PyCharm remote debugger etc..).&lt;/p&gt;

&lt;p&gt;Let's say we run a program, which calls &lt;code&gt;some_erroneous_function&lt;/code&gt; and we want to know some value from this function.&lt;br&gt;
&lt;code&gt;main() -&amp;gt; foo() -&amp;gt; some_erroneous_function()&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;how can we know the value inside &lt;code&gt;some_erroneous_function()&lt;/code&gt;?&lt;br&gt;
simple - add next line:&lt;br&gt;&lt;br&gt;
&lt;code&gt;import pdb; pdb.set_trace()&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Can't see value of &lt;code&gt;a&lt;/code&gt;:&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--83HJSNM3--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/jsi7wflt9ashsxsjamfn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--83HJSNM3--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/jsi7wflt9ashsxsjamfn.png" alt="function raising exception" width="859" height="391"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Do manage to see value of &lt;code&gt;a&lt;/code&gt;:&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--xYDcyn3c--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/4w66yhdmm9iqdk6yr7tm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--xYDcyn3c--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/4w66yhdmm9iqdk6yr7tm.png" alt="adding pdb set_trace to get shell into program execution" width="745" height="494"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  What happens when program A runs program B?
&lt;/h3&gt;

&lt;p&gt;When we have&lt;br&gt;
&lt;code&gt;main() -&amp;gt; bar() -&amp;gt; cli_app_bar.py -&amp;gt; some_erroneous_function()&lt;/code&gt;,&lt;br&gt;&lt;br&gt;
the &lt;code&gt;import pdb; pdb.set_trace()&lt;/code&gt; trick simply doesn't work;&lt;br&gt;
We get a stuck process instead. This is because the pdb opens in the child process, however the parent process is waiting for the child process the finish and we're stuck.  &lt;/p&gt;

&lt;p&gt;In this case, we should run child process ourselves.  &lt;/p&gt;
&lt;h3&gt;
  
  
  What parts are required to run a child process ourself?
&lt;/h3&gt;

&lt;p&gt;So there are 2 parts which are required; one is obvious, the other part is often forgotten!!&lt;br&gt;
2 parts are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;program name + command like arguments&lt;/li&gt;
&lt;li&gt;Environment variables!!&lt;/li&gt;
&lt;li&gt;(there's a 3rd part which is IPC messages, but it's very hard to mimic such behavior...)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let's see how do we capture this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Modify program to save CLI arguments and env vars&lt;/li&gt;
&lt;li&gt;Run using CLI and env vars&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  getting cmd + env vars
&lt;/h3&gt;

&lt;p&gt;Several methods; getting env vars for a running process you could use &lt;code&gt;cat /proc/46/environ | tr '\0' '\n'&lt;/code&gt; (replace 46 with process id)&lt;/p&gt;

&lt;p&gt;From within Python process, we want to print env vars in "ready to go" format, eg with the export prefix:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nb"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'/tmp/params.txt'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'w'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;fout&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# print all env vars
&lt;/span&gt;    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="n"&gt;fout&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'export "%s"="%s"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;'&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;And then diff with current env vars:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"creating bar before"&lt;/span&gt;
&lt;span class="nb"&gt;cat&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="no"&gt;EOF&lt;/span&gt;&lt;span class="sh"&gt; &amp;gt; create_before.py
#!/usr/bin/python3
import os
with open('/tmp/params.before.txt', 'w') as fout:
    for k, v in os.environ.items():
        fout.write('export "%s"="%s"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;' % (k,v))
&lt;/span&gt;&lt;span class="no"&gt;EOF

&lt;/span&gt;python create_before.py

&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"print some stats"&lt;/span&gt;
&lt;span class="nb"&gt;wc&lt;/span&gt; &lt;span class="nt"&gt;-l&lt;/span&gt; /tmp/params.txt /tmp/params.before.txt

&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"get keys"&lt;/span&gt;
&lt;span class="nb"&gt;cat&lt;/span&gt; /tmp/params.txt | &lt;span class="nb"&gt;awk&lt;/span&gt; &lt;span class="nt"&gt;-F&lt;/span&gt; &lt;span class="s1"&gt;'='&lt;/span&gt; &lt;span class="s1"&gt;' { print $1 } '&lt;/span&gt; | &lt;span class="nb"&gt;sort&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; /tmp/params.keys.txt
&lt;span class="nb"&gt;cat&lt;/span&gt; /tmp/params.before.txt | &lt;span class="nb"&gt;awk&lt;/span&gt; &lt;span class="nt"&gt;-F&lt;/span&gt; &lt;span class="s1"&gt;'='&lt;/span&gt; &lt;span class="s1"&gt;' { print $1 } '&lt;/span&gt; | &lt;span class="nb"&gt;sort&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; /tmp/params.before.keys.txt
&lt;span class="nb"&gt;wc&lt;/span&gt; &lt;span class="nt"&gt;-l&lt;/span&gt; /tmp/params.keys.txt /tmp/params.before.keys.txt
diff /tmp/params.keys.txt /tmp/params.before.keys.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;and we got the newly added env var key, &lt;code&gt;EXTRA&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;YuvShell $ diff /tmp/params.keys.txt /tmp/params.before.keys.txt
1d0
&amp;lt; export "EXTRA"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Questions
&lt;/h3&gt;

&lt;p&gt;Q: What is the "YuvShell"??&lt;br&gt;
A: It's just me editing the ~/.bashrc and changing the PS1 (Prompt String) var;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--fqP0jSsY--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/lkkruyhuocodizi5sczw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--fqP0jSsY--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/lkkruyhuocodizi5sczw.png" alt="changing bash shell prompt string" width="880" height="214"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Q: What is the different between &lt;code&gt;cat some_file.txt | wc -l&lt;/code&gt; and &lt;code&gt;wc -l some_file.txt&lt;/code&gt;?&lt;br&gt;
A: with &lt;code&gt;cat + wc&lt;/code&gt; we use a pipe to transfer data from the cat output to the wc input; with wc only, we don't use the pipe.&lt;br&gt;&lt;br&gt;
Let's create some big file from urandom, and see &lt;code&gt;time&lt;/code&gt; output of both options:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;cat /dev/urandom | base64 | head -c 1GB &amp;gt; /tmp/random_1GB_file.txt

time cat /tmp/random_1GB_file.txt | wc -l
time wc -l /tmp/random_1GB_file.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--ov-tUuhX--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/9z9wb02kc1jnech0qql0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--ov-tUuhX--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/9z9wb02kc1jnech0qql0.png" alt="performance results of wc with and without pipe" width="564" height="363"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Source Code
&lt;/h3&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;




&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;



&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


</description>
    </item>
    <item>
      <title>Data Ingestion - Build Your Own "Map Reduce"?</title>
      <dc:creator>Yuval</dc:creator>
      <pubDate>Fri, 24 Dec 2021 12:04:22 +0000</pubDate>
      <link>https://dev.to/yuval1024/data-ingestion-build-your-own-map-reduce-2j1h</link>
      <guid>https://dev.to/yuval1024/data-ingestion-build-your-own-map-reduce-2j1h</guid>
      <description>&lt;h2&gt;
  
  
  Why map reduce
&lt;/h2&gt;

&lt;p&gt;Let's say you work on Facebook; you have lots of data and probably needs lots of map-reduce tasks.&lt;br&gt;
You will use mrjob/PySpark/spark/hadoop. You got the point - you need 1 framework to rule them all.&lt;br&gt;
You need a system: where will temp file be stored, API with cloud, data security, erasuer, multi-tenant etc..&lt;br&gt;
You need standards - standards between developers to themselves, between developers to devops etc. ; &lt;/p&gt;

&lt;p&gt;Let's say, for the other hand, your a solopreneur/small startup. Max 3-4 developers team.&lt;br&gt;
You need things to work, and work fast.&lt;br&gt;
Don't have 10ks of map-reduce jobs, but probably 1 or 2.&lt;br&gt;
You won't be using hadoop, that's for sure. Might be using:&lt;/p&gt;
&lt;h2&gt;
  
  
  Different approaches
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Linq
&lt;/h3&gt;

&lt;p&gt;not really map reduce per se, more like "sql w/out sql engine"&lt;br&gt;
However, this adds complexities of .net to your environment;&lt;br&gt;
e.g. read release notes and understand if you can run it on your &lt;a href="https://docs.microsoft.com/en-us/dotnet/core/install/linux"&gt;different OSes&lt;/a&gt; (production, staging, developers machines).&lt;br&gt;
Also - need to learn C#; loading from files, different encodings, saving, iterators etc..&lt;br&gt;
If you're not proficient with C#, this could be one-time investment which will not worth it.&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;a href="https://github.com/Yelp/mrjob"&gt;Mrjob&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Pros: Python native lib; able to debug easily (using &lt;a href="https://mrjob.readthedocs.io/en/latest/runners-inline.html"&gt;inline&lt;/a&gt;) run &lt;a href="https://mrjob.readthedocs.io/en/latest/runners-local.html"&gt;locally&lt;/a&gt; e.g. multi process on local machine,&lt;br&gt;
use hadoop, dataproc (seems that &lt;a href="https://cloud.google.com/dataproc/docs/concepts/overview"&gt;"Dataproc is a managed Spark and Hadoop service..."&lt;/a&gt; ) etc.&lt;br&gt;
However, lots of moving parts and different configuration options.&lt;/p&gt;
&lt;h3&gt;
  
  
  Custom made map-reduce
&lt;/h3&gt;

&lt;p&gt;Let's go to UCI Machine Learning website (2015 is on the phone..)&lt;br&gt;
Choose some &lt;a href="http://archive.ics.uci.edu/ml/datasets/Bar+Crawl%3A+Detecting+Heavy+Drinking"&gt;dataset&lt;/a&gt;, and test&lt;/p&gt;

&lt;p&gt;Some notes:&lt;br&gt;
We don't need Sha256 and not evey base64; nothing will happen if keys will not distribute very equally.&lt;br&gt;
we could take MMH3; googling "python murmurhash" gives 2 interesting results; and since both use &lt;a href="https://github.com/hajimes/mmh3/blob/master/MurmurHash3.cpp"&gt;the&lt;/a&gt; &lt;a href="https://github.com/explosion/murmurhash/blob/master/murmurhash/MurmurHash3.cpp"&gt;same&lt;/a&gt; cpp code, let's take the one with &lt;a href="https://github.com/hajimes/mmh3"&gt;most stars&lt;/a&gt;&lt;br&gt;
Other options would be to simply do (% NUM_SHARDS) or even shift right (however must have shards count == power of 2).&lt;/p&gt;

&lt;p&gt;mini setup script:&lt;br&gt;
&lt;/p&gt;
&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;and 2 python test scripts:&lt;br&gt;
&lt;/p&gt;
&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;



&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;Results:&lt;br&gt;
imap runs much slower;&lt;br&gt;
we can look at it/sec from tqdm to see that:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# test.py sample after 4 seconds:
2801493it [00:04, 566075.99it/s]

# test_imap.py sample after 4 seconds:
73439it [00:04, 18754.44it/s]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We could see the non-imap version is 30x faster!&lt;/p&gt;

&lt;h2&gt;
  
  
  Q&amp;amp;A
&lt;/h2&gt;

&lt;p&gt;Q: Why setup.sh and not requirements.txt file?&lt;br&gt;
A: this is not production code; it's aimed for quick reproducibility, not for having exact same lib (e.g. security etc.)&lt;/p&gt;

&lt;p&gt;Q: Why MMH3 and not sha256?&lt;br&gt;
A: This is not a security product, we don't need cryptographic hash; we just need a &lt;em&gt;nice distribution of keys&lt;/em&gt;, and we want this to be &lt;strong&gt;fast&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Q: Why is imap slower than single process?&lt;br&gt;
A: Might be because the imap version has &lt;strong&gt;lots of overhead because of IPC&lt;/strong&gt;;&lt;br&gt;
The trade-off between offloading the (alleged) "heavy lifting" calculation of hash to external process is being erased by the IPC.&lt;/p&gt;

&lt;p&gt;Q: Why?&lt;br&gt;
A: Using process pool might worth it if task is more CPU bound; here, &lt;strong&gt;task is more io bound&lt;/strong&gt; the overhead of MMH hash doesn't justify it.&lt;/p&gt;

&lt;p&gt;Q: Are we sure about it?&lt;br&gt;
A: we could use &lt;a href="https://github.com/benfred/py-spy"&gt;py-spy&lt;/a&gt; and see.&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Uz7NsIkF--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/88aqgbnwjm0bqudnd5m4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Uz7NsIkF--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/88aqgbnwjm0bqudnd5m4.png" alt="Image description" width="880" height="353"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Q: Conclusion?&lt;br&gt;
A: it depends&lt;br&gt;
Also - depends on size of file.&lt;br&gt;
Also - depends on post-processing each shard.&lt;br&gt;
conclusion - test mrjob as well; it might have a better IPC.&lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
