DEV Community: Yuval

DIY Database Backup - quick and dirty backup using rsync and s3

Yuval — Wed, 16 Apr 2025 10:15:23 +0000

Let's say you have a local database (DynamoDB / Postgresql) for a local project; it's not production yet, so no need for RDS or alike.
However, you would still want to backup this database. How?

Setup

MongoDB 7.0 in a Docker container exposed on port 27090
PostgreSQL with pgvector in a custom Docker container exposed on port 54032

Data directories mounted from the host at:
/root/data/mongodb for MongoDB
/root/data/postgresql for PostgreSQL

Storage Layer VS Application Layer

Storage Layer Backups: Direct copying of database files
Application Layer Backups: Using database-specific tools like mongodump and pg_dump

1st try - tar datafiles:

export filepath=/root/data/backups/mongo_db_backup_day.tar.gz

tar -cf /root/data/mongo_db $filepath

Problem: what if files are changing during the backup?

2nd try - copy before taring:

cp /root/data/mongo_db /root/data/backups/mongo_db
export filepath=/root/data/backups/mongo_db_backup_day.tar.gz

tar -cf /root/data/backups/tmp/mongo_db $filepath

Problem: Copying is faster then archiving, but still - files could change during the copy.

3rd try - 2-pass rsync

rsync -av /root/data/mongodb/ /root/data/backups/tmp/mongo_db/
rsync -av /root/data/mongodb/ /root/data/backups/tmp/mongo_db/
export filepath=/root/data/backups/mongo_db_backup_day.tar.gz

tar -cf /root/data/backups/tmp/mongo_db $filepath

This looks way better!
We do 2-pass rsync copying.
We do rsync twice - first we copy all files, and on second pass we copy only changes files from the first. rsync syncs (copies) only changed files.

Feature Request - Saving Latest X Backups

We would like to save latest X backups, eg latest 3 backups; how shall we do this?
One way, is to do aws s3 ls and delete old backups;
However, we want quick and dirty solution; so we take the current day and module 3. This way we will rotate backup "shard" every day.

# DAY_MOD is day of the year module 3
export DAY_MOD=$(( $(date +%j) % 3 ))
export filepath=/root/data/backups/mongo_db_backup_day${DAY_MOD}.tar.gz

tar -cf /root/data/backups/tmp/mongo_db $filepath

Feature Request - Speed of Backup?

For this, we can parallize the process. We could use parallel command, but seems like pigz is better;
Let's limit also number of cpus to 10 (we could also choose nproc//2, or nproc-1 for that matter).

export filepath=/root/data/backups/mongo_db_backup_day${DAY_MOD}.tar.gz

tar -c /root/data/backups/tmp/mongo_db | pigz -p 10 > $filepath

Feature request - Application Layer Backup

This could be done using pg_dump, or using mongodump.

It's really 2 minutes talking to Claude, to get Docker command like this:

docker run --rm --network=host \
  -v $(dirname $BACKUP_PATH):/backup \
  mongo:7.0.15-jammy \
  mongodump --host=localhost --port=27090 \
  --username="$MONGO_USERNAME" \
  --password="$MONGO_PASSWORD" \
  --authenticationDatabase=admin \
  --archive=/backup/$(basename $BACKUP_PATH) --gzip

Q: Why double-rsync?
A: The first rsync copies most files. During this time, some files might change. The second rsync then efficiently copies only the files that changed during the first pass, resulting in a more consistent snapshot.

Q: Storage layer backup? Isn't this a problem?
A: Yes, it is; it will require using the exact same database version, eg the same Docker tag.

Q: What about differential backup?
A: For larger systems, this makes lots of sense to integrate CDC and do faster backup. However, for larger systems we might be using managed solutions already.

There are more than 2 UUID types - UUIDv4, 7, ULID, etc...

Yuval — Sat, 17 Feb 2024 20:09:23 +0000

Tl;dr - UUIDv4, UUIDv7, ULID, Base64, Base58, Base85, HashIDs (hiding IDs on the frontend), libs compatibility between different SDKs.

So, you've all heard about UUIDv4. It's just a very random collection of bits, represented nicely.

Let's review some other UUIDs:

ULID / UUIDv7

UUIDv4 has a major issue - it would give you some issues when you try to order by UUIDv4.

Let's say you have a database table, with id which is just a simple int, and AUTO_INCREMENT.

first record will be id=1, second record will be id=2, etc..
Now, when you do something like SELECT * FROM my_table ORDER BY id, results will be sorted, but more importantly - the results will be close to each other.

e.g., if you iterate in chunks of 100 for example, you would not jump all over the database, but results should be pretty close to each other.

What happens with UUIDv4? There's no really point in sorting, because you just sort a bunch of random numbers.

In addition, you will "jump" all over the database when reading records. And if your DB is big enough, results you read will page out.

So what about ULID/ UUIVv7?

So, ULID / UUIDv7 are 2 protocol which offer to prefix the UUID with time signature; eg the IDs are always increasing.

This way, for example, if you have table with ULID/UUIDv7 as an index, you can run SELECT * FROM my_table ORDER BY id and it would make sense.

Problems with ULID/UUIv7?

So, one thing is adoption; another problem which is more problematic, is information leakage - given an ID, we can know when it was created..

"Nano IDs"

This is just a summary of this great article - The UX of UUIDs. Go read it now.

UUIDs are not easier to copy; the "-" in the UUID prevent from copying the whole string.

We can see what Stripe is doing - key is just a random string, without dashed; in addition it is prefixed with key description. For example:

STRIPE_LIVE_PUBLIC_KEY="pk_live_xUBcwUhe....."
STRIPE_LIVE_SECRET_KEY="sk_live_gpTjnUwB....."
STRIPE_TEST_PUBLIC_KEY="pk_test_CcfLsSzE....."
STRIPE_TEST_SECRET_KEY="sk_test_WFnNSjpB....."
DJSTRIPE_WEBHOOK_SECRET="whsec_LqqRWEKkd....."

We can copy the key(s) using double-click, and also key has lots more information. How come?

Answer is, UUID (v4 for example) represents in hexadecimal base; Stripe IDs, however, represent in a different base. The bigger the base, the shorter the string for the same amount of data represented.

Base64 vs Base58

We all know about base64 (see FAQ if not) - but what is Base58??
Base58 is just like Base64, but with some confusing letters omitted; eg we remove I and l, remove O and 0 and o avoid confusion. and + and / to as well.

Base85???

Yes, another base is base85; let's say:

you work for a software company which distributes signed .exe files to customers
and you want the filename to contain url the executable should connect to on first run.
You can't add this URL to the file content, since you would have to sign many different files (*).
URL should be part of the filename
Filename should be short as possible.

So - use you base85 to encapsulate the URL; the bigger the base -> lower string of the filename.
And this way you get a short filename.

Hash IDs

Let's say you have a SaaS, and you give each new user an ID. And you have a view of the format https://my-saas.com/users/123 (where 123 is the user_id) ;

What happens is, people can estimate the number of users in your website, by creating a new user and checking the id they got.

So - how can you hide the current user_id from the user itself??

One option of course is to use a random id, but then we would get all the issues of UUID (UUID is just a private case of random ID).

Another option, is to encrypt the ID using some key; and this is exactly what Squid (formerly HashIDs) is doing!

Using a secret key (*), you can convert id->string and string->id, and this way you can have something like https://my-saas.com/users/nVB, and convert nVB to user_id 123 in your backend.

Sample code:

# Taken from: https://github.com/davidaurelio/hashids-python

hashids = Hashids(salt='this is my salt 1')
hashid = hashids.encode(123) # 'nVB'

# and with different salt:
hashids = Hashids(salt='this is my salt 2')
hashid = hashids.encode(123) # 'ojK'

What can be the problem with HashIDs?

Well, first we should check what algorithm does it use; and make sure e(d(id)) == id for all ids; eg that we can trust the lib (algorithm) to do conversions without an error.

Another issue, is we might be bound to a specific implementation (and thus technology), unless we prove that results are not changed when we switch lib.
Security review of the algorithm - the algorithm does some logic to avoid generating most common English curse words by never placing some letters next to each other ; so this might sound like trouble, entropy-wise..

Checking Cross-Language Consistency

What happens if frontend uses HashIDs with Javascript but backend uses Python/C++/Rust/ for example?

FAQ

Q: Do we really need 128 bits as an ID? isn't it too much? What are the odds of collisions?
A: Some people claim this isn't really needed. Referring to the Birthday Attack probability table we see that for 128 bits we need to have 1.6×10⁷⁶ keys in order to get collision with 1% chance.

Q: What is base64 for?
A: Let's say you want to transfer information via text, eg you want to serialize info and send it to someone (other program)
Let's say you want to transfer information, eg serialize it. Serializing means converting information to text, so you can send it from one program to another

Q: Why the "" in "you would have to sign many different files"?
**A*: There are some mechanisms to deal with it, eg signing the file except a small part of metadata, for example.

Q: If we use UUID with time as prefix, what happens on daylight saving time?
A: Nothing; as the time is unix epoch, which is always increasing.

Q: Why do hash IDs lib call the secret key "salt"? It's a secret, not salt..
A: The goal of hash IDs is to convert number to string and vice versa; eg supply a two-direction hash function.

Thus, in order to change the hash result, we use salt.
In the algorithmic layer, this hould indeed be called "hash".

In the Product/Marketing layer, this is should be called "secret".

Q: In "Checking cross-language", why do we need a Dockerfile for the test?
A: We really don't need; we can do this one time to check the implementations we need and that's it. Dockerfile is for demonstration purposes only.

References

UUIDs and poor index locality

Benchmarking UUIDs and checking WAL

https://buildkite.com/blog/goodbye-integers-hello-uuids

Testing Github Co-Pilot and Trying to Win World Cup Bet

Yuval — Sun, 20 Nov 2022 19:48:52 +0000

The world of Algorithmic Betting is very reach; lots of words were written about Arbitrage Betting; what it means is, that different booking providers give different ratios for the same game, so you can, by betting in multiple providers, guarantee to make a profit.

However, this requires a lot of effort - real-time betting, scraping, etc..

This post will be about trying to find the best strategy to gamble with friends.

2 options exist:

Bet "single" - single result on a game - home/draw/away - 2 points if correct
Bet "double" - choose 2 of home/draw/away - 1 point if correct

Bonus - guess 5/6 games in a group - get 1 bonus point; guess 6/6 games in a group - get 3 bonus points.

A script was written to find the ratio in which you should take a single bet below. e.g. if a team is expected to win at 1.1 ratio, and the cut-off is 1.2, then bet single.
If a team is expected to win at 1.5 but cut-off is 1.4, so bet double - e.g. this team and the best other bet.

Howto?

We start with RapidApi and do Google search for rapid api soccer bet then need to find a free provider.
We'll go with Pinnacle. Subscribing to free plan which would be enough.
Scraping market, scraping games for the market and saving results in cache.
Setting a single_limit, e.g. the limit for single or double bet - below this limit always a single bet.

Teams to Groups

So, each group has 6 games - 4 choose 2, e.g. 4!/(2!2!)
And for 5 correct answers we get 1 point, for 6 we get 3 points.

How do we know if we have 5 or 6 correct answers? We need to map game (team) -> group.

How to do this? It can be done automatically! Since the group(team) is an Equivalence Ratio
we can build it: if we have games:

team A <-> Team B, And
Team B <-> Team C, and
Team C <-> Team D,

So we know all A,B,C,D are in the same group!
And we don't have to enter all the data manually. In the code a similar solution is implemented in GroupsHelper.

Use Copilot!

So, there are some controversy around Github Copilot.

So please don't use it in your corporate job lol.. Or make sure to ask legal before doing so.
I've used Copilot for this toy project, and got some nice results.

Main takeaways:
It can generate complete class, if the class is trivial (e.g. Game class).
You can add a comment before function, and this way give copilot a "hint" about what is expected from it. Function name is a hint, of course, but also the comment.

Sometimes, even hints doesn't help.. The algo gives us 2018 world cup groups, not 2022 as instructed..

Q&A

Q: What is RAPID_API_KEY = os.environ.get('RAPID_API_KEY')?
A: You should store configuration in environment variables; never in code. See 12 factors app.
Python .pyc files can easily be "decompiled" to .py and reveal all secrets in code.

Q: What methods can be used to explore the API?
A: The best option is to search for Swagger file. Swagger is "open source editor to design, define and document RESTful APIs in the Swagger Specification".

Another alternative, is to search for Postman collection for relevant product / service. Some Postman collections examples.
For this project, I've used some hacky method:

resp = requests.get(url)
open('out.txt', 'w', encoding=resp.encoding).write(json.dumps(resp.json(), indent=4))

Q: What about exception handling?
A: Scraping part was manual, e.g. scraping results of all of the games, and that's it; so in case of error - it was handled manually.

Q: I guess there are some libraries to handle http requests cache
A: Yes, there are indeed; however too much dev time to learn those libs.

Q: What is the req_id in do_req?
A: While other libs like requests-cache automatically integrate (eg patch) into requests, since we implement our own "cache", we need a way to know if request was already fetched or not.

This is a signature of the file, which allows us to check quickly if request already exist in cache. E.g., we can save request in some key-value DB (Redis) and query it by the signature. Actually, we use the disk (/tmp/cache/) as key-value store.

Q: Why does numpy gives a warning on the print() line?

A: Check what is the return type from np.mean(), for example..

Q: What is the name of the technique which can describe the line for single_limit in [1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2.0]:?
Which scikit-learn method can refer to it?
A: Grid-search; we search for the best parameters for the mode.
scikit-learn ref - https://scikit-learn.org/stable/modules/grid_search.html

Q: What about Generic Algorithm?
A: This was the original plan, to use some GA to get the best betting strategy. But not enough parameters for it.

Q: In tests.py, in some of the tests, you're missing assert statement!
A: Correct; these are statistical tests, so I print values and check them manually; this is also part of the "type" of the project.

If it was a production system, we'd have to do something like making sure values are "similar", e.g. maybe up to some 2 or 3 standard deviations from one each other..

Q: What is the random.seed(42)?
A: In case of bugs, we want to be able to reprod the bug. So the random.seed() allows us to get reproducible results.
Q: But then you get the same results every time; don't you want randomization?
A: Actually, we do. So we can use something like that:

import time, random
time_seed = int(time.time())
print("seeding with %d" % (time_seed))
random.seed(time_seed)

Source Code

Python - PDB usage and reproducing program execution

Yuval — Fri, 11 Nov 2022 10:58:48 +0000

So imagine you have a Python program, and you want to inspect some parameters during an error.

There are many, possible, ways to do that;
I'd like to speak about a basic one, which involves debugger. Just like GDB for C/C++, Python has PDB.

PDB is command line debugger, which can be attached to process or started from within the process.

Just add the lines import pdb; pdb.set_trace() and you will have a shell where you can communicate with the process.

Needless to say, this is good only for CLI programs. Others, like servers, should have other solutions (Rookout etc.., PyCharm remote debugger etc..).

Let's say we run a program, which calls some_erroneous_function and we want to know some value from this function.
main() -> foo() -> some_erroneous_function()

how can we know the value inside some_erroneous_function()?
simple - add next line:

import pdb; pdb.set_trace()

Can't see value of a:

Do manage to see value of a:

What happens when program A runs program B?

When we have
main() -> bar() -> cli_app_bar.py -> some_erroneous_function(),

the import pdb; pdb.set_trace() trick simply doesn't work;
We get a stuck process instead. This is because the pdb opens in the child process, however the parent process is waiting for the child process the finish and we're stuck.

In this case, we should run child process ourselves.

What parts are required to run a child process ourself?

So there are 2 parts which are required; one is obvious, the other part is often forgotten!!
2 parts are:

program name + command like arguments
Environment variables!!
(there's a 3rd part which is IPC messages, but it's very hard to mimic such behavior...)

Let's see how do we capture this:

Modify program to save CLI arguments and env vars
Run using CLI and env vars

getting cmd + env vars

Several methods; getting env vars for a running process you could use cat /proc/46/environ | tr '\0' '\n' (replace 46 with process id)

From within Python process, we want to print env vars in "ready to go" format, eg with the export prefix:

with open('/tmp/params.txt', 'w') as fout:
    # print all env vars
    for k, v in os.environ.items():
        fout.write('export "%s"="%s"\n' % (k,v))

And then diff with current env vars:

echo "creating bar before"
cat <<EOF > create_before.py
#!/usr/bin/python3
import os
with open('/tmp/params.before.txt', 'w') as fout:
    for k, v in os.environ.items():
        fout.write('export "%s"="%s"\n' % (k,v))
EOF

python create_before.py

echo "print some stats"
wc -l /tmp/params.txt /tmp/params.before.txt

echo "get keys"
cat /tmp/params.txt | awk -F '=' ' { print $1 } ' | sort > /tmp/params.keys.txt
cat /tmp/params.before.txt | awk -F '=' ' { print $1 } ' | sort > /tmp/params.before.keys.txt
wc -l /tmp/params.keys.txt /tmp/params.before.keys.txt
diff /tmp/params.keys.txt /tmp/params.before.keys.txt

and we got the newly added env var key, EXTRA:

YuvShell $ diff /tmp/params.keys.txt /tmp/params.before.keys.txt
1d0
< export "EXTRA"

Questions

Q: What is the "YuvShell"??
A: It's just me editing the ~/.bashrc and changing the PS1 (Prompt String) var;

Q: What is the different between cat some_file.txt | wc -l and wc -l some_file.txt?
A: with cat + wc we use a pipe to transfer data from the cat output to the wc input; with wc only, we don't use the pipe.

Let's create some big file from urandom, and see time output of both options:

cat /dev/urandom | base64 | head -c 1GB > /tmp/random_1GB_file.txt

time cat /tmp/random_1GB_file.txt | wc -l
time wc -l /tmp/random_1GB_file.txt

Source Code

Data Ingestion - Build Your Own "Map Reduce"?

Yuval — Fri, 24 Dec 2021 12:04:22 +0000

Why map reduce

Let's say you work on Facebook; you have lots of data and probably needs lots of map-reduce tasks.
You will use mrjob/PySpark/spark/hadoop. You got the point - you need 1 framework to rule them all.
You need a system: where will temp file be stored, API with cloud, data security, erasuer, multi-tenant etc..
You need standards - standards between developers to themselves, between developers to devops etc. ;

Let's say, for the other hand, your a solopreneur/small startup. Max 3-4 developers team.
You need things to work, and work fast.
Don't have 10ks of map-reduce jobs, but probably 1 or 2.
You won't be using hadoop, that's for sure. Might be using:

Different approaches

Linq

not really map reduce per se, more like "sql w/out sql engine"
However, this adds complexities of .net to your environment;
e.g. read release notes and understand if you can run it on your different OSes (production, staging, developers machines).
Also - need to learn C#; loading from files, different encodings, saving, iterators etc..
If you're not proficient with C#, this could be one-time investment which will not worth it.

Mrjob

Pros: Python native lib; able to debug easily (using inline) run locally e.g. multi process on local machine,
use hadoop, dataproc (seems that "Dataproc is a managed Spark and Hadoop service..." ) etc.
However, lots of moving parts and different configuration options.

Custom made map-reduce

Let's go to UCI Machine Learning website (2015 is on the phone..)
Choose some dataset, and test

Some notes:
We don't need Sha256 and not evey base64; nothing will happen if keys will not distribute very equally.
we could take MMH3; googling "python murmurhash" gives 2 interesting results; and since both use the same cpp code, let's take the one with most stars
Other options would be to simply do (% NUM_SHARDS) or even shift right (however must have shards count == power of 2).

mini setup script:

and 2 python test scripts:

Results:
imap runs much slower;
we can look at it/sec from tqdm to see that:

# test.py sample after 4 seconds:
2801493it [00:04, 566075.99it/s]

# test_imap.py sample after 4 seconds:
73439it [00:04, 18754.44it/s]

We could see the non-imap version is 30x faster!

Q&A

Q: Why setup.sh and not requirements.txt file?
A: this is not production code; it's aimed for quick reproducibility, not for having exact same lib (e.g. security etc.)

Q: Why MMH3 and not sha256?
A: This is not a security product, we don't need cryptographic hash; we just need a nice distribution of keys, and we want this to be fast.

Q: Why is imap slower than single process?
A: Might be because the imap version has lots of overhead because of IPC;
The trade-off between offloading the (alleged) "heavy lifting" calculation of hash to external process is being erased by the IPC.

Q: Why?
A: Using process pool might worth it if task is more CPU bound; here, task is more io bound the overhead of MMH hash doesn't justify it.

Q: Are we sure about it?
A: we could use py-spy and see.

Q: Conclusion?
A: it depends
Also - depends on size of file.
Also - depends on post-processing each shard.
conclusion - test mrjob as well; it might have a better IPC.