DEV Community: Maria Karanasou

Python YAML configuration with environment variables parsing

Maria Karanasou — Tue, 27 Apr 2021 15:51:04 +0000

Load a YAML configuration file and resolve any environment variables

Note: if you want to use this, check the UPDATE at the end of the article :)

If you’ve worked with Python projects, you’ve probably have stumbled across the many ways to provide configuration. I am not going to go through all the ways here, but a few of them are:

using .ini files
using a python class
using .env files
using JSON or XML files
using a yaml file

And so on. I’ve put some useful links about the different ways below, in case you are interested in digging deeper.

My preference is working with yaml configuration because I usually find very handy and easy to use and I really like that yaml files are also used in e.g. docker-compose configuration so it is something most are familiar with.

For yaml parsing I use the PyYAML Python library.

In this article we’ll talk about the yaml file case and more specifically what you can do to avoid keeping your secrets, e.g. passwords, hosts, usernames etc, directly on it.

Let’s say we have a very simple example of a yaml file configuration:

database:
 name: database_name
 user: me
 password: very_secret_and_complex
 host: localhost
 port: 5432

ws:
 user: username
 password: very_secret_and_complex_too
 host: localhost

When you come to a point where you need to deploy your project, it is not really safe to have passwords and sensitive data in a plain text configuration file lying around on your production server. That’s where **environment variables **come in handy. So the goal here is to be able to easily replace the very_secret_and_complex password with input from an environment variable, e.g. DB_PASS, so that this variable only exists when you set it and run your program instead of it being hardcoded somewhere.

For PyYAML to be able to resolve environment variables, we need three main things:

A regex pattern for the environment variable identification e.g. pattern = re.compile(‘.?\${(\w+)}.?’)
A tag that will signify that there’s an environment variable (or more) to be parsed, e.g. !ENV.
And a function that the loader will use to resolve the environment variables

Here’s a complete example:

Example of a YAML configuration with environment variables:

database:
 name: database_name
 user: !ENV ${DB_USER}
 password: !ENV ${DB_PASS}
 host: !ENV ${DB_HOST}
 port: 5432

ws:
 user: !ENV ${WS_USER}
 password: !ENV ${WS_PASS}
 host: !ENV ‘[https://${CURR_ENV}.ws.com.local'](https://${CURR_ENV}.ws.com.local')

This can also work with more than one environment variables declared in the same line for the same configuration parameter like this:

ws:
 user: !ENV ${WS_USER}
 password: !ENV ${WS_PASS}
 host: !ENV '[https://${CURR_ENV}.ws.com.](https://${CURR_ENV}.ws.com.local')[${MODE}](https://${CURR_ENV}.ws.com.local')'  # multiple env var

And how to use this:

First set the environment variables. For example, for the DB_PASS :

export DB_PASS=very_secret_and_complex

Or even better, so that the password is not echoed in the terminal:

read -s ‘Database password: ‘ db_pass
export DB_PASS=$db_pass

Then you can run the above script:

python use_env_variables_in_config_example.py -c /path/to/yaml

And in your code, do stuff with conf, e.g. access the database password like this: conf['database']['DB_PASS']

I hope this was helpful. Any thoughts, questions, corrections and suggestions are very welcome :)

UPDATE

Because I — and other people — have been using this a lot, I created a (very) small library, with tests and some extra features, to make it easier to use this without copy-pasting things all over :)

You can now just do:

pip install pyaml-env

And then you can import parse_config to use it in your code.

from pyaml_env import parse_config

config = parse_config('path/to/yaml')

I also added support for default values (thanks Jarosław Gilewski for the idea!) and will probably add a few other — config related things that are getting transferred from one project to another.

You can find the repo here:
Python YAML configuration with environment variables parsing
*A very small library that parses a yaml configuration file and it resolves the environment variables, so that no…*mariakaranasou.com
mkaranasou/pyaml_env
*A very small library that parses a yaml configuration file and it resolves the environment variables, so that no…*github.com

Useful links

The Many Faces and Files of Python Configs
*As we cling harder and harder to Dockerfiles, Kubernetes, or any modern preconfigured app environment, our dependency…*hackersandslackers.com
4 Ways to manage the configuration in Python
*I’m not a native speaker. Sorry for my english. Please understand.*hackernoon.com
Python configuration files
*A common need when writing an application is loading and saving configuration values in a human-readable text format…*www.devdungeon.com
Configuration files in Python
*Most interesting programs need some kind of configuration: Content Management Systems like WordPress blogs, WikiMedia…*martin-thoma.com

Originally published at Medium

I could use a coffee to keep me going :)
Thanks!

Adding sequential IDs to a Spark Dataframe

Maria Karanasou — Fri, 23 Apr 2021 15:56:21 +0000

How to do it and is it a good idea?

TL;DR

Adding sequential unique IDs to a Spark Dataframe is not very straight-forward, especially considering the distributed nature of it. You can do this using either zipWithIndex() or row_number() (depending on the amount and kind of your data) but in every case there is a catch regarding performance.

The idea behind this

Coming from traditional relational databases, like MySQL, and non-distributed data frames, like Pandas, one may be used to working with ids (auto-incremented usually) for identification of course but also the ordering and constraints you can have in data by using them as reference. For example, ordering your data by id (which is usually an indexed field) in a descending order, will give you the most recent rows first etc.

Depending on the needs, we might be found in a position where we would benefit from having a (unique) auto-increment-ids’-like behavior in a spark dataframe. When the data is in one table or dataframe (in one machine), adding ids is pretty straigth-forward. What happens though when you have distributed data, split into partitions that might reside in different machines like in Spark?

(More on partitions here)

Throughout this post, we will explore the obvious and not so obvious options, what they do, and the catch behind using them.

Notes

Please, note that this article assumes that you have some working knowledge of Spark, and more specifically of PySpark. If not, here is a short intro with what it is and I’ve put several helpful resources in the Useful links and notes section. I’ll be glad to answer any questions I can :).
Practicing Sketchnoting again, yes, there are terrible sketches through out the article, trying to visually explain things as I understand them. I hope they are more helpful than they are confusing :).

The RDD way — zipWithIndex()

One option is to fall back to RDDs

resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel

and use df.rdd.zipWithIndex():

The ordering is first based on the partition index and then the
ordering of items within each partition. So the first item in
the first partition gets index 0, and the last item in the last
partition receives the largest index.
This method needs to trigger a spark job when this RDD contains
more than one partitions.

Four points here:

The indexes will be starting from 0 **and the **ordering **is done **by partition
You will need to have all your data in the dataframe — additions* will not add an auto-increment id
Falling back to rdds and then to dataframe **can be quite expensive.**
The updated version of your dataframe with ids will require you to do some extra work to bring your dataframe back to its original form. Which also adds to the performance toll.

*You cannot really update or add to a dataframe, since they are immutable but you could for example join one with another and end up with a dataframe that has more rows than the original.

The Dataframe way

If your data is sortable

If you can order your data by one of the columns, let’s say column1 in our example, then you can use the row_number() function to provide, well, row numbers:

row_number() is a windowing function, which means it operates over predefined windows / groups of data.

The points here:

Your data must be sortable
You will need to work with a very big window (as big as your data)
Your indexes will be starting from 1
You will need to have all your data in the dataframe — updates will not add an auto-increment id
No extra work to reformat your dataframe
But you might end up with an OOM Exception, as I’ll explain in a bit.

If your data is NOT sortable — or you don’t want to change the current order of your data

Another option, is to combine row_number() with monotonically_increasing_id(), which according to the documentation creates:

A column that generates monotonically increasing 64-bit integers.
The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. The current implementation puts the partition ID in the upper 31 bits, and the record number within each partition in the lower 33 bits. The assumption is that the data frame has less than 1 billion partitions, and each partition has less than 8 billion records.

The monotonically **increasing and unique, but not consecutive **is the key here. Which means you can sort by them but you cannot trust them to be sequential. In some cases, where you only need sorting, monotonically_increasing_id() comes in very handy and you don’t need the row_number() at all. But in this case, let’s say we absolutely need to have consequent ids.

Again, resuming from where we left things in code:

There are of course different ways (semantically) to go about it. For example, you could use a temp view (which has no obvious advantage other than you can use the pyspark SQL syntax):

>>> df_final.createOrReplaceTempView(‘df_final’)
>>> spark.sql(‘select row_number() over (order by “monotonically_increasing_id”) as row_num, * from df_final’)

The points here:

Same as above but also a small side note that practically the ordering **is done **by partition

And the very big catch to this whole effort

In order to use row_number(), we need to move our data into one partition. The Window in both cases (sortable and not sortable data) consists basically of all the rows we currently have so that the row_number() function can go over them and increment the row number. This can cause performance and memory issues — we can easily go OOM, depending on how much data and how much memory we have. So, my suggestion would be to really ask yourself if you need an auto-increment/ indexing like behavior for your data or if you can do things another way and avoid this, because it will be expensive. Especially if you process arbitrary amounts of data each time, so careful memory amount consideration cannot be done (e.g. processing streaming data in groups or windows).

Spark will give you the following warning whenever you use Window without providing a way to partition your data:

WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.

Conclusion: is this a good idea or not?

Well, probably not. In my experience, if you find yourself needing this kind of functionality, then you should take a good look at your needs and the transformation process you have and figure out ways around it if possible. Even if you use zipWithIndex() the performance of your application will probably still suffer — but it seems like a safer option to me.

But if you cannot avoid it, at least be aware of the mechanism behind it, the risks and plan accordingly.

I hope this was helpful. Any thoughts, questions, corrections and suggestions are very welcome :)

Useful links and notes

Explaining technical stuff in a non-technical way — Apache Spark
*What is Spark and PySpark and what can I do with it?*towardsdatascience.com

Adjusting the indexes start from 0

The indexes when using row_number() start from 1. To have them start from 0 we can simply deduct 1 from the row_num column:

df_final = df_final.withColumn(‘row_num’, F.col(‘row_num’)-1)

On RDDs and Datasets

A Tale of Three Apache Spark APIs: RDDs vs DataFrames and Datasets
*In summation, the choice of when to use RDD or DataFrame and/or Dataset seems obvious. While the former offers you…*databricks.com
RDD Programming Guide
*Spark 2.4.4 is built and distributed to work with Scala 2.12 by default. (Spark can be built to work with other…*spark.apache.org

About createOrReplaceTempView

This creates (or replaces if that view name already exists) a lazily evaluated “view” of you data, which means that if you don’t cache/ persist it, each time you access the view any calculations will run again. In general, you can then use like a hive table in Spark SQL.
pyspark.sql module - PySpark 2.4.4 documentation
*schema - a pyspark.sql.types.DataType or a datatype string or a list of column names, default is . The data type string…*spark.apache.org

Row Number and Windows

pyspark.sql module - PySpark 2.4.4 documentation
*schema - a pyspark.sql.types.DataType or a datatype string or a list of column names, default is . The data type string…*spark.apache.org
Introducing Window Functions in Spark SQL
*In this blog post, we introduce the new window function feature that was added in Apache Spark 1.4. Window functions…*databricks.com

Where to next?

Understanding your Machine Learning model’s predictions:
Machine Learning Interpretability — Shapley Values with PySpark
*Interpreting Isolation Forest’s predictions — and not only*medium.com

Originally published at Medium

I could use a coffee to keep me going :) Thanks!

On explaining technical stuff in a non-technical way — (Py)Spark

Maria Karanasou — Fri, 23 Apr 2021 15:47:31 +0000

What is Spark and PySpark and what can I do with it?

I was once asked during a presentation of the Baskerville Analytics System to explain Apache Spark to someone that is not technical at all. It kind of baffled me because I am very much used to thinking and talking in code and my mind just kept going back to technical terms, so I believe I didn’t do a great job in the very limited time I had. Let’s try this one more time, for the sake of that one person who asked me and because I believe that explaining things as simply as possible is a great skill to develop.

A side note: Sketchnoting

I’ve been reading Pencil Me In by Christina R Wodtke that talks about Sketchnoting, which the process of keeping visual notes to help in understanding and memorization. I’ve always been a visual person and used to doodle a lot throughout my childhood — which indeed helped me remember things better, and sometimes also got me into trouble. And since the whole process of me writing on Medium is so that I better understand what I think I know, and to also learn new things, I thought I’d try this again. It’s been a long long time since I last did this and I am now very much used to typing and not writing (translation: horrible sketches coming up!), so please be lenient.

The impossible homework

I guess the first thing to do is to provide an example that anyone, or almost anyone, can relate to. Thus, let’s say that you have homework that is due in a week, and what you have to do is read a really huge book, 7K pages long, and keep a count how many times the author used the term “big data” and ideally also keep the phrases that contain it (silly task but bear with me :) ).

This is an impossible task, given the time constraint, even if you read day and night, you won’t be able to finish this within the week. But, you are not alone in this, so you decide to talk to your classmates and friends and figure out a solution.

It seems logical that you split the pages and each one of you takes care of at least a couple of pages. It also makes sense that the pages each one of you takes home to read, have content that is relevant so what you’ll be reading makes sense, so you try to split by chapters.

It also looks like there is a need for a coordinator. Let’s say you take up that task since it was your idea. (You would ideally take up a chapter or two yourself, but let’s say that management and communication will take up most of your time)

Another thing to consider is to split the pages according to who has the most time available and who is a speedy reader or a slow one so that the process is as efficient as possible, right? Also, some of you might have other homework to do within the week, so this must also be taken into account.

Throughout the week, it would be good to talk to your fellow students to check in and see how they’re doing. And of course, since reading the chapters will not be done at one go, use bookmarks to note your progress and keep track of where you are with the task

What if you had to count more than one term? The splitting of the pages should probably be done according to the title of the chapters and the likelihood of the chapter including the terms. And what if something happens and one of you cannot complete the task? The respective pages should be redistributed to the rest of you, ideally depending on how many pages each of you has left.

In the end, you would all gather and add up your counts to have your results.

So, to sum up, to tackle this task, it makes sense to:

Split the chapters between fellow students
Have you organize things, since it was your idea and you know how things should play out
Split the chapters according to each student’s capacity — take into account reading speed and availability
Re-distribute the work if something happens and a person cannot finish up their part
Keep track of how things are going — use bookmarks, talk to your fellow students to keep track of their progress, etc.
Gather up at the end to share and combine results

How this relates to Spark and PySpark — getting a bit more technical

The homework example illustrates, as I understand it, the over-simplified basic thinking behind Apache Spark (and many similar frameworks and systems, e.g. horizontal or vertical data “sharding”), splitting the data into reasonable groups (called “partitions” in Spark’s case), given the fact that you know what kind of tasks you have to perform on the data, so that you are efficient, and distribute those partitions to ideally equal number of workers (or as many workers as your system can provide). These workers can be in the same machine or in different ones, e.g. each worker on one machine (node). There must be a coordinator of all this effort, to collect all the necessary information that is needed to perform the task and to redistribute the load in case of failure. It is also necessary to have a (network) connection between the coordinator and the workers to communicate and exchange data and information. Or even re-partition the data in case of either failure or when the computations require it (e.g. we need to calculate something on each row of data independently but then we need to group those rows by a key). There is also the concept of doing things in a “lazy” way and use caching to keep track of intermediate results and not having to calculate everything from scratch all the time.

PySpark is the python implementation of Apache Spark, which is “a unified analytics engine for large-scale data processing”.

Note that this is not an exact and one-on-one comparison with the Spark components, but it is a close one conceptually. I’ve also omitted many of Spark internals and structures for the sake of simplicity. If you want to dig deeper into this, there are plenty of resources out there, starting with the official Apache Spark site.

The comparison depicted in the previous image as I mentioned is not quite accurate. Let’s try again and get the teacher into the picture too. The teacher is the one who provides the homework and the instructions (the driver program), the students are split into working groups and each working group can take care of a part of the task. For the sake of brevity — and for trying to make my drawings less complicated and my life a bit easier, the image below shows the comparison of one working group to Spark. This, I feel is a bit closer to what actually goes on when a Spark application runs.

In simple, and a bit more technical terms, let’s say you have a huge text file (ok not big-data-huge but let’s say a 15GB file) on your computer and you really want to know how many words there are, or, as the homework above, how many times the term “big data” appears in it, along with the relevant phrases, you will be faced with the following issues:

you cannot really open this file with let’s say notepad, because even if you have 32GB of RAM, the application used to open and edit text files will be practically unusable with a 15GB file.
you can code something to count the words or a specific word or phrase in this file, either by reading line by line or using something like wc depending on your system, but it will be slow, very slow. And what if you need to do more complicated things?

So, immediately we see that there is no quick and easy option to do simple, let alone complex things with a big file.

One can think of several work-arounds, like splitting the huge file into many little ones and processing the little ones and adding up the results, leveraging multiprocessing techniques. And here is where Spark comes to provide an easy solution to this. Let’s see a very basic PySpark example using the python library for pyspark.

It looks quite simple, doesn’t it? Just a few lines of Python code. Now let’s explain a bit about what it does:

There is no obvious splitting of the file into “chapters”, no coordination, no keeping track, not anything. That is because Spark will take care of all the complexity behind the scenes and we do not have to worry about telling workers and executors to read parts of the file or how to split it, or what happens if an executor drops its part suddenly and so on. So, here, we’ve done our homework in just a few lines of code.

Don’t get me wrong, Spark seems simple but there is a lot of complexity behind it and troubleshooting it is not an easy task at all, but, let’s just appreciate the good parts for now and we can talk about the difficulties later on.

Additionally, the example here is one of the simplest ones, but I believe, once you understand the mechanism and logic behind such frameworks, it is a lot easier to grasp what you can and, more importantly, cannot do with them, how to structure systems that leverage those frameworks and to get good at estimating whether doing things a certain way will be fast and efficient or not. Again, keeping this simple, I won’t go into further details about that right now.

I hope this was helpful. Any thoughts, questions, corrections and suggestions are very welcome :)

Originally published at Medium

I could use a coffee to keep me going :)
Thanks!