<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Maria Karanasou</title>
    <description>The latest articles on DEV Community by Maria Karanasou (@mkaranasou).</description>
    <link>https://dev.to/mkaranasou</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F40184%2F530cd799-a04a-49e1-b2c4-9572f8285591.png</url>
      <title>DEV Community: Maria Karanasou</title>
      <link>https://dev.to/mkaranasou</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/mkaranasou"/>
    <language>en</language>
    <item>
      <title>Python YAML configuration with environment variables parsing</title>
      <dc:creator>Maria Karanasou</dc:creator>
      <pubDate>Tue, 27 Apr 2021 15:51:04 +0000</pubDate>
      <link>https://dev.to/mkaranasou/python-yaml-configuration-with-environment-variables-parsing-2ha6</link>
      <guid>https://dev.to/mkaranasou/python-yaml-configuration-with-environment-variables-parsing-2ha6</guid>
      <description>&lt;h3&gt;
  
  
  Load a YAML configuration file and resolve any environment variables
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F11700%2F1%2A4s_GrxE5sn2p2PNd8fS-6A.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F11700%2F1%2A4s_GrxE5sn2p2PNd8fS-6A.jpeg"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Note: if you want to use this, check the &lt;strong&gt;UPDATE&lt;/strong&gt; at the end of the article :)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If you’ve worked with Python projects, you’ve probably have stumbled across the many ways to provide configuration. I am not going to go through all the ways here, but a few of them are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;using .ini files&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;using a python class&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;using .env files&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;using JSON or XML files&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;using a yaml file&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And so on. I’ve put some useful links about the different ways below, in case you are interested in digging deeper.&lt;/p&gt;

&lt;p&gt;My preference is working with yaml configuration because I usually find very handy and easy to use and I really like that yaml files are also used in e.g. docker-compose configuration so it is something most are familiar with.&lt;/p&gt;

&lt;p&gt;For yaml parsing I use the &lt;a href="https://pyyaml.org/wiki/PyYAMLDocumentation" rel="noopener noreferrer"&gt;PyYAML&lt;/a&gt; Python library.&lt;/p&gt;

&lt;p&gt;In this article we’ll talk about the yaml file case and more specifically what you can do to &lt;strong&gt;avoid keeping your secrets, e.g. passwords, hosts, usernames etc, directly on it&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Let’s say we have a very simple example of a yaml file configuration:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;database:
 name: database_name
 user: me
 password: very_secret_and_complex
 host: localhost
 port: 5432

ws:
 user: username
 password: very_secret_and_complex_too
 host: localhost
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;When you come to a point where you need to deploy your project, it is not really safe to have passwords and sensitive data in a plain text configuration file lying around on your production server. That’s where &lt;a href="https://medium.com/dataseries/hiding-secret-info-in-python-using-environment-variables-a2bab182eea" rel="noopener noreferrer"&gt;**environment variables&lt;/a&gt; **come in handy. So the goal here is to be able to easily replace the very_secret_and_complex password with input from an environment variable, e.g. DB_PASS, so that this variable only exists when you set it and run your program instead of it being hardcoded somewhere.&lt;/p&gt;

&lt;p&gt;For PyYAML to be able to resolve environment variables, we need three main things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;A regex pattern for the environment variable identification e.g. pattern = re.compile(‘.&lt;em&gt;?\${(\w+)}.&lt;/em&gt;?’)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;A tag that will signify that there’s an environment variable (or more) to be parsed, e.g. !ENV.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;And a function that the loader will use to resolve the environment variables&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;



&lt;p&gt;Here’s a complete example:&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;Example of a YAML configuration with environment variables:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;database:
 name: database_name
 user: !ENV ${DB_USER}
 password: !ENV ${DB_PASS}
 host: !ENV ${DB_HOST}
 port: 5432

ws:
 user: !ENV ${WS_USER}
 password: !ENV ${WS_PASS}
 host: !ENV ‘[https://${CURR_ENV}.ws.com.local'](https://${CURR_ENV}.ws.com.local')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;This can also work &lt;strong&gt;with more than one environment variables&lt;/strong&gt; declared in the same line for the same configuration parameter like this:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ws:
 user: !ENV ${WS_USER}
 password: !ENV ${WS_PASS}
 host: !ENV '[https://${CURR_ENV}.ws.com.](https://${CURR_ENV}.ws.com.local')[${MODE}](https://${CURR_ENV}.ws.com.local')'  # multiple env var
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;And how to use this:&lt;/p&gt;

&lt;p&gt;First set the environment variables. For example, for the DB_PASS :&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;export DB_PASS=very_secret_and_complex
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;Or even better, so that the password is not echoed in the terminal:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;read -s ‘Database password: ‘ db_pass
export DB_PASS=$db_pass
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;



&lt;p&gt;Then you can run the above script:&lt;/p&gt;

&lt;p&gt;python use_env_variables_in_config_example.py -c /path/to/yaml&lt;/p&gt;

&lt;p&gt;And in your code, do stuff with conf, e.g. access the database password like this: conf['database']['DB_PASS']&lt;/p&gt;

&lt;p&gt;I hope this was helpful. Any thoughts, questions, corrections and suggestions are very welcome :)&lt;/p&gt;

&lt;h2&gt;
  
  
  UPDATE
&lt;/h2&gt;

&lt;p&gt;Because I — and other people — have been using this a lot, I created a (very) small library, with tests and some extra features, to make it easier to use this without copy-pasting things all over :)&lt;/p&gt;

&lt;p&gt;You can now just do:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install pyaml-env
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;And then you can import parse_config to use it in your code.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from pyaml_env import parse_config

config = parse_config('path/to/yaml')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;I also added support for default values (thanks &lt;a href="https://medium.com/@jgilewski" rel="noopener noreferrer"&gt;Jarosław Gilewski&lt;/a&gt; for the idea!) and will probably add a few other — config related things that are getting transferred from one project to another.&lt;/p&gt;

&lt;p&gt;You can find the repo here:&lt;br&gt;
&lt;a href="https://mariakaranasou.com/pyaml_env/" rel="noopener noreferrer"&gt;&lt;strong&gt;Python YAML configuration with environment variables parsing&lt;/strong&gt;&lt;br&gt;
*A very small library that parses a yaml configuration file and it resolves the environment variables, so that no…*mariakaranasou.com&lt;/a&gt;&lt;br&gt;
&lt;a href="https://github.com/mkaranasou/pyaml_env" rel="noopener noreferrer"&gt;&lt;strong&gt;mkaranasou/pyaml_env&lt;/strong&gt;&lt;br&gt;
*A very small library that parses a yaml configuration file and it resolves the environment variables, so that no…*github.com&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Useful links
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://hackersandslackers.com/simplify-your-python-projects-configuration/" rel="noopener noreferrer"&gt;&lt;strong&gt;The Many Faces and Files of Python Configs&lt;/strong&gt;&lt;br&gt;
*As we cling harder and harder to Dockerfiles, Kubernetes, or any modern preconfigured app environment, our dependency…*hackersandslackers.com&lt;/a&gt;&lt;br&gt;
&lt;a href="https://hackernoon.com/4-ways-to-manage-the-configuration-in-python-4623049e841b" rel="noopener noreferrer"&gt;&lt;strong&gt;4 Ways to manage the configuration in Python&lt;/strong&gt;&lt;br&gt;
*I’m not a native speaker. Sorry for my english. Please understand.*hackernoon.com&lt;/a&gt;&lt;br&gt;
&lt;a href="https://www.devdungeon.com/content/python-configuration-files" rel="noopener noreferrer"&gt;&lt;strong&gt;Python configuration files&lt;/strong&gt;&lt;br&gt;
*A common need when writing an application is loading and saving configuration values in a human-readable text format…*www.devdungeon.com&lt;/a&gt;&lt;br&gt;
&lt;a href="https://martin-thoma.com/configuration-files-in-python/" rel="noopener noreferrer"&gt;&lt;strong&gt;Configuration files in Python&lt;/strong&gt;&lt;br&gt;
*Most interesting programs need some kind of configuration: Content Management Systems like WordPress blogs, WikiMedia…*martin-thoma.com&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Originally published at &lt;a href="https://medium.com/swlh/python-yaml-configuration-with-environment-variables-parsing-77930f4273ac" rel="noopener noreferrer"&gt;Medium&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I could use a &lt;a href="https://www.buymeacoffee.com/mkaranasou" rel="noopener noreferrer"&gt;coffee&lt;/a&gt; to keep me going :) &lt;br&gt;
Thanks!&lt;/p&gt;

</description>
      <category>programming</category>
      <category>python</category>
      <category>yaml</category>
    </item>
    <item>
      <title>Adding sequential IDs to a Spark Dataframe</title>
      <dc:creator>Maria Karanasou</dc:creator>
      <pubDate>Fri, 23 Apr 2021 15:56:21 +0000</pubDate>
      <link>https://dev.to/mkaranasou/adding-sequential-ids-to-a-spark-dataframe-2fhg</link>
      <guid>https://dev.to/mkaranasou/adding-sequential-ids-to-a-spark-dataframe-2fhg</guid>
      <description>&lt;h3&gt;
  
  
  How to do it and is it a good idea?
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Z-IOIPNZ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/11520/0%2AX8A8V7gkYlcNNzD4" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Z-IOIPNZ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/11520/0%2AX8A8V7gkYlcNNzD4" alt="Photo by [Markus Spiske](https://unsplash.com/@markusspiske?utm_source=medium&amp;amp;utm_medium=referral) on [Unsplash](https://unsplash.com?utm_source=medium&amp;amp;utm_medium=referral)"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;h1&gt;
  
  
  &lt;strong&gt;TL;DR&lt;/strong&gt;
&lt;/h1&gt;
&lt;h1&gt;
  
  
  Adding sequential unique IDs to a Spark Dataframe is not very straight-forward, especially considering the distributed nature of it. You can do this using either zipWithIndex() or row_number() (depending on the amount and kind of your data) but in every case there is a catch regarding performance.
&lt;/h1&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  The idea behind this
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--KeOfYQz1--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/3154/1%2ASAWPTt-_eh0Txr35RjaNLg.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--KeOfYQz1--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/3154/1%2ASAWPTt-_eh0Txr35RjaNLg.jpeg" alt="Typical usages for ids — besides the obvious: for identity purposes"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Coming from traditional relational databases, like &lt;a href="https://www.mysql.com"&gt;MySQL&lt;/a&gt;, and non-distributed data frames, like &lt;a href="https://pandas.pydata.org"&gt;Pandas&lt;/a&gt;, one may be used to working with ids (auto-incremented usually) for identification of course but also the ordering and constraints you can have in data by using them as reference. For example, ordering your data by id (which is usually an indexed field) in a descending order, will give you the most recent rows first etc.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--ZEsBmYNA--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/3164/1%2AcXGB03Uf0IJKcew42e_AEw.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--ZEsBmYNA--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/3164/1%2AcXGB03Uf0IJKcew42e_AEw.jpeg" alt="A representation of a Spark Dataframe — what the user sees and what it is like physically"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Depending on the needs, we might be found in a position where we would benefit from having a (unique) auto-increment-ids’-like behavior in a spark dataframe. &lt;strong&gt;When the data is in one table or dataframe (in one machine), adding ids is pretty straigth-forward. What happens though when you have distributed data, split into partitions that might reside in different machines like in Spark?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;(More on partitions &lt;a href="https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd-partitions.html"&gt;here&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;Throughout this post, we will explore the obvious and not so obvious options, what they do, and the catch behind using them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Notes&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Please, note that this article assumes that you have some working knowledge of Spark, and more specifically of &lt;a href="http://spark.apache.org/docs/latest/quick-start.html"&gt;PySpark&lt;/a&gt;. If not, here is a &lt;a href="https://towardsdatascience.com/explaining-technical-stuff-in-a-non-techincal-way-apache-spark-274d6c9f70e9#b88f-81d3a1ffe447"&gt;short intro&lt;/a&gt; with what it is and I’ve put several helpful resources in the &lt;em&gt;Useful links and notes&lt;/em&gt; section. I’ll be glad to answer any questions I can :).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Practicing &lt;strong&gt;Sketchnoting&lt;/strong&gt; again, yes, there are &lt;em&gt;terrible sketches&lt;/em&gt; through out the article, trying to visually explain things &lt;em&gt;as I understand them&lt;/em&gt;. I hope they are more helpful than they are confusing :).&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The RDD way — zipWithIndex()
&lt;/h3&gt;

&lt;p&gt;One option is to fall back to &lt;a href="https://spark.apache.org/docs/latest/rdd-programming-guide.html"&gt;RDDs&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;resilient distributed dataset&lt;/em&gt; (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;and use df.rdd.zipWithIndex():&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The ordering is first based on the partition index and then the&lt;br&gt;
ordering of items within each partition. So the first item in&lt;br&gt;
the first partition gets index 0, and the last item in the last&lt;br&gt;
partition receives the largest index.&lt;br&gt;
 This method needs to trigger a spark job when this RDD contains&lt;br&gt;
more than one partitions.&lt;/p&gt;
&lt;/blockquote&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--HPcbCLFg--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/3186/1%2ATdfafEJB01ubCSkZR5NFVA.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--HPcbCLFg--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/3186/1%2ATdfafEJB01ubCSkZR5NFVA.jpeg" alt="The process of using zipWithIndex()"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Four points here:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;The indexes will be &lt;strong&gt;starting from 0 **and the **ordering **is done **by partition&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;You will need to have all your data in the dataframe — &lt;strong&gt;additions* will not add an auto-increment id&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Falling back to rdds and then to dataframe &lt;a href="https://stackoverflow.com/questions/37088484/whats-the-performance-impact-of-converting-between-dataframe-rdd-and-back"&gt;**can be quite expensive&lt;/a&gt;.**&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The updated version of your dataframe with ids will require you to do some &lt;strong&gt;extra work&lt;/strong&gt; to bring your dataframe back to its original form. Which also adds to the &lt;strong&gt;performance toll&lt;/strong&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;*&lt;em&gt;You cannot really update or add to a dataframe, since they are immutable but you could for example join one with another and end up with a dataframe that has more rows than the original.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The Dataframe way
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;If your data is sortable&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you can order your data by one of the columns, let’s say column1 in our example, then you can use the &lt;a href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.row_number"&gt;row_number&lt;/a&gt;() function to provide, well, row numbers:&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;blockquote&gt;
&lt;p&gt;row_number() is a windowing function, which means it operates over predefined windows / groups of data.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The points here:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Your data must be &lt;strong&gt;sortable&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;You will need to work with &lt;strong&gt;a very big window&lt;/strong&gt; (as big as your data)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Your indexes will be &lt;strong&gt;starting from 1&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;You will need to have all your data in the dataframe — &lt;strong&gt;updates will not add an auto-increment id&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;No extra work to reformat&lt;/strong&gt; your dataframe&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;But&lt;/strong&gt; you might end up with an &lt;strong&gt;OOM Exception&lt;/strong&gt;, as I’ll explain in a bit.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;If your data is NOT sortable — or you don’t want to change the current order of your data&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Another option, is to combine row_number() with monotonically_increasing_id(), which according to the &lt;a href="http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.monotonically_increasing_id"&gt;documentation&lt;/a&gt; creates:&lt;/p&gt;

&lt;blockquote&gt;
&lt;blockquote&gt;
&lt;p&gt;A column that generates monotonically increasing 64-bit integers.&lt;br&gt;
The generated ID is guaranteed to be &lt;strong&gt;monotonically&lt;/strong&gt; &lt;strong&gt;increasing and unique, but not consecutive&lt;/strong&gt;. The current implementation puts the partition ID in the upper 31 bits, and the record number within each partition in the lower 33 bits. The assumption is that the data frame has less than 1 billion partitions, and each partition has less than 8 billion records.&lt;/p&gt;
&lt;/blockquote&gt;


&lt;/blockquote&gt;

&lt;p&gt;The &lt;strong&gt;monotonically&lt;/strong&gt; **increasing and unique, but not consecutive **is the key here. Which means you can sort by them but you cannot trust them to be sequential. In some cases, where you only need sorting, monotonically_increasing_id() comes in very handy and you don’t need the row_number() at all. But in this case, let’s say we absolutely need to have consequent ids.&lt;/p&gt;

&lt;p&gt;Again, resuming from where we left things in code:&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;There are of course different ways (semantically) to go about it. For example, you could use a temp view (which has no obvious advantage other than you can use the pyspark SQL syntax):&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt;&amp;gt;&amp;gt; df_final.createOrReplaceTempView(‘df_final’)
&amp;gt;&amp;gt;&amp;gt; spark.sql(‘select row_number() over (order by “monotonically_increasing_id”) as row_num, * from df_final’)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The points here:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Same as above but also a small side note that practically &lt;strong&gt;the ordering **is done **by partition&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  And the very big catch to this whole effort
&lt;/h3&gt;

&lt;p&gt;In order to use row_number(), we need to move our data into one partition. The Window in both cases (sortable and not sortable data) consists basically of all the rows we currently have so that the row_number() function can go over them and increment the row number. This can cause performance and memory issues — we can easily go OOM, depending on how much data and how much memory we have. So, my suggestion would be to really ask yourself if you need an auto-increment/ indexing like behavior for your data or if you can do things another way and avoid this, because it will be expensive. Especially if you process arbitrary amounts of data each time, so careful memory amount consideration cannot be done (e.g. processing streaming data in groups or windows).&lt;/p&gt;

&lt;p&gt;Spark will give you the following warning whenever you use Window without providing a way to partition your data:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--QjcXv3Jf--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/3096/1%2ALgzG1UwkwaFNPGeDRt45RQ.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--QjcXv3Jf--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/3096/1%2ALgzG1UwkwaFNPGeDRt45RQ.jpeg" alt="Using row_number() over Window and the OOM danger"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion: is this a good idea or not?
&lt;/h2&gt;

&lt;p&gt;Well, &lt;em&gt;probably not&lt;/em&gt;. In my experience, if you find yourself needing this kind of functionality, then you should &lt;em&gt;take a good look at your needs and the transformation process&lt;/em&gt; you have and figure out ways around it if possible. Even if you use zipWithIndex() the performance of your application will probably still suffer — but it seems like a safer option to me.&lt;/p&gt;

&lt;p&gt;But if you cannot avoid it, at least be aware of the mechanism behind it, the risks and plan accordingly.&lt;/p&gt;

&lt;p&gt;I hope this was helpful. Any thoughts, questions, corrections and suggestions are very welcome :)&lt;/p&gt;

&lt;h2&gt;
  
  
  Useful links and notes
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://towardsdatascience.com/explaining-technical-stuff-in-a-non-techincal-way-apache-spark-274d6c9f70e9"&gt;&lt;strong&gt;Explaining technical stuff in a non-technical way — Apache Spark&lt;/strong&gt;&lt;br&gt;
*What is Spark and PySpark and what can I do with it?*towardsdatascience.com&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Adjusting the indexes start from 0&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The indexes when using row_number() start from 1. To have them start from 0 we can simply deduct 1 from the row_num column:&lt;/p&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df_final = df_final.withColumn(‘row_num’, F.col(‘row_num’)-1)&lt;br&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h3&gt;
&lt;br&gt;
  &lt;br&gt;
  &lt;br&gt;
  &lt;strong&gt;On RDDs and Datasets&lt;/strong&gt;&lt;br&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html"&gt;&lt;strong&gt;A Tale of Three Apache Spark APIs: RDDs vs DataFrames and Datasets&lt;/strong&gt;&lt;br&gt;
*In summation, the choice of when to use RDD or DataFrame and/or Dataset seems obvious. While the former offers you…*databricks.com&lt;/a&gt;&lt;br&gt;
&lt;a href="https://spark.apache.org/docs/latest/rdd-programming-guide.html"&gt;&lt;strong&gt;RDD Programming Guide&lt;/strong&gt;&lt;br&gt;
*Spark 2.4.4 is built and distributed to work with Scala 2.12 by default. (Spark can be built to work with other…*spark.apache.org&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;About createOrReplaceTempView&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;This creates (or replaces if that view name already exists) a lazily evaluated “view” of you data, which means that if you don’t cache/ persist it, each time you access the view any calculations will run again. In general, you can then use like a hive table in Spark SQL.&lt;br&gt;
&lt;a href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.createOrReplaceTempView"&gt;&lt;strong&gt;pyspark.sql module - PySpark 2.4.4 documentation&lt;/strong&gt;&lt;br&gt;
*schema - a pyspark.sql.types.DataType or a datatype string or a list of column names, default is . The data type string…*spark.apache.org&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Row Number and Windows
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.row_number"&gt;&lt;strong&gt;pyspark.sql module - PySpark 2.4.4 documentation&lt;/strong&gt;&lt;br&gt;
*schema - a pyspark.sql.types.DataType or a datatype string or a list of column names, default is . The data type string…*spark.apache.org&lt;/a&gt;&lt;br&gt;
&lt;a href="https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html"&gt;&lt;strong&gt;Introducing Window Functions in Spark SQL&lt;/strong&gt;&lt;br&gt;
*In this blog post, we introduce the new window function feature that was added in Apache Spark 1.4. Window functions…*databricks.com&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Where to next?
&lt;/h2&gt;

&lt;p&gt;Understanding your Machine Learning model’s predictions:&lt;br&gt;
&lt;a href="https://medium.com/mlearning-ai/machine-learning-interpretability-shapley-values-with-pyspark-16ffd87227e3"&gt;&lt;strong&gt;Machine Learning Interpretability — Shapley Values with PySpark&lt;/strong&gt;&lt;br&gt;
*Interpreting Isolation Forest’s predictions — and not only*medium.com&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://towardsdatascience.com/adding-sequential-ids-to-a-spark-dataframe-fa0df5566ff6"&gt;Medium&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  I could use a &lt;a href="https://www.buymeacoffee.com/mkaranasou"&gt;coffee&lt;/a&gt; to keep me going :) Thanks!
&lt;/h3&gt;

</description>
      <category>python</category>
      <category>programming</category>
      <category>pyspark</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>On explaining technical stuff in a non-technical way — (Py)Spark</title>
      <dc:creator>Maria Karanasou</dc:creator>
      <pubDate>Fri, 23 Apr 2021 15:47:31 +0000</pubDate>
      <link>https://dev.to/mkaranasou/on-explaining-technical-stuff-in-a-non-technical-way-py-spark-c3h</link>
      <guid>https://dev.to/mkaranasou/on-explaining-technical-stuff-in-a-non-technical-way-py-spark-c3h</guid>
      <description>&lt;h3&gt;
  
  
  What is Spark and PySpark and what can I do with it?
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F11520%2F0%2A-8mJ0H4u-y1uXEFf" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F11520%2F0%2A-8mJ0H4u-y1uXEFf" alt="Photo by [Markus Spiske](https://unsplash.com/@markusspiske?utm_source=medium&amp;amp;utm_medium=referral) on [Unsplash](https://unsplash.com?utm_source=medium&amp;amp;utm_medium=referral)"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I was once asked during a presentation of the &lt;a href="https://equalit.ie/deflect-labs-report-5-baskerville/" rel="noopener noreferrer"&gt;Baskerville Analytics System&lt;/a&gt; to explain &lt;a href="https://spark.apache.org" rel="noopener noreferrer"&gt;Apache Spark&lt;/a&gt; to someone that is not technical at all. It kind of baffled me because I am very much used to thinking and talking in code and my mind just kept going back to technical terms, so I believe I didn’t do a great job in the very limited time I had. Let’s try this one more time, for the sake of that one person who asked me and because I believe that explaining things as simply as possible is a great skill to develop.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;&lt;em&gt;A side note: Sketchnoting&lt;/em&gt;&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;I’ve been reading &lt;a href="https://www.goodreads.com/book/show/36338843-pencil-me-in" rel="noopener noreferrer"&gt;Pencil Me In by Christina R Wodtke&lt;/a&gt; that talks about Sketchnoting, which the process of keeping visual notes to help in understanding and memorization. I’ve always been a visual person and used to doodle a lot throughout my childhood — which indeed helped me remember things better, and sometimes also got me into trouble. And since the whole process of me writing on Medium is so that I better understand what I think I know, and to also learn new things, I thought I’d try this again. It’s been a long long time since I last did this and I am now very much used to typing and not writing (translation: horrible sketches coming up!), so please be lenient.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The impossible homework&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;I guess the first thing to do is to provide an example that anyone, or almost anyone, can relate to. Thus, let’s say that you have homework that is due in a week, and what you have to do is read a really huge book, 7K pages long, and keep a count how many times the author used the term “big data” and ideally also keep the phrases that contain it (silly task but bear with me :) ).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F3010%2F1%2AJqWkrE1i8pva-nfSxMdpkA.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F3010%2F1%2AJqWkrE1i8pva-nfSxMdpkA.jpeg" alt="The “impossible” homework"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is an impossible task, given the time constraint, even if you read day and night, you won’t be able to finish this within the week. But, you are not alone in this, so you decide to talk to your classmates and friends and figure out a solution.&lt;/p&gt;

&lt;p&gt;It seems logical that you split the pages and each one of you takes care of at least a couple of pages. It also makes sense that the pages each one of you takes home to read, have content that is relevant so what you’ll be reading makes sense, so you try to split by chapters.&lt;/p&gt;

&lt;p&gt;It also looks like there is a need for a coordinator. Let’s say you take up that task since it was your idea. (You would ideally take up a chapter or two yourself, but let’s say that management and communication will take up most of your time)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F3044%2F1%2A5eg7h0e8H3IwTHSHoZ49mw.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F3044%2F1%2A5eg7h0e8H3IwTHSHoZ49mw.jpeg" alt="Help each other!"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Another thing to consider is to split the pages according to who has the most time available and who is a speedy reader or a slow one so that the process is as efficient as possible, right? Also, some of you might have other homework to do within the week, so this must also be taken into account.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F3120%2F1%2AJSuV_38Cec3ScFQf2L1ZIg.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F3120%2F1%2AJSuV_38Cec3ScFQf2L1ZIg.jpeg" alt="Communicate with each other, know the availability and distribute work accordingly"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Throughout the week, it would be good to talk to your fellow students to check in and see how they’re doing. And of course, since reading the chapters will not be done at one go, use bookmarks to note your progress and keep track of where you are with the task&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F3110%2F1%2Ao2uzbmPPeags7ehMwMauvA.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F3110%2F1%2Ao2uzbmPPeags7ehMwMauvA.jpeg" alt="Bookmark, keep track and redistribute work in case of failure"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What if you had to count more than one term? The splitting of the pages should probably be done according to the title of the chapters and the likelihood of the chapter including the terms. And what if something happens and one of you cannot complete the task? The respective pages should be redistributed to the rest of you, ideally depending on how many pages each of you has left.&lt;/p&gt;

&lt;p&gt;In the end, you would all gather and add up your counts to have your results.&lt;/p&gt;

&lt;p&gt;So, to sum up, to tackle this task, it makes sense to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Split the chapters between fellow students&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Have you organize things, since it was your idea and you know how things should play out&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Split the chapters according to each student’s capacity — take into account reading speed and availability&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Re-distribute the work if something happens and a person cannot finish up their part&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Keep track of how things are going — use bookmarks, talk to your fellow students to keep track of their progress, etc.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Gather up at the end to share and combine results&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How this relates to Spark and PySpark — getting a bit more technical
&lt;/h2&gt;

&lt;p&gt;The homework example illustrates, &lt;em&gt;as I understand it&lt;/em&gt;, the over-simplified basic thinking behind &lt;a href="https://spark.apache.org" rel="noopener noreferrer"&gt;Apache Spark&lt;/a&gt; (and many similar frameworks and systems, e.g. horizontal or vertical data “sharding”), splitting the data into reasonable groups (called “partitions” in Spark’s case), given the fact that you know what kind of tasks you have to perform on the data, so that you are efficient, and distribute those partitions to ideally equal number of workers (or as many workers as your system can provide). These workers can be in the same machine or in different ones, e.g. each worker on one machine (node). There must be a coordinator of all this effort, to collect all the necessary information that is needed to perform the task and to redistribute the load in case of failure. It is also necessary to have a (network) connection between the coordinator and the workers to communicate and exchange data and information. Or even re-partition the data in case of either failure or when the computations require it (e.g. we need to calculate something on each row of data independently but then we need to group those rows by a key). There is also the concept of doing things in a “lazy” way and use caching to keep track of intermediate results and not having to calculate everything from scratch all the time.&lt;/p&gt;

&lt;p&gt;PySpark is the python implementation of &lt;a href="https://spark.apache.org" rel="noopener noreferrer"&gt;Apache Spark&lt;/a&gt;, which is “a unified analytics engine for large-scale data processing”.&lt;/p&gt;

&lt;p&gt;Note that this is not an exact and one-on-one comparison with the Spark components, but it is a close one conceptually. I’ve also omitted many of Spark internals and structures for the sake of simplicity. If you want to dig deeper into this, there are plenty of resources out there, starting with &lt;a href="https://spark.apache.org/research.html" rel="noopener noreferrer"&gt;the official Apache Spark site&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F3136%2F1%2ADZpJ7d-yFEecoLBTnQBryA.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F3136%2F1%2ADZpJ7d-yFEecoLBTnQBryA.jpeg" alt="Comparison with Spark"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The comparison depicted in the previous image as I mentioned is not quite accurate. Let’s try again and get the teacher into the picture too. The teacher is the one who provides the homework and the instructions (the driver program), the students are split into working groups and each working group can take care of a part of the task. For the sake of brevity — and for trying to make my drawings less complicated and my life a bit easier, the image below shows the comparison of one working group to Spark. This, I feel is a bit closer to what actually goes on when a Spark application runs.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F3058%2F1%2AviGc7cGu-NZla0wum2J10w.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F3058%2F1%2AviGc7cGu-NZla0wum2J10w.jpeg" alt="Perhaps a better comparison with Spark"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In simple, and a bit more technical terms, let’s say you have a huge text file (ok not big-data-huge but let’s say a 15GB file) on your computer and you really want to know how many words there are, or, as the homework above, how many times the term “big data” appears in it, along with the relevant phrases, you will be faced with the following issues:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;you cannot really open this file with let’s say notepad, because even if you have 32GB of RAM, the application used to open and edit text files will be practically unusable with a 15GB file.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;you can code something to count the words or a specific word or phrase in this file, either by reading line by line or using something like &lt;code&gt;wc&lt;/code&gt; depending on your system, but it will be &lt;em&gt;slow&lt;/em&gt;, &lt;em&gt;very&lt;/em&gt; slow. And what if you need to do more complicated things?&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So, immediately we see that there is no quick and easy option to do simple, let alone complex things with a big file.&lt;/p&gt;

&lt;p&gt;One can think of several work-arounds, like splitting the huge file into many little ones and processing the little ones and adding up the results, leveraging multiprocessing techniques. And here is where Spark comes to provide an easy solution to this. Let’s see a very basic PySpark example using the python library for &lt;a href="https://pypi.org/project/pyspark/" rel="noopener noreferrer"&gt;pyspark&lt;/a&gt;.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;It looks quite simple, doesn’t it? Just a few lines of Python code. Now let’s explain a bit about what it does:&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;There is no obvious splitting of the file into “chapters”, no coordination, no keeping track, not anything. That is because Spark will take care of all the complexity behind the scenes and we do not have to worry about telling workers and executors to read parts of the file or how to split it, or what happens if an executor drops its part suddenly and so on. So, here, we’ve done our homework in just a few lines of code.&lt;/p&gt;

&lt;p&gt;Don’t get me wrong, Spark seems simple but there is a lot of complexity behind it and troubleshooting it is not an easy task at all, but, let’s just appreciate the good parts for now and we can talk about the difficulties later on.&lt;/p&gt;

&lt;p&gt;Additionally, the example here is one of the simplest ones, but I believe, once you understand the mechanism and logic behind such frameworks, it is a lot easier to grasp what you can and, more importantly, cannot do with them, how to structure systems that leverage those frameworks and to get good at estimating whether doing things a certain way will be fast and efficient or not. Again, keeping this simple, I won’t go into further details about that right now.&lt;/p&gt;

&lt;p&gt;I hope this was helpful. Any thoughts, questions, corrections and suggestions are very welcome :)&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://towardsdatascience.com/explaining-technical-stuff-in-a-non-techincal-way-apache-spark-274d6c9f70e9" rel="noopener noreferrer"&gt;Medium&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I could use a &lt;a href="https://www.buymeacoffee.com/mkaranasou" rel="noopener noreferrer"&gt;coffee&lt;/a&gt; to keep me going :) &lt;br&gt;
Thanks!&lt;/p&gt;

</description>
      <category>python</category>
      <category>programming</category>
      <category>distributedsystems</category>
      <category>bigdata</category>
    </item>
  </channel>
</rss>
