DEV Community: Lucas Miranda

Spark AI - Bringing Chat GPT to Data Engineering

Lucas Miranda — Sat, 08 Jul 2023 23:15:49 +0000

Chat GPT has brought a sea of possibilities with his huge capacity to understand human language. Since OpenAI opened GPT model through Rest API for developers, a lot of those possibilites started to become reality, like Bing integrating GPT - an extension of Microsoft's search tool, or Auto-GPT - "An experimental open-source attempt to make GPT-4 fully autonomous".
And now - more precisely at June 29 2023, a new post on Databricks' Blog has introduced pyspark-ai, "The English SDK for Apache Spark". It brings a nice API over our known PySpark DataFrames allowing us to load data from web (like a web scraping) into a dataframe, perform transformations, run assertions about the data, describe and plot different views of the dataset. Everything with natural language. Let's see some examples (from the original article):

# ingest data
auto_df = spark_ai.create_df("https://www.carpro.com/blog/full-year-2022-national-auto-sales-by-brand")
auto_df.show(n=5)

rank	brand	us_sales_2022	sales_change_vs_2021
1	Toyota	1849751	-9
2	Ford	1767439	-2
3	Chevrolet	1502389	6
4	Honda	881201	-33
5	Hyundai	724265	-2

# plot
auto_df.ai.plot()
# with instructions
auto_df.ai.plot("pie chart for US sales market shares, show the top 5 brands and the sum of others")

# transformations
auto_top_growth_df=auto_df.ai.transform("brand with the highest growth")
auto_top_growth_df.show()

brand	us_sales_2022	sales_change_vs_2021
Cadillac	134726	14

# validation
auto_top_growth_df.ai.verify("expect sales change percentage to be between -100 to 100")
# outputs True

SparkAI also provides a cool API to UDFs:

@spark_ai.udf
def previous_years_sales(brand: str, current_year_sale: int, sales_change_percentage: float) -> int:
    """Calculate previous years sales from sales change percentage"""
    ...

It looks amazing, right? If you want to give it a shot, I have built a CLI on top of pyspark-ai and you can run it interactively. Check it out: https://github.com/lucas-lm/spark-ai-cli

PySparkAI CLI

Let's suppose we want to check what are the top 3 repositories more stared in the google topic on github (https://github.com/topics/google). Using PySpark AI CLI we could run the command in shell to get this view:

pyspark-ai https://github.com/topics/google --transform "top 3 python repos with more stars"

Below results were produced by the command above using gpt-3-turbo from OpenAI as our LLM.

As we can see, it is achieving satisfactory results, but there is some mistakes like the wrong table name and the lower case in the filter when the values in the dataframe are Title Case.
As of today, pyspark-ai is still in early stage development and this kind of gap is expected.

Nevertheless, it has a great potential to become a tool for study and exploration of some datasets.

Note about pyspark-ai-cli:
The plot feature is not supported because pyspark-ai enforces plotly as its visualization library (in spark_ai.plot function), which does not display any figure when running from a terminal (https://github.com/plotly/plotly_express/issues/47).

If you want to get started with PySparkAI CLI, check the instructions in the public repository. If you are more interested in the pyspark-ai features, check it out on github repository.

PySpark-AI under the hood

If you take a quick look on spark-ai source code, you will notice that it follows a pattern:

Your input from methods (transform, plot, verify etc.) is caught
Your input is used to compound a prompt template
This prompt is processed by some LLM (commonly with GPT Rest API)
The output of this prompt is parsed to extract the code blocks
Code blocks are executed in the exec python's function
The results that matters from the execution are returned

This high-level overview recalls a little bit the illustration given in the Databricks blog post (https://www.databricks.com/blog/introducing-english-new-programming-language-apache-spark):

The Downside

Before talking about the cons of the "English SDK", we have to point out that the library is under the databrickslabs organization on github, which is a huge indicative that it is something experimental and it is not meant to be handled as a reliable product, and of course, is not ready for production environments.

What scares me the most about the approach embraced in PySpark-AI is that we do not have control over the code that is running. Even though we can see some logs to understand the code generated, we do not have the chance to assess that code before running it.
Even before we had advanced generative AIs as we have nowadays, exec and eval are functions highly avoided due to the inherent security risks they carry.

Another problem that comes as a consequence of this dynamic execution is the side effects. We can not trust that the code generated will be always the same given the same input. Relying on a third party service to give us the output can also be problematic, because we may face instabilities, increases in the latency among other undesirable situations.

Code generated by GPT (model gpt-3.5-turbo) running error.

Conclusion

PySpark-AI - or "English SDK" as it is being introduced, brought an innovative design with a nice API to work with pyspark covering a good variety of operations.

It is easy to get started, can be useful for beginners and non-technical users would feel more comfortable to try it as well.

It is not so reliable though. Even if we have future enhancements being applied, I myself can not see this kind of solution becoming safe and stable enough to be applied at scale and/or in a real-world production environment.

Python Virtual Environments

Lucas Miranda — Tue, 27 Jun 2023 00:20:47 +0000

Packaging and modularity are great features present in every relevant programming language, as it allows us, developers, to easily reuse recurrent code snippets (the modules).

In Python we can add packages as dependencies of our project by installing them with some package manager like pip:

pip install requests==2.30.0
pip install pytest pytest-cov

In the example we are installing the library requests version 2.30.0 in the first line and in the second line we are installing the libraries pytest and pytest-cov. As we are not specifying any version for the packages pytest and pytest-cov, we will have the latest versions installed.

That is pretty cool, but a problem arises when working on multiple projects simultaneously or sharing projects with others, or even if we use the machine to study. The management of project-specific packages becomes challenging.

Downside of global installations (non virtual environments)

Before we get started with virtual environments, let's picture one situation.
Let's say you have started in a new project, and you have to install pandas, requests and jinja for it. You will simply run:

pip install Jinja2 requests pandas

After that, you are going to share this project with someone else, and of course, you will use git to version it and send it to your remote repository like github. Now, other people need to contribute with your project, but when they try to run it, all they see is an error message saying "no module named 'pandas'", which makes sense, since you are sharing only your source code in the repository and not your entire setup (python, packages installed, environment etc.)

To solve this problem you could simply give instructions to people working in your project to run pip install Jinja2 requests pandas. But that does not completely resolve the problem. Some people might already have some different version of requests installed in their machines, which would not raise the same error as before, but it could raise different errors related to incompatibilities between the different versions used - and now you say "it works on my machine!"

So you will need a way to share the exactly same versions you are using in the current project with people to make it work properly. Fortunately, python comes with a solution for that: the subcommand freeze from pip. With this command you can output all your dependencies (including the depencies of your original dependencies - dependency graph) with their specified versions and then you can use this output to create a text file, which can be used to install the correct dependencies:

pip freeze > requirements.txt

The above command is a common pattern in python to create a requirements.txt file with the dependencies you have installed so far (it "freezes" them with current installed versions). Then, when you want to setup the project in a new machine, you can run:

pip install -r requirements.txt

This command will install everything you have in your requirements.txt file.

Note: Make sure to keep this file on root of your source code for convenience.

Now we have a short command to install every single dependency and we just have to ensure that requirements.txt is up-to-date by running pip freeze > requirements.txt every time we include a new dependency to our project, right? Not exactly. We still have a problem.

Remember when we first installed requests 2.30, pytest and pytest-cov at zthe beggining of this article? Those libraries are not part of the dependencies of the project we have started later, but you will notice that your requirements.txt still includes them (and all their dependency tree). Take a look:

This can be a problem, because it will install non-required packages for the project and it might even affect the way our dependencies will be resolved by pip install. Furthermore, imagine the mess we will have when we need to work with different versions of the same package for distinct projects in the same machine...

To overcome this problem, we can use python's virtual environments.

Python's venv

Virtual environment in python - venv for short - is a resource to isolate your project context (python interpreter, libraries installation etc.) from your global python configuration. In practice, the packages installed under a virtual environment does not conflict with packages installed globally in your machine. It is a best practice to have different venvs for each project you are working on. Next we will see a step-by-step guide to use and understand venv.

Create venv

Python comes with a builtin module to create virtual environment. It is called venv, and you can use it to create a new virtual environment just like that:

python -m venv venv
# note: in your system, python command may be under other name
# like python3 or just py instead of python...

In the above command, python -m venv is the command to create a new virtual environment and the last venv is just the name of your virtual environment. It will be used to create the local folder containing the python interpreter, libraries installed etc. It could be any other name, venv is the most commonly used name though.

Note:
Do not worry about the folder venv generated by this command.
It will be managed by python and you will never have to touch it.

Your venv is now created, but it is not active yet, so you are still in your regular environment.

Activate venv

Now that we have a venv for the project, we can activate it to take advantage of its benefits.

# Linux and MacOS:
source myvenv/bin/activate

# Windows
.\venv\Scripts\activate

You will notice venv - or whatever name you gave to your venv - appears in green on your command line, indicating that your virtual environment was succesfully activated and now you are running on it.

Do not worry about your shell configuration, it remains mostly unmodified. Your python interpreter changes to the one contained in your venv though.

Working on venv

First thing to notice is that you do not have access to the packages and modules that you have installed globally. Trying to import pandas you raise ModuleNotFoundError for example. It makes sense, we are in a brand new environment right now.

Let's try to re-install the dependencies we need and freeze them.

pip install Jinja2 requests pandas
pip freeze > requirements.txt

Note that you can play around with python and pip commands just like we do outside of the virtual environment. The difference is that pip and python source here is the venv instead of your global installation.

Look at the difference between the previous requirements.txt (on the left) against the new one generated in the virtual environment (on the right):

From 22 we went to 13 dependencies. This is a reduction of 9 dependencies. Basically 40% less dependencies than we had before!

Note:
When you close your terminal or your IDE, your virtual environment will be deactivated.
So you have to activate it every time you open it again.

Deactivate venv

Whenever you need to deactivate your venv to get back to your "global environment", you can simply run the deactivate script.

deactivate

Further considerations

Perfect, now we can work with virtual environments to have an isolated context for each project we are working on, but there is still some last observations.

Git

The virtual environment directory can contain a lot of files and it tends to be large in size. Also, the package manager might not work the same way in different machines, so you should include your venv folder in .gitignore file when working with git. This way you keep only the source code in the remote repository and the venv can be reproduced by anyone and the package manager will do the best to install the dependencies in the right way for them.

CI/CD pipelines

In CI/CD pipelines usually we already have a fully isolated environment, so there is no need to add commands in your scripts to create and activate the virtual environment before installing the dependencies. You can go straight forward to the installation and execution of your python script in these cases.

VSCode Users

There is some cool extensions for python in VSCode, but the essential one, in my opinion, is the Python from Microsoft.

This extension brings features like intellisense, debugging and so on.
In order to take advantage of the features of this extension, you have to make sure that the python interpreter currently selected is the one of your venv (assuming that you are working in a venv).

Let's suppose I have added a new dependency to my project (only on venv)

pip install matplotlib
pip freeze > requirements.txt

Here is what will happen if we are NOT selecting the correct python interpreter:

Note that we have matplotlib installed in our venv, but we still get a warning saying that matplotlib could not be resolved. Look what we get if we try to run it from the VSCode "run button":

If we try to run from the terminal with venv active:

That means that VSCode is using the wrong interpreter. To fix that is very easy. You just have to hit in the right bottom corner the Python interpreter (marked in red in the illustration) and in the upper panel that will open, select the correct interpreter (you can identify the right one by the path - also marked in red).

After that, the warnings will disappear, your intellisense will work fine and the run button will not result in error anymore:

It is also interesting to notice that in the bottom right corner will be showing your venv interpreter.

In conclusion, virtual environments are a powerful tool for managing dependencies and creating isolated project contexts in Python. They help ensure consistent installations and prevent conflicts between different projects.

If you want to go deeper in virtual environments, check also https://realpython.com/python-virtual-environments-a-primer/