DEV Community: as3fn

Dominate MS Office Reporting With Python Part2

as3fn — Tue, 08 Jun 2021 20:54:12 +0000

intro

Last time we discussed the WHY, now it's time to discussed the HOW. This series is based on python-pptx library, a small and great library but it's not most beginner friendly, we will try to make more friendly in this part.

Syntax

After installing the library pip install python-pptx, things maybe overwhelming if you navigate the library's documentation, to save you the trouble I will break down what you will need from the library, and once you can do simple template population, you may feel free to delve into the docs and explore more cool stuff!

Presentation()

The first thing you need to do is to import Presentation(), this is the core of what we will do, you can open a copy of your template using this function into a variable to use later on. The argument the function needs is the name/path of the template presentation, personally I keep my templates in a different location from where my code resides for various reason, if you want to know how to use the filesystem or my reasons, you may check this article. The function returns a presentation object prs.

from pptx import Presentation
prs = Presentation('./market/data/temp.pptx')

Slides

From the first part, we broke down the presentation into a tree of hierarchy, the second level is the slides level, prs.slides is an iteratable that allows you to loop through the slides or access slides directly using list notation starting with 0 for the first slide.

for sld in prs.slides:
    print(sld)
print(prs.slides[0])

Shapes

The realy intresting part from this hierarchy is the shapes. I talked about it quit a bit in the first part because of how important it's so get familiar with it before we processed.
The shapes can be accessed from a single slide object, let's say that we are interested in the first slide for now, we can get the shapes in a similar fashion to the slide like this prs.slides[0].shapes. This returns an iteratable object,
to check for a specific shape, you have to iterate in a loop and compare with the name of the shape (you named them in the selection pane inside powperpoing) or by the the type number to get a specific type of shapes.

for shp in prs.slides[0].shapes:
    if shp.name == 'selectmeplz':
        print('hello world')
    if shp.shape_type == 3:
        print('I am a chart!!!')

Basic shapes

diving down the shapes, we find 3 major types that you will mostly need in all presentations. The first one is the basic or box shape, these are the titles, headers, footers and so on. These are very basic and you only manipulate the text inside them.
shp.text is the method to use, you can get the text inside this shpae or you can feed it new text like this shp.text = 'changemeplz'.

Charts

Charts shp.chart are a critical part of any presentation and the main reason we do analytical presentations in the first place. There are different kinds of charts, but one of them is more common so it will be covered here. The common type of charts is called categorical chart, this is any chart that has a categorical axis, this includes bar charts and line charts single or stacked. The basic idea is simple, you create your chart in PowerPoint, add any formatting or modification, name it then use python pptx to populate it.

The categorical chart consists of two parts only, a category and the series/s. To work on this chart you need to import the CategoryChartData function, this returns a chart object as we will see.
Let's say you have your data in a tabular form, where your index or categorical axis is know, this can be text, dates or numbers, but they should be unique. Each column you have left can be treated as a series in you chart. This following code will clarify things more:


from pptx.chart.data import CategoryChartData
chart = shp.chart
data = CategoryChartData()

data.categories = df['your_index_or_main_axis']

for col in cols:
    data.add_series(col_name,df[col])
chart.replace_data(data)

Adding series to your data object can be done either for only 1 series or more, I do it in a loop so you get the general idea. The series could be different company sales data vs date, so your chart will be a group of trend liens with the x axis as date and so on.
once you add your series and set your categorical axis, you then replace the data using the replace_data chart's method. Note that this method can add more than one series, if your chart is expecting a limited number only, you have to be careful.
This method of replacing the data is great as the chart ranges and axis change dynamically without your input, so you can populate the same chart with 3 series in one slide then with 5 in another slide without any modification from your side. Cool right?!

Tables

Tables shp.table are a tricky kind of shape, the table object is more an excel object than a PowerPoint object imo. For some reason table objects don't change dynamically like charts, so if you have a table
with 10 rows, if you feed it 11 rows worth of data, it will not add and extra row, opposite is same, less data will not remove excess rows/columns, you will do some cleaning with you hand for sure. Unlike charts,
you don't work with series or uniform list, rather you work with individual cells starting with index (0,0). So to populate a table, you need to fill each cell with its data! at least this is what I found so far,
so if you have a better way, please tell me!

table = shp.table
j = 1 # second column
for i in range(0,len(table.rows)):
   table.cell(i,j).text = i*2

I like to work on tables in a column wise manner, I loop through columns and call a function to populate the rows of that column, this is my preference in the end and you can do it in another way if you like.
For more control on the font formatting of each cell, you may take a look at this answer

Congrantz

Now you have the basic tools to automate that boring report using only a small script. Of course you are not limited to these tools and you may further customize your presentations more and you are free to explore and share your finds with us. In the next part of this series, I will be doing a walkthrough of a simple report, to share my flow and helpful functions that make my life much easier, stay toned.

It's Time To let Jupyter Go

as3fn — Sun, 11 Apr 2021 23:00:46 +0000

Who Are You

A junior data person that does everything through JupyterNotebooks JN. Maybe you just started your first job or about to start and long past the point of online courses and tutorials. So this is NOT for BEGINNERS in anyway, actually if you are a beginner you are better starting of with JN!

The spirit of JN

Before going through why you should leave JN, I want to give a big thanks to the JN development team on making my learning experience and many others a smooth and simple welcoming experience as we were starting with data science. But what I feel like thanking them on more, is what I call the spirit of JN. Reading about functional programming, a lot of the concepts and best practices were already familiar to me simply because of the JN style. The style of JN is like: you are making small blocks of code (functions)such that, the output of one block can feed as an input to others, and the whole thing runs in a sequential way. This kinda forces you into dividing your code into smaller simpler parts, testing them, moving them around becomes supper easy, tracking errors and bugs is much easier and modifications effects are limited to a smaller area of the total code. Expand this style further and you will start creating helper files and modules without realizing, it grows naturally on you, at least that was my case. This kind of style or spirit is what makes me really grateful to the JN team.

What's A JN

JNs are like a save space, very friendly and useful for experimenting and creating quick results, models and EDA using using a GUI run by Python. Going into the wild, things start to become less friendly, you move files and code all over the place, you may work on other machines, use the command line to run code, inspect the machine and so on.

What JN does is abstracting a lot of things so you can experiment without worrying about them, but this abstractions comes with a price tag, you carry a lot of baggage and limit yourself as a programmer. The wild requires you to be fast and lean, learning new tricks and utilizing the full potential of the beautiful Python language.

So Should I leave JN for good?

The short answer is yes, but not totally. The main reason I left the JN was because I could not use Vim, scrolling and copy and pasting through large notebooks became a nightmare.

I tried looking for a way to use vim with JN but with no hope. Then I found my salvation, vscode integrated JN last year, so you could open and edit and even transform the notebooks to other files natively inside vscode, and as a bounce you can add the vim plugin(or any other plugin) to do your vi magic.

This seemed promising at first, but I was still not able to use vim commands on the cells; The notebooks inside vscode where just like the ones on the browser, you could not jump cells or use vim inside of them. The only solution was to work with pure .py files. This seemed intimidating at first as I am used to the cell layout and running blocks of code individually, working with the normal python files was not like what I used to do for 2 years of online courses.

This seemed like a problem, but because of it, I discovered a new world that I would not have dared to approach if I was sticking to the notebooks.

All You Need Is #%%

This is not a mask for a swear word, rather it's a special comment in python files that treats what is beneath it as a "cell". I found this when I converted my notebooks to .py files, this comment replaced the cell structure, and with vscode, when you add it to your .py file, you get the usual cell options, run, run above debug cell as shown.

With this weird comment, you have now the structure of JN inside python files, and you can use vim and other great vscode plugins if you want. This is why I said you will not leave JN totally as it can still be with you in spirt and style.

Hacking Is The Way to Go

Adding this comment whenever I wanted to create a new cell was too much for me, could I make a short-key/key-biding that can write it down for me? Turns out I could. I had to read about keybidings in vscode and how to configure them, this drove me to dig around and get my hands dirty, something I wouldn't have done if I was sticking to JN. From there on, I started to play with different things like: the development environment, configuration files, libraries and so on.

A Mental Shift

Relying on JN will limit you, at least that was my case. Removing the JN from the equation forced me to try and tweak things more, read more documentations and learn how stuff work under the hood. This simple shift of mentality has improved and added to my skills greatly, I understand this might not be your cup of tea, but trying will not hurt as you are getting almost exactly the same experience as using JN the normal way.

Dominate MS Office Reporting with Python part1

as3fn — Wed, 24 Mar 2021 19:21:10 +0000

Who Are You?

In the field of data analytics there are 2 types of people, senior business/corporate people and younger programing-driven analysts. The first type is in love with MS office suit, primary Excel and PowerPoint, these two are almost the only tools that the first type can use/communicate with. Second type are Python developers mostly, they hate MS as it's a proprietary software that a free language can do the same things faster and on much larger scale.

Being the second type myself, just want to make sure that this will not be a series about how superior Python is to MS office reporting software. Rather this will be a bridge between the two, as both can do great things if they work together.

Business Guy

I know how much you like Excel and PowerPoint and you can't imagen a world without them, they are very user friendly, you have been using them for years, you can port your files to any machine and the reports and your killer presentations will work just fine, connecting to DB, equations and plotting, sumifs and vlookups.
I know this much cause I was a heavy user of MS suit for months.

But you should know that, Excel is a monster, a very large program, that needs updating constantly, it can't handle large datasets as it did before (not because Excel got weaker, but the data has grown orders of magnitude than before), and VBA is a pain in the neck. A good program generally does one thing fast and does it well enough, you just can't have one program that does everything well, instead it will do just ok on average, this might have been enough before, but in the age of big data, just ok is not acceptable.

The idea is simple, to work with large volume of data, you need your computing resource free as much as possible to do the computing, with large user interface, with shiny tabs and clickable buttons, with the huge services lurking in the background, your hardware is very busy showing you these cool features, but for each small task you do, you don't want 99.9% of the remaining features, so why bother with extra, slowing-you-down solutions, when you can get a much faster option?

Here comes Python, a language simple enough to understand with little coding experience, yet powerful enough to perform complex data manipulation and calculations. Python is a general purpose language, that is it's not designed to do only one thing like a lot of more hardcore languages, but it's very rich in libraries that are very focused on one domain, so if you only need to do calculations and you need no plotting, then you summon the library that was designed to do the calculations efficiently, and you do the same for plotting, you use what you NEED only, WHEN you need to.

But I am not here to tell you to give up on Excel, rather I will show you that you can mix the two, were Python does the heavy lifting and output the results into slides and workbooks that you can present or run quick calculations on them on the go.

Python data Analyst

I don't need to tell you how great python is, we already know that, but Python can't replace MS office reporting software for 2 reasons:

Fun fact; Not Everyone Can Code

Excel and PowerPoint are very user friendly, you don't need to go through tons of tutorials to get decent at using them. While you are using your time to understand scopes and numpy and different pandas magic, other people used their time to learn about business, marketing, financial analysis, brand health and product management etc. These people need fast and easy to use tools to communicate their ideas to others, aka to money.

Legacy

If you are familiar with legacy code, you will find that some parts of the code are there because of "historical" reasons, changing that function or deleting that "useless" line of code can result in breaking the whole program, as this might not be the best practice, you find yourself going around these parts as the cost of going through the trouble of dealing with them is not worth it.

The same is true with manger/higher senior roles, you just can't expect someone of +10 years of experience to ditch what they have been using for years to learn a skill that will take months of free time to master, instead they are giving you money to utilize these skills for something useful for everyone, that's not a bad deal IMO.

What kind of Reporting anyways?

This series will focus on reporting using a pre-made PowerPoint template, where you use Python to read data from Excel workbooks, manipulate it, then populate tens and hundreds of charts and tables inside a PowerPoint deck. Dealing with Excel reporting can be found everywhere, PowerPoint on the other side lacks quality content online, so this series will focus on this part more, and maybe later Excel will be visited in more detailed way.

MS language

To work with PowerPoint, you need to understand its language, what things are called can affect how you access them, and if you access them you can manipulate them at your will. This will be very practical series, high level information that is enough to get the work done, deep understanding is left to you if you are interested.

Hirarchy

like a matryoshka toy, entities inside a presentation are stacked inside bigger entities or objects as MS likes to call them, The biggest one is the presentation. Think of the object as a list of things that you can iterate through them or access them directly if you know their index.

Slides

Inside the first layer is the slides object, these represent a list that holds inside everything inside each slide you see. So if you have 20 slides in your presentation, except to find 20 slide object inside your presentation. Some slides may contain tables, some may contain charts or even both, they are similar as containers under the presentation layer, but they may contain different things.

Shapes

Now comes the useful part, almost anything of value inside a slide is called a shape object. The shape might be a text box, a table or a chart. Each shape resides inside the slide that contains it, you can iterate through them to access the desired shape object. It's important to note that not because they share the same type or name means that they are the same or have same methods. A table is different from a chart despite both of them being shape objects.

The shape object was the hardest part for me to understand, so it's ok if you are a bit confused. Shapes are like a general structure, a build of sorts, a building can have a pool, a garage or garden, each of these things can have different properties and usage.

So if you want to access a table, the MS way of doing that is to find the shape of that table then extract the table of that shape using .table method. Like a street full of buildings, you can't know which building contains a pool and which contains a garden. the way MS wants you to do it, is by opening each house, ask if it contains a pool (if you are interested in pools). That's how you know which shape is which, kind of stupid I would say.

Cool trick

This trick as simple as it may seem, is quit important, and without it things can get really hard, a dear collogue told me and forever thankful to you Sheeren!

So to summarize, if you are inside let's say slide #5, and you want to access a certain chart, you have to loop through the list of shapes inside that and slide, and ask each one of them are you a chart? if yes then you can try guessing which chart in the code is the one on the presentation in front of you. If you thought about giving your charts and tables a unique name then you are a smart fella, but how do you ask.

From format shape, you can choose selection pane, this list shows every single shape inside the current slide, you can click on your shapes inside your starting template, then rename them in a logical pattern that can help you identify them when you are looping through them in the code. I like to name single charts with C+#number of occurrence from left to right, tables as T+#number in the same fashion. Once you name your shapes, you can copy and paste that slide and the charts will keep the name you gave them!

The real trick I use is that once I am inside the slide, I loop through all the shapes, filtering by type (using the .type method), then keep each type in a dictionary for fast access. You can find a list of shape type id in this link https://docs.microsoft.com/en-us/office/vba/api/office.msoshapetype

End of Part 1

This was a quick introduction to the general idea, later on we will go through more technical details and get our hands dirty with code. See you soon and please feel free that your input and questions are very welcome.

Filesystem For Data Scientists, With Useful Style

as3fn — Tue, 24 Nov 2020 04:30:50 +0000

If you are a beginner data scientist, chances are you have done a lot of coding and training models on Jupyter notebooks, a powerful tool for rapid prototyping or testing that makes your life easy. Unfortunately, a lot of online courses rarely leave the awesome notebook interface and spoon feed you with the necessary files in one folder where you can access them with one line of code, or if you are following a tutorial you could just copy the pieces where the author downloaded and store the data for analysis/training.

This is all sweet and fine until you face the real world where files are downloaded and combined to be cleaned by no one but you, and if you were like me, putting every thing in the same folder where your code resides, you will get a really messy and unproductive environment. So is there a solution to this? yes there is and it's not that complicated as I thought.

A Little Bit Of Style

Dealing with lots of raw files is not that sweet, you go through a lot of changes and modifications and data cleaning, with small number of files, that's ok. But when you are adding helper text files (dates, prices, brands ..etc) and some old notebook that is useful, your file will grow to something very ugly and unproductive like this mess:

fig. 1 My messy old style

I knew I had to find away to organize this crazy bill of ordered chaos that in one week will be just chaos, but the solution I knew would involve working with filesystem through code, which was something totally foreign to me. So let's imagine that we can deal with filesystem easily (will go through that shortly), how can we get organized?

A nice style that I have seen professionals follow and find it quit generalize-able to different data science projects is very simple yet very effective. The idea is simple, get all related stuff into one folder, and very related stuff to a deeper folder if need, that's it! A good start is to divide your project files into: input and output folders, the input can be divided further into 2 folders, source folder that contains raw dirty files and data folder where the clean stuff your code needs are nicely put in one place, this can be divide further for specific type of data in one folder (e.x., csv files) you get the idea. This is how this looks in practice:

fig. 2 Much better

Think Of A Potato Not A Tree

Potatoes are root vegetables, for the city boys that means the good stuff are found in the roots under the dirt not up on the plant like other vegies. The filesystem is similar to a potato plant, there is one root file at the top and you go down the paths/roots until you reach the file you want:

fig. 3 Potato plant

The downward movement is not the only similarity between the two, if you look at the picture, imagine the far left potato is your file you are currently at, to go to the far right potato, you need to go in the root direction until you find a node (path) that leads to where your target potato, I mean where the file at. Let's say you're at data folder in the previous images, to go to source folder you need to move up to scope2updated then you go to source. That's it just moving up or down, how you ask, this will come next.

Slashes Rook \m/

Slash, A legend

If you only use windows, it's very likely that you never navigated through file using the command line, which what people did (still do) on unix operating systems. The idea is simple, to navigate through the filesystem using command line ( using Python for example) you need to know 3 things:

Your current location (i.e., which folder are you running your code from) Relative
Your root file and the big files inside it. Absolute
Target location. Relative vs Absolute

Relative Path

As this is a general introduction and an icebreaker on the subject, I will cover useful and general information, if you want more you can use these info as a starting point for deeper search.

Relative path is what you need most as a data scientist, you are basically telling your code to use your current location as a pusdo-root folder where you use it as a start to go deeper to wanted folders if you wish. This way once you move your project folder to different devices, the code will run no matter wherever you place your folder in the new filesystem. This current location is expressed as "./" in the path string inside Python path objects. So let's say you want to go to date.txt file in figure 2. the path will be "./data/date.txt".

os.chdir(), os.listdir()

The power of the Relative path truly shins with the filesystem navigation and manipulation methods. Two very useful methods are chdir and listdir -of the os library in Python- short for change directory and list directory respectively. listdir gives you a list of the file/folders names inside your current location -the default- or inside a folder you gave the path to as an argument. chdir changes the default location or focus to a different folder, this can be useful if you want to run the function on different folders -data cleaning for example- smoothly with minimum modifications to the path string.

Relative Home "../"

Relative home is the parent folder hosting your current folder (i.e., the folder that contains your working folder). A good scenario that you may find yourself in: you are inside one folder and want to go up to the parent folder, either to get a file or go back after an os.chdir. This can happen easily using "../" as your path to go back, or add another folder name if you wish to change folders on the same level (think data cleaning).

Working Example

fig. 4

So we are here, we have 3 folders and we want to do some cleaning on their contents. The function/s is same and will follow the same steps, so we only need to change the path then we are good to go.

We start with os.chdir('./first') or os.chdir('first') directly. This will change the working directory to 'first'. After our business is done with 'first', we either can go back to our original starting point and switch to 'second' or switch to second directly as follows: os.chdir('../') >> os.chdir('second') or os.chcdir('../second') directly and so on with other folders.

Absolute Path '/'

We end our discussion with the absolute path, not that common to deal with if you're only doing data science stuff but still, a good to know small detail that you may need someday. The absolute path is the original root where every things falls under, usually you use this address or path if you are working on a system level where you need to access files scattered around the whole computer.

fig. 5

This is the absolute path to the folder test from the working example, notice how the slashes are the other way around in the widows OS. You can get the absolute path of a file or folder in python using the os.path.abspath(path_string).

This is about it folks! Hope you are willing to utilize the power of filesystem in your projects now and you it find as useful as much as I do.