Documenting Python Data Science Code with mindoc

#python #documentation #datascience #productivity

You can write amazingly readable code in python.
Think of the hierarchies of abstraction the same you would reports.
mindoc can help you document your code like markdown.

https://minchulkim87.github.io/mindoc/

Write amazingly readable code in python.

This talk was a game-changer to me:

I shared it with my colleague, and it changed how we used the already-amazing Jupyter notebook.

I was already trying to write code as readable to other humans as I though possible, but the talk gave me a better template to work with.

Basically, the idea is to write code that even a non-dev could understand if they tried.

In pandas, for example, the equivalent way of writing code would be:

(df.pipe(clean_data)
   .pipe(group_data_by_customer_type)
   .pipe(compute_statistics)
   .pipe(generate_visualisation))

Even a non-data-person or a non-coder would recognise that the data was being cleaned, then grouped by customer type, then some statistics are being calculated before the visualisation is created.

With this newfound excitement over just how clean our code started to look, we quickly started re-factoring our data pipeline codes.

And then came the awkward moment

The great thing is that we were able to abstract away what steps like the "clean_data" were! If the reader was curious about what was meant by "cleaning" the data in this code, they simply need to look at the clean_data code, which may look like the following:

def clean_data(df: pd.DataFrame) -> pd.DataFrame:
    df = (df.pipe(rename_columns)
            .pipe(remove_special_characters)
            .pipe(correct_date_format)
            .pipe(replace_values))
    return df

Again, the steps taken in the clean_data "phase" were obvious. Again, you could abstract away further details. Great!

Then we hit our first moment of 'um...'. We constantly had decisions to make about exactly what we would abstract away. And to what end? Were we to get down to the lowest level of "atomic" manipulations of the data?

Once you get to over a hundred such functions, how do you order them? In the order they are used? What if they are used multiple times? Alphabetically? But then the code stops "explaining itself".

Our project seemed way too large for this way of coding to scale to our needs. Not to mention that even the most readable code becomes too "technical" to the manager or other business people who are not likely programmers or "data people". We needed a way to write "documentation" for both the other wizards and the muggles.

Then came the Aha! moment

I remembered watching another amazing talk (for a javascript audience, but the message applied to any developer):

Then I suggested that we think of organising the code as if we were writing a Word document. The levels of abstraction were equivalent to levels of headings. If we were to write the code in English, what would the title, headings, and subheadings be?

We decided it was a good idea to arrange our code this way, and started to use the comments in python to indicate headings.

The commenting syntax resembles a markdown header.

# This is a comment. Or... is it a header?

Surely this made sense to a lot of other people. Surely, there was some standard practice around documenting data analysis, data engineering, and data science projects with code.

We took to the python documentation tools.

We envied using Sphinx. But while it seemed great for documenting packages, it didn't meet the needs of a data project.

First of all, "documentation" for python modules uses docstrings heavily. These are useful for creating a manual of sorts to the user of the module. What methods are available, what arguments I can give this function, etc.

But the point of this new way of coding was that the code was "self-describing". Did I really need to start writing like this?:

def clean_data(df: pd.DataFrame) -> pd.DataFrame:
    """
    This function cleans the dataframe.

    First the columns are renamed.
    Then the special characters are removed.
    Then the date format is corrected.
    Then some values are replaced with others.

    Args:
        df (pd.DataFrame): Pandas DataFrame to be cleaned

    Returns:
        pd.DataFrame: The cleaned Pandas DataFrame
    """
    df = (df.pipe(rename_columns)
            .pipe(remove_special_characters)
            .pipe(correct_date_format)
            .pipe(replace_values))
    return df

Surely, the names of the functions and the python type hinting already fully details the function. The "documentation" we needed was to document the process rather than the tool. Like I wrote above, the documentation of the tool is absolutely required. The processes used in a data project that needs to be used by others? Not so much. The process should be documented though.

There was Pycco, which we drew some ideas from, but the presentation made it feel like it was annotating the code rather than documenting the code. Again, there is a place for annotation, but it didn't quite meet our needs.

There was an added limitation of being a public servant. We did not have the luxury of using "pip install" - especially when it came to upgrading JupyterLab or extending them. Even if we discovered a solution, if it required packages we did not have access to, we would not be able to use it that easily.

So. I decided to write a simple documentation tool.

Introducing mindoc

It is a simple tool that converts a .py file into an HTML document.

It is basically a flipped version of markdown.

In markdown, you write a document and put code between fenced triplets of backticks.
With mindoc, you write code and put the documentation between fenced triplets of quotes.

The use of the headers allows you to organise your code in hierarchical levels, and "hide" away whichever level of granularity you wish to make the logic of the code more readable.

It automatically generates the table of contents for you too.

Go check it out on GitHub!

The page that you see from the link is the .py file (converted into HTML).