DEV Community: PaulOpu

Nesting Columns like a Pro: A Guide to Mastering Nested Structs in PySpark

PaulOpu — Sat, 14 Jan 2023 15:10:28 +0000

Task

Are you tired of struggling with adding nested columns to already nested structs in PySpark DataFrames? Well, you're in luck because in this blog post, we'll be diving into the solution to this common problem. With the help of PySpark's withColumn and struct functions, we'll be showing you how to easily add nested columns to your DataFrames, but only if the column doesn't already exist. So, whether you're a beginner or an experienced PySpark developer, this post is for you. Keep reading to learn how to master nested columns in PySpark DataFrames!

Problem

In PySpark, adding columns to a DataFrame is not as simple as selecting them like you would with a SELECT statement. Instead, when you want to add a new column that is nested within other structs, you have to rebuild all of the structs that are part of that column. This can be a bit tricky because you also have to make sure to include all of the other fields that were already in those structs, otherwise they will be lost. It's like a puzzle, you have to make sure to keep all the pieces in the right place.

Solution

from typing import List
from pyspark.sql import functions, Column, DataFrame

def add_column_to_struct(dataTypes, path: List[str], current_struct_name: str, value: any) -> Column:
    current_column_level = path[0]
    next_column_level = path[1]

    should_place_column = len(path) == 2
    if should_place_column:
        new_column = []
        if next_column_level not in dataTypes.names:
            new_column = [functions.lit(value).alias(next_column_level)]
        return functions.struct(
            *([functions.col(current_column_level)[c].alias(c) for c in dataTypes.names] + new_column)
        ).alias(current_struct_name)

    current_struct_fields = []
    for index, column in enumerate(dataTypes.names):

        is_column_in_next_level = column == next_column_level
        if is_column_in_next_level:
            grouped_nested_path = [f"{current_column_level}.{next_column_level}", *path[2:]]

                        current_struct = add_column_to_struct(
                dataTypes.fields[index].dataType,
                grouped_nested_path,
                next_column_level,
                value
            )
        else:
            current_struct = functions.col(current_column_level)[column].alias(column)

        current_struct_fields.append(current_struct)

    return functions.struct(*current_struct_fields).alias(current_struct_name)

def create_column_if_not_exist(df: DataFrame, path: List[str], value: any) -> DataFrame:

    current_column_level = path[0]
    not_nested = len(path) == 1
    if not_nested:
        if current_column_level in df.columns:
            return df
        return df.withColumn(current_column_level, functions.lit(value))

    return df.withColumn(
        current_column_level,
        add_column_to_struct(
            df.schema[current_column_level].dataType,
            path,
            current_column_level,
            value
        )
    )

The previous code defines two functions create_column_if_not_exist and add_column_to_struct that allow adding a new column to a nested struct column in a PySpark DataFrame only if the column doesn't already exist.

The create_column_if_not_exist function takes in three parameters:

df: a PySpark DataFrame where the column will be added
path: a list of strings representing the path to the nested struct column where the new column will be added. For example, if the struct column is called address and is nested within another struct column called user, the path would be ["user", "address"]
value: the value that will be assigned to the new column

The function first checks if the path has only one element, meaning that the column is not nested within any other structs. In this case, it checks if the column already exists in the DataFrame, if it does the function returns the DataFrame unmodified, otherwise, it adds the new column using the withColumn method and assigns it the value passed in.

If the path has more than one element, it means that the column is nested within other structs, the function then calls the add_column_to_struct function, passing the struct's dataType and the other parameters to it.

The add_column_to_struct function takes four parameters:

dataTypes: the dataType of the struct column where the new column will be added
path: the path to the nested struct column where the new column will be added
current_struct_name: the name of the struct column where the new column will be added
value: the value that will be assigned to the new column

It starts by extracting the first and second element of the path, these will be used to determine if we should add the new column in this level of nesting or if it should be added in a deeper level.

If the path has only two elements, the function checks if the new column already exists in the struct, if it does not, it creates the new column and assigns it the value passed in. It then creates a new struct column containing the existing columns and the new column.

If the path has more than two elements, the function iterates through all the columns of the struct and checks if the next level of nesting is reached. If it is, it calls the add_column_to_struct function again, passing the dataType of the next level of struct and updating the path and struct name accordingly. Otherwise, it simply uses the existing column.

Once all the columns have been processed, the function creates a new struct column containing all the columns and returns it.

Conclusion

In conclusion, working with nested structs in PySpark DataFrames can be like a game of Tetris - it takes a bit of skill to manoeuvre the columns in the right place, but with the right tools, it's totally doable. And let's be real, there's nothing more satisfying than fitting that last piece into the perfect spot. The provided code is like the cheat code to this game, it gives you hopefully a clear and clean way to add new columns to nested structs without losing any existing data. So, whether you're a PySpark pro or a beginner, this solution will have you adding nested columns like a boss in no time.

ElasticSearch: Switch Index like a Pro

PaulOpu — Sun, 23 Oct 2022 14:30:08 +0000

Introduction

One of the most used databases to search for large documents is Elastic Search. There are specific mechanisms in place to speed up the search significantly. Nevertheless, the configurations must be set wisely, otherwise, your search takes much longer than expected.

Background (Elastic Search Concepts)

Node

The data is distributed across multiple computational instances, called nodes to parallelize the search and make it quicker. Thereby, each instance only has a fraction of the entire database.

Primary Shards

The data is organized in indices, which can be compared with tables in SQL databases. For example, if you are a sports website, you have indices for players, teams, or tournaments. The data of each index is split into primary shards so that one shard can stay on one node.
You already realize that the number of primary shards should not be higher than the number of nodes, otherwise we have 2 separate shards on one node and that is less efficient.

Replica Shards

Consider we have a player index with 5 shards in a cluster with 5 nodes, where shard 1 is on node 1, and so on. Assume that node 1 is busy searching in another index. If you now want to search for players, shard 1 (is on the busy node 1) might respond with a delay.
To overcome this, we introduce replica shards: each shard can have 0-n replicas, which contain exactly the same data as the corresponding shard. They should lie on a different node. In our example considering the replica 1 for shard 1 is on node 2, we can also address node 2 and skip the busy node 1.

Problem

We experienced that the CPU utilization of our nodes increased drastically during peak times and elastic search was not able to handle all search requests. The time to resolve a search request is split into dividing the request on each node, the actual search in the shards, and collecting the results to aggregate them.
The actual search cannot be the problem, as each primary shard just contains a few hundred MBs and the desired size should be at least a few GBs. Therefore, we assumed that the splitting and aggregation might be a bigger overhead for the system. As a result, we needed to reduce the number of nodes and shards to see if our hypothesis was correct.

Unfortunately, it is not possible to update the number of shards of an existing index.

Solution

Current Setup

6 Nodes
6 indices with < 1Gb data
- 5 primary shards
- 1 replica per shard

Requirements

Changing the number of shards is not just flipping a switch. Therefore, we need to make sure that the productive system is still responsive.

Next to zero downtime, the data amount increases in the future, and the cluster should be able to scale up again

Implementation

There are several ways to deal with this situation. In the end, we decided to use the concept of aliases in Elastic Search.

Alias

An alias is a pointer to an existing index. For example, the alias member points toward the player index. Now any request to the member alias is redirected to the player index. Creating and deleting an index is very easy compared to an index. You cannot create an alias with the same name as an index. Nevertheless, creating an alias and deleting an index can happen at the same time. We use this behavior.

Process

Create a new index with the name and the desired replica and primary shard size.

PUT /new_index
{
  "settings": {
    "index": {
      "number_of_shards": 3,  
      "number_of_replicas": 2 
    }
  }
}

Copy the data from to . Use the reindex API call from Elastic Search.

POST _reindex
{
  "source": {
    "index": "old_index"
  },
  "dest": {
    "index": "new_index"
  }
}

Redirection and Cleaning up

a. Delete and create a alias with the name

```json
POST /_aliases
  "actions": [
    {
      "add": {
        "index": "new_index",
        "alias": "old_index"
      }
    },
    {
      "remove_index": {
        "index": "old_index"
      }
    }
  ]
}
```

b. If you already have an alias, redirect it to

```json
POST /_aliases
  "actions": [
    {
      "add": {
        "index": "new_index",
        "alias": "old_index"
      }
    },
        {
      "remove": {
        "index": "other_index",
        "alias": "old_index"
      }
    },
    {
      "remove_index": {
        "index": "old_index"
      }
    }
  ]
}
```

Flags

To make the approach testable you can add flags to your script making each part of the process optional. Thereby, you can create a new temporary index and switch from that one. That doesn’t affect your production environment, but you can test each step of the process. Here is an example:

Add a new temporary index and reindex from an existing one, but don’t switch (1. and 2. step of the process)
Execute all steps on the temporary index. You can optionally run a load test on that index to see if there was no downtime while switching the indices.
See if everything worked as expected and if your service experienced any downtime
Delete the new index, as it was just for testing (just step 3)

Switch the index again

You might ask yourself: if I want to change the number of shards again how can I do that? No problem, just use step 3.b, which takes into account that you already created an alias.

Conclusion

I showed you how to change the settings of an index without interrupting your service. You create a new index, copy all data, and create an alias that points to your new index while deleting the old one.

You have to downtime, as switching the index and deleting the old index happens at the same moment.
Testing is no problem, as you can enable and disable each step and create temporary indices.
The process can be applied multiple times, as it also works with existing aliases.
Use the cat endpoint to monitor the process in real-time

Don’t be afraid to scale down

We reduced the number of shards to 1 with 2 replicas and 3 nodes (before 5 primary shards with 1 replica and 5 nodes). The CPU utilization was reduced by 75% and the costs by 20%.

I hope I could help you to get the most out of your Elastic Search Cluster. If you have any questions let me know and I’m happy to answer them. I would appreciate any feedback that you have, as I like to learn every day. Thanks a lot!

DEV Community: PaulOpu

Nesting Columns like a Pro: A Guide to Mastering Nested Structs in PySpark

Task

Problem

Solution

Conclusion

ElasticSearch: Switch Index like a Pro

Introduction

Background (Elastic Search Concepts)

Node

Primary Shards

Replica Shards

Problem

Solution

Current Setup

Requirements

Implementation

Alias

Process

Flags

Switch the index again

Conclusion

Don’t be afraid to scale down

Reference