DEV Community

Discussion on: Welcome Thread - v62

Collapse
 
saikatbh profile image
Saikat Bhattacharjee

I have a pyspark project where I have a dataframe shown below:

   +----------+--------------------------------+
   | Index    |           flagArray            |
   +----------+--------------------------------+
   |    1     | ['A','S','A','E','Z','S','S']  | 
   +----------+--------------------------------+
   |    2     | ['A','Z','Z','E','Z','S','S']  |
   +--------- +--------------------------------+

I want to represent array elements with its corresponding numeric values.

     A - 0
     F - 1
     S - 2
     E - 3
     Z - 4

So my output dataframe should look like

   +----------+--------------------------------+--------------------------------+
   | Index    |           flagArray            |           finalArray           |
   +----------+--------------------------------+--------------------------------+
   |    1     | ['A','S','A','E','Z','S','S']  | [0, 2, 0, 3, 4, 2, 2]          | 
   +----------+--------------------------------+--------------------------------+
   |    2     | ['A','Z','Z','E','Z','S','S']  | [0, 4, 4, 3, 4, 2, 2]          |
   +--------- +--------------------------------+--------------------------------+

I have written an udf in pyspark where I am achieving it by writing some if else statements. Is there any better way to handle the same.