loading...

XML with Python

minchulkim87 profile image Min ・8 min read

Note: This isn't a perfect solution. If you search the internet, you will find other approaches, mostly using xml.etree.ElementTree (see this post, this post, and this post for examples). I just wanted something different.

Pandas Read XML

I made a tool to help read XML files as pandas dataframes.

See example in Google Colab here

Isn't it annoying working with data in XML format? I think so. Take a look at this simple example.

<first-tag>
    <not-interested>
        blah blah
    </not-interested>
    <second-tag>
        <the-tag-you-want-as-root>
            <row>
                <columnA>
                    The data that you want
                </columnA>
                <columnB>
                    More data that you want
                </columnB>
            </row>
            <row>
                <columnA>
                    Yet more data that you want
                </columnA>
                <columnB>
                    Eh, get this data too
                </columnB>
            </row>
        </the-tag-you-want-as-root>
    </second-tag>
    <another-irrelevant-tag>
        some other info that you do not want
    </another-irrelevant-tag>
</first-tag>

I wish there was a simple df = pd.read_xml('some_file.xml') like pd.read_csv() and pd.read_json() that we all love.

I can't solve this with my time and skills, but perhaps this package will help get you started. Or maybe inspire a more skilled developer to come up with and integrate something into the pandas package.

Install

pip install pandas_read_xml

Import package

import pandas_read_xml as pdx

Read XML as pandas dataframe

You will need to identify the path to the "root" tag in the XML from which you want to extract the data.

df = pdx.read_xml("test.xml", ['first-tag', 'second-tag', 'the-tag-you-want-as-root'])

Real example.

Here is a real example taken from USPTO. It is one of their "daily diff" files for the US trademark applications data.

test_zip_path = 'https://bulkdata.uspto.gov/data/trademark/dailyxml/applications/apc200219.zip'
root_key_list = ['trademark-applications-daily', 'application-information', 'file-segments', 'action-keys']

df = pdx.read_xml(test_zip_path, root_key_list)

Auto Flatten

The real cumbersome part of working with XML data (or JSON data) is that they do not represent a single table. Rather, they are a (nested) tree representations of what probably were relational databases. Often, these XML data are exported without a clearly documented schema, and more often, no clear way of navigating the data.

What is even more annoying is that, in comparison to JSON, the data structures are not consistent across XML files from the same schema. Some files may have multiples of the same tag, resulting in a list-type data, while in other files of the same schema will only have on of that tag, resulting in a non-list-type data. In other times, the tags are not present which means that the resulting "column" is not just null, but not even a column. This makes it difficult to "flatten".

Pandas already has some tools to help "explode" (items in list become separate rows) and "normalise" (key, value pairs in one column become separate columns of data), but they fail when there are these mixed types within the same tags (columns). Besides, "flattening" (combining exploding and normalising) duplicates other data in the dataframe as well, leading to an explosion of memory requirements.

So, in this tool, I have also attempted to make a few different tools to separate the relational tables.

A quick example from the same dataframe from USPTO above:

from pandas_read_xml import auto_separate_tables

key_columns = ['action-key', 'case-file|serial-number']

data = df.pipe(auto_separate_tables, key_columns)

will separate out what the auto_separate_tables function guesses to be separate tables. The resulting data is a dictionary where the keys are the "table names" and the corresponding values are the pandas dataframes. Each of the separate tables will have the key_columns as common columns.

You can see the list of separated tables by using python dictionary methods.

data.keys()

And then view a table.

data['classifications']

There are also other "smaller" functions that does parts of the job:

  • flatten(df)
  • auto_flatten(df, key_columns)
  • fully_flatten(df, key_columns)

Even more if you look through the code.

A quick note on how this works

Converting XML to pandas dataframe

The actual "reading" part of the XML uses another library called xmltodict. Some amazing person by the name Martin Blech made it.

My tool simply wraps around it and navigates to the root tag and then converts the dict into a pandas dataframe.

def get_to_root_in_dict(the_dict: dict, root_key_list: list) -> dict:
    if len(root_key_list) > 1:
        return get_to_root_in_dict(the_dict[root_key_list[0]], root_key_list[1:])
    else:
        return the_dict[root_key_list[0]]


def read_xml_as_dataframe(xml: str, root_key_list: list, transpose=False) -> pd.DataFrame:
    if transpose:
        return pd.DataFrame(get_to_root_in_dict(xmltodict.parse(xml), root_key_list)).T
    else:
        return pd.DataFrame(get_to_root_in_dict(xmltodict.parse(xml), root_key_list))

Reading XML files

And then there are a whole bunch of helper functions to determine if the XML path provided is a URL, and whether it is an XML file, or the XML files are in a ZIP file, or whether the XML files are in ZIP files within ZIP files. Admittedly, the IO part of the package is something that I am not very comfortable with... But I've tried.

def get_zip_file_from_path(zip_path: str) -> ZipFile:
    return ZipFile(zip_path, 'r')


def get_zip_file_from_url(zip_url: str) -> ZipFile:
    return ZipFile(io.BytesIO(requests.get(zip_url).content))


def get_files_list_in_zip(zip_file: ZipFile, file_extension: str) -> list:
    return [x for x in zip_file.namelist() if x.endswith(file_extension)]


def read_xml_from_path(path: str) -> str:
    with open(path, 'r') as xf:
        return xf.read()


def read_xml_from_url(url: str) -> str:
    return requests.get(url).text


def read_xml_in_zip(zip_file: ZipFile, xml_file_name: str) -> str:
    with zip_file.open(xml_file_name) as xf:
        return xf.read()


def get_zip_in_zip(zip_file: ZipFile, zip_file_name: str) -> ZipFile:
    return ZipFile(io.BytesIO(zip_file.read(zip_file_name)))


def read_xml_files_in_zip_as_dataframe(zip_file: ZipFile, root_key_list: list, transpose=False) -> pd.DataFrame:
    return pd.concat([read_xml_as_dataframe(read_xml_in_zip(zip_file, xml_file), root_key_list, transpose=transpose)
                      for xml_file in get_files_list_in_zip(zip_file, '.xml')],
                     sort=True, join='outer', ignore_index=True)


def read_xml_files_in_double_zip_as_dataframe(zip_file: ZipFile, root_key_list: list, transpose=False) -> pd.DataFrame:
    return pd.concat([
        read_xml_files_in_zip_as_dataframe(
            get_zip_in_zip(zip_file, sub_zip_file),
            root_key_list,
            transpose=transpose)
        for sub_zip_file in get_files_list_in_zip(zip_file, '.zip')
    ])


def read_xml(path: str, root_key_list: list, transpose=False) -> pd.DataFrame:
    if urllib.parse.urlparse(path).scheme in ['http', 'https']:
        if path.endswith('.xml'):
            return read_xml_as_dataframe(read_xml_from_url(path), root_key_list, transpose=transpose)
        elif path.endswith('.zip'):
            with get_zip_file_from_url(path) as zf:
                if len(get_files_list_in_zip(zf, '.xml')) > 0:
                    return read_xml_files_in_zip_as_dataframe(zf, root_key_list, transpose=transpose)
                elif len(get_files_list_in_zip(zf, '.zip')) > 0:
                    return read_xml_files_in_double_zip_as_dataframe(zf, root_key_list, transpose=transpose)
    else:
        if path.endswith('.xml'):
            return read_xml_as_dataframe(read_xml_from_path(path), root_key_list)
        elif path.endswith('.zip'):
            with get_zip_file_from_path(path) as zf:
                if len(get_files_list_in_zip(zf, '.xml')) > 0:
                    return read_xml_files_in_zip_as_dataframe(zf, root_key_list, transpose=transpose)
                elif len(get_files_list_in_zip(zf, '.zip')) > 0:
                    return read_xml_files_in_double_zip_as_dataframe(zf, root_key_list, transpose=transpose)

Flattening

The real fun part is the flattening (some of this applies to dataframes read from JSON as well, but XML data are usually worse because schema data types are not consistent with XML). I have not cracked this yet. It seems that without knowing the schema, it is hard to automatically flatten tree structures without blowing the memory.

There were a number of obstacles in separating the "relational" tables out without knowing the schema beforehand.

My approach was to determine what "flattening" action should be applied, based on data types contained within columns, then applying that flattening method to that column.

The low level flattening methods

def do_nothing(df: pd.DataFrame, column: str) -> pd.DataFrame:
    return df


def explode(df: pd.DataFrame, column: str) -> pd.DataFrame:
    return (df
            .explode(column)
            .reset_index(drop=True))

def normalise(df: pd.DataFrame, column: str) -> pd.DataFrame:
    return (df
            .reset_index(drop=True)
            .join(json_normalize(df[column], sep='|', max_level=0)
                  .add_prefix(column+'|'))
            .drop(columns=[column]))


def mixed_explode(df: pd.DataFrame, column: str) -> pd.DataFrame:
    explodeable = df[column].apply(lambda x: type(x)==list)
    return pd.concat([df.loc[explodeable, :].pipe(explode, column), df.loc[~explodeable, :]]).reset_index(drop=True)


def mixed_normalise(df: pd.DataFrame, column: str) -> pd.DataFrame:
    normalisable = df[column].apply(lambda x: (type(x)==dict) | (type(x)==collections.OrderedDict))
    return pd.concat([df.loc[normalisable, :].pipe(normalise, column),
                      df.loc[~normalisable, :]],
                     sort=True,
                     join='outer').drop(columns=[column]).reset_index(drop=True)

Determining and applying flattening method

def determine_flatten_action_for_column(df: pd.DataFrame, column: str) -> Callable:
    column_data_types = df[column].apply(lambda x: type(x))
    if (column_data_types == list).all(skipna=True):
        return explode
    elif (column_data_types == dict).all(skipna=True) or (column_data_types == collections.OrderedDict).all(skipna=True):
        return normalise
    elif (column_data_types == list).any():
        return mixed_explode
    elif (column_data_types == dict).any() or (column_data_types == collections.OrderedDict).any():
        return mixed_normalise
    else:
        return do_nothing


def flatten(df: pd.DataFrame) -> pd.DataFrame:
    for column in df.columns:
        df = determine_flatten_action_for_column(df, column)(df, column)
    return df

Semi-automated flattening

Then I have some "middle" level functions that "fully flatten" (until no lists or dicts remain in the entire dataframe) or "auto_flatten" (until at least one column, other than the key columns, requires no action).

def action_required(df: pd.DataFrame, key_columns: list=[]) -> bool:
    return any(determine_flatten_action_for_column(df, column) != do_nothing
               for column in df.columns if column not in key_columns)


def action_recommended(df: pd.DataFrame, key_columns: list=[]) -> bool:
    return all(determine_flatten_action_for_column(df, column) != do_nothing
               for column in df.columns if column not in key_columns)


def auto_flatten(df: pd.DataFrame, key_columns: list=[]) -> pd.DataFrame:
    while action_recommended(df, key_columns):
        df = flatten(df)
    return df


def fully_flatten(df: pd.DataFrame, key_columns: list=[]) -> pd.DataFrame:
    while action_required(df, key_columns):
        df = flatten(df)
    return df

*Use fully_flatten with caution.

Auto separate tables

This is the highest level functions that attempt to separate out tables. Basically, if there are columns that do not require further (flattening) action, while other columns do require further action, those that require further action probably can be separated out into individual tables (i.e. normalise). This is achieved by first determining which "columns" will become separate "tables", then actually applying the separation process. There are some ways of handling the table names and column to help with this process as well.

def list_separate_tables(df: pd.DataFrame) -> list:
    potential_separate_table = list(set([
        '|'.join(column.split('|')[:-1])
        for column in df.columns
        if determine_flatten_action_for_column(df, column).__name__ != "do_nothing"
    ]))
    potential_separate_table.sort()

    separate_tables_list = potential_separate_table
    for i, potential in enumerate(potential_separate_table):
        if (i < len(potential_separate_table) - 1) and (potential_separate_table[i+1].startswith(potential)):
            separate_tables_list.remove(potential_separate_table[i+1])
    if separate_tables_list == ['']:
        return ['main']+potential_separate_table
    else:
        return [column for column in df.columns
                if determine_flatten_action_for_column(df, column).__name__ != "do_nothing"]


def auto_separate_tables(df: pd.DataFrame, key_columns: list) -> dict:
    data = {}
    main = df.pipe(auto_flatten, key_columns)
    main_table_name = list_separate_tables(main)[0].split('|')[0]

    for table in list_separate_tables(main):
        table_name = table.replace(main_table_name+'|', '', 1)
        table_columns = [
            column for column in main.columns
            if (column.startswith(table)) and (column not in key_columns)
        ]
        data[table_name] = (main[list(key_columns + table_columns)]
                            .dropna(axis='rows', how='all', subset=table_columns)
                            .pipe(fully_flatten, key_columns)
                            .drop_duplicates())
        data[table_name].columns = [column.replace(main_table_name+'|', '', 1) for column in data[table_name].columns]
        data[table_name].columns = [column.replace(table_name+'|', '', 1) for column in data[table_name].columns]
        main = (main[[column for column in main.columns
                      if column not in table_columns]])

    data[main_table_name] = main.drop_duplicates()
    data[main_table_name].columns = [column.replace(main_table_name+'|', '', 1) for column in data[main_table_name].columns]
    return data

The key_columns are what connects the separate tables, as opposed to common keys being generated. Perhaps that approach might be better, but this may fail if there were multiple such XML files that needed to be concatenated later on, as the generated indices would violate uniqueness.

Another issue is that this auto separation of tables only goes one layer of tables deep. Further structures are simply "fully flattened" - which may cause memory blow ups if the schema is complicated and dataset gets large. But I think the lower-level functions can be used strategically to achieve a desired outcome.

Anyway, this is where I am up to. I hope somebody can help develop this further. Feel free to just fork and make it more robust and generalisable.

I'm sure many things will break. I'm sure there are much better ways than this. Please discuss in the comments. I'll see what I can and how I can contribute. (Also, I am not really used to writing packages... so if somebody else would like to take over and make it into an actual project, that'd be cool too.)

Posted on by:

minchulkim87 profile

Min

@minchulkim87

I am a "Data Analyst" (whatever that means) interested in data science and web development.

Discussion

pic
Editor guide
 

Have you tried lxml its similar to element tree?

 

Short answer is no.

I think when it came up in my initial searches, I wasn't able to use it (I work in a government where installing things that do not require Linux or compiling is okay but not so much the compile stuff, or use Linux which the lxml seems to be both).

But more to the point, they don't "do" anything besides just "read" the data. The thing that lxml offers over the regular xml package seems to be speed. On the otherhand, xmltodict offered an automated way for me to get to a point of simply converting to a pandas dataframe unlike xml or lxml. Both xml and lxml are "lower level" than pandas. If one needs to navigate through the xml trees, then xml and lxml seem like the way to go. But as with pretty much every other file type one typically use with pandas, I wanted a pd.read_file() type solution. Eventually, I need a table structure, not a tree structure. Getting from XML to normalised and separated dataframes was my goal, and the xml and lxml solutions don't get there on their own. I guess I could have used xml as the reading part rather than xmltodict, but xmltodict was nicer in that it was automated.