XML to CSV Conversion Is a Flattening Problem (And It Is Harder Than It Looks)

#beginners #tutorial #webdev #programming

XML is hierarchical. CSV is flat. Converting between them requires making decisions about how to collapse a tree structure into rows and columns. These decisions are not obvious, and different tools make different choices, which is why the "same" conversion can produce wildly different CSV outputs.

The simple case

<users>
  <user>
    <name>Alice</name>
    <email>alice@example.com</email>
    <age>30</age>
  </user>
  <user>
    <name>Bob</name>
    <email>bob@example.com</email>
    <age>25</age>
  </user>
</users>

This maps cleanly to CSV:

name,email,age
Alice,alice@example.com,30
Bob,bob@example.com,25

Each <user> element becomes a row. Each child element becomes a column. Simple. But real-world XML is rarely this clean.

Nested elements

<user>
  <name>Alice</name>
  <address>
    <street>123 Main St</street>
    <city>Springfield</city>
    <state>IL</state>
  </address>
</user>

How do you flatten the nested <address>? Options:

Dot notation columns: address.street, address.city, address.state. This preserves the hierarchy in the column names. Clean, but column names get long with deep nesting.

Concatenation: Combine the address into a single column: "123 Main St, Springfield, IL". Loses structure but produces simpler CSV.

Ignore nesting: Treat each leaf element as a top-level column: street, city, state. Simpler column names but loses the hierarchical context.

Repeated elements

<user>
  <name>Alice</name>
  <phone>555-0001</phone>
  <phone>555-0002</phone>
  <phone>555-0003</phone>
</user>

Alice has three phone numbers. CSV rows have a fixed number of columns. Options:

Multiple columns: phone_1, phone_2, phone_3. But what if some users have 1 phone and others have 10? You need to scan all records first to determine the maximum count.

Delimited within a cell: phone column contains "555-0001; 555-0002; 555-0003". The data is in one cell with a sub-delimiter. Simple but violates first normal form and complicates downstream parsing.

Multiple rows: One row per phone number, duplicating the name for each. This produces a denormalized table but preserves each value in its own cell.

name,phone
Alice,555-0001
Alice,555-0002
Alice,555-0003

Each approach has trade-offs, and the right choice depends on what you will do with the CSV.

Attributes vs. elements

<user id="1" role="admin">
  <name>Alice</name>
</user>

Attributes (id, role) and child elements (name) both carry data. The converter needs to handle both. A common convention: prefix attribute names with @ in the column header: @id, @role, name.

Mixed content

<note>
  Call <b>Alice</b> about the <i>project</i>.
</note>

This element contains text mixed with formatting elements. For CSV, you typically want the plain text: "Call Alice about the project." But some converters will include the markup or produce unexpected results.

The XPath approach

For complex XML, defining the row and column mapping explicitly using XPath expressions is more reliable than automated flattening:

Row selector:    //user
Column mappings:
  name:          ./name/text()
  email:         ./email/text()
  street:        ./address/street/text()
  first_phone:   ./phone[1]/text()

This gives you full control over which elements become columns and how nested/repeated elements are handled.

I built an XML to CSV converter at zovo.one/free-tools/xml-to-csv-converter that handles nested elements (using dot notation), repeated elements (configurable: multiple columns or delimited), and attributes. It auto-detects the record structure and lets you customize the flattening rules before exporting. Paste XML, adjust the mapping if needed, download CSV.

I'm Michael Lip. I build free developer tools at zovo.one. 500+ tools, all private, all free.