DEV Community

Cover image for From XML to Word: simplifying conversion with FileConversionLibrary
Bohdan Harabadzhyu
Bohdan Harabadzhyu

Posted on

From XML to Word: simplifying conversion with FileConversionLibrary

FileConversionLibrary

File conversion can be a tedious task for developers, but FileConversionLibrary offers a basic solution for simple tasks. This library provides essential tools for converting CSV and XML files into formats like PDF, Word, YAML, and JSON. Available on NuGet and GitHub, it’s an easy-to-use tool for streamlining data processing workflows. In this article, I will focus specifically on converting XML to Word


Introduction to XML and Word Formats

XML (eXtensible Markup Language)

XML is a highly versatile format designed for storing and exchanging structured data. Developed by the World Wide Web Consortium (W3C) in 1998, it is widely used due to its compatibility and extensibility. Here are some technical details:

Key Features of XML:

  1. Structure

    • XML uses a tree-based hierarchy, starting with a single root element.
    • Documents are composed of elements enclosed in tags.
  2. Flexibility

    • Users can define custom tags, adapting XML for various domains.
    • Optional schemas like DTD (Document Type Definition) or XSD (XML Schema Definition) can enforce document structure.
  3. Portability and Compatibility

    • XML is platform-agnostic and supported across programming languages.
    • It forms the basis for standards like SOAP (Simple Object Access Protocol) and SVG (Scalable Vector Graphics).
  4. Drawbacks

    • XML can be verbose, leading to large file sizes.
    • Parsing XML is computationally heavier compared to alternatives like JSON.

Word Documents (.docx)

The .docx format is the modern standard for Microsoft Word documents, introduced in Office 2007 as part of the Office Open XML (OOXML) standard. It is designed to offer improved performance, compatibility, and extensibility.

Key Features of .docx:

  1. File Structure

    • .docx files are essentially ZIP archives containing multiple XML files and resources.
    • Components include document.xml (content), styles.xml (styling), and settings.xml (document settings).
  2. Content Representation

    • Text and elements like tables are stored using a structured XML vocabulary.
    • Relationships, such as those linking images or styles, are managed via rels (relationships) files.
  3. Formatting and Extensibility

    • Supports advanced formatting, including styles, fonts, and embedded media.
    • Allows custom XML parts for metadata or structured content integration.
  4. Advantages Over .doc

    • Smaller file sizes due to ZIP compression.
    • Improved interoperability with other software, thanks to its XML foundation.

Why Convert XML to Word?

XML is excellent for structuring data but isn’t user-friendly for non-technical audiences. Converting XML to Word bridges this gap, providing readable and editable documents for various use cases, such as:

  • Report Generation: Transform XML into formatted reports suitable for stakeholders.
  • Documentation: Automatically populate templates for manuals, invoices, or records.
  • Improved Accessibility: Enable non-technical users to view and edit data in a familiar format.

By converting XML to Word, you can unlock the potential of structured data while delivering it in a polished, user-friendly format.

Example XML

To demonstrate how FileConversionLibrary handles XML to Word conversion, we will use the following XML books.xml:

<?xml version="1.0"?>
<catalog>
    <book id="bk101">
        <author>Gambardella, Matthew</author>
        <title>XML Developer's Guide</title>
        <genre>Computer</genre>
        <price>44.95</price>
        <publish_date>2000-10-01</publish_date>
        <description>An in-depth look at creating applications
            with XML.</description>
    </book>
    <book id="bk102">
        <author>Ralls, Kim</author>
        <title>Midnight Rain</title>
        <genre>Fantasy</genre>
        <price>5.95</price>
        <publish_date>2000-12-16</publish_date>
        <description>A former architect battles corporate zombies,
            an evil sorceress, and her own childhood to become queen
            of the world.</description>
    </book>
 ...
</catalog>
Enter fullscreen mode Exit fullscreen mode

After converting the XML, the data will be structured and formatted into a Word document. Each CD entry will appear as a table or a section, depending on your template and configuration.

Here’s an example of how it might look:

Example

DocumentFormat.OpenXml: A Powerful Library for Office Document Manipulation

In the FileConversionLibrary, we leverage DocumentFormat.OpenXml to convert XML data into structured Word documents. DocumentFormat.OpenXml is an open-source library provided by Microsoft that enables developers to edit and process Office Open XML (OOXML) documents programmatically. This includes file formats such as .docx, .xlsx, and .pptx. The library eliminates the need to rely on Microsoft Office's COM Interop, making it lightweight and efficient for server-side or cross-platform applications.

Key Features of DocumentFormat.OpenXml

  1. Standard Compliance
    • The library adheres to the ISO/IEC 29500 standard, ensuring compatibility with modern Office applications. Works with .docx, .xlsx, and .pptx file formats introduced in Office 2007 and later.
  2. Platform Independence
    • Operates without requiring Microsoft Office installed on the host machine. Ideal for server-side applications, cloud-based systems, or non-Windows environments.
  3. Efficient File Manipulation
    • Enables the creation, reading, and modification of Office documents with fine-grained control. Supports advanced document structures, such as styles, metadata, tables, images, and charts.
  4. XML-Based Architecture
    • OOXML documents are structured XML files encapsulated in ZIP archives. Developers can directly interact with the XML nodes to customize or extract content.

XmlHelperFile: Simplifying XML Data Extraction

The XmlHelperFile class in FileConversionLibrary is designed to facilitate the extraction of data from XML files. This utility class provides an asynchronous method to read XML files and convert their contents into a structured format, making it easier to process and manipulate the data.

Key Features of XmlHelperFile

  1. Asynchronous Operation.
    The ReadXmlAsync method is asynchronous, ensuring non-blocking I/O operations which are crucial for performance in applications that handle large files or multiple concurrent tasks.

  2. XML Parsing.
    Utilizes XmlDocument and XmlNodeReader to parse the XML content, ensuring compatibility with standard XML structures.

  3. DataSet Integration.
    Converts the XML data into a DataSet, leveraging its powerful data manipulation capabilities to handle complex XML structures.

  4. Structured Output.
    Extracts headers and rows from the XML, returning them as a tuple containing an array of headers and a list of rows. Each row is represented as an array of strings, corresponding to the columns in the XML.

public static class XmlHelperFile
{
    public static async Task<(string[] Headers, List<string[]> Rows)> ReadXmlAsync(string xmlFilePath)
    {
        // Read the entire content of the XML file asynchronously
        var xmlContent = await File.ReadAllTextAsync(xmlFilePath);

        // Load the XML content into an XmlDocument
        var xmlFile = new XmlDocument();
        xmlFile.LoadXml(xmlContent);

        // Create an XmlNodeReader from the XmlDocument
        var xmlReader = new XmlNodeReader(xmlFile);

        // Create a DataSet and read the XML data into it
        var dataSet = new DataSet();
        dataSet.ReadXml(xmlReader);

        // Check if the DataSet contains any tables
        if (dataSet.Tables.Count == 0)
        {
            throw new Exception("No tables found in the XML file.");
        }

        // Get the first table from the DataSet
        var table = dataSet.Tables[0];

        // Extract the column names (headers) from the table
        var headers = new string[table.Columns.Count];
        for (var i = 0; i < table.Columns.Count; i++)
        {
            headers[i] = table.Columns[i].ColumnName;
        }

        // Extract the rows from the table
        var rows = new List<string[]>();
        foreach (DataRow row in table.Rows)
        {
            var rowData = new string[table.Columns.Count];
            for (var i = 0; i < table.Columns.Count; i++)
            {
                rowData[i] = row[i].ToString();
            }
            rows.Add(rowData);
        }

        // Return the headers and rows as a tuple
        return (headers, rows);
    }
}
Enter fullscreen mode Exit fullscreen mode

IXmlConverter and XmlToWordConverter: Converting XML to Word

The IXmlConverter interface and XmlToWordConverter class in FileConversionLibrary provide a structured way to convert XML files into Word documents. This section explains how these components work together to achieve the conversion.

IXmlConverter Interface

The IXmlConverter interface defines a contract for converting XML files to other formats. It includes a single method, ConvertAsync, which performs the conversion asynchronously.

public interface IXmlConverter
{
    Task ConvertAsync(string xmlFilePath, string outputFilePath);
}
Enter fullscreen mode Exit fullscreen mode

XmlToWordConverter Class

The XmlToWordConverter class implements the IXmlConverter interface to convert XML files into Word documents. It uses the XmlHelperFile class to read and parse the XML data, and the DocumentFormat.OpenXml library to create and manipulate the Word document.

public class XmlToWordConverter : IXmlConverter
{
    public async Task ConvertAsync(string xmlFilePath, string wordOutputPath)
    {
        try
        {
            // Read and parse the XML file using XmlHelperFile
            var (headers, rows) = await XmlHelperFile.ReadXmlAsync(xmlFilePath);

            // Create a new Word document
            using (var wordDocument = WordprocessingDocument.Create(wordOutputPath, WordprocessingDocumentType.Document))
            {
                // Add a main document part to the Word document
                var mainPart = wordDocument.AddMainDocumentPart();
                mainPart.Document = new Document();
                var body = mainPart.Document.AppendChild(new Body());

                // Add headers to the Word document
                var headerParagraph = body.AppendChild(new Paragraph());
                var headerRun = headerParagraph.AppendChild(new Run());
                headerRun.AppendChild(new Text(string.Join(" ", headers)));

                // Add rows to the Word document
                foreach (var row in rows)
                {
                    var paragraph = body.AppendChild(new Paragraph());
                    var run = paragraph.AppendChild(new Run());
                    run.AppendChild(new Text(string.Join(" ", row)));
                }
            }
        }
        catch (FileNotFoundException e)
        {
            // Handle file not found exception
            Console.WriteLine($"File not found: {e.FileName}");
        }
        catch (XmlException e)
        {
            // Handle invalid XML exception
            Console.WriteLine($"Invalid XML: {e.Message}");
        }
        catch (Exception e)
        {
            // Handle any other unexpected exceptions
            Console.WriteLine($"Unexpected error: {e.Message}");
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

How It Works

  1. Reading and Parsing XML
    The ConvertAsync method starts by calling XmlHelperFile.ReadXmlAsync to read and parse the XML file. This method returns the headers and rows extracted from the XML.

  2. Creating a Word Document
    The method then creates a new Word document using WordprocessingDocument.Create. This document is created at the specified wordOutputPath.

  3. Adding Main Document Part
    A main document part is added to the Word document, which contains the body of the document.

  4. Adding Headers
    A paragraph is created for the headers, and a run is added to this paragraph. The headers are joined into a single string and added as text to the run.

  5. Adding Rows
    For each row in the extracted data, a new paragraph is created, and a run is added to this paragraph. The row data is joined into a single string and added as text to the run.

  6. Exception Handling
    The method includes exception handling for FileNotFoundException, XmlException, and general exceptions to provide informative error messages.

Executing XML to Word Conversion

Here is an example of how to call the ConvertAsync method of the XmlToWordConverter class within the Main method:

class Program
{
    static async Task Main(string[] args)
    {
        // XML to Word Conversion
        var xmlToWordConverter = new XmlToWordConverter();
        await xmlToWordConverter.ConvertAsync(@"C:\Users\User\Desktop\books.xml", @"C:\Users\User\Desktop\output.docx");
        Console.WriteLine("XML to Word conversion completed.");

        ...
    }
}
Enter fullscreen mode Exit fullscreen mode

Conclusion

In this article, we explored the process of converting XML files to Word documents using the FileConversionLibrary. My solution provides a basic, initial-level example of converting XML to Word. It serves as a basic example that can be taken as a starting point and improved upon.

Top comments (0)