DEV Community

Adrian Paul Nutiu
Adrian Paul Nutiu

Posted on

πŸš€ The Journey to Sitecore CMS: A Complex Content Migration

πŸ“š Intro

As I mentioned in my first article, πŸ¦Έβ€β™‚οΈ Nobody Dreamed of Becoming a DevOps Engineer, today I'm going to share with you just how lazy I am - and how that trait turned into something remarkable.

About more than ten years ago, at my first company, I found myself working on a project that involved migrating a massive live website into Sitecore CMS. The task was far from simple, and it quickly turned into a significant challenge with moments of trial and error.

This is the story of how we navigated through a maze of technical hurdles, limited tools, and budget constraints to deliver a successful solution.


⚑ How It Started: From Manual Labor to Automation

Before diving into building a robust migration solution, my journey began with a simpler and, frankly, more exhausting task - manual content migration. 😭

Alongside my work on implementing the new website, I was also tasked with migrating content by hand. Day after day, I found myself copying and pasting text, images, and other assets. It didn't take long to realize that I despised this repetitive, draining work. 😨

Image description

Determined to break free from this mundane cycle, I began experimenting with automation.

Some might view it as laziness, but this laziness can evolve into a drive to envision automations. πŸš€


πŸ§— Challenges

But first, let's see some of the challenges we faced.

πŸ”’ Challenge 1: The Encrypted Database

One of our first challenges was dealing with an Adobe CMS database that had limited access and encrypted content. With no available documentation or Adobe CMS SDK to guide us, we had to gave up on doing any reverse-engineering to migrate content between CMS-es. πŸ€”

πŸ•ΈοΈ Challenge 2: Static Crawling Is Not Enough

The initial idea was to scrape the website statically, crawling through all available pages and extracting the content... using PHP. The programming language I mastered at that time.
I can still hear my team leader's voice mocking the same words I'd been telling myself: I can do it faster in PHP. Just wait and see. 🀣🀣
(If you're reading this, I just want to tell "Thank you!" πŸ™‡β€β™‚οΈ)

Well, as it often goes with the best-laid plans, it quickly became apparent that static crawling wasn't enough. While PHP could handle scraping the static content, it wasn't designed to capture dynamically loaded data - like images, videos, and other media fetched through JavaScript, Flash (Is Flash Dead Yet? 😱 Yes, it is.) and other techniques. We were scratching the surface, but we needed something far more sophisticated to get the full picture of the website's content.

First trial and error - accomplished! πŸ˜…πŸ˜…

This was only the beginning of what would become a much larger and more intricate challenge.

Tip of the Iceberg Crawler

The good part in this small automation project is that it provided a bird's-eye view of the number of pages and assets we needed to migrate, helping us gauge the scope of the work.

🌐 Challenge 3: The Multilingual Nightmare

The multilingual migration became challenging due to inconsistent content structures across languages. Variants often had missing, misaligned, or differently formatted content, requiring custom logic to ensure accurate mapping and maintain integrity across all languages.

And personally, I don't know 30 languages to be able and tell for sure that I am copying the right Chinese content - for example - without passing it through Google Translate. πŸ˜…


🧠 Stepping Back: Designing the Solution

While working in parallel to implement the core functionalities for the website using ASP.NET with the Sitecore SDK, creating a foundation for the new CMS-driven system I got to the realization that static crawling was inadequate, it was time to take a step back and reconsider the approach. I needed to think through the requirements and architect a solution that would tackle the challenges of migrating dynamic content.
This process began only in my mind - an outline of what the solution would look like, how the components might fit together, and what new hurdles we might encounter.

Next came research, digging through resources, looking up technologies, and thinking about ways to stitch everything together. I also tested the field with my colleagues for each potential component and explored what might work, without telling them about the bigger picture. In those early stages, fear held me back from sharing it outright; I was unsure if my vision would work or be received well.

So, I built an initial proof of concept (POC) to start small and show it in action. It wasn't fancy - just a simple setup that navigated a web page using a custom browser component built on Internet Explorer (yes, not even Chromium). This POC could highlight elements, extract content, and output the data to the console.

Nevertheless the POC had success. And it was also the point when it clicked for everyone why I was asking all my question and test the field with them 🀯.

It started to be a dual effort: building a migration solution and ensuring Sitecore integration worked seamlessly.


πŸ› οΈ Building the Solution in C#

Codename: Octopussy
(Yes! Like the 007 movie with Roger Moore, from 1983 πŸ˜…)

To get the content migrated properly, I had to develop several custom components, many of them from scratch, all in C#. Each of these components played a crucial role in making sure the migration was as seamless as possible.

🌐 Custom Browser for Navigation

The heart of our solution was a custom browser, built to navigate the website. This browser allowed us to manually or automatically navigate through pages, ensuring that we could capture not just static content but also dynamically loaded elements like images, videos, and other media.

πŸ›‘οΈ Reverse Proxy for Catching Dynamic Assets

To capture dynamically loaded content like Flash assets, images, and videos, I used a custom reverse proxy in combination with Fiddler, a web debugging tool.

Fiddler helped us monitor HTTP/HTTPS traffic, revealing how assets were loaded asynchronously. The reverse proxy acted as an intermediary, ensuring that all dynamic content, not just static pages, was captured and migrated into Sitecore CMS. This approach ensured we didn't miss any assets and kept the integrity of the original site intact during migration.

🌍 Translation with Google Translate API

Given that the website was multilingual, one of the trivial but necessary components we built was a translation feature. I integrated the Google Translate API to automatically translate page titles and content across different languages. But here's the tricky part: the structure and content were not consistent across languages. So, I had to add a text similarity check to match the correct content items for each language variant. This way, I ensured the translated titles corresponded to the right content in each language.

πŸ“¦ Sitecore SDK Integration

Once we had the content, the next challenge was pushing it into Sitecore CMS. Using the Sitecore SDK, I built a service that handled the loading and saving of content into the CMS database. This service also ensured that the content was correctly mapped to the relevant Sitecore templates, ensuring everything from text to media had the appropriate structure.

βš™οΈ Dynamically Generated Assemblies

To make the system flexible and avoid constant restarts during development, I implemented a mechanism for dynamically building and versioning assemblies on the fly. These assemblies were required for templates, which were strongly typed. This was a game changer as it allowed us to add new templates or modify existing ones without restarting the service, making the entire process much more efficient.

πŸ” XPath Rules for Content Extraction*

Given the complexity of the site's structure, I couldn't simply rely on traditional scraping techniques. Instead, I used XPath-like rules to match specific elements on each page - whether they were articles, images, or media files - and map that content into Sitecore's CMS. This was crucial for ensuring that content was categorized and saved properly.


πŸ’Έ Why No Use of Expensive Tools?

At the time, there were other solutions availableβ€”powerful, "enterprise-level" (question mark) tools designed specifically for content migration. But those tools came with a hefty price tag of around €30k per month.
(Kapow! No! Really! That's how it's called. πŸ˜ƒ)

The client, on the other hand, was only willing to invest €30k for the entire project migration. That's right, the budget for several months was the same as one month of using these premium tools. So, we had no choice but to roll up our sleeves and build our solution.

One does not simply buy it for €30k


πŸ“¦ The Sitecore CMS: Why Sitecore?

Sitecore, a powerful CMS platform, was not my decision but rather the result of a sales deal. Sitecore brought features that were essential for this project: multilingual support, robust templating, and an efficient publishing mechanism. Sitecore also allows organizations to customize their content based on language, which was critical for this project, given its global reach. We had to ensure all the content was migrated in multiple languages - a task made easier with Sitecore's multilingual capabilities.

Another key feature was Sitecore's template system. This allowed us to define the structure of content upfront, ensuring consistency across different types of content like articles, news items, and images. The ability to create and publish content quickly, while ensuring it was structured correctly, was invaluable.


πŸ™‹β€β™‚οΈ Lessons Learned: Asking for Help

One of the most important lessons I learned from this project wasn't technical. My team leader reminded me that I didn't have to do everything myself. There were moments where I felt overwhelmed by the complexity of the task, but knowing when to ask for help, and rely on others, was key. This mindset carried me through not just this project but many others down the line.


πŸ“ˆ Rate of Success: Automation and Content Fillers

In terms of the overall success rate, we achieved about 80% coverage through automation. The custom components I developed - such as the browser, reverse proxy, and integration with the Sitecore SDK - were effective at handling the bulk of the migration work. This meant that most of the content, including text, images, and media, was seamlessly transferred into the CMS.

Smooth

However, there was still about 20% of the content that required manual intervention, which we covered using content fillers. This portion of the work involved sections of the website where content structures were too varied or complex to automate efficiently. Rather than spending additional time creating overly elaborate logic for these edge cases, we opted for a more pragmatic approach. The content fillers allowed us to populate those areas without going too deep into custom code for content that was often inconsistent across languages and formats.

This hybrid approach of automation and manual content filling struck the right balance, enabling us to meet the project's goals within the given time and budget.


πŸ† Conclusion: A Journey Worth Remembering

Looking back, the migration project was challenging, full of unexpected roadblocks, but also incredibly rewarding. We built a custom solution from the ground up, integrated it with one of the most powerful CMS platforms available, and delivered a multilingual content migration that met the client's expectations - without breaking the bank.

This project remains one of the most technically challenging I've worked on, but it's also one of the most satisfying. From reverse proxies to dynamic assembly generation, we navigated a maze of technical problems, learning a lot along the way. And most importantly, we came out the other side with a system that worked.


πŸ’¬ *Confession

The XPath rules weren't exactly XPath. πŸ™„ I used a custom syntax that resembled XPath, which worked for a while but became cumbersome over time. It was only months or years later that I realized using XPath would have been much better, especially if I had known about the HTML Agility Pack and its ability to fix broken HTML, allowing XPath to function properly.

Who knows... Having that in place might have significantly boosted the success rate, potentially raising it to 99%. Probably!

Image description


Even though Flash is no longer supported - and back then, it wasn't even on the deprecation radar - and the entire website has since been redesigned and rewritten, I'm still proud of what we accomplished. πŸ’ͺ

I think that anyone who has tackled a major migration considers it a milestone in their career. πŸ€”


Let me know in the comments if you've tackled any migration.
How did it go? What tools did you use?

Top comments (0)