Building a Content Transformation Pipeline

Processing unstructured content into a clean, well-formatted output is a common problem in software development. One elegant solution is the pipeline programming pattern. A pipeline organizes data transformations into discrete, reusable steps, making the process more maintainable, testable, and extensible.

In this article, we’ll implement a realistic example: transforming a messy email containing images, links, and text into a structured blog post.

The Scenario

We receive an email with:

Images mixed with text
Links to various websites
Embedded YouTube links
Inconsistent HTML formatting

Our goal:

Extract all images and store them in a gallery.
Detect links and turn them into clickable links, or embed if from special providers like YouTube.
Extract the clean, plain text body without formatting.

Why Pipelines Work

Instead of writing one large function that handles everything, we’ll create a sequence of small processors. Each processor:

Takes a standard input format
Outputs a standard format
Passes the result to the next processor

This structure makes the process easier to read, debug, and extend.

Implementation

Here’s an example in modern JavaScript:


javascript
class Pipeline {
  constructor(steps = []) {
    this.steps = steps;
  }
  add(step) {
    this.steps.push(step);
    return this;
  }
  run(input) {
    return this.steps.reduce((data, step) => step(data), input);
  }
}

// Processing steps
function extractImages(data) {
  const images = [...data.raw.matchAll(/<img[^>]+src="([^">]+)"/g)]
    .map(m => m[1]);
  return { ...data, images };
}

function extractLinks(data) {
  const links = [...data.raw.matchAll(/https?:\/\/[^\s<]+/g)]
    .map(m => m[0]);
  return { ...data, links };
}

function embedSpecialLinks(data) {
  const embeddedLinks = data.links.map(link => {
    if (link.includes("youtube.com")) {
      return `<iframe src="${link}" frameborder="0"></iframe>`;
    }
    return `<a href="${link}">${link}</a>`;
  });
  return { ...data, embeddedLinks };
}

function cleanText(data) {
  const text = data.raw.replace(/<[^>]*>/g, "").trim();
  return { ...data, text };
}

// Build the pipeline
const contentPipeline = new Pipeline()
  .add(extractImages)
  .add(extractLinks)
  .add(embedSpecialLinks)
  .add(cleanText);

// Example usage
const rawEmail = `
  <p>Hello <b>World</b></p>
  <img src="photo1.jpg" />
  https://youtube.com/watch?v=abc123
  https://example.com
`;

const result = contentPipeline.run({ raw: rawEmail });

console.log(result);
/*
{
  raw: "...",
  images: ["photo1.jpg"],
  links: [
    "https://youtube.com/watch?v=abc123",
    "https://example.com"
  ],
  embeddedLinks: [
    "<iframe src=\"https://youtube.com/watch?v=abc123\" frameborder=\"0\"></iframe>",
    "<a href=\"https://example.com\">https://example.com</a>"
  ],
  text: "Hello World https://youtube.com/watch?v=abc123 https://example.com"
}
*/

DEV Community

Building a Content Transformation Pipeline

The Scenario

Why Pipelines Work

Implementation

Top comments (0)