DEV Community

Peter
Peter

Posted on

Building a Content Transformation Pipeline

Processing unstructured content into a clean, well-formatted output is a common problem in software development. One elegant solution is the pipeline programming pattern. A pipeline organizes data transformations into discrete, reusable steps, making the process more maintainable, testable, and extensible.

In this article, we’ll implement a realistic example: transforming a messy email containing images, links, and text into a structured blog post.


The Scenario

We receive an email with:

  • Images mixed with text
  • Links to various websites
  • Embedded YouTube links
  • Inconsistent HTML formatting

Our goal:

  1. Extract all images and store them in a gallery.
  2. Detect links and turn them into clickable links, or embed if from special providers like YouTube.
  3. Extract the clean, plain text body without formatting.

Why Pipelines Work

Instead of writing one large function that handles everything, we’ll create a sequence of small processors. Each processor:

  • Takes a standard input format
  • Outputs a standard format
  • Passes the result to the next processor

This structure makes the process easier to read, debug, and extend.


Implementation

Here’s an example in modern JavaScript:


javascript
class Pipeline {
  constructor(steps = []) {
    this.steps = steps;
  }
  add(step) {
    this.steps.push(step);
    return this;
  }
  run(input) {
    return this.steps.reduce((data, step) => step(data), input);
  }
}

// Processing steps
function extractImages(data) {
  const images = [...data.raw.matchAll(/<img[^>]+src="([^">]+)"/g)]
    .map(m => m[1]);
  return { ...data, images };
}

function extractLinks(data) {
  const links = [...data.raw.matchAll(/https?:\/\/[^\s<]+/g)]
    .map(m => m[0]);
  return { ...data, links };
}

function embedSpecialLinks(data) {
  const embeddedLinks = data.links.map(link => {
    if (link.includes("youtube.com")) {
      return `<iframe src="${link}" frameborder="0"></iframe>`;
    }
    return `<a href="${link}">${link}</a>`;
  });
  return { ...data, embeddedLinks };
}

function cleanText(data) {
  const text = data.raw.replace(/<[^>]*>/g, "").trim();
  return { ...data, text };
}

// Build the pipeline
const contentPipeline = new Pipeline()
  .add(extractImages)
  .add(extractLinks)
  .add(embedSpecialLinks)
  .add(cleanText);

// Example usage
const rawEmail = `
  <p>Hello <b>World</b></p>
  <img src="photo1.jpg" />
  https://youtube.com/watch?v=abc123
  https://example.com
`;

const result = contentPipeline.run({ raw: rawEmail });

console.log(result);
/*
{
  raw: "...",
  images: ["photo1.jpg"],
  links: [
    "https://youtube.com/watch?v=abc123",
    "https://example.com"
  ],
  embeddedLinks: [
    "<iframe src=\"https://youtube.com/watch?v=abc123\" frameborder=\"0\"></iframe>",
    "<a href=\"https://example.com\">https://example.com</a>"
  ],
  text: "Hello World https://youtube.com/watch?v=abc123 https://example.com"
}
*/
Enter fullscreen mode Exit fullscreen mode

Top comments (0)