DEV Community

Thomas.G for MyUnisoft

Posted on

Migrating from Nextcloud to Azure S3

Hello ๐Ÿ‘‹

Back for a new MyUnisoft technical article, this time with the help of my colleague Nicolas MARTEAU. Today, we will share our journey to completely refactor our document management architecture and how we migrated from Nextcloud to Azure S3 as our storage technology.

We werenโ€™t able to cover every detailโ€”both for security reasons ๐Ÿ›ก๏ธ and to protect sensitive data ๐Ÿ”’โ€”but I hope you will enjoy what I could share. ๐Ÿ˜Š

๐Ÿ‘€ Why moving away from Nextcloud ?

Performance ๐Ÿค–

Until now, we have managed several tens of millions of documents with Nextcloud. However, stability and performance had become an issue, with regular downtime ๐Ÿ•’ and delays โณ of several minutes for a simple document upload at times.

๐Ÿ’ฌ These upload delays sometimes led to misunderstandings among users. For example, in certain integrations, it was not uncommon for users to delete their Accounting entries ๐Ÿงพ after a few seconds because they thought the attachment was missing.

The complexity and limited functionality of the existing APIs quickly became a significant obstacle ๐Ÿ›‘. Simply making a document available in a specific folder could require four or five separate HTTP requests. We needed a more robust storage solution that could scale effectively ๐Ÿ“ˆ and provide consistent, fast response times. ๐Ÿš€

Infrastructure ๐Ÿข

Furthermore, we needed to reduce Nextcloud's impact on our infrastructure. Unlike Azure, Nextcloud doesn't scale well and required too much maintenance from our DevOps team.

๐Ÿ˜ฌ Architectural issues

In the past, users accessed documents stored directly on Nextcloud, with some files displayed through the platformโ€™s built-in viewers.

GED Architecture 1

This initial choice was certainly made for simplicity, but it has evolved into a significant architectural challenge as we began exposing storage directly to customers. Changing a storage server without affecting our users has become complex, and it also complicates the management of certain security and observability concerns.

GED Architecture 2

The primary issue is with PDF documents, such as ledgers, which contain hardcoded URLs pointing to specific storage servers. This requires us to maintain these URLs for years to ensure continued access.


As part of our migration to S3, we are addressing these issues by routing all requests through the same service (GED).

GED New Architecturee

This approach enables us to resolve several issues and enhance the productโ€™s functionality:

  • Requiring authentication for some sensible documents.
  • Providing full observability over who uploads or downloads specific documents.
  • Enabling updates to storage capabilities without impacting customers.
  • Integrating new storage technologies seamlessly and transparentlyโ€”for instance, through potential future integrations with services like Microsoft OneDrive.

๐Ÿ“ข The plan

The first step was to draft an action plan ๐Ÿ“ and thoroughly document the existing setup. After several weeks of work, we established the key steps:

  1. Route all document downloads through the GED service.
  2. Route all document uploads through the GED service.
  3. Migrate all existing documents to our new Azure storage, ensuring zero impact ๐Ÿšซ on the end user.
  4. Managing the Nextcloud links found in the PDFs already exported by our clients before the migration. Since these links pointed directly to our Nextcloud servers๐Ÿ’€, we had to find a reliable solution to route these calls through the GED.

Each stage comes with its own set of challenges, which weโ€™ll examine in detail later in the article.

Our primary concern, however, was to correct previous architectural missteps ๐Ÿ”.

1๏ธโƒฃ Download

The first step was to re-abstract downloads and previews, routing them through our backend. This required us to manage both legacy documents ๐Ÿ“œ still stored on Nextcloud and new documents that would be hosted on Azure storage.

One challenge weโ€™re facing is that the token generated by Nextcloud lacks any information about the tenant associated with the document. Without this, our backend cannot identify the relevant database cluster and tenant.

TokenXTenant

To resolve this, we created a new opaque token that embeds the tenant ID:

import crypto from "node:crypto";

const token = `${tenantId}-${crypto.randomBytes(16).toString("hex")}`;
console.log(token); // => 1-f82158a508b8bfbed82b601e2ed60edd
Enter fullscreen mode Exit fullscreen mode

๐Ÿ”ฎ Previews

Nextcloud offered automatic previews of uploaded files, a feature we relied on extensively, so we needed to re-implement an equivalent ourselves.

download_preview_usage

We decided not to generate previews at upload, as this would have added significant complexity and cost, along with the challenge of handling asynchronous generation.

For PDFs, we just return an optimized preview of the first page, and for images, we use the Sharp library.

function getImageTransformer(
  query: { x?: string; y?: string; },
  ext: string
): sharp.Sharp {
  const { x, y } = getDimensions(query);
  const transformer = Azure.isAlphaImage(ext)
    ? sharp().png()
    : sharp({ failOn: "none" }).jpeg({ mozjpeg: true, quality: 50 });

  return transformer.resize({
    fit: "inside",
    withoutEnlargement: true,
    height: y,
    width: x
  });
}
Enter fullscreen mode Exit fullscreen mode

๐Ÿ“ฆ Headers and encoding

When returning documents, itโ€™s essential to set the correct HTTP headers and apply proper encoding to values like file names.

const contentType = mime.contentType(
  path.extname(request.body.document)
);
const { body, contentLength } = await getFileFromAzure(request);
// pipe body to reply/response

reply.header(
  "Content-Disposition",
  `attachment; filename="${encodeURIComponent(filename)}"`
);
reply.header("Content-Type", contentType);
reply.header("Content-Length", contentLength);
Enter fullscreen mode Exit fullscreen mode

I frequently see editors forget to re-inject the file name.

Monitoring

Being able to monitor developments and misuse is critical to guaranteeing the stability of our infrastructures.

GED download monitoring

Built-in file viewer

Since Nextcloud could display multiple documents within a viewer, we chose to re-implement a minimal yet functional viewer to retain this capability.

While our front-ends offer more advanced display modules, this lightweight viewer remains useful in several scenarios:

  • Replacing or rewriting legacy URLs in PDFs.
  • External links shared via APIs.
  • Providing quick access for debugging purposes.

GED viewer

2๏ธโƒฃ Upload

Successfully prototyping an upload wasnโ€™t as complex as expectedโ€ฆ but, as always, the devil is in the details.

For inter-service uploads between Node applications, another Fastify plugin was added to our workspace package, providing methods to interact with the GED API ๐Ÿ”€.

๐Ÿ“‰ Optimize PDFs and images

Many of the PDFs and images submitted by our users are quite large and can be optimized. For this, we use Ghostscript ๐Ÿ‘ป to optimize PDFs and the Sharp package for images.

To date, weโ€™ve reduced the size of received PDFs and images by an average of 50%, with no loss in quality โœจ.

GhostscriptSharp

Compression is performed asynchronously using setImmediate to ensure fast server response times.
A compression value of "null" indicates that the compression ratio is below 5% ๐Ÿคทโ€โ™€๏ธ, making the update to the file on Azure negligible.
Otherwise, the file is updated in the cloud โœ”.

Most of these optimizations are carried out via streams, so that the file or image is never completely buffered.

However, we needed to remain vigilant about rising CPU consumption and enhance our infrastructure setup ๐Ÿ—๏ธ to handle increased workloads effectively.

๐Ÿ–ผ๏ธ HEIC/HEIF

Apple's proprietary HEIC format ๐Ÿ“ฑ presented a significant challenge, often requiring conversion to JPG or PNG for compatibility.

Given that Python bindings to libheif showed much better performance, we initially opted to create our own N-API Node.js binding for libheif, using low-level libraries for rapid JPG and PNG conversion.

HEIF-converter

For maintenance reasons, we chose to use Sharp by building libvips directly on our machines and installing the necessary tools (libheif, mozjpeg, libpng, etc.).

๐Ÿ”’ Security

When managing file uploads and storage, vigilance is essential ๐Ÿ•ต๏ธโ€โ™‚๏ธ in several areas:

  • Monitor for spoofed HTTP headers ๐Ÿ›ก๏ธ, such as altered content-type headers.
  • Scan files for viruses and malicious content ๐Ÿฆ .

Otherwise, an attacker could misuse your brand and storage capabilities to distribute malicious content and compromise users ๐Ÿšจ.

Make it a habit to consult the OWASP cheat sheets to ensure maximum protection against errors and oversights: OWASP File Upload Cheat Sheet.

We used clamscan (which relies on ClamAV) to scan the files ๐Ÿ‘๏ธ, and file-type to accurately identify the file type instead of relying solely on the request headers ๐Ÿงจ.

๐Ÿ“Š Monitoring

As we regain control, itโ€™s essential not to overlook usage monitoring through logs and other metrics.

myunisoft_ged_upload_monitoring_1

3๏ธโƒฃ Migrating Nextcloud documents

To gradually phase out our Nextcloud servers, we developped a temporary Node.js API ๐Ÿฆพ, responsible for transferring resources from Nextcloud to Azure. This service handled upload concurrency, which we've limited to 64 simultaneous uploads to avoid overloading the server ๐Ÿ”ฅ.

Albatros_tool

Without detailing every feature of this internal tool, it was designed to support key functionalities such as pausing and resuming the migration process, as well as monitoring the status of each transfers (successes โœ…, errors โŒ, totals, etc.).

Nextcloud_2_totals

Step 1: Data Extraction

We extracted from the Nextcloud database all the tokens ๐Ÿ“„ (used to retrieve document data from the database) and the file paths on the server (to transfer the resources), saving them into .csv or .txt files.

Nextcloud_export

Step 2: Environment Setup

We then set up a NAS server to run the Node.js tool and directly access the file system ๐Ÿฆ„, bypassing the Nextcloud API. This approach was chosen to maximize performance and enable efficient stream-based, parallel processing of the document transfers.

Nextcloud_NAS

Step 3: Create the DB and go ๐Ÿงจ

All that remained was to create the SQLite databases (we chose to generate one database per firm to avoid excessively large files), using the Nextcloud exports that contained tens of millions of rows, and then start the transfers โœ….

Letโ€™s just say we ran into a few surprises along the way ๐Ÿคซ, and the migration ended up taking us several days ๐Ÿคญ.

4๏ธโƒฃ Legacy URLs in PDF

Some URLs are permanently embedded in PDFs ๐Ÿ“„, so we need to consider strategies for rewriting them using the information available.

Since Nextcloud tokens didnโ€™t contain any tenant information, we created a minimal API (microservice) supported by an SQLite database to maintain the relationship between a token and its corresponding tenant ID.

PRAGMA journal_mode = OFF;
PRAGMA synchronous = 0;
PRAGMA locking_mode = EXCLUSIVE;

CREATE TABLE IF NOT EXISTS "tokens" (
  "token" TEXT PRIMARY KEY NOT NULL,
  "schema" INTEGER NOT NULL
) WITHOUT ROWID;
Enter fullscreen mode Exit fullscreen mode

We can manage thousands of tokens within just a few milliseconds โฑ๏ธ using purely synchronous I/O. Additionally, we implemented an LRU cache to ensure that repetitive requests are handled even more quickly.

The final step is to configure HAProxy ๐Ÿ”€ to redirect nextcloud viewer requests to a specific GED endpoint, where the URL is parsed to retrieve tokens and correlate them with their respective tenants, using the project setup described above.

const { link } = request.query;

if (!URL.canParse(link)) {
  // Problem
}
// Other URL validation here

const tokens = link.match(
  /(?<=\/)([1-9]{1,4}-\w{15,32}|\w{15})(?=\W|$)/g
);
// Correlate tokens with our microservice database

reply.redirect(`/ged/document/view?tokens=${correlatedTokens.join("|")}`);
Enter fullscreen mode Exit fullscreen mode

This is only a partial overview of the implementation. We use a combination of the WHATWG URL API and regular expressions to extract tokens, ensuring sufficient security to mitigate any ReDoS attack vectors.

We then redirect the request with all tokens to our built-in viewer.

Built-in_viewer

๐Ÿ”ฌ What we've learned

This project taught us that errors in URLs saved within PDF documents are hard to forgive. Due to some technical debt and a lack of foresight, we ended up with an unintended /ged/ged prefix. Today we're having a bit of a laugh about it, and if you see this prefix you'll know it wasn't meant to be ๐Ÿ˜†.

Managing files with proper streaming while handling errors proved far more challenging than anticipated, plaguing us for weeks with ghost files, memory leaks, and other unexpected bugs. At this level of usage, itโ€™s technical excellence or nothing.

โค๏ธ Credits

A migration project of this scale doesnโ€™t happen overnightโ€”it took us well over a year to complete all the steps outlined above. A big thank you ๐Ÿ™ to everyone involved for their dedication and effort โค๏ธ.

  • Nicolas ๐Ÿ‘จโ€๐Ÿ’ป, for leading the project development from A to Z.
  • The infrastructure team ๐Ÿ—๏ธ (Vincent, Jean-Charles, and Cyril) for their consistent support throughout the project.
  • Aymeric, for managing and leading the migration of downloads and uploads for his team services.
  • Many others ๐Ÿ‘ฅ for their reviews and support ๐Ÿ“.

This project was incredibly rewarding ๐Ÿ†, both for its challenges and the range of architectural issues it addressed ๐Ÿ“.


Thank you, see you soon for another technical adventure ๐Ÿ˜‰๐Ÿ˜Š

๐Ÿ‘‹๐Ÿ‘‹๐Ÿ‘‹

Top comments (1)

Collapse
 
dafyh profile image
Dafyh

Another successful golden project under your Lead, thank you Thomas ๐Ÿฆพ.