How to store an email archive?

twitter logo github logo ・1 min read

I am going to be developing an application which stores emails but I'm wondering what would be the best way to store them?

I would like to be able to search through them which makes storing them in separate files very inefficient. So I was wondering which DB is the best for this solution.

SQL? NoSQL?
PostgreSQL? MongoDB?

What do you think?

twitter logo DISCUSS (4)
markdown guide
 

What else does this application need to do, and which decisions (technology stack, ...) have already been made or are defined by non-functional requirements? And how do these messages look like? Will you have to deal with large amounts of binary attachments? Will those attachments be rather huge?

Approaches that come to mind:

(1) Store all your mail messages as files, one per message, and use something like ElasticSearch or plain Lucene to build an index on top of these files. This way you would have a rather easy store and quite a fast way to search it.

(2) Dump all your messages into a SQL database like postgreSQL, keep a few important columns (such as message headers, ...) that need to be searchable and the "raw" mail in a BLOB column. This would allow for fast searching at least across the information in those separate columns, but it might not be all too funny if you need to full text search or mails get bigger (we have to handle messages with > 200mb of binary attachment at times).

(3) Embed a lightweight IMAP server into your application to do your mail storage. These are made right for this purpose and offer query facilities too but also might have limitations at least in terms of doing full-text searches.

I'd possibly try (1) first and see how far it takes me. Plus I recommend staying away from having to split messages (in example for storing binary attachments and text parts in separate stores); this is something likely to get nasty rather fast. ;)

 

Thank you so much for this detailed reply. I haven't decided on the tech stack yet. I wanted to deal with the toughest choice first.

At first I was thinking to use SQL to store an original version and a "searchable" version, which is basically the email stripped out of HTML tags.

After your reply I think of starting with (1). Never have been using ElasticSearch, so would be pretty nice to learn it along the way. Problem with (1) would be inode count at some point, but I guess I will stumble upon an easy fix.

Also about attachments, I don't think I will be receiving emails with any of them. So whenever I get some I guess I'm going to be compressing them and keeping them somewhere just in case.

Again, thank you for sharing!

 

Glad I could be of help - feel free to ask if you need more input. I think you still will have some decisions to make down that road, depending upon your actual requirements and the system you're about to build - talking also about amount of messages that should be stored in total, amount of messages incoming per day/hour, ... . But not having to deal with attachments is something making this thing a bit easier.

 

Archiveopteryx.

It's an IMAP server backed onto a SQL database, so it natively understands email, and it's specifically designed for long-term storage and search.

Classic DEV Post from Oct 13 '18

The beginner's guide to contributing to projects with GitHub Desktop

How to make your first open source contribution.

Dimitar Nestorov profile image
JavaScript, TypeScript, React, React Native, Node, and Electron