loading...

What are the most suitable datastores for storing a huge number of articles and news?

devfanooos profile image FaN000s ・1 min read

I am assigned to a new big project at my current company. The project will collect a huge number of news articles from different sources.

The whole requirements still not clear but we can expect some of it. For example,

Building some dashboards to display some statistics about the collected news.
Full-text search (exact, fuzzy, and synonym)
Providing a way to other teams (specifically data analysis team) to query the data.

What would you suggest as a datastore for such a project?

I believe there is no one-size-fits-all solution to this type of project.

As a start, I am thinking in using Elassandra as it combines both Cassandra and Elasticsearch which may satisfy the first two points (Cassandra for aggregation and analytics and Elasticsearch for full-text search).

Still the third point not satisfied. The data analysis people are familiar more with SQL which will not be 100% provided by either Cassandra or Elasticsearch.

The other approach I am thinking in is to have another storage for the analysis team and the application responsible for writing the data will write it to both storages.

What do you think?

Discussion

pic
Editor guide
Collapse
bgadrian profile image
Adrian B.G.

IF you have a good team of experienced SysAdmins/Data Engineers to maintain the clusters:

To do it right and for long term I would choose 3 solutions for 3 problems.
Long term storage something horizontal scalable with replication (cassandra/kafka with streams maybe? )
A nice alternative would be S3 documents.

From "the source of truth" you can move data with NiFi or other solutions to other platforms that can change over time. This is the trick.

ElasticSearch is one option for text search.

Real-time analytics/aggregation: Apache Beam/Spark/Flink.

Once a month heavy duty analytics and discovery: a special database you can put tons of data, extract the report and close it (BigQuery, AWS Athena, Aurora..)

ELSE / you do not have a big team of SRE and DevOps:
managed solutions, I would suggest Google Cloud.

Collapse
devfanooos profile image
FaN000s Author

I think we are in a trap :) :) :)

We do not have a team of experienced SysAdmins/Data Engineers and I do not think storing the data outside our data center will be an acceptable choice :) :) :) :).

Collapse
bgadrian profile image
Adrian B.G.

I would suggest looking for another job, but hey, thats just my trivial oppinion 😀

This situation usually is a signal for a lot more company wide issues.

Collapse
rhymes profile image
rhymes

How massive? Why not a relational DB like PostgreSQL? I'm not saying is the right choice but what was the process of elimination?

You can store huge quantities of data, data analysis can be performed (on the live version or even better on a read replica), full text is supported and if it's too limited you can still use Elasticsearch just for search.

It's definitely easier to setup and handle than Cassandra...

Collapse
devfanooos profile image
FaN000s Author

How massive? there is no specific number but what I can tell we may be asked to get all what we can get from major newspaper sites like BBC, CNN, ....,etc plus some other blogs and news sites.

Why not PostgreSQL (or relational DB)? Actually, there is no reason and I am currently looking at CitusData as an option. Another option is PostgreSQL-XL.

The concerns to the relational are the size of data, how it easy to scale and add new nodes, and high availability which are provided by NoSql databases by default. That is why we give NoSql DBs a higher periority.

Collapse
rhymes profile image
rhymes

How massive? there is no specific number but what I can tell we may be asked to get all what we can get from major newspaper sites like BBC, CNN, ....,etc plus some other blogs and news sites.

I would consider an alternative to PostgreSQL in the hundreds of millions but even then, it depends on what you do with the data :D

Why not PostgreSQL (or relational DB)? Actually, there is no reason and I am currently looking at CitusData as an option.

I've heard about it from some colleagues, check what limitations you have because it's not exactly like PostgreSQL. Just checked the website, they have been purchased by Microsoft eheh

The concerns to the relational are the size of data, how it easy to scale and add new nodes, and high availability which are provided by NoSql databases by default. That is why we give NoSql DBs a higher periority.

Gotcha, obviously keep in mind the tradeoffs.

In any case I would separate the search, due to size requirements, from the "single source of truth" DB