3 Reasons why we need an Open Source Data Infrastructure Platform

#opensource #devops

TL;DR Speeding up the setup, commoditising and enhancing the developer experience of the data infrastructure is the need of the hour, and open sourcing Gigahex is a first step towards this.

Being in the Big Data industry for more than a decade has made me realize that managing open source distributed systems is indeed a painful experience that forces you to have sleepless nights. But the cloud vendors like — AWS, GCP and Azure have come to rescue by offering managed services with some extra platform fee, generally paid per hour per compute instance. This seems reasonable, and large organizations with deep pockets may keep up with cloud bills, but many SMBs and research institutes may not have such funding to support their research work.

I want to highlight the three main reasons why its time to build the Data Infrastructure Platform in open.

Launch Data Infrastructure under 60 seconds

We have been living in a world of super computers and Google, where we get answers to the most fascinating questions at the click of a button. But when it comes to setting up a development or testing environment for the data engineers, it takes hours or even days after exchanging multiple slack messages and email threads and escalations.

Why can’t we get things up and running under 60 seconds?

Pay based on criticality of data application

Open source software is free, but deploying and managing is extremely costly and time consuming. Cloud vendors have provided managed services for most of the popular data services — Databricks, AWS EMR, GCP Dataproc, Azure Analytics and few others.

Why is there not an established open source alternative that provides end-to-end solution for setting up data infrastructure and analytics engine?
This gives the businesses to choose the right data platform, based on the need for speed and SLA for these services.

Stay sane in the world of multiple browser tabs

Data Engineers have been constantly mastering the skill of Cmd+Tab / Win + Tab in order to find the right window which can help them find why a job failed, lost executor, session terminated, received OOM error. Is it the application or infrastructure issue?
As data applications are tightly coupled to the infrastructure, so each data engineer also needs to be good at Data Ops. This brings them to the world of total chaos, demanding them to jump from tab to tab, mail to slack, slack to Zoom and finally they demand to bring Friday earlier :)

So why can’t we have an open source data platform to marry the data infrastructure to the data applications?

The new Gang in the Open source street

Gigahex is making a debut in the world of Open source, to solve the above issues. The first release enables developers to launch Apache Spark, Kafka and Hadoop single node clusters on your local machine.

Give it a try and let us know your feedback.

Top comments (1)

Grigor Khachatryan • Mar 7 '22

Nice article!