🧠Understanding 6 Common Data Formats in Cloud Data Analytics

CIBIN S — Mon, 10 Nov 2025 04:41:24 +0000

Data analytics relies heavily on how data is stored, exchanged, and processed. Different data formats are optimized for different use cases — from simple spreadsheets to large-scale distributed processing. In this blog, let’s explore six popular data formats used in cloud-based analytics: CSV, SQL, JSON, Parquet, XML, and Avro.

We’ll use a simple dataset throughout all examples 👇

Name    Register_No Subject Marks
Arjun   101 Math    90
Priya   102 Science 88
Kavin   103 English 92

CSV (Comma Separated Values)

Explanation:
CSV is the simplest and most human-readable format for storing tabular data. Each line represents a row, and commas separate individual values. It’s widely used for data import/export in spreadsheets and analytics tools.

Example (data.csv):

Name,Register_No,Subject,Marks
Arjun,101,Math,90
Priya,102,Science,88
Kavin,103,English,92

SQL (Relational Table Format)

Explanation:
SQL stores data in structured tables within relational databases. It allows for querying, joining, and managing data efficiently using SQL commands.

Example (data.sql):

CREATE TABLE Students (
Name VARCHAR(20),
Register_No INT,
Subject VARCHAR(20),
Marks INT
);

INSERT INTO Students VALUES ('Arjun', 101, 'Math', 90);
INSERT INTO Students VALUES ('Priya', 102, 'Science', 88);
INSERT INTO Students VALUES ('Kavin', 103, 'English', 92);

JSON (JavaScript Object Notation)

Explanation:
JSON is a lightweight data-interchange format that stores data as key-value pairs. It is easy for both humans and machines to read and is widely used in APIs, web applications, and NoSQL databases.

Example (data.json):

[
{"Name": "Arjun", "Register_No": 101, "Subject": "Math", "Marks": 90},
{"Name": "Priya", "Register_No": 102, "Subject": "Science", "Marks": 88},
{"Name": "Kavin", "Register_No": 103, "Subject": "English", "Marks": 92}
]

Parquet (Columnar Storage Format)

Explanation:
Parquet is an efficient, columnar storage format used in big data systems like Hadoop, Spark, and AWS Athena. It stores data by columns instead of rows, allowing faster read performance and better compression for analytical queries.

Example (Conceptual View):

Column 1: Name → [Arjun, Priya, Kavin]
Column 2: Register_No → [101, 102, 103]
Column 3: Subject → [Math, Science, English]
Column 4: Marks → [90, 88, 92]

(In reality, Parquet is a binary format, so the data is stored in compressed column chunks rather than text.)

XML (Extensible Markup Language)

Explanation:
XML represents data in a tree structure with tags. It’s commonly used for configuration files, data exchange, and web services (SOAP). Each element is enclosed in start and end tags, providing structure and hierarchy.

Example (data.xml):

<Students>
  <Student>
    <Name>Arjun</Name>
    <Register_No>101</Register_No>
    <Subject>Math</Subject>
    <Marks>90</Marks>
  </Student>
  <Student>
    <Name>Priya</Name>
    <Register_No>102</Register_No>
    <Subject>Science</Subject>
    <Marks>88</Marks>
  </Student>
  <Student>
    <Name>Kavin</Name>
    <Register_No>103</Register_No>
    <Subject>English</Subject>
    <Marks>92</Marks>
  </Student>
</Students>

Avro (Row-Based Storage Format)

Explanation:
Avro is a compact binary format often used in Apache Hadoop and Kafka. It stores data along with its schema, making it ideal for data streaming and serialization between services.

Example (Schema + Data):
Schema (avro_schema.json):

{
"type": "record",
"name": "Student",
"fields": [
{"name": "Name", "type": "string"},
{"name": "Register_No", "type": "int"},
{"name": "Subject", "type": "string"},
{"name": "Marks", "type": "int"}
]
}

Data (conceptual view):

{
"Name": "Arjun", "Register_No": 101, "Subject": "Math", "Marks": 90
}
{
"Name": "Priya", "Register_No": 102, "Subject": "Science", "Marks": 88
}
{
"Name": "Kavin", "Register_No": 103, "Subject": "English", "Marks": 92
}

(Stored in binary format during real usage.)

Conclusion

Each data format serves a unique purpose:
CSV → Simple and human-readable
SQL → Structured and relational
JSON → Flexible and web-friendly
Parquet → Optimized for analytics
XML → Hierarchical and descriptive
Avro → Compact and schema-based
Choosing the right data format depends on your use case, data size, and processing tools.

MongoDB Hands-On

CIBIN S — Sat, 06 Sep 2025 04:29:43 +0000

Hello All,
I’ve been exploring how NoSQL databases work, and MongoDB was the perfect place to start.Its document-oriented structure makes it simple to model real-world data.Unlike relational databases, it doesn’t force a predefined schema, which means more flexibility.This article covers my step-by-step MongoDB practice project.

1. Setup

Installed MongoDB Compass locally for an easy GUI-based interaction.
Created a database named yelpDB and a collection named reviews.
Imported and manually added a dataset of sample Yelp-style business reviews.

2.Tasks Performed

Insert Records

I manually inserted at least 10 records into the reviews collection.

3.Queries

Top 5 Businesses with Highest Average Rating

Using the aggregation pipeline with $group and $sort, I retrieved the top 5 businesses with the highest average ratings.

Count Reviews Containing the Word “good”

To analyze sentiment, I searched for reviews that contained the word "good" using a regex query.

Get All Reviews for a Specific Business ID

I queried for all reviews belonging to a specific business ID (e.g., B7 – Healthy Bites).

Update a Review & Delete a Record

I performed both update and delete operations on the dataset:

Updated a review to reflect an improved service.
Deleted one record flagged for removal.

Conclusion
This provided practical exposure to:

Managing data in MongoDB Compass.
Performing CRUD operations (Create, Read, Update, Delete).
Writing queries & aggregation pipelines.
Exporting data for reporting.

Overall, I learned how powerful MongoDB is for handling unstructured data and performing flexible queries without rigid schema constraints.

DEV Community: CIBIN S

🧠Understanding 6 Common Data Formats in Cloud Data Analytics

MongoDB Hands-On