Data in Cloud

Sugithaa K — Mon, 06 Oct 2025 05:20:05 +0000

Data analytics in the cloud requires efficient ways to store, organize, and process data.
Depending on the use case, different data formats are used — some are human-readable, while others are optimized for speed and scalability.

In this blog, we’ll explore six popular data formats used in cloud-based data analytics:
CSV, SQL, JSON, Parquet, XML, and Avro.

For each format, we’ll:
->Explain it in simple terms
->Show a small dataset
->Represent the dataset in that format

1)CSV (Comma Separated Values)
Explanation:
CSV is the simplest text-based data format.
Each row in a CSV file represents one record, and the fields are separated by commas (,).
It’s widely used because it’s easy to create and can be opened in Excel or Google Sheets.
Our Dataset Example
We’ll use a simple dataset of three students and their marks:

-	Name	Register_No	Subject	Marks
-	Kavya	101	Cloud Computing	95
-	Ravi	102	Data Analytics	88
-	Meena	103	AI & ML	91

Dataset in CSV Format:

Name,Register_No,Subject,Marks
Kavya,101,Cloud Computing,95
Ravi,102,Data Analytics,88
Meena,103,AI & ML,91

-> Advantages:

Easy to read and edit manually.
Compatible with most software and tools.

-> Disadvantages:

No schema (data type information missing).
Not suitable for complex or nested data.

2)SQL (Relational Table Format)
Explanation:
SQL (Structured Query Language) stores data in tables with defined columns and data types.
It’s used in relational databases such as MySQL, PostgreSQL, or Oracle.

Our Dataset Example

-	Name	Register_No	Subject	Marks
-	Kavya	101	Cloud Computing	95
-	Ravi	102	Data Analytics	88
-	Meena	103	AI & ML	91

Dataset in SQL Format:

CREATE TABLE Students (
Name VARCHAR(50),
Register_No INT,
Subject VARCHAR(50),
Marks INT
);

INSERT INTO Students (Name, Register_No, Subject, Marks) VALUES
('Kavya', 101, 'Cloud Computing', 95),
('Ravi', 102, 'Data Analytics', 88),
('Meena', 103, 'AI & ML', 91);

-> Advantages:

Data is structured and well-organized.
Easy to query and analyze using SQL commands.

-> Disadvantages:

Requires a database engine.
Not suitable for unstructured or flexible data.

3)JSON (JavaScript Object Notation)
Explanation:
JSON is a lightweight text-based format used to exchange data between applications.
It stores data as key–value pairs, making it easy for computers and humans to read.

Our Dataset Example

-	Name	Register_No	Subject	Marks
-	Kavya	101	Cloud Computing	95
-	Ravi	102	Data Analytics	88
-	Meena	103	AI & ML	91

Dataset in JSON Format:
[
{
"Name": "Kavya",
"Register_No": 101,
"Subject": "Cloud Computing",
"Marks": 95
},
{
"Name": "Ravi",
"Register_No": 102,
"Subject": "Data Analytics",
"Marks": 88
},
{
"Name": "Meena",
"Register_No": 103,
"Subject": "AI & ML",
"Marks": 91
}
]

-> Advantages:

Readable and easy to use in web APIs.
Supports nested structures (objects, arrays).

-> Disadvantages:

Slightly larger in size compared to CSV.
Parsing can be slower for very large files.

4)Parquet (Columnar Storage Format)
Explanation:
Parquet is a binary, columnar storage format designed for big data analytics.
Instead of saving data row by row, it stores data by columns, which reduces storage space and speeds up queries.
It’s used in systems like Apache Spark, Hadoop, AWS Athena, and Google BigQuery.

Our Dataset Example

-	Name	Register_No	Subject	Marks
-	Kavya	101	Cloud Computing	95
-	Ravi	102	Data Analytics	88
-	Meena	103	AI & ML	91

Dataset in Parquet Format (Conceptual Representation):
Parquet File (Binary Representation)

Columns:
Name: [Kavya, Ravi, Meena]
Register_No: [101, 102, 103]
Subject: [Cloud Computing, Data Analytics, AI & ML]
Marks: [95, 88, 91]

-> Advantages:

Highly compressed and efficient for analytical queries.
Excellent performance in cloud big data systems.

-> Disadvantages:

Not readable manually.
Needs specific software to open or process.

XML (Extensible Markup Language)
Explanation:
XML represents data using tags similar to HTML.
It’s structured and self-descriptive, making it useful for web services and document storage.

Our Dataset Example

-	Name	Register_No	Subject	Marks
-	Kavya	101	Cloud Computing	95
-	Ravi	102	Data Analytics	88
-	Meena	103	AI & ML	91

Dataset in XML Format:

Kavya
101
Cloud Computing
95

Ravi
102
Data Analytics
88

Meena
103
AI & ML
91

-> Advantages:

Good for hierarchical (tree-like) data.
Self-descriptive structure.

-> Disadvantages:

Very verbose (takes more space).
Slower parsing compared to JSON.

6)Avro (Row-based Storage Format)
Explanation:
Avro is a binary, row-based storage format developed by Apache for use in Hadoop ecosystems.
It stores both data and schema, which makes it great for streaming and scalable systems.

Our Dataset Example

-	Name	Register_No	Subject	Marks
-	Kavya	101	Cloud Computing	95
-	Ravi	102	Data Analytics	88
-	Meena	103	AI & ML	91

Dataset in Avro Format:
Schema (in JSON format):
{
"type": "record",
"name": "Student",
"fields": [
{"name": "Name", "type": "string"},
{"name": "Register_No", "type": "int"},
{"name": "Subject", "type": "string"},
{"name": "Marks", "type": "int"}
]
}
Data (Conceptual):
Row 1: Kavya, 101, Cloud Computing, 95
Row 2: Ravi, 102, Data Analytics, 88
Row 3: Meena, 103, AI & ML, 91

-> Advantages:

Compact binary format (saves space).
Schema evolution supported (easy to change fields).
Ideal for big data and streaming (Kafka, Hadoop).

-> Disadvantages:

Not human-readable.
Needs Avro libraries to read or write.

Hands-On with MongoDB: Storing, Querying, and Analyzing Data

Sugithaa K — Wed, 27 Aug 2025 04:53:51 +0000

In this tutorial, I explored MongoDB, a NoSQL database, to learn how to store, query, and analyze data. I worked with a sample dataset of business reviews and performed common operations like insertion, aggregation, search, update, and deletion.

Step 1: Setup MongoDB:
I installed MongoDB Compass and connected to my local database. Here’s how the dashboard looks:

Step 2: Insert Sample Data
I inserted 10 sample business reviews manually into the collection using Compass in JSON mode. Here are the documents:[
{ "business_id": 1, "name": "Cafe One", "rating": 4, "review": "Good food and service" },
{ "business_id": 2, "name": "Pizza Place", "rating": 5, "review": "Excellent pizza, good staff" },
{ "business_id": 3, "name": "Tea Corner", "rating": 3, "review": "Average taste but good location" },
{ "business_id": 4, "name": "Burger Hub", "rating": 2, "review": "Not good, very slow service" },
{ "business_id": 5, "name": "Sushi World", "rating": 5, "review": "Fresh sushi, good experience" },
{ "business_id": 6, "name": "Taco House", "rating": 4, "review": "Good tacos and friendly staff" },
{ "business_id": 7, "name": "Pasta Point", "rating": 3, "review": "Average pasta but good ambience" },
{ "business_id": 8, "name": "Biryani Express", "rating": 5, "review": "Good biryani, loved it" },
{ "business_id": 9, "name": "Coffee Bar", "rating": 4, "review": "Good coffee, nice place to relax" },
{ "business_id": 10, "name": "Ice Cream Land", "rating": 5, "review": "Very good flavors and service" }
]

Step 3: Running Queries
3.1 Top 5 Businesses with Highest Average Rating

Aggregation query:
[
{ "$group": { "_id": "$business_id", "avgRating": { "$avg": "$rating" } } },
{ "$sort": { "avgRating": -1 } },
{ "$limit": 5 }
]

Explanation:

I used an aggregation pipeline to group by business_id and calculate the average rating, then sorted descending to find the top 5.

3.2 Count Reviews Containing “Good”
Filter query:
{ "review": { "$regex": "good", "$options": "i" } }

Explanation:
Using a regex filter, I found all reviews containing the word ‘good’ (case-insensitive).

3.3 Get Reviews for a Specific Business
Query example for business_id = 2:
{ "business_id": 2 }

Explanation:

This query retrieves all reviews for a specific business.

3.4 Update a Review
Query example:
{ "$set": { "review": "Updated review: Really good service and tasty food!" } }

Explanation:
I updated the review for business_id = 1 to reflect a more detailed feedback.

3.5 Delete a Record

Explanation:
I deleted one record from the collection to demonstrate the deletion operation.

DEV Community: Sugithaa K

Data in Cloud

Hands-On with MongoDB: Storing, Querying, and Analyzing Data