Understanding Data Formats in Cloud & Data Analytics

Prakalya Sambathkumar — Mon, 06 Oct 2025 07:09:54 +0000

When working with data in cloud systems or analytics projects, the format you store your data in can make a huge difference in performance, scalability, and compatibility.

Different data formats are designed for different purposes — some are easy to read, while others are optimized for large-scale analytics.

In this article, we’ll explore six widely used data formats in analytics:
CSV, SQL, JSON, Parquet, XML, and Avro, using a simple student dataset as an example.

🎯 Sample Dataset

1️⃣ CSV (Comma-Separated Values)

CSV is the simplest and most familiar data format. Each record is stored as a row, with commas separating the values.
📄 Example:

Name,Register_No,Subject,Marks
Abi,201,Statistics,100
Mano,250,Computer Science,99
Priya,260,English,95
Riya,265,Maths,100

✅ Pros:Easy to create, read, and share.

⚠️ Cons:No data types or schema,Struggles with nested or complex data.

2️⃣ SQL (Structured Query Language)

SQL represents data in a relational table format, where data is organized into rows and columns.

📄 Example:

CREATE TABLE Students (
    Name VARCHAR(50),
    Register_No INT,
    Subject VARCHAR(50),
    Marks INT
);

INSERT INTO Students (Name, Register_No, Subject, Marks) VALUES
('Abi', 201, 'Statistics', 100),
('Mano', 250, 'Computer Science', 99),
('Priya', 260, 'English', 95),
('Riya', 265, 'Maths', 100);

✅ Pros:Enforces schema and data types,easy to query and manage structured data.

⚠️ Cons:Not suitable for unstructured or hierarchical data.

3️⃣ JSON (JavaScript Object Notation)

JSON stores data in key-value pairs. It’s lightweight, flexible, and widely used for APIs and NoSQL databases.

📄 Example:

[
  {
    "Name": "Abi",
    "Register_No": 201,
    "Subject": "Statistics",
    "Marks": 100
  },
  {
    "Name": "Mano",
    "Register_No": 250,
    "Subject": "Computer Science",
    "Marks": 99
  },
  {
    "Name": "Priya",
    "Register_No": 260,
    "Subject": "English",
    "Marks": 95
  },
  {
    "Name": "Riya",
    "Register_No": 265,
    "Subject": "Maths",
    "Marks": 100
  }
]

✅ Pros:Human-readable and easy to share.

⚠️ Cons:Takes more space compared to binary formats,Slow.

4️⃣ Parquet (Columnar Storage Format)

Parquet is a binary, columnar format created for efficient data analytics.
It’s highly optimized for tools like Apache Spark, Hadoop, AWS Athena, and BigQuery.

📄 Example:

import pandas as pd

data = {
    "Name": ["Abi", "Mano", "Priya", "Riya"],
    "Register_No": [201, 250, 260, 265],
    "Subject": ["Statistics", "Computer Science", "English", "Maths"],
    "Marks": [100, 99, 95, 100]
}

df = pd.DataFrame(data)
df.to_parquet("students.parquet", engine="pyarrow", index=False)

print("✅ Parquet file created successfully!")

⚡ Parquet files are not human-readable — they store compressed binary data for faster processing.

✅ Pros:Great compression and query performance.

⚠️ Cons:Needs special libraries to read/write,Not ideal for simple text sharing.

5️⃣ XML (Extensible Markup Language)

XML represents data using a tag-based structure, making it hierarchical and self-descriptive.

📄 Example:

<Students>
  <Student>
    <Name>Abi</Name>
    <Register_No>201</Register_No>
    <Subject>Statistics</Subject>
    <Marks>100</Marks>
  </Student>
  <Student>
    <Name>Mano</Name>
    <Register_No>250</Register_No>
    <Subject>Computer Science</Subject>
    <Marks>99</Marks>
  </Student>
  <Student>
    <Name>Priya</Name>
    <Register_No>260</Register_No>
    <Subject>English</Subject>
    <Marks>95</Marks>
  </Student>
  <Student>
    <Name>Riya</Name>
    <Register_No>265</Register_No>
    <Subject>Maths</Subject>
    <Marks>100</Marks>
  </Student>
</Students>

✅ Pros:Self-descriptive and structured,Ideal for hierarchical data.

⚠️ Cons:Verbose and storage-heavy,Slower to parse compared to JSON.

6️⃣ Avro (Row-based Storage Format)

Avro is a row-based binary format designed for fast data serialization.
It’s schema-based and often used in Apache Kafka and Hadoop ecosystems.

📄 Schema (students.avsc):

{
  "type": "record",
  "name": "Student",
  "fields": [
    {"name": "Name", "type": "string"},
    {"name": "Register_No", "type": "int"},
    {"name": "Subject", "type": "string"},
    {"name": "Marks", "type": "int"}
  ]
}

📄 Example:

from fastavro import writer

schema = {
    "type": "record",
    "name": "Student",
    "fields": [
        {"name": "Name", "type": "string"},
        {"name": "Register_No", "type": "int"},
        {"name": "Subject", "type": "string"},
        {"name": "Marks", "type": "int"}
    ]
}

records = [
    {"Name": "Abi", "Register_No": 201, "Subject": "Statistics", "Marks": 100},
    {"Name": "Mano", "Register_No": 250, "Subject": "Computer Science", "Marks": 99},
    {"Name": "Priya", "Register_No": 260, "Subject": "English", "Marks": 95},
    {"Name": "Riya", "Register_No": 265, "Subject": "Maths", "Marks": 100}
]

with open("students.avro", "wb") as out:
    writer(out, schema, records)

print("✅ Avro file created successfully!")

✅ Pros:Schema-based and consistent.

⚠️ Cons:Binary format (not readable),Requires Avro libraries to parse.

Knowing when to use each helps you build efficient, scalable, and cloud-ready data pipelines. 🌥️

MongoDB: The Yelp Review Chronicles #dataengineering #mongodb #database #learningjourney

Prakalya Sambathkumar — Tue, 26 Aug 2025 05:01:50 +0000

Episode 1: The Data Adventure Begins
Dive into the vast world of Yelp reviews, where every customer’s opinion shapes the experience of millions. Our stage? MongoDB — the NoSQL powerhouse perfect for handling this diverse data. Our mission? Insert, query, update, and explore insights hidden in these reviews.

Act 1: Setting the Scene — MongoDB Setup
Like any great data story, we start by setting up MongoDB. Whether on local machines or the cloud with MongoDB Atlas, we created a database named yelpDB and a collection reviews — the heart of our review operations.

Act 2: Rolling Out the Cast — Insert Records
With our stage ready, it was time to introduce the actors. We manually inserted 10 sample Yelp reviews, each carrying vital attributes like business_id, review_id, text of the review, and rating.

Act 3: The Rating Royalty — Top 5 Businesses
Who reigns supreme in the Yelp kingdom? MongoDB’s aggregation framework helped us uncover the top 5 businesses with the highest average ratings, proving once again that stars have power.

Act 4: The Good Word Mystery
What’s the hype about the word “good”? We counted how many reviews mentioned “good” to catch the pulse of positivity (or criticism) in the community.

Act 5: Reviews Spotlight — A Business Tale
Because every business has its story, we drilled down to look at all reviews for a particular business_id — say "b2" — gathering the voices behind the numbers.

Act 6: The Plot Twists — Update & Delete
No story remains static. We performed an update to a review’s rating (e.g., changing rating of review "r1") and deleted another review ("r4") that no longer fit the narrative.

Exporting the Chronicles
Every great story deserves to be shared. We exported our curated review dataset and query results into JSON and CSV formats for further analysis, archival, and storytelling across platforms.

By the end of our Yelp Reviews MongoDB journey, we had:
✅ Inserted sample review records
✅ Aggregated businesses by average rating
✅ Counted review text occurrences
✅ Queried reviews by business_id
✅ Updated and deleted records
✅ Exported data for external use

💡 This hands-on journey mirrors real-world data engineering workflows — from ETL to insights, data maintenance, and exporting essential data products.

Stay tuned for more seasons of MongoDB exploration and data adventures!

DEV Community: Prakalya Sambathkumar

Understanding Data Formats in Cloud & Data Analytics

MongoDB: The Yelp Review Chronicles #dataengineering #mongodb #database #learningjourney