Data in the Cloud: 6 Common Data Formats

SAHANA S — Wed, 08 Oct 2025 10:09:57 +0000

In today's world of data analytics and cloud computing, how we store and exchange data can have a huge impact on performance, scalability, and compatibility. Whether you’re working on a data pipeline or exporting reports, understanding different data formats is essential.

In this article, I’ll walk you through six popular data formats — CSV, SQL, JSON, Parquet, XML, and Avro — using a simple example dataset.

Sample Dataset

Let’s use a small dataset of student marks:

Name RegisterNumber Subject Marks
Alice 1001 Math 85
Bob 1002 Math 90
Charlie 1003 Math 78

CSV (Comma-Separated Values)

What is it?
CSV is one of the simplest data formats. Each row is a line, and each value is separated by a comma. It's lightweight and easy to read.

When to use:
Exporting data from spreadsheets or databases
Quick sharing and inspection of tabular data

Example:

Name,RegisterNumber,Subject,Marks
Alice,1001,Math,85
Bob,1002,Math,90
Charlie,1003,Math,78

SQL (Relational Table Format)
What is it?
SQL is the standard language used to interact with relational databases. Data is stored in tables with columns and rows.

When to use:
Structured data that fits into relational models
OLTP and OLAP systems

Example:

CREATE TABLE StudentMarks (
  Name TEXT,
  RegisterNumber INT,
  Subject TEXT,
  Marks INT
);


INSERT INTO StudentMarks VALUES ('Alice', 1001, 'Math', 85);
INSERT INTO StudentMarks VALUES ('Bob', 1002, 'Math', 90);
INSERT INTO StudentMarks VALUES ('Charlie', 1003, 'Math', 78);

JSON (JavaScript Object Notation)
What is it?
JSON is a lightweight text-based format used for data interchange. It’s structured like a dictionary or object and is widely used in APIs.

When to use:
Web services and REST APIs
Semi-structured data

Example:

[
  {
    "Name": "Alice",
    "RegisterNumber": 1001,
    "Subject": "Math",
    "Marks": 85
  },
  {
    "Name": "Bob",
    "RegisterNumber": 1002,
    "Subject": "Math",
    "Marks": 90
  },
  {
    "Name": "Charlie",
    "RegisterNumber": 1003,
    "Subject": "Math",
    "Marks": 78
  }
]

Parquet (Columnar Storage Format)
What is it?
Parquet is an Apache columnar storage format, designed for efficient data compression and query performance. It’s a binary format and not human-readable.

When to use:
Big data workloads (Spark, Hive)
Analytics in cloud data lakes

Example:
Here’s how the dataset might look conceptually (in a tool like Spark or Python) before writing to Parquet:

import pandas as pd

data = [
    {"Name": "Alice", "RegisterNumber": 1001, "Subject": "Math", "Marks": 85},
    {"Name": "Bob", "RegisterNumber": 1002, "Subject": "Math", "Marks": 90},
    {"Name": "Charlie", "RegisterNumber": 1003, "Subject": "Math", "Marks": 78},
]

df = pd.DataFrame(data)
df.to_parquet("student_marks.parquet")

XML (Extensible Markup Language)
What is it?
XML is a markup language that uses custom tags to define data. It’s more verbose than JSON but was a standard for data exchange in the early web.

When to use:
Legacy systems
Document-centric data (RSS feeds, SOAP APIs)

Example:

<Students>
  <Student>
    <Name>Alice</Name>
    <RegisterNumber>1001</RegisterNumber>
    <Subject>Math</Subject>
    <Marks>85</Marks>
  </Student>
  <Student>
    <Name>Bob</Name>
    <RegisterNumber>1002</RegisterNumber>
    <Subject>Math</Subject>
    <Marks>90</Marks>
  </Student>
  <Student>
    <Name>Charlie</Name>
    <RegisterNumber>1003</RegisterNumber>
    <Subject>Math</Subject>
    <Marks>78</Marks>
  </Student>
</Students>

Avro (Row-based Binary Format)
What is it?
Avro is a row-based binary data format from Apache. It supports rich data structures and includes a schema definition, making it excellent for streaming and serialization.

When to use:
Kafka message serialization
Schema evolution
Compact binary storage

Example:
Here’s how you define and write Avro data using Python (conceptual):

{
  "type": "record",
  "name": "Student",
  "fields": [
    {"name": "Name", "type": "string"},
    {"name": "RegisterNumber", "type": "int"},
    {"name": "Subject", "type": "string"},
    {"name": "Marks", "type": "int"}
  ]
}

import avro.schema
import avro.datafile
import avro.io
from io import BytesIO

schema = avro.schema.parse(open("student.avsc", "r").read())
data = [
    {"Name": "Alice", "RegisterNumber": 1001, "Subject": "Math", "Marks": 85},
    {"Name": "Bob", "RegisterNumber": 1002, "Subject": "Math", "Marks": 90},
    {"Name": "Charlie", "RegisterNumber": 1003, "Subject": "Math", "Marks": 78},
]

with open("students.avro", "wb") as out_file:
    writer = avro.datafile.DataFileWriter(out_file, avro.io.DatumWriter(), schema)
    for record in data:
        writer.append(record)
    writer.close()

Conclusion
Understanding these formats helps you choose the right tool for the job, whether you're building a modern data pipeline or integrating with legacy systems.

Exploring MongoDB

SAHANA S — Sat, 06 Sep 2025 03:59:35 +0000

As a part of exploring MongoDB using MongoDB Compass—the friendliest GUI for Mongo's document store. I wanted to explore MongoDB beyond the shell, so I cooked up a mini “Yelp-style” reviews dataset in a local yelp DB. My goal? Get hands-on with real-world tasks: inserting sample data, running queries and aggregations, tweaking entries, performing deletions, and even exporting results as files.

Importing Dataset

For convenience, I prepared a small sample in JSON format—modelled after Yelp reviews. I then used the Add Data → Import Data feature in Compass to upload the JSON file.

Inserting Records Manually

To get hands-on, I also manually added at least 10 review entries with fields like business_id, name, rating, and review. This helped me understand data structure and ensure variety in entries.

Queries & Aggregations
Top 5 Businesses by Average Rating

Count of Reviews Containing the Word “good”

All Reviews for a Specific Business ID

Delete a Record

I selected a review and clicked the delete/trash icon to remove it. Compass prompted for confirmation and then removed the document.

Conclusion
By doing this practical exercise, I learned how to:

Insert, query, update, and delete documents in MongoDB Compass
Run aggregation pipelines to analyze data
Use regex to search text fields
Export results for further analysis

MongoDB’s flexible schema and Compass’s visual interface make it a powerful pairing for real-world data tasks.

DEV Community: SAHANA S

Data in the Cloud: 6 Common Data Formats

Exploring MongoDB