Understanding 6 Common Data Formats in Cloud & Data Analytics

#data

When working with data in the cloud or in analytics pipelines, the way data is stored and exchanged plays a huge role in performance, compatibility, and scalability.

In this blog, we’ll explore 6 popular data formats used in analytics:

CSV (Comma Separated Values)

SQL (Relational Table Format)

JSON (JavaScript Object Notation)

Parquet (Columnar Storage Format)

XML (Extensible Markup Language)

Avro (Row-based Storage Format)

We’ll use a small dataset of students (name, register number, subject, marks) and represent it in all six formats.

🎯 Sample Dataset

Here’s the dataset we’ll use throughout:

1️⃣ CSV (Comma Separated Values)

CSV is the simplest and most widely used format. Data is stored as plain text, with commas separating the values.

Example:

Alice,101,Math,85
Bob,102,Physics,78
Charlie,103,Chemistry,92

✅ Pros: Easy to read, lightweight.
⚠️ Cons: No schema, can get messy with nested/complex data.

2️⃣ SQL (Relational Table Format)

SQL stores data in tables with rows and columns. Data can be inserted using SQL statements.

Example:

CREATE TABLE Students (
    Name VARCHAR(50),
    Register_No INT,
    Subject VARCHAR(50),
    Marks INT
);

INSERT INTO Students (Name, Register_No, Subject, Marks) VALUES
('Alice', 101, 'Math', 85),
('Bob', 102, 'Physics', 78),
('Charlie', 103, 'Chemistry', 92);

✅ Pros: Schema enforcement, supports queries.
⚠️ Cons: Not ideal for unstructured or semi-structured data.

3️⃣ JSON (JavaScript Object Notation)

JSON represents data in a structured key-value pair format, often used in APIs and NoSQL databases.

Example:

[
  {
    "Name": "Alice",
    "Register_No": 101,
    "Subject": "Math",
    "Marks": 85
  },
  {
    "Name": "Bob",
    "Register_No": 102,
    "Subject": "Physics",
    "Marks": 78
  },
  {
    "Name": "Charlie",
    "Register_No": 103,
    "Subject": "Chemistry",
    "Marks": 92
  }
]

✅ Pros: Great for APIs, supports nested data.
⚠️ Cons: Not space-efficient compared to Parquet/Avro.

4️⃣ Parquet (Columnar Storage Format)

Parquet is a binary, columnar format designed for analytics (popular with Hadoop, Spark, AWS Athena, BigQuery). It stores data by columns, making queries faster.

Example in Python (to generate a .parquet file):

import pandas as pd

# Sample dataset
data = {
    "Name": ["Alice", "Bob", "Charlie"],
    "Register_No": [101, 102, 103],
    "Subject": ["Math", "Physics", "Chemistry"],
    "Marks": [85, 78, 92]
}

df = pd.DataFrame(data)
df.to_parquet("students.parquet", engine="pyarrow", index=False)

print("✅ Parquet file saved as students.parquet")

⚡ Binary file (not human-readable). If you open it, you’ll see compressed binary data.

✅ Pros: Efficient for large datasets, great compression, optimized for analytics.
⚠️ Cons: Not human-readable, needs libraries to parse.

5️⃣ XML (Extensible Markup Language)

XML stores data in a tag-based hierarchical structure. It is verbose but still used in enterprise systems.

Example:

<Students>
  <Student>
    <Name>Alice</Name>
    <Register_No>101</Register_No>
    <Subject>Math</Subject>
    <Marks>85</Marks>
  </Student>
  <Student>
    <Name>Bob</Name>
    <Register_No>102</Register_No>
    <Subject>Physics</Subject>
    <Marks>78</Marks>
  </Student>
  <Student>
    <Name>Charlie</Name>
    <Register_No>103</Register_No>
    <Subject>Chemistry</Subject>
    <Marks>92</Marks>
  </Student>
</Students>

✅ Pros: Self-descriptive, supports hierarchical data.
⚠️ Cons: Verbose, storage-heavy compared to JSON.

6️⃣ Avro (Row-based Storage Format)

Avro is a row-based binary format developed by Apache. It’s great for data serialization and works well with Kafka & Hadoop.

Avro requires both data and a schema.

Schema (students.avsc):

{
  "type": "record",
  "name": "Student",
  "fields": [
    {"name": "Name", "type": "string"},
    {"name": "Register_No", "type": "int"},
    {"name": "Subject", "type": "string"},
    {"name": "Marks", "type": "int"}
  ]
}

Example in Python (to generate .avro file):
import fastavro
from fastavro import writer

# Define schema
schema = {
    "type": "record",
    "name": "Student",
    "fields": [
        {"name": "Name", "type": "string"},
        {"name": "Register_No", "type": "int"},
        {"name": "Subject", "type": "string"},
        {"name": "Marks", "type": "int"}
    ]
}

# Data
records = [
    {"Name": "Alice", "Register_No": 101, "Subject": "Math", "Marks": 85},
    {"Name": "Bob", "Register_No": 102, "Subject": "Physics", "Marks": 78},
    {"Name": "Charlie", "Register_No": 103, "Subject": "Chemistry", "Marks": 92}
]

# Save to Avro file
with open("students.avro", "wb") as out:
    writer(out, schema, records)

print("✅ Avro file saved as students.avro")

⚡ Like Parquet, the Avro file is binary and not human-readable.

✅ Pros: Schema-based, efficient row storage, great for streaming.
⚠️ Cons: Needs schema + libraries to parse.