When working with data in the cloud or in analytics pipelines, the way data is stored and exchanged plays a huge role in performance, compatibility, and scalability.
In this blog, we’ll explore 6 popular data formats used in analytics:
CSV (Comma Separated Values)
SQL (Relational Table Format)
JSON (JavaScript Object Notation)
Parquet (Columnar Storage Format)
XML (Extensible Markup Language)
Avro (Row-based Storage Format)
We’ll use a small dataset of students (name, register number, subject, marks) and represent it in all six formats.
🎯 Sample Dataset
Here’s the dataset we’ll use throughout:
1️⃣ CSV (Comma Separated Values)
CSV is the simplest and most widely used format. Data is stored as plain text, with commas separating the values.
Example:
Alice,101,Math,85
Bob,102,Physics,78
Charlie,103,Chemistry,92
✅ Pros: Easy to read, lightweight.
⚠️ Cons: No schema, can get messy with nested/complex data.
2️⃣ SQL (Relational Table Format)
SQL stores data in tables with rows and columns. Data can be inserted using SQL statements.
Example:
CREATE TABLE Students (
Name VARCHAR(50),
Register_No INT,
Subject VARCHAR(50),
Marks INT
);
INSERT INTO Students (Name, Register_No, Subject, Marks) VALUES
('Alice', 101, 'Math', 85),
('Bob', 102, 'Physics', 78),
('Charlie', 103, 'Chemistry', 92);
✅ Pros: Schema enforcement, supports queries.
⚠️ Cons: Not ideal for unstructured or semi-structured data.
3️⃣ JSON (JavaScript Object Notation)
JSON represents data in a structured key-value pair format, often used in APIs and NoSQL databases.
Example:
[
{
"Name": "Alice",
"Register_No": 101,
"Subject": "Math",
"Marks": 85
},
{
"Name": "Bob",
"Register_No": 102,
"Subject": "Physics",
"Marks": 78
},
{
"Name": "Charlie",
"Register_No": 103,
"Subject": "Chemistry",
"Marks": 92
}
]
✅ Pros: Great for APIs, supports nested data.
⚠️ Cons: Not space-efficient compared to Parquet/Avro.
4️⃣ Parquet (Columnar Storage Format)
Parquet is a binary, columnar format designed for analytics (popular with Hadoop, Spark, AWS Athena, BigQuery). It stores data by columns, making queries faster.
Example in Python (to generate a .parquet file):
import pandas as pd
# Sample dataset
data = {
"Name": ["Alice", "Bob", "Charlie"],
"Register_No": [101, 102, 103],
"Subject": ["Math", "Physics", "Chemistry"],
"Marks": [85, 78, 92]
}
df = pd.DataFrame(data)
df.to_parquet("students.parquet", engine="pyarrow", index=False)
print("✅ Parquet file saved as students.parquet")
⚡ Binary file (not human-readable). If you open it, you’ll see compressed binary data.
✅ Pros: Efficient for large datasets, great compression, optimized for analytics.
⚠️ Cons: Not human-readable, needs libraries to parse.
5️⃣ XML (Extensible Markup Language)
XML stores data in a tag-based hierarchical structure. It is verbose but still used in enterprise systems.
Example:
<Students>
<Student>
<Name>Alice</Name>
<Register_No>101</Register_No>
<Subject>Math</Subject>
<Marks>85</Marks>
</Student>
<Student>
<Name>Bob</Name>
<Register_No>102</Register_No>
<Subject>Physics</Subject>
<Marks>78</Marks>
</Student>
<Student>
<Name>Charlie</Name>
<Register_No>103</Register_No>
<Subject>Chemistry</Subject>
<Marks>92</Marks>
</Student>
</Students>
✅ Pros: Self-descriptive, supports hierarchical data.
⚠️ Cons: Verbose, storage-heavy compared to JSON.
6️⃣ Avro (Row-based Storage Format)
Avro is a row-based binary format developed by Apache. It’s great for data serialization and works well with Kafka & Hadoop.
Avro requires both data and a schema.
Schema (students.avsc):
{
"type": "record",
"name": "Student",
"fields": [
{"name": "Name", "type": "string"},
{"name": "Register_No", "type": "int"},
{"name": "Subject", "type": "string"},
{"name": "Marks", "type": "int"}
]
}
Example in Python (to generate .avro file):
import fastavro
from fastavro import writer
# Define schema
schema = {
"type": "record",
"name": "Student",
"fields": [
{"name": "Name", "type": "string"},
{"name": "Register_No", "type": "int"},
{"name": "Subject", "type": "string"},
{"name": "Marks", "type": "int"}
]
}
# Data
records = [
{"Name": "Alice", "Register_No": 101, "Subject": "Math", "Marks": 85},
{"Name": "Bob", "Register_No": 102, "Subject": "Physics", "Marks": 78},
{"Name": "Charlie", "Register_No": 103, "Subject": "Chemistry", "Marks": 92}
]
# Save to Avro file
with open("students.avro", "wb") as out:
writer(out, schema, records)
print("✅ Avro file saved as students.avro")
⚡ Like Parquet, the Avro file is binary and not human-readable.
✅ Pros: Schema-based, efficient row storage, great for streaming.
⚠️ Cons: Needs schema + libraries to parse.
Top comments (0)